Towards a Unified Architecture for in-RDBMS Analytics

The increasing use of statistical data analysis in enterprise applications has created an arms race among database vendors to offer ever more sophisticated in-database analytics. One challenge in this race is that each new statistical technique must be implemented from scratch in the RDBMS, which leads to a lengthy and complex development process. We argue that the root cause for this overhead is the lack of a unified architecture for in-database analytics. Our main contribution in this work is to take a step towards such a uni-fied architecture.

1. Towards a Unified Architecture for in-RDBMS Analytics Xixuan Feng Arun Kumar Benjamin Recht Christopher Ré Department of Computer Sciences University of Wisconsin-Madison {xfeng, arun, brecht, chrisre} ABSTRACT late 1990s and early 2000s, this brought a wave of data min- The increasing use of statistical data analysis in enterprise ing toolkits into the RDBMS. Several major vendors are applications has created an arms race among database ven- again making an effort toward sophisticated in-database an- dors to offer ever more sophisticated in-database analytics. alytics with both open source efforts, e.g., the MADlib plat- One challenge in this race is that each new statistical tech- form [17], and several projects at major database vendors. nique must be implemented from scratch in the RDBMS, In our conversations with engineers from Oracle [38] and which leads to a lengthy and complex development process. EMC Greenplum [21], we learned that a key bottleneck in We argue that the root cause for this overhead is the lack of this arms race is that each new data analytics technique re- a unified architecture for in-database analytics. Our main quires several ad hoc steps: a new solver is employed that contribution in this work is to take a step towards such a uni- has new memory requirements, new data access methods, fied architecture. A key benefit of our unified architecture is etc. As a result, there is little code reuse across different al- that performance optimizations for analytics techniques can gorithms, slowing the development effort. Thus, it would be be studied generically instead of an ad hoc, per-technique a boon to the database industry if one could devise a single fashion. In particular, our technical contributions are the- architecture that was capable of processing many data an- oretical and empirical studies of two key factors that we alytics techniques. An ideal architecture would leverage as found impact performance: the order data is stored, and many of the existing code paths in the database as possible parallelization of computations on a single-node multicore as such code paths are likely to be maintained and optimized RDBMS. We demonstrate the feasibility of our architecture as the RDBMS code evolves to new platforms. by integrating several popular analytics techniques into two To find this common architecture, we begin with an ob- commercial and one open-source RDBMS. Our architecture servation from the mathematical programming community requires changes to only a few dozen lines of code to integrate that has been exploited in recent years by both the statis- a new statistical technique. We then compare our approach tics and machine learning communities: many common data with the native analytics tools offered by the commercial analytics tasks can be framed as convex programming prob- RDBMSes on various analytics tasks, and validate that our lems [16, 26]. Examples of such convex programming prob- approach achieves competitive or higher performance, while lems include support vector machines, least squares and lo- still achieving the same quality. gistic regression, conditional random fields, graphical mod- els, control theoretic models, and many more. It is hard to overstate the impact of this observation on data analysis Categories and Subject Descriptors theory: rather than studying properties of each new model, H.2 [Database Management]: Miscellaneous researchers in this area are able to unify their algorithmic and theoretical studies. In particular, convex programming Keywords problems are attractive as local solutions are always globally optimal, and one can find local solutions via a standard suite Analytics, Convex Programming, Incremental Gradient De- of well-established and analyzed algorithms. Thus, convex scent, User-Defined Aggregate programming is a natural starting point for a unified ana- lytics architecture. 1. INTRODUCTION The mathematical programming literature is filled with There is an escalating arms race to bring sophisticated algorithms to solve convex programming problems. Our data analysis techniques into enterprise applications. In the first goal is to find an algorithm in that literature whose data access properties are amenable to implementation in- side an RDBMS. We observe that a classical algorithm from the mathematical programming cannon, called incremental Permission to make digital or hard copies of all or part of this work for gradient descent (IGD), has a data-access pattern that is personal or classroom use is granted without fee provided that copies are essentially identical to the data access pattern of any SQL not made or distributed for profit or commercial advantage and that copies aggregation function, e.g., an SQL AVG. As we explain in bear this notice and the full citation on the first page. To copy otherwise, to Section 2, IGD can be viewed as examining the data one republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. tuple at time and then performing a (non-commutative) ag- SIGMOD ’12, May 20–24, 2012, Scottsdale, Arizona, USA. gregation of the results. Our first contribution is an archi- Copyright 2012 ACM 978-1-4503-1247-9/12/05 ...$10.00.

2. Analytics Task Objective Logistic Regression (LR) i log(1 + exp(−yi wT xi )) + µ w 1 T Classification (SVM) i (1 − yi w xi )+ + µ w 1 T 2 2 Recommendation (LMF) (i,j)∈Ω (Li Rj − Mij ) + µ L, R F Labeling (CRF) [48] k j wj Fj (yk , xk ) − log Z(xk ) T Kalman Filters t=1 ||Cwt − f (yt )||22 + ||wt − Awt−1 ||22 Portfolio Optimization pT w + wT Σw s.t. w ∈ ∆ Figure 1: Bismarck in an RDBMS: (A) In contrast to existing in-RDBMS analytics tools that have separate code paths for different analytics tasks, Bismarck provides a single framework to implement them, while possibly retaining similar interfaces. (B) Example tasks handled by Bismarck. In Logistic Regression and Classification, we minimize the error of a predictor plus a regularization term. In Recommendation, we find a low-rank approximation to a matrix M which is only observed on a sparse sampling of its entries. This problem is not convex, but it can still be solved via IGD. In Labeling with Conditional Random Fields, we maximize the weights associated with features (Fj ) in the text to predict the labels. In Kalman Filters, we fit noisy time series data. In quantitative finance, portfolios are optimized balancing risk (pT w) with expected returns (wT Σw); the allocations must lie in a simplex, ∆, i.e., ∆ = {w ∈ Rn | n i=1 wi = 1} and wi ≥ 0 for i = 1, . . . , n. tecture that leverages this observation: we show that we can on disk can lead to considerable degradation in performance. implement these methods using the user-defined aggregate With this in mind, we describe a theoretical example that features that are available inside every major RDBMS. To characterizes some “bad” orders for IGDs and shows that support our point, we implement our architecture over Post- they are indeed likely inside an RDBMS. For example, if greSQL and two commercial database systems. In turn, this one clusters the data for a classification task such that all allows us to implement all convex data analysis techniques of the positive examples come before the negative examples, that are available in current RDBMSes – and many next the resulting convergence rate may be much slower than if generation techniques (see Figure 1). The code to add in a the data were randomly ordered, i.e., to reach the same dis- new model can be as little as ten lines of C code, e.g., for tance to the optimal solution, more passes over the data are logistic regression.1 needed if the data is examined by IGD in the clustered order As with any generic architectural abstraction, a key ques- versus a random order. Our second technical contribution is tion is to understand how much performance overhead our to describe the clustering phenomenon theoretically, and use approach would incur. In the two commercial systems that this insight to develop a simple approach to combat this. A we investigate, we show that compared to a strawman user- common approach in machine learning randomly permutes defined aggregate that computes no value, our approach has the data with each pass. However, such random shuffling between 5% (for simple tasks like regression) to 100% over- may incur substantial computational overhead. Our method head (for complex tasks like matrix factorization). What is obviates this overhead by shuffling the data only once before perhaps more surprising is that our approach is often much the first pass. We implement and benchmark this approach faster than existing in-database analytic tools from commer- on all three RDBMSes that we study: empirically, we find cial vendors: our prototype implementations are in many that across a broad range of models, while shuffling once cases 2 − 4x faster than existing approaches for simple tasks has a slightly slower convergence rate than shuffling on each – and for some newly added tasks such as matrix factoriza- pass, the lack of expensive reshuffling allows us to simply run tion, orders of magnitude faster. more epochs in the same amount of time. Thus, shuffling A second benefit of a unified in-database architecture is once has better overall performance than shuffling always. that we can study the factors that impact performance and We then study how to parallelize IGD in an RDBMS. optimize them in a way that applies across several analyt- We first observe that recent work in the machine learn- ics tasks. Our preliminary investigation revealed many such ing community allows us to parallelize IGD [52] in a way optimization opportunities including data layout, compres- that leverages the standard user-defined aggregation fea- sion, data ordering, and parallelism. Here, we focus on two tures available in every RDBMS to do shared-nothing par- such factors that we discovered were important in our ini- allelism. We leverage this parallelization feature in a com- tial prototype: data clustering, i.e., how the data is ordered mercial database and show that we can get almost linear on-disk, and parallelism on a single-node multicore system. speed-ups. However, recent results in the machine learning Although IGD will converge to an optimal solution on con- community have shown that these approaches may yield sub- vex programming problems no matter how the underlying optimal runtime performance compared to approaches that data is ordered, empirically some orders allow us to termi- exploit shared-memory parallelism [29, 37]. This motivates nate more quickly than others. We observe that inside an us to adapt approaches that exploit shared memory for use RDBMS, data is often clustered for reasons unrelated to the inside an RDBMS. We focus on single-node multicore par- analysis task (e.g., to support efficient query performance), allelism where shared memory is available. Although not in and running IGD through the data in the order that is stored the textbook description of an RDBMS, all three RDBMSes 1 we inspected allow us to allocate and manage some shared Not all data analysis problems are convex. Notable exceptions are memory (even providing interfaces to help manage the nec- Apriori [9] and graph mining algorithms.

3.essary data structures). We show that the shared-memory for uncertain data based on sampling and graphical models version converges faster than the shared-nothing version. respectively. In contrast, we consider data analytics tech- In some cases, a single shuffle of the data may be too niques that are modeled as convex programming problems. expensive (e.g., for data sets that do not fit in available A related (but orthogonal issue) is how statistical models memory). To cope with such large data sets, users often should be integrated into the RDBMS to facilitate ease of perform a subsampling of the data (e.g., using a reservoir use, notably model-based views pioneered in MauveDB [19]. sample [46]). Subsampling is not always desirable, as it in- The idea is to give users a unified abstraction that hides from troduces an additional error (increasing the variance of the the user (but not the tool developer) the details of statistical estimate). Thus, for such large data sets, we would like to processing. In contrast, our goal is a lower level abstraction: avoid the costly shuffle of the data to achieve better perfor- we want to unify at the implementation of many different mance than subsampling. Our final technical contribution data analysis tasks. combines the parallelization scheme and reservoir sampling Using incremental gradient algorithms for convex program- to get our highest performance results for datasets that do ming problems is a classical idea, going back to the semi- not fit in available RAM. On simple tasks like logistic regres- nal work in the 1950s of Robbins and Monro [40]. Recent sion, we are 4X faster than state-of-the-art in-RDBMS tools. years have seen a resurgence of interest in these algorithms On more complex tasks like matrix factorization, these ap- due to their ability to tolerate noise, converge rapidly, and proaches allow us to converge in a few hours, while existing achieve high runtime performance. In fact, sometimes an tools do not finish even after several days. IGD method can converge before examining all of the data; In summary, our work makes the following contributions: in contrast, a traditional gradient method would need to touch all of the data items to take even a single step. These • We describe a novel unified architecture, Bismarck, properties have made IGD an algorithm of choice in the Web for integrating many data analytics tasks formulated community. Notable implementations include Vowpal Wab- as Incremental Gradient Descent into an RDBMS us- bit at Yahoo! [7], and in large-scale learning [14]. IGD has ing features available in almost every commercial and also been employed for specific algorithms, notably Gemulla open-source system. We give evidence that our ar- et al recently used it for matrix factorization [23]. What chitecture is widely applicable by implementing Bis- distinguishes our work is that we have observed that IGD marck in three RDBMS engines: PostgreSQL and two forms the basis of a systems abstraction that is well suited commercial engines. for in-RDBMS processing. As a result, our technical focus • We study the effect of data clustering on performance. is on optimizations that are implementable in an RDBMS We identify a theoretical example that shows that bad and span many different models. orderings not typically considered in machine learning Our techniques to study the impact of sorting is inspired do occur in databases and we develop a novel strategy by the work of Bottou and LeCun [15], who empirically stud- to improve performance. ied the related problem of different sampling strategies for • We study how to adapt existing approaches to make stochastic gradient algorithms. There has been a good deal Bismarck run in parallel. We verify that this allows of work in the machine learning community to create sev- us to achieve large speed-ups on a wide range of tasks eral clever parallelization schemes for IGD [12,18,20,29,53]. using features in existing RDBMSes. We combine our Our work builds on this work to study those methods that solution for clustering with the above parallelization are ideally suited for an RDBMS. For convex programming schemes to attack the problem of bad data ordering. problems, we find that the model averaging techniques of We validate our work by implementing Bismarck on three Zinkevich et al [53] fit well with user-defined aggregates. RDBMS engines: PostgreSQL, and two commercial engines, Recently, work on using shared memory without locking has DBMS A and DBMS B. We perform an extensive exper- been shown to converge more rapidly in some settings [37]. imental validation. We see that we are competitive, and We borrow from both approaches. often better than state-of-the-art in-database tools for stan- Finally, the area of convex programming problems is a dard tasks like regression and classification. We also show hot topic in data analysis [12, 16], e.g., the support vector that for next generation tasks like conditional random fields, machine [32], Lasso [44], and logistic regression [47] were all we have competitive performance against state-of-the-art designed and analyzed in a convex programming framework. special-purpose tools. Convex analysis also plays a pivotal role in approximation algorithms, e.g., the celebrated MAX-CUT relaxation [24] shows that the optimal approximation to this classical NP- Related Work. Every major database vendor has data min- hard problem is achieved by solving a convex program. In ing tools associated with their RDBMS offering. Recently, fact a recent result in the Theory community shows that there has been an escalating arms race to add sophisticated there is reason to believe that almost all combinatorial op- analytics into the RDBMS with each iteration bringing more timization problems have optimal approximations given by sophisticated tools into the RDBMS. So far, this arms race solving convex programs [39]. Thus, we argue that these has centered around bringing individual statistical data min- techniques may enable a number of sophisticated data pro- ing techniques into an RDBMS, notably Support Vector cessing algorithms in the RDBMS. Machines [35], Monte Carlo sampling [27, 51], Conditional Random Fields [25, 49], and Graphical Models [43, 50]. Our effort is inspired by these approaches, but the goal of this Outline. The rest of the paper is organized as follows: In work is to understand the extent to which we can handle Section 2, we explain how Bismarck interacts with the these analytics tasks with a single unified architecture. Of RDBMS, and give the necessary mathematical programming these approaches, MCDB [27] and Wick et al. [51] are the background on gradient methods. In Section 3, we discuss most related in that they propose a single unified interface the architecture of Bismarck, and how data ordering and

4.parallelism impact performance. In Section 4, we validate w ∈ Rd for some d ≥ 1 that minimizes the following objec- that Bismarck is able to integrate analytics techniques into tive:2 an RDBMS with low overhead and high performance. N min f (w, zi ) + P (w) (1) w∈Rd 2. PRELIMINARIES i=1 We start with a description of how Bismarck fits into an Here, the objective function decomposes into a sum of func- RDBMS, and then give a simple example of how an end-user tions f (w, zi ) for i = 1, . . . , N where each zi is an item of interacts with Bismarck in an RDBMS. We then discuss the (training) data. In Bismarck, the zi are represented by necessary mathematical programming background on gradi- tuples in the database, e.g., a pair (paper,area) for paper ent methods. classification. We abbreviate f (w, zi ) = fi (w). For exam- ple, in SVM classification, the function fi (w) could be the 2.1 Bismarck in an RDBMS hinge loss of the model w on the ith data element and P (w) We start by contrasting the high level architecture of most enforces the smoothness of the classifier (preventing overfit- existing in-RDBMS analytics tools with how Bismarck in- ting). Eq. 1 is general: Figure 1(B) gives an incomplete list tegrates analytics into an RDBMS, and explain how Bis- of examples that can be handled by Bismarck. marck is largely orthogonal to the end-user interfaces. Ex- A gradient is a generalization of a derivative that tells us isting tools like MADlib [17], Oracle Data Mining [4], and if the function is increasing or decreasing as we move in a Microsoft SQL Server Data Mining [1] provide SQL-like in- particular direction. Formally, a gradient of a function h : terfaces for the end-user to specify tasks like Logistic Regres- Rd → R is a function ∇h : Rd → Rd such that (∇h(w))i = sion, Support Vector Machine, etc. Declarative interface- ∂ ∂wi h(w) [16]. Linearity of the gradient implies the equation: level abstractions like model-based views [19] help in creat- N N ing such user-friendly interfaces. However, the underlying implementations of these tasks do not have a unified archi- ∇ fi (w) = ∇fi (w) . tecture, increasing the overhead for the developer. In con- i=1 i=1 trast, Bismarck provides a single architectural abstraction For our purpose, the importance of this equation is that for the developer to unify the in-RDBMS implementations of to compute the gradient of the objective function, we can these analytics techniques, as illustrated in Figure 1. Thus, compute the gradient of each fi individually. Bismarck is orthogonal to the end-user interface, and the Gradient methods are algorithms that solve (1). These developer has the freedom to provide any existing or new methods are defined by an iterative rule that describes how interfaces. In fact, in our implementation of Bismarck in one produces the (k+1)-st iterate, w(k+1) , given the previous each RDBMS, Bismarck’s user-interface mimics the inter- iterate, w(k) . For simplicity, we assume that P = 0. Then, face of that RDBMS’ native analytics tool. we are minimizing a function f (w) = N i=1 fi (w), our goal is For example, consider the interface provided by the open- to produce a new point w(k+1) where f (w(k) ) > f (w(k+1) ). source MADlib [17] used over PostgreSQL and Greenplum In 1-D, we need to move in the direction opposite the deriva- databases. Consider the task of classifying papers using a tive (gradient). A gradient method is defined by the rule: support vector machine (SVM). The data is in a table La- beledPapers(id, vec, label), where id is the key, vec is w(k+1) = w(k) − αk ∇f (w(k) ) the feature values (say as an array of floats) and label is the class label. In MADlib, the user can train an SVM model by here αk ≥ 0 is a positive parameter called step-size that simply issuing a SQL query with some pre-defined functions determines how far to follow the current search direction. that take in the data table details, parameters for the model, Typically, αk → 0 as k → ∞. etc. [17] In Bismarck, we mimic this familiar interface for The twist for incremental gradient methods is to approxi- users to do in-RDBMS analytics. For example, the query mate the full gradient using a single terms of the sum. That (similar to MADlib’s) to train an SVM is as follows: is, let η(k) ∈ {1, . . . , N }, chosen at iteration k. Intuitively, we approximate the gradient ∇f (w) with ∇fη(k) (w).3 Then, SELECT SVMTrain (‘myModel’, ‘LabeledPapers’, w(k+1) = w(k) − αk ∇fη(k) (w(k) ) (2) ‘vec’, ‘label’); This is a key connection: each fi can be represented as a SVMTrain is a function that passes the user inputs to Bis- single tuple. We illustrate this rule with a simple example: marck, which then performs the gradient computations for SVM and returns the model. The model, which is basically Example 2.1. Consider a simple least-squares problem a vector of coefficients for an SVM, is then persisted as a with 2n (n ≥ 1) data points (x1 , y1 ), . . . , (x2n , y2n ). The user table ‘myModel’. The model can be applied to new feature values are xi = 1 for i = 1, . . . , 2n and the labels are unlabeled data to make predictions by using a similar SQL yi = 1 for i ≤ n, and yi = −1, otherwise. The resulting query. mathematical programming problem is: 2.2 Background: Gradient Methods 1 2n min (wxi − yi )2 We provide a brief introduction to gradient methods. For w 2 i=1 a thorough introduction to gradient methods and their pro- 2 jected, incremental variants, we direct the interested reader In Appendix A, we generalize to include constraints via proximal point methods. One can also generalize to both matrix valued w and to the many surveys of the subject [13, 36]. We focus on non-differentiable functions [42]. a particular class of problems that have linearly separable 3 Observe that minimizing f and g(w) = N 1 f (w), means correcting objective functions. Formally, our goal is to find a vector by the factor N is not necessary and not done by convention.

5. Figure 3: The Standard Three Phases of a UDA. provide a simple iteration to test for convergence. We will explain more about these two aspects shortly, but first we describe the architecture of a UDA, and how we can handle IGD in this framework. Figure 2: High-level Architecture of Bismarck. IGD as a User-Defined Aggregate. As shown in Figure 3, a developer implements a UDA by writing three standard Since xi = 1 for all i, the optimal solution to the problem is functions: initialize(state), transition(state, data) the mean w = 0, but we choose this to illustrate the mechan- and terminate(state). Almost all RDBMSes provide the ics of the method. We begin with some point w(0) chosen abstraction of a UDA, albeit with different names or inter- arbitrarily. We choose i ∈ {1, . . . , 2n} at random. Fix some faces for these three steps, e.g., PostgreSQL names them α ≥ 0 and for k ≥ 0, set αk = α for simplicity. Then, our ‘initcond’, ‘sfunc’ and ‘finalfunc’ [5]. approximation to the gradient is ∇fi (w(0) ) = (w(0) − yi ). The state is basically the context of aggregation (e.g., And so, our first step is: the running total and count for an AVG query). The data is a tuple in the table. In our case, the state is essentially the w(1) = w(0) − α(w(0) − yi ) model (e.g., the coefficients of a logistic regressor) and per- haps some meta data (e.g., number of gradient steps taken). We then repeat the process with w(2) , etc. One can check In our current implementation, we assume that the state that after k + 1 steps, we will have: fits in memory (models are typically orders of magnitude k smaller than the data, which is not required to fit in mem- w(k+1) = (1 − α)k+1 w0 + α (1 − α)k−i yη(j) ory). The data is again an example from the data table, j=0 which includes the attribute values and the label (for super- vised schemes). We now explain the role of each function: Since the expectation of yη(j) equals 0, we can see that we converge exponentially quickly to 0 under this scheme – even • The initialize(state) function initializes the model before we see all 2n points. This serves to illustrate why with user-given values (e.g., a vector of zeros), or a an IGD scheme may converge much faster than traditional model returned by a previous run. gradient methods, where one must touch every data item at least once just to compute the first step. • In transition(state, data), we first compute the (incremental) gradient value of the objective function Remarkably, when both the functions n i=1 fi (w) and P (w) on the given data example, and then update the cur- are both convex, the incremental gradient method is guaran- rent model (Equation 2 from Section 2.2). This func- teed to converge to a globally optimal solution [36] at known tion is where one puts the logic of the the various ana- rates. Also, IGD converges (perhaps at a slower rate) even if lytics techniques – each technique has its own objective η(k) is a sequence in a fixed, arbitrary order [11,30,31,33,45]. function and gradient (Figure 1(B)). Thus, the main We explore this issue in more detail in Example 3.1. differences in the implementations of the various an- alytics techniques occur mainly in a few lines of code 3. BISMARCK ARCHITECTURE within this function, while the rest of our architec- ture is reused across techniques. Figure 4 illustrates We first describe the high-level architecture of Bismarck, the claim with actual code snippets for two tasks (LR and then explain how we implement IGD in an RDBMS. and SVM). This simplifies the development of sophis- Then, we drill down into two aspects of our architecture ticated in-database analytics, in contrast to existing that impact performance - data ordering and parallelism. systems that usually have different code paths for dif- 3.1 High-Level Architecture ferent techniques (Figure 1(A)). The high-level architecture of Bismarck is presented in • In terminate(state), we finish the gradient compu- Figure 2. Bismarck takes in the specifications for an an- tations and return the model, possibly persisting it. alytics task (e.g., data details, parameters, etc.) and runs the task using Incremental Gradient Descent (IGD). As ex- A key implementation detail is that Bismarck may re- plained before, IGD allows us to solve a number of ana- order the data to improve the convergence rate of IGD or lytics tasks in one unified way. The main component of to sample from the data. This feature is supported in all Bismarck is the in-RDBMS implementation of IGD with a major RDBMSes, e.g., in PostgreSQL using the ORDER BY data access pattern similar to a SQL aggregate query. For RANDOM() construct. this purpose, we leverage the mechanism of User-Defined Aggregate (UDA), a standard feature available in almost all Key Differences: Epochs and Convergence. A key dif- RDBMSes [2, 3, 5]. The UDA mechanism is used to run the ference from traditional aggregations, like SUM, AVG, or IGD computation, but also to test for convergence and com- MAX, is that to reach the optimal objective function value, pute information, e.g., error rates. Bismarck also needs to IGD may need to do more than one pass over the dataset.

6. LR_Transition(ModelCoef *w, Example e) { ... SVM_Transition(ModelCoef *w, Example e) { ... wx = Dot_Product(w, e.x); wx = Dot_Product(w, e.x); sig = Sigmoid(-wx * e.y); c = stepsize * e.y; c = stepsize * e.y * sig; if(1 - wx * e.y > 0) { Scale_And_Add(w, e.x, c); ... } Scale_And_Add(w, e.x, c); } ... } Figure 4: Snippets of the C-code implementations of the transition step for Logistic Regression (LR) and Support Vector Machine (SVM). Here, w is the coefficient vector, and e is a training example with feature vector x and label y. Scale_And_Add updates w by adding to it x multiplied by the scalar c. Note the minimal differences between the two implementations. Following the machine learning literature, we call each pass ple that characterizes pathological orders for IGD. We chose an epoch [15]. Thus, the aggregate may need to be executed this example to illustrate the important points with respect more than once, with the output model of one run being in- to clustering and be as theoretically simple as possible. put to the next (shown in Figure 2 as a loop). To determine Example 3.1 (1-D CA-TX). Suppose that our data is how many epochs to run, Bismarck supports an arbitrary clustered geographically, e.g., sales data from California, fol- Boolean function to be called (which may itself involve ag- lowed by Texas, and the attributes of the sales in the two gregation). This supports both what we observed in practice states cause the data to be in two different classes. With this as common heuristic convergence tests, e.g., run for a fixed in mind, recall Example 2.1. We are given a simple least- number of iterations, and more rigorous conditions based on squares problem with 2n (n ≥ 1) data points (x1 , y1 ), . . . , the norm of the gradient common in machine learning [10]. (x2n , y2n ). The feature values are xi = 1 for i = 1, . . . , 2n A second difference is that the we may need to compute and the labels are yi = 1 for i ≤ n, and yi = −1, otherwise. the actual value of the objective function (also known as The resulting mathematical programming problem is: the loss) using the model after each epoch. The loss value 2n may be needed by the stopping condition, e.g., a common 1 convergence test is based on the relative drop in the loss min (wxi − yi )2 w 2 i=1 value. This loss computation can also be implemented as a UDA (or piggybacked onto the IGD UDA). Since xi = 1 for all i, the optimal solution is the mean, w = 0. But our goal here is to analyze the behavior of IGD Technical Opportunities. A key conceptual benefit of Bis- on this problem under various orders. Due to this prob- marck’s approach is that one can study generic performance lem’s simplicity, we can solve the behavior of the resulting optimizations (i.e., optimizations that apply to many analyt- dynamical system in closed form under a variety of order- ics techniques) rather than ad hoc, per-technique ones. The ing schemes. Consider two schemes to illustrate our point: remainder of the technical sections are devoted to examin- (1) data points seen are randomly sampled from the dataset, ing two such generic optimizations. First, the conventional and (2) data points seen in ascending index order, (x1 , y1 ), wisdom is that for IGD to converge more rapidly, each data (x2 , y2 ), . . . . Scheme (2) simulates operating on data that is point should be sampled in random (without-replacement) clustered by class. order [15]. This can be achieved by randomly reordering, Figure 5 plots the value of w during the course of the IGD or shuffling, the dataset before running the aggregate for under the above two sampling schemes (using diminishing gradient computation at each epoch. The goal of course is step-size rule). We see that both approaches do indeed con- to converge faster in wall-clock time, not per epoch. Thus, verge to the optimal value, but approach (1), which uses ran- we study when the increased speed in convergence rate per dom sampling, converges more rapidly. In contrast, in ap- epoch outweighs the additional cost of reordering the data proach (2), w oscillates between 1 and −1, until converging at each epoch. The second optimization we describe is how eventually. Intuitively, this is so because the IGD initially to leverage multicore parallelism to speed-up the IGD ag- takes steps influenced by the positive examples, and is later gregate computation. influenced by the negative examples (within one epoch). In other words, convergence can be much slower on clustered 3.2 Impact of Data Ordering data. In the extended version of the paper [22], we present On convex programming problems, IGD is known to con- calculations to precisely explain this behavior. We conclude verge to the optimal value irrespective of how the underlying the example by noting that almost all permutations of the data is ordered. But empirically some data orderings allow data will behave similar to (1), and not (2). In other words, us to converge in fewer epochs than others. However, our ex- (2) is a pathological ordering, but one which is indeed possi- periments suggest that the sensitivity is not as great as one ble for data stored in an RDBMS. might think. In other words, presenting the data in a ran- Shuffling the data at each epoch is expensive and incurs dom order gets essentially optimal run-time behavior. This a high overhead. In fact, for simple tasks like LR and SVM, begs the question as to whether we should even reorder the the shuffling time dominates the gradient computation time data randomly at each epoch. In fact, some machine learn- by a factor of 5. To remove the overhead of shuffling the ing tools do not even bother to randomly reorder the data. data at every epoch, while still avoiding the pathological or- However, we observe that inside an RDBMS, data is often dering, we propose a simple solution – shuffle the data only clustered for reasons unrelated to the analysis task (e.g., for once. By randomly reordering the data once, we avoid the efficient join query performance). For example, the data for pathological ordering that might be present in data stored a classification task might be clustered by the class label. in a database. We implemented and benchmarked this ap- We now analyze this issue by providing a theoretical exam- proach on all three RDBMSes that we study. As explained

7. 1 the IGD aggregate completely in the user space with no (1) Random w 0 changes needed to the RDBMS code. This allows us to pre- -1 serve the 3-function abstraction from Section 3.1, and also 0 10000 20000 30000 40000 50000 reuse most of the code from the UDA-based implementation. 1 The model to be learned is maintained in shared memory (2) Clustered w 0 and is concurrently updated by parallel threads operating on different segments of the data. Concurrent updates suggest -1 0 10000 20000 30000 40000 50000 that we need locking on the shared model. Nevertheless, (10) (20) (30) (40) (50) recent results from the machine learning community show Number of Gradient Steps (No. of Epochs) that IGD can be parallelized in a shared-memory environ- ment with no locking at all [37]. We adopt this technique Figure 5: 1-D CA-TX Example: Plot of w against number into Bismarck. Light-weight locking schemes often have of gradient steps on (1) Random, and (2) Clustered data stronger theoretical properties for convergence, and so we orderings for a dataset with 1000 examples (i.e., n = 500). consider one such scheme called Atomic Incremental Gradi- The number of epochs is shown in parentheses on the x-axis. ent (AIG) that uses only CompareAndExchange instructions Random takes 18 epochs to converge (convergence defined to effectively perform per-component locking [37]. here as w2 < 0.001), while Clustered takes 48 epochs. As shown later in Section 4, we empirically observe that the model-averaging approach (pure UDA) has a worse con- vergence rate than the shared-memory UDA, and so worse later in Section 4.3, empirically, we find that shuffling once overall performance. This led us to consider the shared- suffices across a broad range of models. Shuffling once does memory UDA for Bismarck. have a slightly lower convergence rate than shuffling always. However, since we need not shuffle at every epoch, we signif- 3.4 Avoiding Shuffling Overhead icantly reduce the runtime per epoch, which means we can From the CA-TX example in Section 3.2, we saw that bad simply run more epochs within the same wall-clock time so data orderings can impact convergence, and that shuffling as to reach the optimal value. As we show later in Section once suffices in some instances to achieve good convergence 4.3, this allows shuffle-once to converge faster than shuffle- rate. However, shuffling even once could be expensive for always (between 2X-6X faster on the tasks we studied). very large datasets. We verified this on a scalability dataset, and it did not finish shuffling even in one day. Thus, we in- 3.3 Parallelizing Gradient Computations vestigate if it is possible to achieve good convergence rate We now study how we can parallelize the IGD aggregate even on bad data orderings without any shuffling. A clas- computation to achieve performance speed-ups on a single- sical technique to cope with this situation is to subsample node multicore system. We explain two mechanisms for the data using reservoir sampling (in fact, some vendors achieving this parallelism – one based on standard UDA fea- do implement subsampling); in this technique, given an in- tures, and another based on shared-memory features. We memory buffer size B, we can obtain a without-replacement emphasize that both features are available in almost all sample of size B in just one pass over the dataset, without RDBMSes. shuffling the dataset [46]. The main idea of reservoir sam- pling is straightforward: suppose that our reservoir (array) Pure UDA Version. The UDA infrastructure offered by can hold m items and our goal is to sample from N (≥ m) most RDBMSes (including the commercial DBMS A and items. Read the first m items and fill the reservoir. Then, DBMS B) include an built-in mechanism for ‘shared-nothing’ when we read the kth additional item (m + k overall), we parallelism. The RDBMS requires that the developer pro- randomly select an integer s in [0, m + k). If s < m, then vide a function merge(state, state), along with the 3 func- we put the item at slot s; otherwise we drop the item. tions discussed in Section 3.1. The merge function specifies Empirically, we observe that the subsampling may have how two aggregation contexts that were computed indepen- slow convergence. Our intuition is that the reservoir dis- dently in parallel can be combined. For example, for an cards valuable data items that could be used to help the AVG query, two individual averages with sufficient statistics model converge faster. To address this issue, we propose (total count) can be combined to obtain a new average. Gen- a simple scheme that we call multiplexed reservoir sampling erally, only aggregates that are commutative and algebraic (MRS), which combines the reservoir sampling idea with the can be parallelized in the above manner [8]. Although the concurrent model updates idea from Section 3.3. IGD is not commutative, we observe that it is essentially commutative, in that it eventually converges to the optimal Multiplexed Reservoir Sampling. The multiplexed reser- value regardless the data order (Section 3.2). And although voir sampling (MRS) idea is to combine, or multiplex, gradi- the IGD is not algebraic, recent results from the machine ent steps over both the reservoir sample and the data that is learning community suggest that one can achieve rapid con- not put in the reservoir buffer. By using the reservoir sam- vergence by averaging models (trained on different portions ple, which is a valuable without-replacement sample, and of the data) [53]. Thus, the IGD is essentially algebraic the rest of the data in conjunction, our scheme can achieve as well. In turn, this implies that we can use the parallel faster convergence than subsampling. UDA approach to achieve near-linear speed-ups on the IGD As Figure 6 illustrates, in MRS, there are two threads aggregate computations. that update the shared model concurrently, called the I/O Worker and the Memory Worker. The I/O Worker has two Shared-Memory UDA. Shared-memory management is pro- tasks: (1) it performs a standard gradient step (exactly as vided by most RDBMSes [6], and it enables us to implement the previous code), and (2) it places tuples into a reservoir.

8. licly available real-world datasets. For LR and SVM, we use two datasets – one dense (Forest, a standard benchmark dataset from the UCI repository) and one sparse (DBLife, which classifies papers by research areas). We binarized these datasets for the standard binary LR and SVM tasks. For LMF, we use MovieLens, which is a movie recommen- dation dataset, and for CRF, we use the CoNLL dataset, which is for text chunking. We also perform a scalability study with much larger datasets – two synthetic datasets Classify300M (for LR and SVM) and Matrix5B (for LMF), Figure 6: Multiplexed Reservoir Sampling (MRS): The I/O as well as DBLP (another real-world dataset) for CRF. The Worker reads example tuple e from the database, and uses relevant statistics for all datasets are presented in Table 1. buffer A to do reservoir sampling. The dropped example d is used for the gradient step, with updates to a shared model. Experimental Setup. All experiments are run on an iden- The Memory Worker iterates over buffer B, and performs tical configuration: a dual Xeon X5650 CPUs (6 cores each gradient steps on each example b in B concurrently. x 2 hyper-threading) machine with 128GB of RAM and a 1TB dedicated disk. The kernel is Linux 2.6.32-131. Each reported runtime is the average of three warm-cache runs. Both of these functions are performed within the previously Completion time for gradient schemes here means achiev- discussed UDA framework. The Memory Worker takes a ing 0.1% tolerance in the objective function value, unless buffer as input, and it loops over that buffer updating the specified otherwise. model using the gradient rule. After the I/O Worker finishes one pass over the data, the buffers are swapped. That is, 4.1 Overhead of Our Architecture the I/O Worker begins filling the buffer that the Memory We first validate that Bismarck incurs little development Worker is using, while the Memory Worker works on the overhead to add new analytics tasks. We then empirically buffer that has just been filled by the I/O Worker. The verify that the runtime overhead of the tasks in Bismarck Memory Worker is signaled by polling a common integer is low compared to a strawman aggregate. indicating which buffer it should run over and whether it should continue running. In Section 4, we show that even with a buffer size that is an order of magnitude smaller than Development Overhead. We implemented the 4 analytics tasks in Bismarck over three RDBMSes (PostgreSQL, com- the dataset, MRS can achieve better convergence rates than mercial DBMS A and DBMS B). Bismarck enables rapid both no-shuffling and subsampling. addition of a new analytics task since a large fraction of the code is shared across all the techniques implemented 4. EXPERIMENTS (on a given RDBMS). For example, starting with an end-to- We first show that our architecture, Bismarck, incurs lit- end implementation of LR in Bismarck (in C, over Post- tle overhead, in terms of both development effort to add new greSQL), we need to modify fewer than two dozen lines of analytics tasks, and runtime overhead inside an RDBMS. We code in order to add the SVM module.4 Similarly, we can then validate that Bismarck, implemented over two com- easily add in a more sophisticated task like LMF with only mercial RDBMSes and PostgreSQL, provides competitive or five dozen new lines of code. We believe that this is possi- better performance than the native analytics tools offered ble because our unified architecture based on IGD abstracts by these RDBMSes on popular in-database analytics tasks. out the logic of the various tasks into a small number of Finally, we evaluate how the generic optimizations that we generic functions. This is in contrast to existing systems, described in Section 3 impact Bismarck’s performance. where there is usually a dedicated code stack for each task. Dataset Dimension # Examples Size Runtime Overhead. We next verify that the tasks imple- Forest 54 581k 77M mented in Bismarck have low runtime overhead. To do DBLife 41k 16k 2.7M this, we compared our implementation to a strawman ag- MovieLens 6k x 4k 1M 24M gregate that sees the same data, but computes no values. CoNLL 7.4M 9K 20M We call this a NULL aggregate. We run three tasks – LR, Classify300M 50 300M 135G SVM and LMF in Bismarck over all the 3 RDBMSes, using Matrix5B 706k x 706k 5B 190G both the pure UDA infrastructure (shared-nothing) and the DBLP 600M 2.3M 7.2G shared-memory variant described in Section 3. We compare the single-iteration runtime of each task against the NULL aggregate for both implementations of Bismarck over the Table 1: Dataset Statistics. DBLife, CoNLL and DBLP are same datasets. The results are presented in Tables 2 and 3. in sparse-vector format. MovieLens and Matrix5B are in We see that the overhead compared to the NULL aggregate sparse-matrix format. can be as low as 4.6%, and is rarely more than 2X runtime for simple tasks like LR and SVM. The overhead is higher for the more computation-intensive task LMF, but is still Tasks and Datasets. We study 4 popular analytics tasks: less than 2.5X runtime of the NULL aggregate. We also see Logistic Regression (LR), Support Vector Machine classifi- cation (SVM), Low-rank Matrix Factorization (LMF) and 4 Both our code and the data used in our experiments are available Conditional Random Fields labeling (CRF). We use 4 pub- at:

9. PostgreSQL DBMS A DBMS B (8 segments) Dataset Run- Over- Dataset Run- Over- Dataset Run- Over- Tasks Tasks Tasks (NULL time) -time -head (NULL time) -time -head (NULL time) -time -head Forest LR 0.57s 90% Forest LR 24.1s 15.3% Forest LR 0.17s 21.4% (0.3s) SVM 0.56s 83.3% (20.9s) SVM 22.0s 5.26% (0.14s) SVM 0.16s 14.3% DBLife LR 0.035s 192% DBLife LR 1.1s 86.4% DBLife LR 0.1 17.6% (0.012s) SVM 0.03s 150% (0.59) SVM 0.8s 35.6% (0.085s) SVM 0.096s 12.9% MovieLens MovieLens MovieLens LMF 0.86s 244% LMF 45.8s 29.4% LMF 0.32s 100% (0.25s) (35.4s) (0.16s) Table 2: Pure UDA implementation overheads: single-iteration runtime of each task implemented in Bismarck against the strawman NULL aggregate. The parallel database DBMS B was run with 8 segments. PostgreSQL DBMS A DBMS B (8 segments) Dataset Run- Over- Dataset Run- Over- Dataset Run- Over- Tasks Tasks Tasks (NULL time) -time -head (NULL time) -time -head (NULL time) -time -head Forest LR 0.56s 86.7% Forest LR 5.1s 54.5% Forest LR 0.25s 150% (0.3s) SVM 0.55s 83.3% (3.3s) SVM 4.0s 21.2% (0.1s) SVM 0.21s 110% DBLife LR 0.017s 41.7% DBLife LR 0.2s 81.8% DBLife LR 0.045s 4.6% (0.012s) SVM 0.016s 33.3% (0.11s) SVM 0.3s 172% (0.043s) SVM 0.045s 4.6% MovieLens MovieLens MovieLens LMF 0.85s 193% LMF 10.3s 102% LMF 0.26s 160% (0.29s) (5.1s) (0.1s) Table 3: Shared-memory UDA implementation overheads: single-iteration runtime of each task implemented in Bismarck against the strawman NULL aggregate. The parallel database DBMS B was run with 8 segments. 100 Frac. of Opt. LogLik. PostgreSQL DBMS A DBMS B (8 segments) Dataset Task 80 CRF++ Bismarck MADlib Bismarck Native Bismarck Native (466) Forest LR 8.0 43.5 40.2 489.0 3.7 17.0 60 (Dense) SVM 7.5 140.2 32.7 66.7 3.3 19.2 40 Mallet Bismarck (1043) DBLife LR 0.8 N/A 9.8 20.6 2.3 N/A 20 (399) (Sparse) SVM 1.2 N/A 11.6 4.8 4.1 N/A 0 MovieLens LMF 36.0 29325.7 394.7 N/A 11.9 17431.3 10 100 1000 Time (sec) Figure 7: Benchmark Comparison: (A) Runtimes (in sec) for convergence (0.1% tolerance) or completion on 3 in-RDBMS analytics tasks. We compare Bismarck implemented over each RDBMS against the analytics tool native to that RDBMS. N/A means the task is not supported on that RDBMS’ native tool (B) For the CRF task, we compare Bismarck (over PostgreSQL) against custom tools by plotting the objective function value against time. Completion times (in sec) are shown in parentheses. that the shared-memory variant is several times faster than collection of in-RDBMS statistical techniques [17]), which the UDA implementation over DBMS A, since DBMS A has is run over PostgreSQL (single-threaded), and the native extra overheads (e.g., model passing, serializations, etc.) to analytics tools provided by the two commercial engines – run the pure UDA. It was this observation that prompted DBMS A (single-threaded), and the parallel DBMS B (with us to use the shared-memory UDA to implement Bismarck 8 segments). We tuned the parameters for each tool, includ- even for a single-thread RDBMS. ing Bismarck, on each task based on an extensive search in the parameter space. The data was preprocessed appro- 4.2 Benchmark Comparison priately for all tools. Some of the tasks we study are not We now validate that Bismarck implemented over two currently supported in the above tools. In particular, the commercial RDBMSes and PostgreSQL provides competi- CRF task is not available in any of the existing in-RDBMS tive or better performance than the native analytics tools analytics tools we considered, and so we compare Bismarck offered by these RDBMSes on three existing in-database an- (over PostgreSQL) against the custom tools CRF++ [28] alytics tasks – LR, SVM and LMF. For the comparison, we and Mallet [34]. use the shared-memory UDA implementation of Bismarck along with the shuffle-once approach described in Section Existing In-RDBMS Analytics Tasks. We first compare 3.2. For the parallel version of Bismarck, we use the no- the end-to-end runtimes of the various tools on LR, SVM lock shared-memory parallelism described in Section 3.3. and LMF. The results are summarized in Figure 7 (A). Over- all, we see that Bismarck implemented over each RDBMS Competitor Analytics Tools. We compare Bismarck against has competitive or faster performance on all these tasks three existing in-RDBMS tools – MADlib (an open-source against the native tool of the respective RDBMS. On sim-

10.ple tasks like LR and SVM, we see that Bismarck is often DBLP). Since Bismarck is not tied to any RDBMS, we run several times faster than existing tools. That is, on the it over PostgreSQL for this study. We compare against the dense LR task, Bismarck is about 12X faster than DBMS native analytics tools of both commercial engines, DBMS A A’s tool, and about 5X faster than MADlib over both Post- and DBMS B, as well as the task-specific in-memory tools greSQL and the native tool in DBMS B. In some cases, e.g., mentioned before. The results are summarized in Table 4.2. DBMS A for sparse SVM, Bismarck is slightly slower due to We see that almost all of the in-RDBMS tools scale on the the function call overheads in DBMS A. On a more complex simple tasks LR and SVM (less than an hour per epoch task like LMF, we see that Bismarck is about 3 orders-of- for Bismarck), except DBMS B on SVM, which did not magnitude faster than MADlib and DBMS B’s native tool. terminate even after 48 hours. Again, on the more complex This validates that Bismarck is able to efficiently handle tasks LMF and CRF, only Bismarck scales to the large several in-RDBMS analytics tasks, while offering a unified datasets. We also tried several custom in-memory tools – all architecture. We also verified that all the tools compared crashed either due to insufficient memory (Weka, SVMPerf, achieved similar training quality on a given task and dataset CRF++) or did not terminate even after 48 hours (Mallet). (recall that IGD converges to the optimal objective value on convex programs), but do not present details here due to 4.3 Impact of Data Ordering space constraints. We now empirically verify how the order the data is stored To understand why Bismarck performs faster, we looked affects the performance of our IGD schemes. We first study into the MADlib source code. While the reasons vary across the objective function value against epochs for data being tasks, Bismarck is faster generally because IGD has lower shuffled before each epoch (ShuffleAlways). We repeat the time complexity than the algorithms in MADlib. IGD, across study for data seen in clustered order (Clustered), without all tasks, is linear in the number of examples (fixing the di- any shuffling. Finally, we shuffle the data only once, before mension) and linear in the dimension of the model (fixing the first epoch (ShuffleOnce). We present the results for the the number of examples). But the algorithms in MADlib LR task on DBLife in Figure 8. We observed similar results for LR, for instance, are super-linear in the dimension, while on other datasets and tasks, but skip them here due to space that for LMF is super-linear in the number of examples. constraints. To get a sense of the performance compared to other tools, a comparison with the popular in-memory tool Weka shows 1E4 1E4 -Log Likelihood -Log Likelihood that Bismarck (over PostgreSQL) is faster on all these tasks Clustered(185) Clustered(9.3) 8E3 8E3 – from 4X faster on dense LR to over 4000X faster on dense ShuffleOnce(47) ShuffleOnce(2.4) 6E3 6E3 SVM. We also validated that our runtimes on SVM are 4E3 4E3 within a factor of 3X to the special-purpose SVM in-memory 2E3 2E3 tool, SVMPerf. This is not surprising as SVMPerf is highly ShuffleAlways(35) ShuffleAlways(5.9) 0E0 0 0E0 0 optimized for the SVM computation, but presents an avenue 0 50 100 150 200 0 2 4 6 8 10 for future work. Epoch Time (s) Next Generation Tasks. Existing in-RDBMS analytics tools Figure 8: Impact of Data Ordering on Sparse LR over do not support emerging advanced analytics tasks like CRF. DBLife: (A) Objective value over epochs, till convergence. But Bismarck is able to efficiently support even such next The number of epochs for convergence are shown in paren- generation tasks within the same architecture. To validate theses. (B) Objective value over time, till convergence. The this, we plot the convergence over time for Bismarck (over time to converge (in sec) are shown in parentheses. PostgreSQL) against in-memory tools. The results are shown in Figure 7(B). We see that Bismarck is able to achieve sim- Figure 8(A) shows that ShuffleAlways converges in the ilar convergence, and runtime as the hand-coded and opti- fewest epochs, as is expected for IGD. Clustered yields the mized in-memory tools, even though Bismarck is a more poorest convergence rate, as explained in Section 3.2. In generic in-RDBMS tool. fact, Clustered takes over 1000 epochs to reach the same objective value as ShuffleAlways. However, we see that Shuf- Bismarck DBMS A DBMS B Others fleOnce achieves very similar convergence rate to ShuffleAl- Task PostgreSQL (Native) (Native) (In-mem.) ways, and reaches the same objective value as ShuffleAlways √ √ √ in 12 extra epochs. Figure 8(B) shows why the extra epochs LR X √ √ are acceptable – ShuffleAlways takes several times longer to SVM X X √ finish than ShuffleOnce. This is because the shuffling over- LMF N/A X X √ head is significantly high. In fact, for simple tasks like LR, CRF N/A N/A X shuffling dominates the runtime – e.g., for LR on DBLife, √ shuffling takes nearly 5X the time for gradient computation Table 4: Scalability : means the task completes, and X per epoch. Even on more complex tasks, the overhead is sig- means that the approach either crashes or takes longer than nificant, e.g., it is 3X for LMF on MovieLens. By avoiding 48 hours. N/A means the task is not supported. The in- this overhead, ShuffleOnce finishes much faster than Shuf- memory tools (Weka, SVMPerf, CRF++, Mallet) all either fleAlways, while still achieving the same quality. crash or take too long. 4.4 Parallelizing IGD in an RDBMS We now verify that both the parallelism schemes (pure Scalability. We now study the scalability of the various UDA and shared-memory UDA) are able to achieve near- tools to much larger datasets (Classify300M, Matrix5B and linear speed-ups but the pure UDA has a worse convergence

11. 5E4 Pure UDA 8 vergence rate than both Subsampling and Clustered, and NoLock -Log Likelihood 4E4 6 Speed-up Lock reaches an objective value that is 20% lower than both. Fig- 3E4 Pure UDA ure 10(B) shows the sensitivity to the buffer size for the AIG 4 Lock AIG 2E4 Subsampling and MRS schemes. We see that the runtime NoLock 1E4 2 to reach 2X of the optimal objective value is lower for MRS. 0 0E0 0 This is as expected since MRS has faster convergence rate 0 5 10 15 20 0 2 4 6 8 than Subsampling. Finally, we verify that Bismarck with Epoch Number of Threads the MRS scheme provides better performance than existing Figure 9: Parallelizing IGD: (A) Plot of objective value over in-RDBMS tools on large datasets (that do not fit in avail- epochs for the pure UDA version and the shared-memory able RAM). For a simple task like LR on the Classify300M UDA variants (Lock, AIG, NoLock) for CRF over CoNLL dataset over PostgreSQL, with a buffer that is just 1% of on 8 threads (segments). (B) Speed-up of the per-epoch the dataset size, Bismarck with the MRS scheme achieves gradient computation times against the number of threads. the same objective value as MADlib in 45 minutes, while The per-epoch time of the single-threaded run is 20.6s. MADlib takes over 3 hours. On a more complex task like LMF on the Matrix5B dataset, Bismarck with MRS scheme finishes in a few hours, while MADlib did not terminate even after one week. rate than the shared-memory UDA. We first study the objec- tive value over epochs for both the implementations. We use the three concurrency schemes for the shared-memory UDA 5. CONCLUSIONS AND FUTURE WORK – lock the model (Lock), AIG, and no locking (NoLock). We present Bismarck, a novel architecture that takes a We present the results for CRF on CoNLL in Figure 9(A) step towards unifying in-RDBMS analytics. Using insights (similar results on other tasks skipped here for brevity). from the mathematical programming literature, Bismarck Figure 9(A) shows that the pure UDA implementation has provides a single systems-level abstraction to implement a poorer convergence rate compared to the shared-memory large class of existing and next-generation analytics tech- UDA with Lock, since the model averaging in the former niques. In providing a unified architecture, we argue that yields poorer quality [52]. The figure also shows that AIG Bismarck may reduce the development overhead for intro- and NoLock have similar convergence rate to the Lock ap- ducing and maintaining sophisticated analytics code in an proach. This is in line with recent results from the machine RDBMS. Bismarck also achieves high performance on these learning literature [37]. By adopting the NoLock shared- techniques by effectively utilizing standard features available memory UDA parallelism into Bismarck, we achieve signif- inside every RDBMS. We implemented Bismarck over two icant speed-ups in a generic way across all the analytics tasks commercial RDBMSes and PostgreSQL, and verified that we handle. Figure 9(B) shows the speed-ups (over a single- Bismarck achieves competitive, and often superior, per- threaded run) achieved by the four parallelism schemes in formance than the state-of-the-art analytics tools natively DBMS B. As expected, the Lock approach has no speed-up, offered by these RDBMSes. while the speed-up of the pure UDA approach is sub-optimal While Bismarck can handle many analytics techniques due to model passing overheads. NoLock and AIG achieve in the current framework, it is interesting future work to linear speed-ups, with NoLock having the highest speed-ups. integrate more sophisticated models, e.g., simulation mod- els, into our architecture. Another direction is to handle 4.5 Multiplexed Reservoir Sampling large-scale combinatorial optimization problems inside the We verify that our Multiplexed Reservoir Sampling (MRS) RDBMS, including tasks like linear programming and fun- scheme has faster convergence rate compared to both Sub- damental NP-hard problems like MAX-CUT. sampling and operating over clustered data (Clustered). One area to improve Bismarck is to match the perfor- mance of some specialized tools for tasks like support vector machines by using more optimizations, e.g. model or feature -Log Likelihood 1E4 12E3 Subsampling Sub- compression. There are also possibilities to improve perfor- 8E3 B MRS Sampling mance by modifying the DBMS engine, e.g., exploiting bet- 4E3 800 2.50 (48) 0.60 (10) ter mechanisms for model passing and storage, concurrency MRS Clustered1600 1.37 (26) 0.36 (6) control, etc. Another direction is to examine more fully how 0E0 0 0 10 20 30 40 50 3200 0.69 (13) 0.12 (2) to utilize features that are available in parallel RDBMSes. Epoch Figure 10: Multiplexed Reservoir Sampling: (A) Objective 6. ACKNOWLEDGMENTS value against epochs for LR on DBLife. The buffer size for This research has been supported by the ONR grant N00014- Subsampling and MRS is 1600 tuples (10% of the dataset). 12-1-0041, the NSF CAREER award IIS-1054009, and gifts (B) Runtime (in sec) to reach 2X the optimal objective value from EMC Greenplum and Oracle to Christopher R´e, and by for different buffer sizes, B. The numbers in parentheses in- the ONR grant N00014-11-1-0723 to Benjamin Recht. We dicate the respective number of epochs. The same values for also thank Joseph Hellerstein, and the analytics teams from Clustered are 1.03s (19). EMC Greenplum and Oracle for invaluable discussions. Figure 10(A) plots the objective value against epochs for 7. REFERENCES [1] Microsoft SQL Server 2008 R2 Data Mining. the three schemes. For Subsampling and MRS, we choose [2] Microsoft SQL Server Books Online. a buffer size that is about 10% the dataset size (for LR on [3] Oracle Data Cartridge Developer’s Guide 11g. DBLife). We see from the figure that MRS has faster con- [4] Oracle Data Mining.

12. [5] PostgreSQL 9.0 Documentation. [38] Oracle Advanced Analytics, Oracle R Enterprise Group. [6] Shared Memory and LWLocks in PostgreSQL. Personal Communication. [7] Vowpal Wabbit. [39] P. Raghavendra. Optimal Algorithms and Inapproximability [8] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Results for Every CSP? In STOC, pages 245–254, 2008. Databases. Addison-Wesley, 1995. [40] H. Robbins and S. Monro. A Stochastic Approximation [9] R. Agrawal and R. Srikant. Fast Algorithms for Mining Method. Ann. Math. Statistics, 22(3):400–407, 1951. Association Rules in Large Databases. In VLDB, pages [41] R. T. Rockafellar. Monotone Operators and the Proximal Point 487–499, 1994. Algorithm. SIAM J. on Control and Optimization, 14(5), [10] K. M. Anstreicher and L. A. Wolsey. Two “Well-known” 1976. Properties of Subgradient Optimization. Math. Program., [42] R. T. Rockafellar. Convex Analysis (Princeton Landmarks in 120(1):213–220, 2009. Mathematics and Physics). Princeton University Press, 1996. [11] D. P. Bertsekas. A Hybrid Incremental Gradient Method for [43] P. Sen, A. Deshpande, and L. Getoor. Exploiting Shared Least Squares. SIAM Journal on Optimization, 7, 1997. Correlations in Probabilistic Databases. PVLDB, 1(1):809–820, [12] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2008. Belmont, MA, 2nd edition, 1999. [44] R. Tibshirani. Regression Shrinkage and Selection via the [13] D. P. Bertsekas. Incremental Gradient, Subgradient, and Lasso. Journal of the Royal Statistical Society, B, 58(1), 1996. Proximal Methods for Convex Optimization: A Survey. [45] P. Tseng. An Incremental Gradient(-Projection) Method with Technical report, Laboratory for Information and Decision Momentum Term and Adaptive Stepsize Rule. SIAM Joural on Systems, 2010. Optimization, 8(2), 1998. [14] L. Bottou and O. Bousquet. The Tradeoffs of Large Scale [46] J. S. Vitter. Random Sampling with a Reservoir. ACM Trans. Learning. In NIPS, 2007. Math. Softw., 11(1):37–57, 1985. [15] L. Bottou and Y. LeCun. Large Scale Online Learning. In [47] G. Wahba, C. Gu, Y. Wang, and R. Chappell. Soft NIPS, 2003. Classification, a.k.a. Risk Estimation, via Penalized Log [16] S. Boyd and L. Vandenberghe. Convex Optimization. Likelihood and Smoothing Spline Analysis of Variance. In The Cambridge University Press, New York, NY, USA, 2004. Mathematics of Generalization., Santa Fe Institute Studies in [17] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and the Sciences of Complexity. Addison-Wesley, 1995. C. Welton. MAD Skills: New Analysis Practices for Big Data. [48] H. M. Wallach. Conditional Random Fields: An Introduction. PVLDB, 2(2):1481–1492, 2009. Technical report, Dept. of CIS, Univ. of Pennsylvania, 2004. [18] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal [49] D. Z. Wang, M. J. Franklin, M. N. Garofalakis, and J. M. Distributed Online Prediction. In ICML, pages 713–720, 2011. Hellerstein. Querying Probabilistic Information Extraction. [19] A. Deshpande and S. Madden. MauveDB: Supporting PVLDB, 3(1):1057–1067, 2010. Model-based User Views in Database Systems. In SIGMOD, [50] D. Z. Wang, E. Michelakis, M. Garofalakis, and J. M. pages 73–84, 2006. Hellerstein. BayesStore: Managing Large, Uncertain Data [20] J. Duchi, A. Agarwal, and M. J. Wainwright. Distributed Dual Repositories with Probabilistic Graphical Models. PVLDB, Averaging in Networks. In NIPS, 2010. 1(1):340–351, 2008. [21] EMC Greenplum. Personal Communication. [51] M. Wick, A. McCallum, and G. Miklau. Scalable Probabilistic Databases with Factor Graphs and MCMC. PVLDB, [22] X. Feng, A. Kumar, B. Recht, and C. R´ e. Towards a Unified 3(1):794–804, 2010. Architecture for in-RDBMS Analytics. CoRR, abs/1203.2574, 2012. [52] Z. A. Zhu, W. Chen, G. Wang, C. Zhu, and Z. Chen. P-packSVM: Parallel Primal grAdient desCent Kernel SVM. In [23] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. ICDM, pages 677–686, 2009. Large-scale Matrix Factorization with Distributed Stochastic Gradient Descent. In KDD, pages 69–77, 2011. [53] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized Stochastic Gradient Descent. In NIPS, 2010. [24] M. X. Goemans and D. P. Williamson. Approximation Algorithms for MAX-3-CUT and Other Problems via Complex Semidefinite Programming. J. Comput. Syst. Sci., 68(2), 2004. [25] R. Gupta and S. Sarawagi. Creating Probabilistic Databases APPENDIX from Information Extraction Models. In VLDB, pages 965–976, 2006. A. PROXIMAL POINT METHODS [26] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of To handle regularization and constraints, we need an addi- Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer-Verlag, 2001. tional concept called proximal point methods. These do not [27] R. Jampani, F. Xu, M. Wu, L. L. Perez, C. M. Jermaine, and change the data access patterns, but do enable us to han- P. J. Haas. MCDB: A Monte Carlo Approach to Managing dle constraints. We state the complete step rule including a Uncertain Data. In SIGMOD, pages 687–698, 2008. projection that allows us to handle constraints: [28] T. Kudo. CRF++: Yet Another CRF Toolkit. [29] J. Langford, L. Li, and T. Zhang. Sparse Online Learning via Truncated Gradient. JMLR, 10:777–801, 2009. w(k+1) = ΠαP w(k) − αk ∇fη(k) (w(k) ) (3) [30] Z. Luo. On the Convergence of the LMS Algorithm with Adaptive Learning Rate for Linear Feedforward Networks. Neural Computation, 3(2):226–245, 1991. Where the function ΠαP is called a proximal point oper- [31] Z. Q. Luo and P. Tseng. Analysis of an Approximate Gradient ator and is defined by the expression: Projection Method with Applications to the Backpropagation 1 2 Algorithm. Optimization Methods and Software, 4, 1994. ΠαP (x) = arg min 2 x−w 2 + αP (w) w [32] O. L. Mangasarian. Linear and Nonlinear Separation of Patterns by Linear Programming. Operations Research, 13, In the case where P is the indicator function of a set C, ΠαP 1965. is simply the Euclidean projection onto C [41]. Thus, these [33] O. L. Mangasarian and M. V. Solodov. Serial and Parallel Backpropagation Convergence via Nonmonotone Perturbed constraints can be used to ensure that the model stays in Minimization. Optimization Methods and Software, 4, 1994. some convex set of constraints. An example proximal-point [34] A. McCallum. MALLET: A Machine Learning for Language operator ensures that the model has unit Euclidean norm Toolkit, 2002. by projecting the model on to the the unit ball. P (w) might [35] B. L. Milenova, J. Yarmus, and M. M. Campos. SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption also be a regularization penalty such as total-variation or of Support Vector Machines. In VLDB, pages 1152–1163, 2005. negative entropy. These are very commonly used in statis- [36] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust tics to improve the generalization of the model or to take Stochastic Approximation Approach to Stochastic Programming. SIAM Journal on Optimization, 19(4), 2009. advantage of properties that are known about the model to [37] F. Niu, B. Recht, C. R´e, and S. Wright. Hogwild: A Lock-Free reduce the number of needed measurements. Approach to Parallelizing Stochastic Gradient Descent. In NIPS, 2011.