In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world’s largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.

da仔发布于2019/01/17

7. This will produce a square matrix and a vector based on To a mathematician, the solution to the matrix equation the size of the independent variable vector. The final cal- Ax = b is simple when it exists: x = A−1 b. As noted in culation is computed by inverting the small A matrix and Section 5.1, we cannot assume we can find A−1 . If matrix multiplying by the vector to derive the coefficients β ∗ . A is n × n symmetric and positive definite (SPD), we can Additionally, calculation of the coefficient of determina- use the Conjugate Gradient method. This method requires tion R2 can be calculated concurrently by neither df (y) nor A−1 and converges in no more than n 1 “X ”2 interations. A general treatment is given in [17]. Here we SSR = b β ∗ − yi outline the solution to Ax = b as an extremum of f (x) = n 1 X 2 1 “X ”2 2 x Ax + b x + c. Broadly, we have an estimate x ˆ to our T SS = yi − yi solution x∗ . Since x ˆ is only an estimate, r0 = Aˆ x − b is n non-zero. Subtracting this error r0 from the estimate allows SSR R2 = us to generate a series pi = ri−1 − {Aˆ Px − b} of orthogonal T SS vectors. The solution will be x∗ = i αi pi for αi defined below. We end at the point rk 2 < for a suitable . In the following SQL query, we compute the coefficients There are several update algorithms, we have written ours β ∗ , as well as the components of the coefficient of determi- in matrix notation. nation: r0 r0 CREATE VIEW ols AS r0 = b − Aˆx0 , α0 = v0 Av0 , SELECT pseudo_inverse(A) * b as beta_star, v 0 = r0 , i=0 (transpose(b) * (pseudo_inverse(A) * b) - sum_y2/count) -- SSR Begin iteration over i. / (sum_yy - sumy2/n) -- TSS as r_squared ri ri FROM ( αi = vi Avi SELECT sum(transpose(d.vector) * d.vector) as A, sum(d.vector * y) as b, xi+1 = xi + αi vi sum(y)^2 as sum_y2, sum(y^2) as sum_yy, ri+1 = ri − αi Avi count(*) as n FROM design d check ri+1 2 ≤ ) ols_aggs; r ri+1 vi+1 = ri+1 + i+1 vi ri ri Note the use of a user-defined function for vector transpose, and user-defined aggregates for summation of (multidimen- sional) array objects. The array A is a small in-memory To incorporate this method into the database, we stored matrix that we treat as a single object; the pseudo-inverse (vi , xi , ri , αi ) as a row and inserted row i+1 in one pass. This function implements the textbook Moore-Penrose pseudoin- required the construction of functions update alpha(r i, p i, verse of the matrix. A), update x(x i, alpha i, v i), update r(x i, alpha i, v i, All of the above can be efficiently calculated in a single A), and update v(r i, alpha i, v i, A). Though the function pass of the data. For convenience, we encapsulated this yet calls were redundant (for instance, update v() also runs the further via two user-defined aggregate functions: update of ri+1 ), this allowed us to insert one full row at a time. An external driver process then checks the value of ri SELECT ols_coef(d.y, d.vector), ols_r2(d.y, d.vector) before proceeding. Upon convergence, it is rudimentary to FROM design d; compute x∗ . The presence of the conjugate gradient method enables Prior to the implementation of this functionality within even more sophisticated techniques like Support Vector Ma- the DBMS, one Greenplum customer was accustomed to cal- chines (SVM). At their core, SVMs seek to maximize the culating the OLS by exporting data and importing the data distance between a set of points and a candiate hyperplane. into R for calculation, a process that took several hours to This distance is denoted by the magnitude of the normal complete. They reported significant performance improve- vectors w 2 . Most methods incorporate the integers {0, 1} ment when they moved to running the regression within the as labels c, so the problem becomes DBMS. Most of the benefit derived from running the analy- sis in parallel close to the data with minimal data movement. 1 2 argmax f (w) = w , subject to c w − b ≥ 0. w,b 2 5.2.2 Conjugate Gradient In this subsection we develop a data-parallel implementa- This method applies to the more general issue of high tion of the Conjugate Gradiant method for solving a system dimensional functions under a Taylor expansion fx0 (x) ≈ of linear equations. We can use this to implement Sup- f (x0 ) + df (x)(x − x0 ) + 12 (x − xo ) d2 f (x)(x − x0 ) With a port Vector Machines, a state-of-the-art technique for binary good initial guess for x∗ and the common assumption of classification. Binary classifiers are a common tool in mod- continuity of f (·), we know the the matrix will be SPD near ern ad placement, used to turn complex multi-dimensional x∗ . See [17] for details. user features into simple boolean labels like “is a car en- thusiast” that can be combined into enthusiast charts. In 5.3 Functionals addition to serving as a building block for SVMs, the Conju- Basic statistics are not new to relational databases – most gate Gradiant method allows us to optimize a large class of support means, variances and some form of quantiles. But functions that can be approximated by second order Taylor modeling and comparative statistics are not typically built- expansions. in functionality. In this section we provide data-parallel

8.implementations of a number of comparative statistics ex- 5.3.2 Log-Likelihood Ratios pressed in SQL. Likelihood ratios are useful for comparing a subpopula- In the previous section, scalars or vectors were the atomic tion to an overall population on a particular attributed. As unit. Here a probability density function is the founda- tional object. For instance the Normal (Gaussian) density an example in advertising, consider two attributes of users: 2 2 beverage choice, and family status. One might want to know f (x) = e(x−µ) /2σ is considered by mathematicians as a single “entity” with two attributes: the mean µ and vari- whether coffee attracts new parents more than the general ance σ. A common statistical question is to see how well a population. data set fits to a target density function. The z−score of This is a case of having two density (or mass) functions a datum x is given z(x) = (x−µ) √ and is easy to obtain in for the same data set X. Denote one distribution as null σ/ n standard SQL. hypothesis f0 and the other as alternate fA . Typically, f0 and fA are different parameterizations of the same density. SELECT x.value, (x.value - d.mu) * d.n / d.sigma AS z_score For instance, N (µ0 , σ0 ) and N (µA , σA ). The likelihood L FROM x, design d under fi is given by Y 5.3.1 Mann-Whitney U Test Lfi = L(X|fi ) = fi (xk ). Rank and order statistics are quite amenable to relational k treatments, since their main purpose is to evaluate a set of The log-likelihood ratio is given by the quotient −2 log (Lf0 /LfA ). data, rather then one datum at a time. The next example Taking the log allows us to use the well-known χ2 approxi- illustrates the notion of comparing two entire sets of data mation for large n. Also, the products turn nicely into sums without the overhead of describing a parameterized density. and an RDBMS can handle it easily in parallel. The Mann-Whitney U Test (MWU) is a popular substi- X X tute for Student’s t-test in the case of non-parametric data. LLR = 2 log fA (xk ) − 2 log f0 (xk ). The general idea it to take two populations A and B and k k decide if they are from the same underlying population by This calculation distributes nicely if fi : R → R, which most examining the rank order in which members of A and B do. If fi : Rn → R, then care must be taken in managing show up in a general ordering. The cartoon is that if mem- the vectors as distributed objects. Suppose the values are bers of A are at the “front” of the line and members of in table T and the function fA (·) has been written as a user- B are at the “back” of the line, then A and B are differ- defined function f llk(x numeric, param numeric). Then the ent populations. In an advertising setting, click-through entire experiment is can be performed with the call rates for web ads tend to defy simple parametric models SELECT 2 * sum(log(f_llk(T.value, d.alt_param))) - like Gaussians or log-normal distributions. But it is often 2 * sum(log(f_llk(T.value, d.null_param))) AS llr useful to compare click-through rate distributions for differ- FROM T, design AS d ent ad campaigns, e.g., to choose one with a better median This represents a significant gain in flexibility and sophisti- click-through. MWU addresses this task. cation for any RDBMS. Given a table T with columns SAMPLE ID, VALUE, row num- bers are obtained and summed via SQL windowing func- Example: The Multinomial Distribution tions. The multinomial distribution extends the binomial dis- tribution. Consider a random variable X with k discrete CREATE VIEW R AS outcomes. These have probabilities p = (p1 , . . . , pk ). In n SELECT sample_id, avg(value) AS sample_avg trials, the joint probability distribution is given by sum(rown) AS rank_sum, count(*) AS sample_n, ! sum(rown) - count(*) * (count(*) + 1) AS sample_us n n FROM (SELECT sample_id, row_number() OVER P(X|p) = p n1 · · · p k k . (ORDER BY value DESC) AS rown, (n1 , . . . , nk ) 1 value FROM T) AS ordered To obtain pi , we assume a table outcome with column GROUP BY sample_id outcome representing the base population. CREATE VIEW B AS Assuming the condition of large sample sizes, for instance SELECT outcome, greater then 5,000, the normal approximation can be justi- outcome_count / sum(outcome_count) over () AS p fied. Using the previous view R, the final reported statistics FROM (SELECT outcome, count(*)::numeric AS outcome_count are given by FROM input GROUP BY outcome) AS a SELECT r.sample_u, r.sample_avg, r.sample_n (r.sample_u - a.sum_u / 2) / In the context of model selection, it is often convenient to sqrt(a.sum_u * (a.sum_n + 1) / 12) AS z_score compare the same data set under two different multinomial FROM R as r, (SELECT sum(sample_u) AS sum_u, distributions. sum(sample_n) AS sum_n „ « FROM R) AS a P(X|p) LLR = −2 log GROUP BY r.sample_u, r.sample_avg, r.sample_n, P(X|˜ p) a.sum_n, a.sum_u ` n ´ n1 n ! n1 ,...,nk p1 · · · pk k = −2 log n n p˜n ` ´ 1 ···p 1 The end result is a small set of numbers that describe a n1 ,...,nk ˜k k relationship of functions. This simple routine can be en- X X capsulated by stored procedures and made available to the = 2 ni log p˜i − ni log pi . i i analysts via a simple SELECT mann whitney(value) FROM table call, elevating the vocabulary of the database tremendously. Or in SQL: