HOT

Approximating Aggregates with Distribution Precision Guarantee
3 点赞
1 收藏
1下载
In many cases analysts require the distribution of (group, aggvalue) pairs in the estimated answer to be guaranteed within a certain error threshold of the exact distribution. Existing AQP techniques are inadequate for two main reasons. First, users cannot express such guarantees. Second, sampling techniques used in traditional AQP can produce arbitrarily large errors even for SUM queries. To address those limitations, we first introduce a new precision metric, called distribution precision, to express such error guarantees. We then study how to provide fast approximate answers to aggregation queries with distribution precision guaranteed within a user specified error bound.

3.construct and utilize uniform and measure-biased samples, is intro- distribution answer xˆ= x ˆ1 , . . . , x ˆr . We propose to measure the duced in Section 4, and the “seek” part, i.e., auxiliary indexes, is in ˆ using the L2 (Euclidean)-distance between x and x error in x ˆ , i.e., r Section 5. Section 6 discusses extensions of our approach. Exper- x−x ˆ 2 = i=1 (xi − x ˆi )2 . For the ease of understanding, imental study is reported in Section 7, followed by related work in we can rephrase x − x ˆ 2 in words as follows: Section 8 and conclusion in Section 9. Proofs of most non-trivial group i’s value estimated group i’s value 2 theorems and additional experiments are in the appendix. − . (2) total group value total est. group value group i 2. PRELIMINARIES Our system allows users to specify an error bound which is We first define the class of queries that we focus on, and formal- independent on both the query Q and the data distribution in table ize the precision guarantee our system offers for query answers. T . The estimated answer x ˆ is called an -approximate answer or -approximation of the true distribution answer x if x − x ˆ 2≤ . 2.1 Single-Block Aggregation Queries System task and precision guarantee. Our AQP system processes T (D1 , . . . , DdC , M1 , . . . , MdM ) is a table with dC + dM dimen- single-block aggregation queries and promises -distribution preci- sions. D1 , . . . , DdC are group-by and predicate dimensions and sion guarantee: that is, for a user-specified error bound , our sys- M1 , . . . , MdM are measure attributes. Types of dimensions are de- tem can produce an estimated answer x ˆ for any given single-block fined by data owner. A column can be both a group-by/predicate aggregation query Q, s.t. xˆ must be an -approximation of the exact dimension and a measure at the same time in different queries. answer x to Q, i.e., x − x ˆ 2 ≤ , with high probability. For a row t ∈ T , let tD and tM be the values of dimension D As discussed in Section 1, compared to CI, our distribution pre- and measure M on this row, respectively, and tID denote the ID of cision guarantee based on L2 -distance is more rigorous, easier to a row. Let |T | = n be the number of rows in T . be interpreted, and has only one parameter that consistently mea- Table aggregation queries. For the ease of explanation, we first sures how two answers differentiate globally. focus on queries in the form of Q(G, F (M ), P), called table ag- Another good property of our guarantee is: maxi |xi − x ˆi | ≤ gregation queries, on a single table T to introduce our techniques: x−x ˆ 2 ≤ . It says that, when -distribution precision guarantee holds, the max of errors |xi − x ˆi | for all groups is also bounded by SELECT G, F (M ) FROM T WHERE P GROUP BY G. (1) , with high probability. It implies that the order of groups or the • Group-by dimensions G ⊆ {D1 , . . . , DdC } are a subset of di- top-k groups derived from x ˆ can be wrong by at most . mensions of T on which rows are aggregated on. • Aggregate measure F (M ) is COUNT(∗) or SUM(M ) where 3. SYSTEM OVERVIEW M ∈ {M1 , . . . , Mm } is a measure attribute. We give an overview of our solution in Section 3.1, including • Predicate P consists of equality constraints on categorical di- sample/index building and aggregation query processing, and sum- mensions {D1 , . . . , DdC }, in the form of “Di = vi ”. P is an marize the main theoretical results we obtain. In Section 3.2, we AND-OR expression of multiple such constraints. introduce the architecture of our system. For simplicity of discussion, we assume that values of a measure M 3.1 Overview of Our Approach are non-negative when presenting our techniques. How to handle To tackle the inherent deficiency of sampling-based approach measures with negative values will be introduced in Section 6.1 . when handling queries with very selective predicates, we propose Single-block aggregation queries. Our techniques can be extended a novel sample+seek framework. We classify queries into large for a more general class, single-block aggregation queries, which queries and small queries based on how many rows satisfy the generalizes the class of table aggregation queries in three aspects: i) predicates. For large queries, with many rows satisfying their pred- (schema) to support foreign key joins on a star/snowflake schema; icates, we are able to collect enough number of rows satisfying ii) (predicates) to support range constraints on numeric dimen- their predicates from a uniform sample or a new sample designed sions and arbitrary AND-OR logical expressions P of equality con- for SUM aggregates, called measure-biased sample, to obtain - straints and range constraints; iii) (aggregates) support more ag- approximate answers. For small queries, random samples are not gregates, e.g., AVG(M ), STDEV(M ), and COUNT(Distinct M ). sufficient, so we propose two new index structures, called measure- We introduce how our system handles the above extensions in Sec- augmented inverted index and low-frequency group index, to eval- tion 6.2. The class of aggregates supported by us is the same as the uate them for -approximations using index seeks. class supported by state of the art AQP systems like BlinkDB . Answers and distributions. An output to an aggregation query Q 3.1.1 Classification of Queries is a set of (group, value)-pairs, {(g1 , z1 ), (g2 , z2 ), . . . , (gr , zr )}, We classify queries using (measure) selectivity into large queries where r is the number of distinct values on the product of group-by and small queries, based on the hardness of being answered. dimensions G = {G1 , . . . , Gl }. Each gi ∈ G1 × G2 × . . . Gl is D EFINITION 1. (Selectivity and Measure selectivity) The se- said to be a group, and zi is the corresponding aggregate value. We lectivity of predicate P in a query Q against a table T is s(P) = can present the output to be an r-dim vector z = z1 , . . . , zr . |TP |/|T | where |T | is the number of rows in T and TP ⊆ T is the The query output z can be normalized as a distribution on all set of rows that satisfy the predicates P in T . groups G1 × . . . × Gl : let xi ← zi / rj=1 zj , for i = 1, 2, . . . , r. The measure selectivity of predicate P in a query Q against The vector x = x1 , . . . , xr is said to be a distribution answer (or a table T on measure M is sM (P) = M (TP )/M (T ), where answer for short) to the query Q. Our guarantee about precision M (TP ) = t∈TP tM and M (T ) = t∈T tM . introduced next is w.r.t. the distribution answer with the goal of preserving the relative ratio/order of groups in the query output. D EFINITION 2. (Large queries and small queries) For a selec- tivity threshold s0 , a query Q with COUNT aggregate is a s0 -large 2.2 Distribution Precision Guarantee query iff s(P) ≥ s0 and otherwise a s0 -small query. Computing the exact output or distribution answer x to a query Q Similarly, a query Q with SUM(M ) is a s0 -large query iff we can be expensive on large tables. Our system produces an estimated have sM (P) ≥ s0 , and otherwise a s0 -small query. 681

4. Input: aggregation query Q(G, F (M ), P) collect enough number of rows satisfying the predicate from the Output: -approximate answer distribution samples, it continues to use the other two deterministic auxiliary 1: If F = COUNT then indexes to evaluate the query for -approximation. 2: x, supxˆ ) ← ProcessWithDataBubble(Q, T 0 ); (ˆ Our query processing algorithm is outlined in Algorithm 1. We 3: Else If F = SUM then process a query with the data bubble samples first (lines 1-4). Let 4: x, supxˆ ) ← ProcessWithDataBubble(Q, T M ). (ˆ ˆ be the estimated distribution answer, and supxˆ be the number of x 5: If supxˆ ≥ 1/ 2 then sample rows satisfying the predicate P (those used to calculate x ˆ ). 6: Output x ˆ; If supxˆ ≥ 1/ 2 , we can terminate with an -approximate answer x ˆ 7: Else ˆ is not accurate enough, so we will continue (lines 5-6); otherwise, x 8: If the low-frequency group index is usable to check whether the low-frequency group index is usable (lines 8- 9: ˆ ← ProcessWithLFIndex(Q); x 9), or refer to the measure-augmented inverted index (lines 10-11). 10: Else Section 4 introduces how to create and utilize data bubble sam- 11: ˆ ← ProcessWithMAIndex(Q); x ples. Section 5 presents how to build and process queries with 12: Output x ˆ. measure-augmented inverted index and low-frequency group index. Algorithm 1: Processing Aggregation Queries 3.1.4 A Summary of Theoretical Guarantees We now present the performance guarantees of our system for √ producing -approximations, in an informal way. We set s0 to be 1/ n, where n is the number of rows in T , in Main results. For a table T with dC group-by/predicate dimen- most of discussion and analysis later on. We simply√call the two √ sions, dM measures, and n rows. We need to keep O n/ 2 rows classes large queries and small queries when s0 = 1/ n. in each of the dM +1 data bubbles. For any aggregation query Q(G, F (M ), P), we need to scan one of the data√bubbles to compute - 3.1.2 Offline Sampling and Index Building approximate answer in linear time O |G| · n/ 2 . If Q is a large For a table T with dC group-by and predicate dimensions, dM query, we can terminate with an -approximation with high proba- measures, and n rows, we build three types of samples/indexes. bility. If Q is a small query and the low-frequency group index can √ The first one is called data bubble. Each data bubble is a random be used to answer it, we scan at most n rows in the index to get sample of rows drawn √ from T . We draw a uniform sample T 0 ⊆ its answer. If the low-frequency group index cannot be used, we T containing O n/ 2 rows to answer COUNT-queries, and a can always use the measure-augmented inverted index to perform √ measure-biased sample containing O n/ 2 rows T M used to O 1/ 2 random I/O’s for an -approximation. answer SUM(M )-queries for each measure dimension M . So there Parameters. Both the error bound and the selectivity threshold are a total of dM +1 data bubbles. The basic idea of measure-biased s0 are tunable. Our theoretical results and system can be extended sampling is to pick a row t in T with replacement with probabil- for general parameter settings. The precision parameter can be ity proportional to tM , to get -approximate answers to SUM(M )- specified by users or administrators to trade-off between precision queries using a minimum number of sample rows. As their sizes and√indexing/processing cost. The selectivity threshold s0 is set as are sublinear in n, we load them into memory in pre-processing. 1/ n by default. Generally, the size of each sample needs to be The second one is called measure-augmented inverted index. For O (1/s0 ) · (1/ 2 ) to get -approximations for s0 -large queries. each value of a categorical dimension, we keep a postings list of Error bound and accuracy probability in big O notations. All row IDs where this value appears in the corresponding dimension. the theoretical results about obtaining -approximation in this pa- Up until now, it is similar to an inverted index in IR systems. We per hold with high probability (“w.h.p.” for short) – we mean that, then augment each postings list with rough values of measures. For with probability at least 1 − δ (e.g., 0.95) for some constant δ example, if tM = 28 + 10, we store 8 (i.e., apx(tM ) = 28 ) for (e.g., 0.05), our estimation is an -approximation. For a fixed er- tM in the index to save space. Similar to measure-biased sampling, ror bound , there is a tradeoff between the sample sizes/processing these rough values apx(tM )’s will be used to guide our online sam- costs (those in “main results” above) and the accuracy probability pling when processing SUM-aggregation queries. We keep this in- 1−δ: theoretically, for an accuracy probability 1−δ, there is an ad- dex on disk if we do not have enough memory, but each postings ditional factor log(1/δ) behind the O(·) notations about the sample list is stored sequentially s.t. it can be efficiently read when needed. sizes/processing costs. However, since δ does not have to be arbi- The third one is called low-frequency group index. In this index, trarily small, we can treat log(1/δ) as a constant for simplicity of for each value that is infrequent in a categorical√dimension, i.e., the presentation. So in all O(·) or Ω(·) notations in the main body of number of rows containing it is no more than n, we materialize this paper, we omit all the small logarithmic terms like log(1/δ) these rows sequentially on disk. The motivation of this index is and log(1/ ) when presenting theoretical results. These terms will that, if an infrequent value appears in the predicate of a query as be found in the proofs of our theorems in the appendix. part of a conjunction, we can simply scan these rows sequentially on disk to calculate the answer. For example, consider the table T 3.2 System Architecture in Figure 1(a) and a predicate “C2 = 1 AND C3 = 0”. Only 10 The architecture of our system is in Figure 2. Our system can be rows satisfy C2 = 1, so we can simply scan them for the answer. attached to a database with its independent storage and query pro- cessing engine. It connects to a data source in the column-based 3.1.3 Sample+Seek Processing of Queries format using a component called data connector, which can be Our system does not estimate queries’ selectivities. For every in- customized for different database systems. In the index building coming query, we first proceed with our uniform sample or measure- stage, the three of indexes, data bubbles, measure-augmented in- biased samples. If enough number (1/ 2 ) of rows satisfying its verted index, and low-frequency group index, are created from the predicate are collected from the samples, an -approximation can column-based storage. After index building is finished, our query be produced. For large queries, the processing can be terminated processor needs to access the data bubbles (in memory) and the at this point with high probability. Only when our system does not other two indexes (on disk) to process aggregation queries. 682

5. Query Processing Query of Theorem 3 generalizes it extensively from two aspects: i) to Aggregation query Estimated answer processor distribution show E x − x ˆ 22 ≤ O 2 , we need to utilize the fact that the number of sample rows satisfying P, denoted as mP , concentrates Index Building on Ω 1/ 2 (w.h.p.); and ii) we need to generalize McDiarmid’s Data bubbles: √samples Measure-augmented Low-frequency inverted index group index inequality to complete the proof for x − x ˆ 2 ≤ (w.h.p.). with sizes O 2n C OROLLARY 4. (Without Predicate ) We use uniform sam- Meta data Data pling to draw O 1/ 2 rows T 0 from a table T of n rows. For any Columnizer Column-based Data source storage connector aggregation query Q with COUNT aggregate on T and no predi- (n rows) Dictionary cate, let x be the exact answer to Q. We can use T 0 to calculate an estimated answer x ˆ s.t. with high probability, x − x ˆ 2≤ . Figure 2: System Architecture Input: aggregation query Q(G, F (M ), P) and sample T Optimality. The information-theoretic lower bound in  also im- Output: -approximate answer x ˆ and support supxˆ plies the optimal sampling size for COUNT, i.e., at least Ω 1/ 2 sample rows are needed to obtain an -approximate answer (with ˆ ← 0; supxˆ ← 0; 1: x probability at least 1 − δ) for a COUNT-query with no predicate. 2: For each sample row t ∈ T : 3: If t satisfies P then Early termination. We can derive an early termination condition 4: Let i be the group t belonging to on G; for Algorithm 2 from Corollary 4. In order to be an -approximate 5: ˆi ← x x ˆi + 1; supxˆ ← supxˆ +1; distribution answer to a query Q, x ˆ needs a sample of O 1/ 2 6: Normalize x ˆ to be a distribution on all groups on G; rows among all rows that satisfy Q’s predicate. So we can track the 7: Output (ˆ x, supxˆ ). size of this sample using supxˆ in Algorithm 2. When supxˆ ≥ 2/ 2 , we can exit the loop of lines 2-5 and output the normalized x ˆ. Algorithm 2: ProcessWithDataBubble(Q, T ): processing ag- We have integrated this early termination condition in our imple- gregation queries with a data bubble (sample) mentation. It offers significant performance gain for √ large queries whose selectivities s(P)’s are much larger than 1/ √n, as for those queries, only a small portion of rows in the O n/ 2 -sample 4. SAMPLE: UNIFORM OR BIASED need to be scanned to get a sample of 2/ 2 rows satisfying P. We now introduce the two sampling schemes, uniform sampling and measure-biased sampling, and how to use them to answer large 4.2 Measure-biased Sampling COUNT and SUM-queries with -approximation, respectively. To extend the uniform sampling schema for queries with SUM aggregates on any measure dimension M , a naive way is to change 4.1 Uniform Sampling xi ← x “ˆ ˆi + 1” to “ˆ xi ← x ˆi + tM ” on line 5 of Algorithm 2 Uniform sampling is a very natural idea to approximate the an- (recall that tM is the value of measure M on a row t). This simple swer distribution x for an aggregation query Q. The sampling pro- extension, however, is not accurate enough. We can show that using cess can be formally described as follows. For a given sample size √ a uniform sample with size m = O n/ 2 , the expected squared m, we repeat drawing a row with replacement from a table T for error E x − x ˆ 22 for large queries is no less than Var [M ] · 2 , m times – each row t ∈ T is chosen into T 0 with equal probability ¯ − 1)2 and M ¯ is the average where Var [M ] = n1 t∈T (tM /M Pr [t is picked] = 1/n. (3) value of measure M . Later in Proposition 7, we will also prove 0 that E x − x ˆ 22 is upper bounded by O ∆3 · 2 with the same The resulting m sample rows in T are stored in memory for pro- sample size, if measure M takes value in [1, ∆]. Both Var [M ] cessing √queries. We show that a pre-drawn uniform sample with and ∆ could be very large for a general data distribution on M . size O n/ 2 suffices for all large COUNT-queries against T . Recall cases I and II in Example 1.1. If we use the above naive Algorithm 2 shows how the uniform sample T 0 can be used to adaption of uniform sampling to process query Q, we will get an estimate answers to aggregation queries. When Q(G, COUNT(∗), estimated answer either in Figure 1(c) or in 1(d) with high proba- P) comes, we process it with T 0 using one scan. For each row t ∈ bility. Both are far away from the true answer. T 0 , we check whether t satisfies P (line 3); if yes and t belongs to We now introduce a new sampling schema called measure-biased the ith group on group-by columns G, we increase x ˆi by 1 (lines 4- sampling. For each measure M , we first need to construct (ran- 5). Finally, we normalize x ˆ = x ˆ1 , . . . , x ˆr as a distribution and domly) a measure-biased sample T M in preprocessing, which will also report the number of rows satisfying P as supxˆ (lines 6-7). be used to estimate answers for SUM-queries with high accuracy. The follow √ theorem formally shows that why T 0 with sample Constructing measure-biased samples. To construct T M with 2 size O n/ suffices for all large COUNT-queries. a given sample size m, we repeat drawing a row with replacement from T for m times – each row t is picked into T M with probability T HEOREM 3. (Precision of Uniform Sampling) We use uniform proportional to its value tM on measure M , i.e., sampling to draw O (1/s0 ) · (1/ 2 ) rows to form a sample T 0 from a table T of n rows. For any COUNT-aggregation query on tM Pr [t is picked] = ∼ tM ; (4) T , Q(G, COUNT(∗), P), let x be the exact answer to Q. If the t ∈T tM selectivity of predicates P is no less than s0 , s(P) ≥ s0 , Algo- rithm 2 uses T 0 to calculate an estimated answer x ˆ s.t. with high All the rows drawn are kept in T M . Note that each row may appear probability, x − ˆ x 2 ≤ . In particular, if the selectivity threshold multiple times in T M , i.e., T M could be a multi-set. We will show √ √ s0 = 1/ n, a uniform sample of O n/ 2 rows suffices. a table T with n rows, a measure-biased sample T M with that for √ size O n/ 2 suffices for all large SUM(M )-queries. A special case of Theorem 3 with selectivity threshold s0 = 1 If there are d measure attributes, we need to maintain d measure- has been studied in , as stated in Corollary 4 below. Our proof biased samples, each for one measure. We can construct all the 683

6.measure-biased samples together with the uniform sample in two 5.1 Measure-Augmented Inverted Index scans of the table T . Please refer to Appendix B for details. As we first focus on aggregation queries with categorical dimen- Estimating using measure-biased samples. When using T M to sions in their predicates, it is a natural idea to apply inverted indexes estimate the answer for a SUM(M )-query Q, it is a bit surprising from IR systems in our problem. For each value v of each dimen- that we can reuse Algorithm 2 – we only need to call the procedure sion Di , we can keep a postings list of rows that have value v on ProcessWithDataBubble(Q, T M ). Readers may wonder why in dimension Di . Then for a query Q(G, F (M ), P), there have been line 5, x ˆi is increased by 1 instead of the measure value tM for a extensive studies on how to efficiently retrieve all rows satisfying sample row t. The reason is, informally, that since the sampling P using those postings lists, such as . For a predicate that is a probability of t is proportional to tM (as in (4)), we need to cali- conjunction (with only ANDs) of equi-constraints, it is equivalent ˆ with t−1 brate row t’s contribution to the estimation x M . This idea is to computing the intersection of postings lists. After that, we only similar to Horvitz-Thompson estimator (which is for single-point need to scan the retrieved rows once for aggregation. estimation), but here we are estimating a distribution x. Refer to There are, however, at least two optimizations we can do in the the following example for more intuition. scenario of aggregation queries processing. First, we cannot keep full rows in the postings list, as each row could be wide (in terms E XAMPLE 4.1. (Continue Example 1.1) Let’s look at the query of the number of bits), especially if the number of dimensions is Q in Example 1.1, for which uniform sample fails. We draw a non-trivial (e.g., 30). We can keep only the IDs of those rows, and measure-biased sample T M with rows 1, 26, 51, 76, 92, 94, 96, later seek their dimension/measure values in the original table. 98, 111, 136, 161, and 186 each appearing once, and 199 and 200 Secondly, of course, random seek based on row ID is expensive. each appearing four times (considering their values on M – refer to Our basic idea is that, after we get the IDs of rows that satisfy the Example B.1 and Figure 9 in Appendix B on how they were drawn). predicate P, we draw a random sample from these IDs, and per- Reusing Algorithm 2, call ProcessWithDataBubble(Q, T M ) – the form random seeks only for the sample IDs. The two sampling ˆ = 8/20, 12/20 = 0.40, 0.60 , which is estimation it gives is x schemes introduced in Section 4 can be applied here online to draw very close to the exact answer x = 0.39, 0.61 . proper samples to achieve our -approximation precision guaran- tee. To this end, besides row IDs, we need approximate measure The following theorem formally shows that, values to guide our online measure-biased sampling. √ for each measure We introduce how to construct and utilize measure-augmented M , a measure-biased sample T M with size O n/ 2 suffices for all large SUM(M )-queries to give -approximation. inverted index in this subsection. 5.1.1 Construction T HEOREM 5. (Precision of Measure-Biased Sampling) We ap- Our measure-augmented inverted index is described in Figure 3. ply measure-biased sampling on measure attribute M and draw For each value v of a categorical dimension D, we maintain a O (1/s0 ) · (1/ 2 ) rows T M from a table T of n rows. For any measure-augmented postings list inv(D, v). Each of our measure- aggregation query, Q(G, SUM(M ), P), with aggregate SUM(M ) augmented postings lists consists of two parts. on T , let x be the exact answer to Q. If the measure selectiv- First, inv(D, v) maintains IDs of rows that have value v on D ity of predicates P, sM (P), is no less than s0 , Algorithm 2 uses just as in an IR-style postings list. This part is depicted in the left T M to calculate an estimated answer x ˆ s.t. with high probability, √ of Figure 3. From the ID of a row t, denoted as tID , we can access x−ˆx 2 ≤ . In particular, if the √selectivity threshold s0 = 1/ n, the value of each the dimension or measure of t from our column- 2 a measure-biased sample of O n/ rows suffices. based storage, using one random seek (I/O). Second, let’s use inv(D, v) also to denote the set of rows that Optimality of measure-biased sampling. Our measure-biased have value v on D. For each row t ∈ inv(D, v), we maintain the sampling for SUM-queries draws the same number of sample rows values of measures M1 , . . . , MdM on this row. Note that keeping all as the uniform sampling does for COUNT-queries. Since COUNT the exact measure values is still too expensive (e.g., for ten 64-bit is a special case of SUM (when the measure attribute is a constant integers). So we only keep an approximation apx(tMi ) of measure 1), the low-bound result in  also implies that for SUM-queries tMi in order to save the space as well as to speedup the access of with no predicate (s0 = 1), we need at least Ω 1/ 2 sample rows inv(D, v). apx(tMi ) only needs to be an 2-approximation of tMi : to obtain an -approximate answer with high probability. apx(tMi ) ≤ tMi < 2 · apx(tMi ). (5) Early termination. The same early termination condition intro- duced in Section 4.1 for the uniform samples can be applied here So the number of bits we need is at most log log ∆(Mi ) + 1, where for measure-biased samples. When we collect supxˆ ≥ 2/ 2 sam- ∆(Mi ) is the range of values in measure Mi . ples rows that satisfy the predicate in Q, we can exit the loop of The construction of measure-augmented inverted index is similar lines 2-5 of Algorithm 2 and output the normalized x ˆ . This condi- to constructing traditional IR inverted index, which can be done tion is also integrated in our implementation and offers significant within one scan of all the rows in linear time. performance gain for queries with high measure selectivities. E XAMPLE 5.1. (Continue Example 1.1) For the example ta- ble T with 200 rows in Figure 1(a), Figure 4(a)-4(b) shows the 5. SEEK: RANDOM OR SEQUENTIAL measure-augmented postings lists inv(C2 , 0) and inv(C3 , 0), re- Large √ queries with (measure) selectivity no smaller than s0 (= spectively. Values on measure M , i.e., tM , are rounded to apx(tM ) 1/ n by default) have been handled by uniform and measure- as 20 , 21 , 22 , . . ., so that (5) is satisfied. For example, tM for row biased samples. So what left here are small queries with (measure) 100 is rounded to 8 from 10 in inv(C3 , 0) and tM for row 200 is selectivity smaller than s0 . For example, for such a small COUNT rounded to 64 from 100 in inv(C2 , 0). The number of bits needed aggregation query, we know √ that the number of rows satisfying its for each apx(tM ) is log2 log2 ∆(M ) + 1. In this example, the predicate is no more than n. We will keep this property in mind range of M is ∆(M ) = 100, so we need only log2 log2 100 + 1 = when designing the two indexes to handle small queries. log2 6 + 1 = 3 bits, as apx(tM ) can only be 20 , 21 , . . . , 26 . 684

7. Postings Approximate log log ∆(Mi ) + 1 bits list inv(D, v) measures Input: aggregation query Q(G, F (M ), P) and table T (1) (1) (1) Index: measure-augmented inverted index inv (D, v) tID apx(tM1 ) ... ... apx(tMi ) ... ... Output: -approximate answer distribution x ˆ ... ... ... ... ... ... ... ... ... number of rows ˆ ← 0; 1: x (j) tID (j) apx(tM1 ) ... ... (j) apx(tMi ) ... ... containing v 2: Compute TPID ← {tID | rows t that satisfy P} using inv; 3: Repeat min{1/ 2 , |TPID |} times: ... ... ... 4: Randomly pick tID ∈ TPID : for each tID ∈ TPID , Figure 3: Measure-Augmented Inverted Index apx(tM ) ID apx(M ) Pr [tID is picked] = ∼ apx(tM ); (6) t ∈T apx(tM ) ID apx(M ) 1 1 ID apx(M ) 1 1 ... 1 5: t ← seeking T using tID for other dimensions; 1 1 6: Let i be the group t belonging to on G; ... 1 90 1 ... 1 90 1 91 8 7: ˆi ← x x ˆi + tM /apx(tM ); 90 1 101 1 ... 8 8: Normalize xˆ to be a distribution and output x ˆ. 101 1 ... 1 100 8 ... 1 Algorithm 3: ProcessWithMAIndex(Q): online approximate 198 1 101 1 198 1 measure-biased sampling and processing 199 64 ... 1 199 64 200 64 198 1 199 64 (c) inv(C2 , 0) ∩ inv(C3 , 0) (a) inv(C2 , 0) (b) inv(C3 , 0) measure M are within range [1, ∆]. For a SUM aggregation query Q with predicate P, let x be the exact answer to Q. We can use TP Figure 4: Two Measure-Augmented Postings Lists ˆ s.t. E [ x − x to get an estimation x ˆ 2] ≤ . Size of index. Our measure-augmented inverted index is composed How to process a query using the approximate measure-biased of row IDs and approximate measures. Suppose a table has n rows sampling technique is described in Algorithm 3. Inspired by the and dC dimensions. Each row ID needs O(log n) bits and is stored idea of measure-biased sampling (recall (4)), the sample size can dC times in the index, one time for each dimension. For each row be reduced tremendously to O 1/ 2 if rows in TP can be sam- ID and each measure Mi , we need O(log log ∆(Mi )) bits to store pled with probability proportional to tM . In measure-augmented the approximate measure apx(tMi ). So the index size is: inverted index, tM is not available, but an approximation to tM , T HEOREM 6. For a table with n rows, dC dimensions, and dM apx(tM ), is available. A row t in TP is drawn into sample TP measures, the measure-augmented inverted index needs a total of with probability proportional to apx(tM ) (line 4). If t is picked, we dM retrieve its dimension and measure values by seeking our column- O n · dC · (log n + Σi=1 log log ∆(Mi )) bits. based storage (line 5), and let i be the group it belongs to (lines 6). Each postings list inv(D, v) can be compressed using standard Let tM = tM /apx(tM ) be the weight of t in the estimation x ˆi techniques as it will only be accessed sequentially. We will report ˆ is normalized and output (line 8). We know (line 7). Finally, x the actual size of such indexes in our experiments. that, for any t, 1 ≤ tM < 2, from how apx(·M ) is chosen as in (5). The reduction in sampling complexity is achieved by shrink- 5.1.2 Processing Queries ing ∆ in Proposition 7. Intuitively, from Proposition 7, if ∆ = 2, To process a query Q(G, F (M ), P) using measure-augmented which is true for tM , a sample of size O 1/ 2 should suffice. Be- inverted index, we first retrieve IDs of all rows that satisfy P. Let fore proving the formal result in Theorem 8, we use an example to TP be the set of all such rows, and TPID be the set of their IDs. For demonstrate the online approximate measure-biased sampling. a predicate expression with AND, OR, and NOT operators on equi- E XAMPLE 5.2. (Continue Example 5.1) For the table T with constraints in the form of “Di = vi ”, this is equivalent to comput- 200 rows in Figure 1(a), consider such a query Q: ing set intersection, union, and minus of postings lists inv(Di , vi ), SELECT C1 , SUM(M ) FROM T and we only need one scan of those lists. WHERE C2 = 0 AND C3 = 0 GROUP BY C1 . After we get IDs of rows in TPID , we can seek the original table The answer x = 90/288, 198/288 = 0.31, 0.69 . Using our T for their dimension and measure values to finish the aggregation. measure-augmented postings lists in Figures 4(a)-4(b), we can first But |TPID | random seeks are expensive. compute the IDs of rows that satisfy “C2 = 0 AND C3 = 0”. A very natural idea is to draw a random sample from TPID , and We get TPID = {1, . . . , 90, 101, . . . , 198, 199} as in Figure 4(c). perform random seeks only for the sample rows. There are 188 rows in TPID with apx(tM ) = 1, and 1 row (199) For COUNT aggregates, from Corollary 4, a uniform sample of with apx(tM ) = 64. So if we draw 20 rows with probability pro- O 1/ 2 rows suffice for an -approximate answer x ˆ. portional to apx(tM ) from TPID , with high likelihood, we will get 7 For SUM aggregates, we propose approximate measure-biased rows from {1, . . . , 90}, 8 rows from {101, . . . , 198}, 5 copies of sampling to estimate an -approximate answer. row 199. From line 7 in Algorithm 3, we can estimate x ˆ= Online approximate measure-biased sampling. We start with a 7×1 8 × 1 + 5 × 100/64 proposition about the sample size needed for an -approximation if , = 0.30, 0.70 . uniform sampling is used. We prove that, with uniform sampling, 7 + 8 + 5 × 100/64 7 + 8 + 5 × 100/64 it suffices to seek O ∆3 / 2 sample rows, where ∆ is the range of measure values. This result itself is discouraging as ∆3 is large. T HEOREM 8. (Sample using Approximate Measures) We can Our approximate measure-biased sampling scheme will improve draw O 1/ 2 rows TP from TP biased on approximate measure the sample size to O 1/ 2 using approximate measures apx(·M ). apx(·M ) of M (in (6)). For an aggregation query Q(G, SUM(M ), P), let x be the exact answer to Q. We can use the measure- P ROPOSITION 7. (Uniform Sample for SUM) We draw a uni- biased sample TP to calculate an approximate answer x ˆ (as in form sample of O ∆3 / 2 rows TP from TP , assuming values of Algorithm 3) s.t. with high probability, we have x − x ˆ 2≤ . 685

8.Query processing cost. We observe the bottleneck in cost lies on Input: aggregation query Q(G, F (M ), P) and sample T the second step, because the first step only involves sequential scans 1: Sˆi ← ∅ for all groups i’s; sup ← 0; of postings lists, but the second step needs random seeks. The num- 2: For each sample row t ∈ T : ber of random seeks, however, is bounded by min{1/ 2 , |TPID |} for 3: If t satisfies P then each dimension in the query to get an -approximation. 4: Let i be the group t belonging to on G; T HEOREM 9. A query Q with predicate P can be processed 5: Sˆi ← Sˆi ∪ {t}; w(t) ← Pr [t ∈ T ]; sup ← sup +1; using our measure-augmented inverted index (with Algorithm 3) to 6: For each group i: x ˆ i ← Estimator(Sˆi , w(·), F ); get an -approximation. Let TP be the set of rows satisfying P. The 7: Output (ˆ x, sup). total number of random seeks to our index and the column-based Algorithm 4: ProcessWithDataBubble(Q, T ): extended for storage is O min{1/ 2 , |TP |} · dQ , where dQ is the number of general measure F on uniform/measure-based sample dimensions and measures in the query Q. 5.2 Low-Frequency Group Index Input: aggregation query Q(G, F (M ), P) and table T Low-frequency group index is designed to benefit a special class Index: measure-augmented index MAInd of predicates in the form of 1: Sˆi ← ∅ for all groups i’s; 2: Compute TPID ← {tID | rows t that satisfy P} using MAInd; P : D1 = v1 AND . . . AND Dl = vl AND (. . .), (7) 3: Repeat min{1/ 2 , |TPID |} times: 4: Randomly pick tID ∈ TPID : for each tID ∈ TPID , i.e., a conjunction of one or more equi-constraints and some other 5: If MeasureBiased = true: Pr [tID ] ∼ apx(tM ); constraints. For example, for table T in Figure 1(a), “C2 = 1 AND 6: Else (uniform sampling): Pr [tID ] = 1/|TPID |; (C3 = 0 OR C3 = 1)” is such a predicate. 7: t ← seeking T using tID for other dimensions; The construction and processing using low-frequency √ group in- 8: Let i be the group t belonging to on G; dex is quite straightforward. If there is less than n rows (out of 9: Sˆi ← Sˆi ∪ {t}; w(t) ← Pr [tID ]; all the n rows in the table T ) with value v on a dimension D, we simply store these rows sequentially in the index. Any query in the 10: For each group i: x ˆ i ← Estimator(Sˆi , w(·), F ); form (7) with “D = v” in its predicate is a small query and cannot 11: Output x ˆ. be handled by our samples. Processing such queries using√the low- Algorithm 5: ProcessWithMAIndex(Q, MeasureBiased): ex- frequency group index is easy: we only need to scan the n rows tended for general measure F on uniform/measure-based sample sequentially in the index. More details are in Appendix C. Not all small queries can be processed using this index.√For ex- ample, when TD1 =v1 and TD2 =v2 are both large (with > n rows natively, we can estimate M (TP ) as M (T )·mP /m, and therefore, and thus not in the index), their intersection TD1 =v1 ∩ TD2 =v2 could be small. We still rely on our measure-augmented inverted in- zˆi = M (T ) · mi /m = x ˆi · M (T ) · mP /m. dex to answer the query with predicate “D1 = v1 AND D2 = v2 ”. Negative values on measures. For a measure attribute M with both positive values and negative values, we can create two versions 6. EXTENSIONS of M : M + takes only positive values and is set to 0 if the row has a negative value on M ; and similarly, M − takes only negative values 6.1 From Distributions to Absolute Answers and is set to 0 if the row has a positive value on M . Using the above We now introduce how to convert a distribution answer x ˆ to an tricks, we can get estimated answers for both M + and M − . The absolute answer z ˆ. For the distribution answer obtained from low- estimation for M can be obtained from the difference between the frequency group index, since it is exact, the conversion is trivial. So estimated M + and M − on each group. In this way, we can at least let’s focus on SUM-queries with estimated distribution obtained handle negative values for both SUM and AVG. from measure-biased sampling. Measure-augmented inverted in- dex also relies on (approximate) measure-biased sampling, so it is 6.2 Extensibility of Our Techniques similar. And COUNT is a special case of SUM. Foreign key joins and star schema. In order for our system to Consider a SUM(M )-query Q with predicate P against a table support aggregate queries with foreign key joins in a star schema, T . Suppose its output is z = z1 , . . . , zr , where zi is the aggre- our samples and indexes can be extended as follows. gate value of the ith group. z can be normalized to a distribution Uniform/measure-biased samples are drawn from the fact table x = x1 , . . . , xr as in Section 2.1. Let TP ⊆ T be the set of rows in the same way as introduced in Section 4. But for each dimen- satisfying P and Ti ⊆ TP be the set of rows in the ith group of the sion table, we have two options. i) We can join it with the sam- output to Q. For a set of rows S, let M (S) = t∈S tM . So we ples from the fact table in the offline sampling stage and attach the can easily derive that xi = M (Ti )/M (TP ) = zi /M (TP ). additional dimensions to the samples; or ii) we can perform the In the measure-biased sample T M , let m be the number of sam- join when processing aggregation queries using samples, if one or ple rows in T M , mP be the number of sample rows satisfying P more of its dimensions appear in the query. Option ii) is feasible in T M , and mi be the be the number of rows belonging to the ith when the dimension table is small enough to be fit into the mem- group in T M . So from Algorithm 2, we have x ˆi = mi /mP . ory. In our implementation, we choose option i). For the complex Since |ˆ xi − xi | ≤ (implied by our distribution precision guar- schema in TPC-H, Table 1 shows the space overhead of all uniform antee x − x ˆ 2 ≤ ), informally, we have and measure-biased samples, which is affordable in memory even xi = zi /M (TP ) ≈ mi /mP = x ˆi . though samples are attached with dimensions from all tables. For the measure-augmented inverted index and the low-frequency So it is intuitive to estimate zˆi = M (TP ) · mi /mP . Both mi and group index, we construct them in the view that joins the fact table mP are known from Algorithm 2, but M (TP ) is unknown. Alter- with all the dimensions. For TPC-H, Table 1 shows their total sizes. 686

9. In a snowflake schema with multiple fact tables and multi-level 7. EVALUATION dimensions, the above extensions can still be applied. We compare our system with a commercial RDB and two state Range predicates. Our samples can already be used to handle any of the art AQP solutions with stratified sampling: Babcock et al. predicate that is based on existing dimensions. To handle range  (workload-independent) and BlinkDB  (workload-aware). predicates like “D ≥ s AND D ≤ t” for small queries, we intro- • SPS: Our sample+seek solution introduced in this paper. duce the following measure-augmented B+ tree index. • DBX: A standard commercial database system with column-store • Measure-augmented B+ tree. In a B+ tree built on a numerical indexes built for all tables on all columns.1 dimension D, each leaf node is associated with a list of D’s values • SMG: Small group sampling introduced in . within certain range. In our measure-augmented B+ tree, for each • BLK: BlinkDB with multi-dim stratified sampling . D’s value at each leaf, we associate it with IDs (pointers) of the More implementation details of SPS. rows with this value on D and also approximate measure apx(tM ) • Parameters. Our √ SPS has two parameters. The selectivity thresh- for each measure (as in (5)) of these rows – recall that these approx- old s0 is set to 1/ n, where n is the total number of rows in a imate measures need much less bits than the exact measure values. table. The requested error bound varies from 0.025 to 0.1, and is When a range constraint s ≤ D ≤ t is in the predicate, we set as 0.05 if not specified. We use constants 1 behind all O(·)’s in can use this measure-augmented B+ tree to retrieve IDs of rows our theoretical results. √For uniform/measure-biased sampling, our with s ≤ tD ≤ t as well as their approximate measures. Our analysis shows that O n/ 2 rows in each sample suffice – we approximate measure-biased sampling introduced in Section 5.1.2 √ pick exactly n/ 2 rows for each sample. Similarly, from Theo- can then be used to pick O 1/ 2 row IDs and seek their dimension rem 8, we pick exactly 1/ 2 rows to perform random seeks. values in the original table to answer a query with -approximation. • Independent storage. We implement our storage/index and query Extension for other aggregates. We now focus on how to extend execution engine in SPS using C# from sketch. Our custom imple- our sample+seek solution to a wide range of aggregates. mentation (of all the components in Figure 2 except “data source”) • Simple extensions. We can convert both estimated distributions does not reply on other database systems. In our system architec- for SUM and COUNT to absolute answers using the way intro- ture in Figure 2, we have the original data kept in a column-based duced in Section 6.1. Then some measure like AVG can be easily storage with dictionary compression on disk. Our samples/indexes supported as it is nothing but the ratio of SUM to COUNT. are constructed from it, with samples loaded into memory and the • Generic aggregates. To support a generic range of aggregates in other two types of indexes maintained on disk. DBX has a better Algorithm 1, we extend processing with uniform/measure-biased optimized column-based storage engine – we do not utilize it for samples (Algorithm 2) and processing with measure-augmented in- SPS, as we want to highlight the speedups brought by our sam- dex (Algorithm 3) to Algorithm 4 and Algorithm 5, respectively. ples/indexes and processing algorithms with early termination. Processing with low-frequency group index is trivial: one can get • Column-store v.s. row-store. Our sample+seek framework is the exact value of any aggregate by scanning rows in the index. independent on the actual data storage of tables, but we choose The main idea behind Algorithms 4-5 is that, no matter with column-store for uniform/measure-biased samples, and row-store i) uniform/measure-biased samples or ii) with measure-augmented for low-frequency group indexes based on the ways they are ac- indexes, our estimations are made based on a set of sample rows cessed. For in-memory samples, a query may touch only a few di- with different sampling probabilities. For each group in the answer, mensions of sample rows and only data on these dimensions need with i), sample rows have been pre-drawn and kept in memory, so to be read. So to take the best advantage of data locality and cache, Algorithm 2 can directly use them. With ii), Algorithm 3 does it is more efficient to store the data column by column. For on- online sampling/seeking in lines 4-5, and make use of the sample √ disk low-frequency group indexes, since at most n rows need to rows for estimation in lines 6-7. So it is intuitive to replace the spe- be scanned to answer a query, it is more efficient to store the data cific estimation methods in Algorithm 2 (lines 4-5) and Algorithm 3 row by row so that only one random access is required per query. (lines 6-7) with generic estimators to handle other aggregates. Dictionary compression is applied on both types of storages. In Algorithm 4, when scanning rows in a (uniform or measure- SMG and BLK are implemented on the same storage engine (to biased) sample T , we keep rows from each group i in Sˆi as well as store data and samples) as SPS, and all the samples from SMG the probability of seeing each row t in w(t) (line 5). After the scan, and BLK are pre-loaded into memory for fair comparison. for each group i, a point estimation of its aggregate F can be made based on the set of sample rows Sˆi from group i and their weights 7.1 Experiment Settings and Summary w(·), using a generic estimator Estimator(Sˆi , w(·), F ) (line 6). All of our experiments are conducted on a Windows server with The extension in Algorithm 5 is similar, but the sampling is 8 core 2.27GHz CPUs and 64 GB memory. DBX is given 4 cores conducted online, either uniformly (line 6) or biased on measure for better performance, and each of the rest uses one core. (line 5). Again, rows collected from sampling are kept in Sˆi for group i together with sampling probabilities in w(·) (line 9). Af- Datasets. We use two datasets (a real one and a benchmark): ter enough number of rows are collected, estimation of aggregate • LOG: This is a real dataset about service logs in a Fortune 500 value F for each group can be made using Estimator(Sˆi , w(·), F ). company. It is a single table with 29 columns – 19 of them are Estimator is a customizable component and any sample-based categorical dimensions (for group-by), and the rest 10 are numer- point estimator for the aggregate F can be applied. It takes a set of ical dimensions/measures. The numbers of distinct values in the sample rows S,ˆ their sampling weights w(·), and F as the input, but categorical dimensions range from 2 to 3000. The whole table does not rely on other components of our system. So our solution has 1.12 billion rows. To test the scalability of different systems, enjoys the same extensibility as other sampling-based AQP tech- we extract two sub-tables from LOG, LOG-S with 310 million niques (e.g. [7, 5]) do to support a generic range of aggregates. For rows and LOG-M with 620 million rows. example, if F is STDEV(M ), the sample standard deviation esti- • TPC-H: With a scaling factor 100, we produce a 100 GB star- mator can be used as Estimator; if F is COUNT(DISTINCT M ), the estimator GEE proposed in  can be used as Estimator. 1 We tried different types of indexes. Column-store indexes turn out We will evaluate such extensions in experiments. to be the best option for our aggregation query workloads. 687