Approximating Aggregates with Distribution Precision Guarantee

In many cases analysts require the distribution of (group, aggvalue) pairs in the estimated answer to be guaranteed within a certain error threshold of the exact distribution. Existing AQP techniques are inadequate for two main reasons. First, users cannot express such guarantees. Second, sampling techniques used in traditional AQP can produce arbitrarily large errors even for SUM queries. To address those limitations, we first introduce a new precision metric, called distribution precision, to express such error guarantees. We then study how to provide fast approximate answers to aggregation queries with distribution precision guaranteed within a user specified error bound.

6.measure-biased samples together with the uniform sample in two 5.1 Measure-Augmented Inverted Index scans of the table T . Please refer to Appendix B for details. As we first focus on aggregation queries with categorical dimen- Estimating using measure-biased samples. When using T M to sions in their predicates, it is a natural idea to apply inverted indexes estimate the answer for a SUM(M )-query Q, it is a bit surprising from IR systems in our problem. For each value v of each dimen- that we can reuse Algorithm 2 – we only need to call the procedure sion Di , we can keep a postings list of rows that have value v on ProcessWithDataBubble(Q, T M ). Readers may wonder why in dimension Di . Then for a query Q(G, F (M ), P), there have been line 5, x ˆi is increased by 1 instead of the measure value tM for a extensive studies on how to efficiently retrieve all rows satisfying sample row t. The reason is, informally, that since the sampling P using those postings lists, such as . For a predicate that is a probability of t is proportional to tM (as in (4)), we need to cali- conjunction (with only ANDs) of equi-constraints, it is equivalent ˆ with t−1 brate row t’s contribution to the estimation x M . This idea is to computing the intersection of postings lists. After that, we only similar to Horvitz-Thompson estimator (which is for single-point need to scan the retrieved rows once for aggregation. estimation), but here we are estimating a distribution x. Refer to There are, however, at least two optimizations we can do in the the following example for more intuition. scenario of aggregation queries processing. First, we cannot keep full rows in the postings list, as each row could be wide (in terms E XAMPLE 4.1. (Continue Example 1.1) Let’s look at the query of the number of bits), especially if the number of dimensions is Q in Example 1.1, for which uniform sample fails. We draw a non-trivial (e.g., 30). We can keep only the IDs of those rows, and measure-biased sample T M with rows 1, 26, 51, 76, 92, 94, 96, later seek their dimension/measure values in the original table. 98, 111, 136, 161, and 186 each appearing once, and 199 and 200 Secondly, of course, random seek based on row ID is expensive. each appearing four times (considering their values on M – refer to Our basic idea is that, after we get the IDs of rows that satisfy the Example B.1 and Figure 9 in Appendix B on how they were drawn). predicate P, we draw a random sample from these IDs, and per- Reusing Algorithm 2, call ProcessWithDataBubble(Q, T M ) – the form random seeks only for the sample IDs. The two sampling ˆ = 8/20, 12/20 = 0.40, 0.60 , which is estimation it gives is x schemes introduced in Section 4 can be applied here online to draw very close to the exact answer x = 0.39, 0.61 . proper samples to achieve our -approximation precision guaran- tee. To this end, besides row IDs, we need approximate measure The following theorem formally shows that, values to guide our online measure-biased sampling. √ for each measure We introduce how to construct and utilize measure-augmented M , a measure-biased sample T M with size O n/ 2 suffices for all large SUM(M )-queries to give -approximation. inverted index in this subsection. 5.1.1 Construction T HEOREM 5. (Precision of Measure-Biased Sampling) We ap- Our measure-augmented inverted index is described in Figure 3. ply measure-biased sampling on measure attribute M and draw For each value v of a categorical dimension D, we maintain a O (1/s0 ) · (1/ 2 ) rows T M from a table T of n rows. For any measure-augmented postings list inv(D, v). Each of our measure- aggregation query, Q(G, SUM(M ), P), with aggregate SUM(M ) augmented postings lists consists of two parts. on T , let x be the exact answer to Q. If the measure selectiv- First, inv(D, v) maintains IDs of rows that have value v on D ity of predicates P, sM (P), is no less than s0 , Algorithm 2 uses just as in an IR-style postings list. This part is depicted in the left T M to calculate an estimated answer x ˆ s.t. with high probability, √ of Figure 3. From the ID of a row t, denoted as tID , we can access x−ˆx 2 ≤ . In particular, if the √selectivity threshold s0 = 1/ n, the value of each the dimension or measure of t from our column- 2 a measure-biased sample of O n/ rows suffices. based storage, using one random seek (I/O). Second, let’s use inv(D, v) also to denote the set of rows that Optimality of measure-biased sampling. Our measure-biased have value v on D. For each row t ∈ inv(D, v), we maintain the sampling for SUM-queries draws the same number of sample rows values of measures M1 , . . . , MdM on this row. Note that keeping all as the uniform sampling does for COUNT-queries. Since COUNT the exact measure values is still too expensive (e.g., for ten 64-bit is a special case of SUM (when the measure attribute is a constant integers). So we only keep an approximation apx(tMi ) of measure 1), the low-bound result in  also implies that for SUM-queries tMi in order to save the space as well as to speedup the access of with no predicate (s0 = 1), we need at least Ω 1/ 2 sample rows inv(D, v). apx(tMi ) only needs to be an 2-approximation of tMi : to obtain an -approximate answer with high probability. apx(tMi ) ≤ tMi < 2 · apx(tMi ). (5) Early termination. The same early termination condition intro- duced in Section 4.1 for the uniform samples can be applied here So the number of bits we need is at most log log ∆(Mi ) + 1, where for measure-biased samples. When we collect supxˆ ≥ 2/ 2 sam- ∆(Mi ) is the range of values in measure Mi . ples rows that satisfy the predicate in Q, we can exit the loop of The construction of measure-augmented inverted index is similar lines 2-5 of Algorithm 2 and output the normalized x ˆ . This condi- to constructing traditional IR inverted index, which can be done tion is also integrated in our implementation and offers significant within one scan of all the rows in linear time. performance gain for queries with high measure selectivities. E XAMPLE 5.1. (Continue Example 1.1) For the example ta- ble T with 200 rows in Figure 1(a), Figure 4(a)-4(b) shows the 5. SEEK: RANDOM OR SEQUENTIAL measure-augmented postings lists inv(C2 , 0) and inv(C3 , 0), re- Large √ queries with (measure) selectivity no smaller than s0 (= spectively. Values on measure M , i.e., tM , are rounded to apx(tM ) 1/ n by default) have been handled by uniform and measure- as 20 , 21 , 22 , . . ., so that (5) is satisfied. For example, tM for row biased samples. So what left here are small queries with (measure) 100 is rounded to 8 from 10 in inv(C3 , 0) and tM for row 200 is selectivity smaller than s0 . For example, for such a small COUNT rounded to 64 from 100 in inv(C2 , 0). The number of bits needed aggregation query, we know √ that the number of rows satisfying its for each apx(tM ) is log2 log2 ∆(M ) + 1. In this example, the predicate is no more than n. We will keep this property in mind range of M is ∆(M ) = 100, so we need only log2 log2 100 + 1 = when designing the two indexes to handle small queries. log2 6 + 1 = 3 bits, as apx(tM ) can only be 20 , 21 , . . . , 26 . 684

7. Postings Approximate log log ∆(Mi ) + 1 bits list inv(D, v) measures Input: aggregation query Q(G, F (M ), P) and table T (1) (1) (1) Index: measure-augmented inverted index inv (D, v) tID apx(tM1 ) ... ... apx(tMi ) ... ... Output: -approximate answer distribution x ˆ ... ... ... ... ... ... ... ... ... number of rows ˆ ← 0; 1: x (j) tID (j) apx(tM1 ) ... ... (j) apx(tMi ) ... ... containing v 2: Compute TPID ← {tID | rows t that satisfy P} using inv; 3: Repeat min{1/ 2 , |TPID |} times: ... ... ... 4: Randomly pick tID ∈ TPID : for each tID ∈ TPID , Figure 3: Measure-Augmented Inverted Index apx(tM ) ID apx(M ) Pr [tID is picked] = ∼ apx(tM ); (6) t ∈T apx(tM ) ID apx(M ) 1 1 ID apx(M ) 1 1 ... 1 5: t ← seeking T using tID for other dimensions; 1 1 6: Let i be the group t belonging to on G; ... 1 90 1 ... 1 90 1 91 8 7: ˆi ← x x ˆi + tM /apx(tM ); 90 1 101 1 ... 8 8: Normalize xˆ to be a distribution and output x ˆ. 101 1 ... 1 100 8 ... 1 Algorithm 3: ProcessWithMAIndex(Q): online approximate 198 1 101 1 198 1 measure-biased sampling and processing 199 64 ... 1 199 64 200 64 198 1 199 64 (c) inv(C2 , 0) ∩ inv(C3 , 0) (a) inv(C2 , 0) (b) inv(C3 , 0) measure M are within range [1, ∆]. For a SUM aggregation query Q with predicate P, let x be the exact answer to Q. We can use TP Figure 4: Two Measure-Augmented Postings Lists ˆ s.t. E [ x − x to get an estimation x ˆ 2] ≤ . Size of index. Our measure-augmented inverted index is composed How to process a query using the approximate measure-biased of row IDs and approximate measures. Suppose a table has n rows sampling technique is described in Algorithm 3. Inspired by the and dC dimensions. Each row ID needs O(log n) bits and is stored idea of measure-biased sampling (recall (4)), the sample size can dC times in the index, one time for each dimension. For each row be reduced tremendously to O 1/ 2 if rows in TP can be sam- ID and each measure Mi , we need O(log log ∆(Mi )) bits to store pled with probability proportional to tM . In measure-augmented the approximate measure apx(tMi ). So the index size is: inverted index, tM is not available, but an approximation to tM , T HEOREM 6. For a table with n rows, dC dimensions, and dM apx(tM ), is available. A row t in TP is drawn into sample TP measures, the measure-augmented inverted index needs a total of with probability proportional to apx(tM ) (line 4). If t is picked, we dM retrieve its dimension and measure values by seeking our column- O n · dC · (log n + Σi=1 log log ∆(Mi )) bits. based storage (line 5), and let i be the group it belongs to (lines 6). Each postings list inv(D, v) can be compressed using standard Let tM = tM /apx(tM ) be the weight of t in the estimation x ˆi techniques as it will only be accessed sequentially. We will report ˆ is normalized and output (line 8). We know (line 7). Finally, x the actual size of such indexes in our experiments. that, for any t, 1 ≤ tM < 2, from how apx(·M ) is chosen as in (5). The reduction in sampling complexity is achieved by shrink- 5.1.2 Processing Queries ing ∆ in Proposition 7. Intuitively, from Proposition 7, if ∆ = 2, To process a query Q(G, F (M ), P) using measure-augmented which is true for tM , a sample of size O 1/ 2 should suffice. Be- inverted index, we first retrieve IDs of all rows that satisfy P. Let fore proving the formal result in Theorem 8, we use an example to TP be the set of all such rows, and TPID be the set of their IDs. For demonstrate the online approximate measure-biased sampling. a predicate expression with AND, OR, and NOT operators on equi- E XAMPLE 5.2. (Continue Example 5.1) For the table T with constraints in the form of “Di = vi ”, this is equivalent to comput- 200 rows in Figure 1(a), consider such a query Q: ing set intersection, union, and minus of postings lists inv(Di , vi ), SELECT C1 , SUM(M ) FROM T and we only need one scan of those lists. WHERE C2 = 0 AND C3 = 0 GROUP BY C1 . After we get IDs of rows in TPID , we can seek the original table The answer x = 90/288, 198/288 = 0.31, 0.69 . Using our T for their dimension and measure values to finish the aggregation. measure-augmented postings lists in Figures 4(a)-4(b), we can first But |TPID | random seeks are expensive. compute the IDs of rows that satisfy “C2 = 0 AND C3 = 0”. A very natural idea is to draw a random sample from TPID , and We get TPID = {1, . . . , 90, 101, . . . , 198, 199} as in Figure 4(c). perform random seeks only for the sample rows. There are 188 rows in TPID with apx(tM ) = 1, and 1 row (199) For COUNT aggregates, from Corollary 4, a uniform sample of with apx(tM ) = 64. So if we draw 20 rows with probability pro- O 1/ 2 rows suffice for an -approximate answer x ˆ. portional to apx(tM ) from TPID , with high likelihood, we will get 7 For SUM aggregates, we propose approximate measure-biased rows from {1, . . . , 90}, 8 rows from {101, . . . , 198}, 5 copies of sampling to estimate an -approximate answer. row 199. From line 7 in Algorithm 3, we can estimate x ˆ= Online approximate measure-biased sampling. We start with a 7×1 8 × 1 + 5 × 100/64 proposition about the sample size needed for an -approximation if , = 0.30, 0.70 . uniform sampling is used. We prove that, with uniform sampling, 7 + 8 + 5 × 100/64 7 + 8 + 5 × 100/64 it suffices to seek O ∆3 / 2 sample rows, where ∆ is the range of measure values. This result itself is discouraging as ∆3 is large. T HEOREM 8. (Sample using Approximate Measures) We can Our approximate measure-biased sampling scheme will improve draw O 1/ 2 rows TP from TP biased on approximate measure the sample size to O 1/ 2 using approximate measures apx(·M ). apx(·M ) of M (in (6)). For an aggregation query Q(G, SUM(M ), P), let x be the exact answer to Q. We can use the measure- P ROPOSITION 7. (Uniform Sample for SUM) We draw a uni- biased sample TP to calculate an approximate answer x ˆ (as in form sample of O ∆3 / 2 rows TP from TP , assuming values of Algorithm 3) s.t. with high probability, we have x − x ˆ 2≤ . 685

8.Query processing cost. We observe the bottleneck in cost lies on Input: aggregation query Q(G, F (M ), P) and sample T the second step, because the first step only involves sequential scans 1: Sˆi ← ∅ for all groups i’s; sup ← 0; of postings lists, but the second step needs random seeks. The num- 2: For each sample row t ∈ T : ber of random seeks, however, is bounded by min{1/ 2 , |TPID |} for 3: If t satisfies P then each dimension in the query to get an -approximation. 4: Let i be the group t belonging to on G; T HEOREM 9. A query Q with predicate P can be processed 5: Sˆi ← Sˆi ∪ {t}; w(t) ← Pr [t ∈ T ]; sup ← sup +1; using our measure-augmented inverted index (with Algorithm 3) to 6: For each group i: x ˆ i ← Estimator(Sˆi , w(·), F ); get an -approximation. Let TP be the set of rows satisfying P. The 7: Output (ˆ x, sup). total number of random seeks to our index and the column-based Algorithm 4: ProcessWithDataBubble(Q, T ): extended for storage is O min{1/ 2 , |TP |} · dQ , where dQ is the number of general measure F on uniform/measure-based sample dimensions and measures in the query Q. 5.2 Low-Frequency Group Index Input: aggregation query Q(G, F (M ), P) and table T Low-frequency group index is designed to benefit a special class Index: measure-augmented index MAInd of predicates in the form of 1: Sˆi ← ∅ for all groups i’s; 2: Compute TPID ← {tID | rows t that satisfy P} using MAInd; P : D1 = v1 AND . . . AND Dl = vl AND (. . .), (7) 3: Repeat min{1/ 2 , |TPID |} times: 4: Randomly pick tID ∈ TPID : for each tID ∈ TPID , i.e., a conjunction of one or more equi-constraints and some other 5: If MeasureBiased = true: Pr [tID ] ∼ apx(tM ); constraints. For example, for table T in Figure 1(a), “C2 = 1 AND 6: Else (uniform sampling): Pr [tID ] = 1/|TPID |; (C3 = 0 OR C3 = 1)” is such a predicate. 7: t ← seeking T using tID for other dimensions; The construction and processing using low-frequency √ group in- 8: Let i be the group t belonging to on G; dex is quite straightforward. If there is less than n rows (out of 9: Sˆi ← Sˆi ∪ {t}; w(t) ← Pr [tID ]; all the n rows in the table T ) with value v on a dimension D, we simply store these rows sequentially in the index. Any query in the 10: For each group i: x ˆ i ← Estimator(Sˆi , w(·), F ); form (7) with “D = v” in its predicate is a small query and cannot 11: Output x ˆ. be handled by our samples. Processing such queries using√the low- Algorithm 5: ProcessWithMAIndex(Q, MeasureBiased): ex- frequency group index is easy: we only need to scan the n rows tended for general measure F on uniform/measure-based sample sequentially in the index. More details are in Appendix C. Not all small queries can be processed using this index.√For ex- ample, when TD1 =v1 and TD2 =v2 are both large (with > n rows natively, we can estimate M (TP ) as M (T )·mP /m, and therefore, and thus not in the index), their intersection TD1 =v1 ∩ TD2 =v2 could be small. We still rely on our measure-augmented inverted in- zˆi = M (T ) · mi /m = x ˆi · M (T ) · mP /m. dex to answer the query with predicate “D1 = v1 AND D2 = v2 ”. Negative values on measures. For a measure attribute M with both positive values and negative values, we can create two versions 6. EXTENSIONS of M : M + takes only positive values and is set to 0 if the row has a negative value on M ; and similarly, M − takes only negative values 6.1 From Distributions to Absolute Answers and is set to 0 if the row has a positive value on M . Using the above We now introduce how to convert a distribution answer x ˆ to an tricks, we can get estimated answers for both M + and M − . The absolute answer z ˆ. For the distribution answer obtained from low- estimation for M can be obtained from the difference between the frequency group index, since it is exact, the conversion is trivial. So estimated M + and M − on each group. In this way, we can at least let’s focus on SUM-queries with estimated distribution obtained handle negative values for both SUM and AVG. from measure-biased sampling. Measure-augmented inverted in- dex also relies on (approximate) measure-biased sampling, so it is 6.2 Extensibility of Our Techniques similar. And COUNT is a special case of SUM. Foreign key joins and star schema. In order for our system to Consider a SUM(M )-query Q with predicate P against a table support aggregate queries with foreign key joins in a star schema, T . Suppose its output is z = z1 , . . . , zr , where zi is the aggre- our samples and indexes can be extended as follows. gate value of the ith group. z can be normalized to a distribution Uniform/measure-biased samples are drawn from the fact table x = x1 , . . . , xr as in Section 2.1. Let TP ⊆ T be the set of rows in the same way as introduced in Section 4. But for each dimen- satisfying P and Ti ⊆ TP be the set of rows in the ith group of the sion table, we have two options. i) We can join it with the sam- output to Q. For a set of rows S, let M (S) = t∈S tM . So we ples from the fact table in the offline sampling stage and attach the can easily derive that xi = M (Ti )/M (TP ) = zi /M (TP ). additional dimensions to the samples; or ii) we can perform the In the measure-biased sample T M , let m be the number of sam- join when processing aggregation queries using samples, if one or ple rows in T M , mP be the number of sample rows satisfying P more of its dimensions appear in the query. Option ii) is feasible in T M , and mi be the be the number of rows belonging to the ith when the dimension table is small enough to be fit into the mem- group in T M . So from Algorithm 2, we have x ˆi = mi /mP . ory. In our implementation, we choose option i). For the complex Since |ˆ xi − xi | ≤ (implied by our distribution precision guar- schema in TPC-H, Table 1 shows the space overhead of all uniform antee x − x ˆ 2 ≤ ), informally, we have and measure-biased samples, which is affordable in memory even xi = zi /M (TP ) ≈ mi /mP = x ˆi . though samples are attached with dimensions from all tables. For the measure-augmented inverted index and the low-frequency So it is intuitive to estimate zˆi = M (TP ) · mi /mP . Both mi and group index, we construct them in the view that joins the fact table mP are known from Algorithm 2, but M (TP ) is unknown. Alter- with all the dimensions. For TPC-H, Table 1 shows their total sizes. 686