Data Summaries for Large Scale Aggregation
1. Data Summaries for Large Scale Aggregation Edward Gan, Moses Charikar, Peter Bailis Email: firstname.lastname@example.org
2.Data summaries enable scalable queries Raw Data Stream Minibatch Summarization 12:15:00 12:30:00 12:45:00 13:00:00 Query over time ranges Single Time Segment Summaries constructed for each data segment: Counts, Sums, Samples, HyperLogLog, CountMin, etc… Queries can be served directly using (approximate) summaries
3.Challenge: error accumulation Query: Top 10 ip addresses by request count in last 7 hours? 12:00:00 12:15:00 12:30:00 132.408.291 1028 32.8.138 4299 132.408.291 1028 … 324.483.998 308 324.483.998 482 324.483.998 308 32.8.138 52 184.108.40.206 256 220.127.116.11 256 Single Summary: error 𝜖𝜖 Aggregating k summaries: error 𝑘𝑘𝑘𝑘 Query accuracy degrades linearly with aggregation
4.Opportunity: error cancellation IP Address counts can be either overestimates or underestimates 😱😱 Consistent Bias 🙂🙂 Independent Errors 😎😎 Perfect Cancellation
5.Summaries for range aggregations Time-series summaries aggregated over k contiguous windows 𝑛𝑛 Truncate Top: 𝑂𝑂 𝑘𝑘 𝑠𝑠 𝑛𝑛 Simple Sampling: 𝑂𝑂 𝑘𝑘 𝑠𝑠 𝑛𝑛 Balanced: 𝑂𝑂 log 𝑘𝑘 𝑠𝑠 Balanced Summarization: bias summaries to cancel out errors of previous consecutive summaries.
6.Designing summaries for error cancellation Problem: Summary approximation error grows with aggregation Goal: Design summaries for error cancellation as an ensemble • For contiguous ranges, error can be controlled incrementally • Other aggregation patterns: 2d ranges, hierarchical, sliding window • Other queries: quantiles, sums Love to hear feedback and more use cases for data summaries! Email: email@example.com