Data Summaries for Large Scale Aggregation

下载 0

暮雪

发布于

1256

人观看

#信息技术

大规模聚合的数据摘要

展开查看详情

1 . Data Summaries for Large Scale Aggregation Edward Gan, Moses Charikar, Peter Bailis Email: edgan8@gmail.com

2 .Data summaries enable scalable queries Raw Data Stream Minibatch Summarization 12:15:00 12:30:00 12:45:00 13:00:00 Query over time ranges Single Time Segment Summaries constructed for each data segment: Counts, Sums, Samples, HyperLogLog, CountMin, etc… Queries can be served directly using (approximate) summaries

3 .Challenge: error accumulation Query: Top 10 ip addresses by request count in last 7 hours? 12:00:00 12:15:00 12:30:00 132.408.291 1028 32.8.138 4299 132.408.291 1028 … 324.483.998 308 324.483.998 482 324.483.998 308 32.8.138 52 192.128.0.1 256 192.128.0.1 256 Single Summary: error 𝜖𝜖 Aggregating k summaries: error 𝑘𝑘𝑘𝑘 Query accuracy degrades linearly with aggregation

4 .Opportunity: error cancellation IP Address counts can be either overestimates or underestimates 😱😱 Consistent Bias 🙂🙂 Independent Errors 😎😎 Perfect Cancellation

5 .Summaries for range aggregations Time-series summaries aggregated over k contiguous windows 𝑛𝑛 Truncate Top: 𝑂𝑂 𝑘𝑘 𝑠𝑠 𝑛𝑛 Simple Sampling: 𝑂𝑂 𝑘𝑘 𝑠𝑠 𝑛𝑛 Balanced: 𝑂𝑂 log 𝑘𝑘 𝑠𝑠 Balanced Summarization: bias summaries to cancel out errors of previous consecutive summaries.

6 .Designing summaries for error cancellation Problem: Summary approximation error grows with aggregation Goal: Design summaries for error cancellation as an ensemble • For contiguous ranges, error can be controlled incrementally • Other aggregation patterns: 2d ranges, hierarchical, sliding window • Other queries: quantiles, sums Love to hear feedback and more use cases for data summaries! Email: edgan8@gmail.com

0点赞

0收藏

0下载