- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Data Summaries for Large Scale Aggregation
大规模聚合的数据摘要
展开查看详情
1 . Data Summaries for Large Scale Aggregation Edward Gan, Moses Charikar, Peter Bailis Email: edgan8@gmail.com
2 .Data summaries enable scalable queries Raw Data Stream Minibatch Summarization 12:15:00 12:30:00 12:45:00 13:00:00 Query over time ranges Single Time Segment Summaries constructed for each data segment: Counts, Sums, Samples, HyperLogLog, CountMin, etc… Queries can be served directly using (approximate) summaries
3 .Challenge: error accumulation Query: Top 10 ip addresses by request count in last 7 hours? 12:00:00 12:15:00 12:30:00 132.408.291 1028 32.8.138 4299 132.408.291 1028 … 324.483.998 308 324.483.998 482 324.483.998 308 32.8.138 52 192.128.0.1 256 192.128.0.1 256 Single Summary: error 𝜖𝜖 Aggregating k summaries: error 𝑘𝑘𝑘𝑘 Query accuracy degrades linearly with aggregation
4 .Opportunity: error cancellation IP Address counts can be either overestimates or underestimates 😱😱 Consistent Bias 🙂🙂 Independent Errors 😎😎 Perfect Cancellation
5 .Summaries for range aggregations Time-series summaries aggregated over k contiguous windows 𝑛𝑛 Truncate Top: 𝑂𝑂 𝑘𝑘 𝑠𝑠 𝑛𝑛 Simple Sampling: 𝑂𝑂 𝑘𝑘 𝑠𝑠 𝑛𝑛 Balanced: 𝑂𝑂 log 𝑘𝑘 𝑠𝑠 Balanced Summarization: bias summaries to cancel out errors of previous consecutive summaries.
6 .Designing summaries for error cancellation Problem: Summary approximation error grows with aggregation Goal: Design summaries for error cancellation as an ensemble • For contiguous ranges, error can be controlled incrementally • Other aggregation patterns: 2d ranges, hierarchical, sliding window • Other queries: quantiles, sums Love to hear feedback and more use cases for data summaries! Email: edgan8@gmail.com