确定删除吗?
1.Real-Time Attribution with Structured Streaming and Databricks Delta Caryl Yuhas, Databricks #ExpSAIS13
2.Introduction • Goal: Caryl previo Provide tools and information u for At sly MediaM tributi on, SA ath / SE / that can help you build more for Da PM tabric ks real-time / lower latency attribution pipelines • Crawl, Walk, Run: Pull Model #ExpSAIS13 2
3.Getting Started • What is Attribution? Image Source: www.mediamath.com #ExpSAIS13 3
4.Introduction What is Databricks Delta? Delta is a data management capability that brings data reliability and performance optimizations to the cloud data lake. #ExpSAIS13 4
5. Stream-to-Sink BEFORE λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming 3 Reprocessing Analytics 4 Compaction 2 Validation Partitioned 3 4 Scheduled to Avoid Compaction Reprocessing Data Lake 4 Compact Reporting Small Files #ExpSAIS13 5
6. Stream-to-Sink AFTER 1 λ-arch 2 Validation Events 2 Validation λ-arch 1 Streaming 3 Reprocessing 3 Analytics DELTA 4 Compaction 4 Reprocessing Optimize Compact Small Files 3 ZOrder Partitioned Reporting #ExpSAIS13 6
7.Attribution in Practice impressions JOIN conversions attributed impressions #ExpSAIS13 7
8.Attribution Challenges Scale • Often dealing with millions to billions of data points per attribution window Complexity • Simple, last-click model is still common • MTA and more sophisticated attribution on rise #ExpSAIS13 8
9.High Level Attribution Pipeline #ExpSAIS13 9
10.Attribution in Practice impressions JOIN conversions attributed impressions #ExpSAIS13 10
11.Data Architecture attribution views attributed table (filters, logic, etc.) last touch impression stream impressions table attributed table weighted conversion stream conversions table #ExpSAIS13 11
12.System Architecture STRUCTURED STREAMING Amazon Kinesis #ExpSAIS13 12
13.Unification of Streaming + Batch DEMO #ExpSAIS13 13
14.Managing Performance • How can we optimize performance? • Levers: – Delta Tools • Optimize • ZOrder • Caching • Data Skipping – Join on Stream – Cluster Size #ExpSAIS13 14
15.Handling Complexity • Flexibility with Complex Logic – Forking streams – Logic on query vs. in-stream • Late or Corrected Data – Upserts – Views automatically update when raw data changed #ExpSAIS13 15
16.Conclusion • Unification of Batch & Streaming • Easy APIs for Managing Performance • Flexible and Scalable Analytics on Near Real-Time Data #ExpSAIS13 16