结构化流和数据库增量的实时属性

做任何广告的公司通常都有一个归因过程,它把用户的谈话与用户被服务或点击的印象结合起来。标准工作流通常是每小时或一天运行一次的批处理作业。然而,随着技术变得更加成熟,广告商正在寻找更实时的报告和结果。本讲座给出了一个使用结构化流媒体和数据块Delta进行近实时属性和高级分析以实时印象和转换数据的基础架构的示例。
展开查看详情

1.Real-Time Attribution with Structured Streaming and Databricks Delta Caryl Yuhas, Databricks #ExpSAIS13

2.Introduction • Goal: Caryl previo Provide tools and information u for At sly MediaM tributi on, SA ath / SE / that can help you build more for Da PM tabric ks real-time / lower latency attribution pipelines • Crawl, Walk, Run: Pull Model #ExpSAIS13 2

3.Getting Started • What is Attribution? Image Source: www.mediamath.com #ExpSAIS13 3

4.Introduction What is Databricks Delta? Delta is a data management capability that brings data reliability and performance optimizations to the cloud data lake. #ExpSAIS13 4

5. Stream-to-Sink BEFORE λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming 3 Reprocessing Analytics 4 Compaction 2 Validation Partitioned 3 4 Scheduled to Avoid Compaction Reprocessing Data Lake 4 Compact Reporting Small Files #ExpSAIS13 5

6. Stream-to-Sink AFTER 1 λ-arch 2 Validation Events 2 Validation λ-arch 1 Streaming 3 Reprocessing 3 Analytics DELTA 4 Compaction 4 Reprocessing Optimize Compact Small Files 3 ZOrder Partitioned Reporting #ExpSAIS13 6

7.Attribution in Practice impressions JOIN conversions attributed impressions #ExpSAIS13 7

8.Attribution Challenges Scale • Often dealing with millions to billions of data points per attribution window Complexity • Simple, last-click model is still common • MTA and more sophisticated attribution on rise #ExpSAIS13 8

9.High Level Attribution Pipeline #ExpSAIS13 9

10.Attribution in Practice impressions JOIN conversions attributed impressions #ExpSAIS13 10

11.Data Architecture attribution views attributed table (filters, logic, etc.) last touch impression stream impressions table attributed table weighted conversion stream conversions table #ExpSAIS13 11

12.System Architecture STRUCTURED STREAMING Amazon Kinesis #ExpSAIS13 12

13.Unification of Streaming + Batch DEMO #ExpSAIS13 13

14.Managing Performance • How can we optimize performance? • Levers: – Delta Tools • Optimize • ZOrder • Caching • Data Skipping – Join on Stream – Cluster Size #ExpSAIS13 14

15.Handling Complexity • Flexibility with Complex Logic – Forking streams – Logic on query vs. in-stream • Late or Corrected Data – Upserts – Views automatically update when raw data changed #ExpSAIS13 15

16.Conclusion • Unification of Batch & Streaming • Easy APIs for Managing Performance • Flexible and Scalable Analytics on Near Real-Time Data #ExpSAIS13 16