Building Sessionization Pipeline at Scale with Databricks Delta

Comcast has made a concerted effort to transform itself from a cable/ISP company to a technology company. Data-driven decision making is at the heart of this transformation, and we use data to understand how customers interact with our products, and we see data as the most truthful representation of the voice of our customer. My team, Product Analytics & behavior science (PABS) team plays the role as interpreter, transforming data into consumable insights. The X1 entertainment operating system, is one of the largest video streaming platforms in the world, and our customers consume more than a billion hours of content a week on X1. Our team consumes X1 telemetry at a rate of more than 25TBs of data per day and uses this data to inform our product teams members about the performance of and engagement with the platform. We also use this data to research customer behaviors to help better inform our product team members about areas of opportunity in our products, which range from fixing bugs to creating new features. To power these insights, we need to have a reliable real-time data pipelines to deliver these insights, and we need our data scientists and data engineers to be able to quickly and efficiently be able to develop and commit new code to ensure we can measure new features the product teams are developing. To do this in an environment at this scale, we have been using Databricks, and Databricks delta to gain operational efficiencies, optimization and cost savings. Some of the features from delta that we took advantage of to achieve the desired levels of efficiencies, optimization and cost savings are: · Distributed writes to s3 (essentially eliminating 500 errors) · s3 log with fast reads and ACID transactions (massive increases in s3 scans/reads, and enabling consistent views of the bucket/table) · Vacuum · Pptimize (which has allowed us to reduce a 640 node job to 40, and massively increase efficiencies of our clusters as well as our DS/DE’s)
展开查看详情

1.Building Sessionization Pipeline at Scale with Databricks Delta April 24th, 2019

2.Comcast -Xfinity X1 -Xfinity Internet and xFi -Xfinity Home -Xfinity Mobile

3.How do we improve our products? Data captures our We decipher and to empower data- to enhance customer customers feedback extract insights… informed decisions… experience at scale… We collect, store, and use all data in accordance with our privacy disclosures to users and applicable laws.

4.Data Scale Billions of Events Petabytes of Stored Data Millions TPS

5.What is sessionization?

6.Challenges/Goals 1. Scalability 2. Reliability/ Robust 3. Performance

7.Value Gains Before After Batch process Stream process 84 jobs 3 jobs ~14 hours data delay ~7 hours data delay Min. late data and failure support Checkpointing

8.Initial Design Data Parse & Assign Sessionize Enrich Scalability Reliability Performance

9.Manually Partition Key to Enable Scaling Key 1 Key 2 Data Parse & Assign Key 32 Sessionize Enrich s3://mybucket/key=<key>/type=<type>/date=<yyyy-mm-dd>/hour=<hh>/… ? Scalability Reliability Performance

10.From Batch to Streaming Data Parse & Assign Sessionize Enrich Data Scalability Reliability Performance

11.Delta Optimize Optimize Delta Delta Delta Data Parse & Assign Sessionize Enrich Data Scalability Reliability Performance

12.Random Prefixes Optimize Optimize Delta Delta Delta random prefix random prefix random prefix Data Parse & Assign Sessionize Enrich Data Scalability Reliability ? Performance

13.Auto Optimize and More Delta Delta Delta random prefix random prefix random prefix Data Parse & Assign Sessionize Enrich Upsert Delta Data Data Scalability Reliability Performance

14.Result Delta Delta Delta random prefix random prefix random prefix Reduced a 84 jobs process to 3 jobs Deliver enriched data 2x faster Increase operation friendliness

15.Outcome Scalable data pipeline that provides consumable insights to our teams near real-time reliably. Delta Delta Delta Data Parse & Assign Sessionize Enrich OKRs Experience Product Feature Enhancements Research Developments