Databricks: What We Have Learned by Eating Our Dog Food

“Databricks Unified Analytics Platform (UAP) is a cloud-based service for running all analytics in one place – from highly reliable and performant data pipelines to state-of-the-art Machine Learning. From the original creators of Apache Spark and MLflow, it provides data science and engineering teams ready to use pre-packaged clusters with optimized Apache Spark and various ML frameworks coupled with powerful collaboration capabilities to improve productivity across the ML lifecycle. Yada yada yada… But in addition to being a vendor Databricks is also a user of UAP. So, what have we learned by eating our own dogfood? Attend a “from the trenches report” from Suraj Acharya, Director Engineering responsible for Databricks’ in-house data engineering team how his team put Databricks technology to use, the lessons they have learned along the way and best practices for using Databricks for data engineering.
展开查看详情

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.What We Have Learned By Eating Our Dog Food (Data engineering at Databricks using Databricks) Suraj Acharya, Databricks Xuan Wang, Databricks #UnifiedAnalytics #SparkAISummit

3.What’s this talk about? ● Data engineering at Databricks using Databricks ● Sharing our approaches and lessons learned ● Hopefully there are a few things that are helpful to your organization and environment ● Starting a discussion and learning #UnifiedAnalytics #SparkAISummit 3

4.Data Team at Databricks ● Mission ○ Create datasets, tools and analyses to inform decision making. ● Data Eng + Data Science ● Pipelines for Product Analytics #UnifiedAnalytics #SparkAISummit 4

5. Data Engineering Playing Field Orchestration Sandbox CI/CD Data Quality and Workflow Dashboarding/ Compute: ETL, analytics, ML Reporting/ BI Message Log Data Catalog/ Lineage Data Model Storage https://pages.databricks.com/wb-data-engineering-best-practices.html

6.Data Engineering at Databricks ● ETL jobs written in Spark running on Databricks ● Azure and AWS ● Structured log events (mainly) ● Scale: ○ 100s of tables ○ 100s of billions of records processed every day ○ 100s of jobs x 10s cloud regions #UnifiedAnalytics #SparkAISummit 6

7.Challenges ● Common pitfalls ● Testing ● Deployment ● Configuration management ● Monitoring #UnifiedAnalytics #SparkAISummit 7

8.Challenges ● Common pitfalls ○ overwrite table partition ○ optimize read performance ● Testing ● Deployment ● Configuration management ● Monitoring #UnifiedAnalytics #SparkAISummit 8

9.Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... #UnifiedAnalytics #SparkAISummit 9

10.Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... Hourly Notebooks, Clusters, Jobs processed logs #UnifiedAnalytics #SparkAISummit 10

11.Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... Hourly Notebooks, Clusters, Jobs processed logs Secrets #UnifiedAnalytics #SparkAISummit 11

12.Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... Hourly Nightly Notebooks, Clusters, Jobs processed logs Secrets #UnifiedAnalytics #SparkAISummit 12

13.Overwrite Table Partition ● Problems ○ nightly jobs v.s. hourly jobs ○ backfill jobs ● df.write.mode(“overwrite”) ○ Crashed in the middle? ○ Read before writing completes? #UnifiedAnalytics #SparkAISummit 13

14. Overwrite Table Partition ● Solution 1: partition swap old input output write new new output output #UnifiedAnalytics #SparkAISummit 14

15. Overwrite Table Partition ● Solution 1: partition swap old old input input output output write new new new output output output #UnifiedAnalytics #SparkAISummit 15

16. Overwrite Table Partition ● Solution 1: partition swap old old old input input input output output output write new new new new output output output output #UnifiedAnalytics #SparkAISummit 16

17.Overwrite Table Partition ● Solution 2: Delta Lake ○ transactional storage layer for Apache Spark + Parquet #UnifiedAnalytics #SparkAISummit 17

18.Optimize Read Performance ● Problem ○ Efficiently filter billions of records metric=login User 1 Processed Logs metric=clusterEvent User 2 metric=??? User ? #UnifiedAnalytics #SparkAISummit 18

19.Optimize Read Performance ● Solution 1: partitioning ● df.write.partitionBy(“date”, “metric”) ○ too many partitions => small files ○ uneven key distribution => skewness ● Choose partition keys ○ Good: partition by date/hour ○ Bad: partition by customerId ○ Rule of thumb: > 1GB per partition #UnifiedAnalytics #SparkAISummit 19

20.Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: 20

21.Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: Before OPTIMIZE: 21

22.Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: Before OPTIMIZE: After File1 File2 File3 OPTIMIZE: 22

23.Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: Before OPTIMIZE: Compute Statistics: After File1 File2 File3 File1: min='a', max='g' OPTIMIZE: File2: min='g', max='n' File3: min='o', max='z' ○ SELECT * FROM logs WHERE metric = 'login' #UnifiedAnalytics #SparkAISummit 23

24.Challenges ● Common pitfalls ● Testing ○ Unit tests & integration tests ● Deployment ● Configuration management ● Monitoring #UnifiedAnalytics #SparkAISummit 24

25.Dev & Deployment Workflow #UnifiedAnalytics #SparkAISummit 25

26.Dev & Deployment Workflow Configuration Management Monitoring Deployment Testing #UnifiedAnalytics #SparkAISummit 26

27.A Simple ETL Example Message log JSON #UnifiedAnalytics #SparkAISummit 27

28.A Simple ETL Example Message JSON ETL Parquet log #UnifiedAnalytics #SparkAISummit 28

29.A Simple ETL Example Message JSON ETL Parquet log #UnifiedAnalytics #SparkAISummit 29