Databricks: What We Have Learned by Eating Our Dog Food

下载 3

Spark开源社区

发布于

8530

人观看

#信息技术

“Databricks Unified Analytics Platform (UAP) is a cloud-based service for running all analytics in one place – from highly reliable and performant data pipelines to state-of-the-art Machine Learning. From the original creators of Apache Spark and MLflow, it provides data science and engineering teams ready to use pre-packaged clusters with optimized Apache Spark and various ML frameworks coupled with powerful collaboration capabilities to improve productivity across the ML lifecycle. Yada yada yada… But in addition to being a vendor Databricks is also a user of UAP. So, what have we learned by eating our own dogfood? Attend a “from the trenches report” from Suraj Acharya, Director Engineering responsible for Databricks’ in-house data engineering team how his team put Databricks technology to use, the lessons they have learned along the way and best practices for using Databricks for data engineering.

展开查看详情

1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2 .What We Have Learned By Eating Our Dog Food (Data engineering at Databricks using Databricks) Suraj Acharya, Databricks Xuan Wang, Databricks #UnifiedAnalytics #SparkAISummit

3 .What’s this talk about? ● Data engineering at Databricks using Databricks ● Sharing our approaches and lessons learned ● Hopefully there are a few things that are helpful to your organization and environment ● Starting a discussion and learning #UnifiedAnalytics #SparkAISummit 3

4 .Data Team at Databricks ● Mission ○ Create datasets, tools and analyses to inform decision making. ● Data Eng + Data Science ● Pipelines for Product Analytics #UnifiedAnalytics #SparkAISummit 4

5 . Data Engineering Playing Field Orchestration Sandbox CI/CD Data Quality and Workflow Dashboarding/ Compute: ETL, analytics, ML Reporting/ BI Message Log Data Catalog/ Lineage Data Model Storage https://pages.databricks.com/wb-data-engineering-best-practices.html

6 .Data Engineering at Databricks ● ETL jobs written in Spark running on Databricks ● Azure and AWS ● Structured log events (mainly) ● Scale: ○ 100s of tables ○ 100s of billions of records processed every day ○ 100s of jobs x 10s cloud regions #UnifiedAnalytics #SparkAISummit 6

7 .Challenges ● Common pitfalls ● Testing ● Deployment ● Configuration management ● Monitoring #UnifiedAnalytics #SparkAISummit 7

8 .Challenges ● Common pitfalls ○ overwrite table partition ○ optimize read performance ● Testing ● Deployment ● Configuration management ● Monitoring #UnifiedAnalytics #SparkAISummit 8

9 .Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... #UnifiedAnalytics #SparkAISummit 9

10 .Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... Hourly Notebooks, Clusters, Jobs processed logs #UnifiedAnalytics #SparkAISummit 10

11 .Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... Hourly Notebooks, Clusters, Jobs processed logs Secrets #UnifiedAnalytics #SparkAISummit 11

12 .Data Pipelines Deployment 1 Deployment 2 Centralized service 0 service 0 Messaging raw logs service 1 service 1 System service ... service ... Hourly Nightly Notebooks, Clusters, Jobs processed logs Secrets #UnifiedAnalytics #SparkAISummit 12

13 .Overwrite Table Partition ● Problems ○ nightly jobs v.s. hourly jobs ○ backfill jobs ● df.write.mode(“overwrite”) ○ Crashed in the middle? ○ Read before writing completes? #UnifiedAnalytics #SparkAISummit 13

14 . Overwrite Table Partition ● Solution 1: partition swap old input output write new new output output #UnifiedAnalytics #SparkAISummit 14

15 . Overwrite Table Partition ● Solution 1: partition swap old old input input output output write new new new output output output #UnifiedAnalytics #SparkAISummit 15

16 . Overwrite Table Partition ● Solution 1: partition swap old old old input input input output output output write new new new new output output output output #UnifiedAnalytics #SparkAISummit 16

17 .Overwrite Table Partition ● Solution 2: Delta Lake ○ transactional storage layer for Apache Spark + Parquet #UnifiedAnalytics #SparkAISummit 17

18 .Optimize Read Performance ● Problem ○ Efficiently filter billions of records metric=login User 1 Processed Logs metric=clusterEvent User 2 metric=??? User ? #UnifiedAnalytics #SparkAISummit 18

19 .Optimize Read Performance ● Solution 1: partitioning ● df.write.partitionBy(“date”, “metric”) ○ too many partitions => small files ○ uneven key distribution => skewness ● Choose partition keys ○ Good: partition by date/hour ○ Bad: partition by customerId ○ Rule of thumb: > 1GB per partition #UnifiedAnalytics #SparkAISummit 19

20 .Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: 20

21 .Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: Before OPTIMIZE: 21

22 .Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: Before OPTIMIZE: After File1 File2 File3 OPTIMIZE: 22

23 .Optimize Read Performance ● Solution 2: optimize and data skipping ● Example: Before OPTIMIZE: Compute Statistics: After File1 File2 File3 File1: min='a', max='g' OPTIMIZE: File2: min='g', max='n' File3: min='o', max='z' ○ SELECT * FROM logs WHERE metric = 'login' #UnifiedAnalytics #SparkAISummit 23

24 .Challenges ● Common pitfalls ● Testing ○ Unit tests & integration tests ● Deployment ● Configuration management ● Monitoring #UnifiedAnalytics #SparkAISummit 24

25 .Dev & Deployment Workflow #UnifiedAnalytics #SparkAISummit 25

26 .Dev & Deployment Workflow Configuration Management Monitoring Deployment Testing #UnifiedAnalytics #SparkAISummit 26

27 .A Simple ETL Example Message log JSON #UnifiedAnalytics #SparkAISummit 27

28 .A Simple ETL Example Message JSON ETL Parquet log #UnifiedAnalytics #SparkAISummit 28

29 .A Simple ETL Example Message JSON ETL Parquet log #UnifiedAnalytics #SparkAISummit 29

0点赞

0收藏

3下载