Continuous Applications at Scale of 100 Teams with Databricks Delta and Structur
展开查看详情
1. Continuous Applications at the Scale of 100 Teams with Databricks Delta and Structured Streaming Viacheslav Inozemtsev Max Schultze April 25, 2019
2.OUTLINE Introduction of Zalando Zalando’s Processing Platform Databricks Use Cases Lessons Learned
3. Who we are Viacheslav Inozemtsev Max Schultze ● Data Engineer ● Data Engineer ● Degrees in Applied Math and in ● MSc in Computer Computer Science Science ● Working with Spark since 0.9.2 ● Took part in early development of Apache Flink 3
4.Introduction of Zalando 4
5. Zalando’s Data Lake Ingestion Serving Storage 5
6. Zalando’s Data Lake Ingestion Serving Data Center Storage DWH Event Bus 6
7. Zalando’s Data Lake Ingestion Serving Data Center Storage DWH Event Bus Metastore 7
8. Zalando’s Data Lake Ingestion Serving Ad-Hoc querying Data Center Storage DWH Event Bus Metastore 8
9. Zalando’s Data Lake Ingestion Serving Ad-Hoc Querying Data Center Storage DWH Processing Platform Event Bus Metastore 9
10. Zalando’s Data Lake Ingestion Serving Ad-Hoc Querying Data Center Storage DWH Processing Platform Event Bus Metastore 10
11.Zalando’s Databricks Processing Platform 11
12. Zalando’s Databricks Processing Platform - Technical Setup 12
13. Zalando’s Databricks Processing Platform - Technical Setup 13
14. Zalando’s Databricks Processing Platform - Technical Setup 14
15. Zalando’s Databricks Processing Platform - Technical Setup 15
16. Zalando’s Databricks Processing Platform - Technical Setup 16
17. Zalando’s Databricks Processing Platform - Organizational Setup Introduction to Databricks ● RSA ● Office Hours 17
18. Zalando’s Databricks Processing Platform - Organizational Setup Introduction to Databricks Initial Setup ● RSA ● Inner Source ● Office Hours Configuration 18
19. Zalando’s Databricks Processing Platform - Organizational Setup Introduction to Databricks Initial Setup ● RSA ● Inner Source ● Office Hours Configuration Development Phase ● Office Hours ● Guest Developer 19
20. Zalando’s Databricks Processing Platform - Organizational Setup Introduction to Databricks Initial Setup ● RSA ● Inner Source ● Office Hours Configuration Development Phase Productionizing ● Office Hours ● 24/7 Support ● Guest Developer 20
21.Databricks Use Cases 21
22. Batch Ingestion from Data Warehouse Ingestion Serving Ad-Hoc Querying Data Center Storage DWH Processing Platform Event Bus Metastore 22
23. Batch Ingestion from Data Warehouse 23
24. Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow 24
25. Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader 25
26. Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader ○ for partitioned tables a view with a column PARTITION_ID can be created 26
27. Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader ○ for partitioned tables a view with a column PARTITION_ID can be created ○ works especially well for tables partitioned on multiple machines 27
28. Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises 28
29. Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early! 29