Continuous Applications at Scale of 100 Teams with Databricks Delta and Structur

Zalando strives to be a fully data-driven company that utilizes AI to make decisions fast and accurately. For this reason we have built a Data Lake that contains all data of the company. To provide easy access to that data and enable the company to make use of it, we have established an internal platform that offers Databricks as a service for all departments and teams. Making Databricks Delta tables available to all clients of the Data Lake enabled them to leverage Structured Streaming and to build continuous applications on top of it. Big part of this journey was solving challenges in governance, security and access management. In this talk we want to share our experience in productionizing and operating Databricks at scale and in making data-driven continuous applications feasible out of the box.
展开查看详情

1. Continuous Applications at the Scale of 100 Teams with Databricks Delta and Structured Streaming Viacheslav Inozemtsev Max Schultze April 25, 2019

2.OUTLINE Introduction of Zalando Zalando’s Processing Platform Databricks Use Cases Lessons Learned

3. Who we are Viacheslav Inozemtsev Max Schultze ● Data Engineer ● Data Engineer ● Degrees in Applied Math and in ● MSc in Computer Computer Science Science ● Working with Spark since 0.9.2 ● Took part in early development of Apache Flink 3

4.Introduction of Zalando 4

5. Zalando’s Data Lake Ingestion Serving Storage 5

6. Zalando’s Data Lake Ingestion Serving Data Center Storage DWH Event Bus 6

7. Zalando’s Data Lake Ingestion Serving Data Center Storage DWH Event Bus Metastore 7

8. Zalando’s Data Lake Ingestion Serving Ad-Hoc querying Data Center Storage DWH Event Bus Metastore 8

9. Zalando’s Data Lake Ingestion Serving Ad-Hoc Querying Data Center Storage DWH Processing Platform Event Bus Metastore 9

10. Zalando’s Data Lake Ingestion Serving Ad-Hoc Querying Data Center Storage DWH Processing Platform Event Bus Metastore 10

11.Zalando’s Databricks Processing Platform 11

12. Zalando’s Databricks Processing Platform - Technical Setup 12

13. Zalando’s Databricks Processing Platform - Technical Setup 13

14. Zalando’s Databricks Processing Platform - Technical Setup 14

15. Zalando’s Databricks Processing Platform - Technical Setup 15

16. Zalando’s Databricks Processing Platform - Technical Setup 16

17. Zalando’s Databricks Processing Platform - Organizational Setup Introduction to Databricks ● RSA ● Office Hours 17

18. Zalando’s Databricks Processing Platform - Organizational Setup Introduction to Databricks Initial Setup ● RSA ● Inner Source ● Office Hours Configuration 18

19. Zalando’s Databricks Processing Platform - Organizational Setup Introduction to Databricks Initial Setup ● RSA ● Inner Source ● Office Hours Configuration Development Phase ● Office Hours ● Guest Developer 19

20. Zalando’s Databricks Processing Platform - Organizational Setup Introduction to Databricks Initial Setup ● RSA ● Inner Source ● Office Hours Configuration Development Phase Productionizing ● Office Hours ● 24/7 Support ● Guest Developer 20

21.Databricks Use Cases 21

22. Batch Ingestion from Data Warehouse Ingestion Serving Ad-Hoc Querying Data Center Storage DWH Processing Platform Event Bus Metastore 22

23. Batch Ingestion from Data Warehouse 23

24. Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow 24

25. Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader 25

26. Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader ○ for partitioned tables a view with a column PARTITION_ID can be created 26

27. Batch Ingestion from Data Warehouse ● Problem 1: extraction from databases via JDBC can be slow ● Solution: ○ use parallelism of Spark JDBC reader ○ for partitioned tables a view with a column PARTITION_ID can be created ○ works especially well for tables partitioned on multiple machines 27

28. Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises 28

29. Batch Ingestion from Data Warehouse ● Problem 2: data warehouse is still often on premises ● Solution: ○ resolve this early! 29