Delta Lake - Reliable Data Lakes at Scale

Delta Lake是一个开源的数据存储层,让Spark和大数据应用支持ACID事务处理。

展开查看详情

1.Welcome Wifi Network: Convene Conference Centers Wifi Password: meetings

2.Building Reliable Data Lakes Delta Lake | Hands-on Lab Joe Widen Senior Solutions Architect

3.Agenda Time Activity 08:30 – 09:00 Registration, Breakfast & Networking 09:00 – 09:20 Opening Remarks - Delta Lake Overview 09:20 – 09:30 Delta Lake in Action - Customer Cases 09:30 – 10:00 Delta Lake: Hands-on Walkthrough - Part 1 10:00 – 10:30 Break 10:30 – 11:30 Delta Lake: Hands-on Walkthrough - Part 2 11:30 – 11:45 Productionizing ML with Delta Lake Demo 11:45 – 12:15 Ask the Expert: bring your most challenging data problems 12:15 – 12:30 Wrap Up 3

4.Delta Lake: Reliable Data Lakes at Scale Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. 4

5.VISION Accelerate innovation by unifying data science, engineering and business SOLUTION Unified Analytics Platform WHO WE • Original creators of , ,& ARE • 2000+ global companies use our platform across big data & machine learning lifecycle

6.Databricks Community Edition Databricks Community Edition is the free version of our cloud-based big data platform. ● Users can access a micro-cluster as well as a cluster manager and notebook environment. ● All users can share their notebooks and host them free of charge with Databricks.

7.Sign up for Databricks Community Edition Go to: databricks.com/try and choose Community Edition

8.Opening Remarks

9.Let’s talk about building data lakes

10.Data reliability challenges with data lakes Failed production jobs leave data in corrupt ✗ state requiring tedious recovery Lack of quality enforcement creates inconsistent and unusable data Lack of transactions makes it almost impossible to mix appends and reads, batch and streaming

11.How does this impact the building of data lakes

12. Evolution of a Cutting-Edge Data Lake Events ? Streaming Analytics Data Lake AI & Reporting

13. Evolution of a Cutting-Edge Data Lake Events Streaming Analytics Data Lake AI & Reporting

14. Challenge #1: Historical Queries? λ-arch 1 λ-arch Events 1 1 λ-arch Streaming Analytics Data Lake AI & Reporting

15. Challenge #2: Messy Data? λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming Analytics 2 Validation Data Lake AI & Reporting

16. Challenge #3: Mistakes and Failures? λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming 3 Reprocessing Analytics 2 Validation Partitioned 3 Reprocessing Data Lake AI & Reporting

17. Challenge #4: Updates? λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming 3 Reprocessing Analytics 4 Updates 2 Validation Partitioned Scheduled to 3 4 Avoid Modifications Reprocessing 4 Data Lake UPDATE & AI & Reporting MERGE

18. Challenges of the Data Lake λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming 3 Reprocessing Analytics 4 Updates 2 Validation Partitioned Scheduled to 3 4 Avoid Modifications Reprocessing 4 Data Lake UPDATE & AI & Reporting MERGE

19.Let’s try it instead with

20.A New Standard for Building Data Lakes Open Format Based on Parquet With Transactions Apache Spark™ APIs

21. Delta Lake ensures data reliability Batch Parquet Files Streaming High Quality & Reliable Data always ready for analytics Updates/Deletes Transactional Log ● ACID Transactions ● Unified Batch & Streaming Key Features ● Schema Enforcement ● Time Travel/Data Snapshots

22. The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting

23. The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting • Full ACID Transactions • Open Source (Apache License) • Powered by

24. The Data Quality Levels Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting Quality Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption.

25. The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting • Dumping ground for raw data • Often with long retention (years) • Raw data with minimal parsing\

26. The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting Intermediate data with some cleanup applied. Queryable for easy debugging!

27. The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting Clean data, ready for consumption. Read with Spark or Presto* *Coming Soon

28. The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting Streams move data through the Delta Lake • Low-latency or manually triggered • Eliminates management of schedules and jobs

29. The OVERWRITE MERGE INSERT DELETE Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates UPDATE AI & Reporting Delta Lake also supports batch jobs and standard DML • Retention • GDPR • Corrections • UPSERTS *DML Coming in 0.3.0