Delta Lake: Making Cloud Data Lakes Transactional and Scalable

Reynold Xin在斯坦福大学计算机课程中的客座讲师分享,介绍Delta Lake开发的起初动机,背后实现ACID的基本原理以及应用案例分享。

展开查看详情

1.Delta Lake: Making Cloud Data Lakes Transactional and Scalable Reynold Xin @rxin Stanford University, 2019-05-15

2.About Me Databricks co-founder & Chief Architect - Designed most major things in “modern day” Apache Spark - #1 contributor to Spark by commits and net lines deleted PhD in databases from Berkeley

3.Building data analytics platform is hard ???? Data streams Insights

4.Traditional Data Warehouses ETL SQL OLTP Data Warehouse Insights databases

5.Challenges with Data Warehouses ETL pipelines are often complex and slow Ad-hoc pipelines to process data and ingest into warehouse No insights until daily data dumps have been processed Data Warehouse Workloads often limited to SQL and BI tools Data in proprietary formats Hard to do integrate streaming, ML, and AI workloads Performance is expensive Scaling up/out usually comes at a high cost

6.Dream of Data Lakes SQL scalable ETL ML, AI streaming Data streams Data Lake Insights

7.Data Lakes + Spark = Awesome! STRUCTURED SQL, ML, STREAMING STREAMING Data streams Data Lake Insights The 1st Unified Analytics Engine

8.Advantages of Data Lakes ETL pipelines are complex and slow simpler and fast Unified Spark API between batch and streaming simplifies ETL Raw unstructured data available as structured data in minutes Workloads limited not limited anything! Data Lake Data in files with open formats Integrate with data processing and BI tools Integrate with ML and AI workloads and tools Performance is expensive cheaper Easy and cost-effective to scale out compute and storage

9.Challenges of Data Lakes in practice

10.Challenges of Data Lakes in practice ETL @

11. Evolution of a Cutting-Edge Data Pipeline Events ? Streaming Analytics Data Lake Reporting

12. Evolution of a Cutting-Edge Data Pipeline Events Streaming Analytics Data Lake Reporting

13. Challenge #1: Historical Queries? λ-arch 1 λ-arch Events 1 1 λ-arch Streaming Analytics Data Lake Reporting

14. Challenge #2: Messy Data? λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming Analytics 2 Validation Data Lake Reporting

15. Challenge #3: Mistakes and Failures? λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming 3 Reprocessing Analytics 2 Validation Partitioned 3 Reprocessing Data Lake Reporting

16. Challenge #4: Query Performance? λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming 3 Reprocessing Analytics 4 Compaction 2 Validation Partitioned 2 4 Scheduled to Avoid Compaction Reprocessing Data Lake 4 Compact Reporting Small Files

17.Data Lake Reliability Challenges Failed production jobs leave data in corrupt state requiring tedious recovery Lack of consistency makes it almost impossible to mix appends, deletes, upserts and get consistent reads Lack of schema enforcement creates inconsistent and low quality data

18.Data Lake Performance Challenges Too many small or very big files - more time opening & closing files rather than reading content (worse with streaming) Partitioning aka “poor man’s indexing”- breaks down when data has many dimensions and/or high cardinality columns Neither storage systems, nor processing engines are great at handling very large number of subdir/files

19.Figuring out what to read is too slow

20.Data integrity is hard

21.Band-aid solutions made it worse!

22.Everyone has the same problems

23.THE GOOD THE GOOD OF DATA WAREHOUSES OF DATA LAKES • Pristine Data • Massive scale out • Transactional Reliability • Open Formats • Fast SQL Queries • Mixed workloads

24. DELTA The The The SCALE RELIABILITY & LOW-LATENCY of data lake PERFORMANCE of streaming of data warehouse

25. Scalable storage DELTA = + Transactional log

26. DELTA Scalable storage pathToTable/ +---- 000.parquet table data stored as Parquet files +---- 001.parquet on HDFS, AWS S3, Azure Blob Stores +---- 002.parquet + ... | Transactional log +---- _delta_log/ sequence of metadata files to track +---- 000.json operations made on the table +---- 001.json ... stored in scalable storage along with table

27.Log Structured Storage | INSERT actions Changes to the table +---- _delta_log/ Add 001.parquet are stored as ordered, +---- 000.json Add 002.parquet atomic commits +---- 001.json UPDATE actions Remove 001.parquet Each commit is a set of ... actions file in directory Remove 002.parquet _delta_log Add 003.parquet

28.Log Structured Storage | INSERT actions Readers read the+---- log in _delta_log/ Add 001.parquet atomic units thus reading +---- 000.json Add 002.parquet consistent snapshots +---- 001.json UPDATE actions Remove 001.parquet ... readers will read Remove 002.parquet either [001+002].parquet Add 003.parquet or 003.parquet and nothing in-between

29.Mutual Exclusion Concurrent writers 000.json need to agree on the Writer 1 Writer 2 001.json order of changes 002.json New commit files must be created mutually only one of the writers trying exclusively to concurrently write 002.json must succeed