- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Delta Lake - Reliable Data Lakes at Scale
Delta Lake是一个开源的数据存储层,让Spark和大数据应用支持ACID事务处理。
展开查看详情
1 .Welcome Wifi Network: Convene Conference Centers Wifi Password: meetings
2 .Building Reliable Data Lakes Delta Lake | Hands-on Lab Joe Widen Senior Solutions Architect
3 .Agenda Time Activity 08:30 – 09:00 Registration, Breakfast & Networking 09:00 – 09:20 Opening Remarks - Delta Lake Overview 09:20 – 09:30 Delta Lake in Action - Customer Cases 09:30 – 10:00 Delta Lake: Hands-on Walkthrough - Part 1 10:00 – 10:30 Break 10:30 – 11:30 Delta Lake: Hands-on Walkthrough - Part 2 11:30 – 11:45 Productionizing ML with Delta Lake Demo 11:45 – 12:15 Ask the Expert: bring your most challenging data problems 12:15 – 12:30 Wrap Up 3
4 .Delta Lake: Reliable Data Lakes at Scale Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. 4
5 .VISION Accelerate innovation by unifying data science, engineering and business SOLUTION Unified Analytics Platform WHO WE • Original creators of , ,& ARE • 2000+ global companies use our platform across big data & machine learning lifecycle
6 .Databricks Community Edition Databricks Community Edition is the free version of our cloud-based big data platform. ● Users can access a micro-cluster as well as a cluster manager and notebook environment. ● All users can share their notebooks and host them free of charge with Databricks.
7 .Sign up for Databricks Community Edition Go to: databricks.com/try and choose Community Edition
8 .Opening Remarks
9 .Let’s talk about building data lakes
10 .Data reliability challenges with data lakes Failed production jobs leave data in corrupt ✗ state requiring tedious recovery Lack of quality enforcement creates inconsistent and unusable data Lack of transactions makes it almost impossible to mix appends and reads, batch and streaming
11 .How does this impact the building of data lakes
12 . Evolution of a Cutting-Edge Data Lake Events ? Streaming Analytics Data Lake AI & Reporting
13 . Evolution of a Cutting-Edge Data Lake Events Streaming Analytics Data Lake AI & Reporting
14 . Challenge #1: Historical Queries? λ-arch 1 λ-arch Events 1 1 λ-arch Streaming Analytics Data Lake AI & Reporting
15 . Challenge #2: Messy Data? λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming Analytics 2 Validation Data Lake AI & Reporting
16 . Challenge #3: Mistakes and Failures? λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming 3 Reprocessing Analytics 2 Validation Partitioned 3 Reprocessing Data Lake AI & Reporting
17 . Challenge #4: Updates? λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming 3 Reprocessing Analytics 4 Updates 2 Validation Partitioned Scheduled to 3 4 Avoid Modifications Reprocessing 4 Data Lake UPDATE & AI & Reporting MERGE
18 . Challenges of the Data Lake λ-arch 1 λ-arch Events 1 2 Validation 1 λ-arch Streaming 3 Reprocessing Analytics 4 Updates 2 Validation Partitioned Scheduled to 3 4 Avoid Modifications Reprocessing 4 Data Lake UPDATE & AI & Reporting MERGE
19 .Let’s try it instead with
20 .A New Standard for Building Data Lakes Open Format Based on Parquet With Transactions Apache Spark™ APIs
21 . Delta Lake ensures data reliability Batch Parquet Files Streaming High Quality & Reliable Data always ready for analytics Updates/Deletes Transactional Log ● ACID Transactions ● Unified Batch & Streaming Key Features ● Schema Enforcement ● Time Travel/Data Snapshots
22 . The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting
23 . The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting • Full ACID Transactions • Open Source (Apache License) • Powered by
24 . The Data Quality Levels Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting Quality Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption.
25 . The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting • Dumping ground for raw data • Often with long retention (years) • Raw data with minimal parsing\
26 . The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting Intermediate data with some cleanup applied. Queryable for easy debugging!
27 . The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting Clean data, ready for consumption. Read with Spark or Presto* *Coming Soon
28 . The Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates AI & Reporting Streams move data through the Delta Lake • Low-latency or manually triggered • Eliminates management of schedules and jobs
29 . The OVERWRITE MERGE INSERT DELETE Bronze Silver Gold Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake Raw Filtered, Cleaned Business-level Ingestion Augmented Aggregates UPDATE AI & Reporting Delta Lake also supports batch jobs and standard DML • Retention • GDPR • Corrections • UPSERTS *DML Coming in 0.3.0