- 微博 QQ QQ空间 贴吧
1 .Building Robust Data Pipelines with
2 .Requirements • Sign in to Databricks Community Edition • https://community.cloud.databricks.com • Create a cluster (DBR 5.3) • Import notebook at • https://docs.delta.io/notebooks/sais19-tutorial.dbc
3 .Enterprises have been spending millions of dollars getting data into data lakes with Apache Spark Data Lake
4 .The aspiration is to do data science and ML on all that data using Apache Spark! Data Science & ML • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance Data Lake • Genomics & DNA Sequencing
5 .But the data is not ready for data science & ML The majority of these projects are failing due to unreliable data! Data Science & ML • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance Data Lake • Genomics & DNA Sequencing
6 .Why are these projects struggling with reliability?
7 .Data reliability challenges with data lakes Failed production jobs leave data in corrupt ✗ state requiring tedious recovery Lack of schema enforcement creates inconsistent and low quality data Lack of consistency makes it almost impossible to mix appends ands reads, batch and streaming
8 .A New Standard for Building Data Lakes Open Format Based on Parquet With Transactions Apache Spark API’s
9 .Delta Lake: makes data ready for Analytics Data Science & ML Reliability Performance • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance • Genomics & DNA Sequencing
10 . Delta Lake ensures data reliability Batch Parquet Files Streaming High Quality & Reliable Data always ready for analytics Updates/Delete Transactional s Log ● ACID Transactions ● Unified Batch & Streaming Key Features ● Schema Enforcement ● Time Travel/Data Snapshots
11 .References • Docs • https://docs.delta.io • Home page • https://delta.io
12 .Let’s begin!