Applications of Time Travel with Delta Lake

Time travel is now possible with Delta Lake! We will uncover how Delta Lake makes Time Travel possible and why it matters to you. Through presentation, notebooks, and code, we will showcase several common applications and how they can improve your modern data engineering pipelines. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark(TM). It provides snapshot isolation for concurrent read/writes. Enables efficient upserts, deletes and immediate rollback capabilities. It allows background file optimization through compaction and Z-Order partitioning achieving up to 100x performance improvements. In this presentation you will learn: What challenges Delta Lake solves How Delta Lake works under the hood Applications of new Delta Time Travel capability

展开查看详情

1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Applications of Time Travel with Delta Lake Kyle Weller Product Manager Azure Databricks - Microsoft #UnifiedDataAnalytics #SparkAISummit

3.Common Data Challenges Gartner estimates > 65% big data projects fail Customer Data Click Streams Unstructured Sensors (IoT) Etc X WHY?

4. P o l l e r a c tive In t

5.Complexities Spark Solves Complex Data Complex Workloads Complex Systems Solved Diverse data formats Combining streaming with interactive Diverse storage systems (Kafka, (json, avro, binary, …) queries Azure Storage,Event Hubs, SQL DW, …) Data can be dirty, Machine learning System failures late, out-of-order Other Spark Challenges: Concurrency The Small Files Problem Updates & Rollbacks Multiple readers and writers Performance degradation GDPR User delete requests or other Upserts Ensuring atomic transactions, Complex cleanup often incurs in consistency, and isolation downtime Data rollback or snapshots for audits

6. - Reliable Data Lakes at Scale ACID Transaction Guarantees Delta Table = • Atomic, Consistent, Isolated, Durable Parquet + Transaction Log + Indexes/Stats Versioned parquet files with transaction log • Snapshot isolation for multiple concurrent read/writes • Immediate rollback capabilities Efficient Upserts (Updates+Inserts) with MERGE command Delta Table • GDPR DSR requests • Change Data Capture Indexes & Versioned Parquet Files Delta Log Stats Time Travel

7. - Easy to Use parquet delta parquet delta

8. - Time Travel Applications Include: Delta Table = a Parquet + Transaction Log + • Audit Data Changes Indexes/Stats • Data reproducibility • Data pipeline debugging • Immediate rollback capabilities Delta Table Indexes & Versioned Parquet Files Delta Log Stats

9. - Time Travel, Audit Applications Audit Data Changes • History of all operations are recorded for audit history • Audit operation types, userIds, clusterIds, notebookIds, timestamps and versions

10. - Time Travel, Data Reproducilibility Data reproducibility Reproduce query results and reports • Go back to the exact same data that was used to train an ML model version in the past.

11.- Time Travel, Rollbacks

12.Delta at scale in the cloud 12

13. Azure Databricks – Introduction Fast, easy, and collaborative Apache Spark™-based analytics platform Increase productivity Built with your needs in mind Role-based access controls Effortless autoscaling Build on a secure, trusted cloud Live collaboration Enterprise-grade SLAs Best-in-class notebooks Scale without limits Simple job scheduling Seamlessly integrated with the Azure Portfolio

14. Azure Databricks – Delta Lake at Scale on Azure Ingest Store Process Serve Sensors and IoT (unstructured) + Azure Event Hub Cosmos DB Apps Azure Databricks Logs (unstructured) Azure IoT Hub Kafka Media (unstructured) Azure Data Lake Storage Azure Data Factory Azure SQL Data Power BI Files (unstructured) Warehouse Raw Format Delta Format Business/custom apps (structured)

15. Azure Databricks – Delta Lake at Scale on Azure Step 1 Step 2 Step 3 Step 4 Load raw data to Azure Use Azure Databricks to Use Azure Databricks to Load data into serving layers like Data Lake Storage 1. Combine streaming and 1. Join, enrich, clean, transform data 1. SQL Data Warehouse for Sensors and IoT batch 2. Develop, train, and score ML models enterprise BI scenarios. (unstructured) 2. Save data as Delta format with Azure ML + MLFlow 2. Cosmos DB for real-time Apps Cosmos DB Apps Logs (unstructured) + Polybase Azure Event Hub Media (unstructured) Azure IoT Hub Kafka Files (unstructured) Azure SQL Data Power BI Warehouse Azure Data Factory Raw Format Delta Format Delta Format Delta Format Business/custom apps (structured) (Bronze Table) (Silver Table) (Gold Table) Azure Data Lake Storage

16.Demo 16

17.Learn More https://delta.io http://bit.ly/adbrelnote + https://aka.ms/AzureDatabricksBestPractices https://docs.azuredatabricks.net/ Azure Databricks

18.DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT