Building Robust Production Data Pipelines with Databricks Delta

“Most data practitioners grapple with data quality issues and data pipeline complexities—it’s the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets. Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain. This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT YOU’LL LEARN: – Understand the key data reliability and performance data pipelines challenges – How Databricks Delta helps build robust pipelines at scale – Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements – How to deliver performance gains using Delta PREREQUISITES: – A fully-charged laptop (8-16GB memory) with Chrome or Firefox – Pre-register for Databricks Community Edition “

1.Building Robust Data Pipelines with

2.Requirements • Sign in to Databricks Community Edition • • Create a cluster (DBR 5.3) • Import notebook at •

3.Enterprises have been spending millions of dollars getting data into data lakes with Apache Spark Data Lake

4.The aspiration is to do data science and ML on all that data using Apache Spark! Data Science & ML • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance Data Lake • Genomics & DNA Sequencing

5.But the data is not ready for data science & ML The majority of these projects are failing due to unreliable data! Data Science & ML • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance Data Lake • Genomics & DNA Sequencing

6.Why are these projects struggling with reliability?

7.Data reliability challenges with data lakes Failed production jobs leave data in corrupt ✗ state requiring tedious recovery Lack of schema enforcement creates inconsistent and low quality data Lack of consistency makes it almost impossible to mix appends ands reads, batch and streaming

8.A New Standard for Building Data Lakes Open Format Based on Parquet With Transactions Apache Spark API’s

9.Delta Lake: makes data ready for Analytics Data Science & ML Reliability Performance • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance • Genomics & DNA Sequencing

10. Delta Lake ensures data reliability Batch Parquet Files Streaming High Quality & Reliable Data always ready for analytics Updates/Delete Transactional s Log ● ACID Transactions ● Unified Batch & Streaming Key Features ● Schema Enforcement ● Time Travel/Data Snapshots

11.References • Docs • • Home page •

12.Let’s begin!