Building Data Intensive Analytic Application on Top of Delta Lake

Why to build your own analytics application on top on Delta lake : – Every enterprise is building a data lake. However, these data lakes are plagued by low user adoption, poor data quality, and result in lower ROI. – BI tools may not be enough for your use case, especially, when you want to build a data driven analytical web application such as paysa. – Delta’s ACID guarantees allows you to build a real-time reporting app that displays consistent and reliable data

In this talk we will learn :

how to build your own analytics app on top of delta lake.
how Delta Lake helps you build pristine data lake with several ways to expose data to end-users
how analytics web application can be backed by custom Query layer that executes Spark SQL in remote Databricks cluster.
We’ll explore various options to build an analytics application using various backend technologies.
Various Architecture pattern/components/frameworks can be used to build custom analytics platform in no time.
How to leverage machine learning to build advanced analytics applications Demo: Analytics application built on Play Framework(for back-end), React(for front-end), Structured Streaming for ingesting data from Delta table. Live query analytics on real time data ML predictions based on analytics data

展开查看详情

1.

2.Ganesh Chand, Databricks Ravi Gawai, Databricks

3.Agenda • Delta Lake - What and Why? • Common Delta Lake use cases • Data as a Service (DaaS) • Our Approach • Use Cases • Demo • Q&A 3

4.What’s a Data Lake? A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” - James Dixon 4

5.Why Data Lake ? Challenges with Data Warehouse • Big Data problem • Expensive (build, store and process) LAKES STREAMS • Proprietary technology (processing and WAREHOUSES NOSQL storage) • Vendor lock-in CSV, • Lack of ML capabilities JSON, TXT… 5

6.Data Lake: Aspiration Use AI and Machine Learning to outperform your competition, retain your customers, boost your productivity with lower TCO using variety of data sources Real-time Streaming, Data Science and ML • Recommendation Engines • Risk, Fraud, & Intrusion Detection • Customer Analytics • IoT & Predictive Maintenance • Genomics & DNA Sequencing 6

7.Data Lake: Reality The majority of these projects are failing! Real-time Streaming, Data Science and ML • Recommendation Engines • Risk, Fraud, & Intrusion Detection • Customer Analytics • IoT & Predictive Maintenance Unreliable, low quality data • Genomics & DNA Sequencing slow performance 7

8. Why ? Strengths of Data Lake StrengthsData ofWarehouse Data Warehouse • Open Source, Open Standards • Full ACID Transaction • Powered By Apache Spark • Insert, Delete, Update w/ SCD-II • Scale • Indexing for faster query response • Unified platform for data & AI • Schema-On-Write And ● Unification of Batch & Streaming workloads ● Incrementally improve the quality of your data until it is ready for consumption (Multi-hop pipelines) ● Dramatically reduces legacy Spark/Hive operational burdens ● Scalable Metadata Handling 8

9.What’s a Delta Lake A Data Lake Powered By Delta Delta Lake Bronze Silver Gold Raw Filtered, Cleaned Business-level LAKES STREAMS Ingestion Augmented Aggregates WAREHOUSES NOSQL CSV, JSON, TXT… 9

10.Common Delta Lake Use Cases • Interactive Queries • BI reporting and dashboards • Train and Build Machine Learning Models • Create Data Warehouse • Create / Monetize Data Products • Sell or Share curated data to partners, vendors and internal customers • Feed data back to source systems, web applications, Mobile Apps 10

11.Common Delta Lake Use Cases • Interactive Queries • BI reporting and dashboards • Train and Build Machine Learning Models • Create Data Warehouse • Create / Monetize Data Products • Sell or Share curated data to partners, vendors and internal customers • Feed data back to source systems, web applications, Mobile Apps 11

12.Serving Data From Delta Lake Data product Web app Data enrichment Mobile app Data Integration ERP Data export Storage 12

13.Serving Data From Delta Lake 13

14.Serving Data From Delta Lake Storage Compute Serving Consumers API Metadata Service S3 ADLS HDFS Data Service Access Catalog Management 14

15.Serving Data From Delta Lake Data-as-a-Service (DaaS ) Challenges • Rest APIs • Security • Ready-Only • Latency • Data Format • Throughput • Delivery mechanism • SLA • Data licensing, ownership and monetization model • Managing evolving requirements • Minimizing Information Silos 15

16.Use Cases for Demo App • UI to interact with delta lake • Export classified and aggregated data out of delta lake to be consumed by a client app • MVP features for the demo app • End-to-end etl pipeline writing into delta lake • DaaS REST endpoint to export data • Front-end app to consume data and build a dashboard 16

17.Our implementation Storage Compute Serving Consumers Routes: /listSchemas R databricks E Jobs /listTables S API T /exportData S3 17

18.DaaS APIs GET delta-meta-service/getDbDetails GET delta-meta-service/previewTable?table=db.tablename POST delta-sql-service/exportSqlData -d { "inputSql": "select * from db.table where condition", "outputPath": "/path/", "format": "json" } GET delta-sql-service/getRunStatus?run_id=id 18

19.Demo 19

20.Delta ETL pipeline 20

21.Front-End 21

22.Front-End 22

23. Thank You ganesh@databricks.com ravi@databricks.com 23

24.