Real-time analytics with Azure Databricks and Azure Event Hubs

Real-time analytics with Azure Databricks and Azure Event Hubs
展开查看详情

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.you should write simple queries & Azure Databricks should continuously update the answer

13.Agenda Real-time analytics scenarios Real-time analytics with Azure Databricks Demo Q&A

14.Complexities in stream processing COMPLEX DATA Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order COMPLEX SYSTEMS Diverse storage systems (Kafka, Azure Storage,Event Hubs, SQL DW, …) System failures COMPLEX WORKLOADS Combining streaming with interactive queries Machine learning

15.Introduction to Spark Spark Unifies: Batch Processing Interactive SQL Real-time processing Machine Learning Deep Learning Graph Processing Spark Core Engine Spark SQL Interactive Queries Yarn Mesos Standalone Scheduler MLlib Machine Learning Structured Streaming Stream processing GraphX Graph Computation

16.Databricks Delta Transaction Flow t write start read complete read start Files only committed to table when transaction complete read start (only see committed files) files committed to table write complete (new readers see updated table) New files committed

17.Azure Databricks: Related Sessions Code Session Date THR2182 An Introduction to big data processing with Azure Databricks September 24, 5:45 PM – 6:05 PM BRK3204 AI with big data: Data science at massive scale with Apache Spark in Azure Databricks September 25, 2:15 PM – 3:30 PM BRK3313 Azure Databricks for data engineers and data developers September 25, 9:00 AM – 10:15 AM WRK3004 AI modeling: Understanding the data science process with Azure Databricks September 25, 9:00 AM – 10:15 AM BRK2372 Learn how Devon Energy leverages Azure Databricks to build predictive machine learning and AI models at scale September 26, 4:00 PM - 5:15 PM BRK3203 Real-time analytics with Azure Databricks and Azure Event Hubs September 27, 12:30 PM - 1:45 PM BRK3205 AI for pros: Accelerating deep learning on Spark with Azure Databricks ML Runtime and GPU based clusters September 27, 12:30 PM - 1:45 PM WRK3005 Building a Data Engineering Pipeline with Azure Databricks and Azure SQL DW September 27, 12:30 PM - 1:45 PM BRK4024 Azure Databricks: Deep dive into deployment, networking, and security September 27, 11:30 AM - 12:15 PM WRK3004R AI modeling: Understanding the data science process with Azure Databricks September 27, 9:00 AM - 10:15 AM

18.you should not have to reason about streaming

19.3. Simplified Architecture Stage I LOTS OF NEW DATA Stage II Stage III Stage IV User Behavior Data Click Streams Sensor Data (IoT) Video/Speech Usage/Billing Data Machine Telemetry Commerce Data … Without Delta With Delta Concurrent access suffer from inconsistent query results Snapshot isolation simplify consistent concurrent data reads and writes Failing streaming jobs can require resetting and restarting data processing Exactly-one semanti cs allow perfect resume-ability without reset and restart

20.Multi - Hop Data Pipelines Stage I - Raw events from many different parts of the organization Stage II - Normalized and enriched with dimension information Stage III - Filtered down and aggregated for particular business objective. Stage IV - High-level summaries of key business metrics. Stage I LOTS OF NEW DATA Stage II Stage III Stage IV User Behavior Data Click Streams Sensor Data (IoT) Video/Speech Usage/Billing Data Machine Telemetry Commerce Data …

21.Multi - Hop Data Pipelines Stage I - Raw events from many different parts of the organization Stage II - Normalized and enriched with dimension information Stage III - Filtered down and aggregated for particular business objective. Stage IV - High-level summaries of key business metrics. Stage I LOTS OF NEW DATA Stage II Stage III Stage IV User Behavior Data Click Streams Sensor Data (IoT) Video/Speech Usage/Billing Data Machine Telemetry Commerce Data …

22.Structured Streaming with Azure Databricks Overview Built-in support for Azure Event Hubs, HDInsight Kafka, Databricks Delta, Azure Blob Storage, ADLS and Azure SQL DW Flexible API to create arbitrary streaming sinks - like a RDBMS, NoSQL etc. Automatic restart of streaming queries, and recovery using checkpointing Optimized performance for stateful streaming with RocksDB