1. Writing Continuous Applications with Structured Streaming in PySpark Jules S. Damji Spark + AI Summit , SF April 24, 2019
2.I have used Apache Spark 2.x Before…
3.Apache Spark Community & Developer Advocate @ Databricks Developer Advocate @ Hortonworks Software engineering @ Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest Program Chair Spark + AI Summit https://www.linkedin.com/in/dmatrix @2twitme
4.VISION Accelerate innovation by unifying data science, engineering and business SOLUTION Unified Analytics Platform WHO WE • Original creators of ARE • 2000+ global companies use our platform across big data & machine learning lifecycle
5.Agenda for Today’s Talk • Why Apache Spark • Why Streaming Applications are Difficult • What’s Structured Streaming • Anatomy of a Continunous Application • Tutorials •Q&A
6. How to think about data in 2019 - 2020 10101010. . . 10101010. . . “Data is the new currency"
7.Why Apache Spark?
8.What is Apache Spark? • General cluster computing engine Streaming SQL ML Graph DL that extends MapReduce • Rich set of APIs and libraries • Unified Engine • Large community: 1000+ orgs, clusters up to 8000 nodes … • Supports DL Frameworks Apache Spark, Spark and Apache are trademarks of the Apache Software Foundation
9.Unique Thing about Spark • Unification: same engine and same API for diverse use cases • Streaming, batch, or interactive • ETL, SQL, machine learning, or graph • Deep Learning Frameworks w/Horovod – TensorFlow – Keras – PyTorch
10.Faster, Easier to Use, Unified First Distributed Specialized Data Unified Data Processing Engine Processing Engines Processing Engine 10
11.Benefits of Unification 1. Simpler to use and operate 2. Code reuse: e.g. only write monitoring, FT, etc once 3. New apps that span processing types: e.g. interactive queries on a stream, online machine learning
12. New applications An Analogy Specialized devices Unified device
13.Why Streaming Applications are Inherently Difficult?
14. building robust stream processing apps is hard
15.Complexities in stream processing COMPLEX DATA COMPLEX WORKLOADS COMPLEX SYSTEMS Diverse data formats Combining streaming with Diverse storage systems (json, avro, txt, csv, binary, …) interactive queries (Kafka, S3, Kinesis, RDBMS, …) Data can be dirty, Machine learning System failures And tardy (out-of-order)
16. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
17. you should not have to reason about streaming
18.Treat Streams as Unbounded Tables data stream unbounded input table new data in the data stream = new rows appended to a unbounded table
19. you should write queries & Apache Spark should continuously update the answer
20.Apache Spark automatically streamifies! t=1 t=2 t=3 input = spark.readStream Read from Kafka Kafka .format("kafka") Source .option("subscribe", "topic") .load() Project device, signal Optimized new data new data new data process process process result = input Operator Filter codegen, off- .select("device", "signal") signal > 15 heap, etc. .where("signal > 15") Write to result.writeStream Parquet Parquet .format("parquet") Sink .start("dest-path") Logical Series of Incremental Optimized DataFrames, Plan Execution Plans Physical Plan Datasets, SQL Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data
21.Structured Streaming – Processing Modes 21
22.Structured Streaming Processing Modes 22
23.Anatomy of a Continunous Application
24.Anatomy of a Streaming Query Streaming word count Simple Streaming ETL
25.Anatomy of a Streaming Query: Step 1 spark.readStream Source .format("kafka") • Specify one or more locations .option("subscribe", "input") to read data from .load() • Built in support for Files/Kafka/Socket, pluggable. .
26.Anatomy of a Streaming Query: Step 2 Transformation spark.readStream • Using DataFrames, Datasets and/or SQL. .format("kafka") .option("subscribe", "input") • Internal once. processing always exactly- .load() .groupBy(“value.cast("string") as key”) .agg(count("*") as “value”)
27.Anatomy of a Streaming Query: Step 3 spark.readStream Sink .format("kafka") .option("subscribe", "input") • Accepts the output of each .load() .groupBy(“value.cast("string") as key”) batch. .agg(count("*") as “value”) .writeStream() • When supported sinks are .format("kafka") transactional and exactly .option("topic", "output") .trigger("1 minute") once (Files). .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start()
28.Anatomy of a Streaming Query: Output Modes from pyspark.sql import Trigger Output mode – What's output spark.readStream • Complete – Output the whole answer .format("kafka") every time .option("subscribe", "input") .load() • Update – Output changed rows .groupBy(“value.cast("string") as key”) .agg(count("*") as 'value’) • Append – Output new rows only .writeStream() .format("kafka") .option("topic", "output") Trigger – When to output .trigger("1 minute") .outputMode("update") • Specified as a time, eventually .option("checkpointLocation", "…") supports data size .start() • No trigger means as fast as possible
29.Streaming Query: Output Modes Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append – Output new rows only