Writing Continuous Applications with Structured Streaming PySpark API

“We’re amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application. In this tutorial we’ll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them. Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs. You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications. This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT YOU’LL LEARN: – Understand the concepts and motivations behind Structured Streaming – How to use DataFrame APIs – How to use Spark SQL and create tables on streaming data – How to write a simple end-to-end continuous application PREREQUISITES – A fully-charged laptop (8-16GB memory) with Chrome or Firefox –Pre-register for Databricks Community Edition”
展开查看详情

1. Writing Continuous Applications with Structured Streaming in PySpark Jules S. Damji Spark + AI Summit , SF April 24, 2019

2.I have used Apache Spark 2.x Before…

3.Apache Spark Community & Developer Advocate @ Databricks Developer Advocate @ Hortonworks Software engineering @ Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest Program Chair Spark + AI Summit https://www.linkedin.com/in/dmatrix @2twitme

4.VISION Accelerate innovation by unifying data science, engineering and business SOLUTION Unified Analytics Platform WHO WE • Original creators of ARE • 2000+ global companies use our platform across big data & machine learning lifecycle

5.Agenda for Today’s Talk • Why Apache Spark • Why Streaming Applications are Difficult • What’s Structured Streaming • Anatomy of a Continunous Application • Tutorials •Q&A

6. How to think about data in 2019 - 2020 10101010. . . 10101010. . . “Data is the new currency"

7.Why Apache Spark?

8.What is Apache Spark? • General cluster computing engine Streaming SQL ML Graph DL that extends MapReduce • Rich set of APIs and libraries • Unified Engine • Large community: 1000+ orgs, clusters up to 8000 nodes … • Supports DL Frameworks Apache Spark, Spark and Apache are trademarks of the Apache Software Foundation

9.Unique Thing about Spark • Unification: same engine and same API for diverse use cases • Streaming, batch, or interactive • ETL, SQL, machine learning, or graph • Deep Learning Frameworks w/Horovod – TensorFlow – Keras – PyTorch

10.Faster, Easier to Use, Unified First Distributed Specialized Data Unified Data Processing Engine Processing Engines Processing Engine 10

11.Benefits of Unification 1. Simpler to use and operate 2. Code reuse: e.g. only write monitoring, FT, etc once 3. New apps that span processing types: e.g. interactive queries on a stream, online machine learning

12. New applications An Analogy Specialized devices Unified device

13.Why Streaming Applications are Inherently Difficult?

14. building robust stream processing apps is hard

15.Complexities in stream processing COMPLEX DATA COMPLEX WORKLOADS COMPLEX SYSTEMS Diverse data formats Combining streaming with Diverse storage systems (json, avro, txt, csv, binary, …) interactive queries (Kafka, S3, Kinesis, RDBMS, …) Data can be dirty, Machine learning System failures And tardy (out-of-order)

16. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems

17. you should not have to reason about streaming

18.Treat Streams as Unbounded Tables data stream unbounded input table new data in the data stream = new rows appended to a unbounded table

19. you should write queries & Apache Spark should continuously update the answer

20.Apache Spark automatically streamifies! t=1 t=2 t=3 input = spark.readStream Read from Kafka Kafka .format("kafka") Source .option("subscribe", "topic") .load() Project device, signal Optimized new data new data new data process process process result = input Operator Filter codegen, off- .select("device", "signal") signal > 15 heap, etc. .where("signal > 15") Write to result.writeStream Parquet Parquet .format("parquet") Sink .start("dest-path") Logical Series of Incremental Optimized DataFrames, Plan Execution Plans Physical Plan Datasets, SQL Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data

21.Structured Streaming – Processing Modes 21

22.Structured Streaming Processing Modes 22

23.Anatomy of a Continunous Application

24.Anatomy of a Streaming Query Streaming word count Simple Streaming ETL

25.Anatomy of a Streaming Query: Step 1 spark.readStream Source .format("kafka") • Specify one or more locations .option("subscribe", "input") to read data from .load() • Built in support for Files/Kafka/Socket, pluggable. .

26.Anatomy of a Streaming Query: Step 2 Transformation spark.readStream • Using DataFrames, Datasets and/or SQL. .format("kafka") .option("subscribe", "input") • Internal once. processing always exactly- .load() .groupBy(“value.cast("string") as key”) .agg(count("*") as “value”)

27.Anatomy of a Streaming Query: Step 3 spark.readStream Sink .format("kafka") .option("subscribe", "input") • Accepts the output of each .load() .groupBy(“value.cast("string") as key”) batch. .agg(count("*") as “value”) .writeStream() • When supported sinks are .format("kafka") transactional and exactly .option("topic", "output") .trigger("1 minute") once (Files). .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start()

28.Anatomy of a Streaming Query: Output Modes from pyspark.sql import Trigger Output mode – What's output spark.readStream • Complete – Output the whole answer .format("kafka") every time .option("subscribe", "input") .load() • Update – Output changed rows .groupBy(“value.cast("string") as key”) .agg(count("*") as 'value’) • Append – Output new rows only .writeStream() .format("kafka") .option("topic", "output") Trigger – When to output .trigger("1 minute") .outputMode("update") • Specified as a time, eventually .option("checkpointLocation", "…") supports data size .start() • No trigger means as fast as possible

29.Streaming Query: Output Modes Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append – Output new rows only