Self-Service Apache Spark Structured Streaming Applications and Analytics

Organizations are increasingly building more and more Apache Spark Structured Streaming Applications for IoT analytics, real-time fraud detection, anomaly detection, analyzing streaming data from devices, turbines etc. However building the streaming applications and operationalizing them is challenging. There is a need for a self-serve platform on Spark Structured Streaming to enable many users to quickly build, deploy, run and monitor a variety of big data streaming use cases. At Sparkflows we built out a Self-Service Platform for building Structured Streaming Applications in minutes. Variety of users can log in with their Browser and build and deploy these applications seamlessly with drag and drop of 200+ Processors. They can also build charts on the streaming data and perform streaming analytics. In this talk we will dive deeper into our journey. We started with a workflow editor and workflow engine for building and running structured streaming jobs. We will explain how we built out the connectors to streaming sources for running in the designer mode, perform ML model scoring with real-time ingestion, streaming analytics, schema inference and propagation and displaying results in continuously moving charts. We will go over how we built self-serve streaming data preparation, lookup and analytics with SQL, Scala, Python etc. Finally, we will also discuss how we enabled deployment, operationalization and monitoring of the long running Structured Streaming jobs. We want to show how Spark can be used to enable very complex Self-Serve Big Data Streaming Applications and Analytics for Enterprises.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Self-Service Apache Spark Structured Streaming Applications & Analytics Jayant Shekhar #UnifiedAnalytics #SparkAISummit

3.Use Cases • IoT analytics • Real-time fraud detection • Anomaly detection • Analyzing streaming data from devices, turbines etc. #UnifiedAnalytics #SparkAISummit 3

4.Why Self-Serve Build Quickly Easy Debugging Ability to handle error cases Build in minutes instead of hours and View what is happening inside days the jobs Visualizations Easy Operationalization Deploy immediately View the streaming data Track status of jobs, jobs history visually Failure Notifications etc. #UnifiedAnalytics #SparkAISummit 4

5. Workflow – Find delayed Flights Display Delayed Save to Flights Hbase/MongoDB Read from Find delayed Parse Fields Kafka Flights (SQL) Aggregate by Display Delays Airports by Airport #UnifiedAnalytics #SparkAISummit 5

6.Workflows stored as JSON #UnifiedAnalytics #SparkAISummit 6

7.Spark Submit spark2-submit --class fire.execute.WorkflowExecuteFromFile --master yarn --deploy-mode client --proxy-user sparkflows /home/sparkflows/fire-3.1.0/fire-core-lib/fire-spark_2_1-core-3.1.0-jar-with-dependencies.jar --postback-url http://demo50:8080/messageFromSparkJob --sql-context HiveContext --job-id 1aa66851-0e69-4fc8-b525-539839f72046 --workflow-file /tmp/fire/workflows/workflow-4994719000673479151.json #UnifiedAnalytics #SparkAISummit 7

8.Processor Types • Connectors • Visualizations • Transforms • ML Scoring • Aggregations • More • Languages (SQL, Scala, – Sessionization Python) – Dedup #UnifiedAnalytics #SparkAISummit 8

9. Processors Details Streaming Feature Generation Machine Learning Data Profiling Geo NLP/OCR • Kafka • Correlation • IP2Geo • Tokenization • Classification • Named Entity • Kinesis • Data Summary • Spatial Joins • Stop Words • Regression Recognition • Files • Histograms • Map lat/lon to Remover • Clustering • Sentiment Analysis • Flume Zipcode • Imputer • Collaborative • Document • Sockets • Locality Sensitive Filtering Categorizer Hashing • Save / Load Model • OCR • One Hot Encoder • Cross-Validation • Predict File File Formats Formats ETL Languages Visualizations Data Sources RDBMS •Scalability CSV / TSV is an • Join, Union • Scala • Graphs • HDFS / S3 • MySQL •attribute Avro that describes • Filter • SQL • Maps • HIVE • Oracle •theParquet ability of a • Data Validation • Jython • Heatmaps • HBase • Postgres •process, JSONnetwork, • Math/String/Date • Java • Barchart • Cassandra • Teradata •software XML or Functions • Python • Piechart • Elastic Search • Etc. •organization PDF to grow • Data Cleanup • Kafka •andBinary manage • Salesforce increased demand. A • Marketo system, business or software that is described as scalable has an advantage because it is more adaptable to the #UnifiedAnalytics #SparkAISummit 9

10.Components HDFS Kafka HBase Running Apache Spark Instances Mongo S3 Kinesis DB Browser Web Server ADLS Cassa ndra Apache Spark Clusters ES #UnifiedAnalytics #SparkAISummit 10

11.Connectors • Cannot easily start/stop streaming jobs when designing. • Build connectors for reading from the stores and converting to DataFrames #UnifiedAnalytics #SparkAISummit 11

12.Analytics • Slice & Dice data • Aggregations • Streaming Visualizations #UnifiedAnalytics #SparkAISummit 12

13.Analytics • SQL • Scala #UnifiedAnalytics #SparkAISummit 13

14.NLP • Integrated Apache OpenNLP & StanfordNLP • Processors ensure serialization of objects etc. and making things parallizable. • So users can easily start applying NLP to millions and millions of records. #UnifiedAnalytics #SparkAISummit 14

15.Streaming Charts • Results produced by the Spark Streaming jobs are streamed back to the browser. • Displayed on streaming charts. #UnifiedAnalytics #SparkAISummit 15

16.Streaming Connectors • Specialized code to run in the Workflow Designer reading from Streaming Sources. • We cannot run a full streaming job for interactive execution. #UnifiedAnalytics #SparkAISummit 16

17.ML Scoring • ML Pipelines include featurization significantly simplifying things. • Ability of Processors to read models and pass them to the next Processors • VectorAssembler, VectorIndexer, StringIndexer, OneHotEncoder, Bucketizer #UnifiedAnalytics #SparkAISummit 17

18.Storing Results • Ability to store in Hbase, ElasticSearch, HDFS etc. • Do not allow running at design mode so as not to mess up the stores. #UnifiedAnalytics #SparkAISummit 18

19.Large Scale Deployment & Monitoring #UnifiedAnalytics #SparkAISummit 19

20.Deployment - Executions #UnifiedAnalytics #SparkAISummit 20

21.Track Job Status • Status : STARTING / RUNNING / COMPLETED / FAILED / KILLED • Jobs post back their status to the server • Poll the jobs in various ways – logs, YARN etc. #UnifiedAnalytics #SparkAISummit 21

22.Scheduling & Triggering • Schedule by Time • Poll Kafka topic for events – Workflow ID – Workflow Parameters – Spark Submit Configurations #UnifiedAnalytics #SparkAISummit 22

23.Notification & Alerts • When jobs complete / fail send email alerts etc. #UnifiedAnalytics #SparkAISummit 23

24.Some more interesting things… • When no events received for defined time period, stop the Streaming Job. #UnifiedAnalytics #SparkAISummit 24

25.Execution Results Execution Results stored in an RDBMS and tracked #UnifiedAnalytics #SparkAISummit 25

26.View and compare various runs of a Workflow #UnifiedAnalytics #SparkAISummit 26

27.Performance #UnifiedAnalytics #SparkAISummit 27

28.Performance • Continue to update each process for best performance. Write once run many times… • Allow user control – Ability to control the Persistence Level of the DataFrames at any step • Focused on steps which took longer than expected, analyzed the code and updated it. • Ran load tests to compare various runs. #UnifiedAnalytics #SparkAISummit 28

29.Learnings… • Many more users able to get value from data. • Reduced time from idea to deployment. • Performance become easier. • Deployment becomes one click. • Easier to write complex modules like Dedup, CDC, ML etc. and use them at many places. #UnifiedAnalytics #SparkAISummit 29