An AI-Powered Chatbot to Simplify Apache Spark Performance Management

>Sarah: My Spark SQL query failed. How can I fix it? >Jeeves: Your Spark query driver went out of memory. >Jeeves: You can set spark.driver.memory to 2.2GB and rerun the query to complete it successfully. Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of performance problems quickly. Instead of just being stuck to screens displaying performance logs and metrics, users can now have more refreshing experience; and consume performance insights via a two-way conversation with their own personal Spark expert. This talk will give an overview of the chatbot, its architecture, and how it fits in a complex Spark environment. The chatbot connects to a large number of sources to get the data to power its AI algorithms. It can detect anomalies in performance and push key insights via alerts to users when they need them the most. The chatbot can also be told to take actions like creating tickets and making configuration changes. You will learn how to build chatbots that tackle your complex data operations challenges with AI algorithms and automation, keeping a cool head at all times.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.An AI-powered Chatbot to Simplify Spark Performance Management Shivnath Babu Cofounder/CTO, Unravel Adjunct Professor, Duke University #UnifiedAnalytics #SparkAISummit

3.Meet the speaker • Cofounder/CTO at Unravel • Adjunct Professor of Computer Science at Duke University • Focusing on ease-of-use and manageability of data-intensive systems • Recipient of US National Science Foundation CAREER Award, three IBM Faculty Awards, HP Labs Innovation Research Award #UnifiedAnalytics #SparkAISummit 3

4.What is a Chatbot? #UnifiedAnalytics #SparkAISummit 4

5.A program which conducts a conversation via text or voice #UnifiedAnalytics #SparkAISummit 5

6.Chatbots are making a real difference #UnifiedAnalytics #SparkAISummit 6

7. Source: #UnifiedAnalytics #SparkAISummit 7

8. TOBi generates 2x more ecommerce conversions in ½ the time for Vodafone #UnifiedAnalytics #SparkAISummit 8

9. Zara provides fast services to 20% of Zurich Insurance customers #UnifiedAnalytics #SparkAISummit 9

10. Woebot, the therapist chatbot, talks to more people in a day than a human therapist does in a lifetime #UnifiedAnalytics #SparkAISummit 10

11.Chatbots ó Spark Performance What is the connection? #UnifiedAnalytics #SparkAISummit 11

12.The happy Spark user • Spark is fast • Spark has easy-to-use and comprehensive APIs • Wow, I can do SQL, Streaming, AI/ML, and Graphs in one system! • Spark has a rich ecosystem #UnifiedAnalytics #SparkAISummit 12

13.The frustrated Spark user “I have no idea why “My app my app is failed and I slow” don’t know why!” “I have no clue which cloud instance type to pick for my workload” “My cloud costs are getting out of control. Help!” #UnifiedAnalytics #SparkAISummit 13

14.Typical app failure in Spark • Many levels of correlated stack traces • Identifying the root cause is hard and time consuming #UnifiedAnalytics #SparkAISummit 14

15.Spark User Spark Chatbot “My app failed and I don’t know “I know that sucks! Let me take why!” a look here …” “I see the problem. Executors are running out of memory” “Setting spark.executor.memory to 12g “Wow. fixes the problem. I have Thanks. verified it. See this run here” You are awesome!” #UnifiedAnalytics #SparkAISummit 15

16.I will show you a Chatbot that • Makes you more productive • Saves you time and money • Becomes your AI-driven Spark Expert in a Bot! #UnifiedAnalytics #SparkAISummit 16

17.My app is too slow… DATA ENGINEER #UnifiedAnalytics #SparkAISummit 17

18.I need to make it faster… DATA ENGINEER #UnifiedAnalytics #SparkAISummit 18

19.Current approach 1. Review Spark/YARN UI to find the app 2. Review metrics in the UI 3. Review jobs and stages associated with the app 4. Identify all containers associated with the app 5. Review and debug container logs 6. Identify “problematic” jobs, stages, or containers 7. Guess which parameters to tune for performance 8. Do trial-and-error by changing a parameter setting 9. Rinse & repeat #UnifiedAnalytics #SparkAISummit 19

20.There has to be a better way #UnifiedAnalytics #SparkAISummit 20

21.What is going on here? #UnifiedAnalytics #SparkAISummit 21

22.Chatbot Architecture from 30000 ft Messaging Bot’s NLP Bot’s Backend Platform Layer Layer #UnifiedAnalytics #SparkAISummit 22

23.Algorithm running in bot’s backend Recommendation Monitoring Algorithm Data App,Goal Probe Algorithm Historic Data & Xnext Probe Data Orchestrator Cluster Services On-premises and Cloud #UnifiedAnalytics #SparkAISummit 23

24. Spark tuning parameters spark.driver.cores 2 PERFORMANCE spark.executor.cores 10 … spark.sql.shuffle.partitions 300 spark.sql.autoBroadcastJoinThres 20MB hold … SKEW('orders', 'o_custId') true spark.catalog.cacheTable(“orders") true … We represent this setting as vector X X #UnifiedAnalytics #SparkAISummit 24

25.Given: App + Goal PERFORMANCE • Find the setting of X that best meets the goal • Challenge: Response surface y = ƒ(X) is unknown X #UnifiedAnalytics #SparkAISummit 25

26.Challenge: Response surface y = ƒ(X) is unknown Model the response surface as PERFORMANCE !t ! yˆ ( X ) = f ( X ) b +Z ( X ) Here: !t ! f ( X )b is a regression model Z(X ) is the residual captured as a Gaussian Process The Gaussian Process model captures the uncertainty in our current knowledge of the response surface X #AI7SAIS 26

27. Opportunity We can now estimate the expected improvement EIP(X) from doing a probe at any setting X PERFORMANCE p= y( X * ) EIP( X )= ò ( y( X ) - p ) pdf yˆ ( X ) ( p )dp * p = -¥ Improvement at any Probability density setting X over the best function (uncertainty performance seen so far estimate) Gaussian Process model helps estimate EIP(X) X #UnifiedAnalytics #SparkAISummit 27

28. Bootstrap 1 Get initial set of monitoring data from history or via probes: <X1,y1>, PERFORMANCE <X2,y2>, …, <Xn,yn> Probe Algorithm 2 Select next probe Xnext based on all Until the history and probe data stopping condition available so far to is calculate the setting reached with maximum expected improvement EIP(X) X #AI7SAIS 28

29. Performance 8 6 y EIP(X) 4 2 U U U 0 4 6 8 10 12 X x1 Xnext: Do next This approach probe here balances Exploration Vs. Exploration Exploitation Exploitation #UnifiedAnalytics #SparkAISummit 29