An AI-Powered Chatbot to Simplify Apache Spark Performance Management

>Sarah: My Spark SQL query failed. How can I fix it? >Jeeves: Your Spark query driver went out of memory. >Jeeves: You can set spark.driver.memory to 2.2GB and rerun the query to complete it successfully. Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of performance problems quickly. Instead of just being stuck to screens displaying performance logs and metrics, users can now have more refreshing experience; and consume performance insights via a two-way conversation with their own personal Spark expert. This talk will give an overview of the chatbot, its architecture, and how it fits in a complex Spark environment. The chatbot connects to a large number of sources to get the data to power its AI algorithms. It can detect anomalies in performance and push key insights via alerts to users when they need them the most. The chatbot can also be told to take actions like creating tickets and making configuration changes. You will learn how to build chatbots that tackle your complex data operations challenges with AI algorithms and automation, keeping a cool head at all times.

An AI-powered Chatbot to Simplify Spark Performance Management Shivnath Babu Cofounder/CTO, Unravel Adjunct Professor, Duke University

Meet the speaker • Cofounder/CTO at Unravel • Adjunct Professor of Computer Science at Duke University • Focusing on ease-of-use and manageability of data-intensive systems • Recipient of US National Science Foundation CAREER Award, three IBM Faculty Awards, HP Labs Innovation Research Award

What is a Chatbot?

A program which conducts a conversation via text or voice

Chatbots are making a real difference

Source:

TOBi generates 2x more ecommerce conversions in ½ the time for Vodafone

Zara provides fast services to 20% of Zurich Insurance customers

Woebot, the therapist chatbot, talks to more people in a day than a human therapist does in a lifetime

Chatbots ó Spark Performance What is the connection?

The happy Spark user • Spark is fast • Spark has easy-to-use and comprehensive APIs • Wow, I can do SQL, Streaming, AI/ML, and Graphs in one system! • Spark has a rich ecosystem

The frustrated Spark user "I have no idea why "My app my app is failed and I slow" don't know why!" "I have no clue which cloud instance type to pick for my workload" "My cloud costs are getting out of control. Help!"

Typical app failure in Spark • Many levels of correlated stack traces • Identifying the root cause is hard and time consuming

Spark User Spark Chatbot "My app failed and I don't know "I know that sucks! Let me take why!" a look here …" "I see the problem. Executors are running out of memory" "Setting spark.executor.memory to 12g "Wow. fixes the problem. I have Thanks. verified it. See this run here" You are awesome!"

I will show you a Chatbot that • Makes you more productive • Saves you time and money • Becomes your AI-driven Spark Expert in a Bot!

My app is too slow… DATA ENGINEER

I need to make it faster… DATA ENGINEER

Current approach 1. Review Spark/YARN UI to find the app 2. Review metrics in the UI 3. Review jobs and stages associated with the app 4. Identify all containers associated with the app 5. Review and debug container logs 6. Identify "problematic" jobs, stages, or containers 7. Guess which parameters to tune for performance 8. Do trial-and-error by changing a parameter setting 9. Rinse & repeat

There has to be a better way

What is going on here?

Chatbot Architecture from 30000 ft Messaging Bot's NLP Bot's Backend Platform Layer Layer

Algorithm running in bot's backend Recommendation Monitoring Algorithm Data App,Goal Probe Algorithm Historic Data & Xnext Probe Data Orchestrator Cluster Services On-premises and Cloud

Spark tuning parameters spark.driver.cores 2 PERFORMANCE spark.executor.cores 10 … spark.sql.shuffle.partitions 300 spark.sql.autoBroadcastJoinThres 20MB hold … SKEW('orders', 'o_custId') true spark.catalog.cacheTable("orders") true … We represent this setting as vector X X

Given: App + Goal PERFORMANCE • Find the setting of X that best meets the goal • Challenge: Response surface y = ƒ(X) is unknown X

Challenge: Response surface y = ƒ(X) is unknown Model the response surface as PERFORMANCE !t ! yˆ ( X ) = f ( X ) b +Z ( X ) Here: !t ! f ( X )b is a regression model Z(X ) is the residual captured as a Gaussian Process The Gaussian Process model captures the uncertainty in our current knowledge of the response surface X

Opportunity We can now estimate the expected improvement EIP(X) from doing a probe at any setting X PERFORMANCE p= y( X * ) EIP( X )= ò ( y( X ) - p ) pdf yˆ ( X ) ( p )dp * p = -¥ Improvement at any Probability density setting X over the best function (uncertainty performance seen so far estimate) Gaussian Process model helps estimate EIP(X) X

Bootstrap 1 Get initial set of monitoring data from history or via probes: <X1,y1>, PERFORMANCE <X2,y2>, …, <Xn,yn> Probe Algorithm 2 Select next probe Xnext based on all Until the history and probe data stopping condition available so far to is calculate the setting reached with maximum expected improvement EIP(X) X

Performance 8 6 y EIP(X) 4 2 U U U 0 4 6 8 10 12 X x1 Xnext: Do next This approach probe here balances Exploration Vs. Exploration Exploitation Exploitation