Official Announcement of Koalas Open Source Project

Keynote from Spark + AI Summit 2019: Reynold Xin, Databricks, Brooke Wenig, Databricks
展开查看详情

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Jaime Woodfin, FIS Global Brooke Wenig, Databricks Special thanks to Amir Issaei, Kevin Mellott, and Aaron Colcord #UnifiedAnalytics #SparkAISummit

3.Outline • FIS & Databricks Intro • Business Problem & Motivation • Approach v1 – Problems encountered along the way • Approach v2 • Production #UnifiedAnalytics #SparkAISummit 3

4.Who is FIS Global? • The Global leader in financial services technology • Customers: banks and credit unions • Ecosystem of products and services built around core banking • FIS Digital Finance, Digital Data and Analytics #UnifiedAnalytics #SparkAISummit 4

5.VISION Accelerate innovation by unifying data science, engineering and business PRODUCT Unified Analytics Platform powered by Apache Spark WHO WE ARE • Founded by the original creators of Apache Spark • Contributes 75% of the open source code, 10x more than any other company • Trained 100k+ Spark users on the Databricks platform

6.Business Problem

7.Conversational Analytics • Measure in support conversations • Follow-up on support conversation #UnifiedAnalytics #SparkAISummit 7

8.Example Conversation

9.Conversational Channels Developments Face-to-face Human support chat Support chatbots Conversational Banking #UnifiedAnalytics #SparkAISummit 9

10.Goals • Score overall user conversation satisfaction • Question: What contributed to their satisfaction? • AND how to do it at scale?? #UnifiedAnalytics #SparkAISummit 10

11.Approach v1

12.Approach v1 • Apply open-source NLP libraries to each turn in conversation #UnifiedAnalytics #SparkAISummit 12

13.Library Comparison • Different Scales – TextBlob: [-1, 1] – NLTK: [-1, 1] – John Snow Labs (sparknlp): Negative or Positive – Stanford CoreNLP: 0, 1, 2, 3, 4 #UnifiedAnalytics #SparkAISummit 13

14.Demo

15.Problems Encountered • Stanford CoreNLP gave predictions per sentence, not turn • Performed poorly on neutral sentences – John Snow Labs had no “neutral” category • Didn’t do well with banking domain Can we do better? #UnifiedAnalytics #SparkAISummit 15

16.Approach v2

17.Approach v2 • No pre-trained sentiment analysis models! • Model: – Build LSTM model on all conversation text to predict sentiment – Augment with additional features (e.g. # of turns, time of day, etc.) – Pass features through end classifier • Positive/Negative X #UnifiedAnalytics #SparkAISummit 17

18.Transfer Learning • Distributed training of LSTM on open-source sentiment dataset with HorovodRunner • Transfer learning on banking data #UnifiedAnalytics #SparkAISummit 18

19.LSTM Stats • Basic stats for performance: – Accuracy: 73% – FPR: 11% – FNR: 32% 𝐹𝑃 𝐹𝑃 𝐹𝑃𝑅 = = 𝑁 𝐹𝑃 + 𝑇𝑁 𝐹𝑁 𝐹𝑁 𝐹𝑁𝑅 = = 𝑁 𝑇𝑃 + 𝐹𝑁 #UnifiedAnalytics #SparkAISummit 19

20.Features • LSTM output • Conversational Features – User average turn length – Agent average turn length – # Turns – Duration • Temporal Features – Day of Week – Time of Day • Others #UnifiedAnalytics #SparkAISummit 20

21.Classifier • Logistic Regression – Accuracy: 77% – FPR: 24% – FNR: 22% • Random Forests – Accuracy: 80% – FPR: 10% – FNR: 51% • Others #UnifiedAnalytics #SparkAISummit 21

22.Random Forest • Chose the Random Forest b/c: – Lowest FP rate & Highest Accuracy – Good model interpretability – Part of SparkML and can use with Pipeline API (easy to switch to Scala) #UnifiedAnalytics #SparkAISummit 22

23.Production

24.Production Requirements • Fit into dev pipeline, largely Scala/Java based – But a lot of data science is done in Python • Close the feedback loop - constantly learning • Automated deployments • Streaming instead of batch #UnifiedAnalytics #SparkAISummit 24

25.Architecture #UnifiedAnalytics #SparkAISummit 25

26.Architecture #UnifiedAnalytics #SparkAISummit 26

27.Python & Scala • Train LSTM in Python (Keras) • Save Model • Load model via UDF • Apply using Scala! #UnifiedAnalytics #SparkAISummit 27

28.Deployment #UnifiedAnalytics #SparkAISummit 28

29.Recap Idea Notebook Production Happy #UnifiedAnalytics #SparkAISummit 29