High Performance Transfer Learning for Classifying Intent of Sales Engagement Em

“The advent of pre-trained language models such as Google’s BERT promises a high performance transfer learning (HPTL) paradigm for many natural language understanding tasks. One such task is email classification. Given the complexity of content and context of sales engagement, lack of standardized large corpus and benchmarks, limited labeled examples and heterogenous context of intent, this real-world use case poses both a challenge and an opportunity for adopting an HPTL approach. This talk presents an experimental investigation to evaluate transfer learning with pre-trained language models and embeddings for classifying sales engagement emails arising from digital sales engagement platforms (e.g., Outreach.io). We will present our findings on evaluating BERT, ELMo, Flair and GloVe embeddings with both feature-based and fine-tuning based transfer learning implementation strategies and their scalability on a GPU cluster with progressively increasing number of labeled samples. Databricks’ MLFlow was used to track hundreds of experiments with different parameters, metrics and models (tensorflow, pytorch etc.). While in this talk we focus on email classification task, the approach described is generic and can be used to evaluate applicability of HPTL to other machine learnings tasks. We hope our findings will help practitioners better understand capabilities and limitations of transfer learning and how to implement transfer learning at scale with Databricks for their scenarios.”

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.High Performance Transfer Learning for Classifying Intent of Sales Engagement Emails: An Experimental Study Yong Liu Corey Zumar #UnifiedAnalytics #SparkAISummit

3.Outline • Data Science Research Objectives • Sales Engagement Platform (SEP) • Use Cases and Technical Challenges • Experiments and Datasets • Results • MLflow Integration and Experiments Tracking • Summary and Future Work #UnifiedAnalytics #SparkAISummit 3

4.Data Science Research Objectives • Establish a high performance transfer learning evaluation framework for email classification • Three research questions: – Which embeddings and pre-trained LMs are to be used? – Which transfer learning implementation strategies (feature- based vs. fine-tuning) are to be used? – How many labeled samples are needed? #UnifiedAnalytics #SparkAISummit 4

5.Sales Engagement Platform (SEP) • A new category of software Sales Reps Sales Engagement Platform (SEP) (e.g., Outreach) CRMs (e.g., Salesforce, Microsoft Dynamics, SAP) #UnifiedAnalytics #SparkAISummit 5

6.SEP Encodes and Automates Sales Activities into Workflows/Pipelines Ø Automates execution and capture of activities (e.g., emails) and records in a CRM. Ø Schedules and reminds the rep when it is the right time to do the manual tasks (e.g. phone call, custom manual email) Ø Enables reps to perform one-on-one personalized outreach to up to 10x more prospects than before. #UnifiedAnalytics #SparkAISummit 6

7.Why Email Intent Classification Is Needed • Email content is critical for driving results for prospecting and other stages of the sales process • A replier’s email intent-based metric (e.g., positive, objection, unsubscription) is much better than a simple “reply rate” • A/B testing using a better metric can pick winners of the email content/template more confidently #UnifiedAnalytics #SparkAISummit 7

8.Why Email Intent Classification is Challenging @ SEP • Different context and players: different roles of players are involved throughout the sales processes and at different orgs • Limited labeled sales engagement domain emails: GDPR and privacy/compliance-constraints; time- consuming and even not possible to label emails in many orgs on a SEP #UnifiedAnalytics #SparkAISummit 8

9.Why Transfer Learning? • Using pretrained language models opens doors for high performance transfer learning (HPTL): – Fewer training samples – Better accuracy – Reduced model training time and engineering complexity • Pretrained language models such as BERT have achieved state-of-the-art scores in the NLP GLUE leaderboard (https://gluebenchmark.com/) – However, whether such benchmark success can be readily translated to practical application is still unknown #UnifiedAnalytics #SparkAISummit 9

10.A List of Pretrained LMs and Embeddings for Experiments • GloVe – count-based context-free word embeddings released in 2014 • ELMo – context-aware character-based embeddings that is based on a recurrent neural network (RNN) architecture released in 2018 • Flair – contextual string embedding released in 2018 • BERT – state-of-the-art transformer-based deep bidirectional language model released in late 2018 by Google #UnifiedAnalytics #SparkAISummit 10

11.Experimental Email Dataset #UnifiedAnalytics #SparkAISummit 11

12.Example Intents and Emails • Positive: "Actually, I'd be interested in talking Friday. Do you have some time around 10am?” • Objection: “Thanks for reaching out. This is not something I am interested in at this time.” • Unsubscribe: “Please remove me from your email list.” • Not-sure: “Mike, in regards to? John” #UnifiedAnalytics #SparkAISummit 12

13.Two Sets of Experiment Runs • Using different pretrained language models (LMs) and embeddings: feature-based vs. fine-tuning – Using the full training examples • Different labeled training size with feature-based and fine-tuning Approach – Increasingly larger training size: 50, 100, 200, 300, 500, 1000, 2000, 3000 #UnifiedAnalytics #SparkAISummit 13

14.Result (1): Different Embeddings feature-based Ø BERT-finetuning has the best f1 score Ø When using feature-based approaches, GloVe performs slightly better Ø Classical MLs such as LightGBM+TF-IDF underperform BERT-finetuing #UnifiedAnalytics #SparkAISummit 14

15.Result (2): Scaling Effect with Different Training Sample Sizes Ø BERT-finetuning outperforms all other Feature-based approaches when training example size is greater than 300 Ø When training size is small (< 100), BERT+Flair performs better Ø To achieve an f1-score > 0.8, BERT-finetuning needs at least 500 training examples, while feature-based approach needs at least 2000 training examples #UnifiedAnalytics #SparkAISummit 15

16.Introducing Open machine learning platform • Works with any ML library & language • Runs the same way anywhere (e.g. any cloud) • Designed to be useful for 1 or 1000+ person orgs • Integrates with Databricks #UnifiedAnalytics #SparkAISummit 16

17.MLflow Components Tracking Projects Models Record and query Packaging format General model format experiments: code, for reproducible runs that supports diverse configs, results, …etc on any platform deployment tools #UnifiedAnalytics #SparkAISummit 17

18.Key Concepts in Tracking Parameters: key-value inputs to your code Metrics: numeric values (can update over time) Artifacts: arbitrary files, including data and models Source: training code that ran Version: version of the training code Tags and Notes: any additional info #UnifiedAnalytics #SparkAISummit 18

19.MLflow Tracking: Example Code import mlflow with mlflow.start_run(): Tracking mlflow.log_param("layers", layers) mlflow.log_param("alpha", alpha) # train model Record and query experiments: code, mlflow.log_metric("mse", model.mse()) configs, results, mlflow.log_artifact("plot", model.plot(test_df)) mlflow.tensorflow.log_model(model) …etc #UnifiedAnalytics #SparkAISummit 19

20. MLflow Models Inference Code Model Format Flavor 1 Flavor 2 Batch & Stream Scoring Standardfor Standard forML ML ML Frameworks models Serving Tools models #UnifiedAnalytics #SparkAISummit 20

21.MLflow to Manage Hundreds of Experiments • Pytorch models for the feature-based approach – Using the Flair framework • Tensorflow for BERT fine-tuning – Using the bert-tensorhub framework #UnifiedAnalytics #SparkAISummit 21

22.MLflow Tracking All Experiments #UnifiedAnalytics #SparkAISummit 22

23.MLflow Logs Artifacts/Parameters/Metrics/Models mlflow.log_metric("micro_avg_f1_score_avg", np.asarray(test_scores).mean()) #UnifiedAnalytics #SparkAISummit 23

24.Images Can Be Logged as Artifacts mlflow.log_artifact(tSNE_img, 'run_{0}'.format(run_id)) #UnifiedAnalytics #SparkAISummit 24

25.Summary • Transfer learning using fine-tuning BERT outperforms all feature-based approaches using different embeddings/pretrained LMs when training example size is greater than 300 • Pretrained language models solve the cold start problem when there is very little training data – E.g., with as little as 50 labeled examples, the f1 score reaches 0.67 with BERT+Flair using the feature-based approach). • However, to get to f1-score >0.8, it may still need one to two thousand examples for a feature-based approach or 500 examples for fine-tuning a pre-trained BERT language model. • MLFlow is proven to be useful and powerful for tracking all experiments #UnifiedAnalytics #SparkAISummit 25

26.Future Work • MLflow: from experimentation to production – Pick the best model for deployment • Extend to cross-org transfer learning – Using one or multiple orgs data for training and then applying to other orgs #UnifiedAnalytics #SparkAISummit 26

27.Acknowledgements • Outreach Data Science Team • Databricks MLflow team #UnifiedAnalytics #SparkAISummit 27