使用Apache Spk和TensorFlow进行自然语言处理的深度学习

使用Apache Spk和TensorFlow进行自然语言处理的深度学习。当与客户交互时,能够实时提取相关的通信信息是成功的关键。本演示文稿将说明Salesforce如何使用Apache Spark和TensorFlow来实时和深入地监视客户活动。长短期存储器(LSTM)网络已被证明是在各种自然语言处理(NLP)任务上实现最新成果的有效技术。当与词嵌入模型结合时,自然而然地捕获了人类语言的时间信息和语义。
展开查看详情

1.Deep Learning for Natural Language Processing Using Apache Spark and TensorFlow Alexis Roos – Director Machine Learning @alexisroos Wenhao Liu – Senior Data Scientist Activity Intelligence team

2.Agenda Introduction Email Classification Deep Learning Model Architecture TensorFrames/SparkDL Demo Wrap up

3.Forward-Looking Statement Statement under the Private Securities Litigation Reform Act of 1995 This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.

4.Doing Well and Doing Good #1 The World’s Best Workplaces Best Places to Work #1 Workplace for #1 The Future 50 for LGBTQ Equality Giving Back The World’s Most #1 World’s Most #1 Top 50 Companies Innovative Companies Innovative Companies that Care

5.Salesforce Keeps Getting Smarter with Einstein Guide Marketers Advise Retailers Einstein Engagement Scoring Einstein Product Recommendations Einstein Segmentation (pilot) Einstein Search Dictionaries Einstein Vision for Social Einstein Predictive Sort Assist Service Agents Empower Admins & Developers Einstein Bots (pilot) Einstein Prediction Builder (pilot) Einstein Agent (pilot) Einstein Vision & Language Einstein Vision for Field Service (pilot) Einstein Discovery Coach Sales Reps Help Community Members Einstein Forecasting (pilot) Einstein Answers (pilot) Einstein Lead & Opportunity Scoring Community Sentiment (pilot) Einstein Activity Capture Einstein Recommendations Austin Buchan CEO, College Forward

6.Agenda Introduction Email Classification Deep Learning Model Architecture TensorFrames/SparkDL Demo Wrap up

7. Enhance CRM experience using AI and activity Email classification use case Emails, meetings, tasks, calls, etc Extract Insights AI Inbox Suggest Action(s) Insights: Einstein Pricing discussed, Executive Activity involved, Scheduling Requested, Capture Angry email, competition mentioned, etc. Timelines Other Salesforce Apps …

8.What types of emails do Sales users receive? • Emails from customers • Scheduling requests, pricing requests, competitor mentioned, etc. • Emails from coworkers • Marketing emails • Newsletters • Telecom, Spotify, iTunes, Amazon purchases • etc

9.Scheduling requests We want to identify scheduling requests from customers Hi Alexis, Hello Wenhao, Welcome to Business review! Can we get together Can you send me that Your subscription is Thursday afternoon? really important active. document? Best, Your next letter will Thanks, be emailed on May John Mark 25th 2018.

10. Before scoring: filtering and parsing HEADER INFORMATION ... INTRO Hey Alexis, • Right language Let’s meet with Ascander on Friday to discuss BODY the $10,000/year rate. Ascander’s phone • Automated vs non automated number is (123) 456-7890. Thanks, • Inbound / outbound SIGNATURE Noah Bergman Engineer at Salesforce • Within or outside the organization (123) 456-7890 The contents of this email and any attachments • etc CONFIDENTIALITY NOTICE are confidential and are intended solely for addressee… From: Alexis alexis@salesforce.com Date: April 1, 2017 Subject: Important Document REPLY CHAIN Noah, how much does your product cost?

11.“Basic” NLP text classifier Steps: • Normalize and tokenize • Generate n-grams • Remove stop words • Compute TF with min threshold filter based vocabulary size • Compute IDF and filter n-grams based on IDF threshold Shortcomings: • Lack of generalization as classifier is limited to tokens from training data • Collection of n-grams doesn’t take into account ordering or sequences

12. Word2Vec or GloVe • Unsupervised learning algorithm for obtaining vector representations for words. • Training is performed on aggregated global word-word co-occurrence statistics from a corpus. • Word vectors for individual tokens capture the semantic. word2VecModel.findSynonyms(“cost”, 5) MONEY price license nominal budget

13. High-level Architecture TF/IDF N-gram Feature Extraction Preprocessing Filtering Filtered Or… Text Raw Emails Word2Vec Emails other ML models implemented in Scala/Spark LDA Our current machine learning pipeline is pure Scala / Spark, which has served us well.

14.Agenda Introduction Email Classification Deep Learning Model Architecture TensorFrames/SparkDL Demo Wrap up

15.What are Neural Networks: feed forward networks “Loose” brain inspiration: structure of cells

16.What are Neural Networks: recurring networks “I grew up in France… I speak fluent French.”

17.LSTM • RNNs suffer from vanishing or exploding gardients • LSTM allow to chain and store and use memory across sequence controlled through gates and operations • Designed to be chained into RNN. https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714

18.Agenda Introduction Email Classification Deep Learning Model Architecture TensorFrames/SparkDL Demo Wrap up

19. High Level Model Architecture We present a “simple” BiLSTM model for text classification. Ob0 Of3 • Tokens are mapped into word embeddings (GloVe pretrained on Wikipedia) • The word embedding for each token is fed into both Backward Cb 0 Cb 1 Cb 2 Cb 3 forward and backward recurrent network with LSTM (Long Short-Term Memory*) cells Forward • “Last” output of the forward and backward RNNs are Cf0 Cf1 Cf2 Cf3 concatenated and taken as input by the sigmoid unit for binary classification x0 x1 x2 x3 * Hochreiter & Schmidhuber 1997

20.Detailed Considerations for the Model About dropout and regularization • We applied dropout on recurrent connections* and inputs… • As well as L2 regularization on the model parameters. Ob0 Of3 trainable_vars = tf.trainable_variables() C b0 C b1 C b2 C b3 regularization_loss = tf.reduce_sum( [tf.nn.l2_loss (v) for v in trainable_vars]) Cf0 Cf1 Cf2 Cf3 loss = original_loss + reg_weight * regularization_loss x0 x1 x2 x3 *Gal & Ghahramani NIPS 2016

21.Detailed Considerations for the Model About variable sequence lengths Emails come in different lengths, and some are extremely short while others are long • One-word email: “Thanks” • 800+ words long emails are also commonly seen in business emails tf.nn.dynamic_rnn( cell=lstm_cell, inputs=input_data, sequence_length=seq_len ) Solution: dynamic_rnn + max length + sequence sampling • tf.nn.dynamic_rnn (or tf.nn.bidirectional_dynamic_rnn) allows for variable lengths for input sequences

22.Other Model Architectures Considered We “settled” on current architecture through lots of experiments and considerations. • Single-direction RNN • Single-direction RNN with GRU • Single-direction RNN with LSTM • Average pooling for outputs • Max pooling for all outputs • CNN on top of outputs • …

23.Agenda Introduction Email Classification Deep Learning Model Architecture TensorFrames/SparkDL Demo Wrap up

24. Fitting a TensorFlow model into a Spark pipeline TF/IDF N-gram Feature Extraction Preprocessing Filtering Filtered Or… Text Raw Emails Word2Vec Emails other ML models implemented in Scala/Spark LDA Our workflow around Spark is completely in Scala/Spark stack • Train a SparkML model in the notebooking environment and save it out • At scoring state, load the pretrained SparkML model (part of SparkML Pipeline), and call the transform method Question: Can we use a TF model as if it was a native Scala/Spark function?

25. Scala/Spark Pipeline + TensorFlow Model TensorFrames / SparkDL as Interface Encoded Input [[10 19853 3920 8425 43 … 18646] Preprocessing Input Tensor Batch Size Filtering Filtered … tf.nn.embedding_lookup Text Raw Emails Emails [235 489 165638 46562 … 16516]] Sequence Length Embedding Matrix Batch Size Vocabulary Size [[0.19853 0.3920 0.8646 0.459 … 0.1865] ng ddi be gth … Sequence Em Len [0.684 0.1894 0.1564 0.9874 … 0.354] ] Length Embedding Length * Shi Yan, Understanding LSTM and its diagrams

26.TensorFrames turns a TensorFlow model into a UDF. Save –> Load –> Score Save the model: %python graph_def = tfx.strip_and_freeze_until(["input_data", "predicted"], sess.graph, sess = sess) tf.train.write_graph(graph_def, “/model”, ”model.pb", False) Load the model: %scala val graph = new com.databricks.sparkdl.python.GraphModelFactory() .sqlContext(sqlContext) .fetches(asJava(Seq("prediction"))) .inputs(asJava(Seq("input_data")), asJava(Seq(”input_data"))) .graphFromFile("/model/model.pb") graph.registerUDF("model") Score with the model: %scala val predictions = inputDataSet.selectExpr("InputData", "model(InputData)")

27.Agenda Introduction Email Classification Deep Learning Model Architecture TensorFrames/SparkDL Demo Wrap up

28.Agenda Introduction Email Classification Deep Learning Model Architecture TensorFrames/SparkDL Demo Wrap up

29.Lessons Learned • A well-tuned LSTM model can outperform traditional ML approaches • But data preparation is still needed and key to success • Spark can play nicely with TensorFlow using TensorFrame as interface • We can do end-to-end in single notebook and mix Spark/Scala with TF/Python • Model outperforming ML approach and is being productized