1.State of the Art Natural Language Processing at Scale Alex Thomas David Talby Data Scientist @ Indeed CTO @ Pacific AI #DD4SAIS
2. CONTENTS ü NLU REAL-WORLD EXAMPLES ü DOCUMENT CLASSIFICATION - WALKTHROUGH ü STATE OF THE ART NLU IN HEALTHCARE ü TRAIN YOUR OWN DEEP LEARNING NLU MODELS
3. INTRODUCING SPARK NLP • Industrial Grade NLP for the Spark ecosystem • Design Goals: 1. Performance & Scale 2. Frictionless Reuse 3. Enterprise Grade • Built on top of the Spark ML API’s • Apache 2.0 licensed, with active development & support
4.NATIVE SPARK EXTENSION High Performance Natural Language Understanding at Scale Part of Speech Tagger Topic Modeling Named Entity Recognition Word2Vec Sentiment Analysis TF-IDF Spell Checker String distance calculation Tokenizer N-grams calculation Stemmer Stop word removal Lemmatizer Train/Test & Cross-Validate Entity Extraction Ensembles Spark ML API (Pipeline, Transformer, Estimator) Spark SQL API (DataFrame, Catalyst Optimizer) Spark Core API (RDD’s, Project Tungsten) Data Sources API
5. FRICTIONLESS REUSE pipeline = pyspark.ml.Pipeline(stages=[ document_assembler, tokenizer, stemmer, Spark NLP annotators normalizer, stopword_remover, tf, Spark ML featurizers idf, lda]) Spark ML LDA implementation Single execution plan for topic_model = pipeline.fit(df) the given data frame
6.Case study: Demand Forecasting of Admissions from ED Features from Structured Data Reason for visit Current wait time Age Number of orders • How many patients will be admitted today? Gender Admit in past 30 days • Data Source: EPIC Clarity data Vital signs Type of insurance
7.Case study: Demand Forecasting of Admission from ED Features from Natural Language Text • A majority of the rich relevant content lies in unstructured notes that are contributed by doctors and nurses from patient interactions. • Data Source: Emergency Department Triage notes and other ED notes Type of Pain Symptoms Intensity of Pain Onset of symptoms Body part of region Attempted home remedy ML with NLP ML with structured data Accuracy Baseline: Human manual prediction
8. Risk prediction Case Study: Detecting Sepsis “Compared to previous work that only used structured data such as vital signs and demographic information, utilizing free text drastically improves the discriminatory ability (increase in AUC from 0.67 to 0.86) of identifying infection.”
9.Cohort selection Case Study: Oncology “Using the combination of structured and unstructured data, 8324 patients were identified as having advanced NSCLC. Of these patients, only 2472 were also in the cohort generated using structured data only. Further, 1090 patients who should have been excluded based on additional data, would be included in the structured data only cohort.”
10. CODE WALKTHROUGH: DOCUMENT CLASSIFICATION • A combined NLP & ML Pipeline • Word embeddings as features • Training your own custom NLP models github.com/melcutz/nlu_tutorial
11. Tokenizer Normalizer Different Vocabulary Lemmatizer Fact Extraction Part of Speech Tagger Spell Checker Different Grammar Coreference Resolution Dependency Parser Sentence Splitting Negation Detection Named Entity Recognition Sentiment Analysis Different Context Intent Classification Summarization Word Embeddings Emotion Detection Question Answering Relevance Ranking Different Meaning Best Next Action Translation Different Language Models
12.Healthcare Extensions High Performance Natural Language Understanding at Scale Part of Speech Tagger Topic Modeling Named Entity Recognition Word2Vec Sentiment Analysis TF-IDF Spell Checker String distance calculation Tokenizer N-grams calculation com.johnsnowlabs.nlp.clinica Stemmer Stop word removal data.johnsnowlabs.com/healt l.* h Lemmatizer Train/Test & Cross-Validate Healthcare specific Entity Extraction Ensembles 1,800+ Expert curated, NLP annotators for clean, linked, enriched Spark in Scala, Java & always up to date or Python: Spark ML API (Pipeline, Transformer, Estimator) data: • Entity Recognition Spark SQL API (DataFrame, Catalyst Optimizer) • Terminology • Value Extraction • Providers • Word Embeddings Spark Core API (RDD’s, Project Tungsten) • Demographics • Assertion Status • Clinical Guidelines • Sentiment Analysis Data Sources API • Genes • Spell Checking, … • Measures, …
13.Named Entity Recognition
14. Deep Learning for NER F-Score Dataset Task 85.81% 2010 i2b2 Medical concept extraction 92.29% 2012 i2b2 Clinical event detection 94.37% 2014 i2b2 De-identification “Entity Recognition from Clinical Texts via Recurrent Neural Network”. Liu et al., BMC Medical Informatics & Decision Making, July 2017.
16. Deep Learning for Entity Resolution F-Score Dataset Task ShARe / 90.30% CLEF Disease & problem norm. 92.29% NCBI Disease norm. in literature “CNN-based ranking for biomedical entity normalization”. Li et al., BMC Bioinformatics, October 2017.
17. Assertion Status Detection Prescribing sick days due to diagnosis of influenza. Positive Jane complains about flu-like symptoms. Speculative Jane’s RIDT came back clean. Negative Jane is at risk for flu if she’s not vaccinated. Conditional Jane’s older brother had the flu last month. Family history Jane had a severe case of flu last year. Patient history
18.Deep Learning for Assertion Status Detection Dataset Metric 94.17% Mirco-averaged F1 4th i2b2/VA 79.76% Marco-averaged F1 “Improving Classification of Medical Assertions in Clinical Notes“ Kim et al., In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011.
19. USING SPARK NLP • Homepage: https://nlp.johnsnowlabs.com – Getting Started, Documentation, Examples, Videos, Blogs – Join the Slack Community • GitHub: https://github.com/johnsnowlabs/spark-nlp – Open Issues & Feature Requests – Contribute! • The library has Scala and Python 2 & 3 API’s • Get directly from maven-central or spark-packages • Tested on all Spark 2.x versions
20. THANK YOU! firstname.lastname@example.org email@example.com in/alnith/ in/davidtalby @davidtalby #DD4SAIS 20