深度学习实体抽取

实体抽取,也称为命名实体识别(NER)、实体分块和实体识别,是信息抽取的子任务,其目的是检测和将文本中的短语分类为预定义的类别。虽然以自动方式查找实体本身是有用的,但是它常常充当更复杂任务的预处理步骤,例如关系提取。例如,生物医学实体提取是理解不同实体类型之间的相互作用的关键步骤,例如药物-疾病关系或基因-蛋白质关系。这些任务的特征生成通常是复杂的和耗时的。然而,神经网络可以消除对特征工程的需求,并使用原始数据作为输入。
展开查看详情

1.Deep Learning for Domain- Specific Entity Extraction from Unstructured Text Mohamed AbdelHady, Microsoft AI Platform Zoran Dzunic, Microsoft AI Platform #DL1SAIS

2.Goals • What is entity extraction? • When to train a custom entity extraction model? • What are word embeddings? • How to train a custom word embedding model on a Spark cluster? • How to train a custom Deep Neural Network for entity extraction? #DL1SAIS 2

3.Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Zoran and Mohamed are at Spark+AI Summit in San Francisco. Zoran : PERSON : LOC #DL1SAIS 3

4.Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Zoran and Mohamed are at Spark+AI Summit in San Francisco. Zoran : PERSON Mohamed : PERSON Spark+AI Summit : ORG San Francisco : LOC #DL1SAIS 4

5.Biomedical Entity Extraction • Entity types drug/chemical, disease, protein, DNA, etc. • Critical step for complex biomedical NLP tasks: – Extraction of diseases, symptoms from electronic medical or health records – Understanding the interactions between different entity types such as drug- drug interaction, drug-disease relationship and gene-protein relationship, e.g., • Drug A cures Disease B. • Drug A causes Disease B. Similar for other domains (e.g., legal, finance) #DL1SAIS 5

6.Biomedical Entity Extraction #DL1SAIS 6

7. Demo https://medicalentitydemo.azurewebsites.net #DL1SAIS 7

8.Approach 1. Feature Extraction Phase – Domain Specific Features Use a large amount of unlabeled domain-specific data corpus such as Medline PubMed abstracts to train a neural word embedding model. 2. Model Training Phase – Domain Specific Model The output embeddings are considered as automatically generated features to train a neural entity extractor using a small/reasonable amount of labeled data. #DL1SAIS 8

9.Word Embedding a semantic continuous representation of words #DL1SAIS 9

10.#DL1SAIS 10

11.Input: Words B-Chemical O O Words: Naloxon reverses the e #DL1SAIS 11

12.Features: Word Embeddings B-Chemical O O Embedding: [0.3, 0.2, 0.9 …] [0.8, 0.8, 0.1 …][0.5, 0.1, 0.5 …] dim small (e.g., 50, 200) #DL1SAIS 12

13.Embeddings #DL1SAIS 13

14.Custom Word Embeddings • Publicly available pre-trained models such as Google News • Can we do better on a specific domain? • We trained a word embedding model for biomedical domain on 27 million Pubmed abstracts (22GB) • Azure HDInsight Spark Cluster, 11 worker nodes • Spark MLlib Word2Vec • Trained in ~30min #DL1SAIS 14

15.DNNs for Entity Extraction #DL1SAIS 15

16.#DL1SAIS 16

17.Why Deep Learning? #DL1SAIS 17

18. DNN Architecture • Keras with TensorFlow • GPU enabled Azure Data Science VM (DSVM) NC6 Standard (56 GB, K80 NVIDIA Tesla) or Deep Learning VM (DLVM) • Parameters – # recurrent units = 150 – droput rate = 0.2 #DL1SAIS 18

19.Results #DL1SAIS 19

20.Datasets • Proteins, Cell Line, Cell Type, DNA and RNA Detection Bio-Entity Recognition Task at BioNLP/NLPBA 2004 - http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html • Chemicals and Diseases Detection BioCreative V CDR task corpus - http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/ • Drugs Detection Semeval 2013 - Task 9.1 (Drug Recognition) - https://www.cs.york.ac.uk/semeval-2013/task9/ #DL1SAIS 20

21.Dataset Description http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/ • - - - #DL1SAIS 21

22.Experimental Setup • Azure ML Python Package for Text Analytics. https://docs.microsoft.com/en-us/python/api/overview/azure-machine- learning/textanalytics https://aka.ms/aml-packages/text/download #DL1SAIS 22

23.Conditional Random Fields (CRF) CRFSuite: • Extract traditional features • Train CRF model #DL1SAIS 23

24.Results (exact match) Algorithm + Features Recall Precision F-score Dictionary Lookup 64% 74% 68% CRF: Traditional Features 61% 81% 70% CRF: Pubmed Embedding 40% 61% 48% CRF: Traditional + Pubmed Embed. 65% 80% 71% LSTM: Pubmed Embedding 76% 77% 76% LSTM: Generic Embeddings 74% 63% 67% #DL1SAIS 24

25.Embedding Comparison #DL1SAIS 25

26.Embedding Comparison #DL1SAIS 26

27. Takeaways • Recipe for building a custom entity extraction pipeline: – Get a large amount of in-domain unlabeled data – Train a word2vec model on unlabeled data on Spark – Get as much of labeled data as possible – Train an LSTM -based Neural Network on a GPU-enabled machine • Word embeddings are powerful features – Convey word semantics – Perform better than traditional features – No feature engineering • LSTM NN is more powerful model than traditional CRF #DL1SAIS 27

28.Questions #DL1SAIS 28