- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Nouns优于 N-Grams
展开查看详情
1 .Asoka Diggs - Data Scientist June 2018
2 .Legal Notices This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright © 2018, Intel Corporation. All rights reserved. Information Technology 2
3 .About Me Lots of database design and architecture experience Text analytics class during my Master’s degree in Predictive Analytics Preparing text for analysis has been a theme of just about every project I’ve been involved in for the last 4+ years Information Technology
4 .Agenda Overview of using part-of-speech tagging to extract tokens from text Python notebooks showing some different extraction approaches – Comparison of standard pipeline tokens to noun phrase extraction Pseudo code example of scaling the method to a Spark cluster Q&A Information Technology
5 .Begin With the End in Mind A part-of-speech based token extraction pipeline for text: – Can be straightforward to implement – Generates tokens that are more human readable – Can find phrases naturally (replacement for n-grams generation) – Removes many stopwords “for free” Lemmatizing or singularizing terms can collapse terms that are different because of plural / singular differences Information Technology
6 .Typical pipeline Tokenize the text Convert all tokens to lower case Apply a stemmer to reduce tokens to common roots – E.g. boil, boils, boiler, boiled, boiling boil Remove stop words – Common English words (the, of, to, in, and, or, which) Might also include n-gram generation Information Technology
7 .Challenge Standard pipeline for preparing text for analysis yields tokens that can be of low value These tokens can be difficult to explain to business users – what do the tokens mean? Uses many rules for handling text, and text (nearly) always has exceptions to each rule. Information Technology
8 .Solution Replace the standard tokenize-stem-stop-n_gram pipeline with a part-of- speech based pipeline Extract words and phrases with the desired parts-of-speech Use singularization as a different approach to stemming to bring singular and plural forms of words together Result is not perfect, but better quality than the tokenization approach. Information Technology
9 .Benefit The resulting tokens naturally: Include phrases (n-gram replacement) Remove the standard English stop words. They have a part-of-speech that isn’t typically used in analysis Supports readability and don’t need stemming to collapse similar words to a common root Information Technology
10 .Information Technology
11 .Resources and Links Pattern – For Python* 3 compatibility, I had to install a development branch of the library. See (link) for details – Something like: git clone –b development https://github.com/clips/pattern cd pattern sudo python setup.py install NLTK Penn-Treebank (on Wikipedia) and Tags Intel® Distribution for Python* (link) Others: RDRPOSTagger, spaCy, rakutenma, TextBlob Information Technology
12 .Information Technology
13 .IT@INTEL: Sharing Intel IT Best Practices With the World Learn more about Intel IT’s initiatives at: www.intel.com/IT Information Technology
14 .