Nouns优于 N-Grams

此备选工作流替换标准工作流,同时从文本中生成表示文本中名词的标记或特征列表,并且准备用作进一步分析的特征。这些名词和名词短语特征在分类类型分析中比词干标记更好,能够服务于N-gram服务的许多相同目的,同时也避免或消除了与标准工作流和N-gram相关的许多问题。该方法的主要限制是POS标签和解释在计算上是强烈的。然而,我们已经实现了我们的解决方案的规模已扩大到100的GB的文本,并相信这将合理地扩展到低TB范围内没有硬件的变化。
展开查看详情

1.Asoka Diggs - Data Scientist June 2018

2.Legal Notices This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright © 2018, Intel Corporation. All rights reserved. Information Technology 2

3.About Me  Lots of database design and architecture experience  Text analytics class during my Master’s degree in Predictive Analytics  Preparing text for analysis has been a theme of just about every project I’ve been involved in for the last 4+ years Information Technology

4.Agenda  Overview of using part-of-speech tagging to extract tokens from text  Python notebooks showing some different extraction approaches – Comparison of standard pipeline tokens to noun phrase extraction  Pseudo code example of scaling the method to a Spark cluster  Q&A Information Technology

5.Begin With the End in Mind  A part-of-speech based token extraction pipeline for text: – Can be straightforward to implement – Generates tokens that are more human readable – Can find phrases naturally (replacement for n-grams generation) – Removes many stopwords “for free”  Lemmatizing or singularizing terms can collapse terms that are different because of plural / singular differences Information Technology

6.Typical pipeline  Tokenize the text  Convert all tokens to lower case  Apply a stemmer to reduce tokens to common roots – E.g. boil, boils, boiler, boiled, boiling  boil  Remove stop words – Common English words (the, of, to, in, and, or, which)  Might also include n-gram generation Information Technology

7.Challenge  Standard pipeline for preparing text for analysis yields tokens that can be of low value  These tokens can be difficult to explain to business users – what do the tokens mean?  Uses many rules for handling text, and text (nearly) always has exceptions to each rule. Information Technology

8.Solution  Replace the standard tokenize-stem-stop-n_gram pipeline with a part-of- speech based pipeline  Extract words and phrases with the desired parts-of-speech  Use singularization as a different approach to stemming to bring singular and plural forms of words together Result is not perfect, but better quality than the tokenization approach. Information Technology

9.Benefit The resulting tokens naturally:  Include phrases (n-gram replacement)  Remove the standard English stop words. They have a part-of-speech that isn’t typically used in analysis  Supports readability and don’t need stemming to collapse similar words to a common root Information Technology

10.Information Technology

11.Resources and Links  Pattern – For Python* 3 compatibility, I had to install a development branch of the library. See (link) for details – Something like: git clone –b development https://github.com/clips/pattern cd pattern sudo python setup.py install  NLTK  Penn-Treebank (on Wikipedia) and Tags  Intel® Distribution for Python* (link)  Others: RDRPOSTagger, spaCy, rakutenma, TextBlob Information Technology

12.Information Technology

13.IT@INTEL: Sharing Intel IT Best Practices With the World Learn more about Intel IT’s initiatives at: www.intel.com/IT Information Technology

14.