申请试用
HOT
登录
注册
 
Nouns优于 N-Grams
poppy
/
发布于
/
1670
人观看
此备选工作流替换标准工作流,同时从文本中生成表示文本中名词的标记或特征列表,并且准备用作进一步分析的特征。这些名词和名词短语特征在分类类型分析中比词干标记更好,能够服务于N-gram服务的许多相同目的,同时也避免或消除了与标准工作流和N-gram相关的许多问题。该方法的主要限制是POS标签和解释在计算上是强烈的。然而,我们已经实现了我们的解决方案的规模已扩大到100的GB的文本,并相信这将合理地扩展到低TB范围内没有硬件的变化。
展开查看详情

1 .Asoka Diggs - Data Scientist June 2018

2 .Legal Notices This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright © 2018, Intel Corporation. All rights reserved. Information Technology 2

3 .About Me  Lots of database design and architecture experience  Text analytics class during my Master’s degree in Predictive Analytics  Preparing text for analysis has been a theme of just about every project I’ve been involved in for the last 4+ years Information Technology

4 .Agenda  Overview of using part-of-speech tagging to extract tokens from text  Python notebooks showing some different extraction approaches – Comparison of standard pipeline tokens to noun phrase extraction  Pseudo code example of scaling the method to a Spark cluster  Q&A Information Technology

5 .Begin With the End in Mind  A part-of-speech based token extraction pipeline for text: – Can be straightforward to implement – Generates tokens that are more human readable – Can find phrases naturally (replacement for n-grams generation) – Removes many stopwords “for free”  Lemmatizing or singularizing terms can collapse terms that are different because of plural / singular differences Information Technology

6 .Typical pipeline  Tokenize the text  Convert all tokens to lower case  Apply a stemmer to reduce tokens to common roots – E.g. boil, boils, boiler, boiled, boiling  boil  Remove stop words – Common English words (the, of, to, in, and, or, which)  Might also include n-gram generation Information Technology

7 .Challenge  Standard pipeline for preparing text for analysis yields tokens that can be of low value  These tokens can be difficult to explain to business users – what do the tokens mean?  Uses many rules for handling text, and text (nearly) always has exceptions to each rule. Information Technology

8 .Solution  Replace the standard tokenize-stem-stop-n_gram pipeline with a part-of- speech based pipeline  Extract words and phrases with the desired parts-of-speech  Use singularization as a different approach to stemming to bring singular and plural forms of words together Result is not perfect, but better quality than the tokenization approach. Information Technology

9 .Benefit The resulting tokens naturally:  Include phrases (n-gram replacement)  Remove the standard English stop words. They have a part-of-speech that isn’t typically used in analysis  Supports readability and don’t need stemming to collapse similar words to a common root Information Technology

10 .Information Technology

11 .Resources and Links  Pattern – For Python* 3 compatibility, I had to install a development branch of the library. See (link) for details – Something like: git clone –b development https://github.com/clips/pattern cd pattern sudo python setup.py install  NLTK  Penn-Treebank (on Wikipedia) and Tags  Intel® Distribution for Python* (link)  Others: RDRPOSTagger, spaCy, rakutenma, TextBlob Information Technology

12 .Information Technology

13 .IT@INTEL: Sharing Intel IT Best Practices With the World Learn more about Intel IT’s initiatives at: www.intel.com/IT Information Technology

14 .

6 点赞
2 收藏
0下载
相关文档
确认
3秒后跳转登录页面
去登陆