斯坦福大学的课件:Text Classification and Naïve Bayes.

Daniel发布于2018/06/02 00:00

注脚

1.Text Classification and Na ï ve Bayes The Task of Text Classification

2.Is this spam?

3.Who wrote which Federalist papers? 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton. Authorship of 12 of the letters in dispute 1963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton

4.Male or female author? By 1925 present-day Vietnam was divided into three parts under French colonial rule. The southern region embracing Saigon and the Mekong delta was the colony of Cochin-China; the central area with its imperial capital at Hue was the protectorate of Annam… Clara never failed to be astonished by the extraordinary felicity of her own name. She found it hard to trust herself to the mercy of fate, which had managed over the years to convert her greatest shame into one of her greatest assets… S. Argamon , M. Koppel, J. Fine, A. R. Shimoni , 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3, pp. 321–346

5.Positive or negative movie review? unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists this is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes. 5

6.What is the subject of this article? Antogonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology … 6 MeSH Subject Category Hierarchy ? MEDLINE Article

7.Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

8.Text Classification: definition Input : a document d a fixed set of classes C = { c 1 , c 2 ,…, c J } Output : a predicted class c  C

9.Classification Methods: Hand-coded rules Rules based on combinations of words or other features spam: black-list-address OR (“dollars” AND“have been selected”) Accuracy can be high If rules carefully refined by expert But building and maintaining these rules is expensive

10.Classification Methods: Supervised Machine Learning Input: a document d a fixed set of classes C = { c 1 , c 2 ,…, c J } A training set of m hand-labeled documents (d 1 ,c 1 ),....,( d m ,c m ) Output: a learned classifier γ:d  c 10

11.Classification Methods: Supervised Machine Learning Any kind of classifier Na ï ve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …

12.Classification Methods: Supervised Machine Learning Any kind of classifier Na ï ve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …

13.Text Classification and Na ï ve Bayes Na ï ve Bayes (I)

14.Naïve Bayes Intuition Simple (“ na ï ve ”) classification method based on Bayes rule Relies on very simple representation of document Bag of words

15.The bag of words representation I love this movie! Its sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre . I would recommend it to just about anyone. Ive seen it several times , and Im always happy to see it again whenever I have a friend who hasnt seen it yet. γ ( )=c

16.The bag of words representation I love this movie! Its sweet , but with satirical humor. The dialogue is great and the adventure scenes are fun … It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre . I would recommend it to just about anyone. Ive seen it several times , and Im always happy to see it again whenever I have a friend who hasnt seen it yet . γ ( )=c

17.The bag of words representation: using a subset of words x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx γ ( )=c

18.The bag of words representation γ ( )=c great 2 love 2 recommend 1 laugh 1 happy 1 ... ...

19.Planning GUI Garbage Collection Machine Learning NLP parser tag training translation language ... learning training algorithm shrinkage network... garbage collection memory optimization region... Test document p arser l anguage l abel translation … Bag of words for document classification ... planning temporal reasoning plan language ... ?

20.Planning GUI Garbage Collection Machine Learning NLP parser tag training translation language ... learning training algorithm shrinkage network... garbage collection memory optimization region... Test document p arser l anguage l abel translation … Bag of words for document classification ... planning temporal reasoning plan language ... ?

21.Text Classification and Na ï ve Bayes Formalizing the Na ï ve Bayes Classifier

22.Bayes’ Rule Applied to Documents and Classes For a document d and a class c

23.Na ï ve Bayes Classifier (I) MAP is “maximum a posteriori” = most likely class Bayes Rule Dropping the denominator

24.Na ï ve Bayes Classifier (II) Document d represented as features x1..xn

25.Na ï ve Bayes Classifier (IV) How often does this class occur? O(| X | n •| C |) parameters We can just count the relative frequencies in a corpus Could only be estimated if a very, very large number of training examples was available.

26.Multinomial Na ï ve Bayes Independence Assumptions Bag of Words assumption : Assume position doesn’t matter Conditional Independence : Assume the feature probabilities P ( x i | c j ) are independent given the class c.

27.Multinomial Na ï ve Bayes Classifier

28.Applying Multinomial Naive Bayes Classifiers to Text Classification positions  all word positions in test document

29.Applying Multinomial Naive Bayes Classifiers to Text Classification positions  all word positions in test document

30.Text Classification and Na ï ve Bayes Na ï ve Bayes: Learning

31.Learning the Multinomial Na ï ve Bayes Model First attempt: maximum likelihood estimates simply use the frequencies in the data Sec.13.3

32.Create mega-document for topic j by concatenating all docs in this topic Use frequency of w in mega- document Parameter estimation fraction of times word w i appears among all words in documents of topic c j

33.Problem with Maximum Likelihood What if we have seen no training documents with the word fantastic and classified in the topic positive ( thumbs-up) ? Zero probabilities cannot be conditioned away, no matter the other evidence! Sec.13.3

34.Laplace (add-1) smoothing for Na ï ve Bayes

35.Multinomial Naïve Bayes: Learning Calculate P ( c j ) terms For each c j in C do docs j  all docs with class = c j Calculate P ( w k | c j ) terms Text j  single doc containing all docs j For each word w k in Vocabulary n k  # of occurrences of w k in Text j From training corpus, extract Vocabulary

36.Multinomial Naïve Bayes: Learning Calculate P ( c j ) terms For each c j in C do docs j  all docs with class = c j Calculate P ( w k | c j ) terms Text j  single doc containing all docs j For each word w k in Vocabulary n k  # of occurrences of w k in Text j From training corpus, extract Vocabulary

37.Text Classification and Na ï ve Bayes Na ï ve Bayes: Relationship to Language Modeling

38.Generative Model for Multinomial Na ï ve Bayes 38 c =China X 1 =Shanghai X 2 =and X 3 =Shenzhen X 4 =issue X 5 =bonds

39.Na ï ve Bayes and Language Modeling Naï ve bayes classifiers can use any sort of feature URL, email address, dictionaries, network features But if, as in the previous slides We use only word features w e use all of the words in the text (not a subset) Then N a ï ve bayes has an important similarity to language modeling. 39

40.Each class = a unigram language model Assigning each word: P(word | c) Assigning each sentence: P( s|c )= Π P( word|c ) 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film … I love this fun film 0.1 0.1 .05 0.01 0.1 Class pos P(s | pos ) = 0.0000005 Sec.13.2.1

41.Na ï ve Bayes as a Language Model Which class assigns the higher probability to s? 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film Model pos Model neg film love this fun I 0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2 P( s | pos ) > P( s | neg ) 0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film Sec.13.2.1

42.Na ï ve Bayes as a Language Model Which class assigns the higher probability to s? 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film Model pos Model neg film love this fun I 0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2 P( s | pos ) > P( s | neg ) 0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film Sec.13.2.1

43.Text Classification and Na ï ve Bayes Multinomial Na ï ve Bayes: A W orked E xample

44.Choosing a class: P(c|d5) P(j|d5) 1/4 * (2/9) 3 * 2/9 * 2/9 ≈ 0.0001 Doc Words Class Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan ? 44 Conditional Probabilities: P( Chinese| c ) = P( Tokyo| c ) = P( Japan| c ) = P( Chinese| j ) = P( Tokyo| j ) = P( Japan| j ) = Priors: P ( c )= P ( j )= 3 4 1 4 (5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14 (1 +1) / (3+6) = 2/9 (0+1) / (8+6) = 1/14 (1 +1) / (3+6) = 2/9 (1 +1) / (3+6) = 2/9 3/4 * (3/7) 3 * 1/14 * 1/14 ≈ 0.0003

45.Na ï ve Bayes in Spam Filtering SpamAssassin Features: Mentions Generic Viagra Online Pharmacy Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN) Phrase : impress ... girl From: starts with many numbers Subject is all capitals HTML has a low ratio of text to image area One hundred percent guaranteed Claims you can be removed from the list Prestigious Non-Accredited Universities http://spamassassin.apache.org/ tests_3_3_x.html

46.Summary: Naive Bayes is Not So Naive Very Fast, low storage requirements Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data Optimal if the independence a ssumptions hold: If assumed independence is correct, then it is the Bayes Optimal Classifier for problem A good dependable baseline for text classification B ut we will see other classifiers that give better accuracy

47.Summary: Naive Bayes is Not So Naive Very Fast, low storage requirements Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data Optimal if the independence a ssumptions hold: If assumed independence is correct, then it is the Bayes Optimal Classifier for problem A good dependable baseline for text classification B ut we will see other classifiers that give better accuracy

48.Text Classification and Na ï ve Bayes Precision, Recall, and the F measure

49.The 2-by-2 contingency table correct not correct selected tp fp not selected fn tn

50.Precision and recall Precision : % of selected items that are correct Recall : % of correct items that are selected correct not correct selected tp fp not selected fn tn

51.A combined measure: F A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean): The harmonic mean is a very conservative average; see IIR § 8.3 People usually use balanced F1 measure i.e., with  = 1 (that is,  = ½): F = 2 PR /( P + R )

52.A combined measure: F A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean): The harmonic mean is a very conservative average; see IIR § 8.3 People usually use balanced F1 measure i.e., with  = 1 (that is,  = ½): F = 2 PR /( P + R )

53.Text Classification and Na ï ve Bayes Text Classification: Evaluation

54.54 More Than Two Classes: Sets of binary classifiers Dealing with any -of or multivalue classification A document can belong to 0, 1, or >1 classes . For each class c∈C Build a classifier γ c to distinguish c from all other classes c’ ∈C Given test doc d , Evaluate it for membership in each class using each γ c d belongs to any class for which γ c returns true Sec.14.5

55.55 More Than Two Classes: Sets of binary classifiers One -of or multinomial classification Classes are mutually exclusive: each document in exactly one class For each class c∈C Build a classifier γ c to distinguish c from all other classes c’ ∈C Given test doc d , Evaluate it for membership in each class using each γ c d belongs to the one class with maximum score Sec.14.5

56.56 Most (over)used data set, 21,578 docs (each 90 types, 200 toknens ) 9603 training, 3299 test articles ( ModApte /Lewis split) 118 categories An article can be in more than one category Learn 118 binary category distinctions Average document (with at least one category) has 1.24 classes Only about 10 out of 118 categories are large Common categories (#train, #test) Evaluation: Classic Reuters-21578 Data Set Earn (2877, 1087) Acquisitions (1650, 179) Money- fx (538, 179) Grain (433, 149) Crude (389, 189) Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Sec. 15.2.4

57.57 Reuters Text Categorization data set ( Reuters-21578) document <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2-MAR-1987 16:51:43.42</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV ( pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter </BODY></TEXT></REUTERS > Sec. 15.2.4

58.Confusion matrix c For each pair of classes <c 1 ,c 2 > how many documents from c 1 were incorrectly assigned to c 2 ? c 3,2 : 90 wheat documents incorrectly assigned to poultry 58 Docs in test set Assigned UK Assigned poultry Assigned wheat Assigned coffee Assigned interest Assigned trade True UK 95 1 13 0 1 0 True poultry 0 1 0 0 0 0 True wheat 10 90 0 1 0 0 True coffee 0 0 0 34 3 7 True interest - 1 2 13 26 5 True trade 0 0 2 14 5 10

59.59 Per class evaluation measures Recall : Fraction of docs in class i classified correctly: Precision : Fraction of docs assigned class i that are actually about class i : Accuracy : (1 - error rate) Fraction of docs classified correctly: Sec. 15.2.4

60.60 Micro- vs. Macro-Averaging If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging : Compute performance for each class, then average. Microaveraging : Collect decisions for all classes, compute contingency table, evaluate. Sec. 15.2.4

61.61 Micro- vs. Macro-Averaging: Example Truth: yes Truth: no Classifier: yes 10 10 Classifier: no 10 970 Truth: yes Truth: no Classifier: yes 90 10 Classifier: no 10 890 Truth: yes Truth: no Classifier: yes 100 20 Classifier: no 20 1860 Class 1 Class 2 Micro Ave. Table Sec. 15.2.4 Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Microaveraged score is dominated by score on common classes

62.Development Test Sets and Cross-validation Metric: P/R/F1 or Accuracy Unseen test set avoid overfitting (‘tuning to the test set’) more conservative estimate of performance Cross-validation over multiple splits Handle sampling errors from different datasets Pool results over each split Compute pooled dev set performance Training set Development Test Set Test Set Test Set Training Set Training Set Dev Test Training Set Dev Test Dev Test

63.Development Test Sets and Cross-validation Metric: P/R/F1 or Accuracy Unseen test set avoid overfitting (‘tuning to the test set’) more conservative estimate of performance Cross-validation over multiple splits Handle sampling errors from different datasets Pool results over each split Compute pooled dev set performance Training set Development Test Set Test Set Test Set Training Set Training Set Dev Test Training Set Dev Test Dev Test

64.Text Classification and Na ï ve Bayes Text Classification: Practical Issues

65.65 The Real World Gee, I’m building a text classifier for real, now! What should I do ? Sec. 15.3.1

66.66 No training data? Manually written rules If (wheat or grain) and not (whole or bread) then Categorize as grain Need careful crafting Human tuning on development data Time-consuming: 2 days per class Sec. 15.3.1

67.67 Very little data? Use N a ï ve Bayes Naïve Bayes is a “high-bias” algorithm ( Ng and Jordan 2002 NIPS) Get more labeled data Find clever ways to get humans to label data for you Try semi-supervised training methods: Bootstrapping, EM over unlabeled documents, … Sec. 15.3.1

68.68 A reasonable amount of data? Perfect for all the clever classifiers SVM Regularized Logistic Regression You can even use user-interpretable decision trees Users like to hack M anagement likes quick fixes Sec. 15.3.1

69.69 A huge amount of data? Can achieve high accuracy! At a cost: SVMs (train time) or kNN (test time) can be too slow R egularized logistic regression can be somewhat better So Naïve Bayes can come back into its own again ! Sec. 15.3.1

70.70 Accuracy as a function of data size With enough data Classifier may not matter Sec. 15.3.1 Brill and Banko on spelling correction

71.Real-world systems generally combine: Automatic classification Manual review of uncertain/difficult /"new” cases 71

72.Underflow Prevention: log space Multiplying lots of probabilities can result in floating-point underflow. Since log( xy ) = log( x ) + log( y ) Better to sum logs of probabilities instead of multiplying probabilities. Class with highest un-normalized log probability score is still most probable. Model is now just max of sum of weights

73.73 How to tweak performance Domain-specific features and weights: very important in real performance Sometimes need to collapse terms: Part numbers, chemical formulas, … But stemming generally doesn’t help Upweighting : Counting a word as if it occurred twice: title words (Cohen & Singer 1996) first sentence of each paragraph (Murata, 1999) In sentences that contain title words ( Ko et al, 2002) Sec. 15.3.2

74.73 How to tweak performance Domain-specific features and weights: very important in real performance Sometimes need to collapse terms: Part numbers, chemical formulas, … But stemming generally doesn’t help Upweighting : Counting a word as if it occurred twice: title words (Cohen & Singer 1996) first sentence of each paragraph (Murata, 1999) In sentences that contain title words ( Ko et al, 2002) Sec. 15.3.2

user picture

相关Slides

  • 1. 中国人工智能产业发展迅速,但整体实力仍落后于美国 2. 中国企业价值链布局侧重技术层和应用层,对需要长周期的基础层关注 度较小 3. 科技巨头生态链博弈正在展开,创业企业则积极发力垂直行业解决方案, 深耕巨头的数据洼地,打造护城河 4. 政府端是目前人工智能切入智慧政务和公共安全应用场景的主要渠道, 早期进入的企业逐步建立行业壁垒,未来需要解决数据割裂问题以获得长 足发展 5. 人工智能在金融领域的应用最为深入,应用场景逐步由以交易安全为主 向变革金融经营全过程扩展 6. 医疗行业人工智能应用发展快速,但急需建立标准化的人工智能产品市 场准入机制并加强医疗数据库的建设 7. 以无人驾驶技术为主导的汽车行业将迎来产业链的革新 8. 人工智能在制造业领域的应用潜力被低估,优质数据资源未被充分利 用 9. 人工智能加速新零售全渠道的融合,传统零售企业与创业企业结成伙 伴关系,围绕人、货、场、链搭建应用场景 10. 政策与资本双重驱动推动人工智能产业区域间竞赛,京沪深领跑全 国,杭州发展逐步加速 11. 各地政府以建设产业园的方式发挥人工智能产业在推动新旧动能转换 中的作用 2. 杭州未来科技城抓住人工智能产业快速发展的机会并取得显著成绩, 未来可以从人才、技术、创新三要素入手进一步打造产业竞争力

  • 张首晟教授讲述了量子的世界,人工智能未来的发展以及区块链技术的展望与发展趋势。张提出区块链与人工智能共生的见解,人工智能需要数据,但是数据往往被中心化平台垄断,因而阻碍创新。从这种意义上人工智能有所欠缺。 加密经济学创造了一个对于数据提供者有正确激励机制的数据市场。人工智能能够依赖这个数据市场起飞。

  • 此次大会分享 主要介绍了竹间智能在人工智能语音,情绪识别上的技术应用,通过文字,表情,声音来进行情绪检测,并分享了竹间智能是别的原理与结构

  • 邓亚锋-如何打造大规模视觉计算系统