文本分类与朴素贝叶斯

下载 6

Daniel

发布于

4346

人观看

#信息技术

斯坦福大学的课件：Text Classification and Naïve Bayes.

展开查看详情

1 .Text Classification and Na ï ve Bayes The Task of Text Classification

2 .Is this spam?

3 .Who wrote which Federalist papers? 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton. Authorship of 12 of the letters in dispute 1963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton

4 .Male or female author? By 1925 present-day Vietnam was divided into three parts under French colonial rule. The southern region embracing Saigon and the Mekong delta was the colony of Cochin-China; the central area with its imperial capital at Hue was the protectorate of Annam… Clara never failed to be astonished by the extraordinary felicity of her own name. She found it hard to trust herself to the mercy of fate, which had managed over the years to convert her greatest shame into one of her greatest assets… S. Argamon , M. Koppel, J. Fine, A. R. Shimoni , 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3, pp. 321–346

5 .Positive or negative movie review? unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists this is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes. 5

6 .What is the subject of this article? Antogonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology … 6 MeSH Subject Category Hierarchy ? MEDLINE Article

7 .Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

8 .Text Classification: definition Input : a document d a fixed set of classes C = { c 1 , c 2 ,…, c J } Output : a predicted class c  C

9 .Classification Methods: Hand-coded rules Rules based on combinations of words or other features spam: black-list-address OR (“dollars” AND“have been selected”) Accuracy can be high If rules carefully refined by expert But building and maintaining these rules is expensive

10 .Classification Methods: Supervised Machine Learning Input: a document d a fixed set of classes C = { c 1 , c 2 ,…, c J } A training set of m hand-labeled documents (d 1 ,c 1 ),....,( d m ,c m ) Output: a learned classifier γ:d  c 10

11 .Classification Methods: Supervised Machine Learning Any kind of classifier Na ï ve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …

12 .Classification Methods: Supervised Machine Learning Any kind of classifier Na ï ve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …

13 .Text Classification and Na ï ve Bayes Na ï ve Bayes (I)

14 .Naïve Bayes Intuition Simple (“ na ï ve ”) classification method based on Bayes rule Relies on very simple representation of document Bag of words

15 .The bag of words representation I love this movie! Its sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre . I would recommend it to just about anyone. Ive seen it several times , and Im always happy to see it again whenever I have a friend who hasnt seen it yet. γ ( )=c

16 .The bag of words representation I love this movie! Its sweet , but with satirical humor. The dialogue is great and the adventure scenes are fun … It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre . I would recommend it to just about anyone. Ive seen it several times , and Im always happy to see it again whenever I have a friend who hasnt seen it yet . γ ( )=c

17 .The bag of words representation: using a subset of words x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx γ ( )=c

18 .The bag of words representation γ ( )=c great 2 love 2 recommend 1 laugh 1 happy 1 ... ...

19 .Planning GUI Garbage Collection Machine Learning NLP parser tag training translation language ... learning training algorithm shrinkage network... garbage collection memory optimization region... Test document p arser l anguage l abel translation … Bag of words for document classification ... planning temporal reasoning plan language ... ?

20 .Planning GUI Garbage Collection Machine Learning NLP parser tag training translation language ... learning training algorithm shrinkage network... garbage collection memory optimization region... Test document p arser l anguage l abel translation … Bag of words for document classification ... planning temporal reasoning plan language ... ?

21 .Text Classification and Na ï ve Bayes Formalizing the Na ï ve Bayes Classifier

22 .Bayes’ Rule Applied to Documents and Classes For a document d and a class c

23 .Na ï ve Bayes Classifier (I) MAP is “maximum a posteriori” = most likely class Bayes Rule Dropping the denominator

24 .Na ï ve Bayes Classifier (II) Document d represented as features x1..xn

25 .Na ï ve Bayes Classifier (IV) How often does this class occur? O(| X | n •| C |) parameters We can just count the relative frequencies in a corpus Could only be estimated if a very, very large number of training examples was available.

26 .Multinomial Na ï ve Bayes Independence Assumptions Bag of Words assumption : Assume position doesn’t matter Conditional Independence : Assume the feature probabilities P ( x i | c j ) are independent given the class c.

27 .Multinomial Na ï ve Bayes Classifier

28 .Applying Multinomial Naive Bayes Classifiers to Text Classification positions  all word positions in test document

29 .Applying Multinomial Naive Bayes Classifiers to Text Classification positions  all word positions in test document

7点赞

0收藏

6下载