AI:特征提取

本章介绍特征提取在分类中的应用。首先通过举例说明介绍了特征提取的基本概念以及特征选择方法的四个主要步骤:generation:选择合适的候选特征子集evaluation:计算子集的相关值stopping criterion:选择相关的子集validation:验证子集。随后介绍了几种算法:Relief(Relevant Features)、Branch & Bound并比较各自特点。最后举例说明如何选择方法。
展开查看详情

1. Feature Selection for Classification by Yan Ke China Jiliang University Lecture Notes of Applied Human Computer Interaction Week 4

2. Feature Selection for Classification Agenda: • Overview and general introduction. • Four main steps in any feature selection methods. • Categorization of the various methods. • Algorithm = Relief, Branch & Bound. • Algorithm = DTM, MDLM, POE+ACC, Focus. • Algorithm = LVF, wrapper approach. • Summary of the various method. • Empirical comparison using some artificial data set. • Guidelines in selecting the “right” method.

3. Feature Selection for Classification (1) Overview. • various feature selection methods since the 1970’s. • common steps in all feature selection tasks. • key concepts in feature selection algorithm. • categorize 32 selection algorithms. • run through some of the main algorithms. • pros and cons of each algorithms. • compare the performance of different methods. • guideline to select the appropriate method.

4. Feature Selection for Classification (2) What is a feature? TRS_DT TRS_TYP_CD REF_DT REF_NUM CO_CD GDS_CD QTY UT_CD UT_PRIC 21/05/93 00001 04/05/93 25119 10002J 001M 10 CTN 22.000 21/05/93 00001 05/05/93 25124 10002J 032J 200 DOZ 1.370 21/05/93 00001 05/05/93 25124 10002J 033Q 500 DOZ 1.000 21/05/93 00001 13/05/93 25217 10002J 024K 5 CTN 21.000 21/05/93 00001 13/05/93 25216 10026H 006C 20 CTN 69.000 21/05/93 00001 13/05/93 25216 10026H 008Q 10 CTN 114.000 21/05/93 00001 14/05/93 25232 10026H 006C 10 CTN 69.000 21/05/93 00001 14/05/93 25235 10027E 003A 5 CTN 24.000 21/05/93 00001 14/05/93 25235 10027E 001M 5 CTN 24.000 21/05/93 00001 22/04/93 24974 10035E 009F 50 CTN 118.000 21/05/93 00001 27/04/93 25033 10035E 015A 375 GRS 72.000 21/05/93 00001 20/05/93 25313 10041Q 010F 10 CTN 26.000 21/05/93 00001 12/05/93 25197 10054R 002E 25 CTN 24.000

5. Feature Selection for Classification (3)What is classification? • main data mining task besides association-rule discovery. • predictive nature - with a given set of features, predict the value of another feature. • common scenario : - Given a large legacy data set. - Given a number of known classes. - Select an appropriate smaller training data set. - Build a model (eg. Decision tree). - Use the model to classify the actual data set into the defined classes.

6. Feature Selection for Classification (4) Main focus of the author. • survey various known feature selection methods • to select subset of relevant feature • to achieve classification accuracy. Thus: relevancy -> correct prediction (5) Why can’t we use the full original feature set? • too computational expensive to examine all features. • not necessary to include all features (ie. irrelevant - gain no further information).

7. Feature Selection for Classification (6) Four main steps in a feature selection method. Original Subset of feature set Generation feature Evaluation Validation Selected no Stopping yes subset of criterion feature process Generation = select feature subset candidate. Evaluation = compute relevancy value of the subset. Stopping criterion = determine whether subset is relevant. Validation = verify subset validity.

8. Feature Selection for Classification (7) Generation • select candidate subset of feature for evaluation. • Start = no feature, all feature, random feature subset. • Subsequent = add, remove, add/remove. • categorise feature selection = ways to generate feature subset candidate. • 3 ways in how the feature space is examined. (7.1) Complete (7.2) Heuristic (7.3) Random.

9. Feature Selection for Classification (7.1) Complete/exhaustive • examine all combinations of feature subset. {f1,f2,f3} => { {f1},{f2},{f3},{f1,f2},{f1,f3},{f2,f3},{f1,f2,f3} } p • order of the search space O(2 ), p - # feature. • optimal subset is achievable. • too expensive if feature space is large. (7.2) Heuristic • selection is directed under certain guideline - selected feature taken out, no combination of feature. - candidate = { {f1,f2,f3}, {f2,f3}, {f3} } • incremental generation of subsets. • search space is smaller and faster in producing result. • miss out features of high order relations (parity problem). - Some relevant feature subset may be omitted {f1,f2}.

10. Feature Selection for Classification (7.3) Random • no predefined way to select feature candidate. • pick feature at random (ie. probabilistic approach). • optimal subset depend on the number of try - which then rely on the available resource. • require more user-defined input parameters. - result optimality will depend on how these parameters are defined. - eg. number of try

11. Feature Selection for Classification (8) Evaluation • determine the relevancy of the generated feature subset candidate towards the classification task. Rvalue = J(candidate subset) if (Rvalue > best_value) best_value = Rvalue • 5 main type of evaluation functions. (8.1) distance (euclidean distance measure). (8.2) information (entropy, information gain, etc.) filter (8.3) dependency (correlation coefficient). (8.4) consistency (min-features bias). (8.5) classifier error rate (the classifier themselves). wrapper

12. Feature Selection for Classification (8.1) Distance measure • z2 = x2 + y2 • select those features that support instances of the same class to stay within the same proximity. • instances of same class should be closer in terms of distance than those from different class. (8.2) Information measure • entropy - measurement of information content. • information gain of a feature : (eg. Induction of decision tree) gain(A) = I(p,n) - E(A) gain(A) = before A is branched - sum of all nodes after branched • select A if gain(A) > gain(B).

13. Feature Selection for Classification (8.3) Dependency measure • correlation between a feature and a class label. • how close is the feature related to the outcome of the class label? • dependence between features = degree of redundancy. - if a feature is heavily dependence on another, than it is redundant. • to determine correlation, we need some physical value. value = distance, information

14. Feature Selection for Classification (8.4) Consistency measure • two instances are inconsistent if they have matching feature values but group under different class label. f1 f2 class instance 1 a b c1 inconsistent instance 2 a b c2 • select {f1,f2} if in the training data set there exist no instances as above. • heavily rely on the training data set. • min-feature = want smallest subset with consistency. • problem = 1 feature alone guarantee no inconsistency (eg. IC #).

15. Feature Selection for Classification Filter approach Wrapper approach Original Original feature set feature set Feature selection Feature selection - evaluation fn - classifier selected selected feature subset feature subset classifier classifier • evaluation fn <> classifier • evaluation fn = classifier • ignored effect of selected subset • take classifier into account. on the performance of classifier. • loss generality. • high degree of accuracy.

16. Feature Selection for Classification (8.5) Classifier error rate. • wrapper approach. error_rate = classifier(feature subset candidate) if (error_rate < predefined threshold) select the feature subset • feature selection loss its generality, but gain accuracy towards the classification task. • computationally very costly.

17. Feature Selection for Classification (9) Comparison among the various evaluation method. method generality time accuracy distance yes low - information yes low - dependency yes low - consistency yes moderate - classifier error rate no high very high generality = how general is the method towards diff. classifier? time = how complex in terms of time? accuracy = how accurate is the resulting classification task?

18. Feature Selection for Classification (10) Author's categorization of feature selection methods. Generation Measures Heuristic Complete Random Distance Relief Branch & Bound (BB) Decision Tree Method Minimal Description Information (DTM) Length Method (MDLM) Probability of Err & Ave Dependency Correlation Coefficient Method (POE+ACC) Consistency Focus LVF Classifier Error SBS, SFS AMB & B LVW Rate

19. Feature Selection for Classification (11.1) Relief [generation=heuristic, evaluation=distance]. • Basic algorithm construct : - each feature is assigned cumulative weightage computed over a predefined number of sample data set selected from the training data set. - feature with weightage over a certain threshold is the selected feature subset. • Assignment of weightage : - instances belongs to similar class should stay closer together than those in a different class. - near-hit instance = similar class. - near-miss instance = different class. - W = W - diff(X,nearhit)2 + diff(X,nearmiss)2

20. Feature Selection for Classification 1. selected_subset = {} 2. init. all feature weightage = 0 (eg. for 2 features : w1=0, w2=0) 3. for i = 1 to no_of_sample get one instance X from the training data set D. get nearhit H = instance in D where dist(X,H) is closest & X.class=H.class get nearmiss M = instance in D where dist(X,M) is closest & X.class<>M.class update weightage for all features : - weightage = weightage -diff(x,h)2 +diff(x,m)2 eg. weightage1 = weightage1 -diff(x1,h1)2 +diff(x1,m1)2 eg. weightage2 = weightage2 -diff(x2,h2)2 +diff(x2,m2)2 4. for j = 1 to no_of_feature (eg. 2) if weightagej >= Threshold, add featurej to selected_subset

21. Data Hair Length Shoe Size Class 5 9 M 10 8 M 15 5 F 20 9 M 25 6 F 25 5 F

22. Feature Selection for Classification Classification : Male / Female 10 M M 9 2 miss hit M 4 shoe size 8 x1 7 3 F 6 x2 F 1 5 miss hit F 4 0 5 10 15 20 25 30 hair length (cm) feature x w -(x-hit)2 +(x-miss)2 =w x w -(x-hit)2 +(x-miss)2 =w shoe size x 1 0 -(4-5)2 +(4-1)2 -1+9 x2 8 -(2-1)2 +(2-5)2 +16 hair length x 1 0 -(2-1)2 +(2-3)2 -1+1 x2 0 -(5-5)2 +(5-4)2 +1 * if (threshold=5), the feature “shoe size” will be selected.

23.

24.

25.

26. Feature Selection for Classification • W = W - diff(X,nearhit)2 - diff(X,nearmiss)2 - try to decrease weightage for instances belong to the same class (*note: their dist. diff. should be small). - try to increase weightage for instances belong to diff class (*note: their dist. diff. should be large). - If (W<=0), then sign of irrelevancy or redundancy. - If (W>0), then instances in diff. class is further apart as expected. • Disadvantages: - applicable only to binary class problem. - insufficient training instances fool relief. - if most features are relevant, relief select all (even if not necessary). • Advantages: - noise-tolerant. - unaffected by feature interaction (weightage is cumulative & det. collectively).

27. Feature Selection for Classification (11.2) Branch & Bound. [generation=complete, evaluation=distance] • is a very old method (1977). • Modified assumption : - find a minimally size feature subset. - a bound/threshold is used to prune irrelevant branches. • F(subset) < bound, remove from search tree (including all subsets). • Model of feature set search tree.

28. Feature Selection for Classification F = { f1, f2, f3 } F {f2, f3} {f1, f3} {f1, f2} {f3} {f2} {f1} {f3} {f1} {f2}

29.Category IV - Generation Heuristic / Evaluation Information 2 Methods: • 1) Decision Tree Method (DTM) – Run C4.5 over training set. – The features that are selected are the union of all features in the pruned decision tree produced by C4.5. – An information based function selects the feature at each node of the decision tree