- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
AI:特征提取
展开查看详情
1 . Feature Selection for Classification by Yan Ke China Jiliang University Lecture Notes of Applied Human Computer Interaction Week 4
2 . Feature Selection for Classification Agenda: • Overview and general introduction. • Four main steps in any feature selection methods. • Categorization of the various methods. • Algorithm = Relief, Branch & Bound. • Algorithm = DTM, MDLM, POE+ACC, Focus. • Algorithm = LVF, wrapper approach. • Summary of the various method. • Empirical comparison using some artificial data set. • Guidelines in selecting the “right” method.
3 . Feature Selection for Classification (1) Overview. • various feature selection methods since the 1970’s. • common steps in all feature selection tasks. • key concepts in feature selection algorithm. • categorize 32 selection algorithms. • run through some of the main algorithms. • pros and cons of each algorithms. • compare the performance of different methods. • guideline to select the appropriate method.
4 . Feature Selection for Classification (2) What is a feature? TRS_DT TRS_TYP_CD REF_DT REF_NUM CO_CD GDS_CD QTY UT_CD UT_PRIC 21/05/93 00001 04/05/93 25119 10002J 001M 10 CTN 22.000 21/05/93 00001 05/05/93 25124 10002J 032J 200 DOZ 1.370 21/05/93 00001 05/05/93 25124 10002J 033Q 500 DOZ 1.000 21/05/93 00001 13/05/93 25217 10002J 024K 5 CTN 21.000 21/05/93 00001 13/05/93 25216 10026H 006C 20 CTN 69.000 21/05/93 00001 13/05/93 25216 10026H 008Q 10 CTN 114.000 21/05/93 00001 14/05/93 25232 10026H 006C 10 CTN 69.000 21/05/93 00001 14/05/93 25235 10027E 003A 5 CTN 24.000 21/05/93 00001 14/05/93 25235 10027E 001M 5 CTN 24.000 21/05/93 00001 22/04/93 24974 10035E 009F 50 CTN 118.000 21/05/93 00001 27/04/93 25033 10035E 015A 375 GRS 72.000 21/05/93 00001 20/05/93 25313 10041Q 010F 10 CTN 26.000 21/05/93 00001 12/05/93 25197 10054R 002E 25 CTN 24.000
5 . Feature Selection for Classification (3)What is classification? • main data mining task besides association-rule discovery. • predictive nature - with a given set of features, predict the value of another feature. • common scenario : - Given a large legacy data set. - Given a number of known classes. - Select an appropriate smaller training data set. - Build a model (eg. Decision tree). - Use the model to classify the actual data set into the defined classes.
6 . Feature Selection for Classification (4) Main focus of the author. • survey various known feature selection methods • to select subset of relevant feature • to achieve classification accuracy. Thus: relevancy -> correct prediction (5) Why can’t we use the full original feature set? • too computational expensive to examine all features. • not necessary to include all features (ie. irrelevant - gain no further information).
7 . Feature Selection for Classification (6) Four main steps in a feature selection method. Original Subset of feature set Generation feature Evaluation Validation Selected no Stopping yes subset of criterion feature process Generation = select feature subset candidate. Evaluation = compute relevancy value of the subset. Stopping criterion = determine whether subset is relevant. Validation = verify subset validity.
8 . Feature Selection for Classification (7) Generation • select candidate subset of feature for evaluation. • Start = no feature, all feature, random feature subset. • Subsequent = add, remove, add/remove. • categorise feature selection = ways to generate feature subset candidate. • 3 ways in how the feature space is examined. (7.1) Complete (7.2) Heuristic (7.3) Random.
9 . Feature Selection for Classification (7.1) Complete/exhaustive • examine all combinations of feature subset. {f1,f2,f3} => { {f1},{f2},{f3},{f1,f2},{f1,f3},{f2,f3},{f1,f2,f3} } p • order of the search space O(2 ), p - # feature. • optimal subset is achievable. • too expensive if feature space is large. (7.2) Heuristic • selection is directed under certain guideline - selected feature taken out, no combination of feature. - candidate = { {f1,f2,f3}, {f2,f3}, {f3} } • incremental generation of subsets. • search space is smaller and faster in producing result. • miss out features of high order relations (parity problem). - Some relevant feature subset may be omitted {f1,f2}.
10 . Feature Selection for Classification (7.3) Random • no predefined way to select feature candidate. • pick feature at random (ie. probabilistic approach). • optimal subset depend on the number of try - which then rely on the available resource. • require more user-defined input parameters. - result optimality will depend on how these parameters are defined. - eg. number of try
11 . Feature Selection for Classification (8) Evaluation • determine the relevancy of the generated feature subset candidate towards the classification task. Rvalue = J(candidate subset) if (Rvalue > best_value) best_value = Rvalue • 5 main type of evaluation functions. (8.1) distance (euclidean distance measure). (8.2) information (entropy, information gain, etc.) filter (8.3) dependency (correlation coefficient). (8.4) consistency (min-features bias). (8.5) classifier error rate (the classifier themselves). wrapper
12 . Feature Selection for Classification (8.1) Distance measure • z2 = x2 + y2 • select those features that support instances of the same class to stay within the same proximity. • instances of same class should be closer in terms of distance than those from different class. (8.2) Information measure • entropy - measurement of information content. • information gain of a feature : (eg. Induction of decision tree) gain(A) = I(p,n) - E(A) gain(A) = before A is branched - sum of all nodes after branched • select A if gain(A) > gain(B).
13 . Feature Selection for Classification (8.3) Dependency measure • correlation between a feature and a class label. • how close is the feature related to the outcome of the class label? • dependence between features = degree of redundancy. - if a feature is heavily dependence on another, than it is redundant. • to determine correlation, we need some physical value. value = distance, information
14 . Feature Selection for Classification (8.4) Consistency measure • two instances are inconsistent if they have matching feature values but group under different class label. f1 f2 class instance 1 a b c1 inconsistent instance 2 a b c2 • select {f1,f2} if in the training data set there exist no instances as above. • heavily rely on the training data set. • min-feature = want smallest subset with consistency. • problem = 1 feature alone guarantee no inconsistency (eg. IC #).
15 . Feature Selection for Classification Filter approach Wrapper approach Original Original feature set feature set Feature selection Feature selection - evaluation fn - classifier selected selected feature subset feature subset classifier classifier • evaluation fn <> classifier • evaluation fn = classifier • ignored effect of selected subset • take classifier into account. on the performance of classifier. • loss generality. • high degree of accuracy.
16 . Feature Selection for Classification (8.5) Classifier error rate. • wrapper approach. error_rate = classifier(feature subset candidate) if (error_rate < predefined threshold) select the feature subset • feature selection loss its generality, but gain accuracy towards the classification task. • computationally very costly.
17 . Feature Selection for Classification (9) Comparison among the various evaluation method. method generality time accuracy distance yes low - information yes low - dependency yes low - consistency yes moderate - classifier error rate no high very high generality = how general is the method towards diff. classifier? time = how complex in terms of time? accuracy = how accurate is the resulting classification task?
18 . Feature Selection for Classification (10) Author's categorization of feature selection methods. Generation Measures Heuristic Complete Random Distance Relief Branch & Bound (BB) Decision Tree Method Minimal Description Information (DTM) Length Method (MDLM) Probability of Err & Ave Dependency Correlation Coefficient Method (POE+ACC) Consistency Focus LVF Classifier Error SBS, SFS AMB & B LVW Rate
19 . Feature Selection for Classification (11.1) Relief [generation=heuristic, evaluation=distance]. • Basic algorithm construct : - each feature is assigned cumulative weightage computed over a predefined number of sample data set selected from the training data set. - feature with weightage over a certain threshold is the selected feature subset. • Assignment of weightage : - instances belongs to similar class should stay closer together than those in a different class. - near-hit instance = similar class. - near-miss instance = different class. - W = W - diff(X,nearhit)2 + diff(X,nearmiss)2
20 . Feature Selection for Classification 1. selected_subset = {} 2. init. all feature weightage = 0 (eg. for 2 features : w1=0, w2=0) 3. for i = 1 to no_of_sample get one instance X from the training data set D. get nearhit H = instance in D where dist(X,H) is closest & X.class=H.class get nearmiss M = instance in D where dist(X,M) is closest & X.class<>M.class update weightage for all features : - weightage = weightage -diff(x,h)2 +diff(x,m)2 eg. weightage1 = weightage1 -diff(x1,h1)2 +diff(x1,m1)2 eg. weightage2 = weightage2 -diff(x2,h2)2 +diff(x2,m2)2 4. for j = 1 to no_of_feature (eg. 2) if weightagej >= Threshold, add featurej to selected_subset
21 . Data Hair Length Shoe Size Class 5 9 M 10 8 M 15 5 F 20 9 M 25 6 F 25 5 F
22 . Feature Selection for Classification Classification : Male / Female 10 M M 9 2 miss hit M 4 shoe size 8 x1 7 3 F 6 x2 F 1 5 miss hit F 4 0 5 10 15 20 25 30 hair length (cm) feature x w -(x-hit)2 +(x-miss)2 =w x w -(x-hit)2 +(x-miss)2 =w shoe size x 1 0 -(4-5)2 +(4-1)2 -1+9 x2 8 -(2-1)2 +(2-5)2 +16 hair length x 1 0 -(2-1)2 +(2-3)2 -1+1 x2 0 -(5-5)2 +(5-4)2 +1 * if (threshold=5), the feature “shoe size” will be selected.
23 .
24 .
25 .
26 . Feature Selection for Classification • W = W - diff(X,nearhit)2 - diff(X,nearmiss)2 - try to decrease weightage for instances belong to the same class (*note: their dist. diff. should be small). - try to increase weightage for instances belong to diff class (*note: their dist. diff. should be large). - If (W<=0), then sign of irrelevancy or redundancy. - If (W>0), then instances in diff. class is further apart as expected. • Disadvantages: - applicable only to binary class problem. - insufficient training instances fool relief. - if most features are relevant, relief select all (even if not necessary). • Advantages: - noise-tolerant. - unaffected by feature interaction (weightage is cumulative & det. collectively).
27 . Feature Selection for Classification (11.2) Branch & Bound. [generation=complete, evaluation=distance] • is a very old method (1977). • Modified assumption : - find a minimally size feature subset. - a bound/threshold is used to prune irrelevant branches. • F(subset) < bound, remove from search tree (including all subsets). • Model of feature set search tree.
28 . Feature Selection for Classification F = { f1, f2, f3 } F {f2, f3} {f1, f3} {f1, f2} {f3} {f2} {f1} {f3} {f1} {f2}
29 .Category IV - Generation Heuristic / Evaluation Information 2 Methods: • 1) Decision Tree Method (DTM) – Run C4.5 over training set. – The features that are selected are the union of all features in the pruned decision tree produced by C4.5. – An information based function selects the feature at each node of the decision tree