Learning vector quantization is to move a prototype close to training samples in its class and move away from samples with different classes. It uses information given by class lables, and often works better than k-means. For k-nearest neighbor classifiers, classification boundaries become smoother with larger k.
- 179
- 0
- 0
注脚
展开查看详情
1.Learning Vector Quantization and K-Nearest Neighbor Learning Vector Quantization and K-Nearest Neighbor Jia Li Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu http://www.stat.psu.edu/∼jiali Jia Li http://www.stat.psu.edu/∼jiali
2.Learning Vector Quantization and K-Nearest Neighbor Learning Vector Quantization Developed by Kohonen. A package with document is available at: http://www.cis.hut.fi/nnrc/nnrc-programs.html . When position the prototypes, use information given by class labels, as a contrast to k-means which selects prototypes without using class labels. Often works better than k-means. The idea is to move a prototype close to training samples in its class, and move away from samples with different classes. Jia Li http://www.stat.psu.edu/∼jiali
3.Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/∼jiali
4.Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/∼jiali
5.Learning Vector Quantization and K-Nearest Neighbor The Algorithm 1. Start from a set of initial prototypes with classes assigned. Denote the M prototypes by Z = {z1 , ..., zM } and their associated classes by C (zm ), m = 1, 2, ..., M. The initial prototypes can be provided by k-means. 2. Sweep through the training samples and update zm after visiting each sample. Suppose xi is assigned to the mth prototype zm by the nearest neighbor rule: xi − zm ≤ xi − zm , ∀m = m, 1 ≤ m ≤ M If gi = C (zm ), move zm towards the training sample: zm ← zm + (xi − zm ) where is the learning rate. If gi = C (zm ), move zm away from the training sample: zm ← zm − (xi − zm ) 3. Step 2 can be repeated a number of times. Jia Li http://www.stat.psu.edu/∼jiali
6.Learning Vector Quantization and K-Nearest Neighbor Experiments Use the diabetes data set. Use prototypes obtained by k-means as initial prototypes. Use LVQ with = 0.1. Results obtained after 1, 2, and 5 passes are shown below. Classification is not guaranteed to improve after adjusting prototypes. One pass with a small usually helps. But don’t over do it. Comments: Fine tuning often helps: Select initial prototypes. Adjust learning rate . Read the package documents for details. Jia Li http://www.stat.psu.edu/∼jiali
7.Learning Vector Quantization and K-Nearest Neighbor Error rate: 27.74% Error rate: 27.61% Jia Li http://www.stat.psu.edu/∼jiali
8.Learning Vector Quantization and K-Nearest Neighbor Error rate: 27.86% Error rate: 32.37% Jia Li http://www.stat.psu.edu/∼jiali
9.Learning Vector Quantization and K-Nearest Neighbor K-Nearest Neighbor Classifiers Given a query point x0 , find the k training samples x(r ) , r = 1, ..., k closest in distance to x0 , and then classify using majority vote among the k neighbors. Feature normalization is often performed in pre-processing. Classification boundaries become smoother with larger k. Jia Li http://www.stat.psu.edu/∼jiali
10.Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/∼jiali
11.Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/∼jiali
12.Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/∼jiali
13.Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/∼jiali
14.Learning Vector Quantization and K-Nearest Neighbor A Comparative Study (ElemStatLearn) Two simulated problems. There are 10 independent features Xj , each uniformly distributed on [0, 1]. The two-class 0-1 target variable is defined as follows: problem 1: “easy” 1 Y = I (X1 > ); 2 problem 2: “difficult” 3 1 Y = I sign (Xj − ) > 0 . 2 j=1 Jia Li http://www.stat.psu.edu/∼jiali
15.Learning Vector Quantization and K-Nearest Neighbor For problem 1 (2), except X1 (and X2 , X3 ), all the other features are “noise”. The Bayes error rates are zero. In each run, 100 samples used in training and 1000 used in testing. Jia Li http://www.stat.psu.edu/∼jiali
16.Learning Vector Quantization and K-Nearest Neighbor The figure shows the mean and standard deviation of the misclassification error for nearest-neighbors, K-means and LVQ over 10 realizations (10 simulated data sets), as the tuning parameters change. K-means and LVQ give almost identical results. For the first problem, K-means and LVQ outperform nearest neighbors, assuming the best choice of tuning parameters for each. For the second problem, they perform similarly. The optimal k for the k-nearest neighbor classification differs significantly for the two problems. Jia Li http://www.stat.psu.edu/∼jiali
17.Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/∼jiali
18.Learning Vector Quantization and K-Nearest Neighbor Adaptive Nearest-Neighbor Methods When dimension is high, data become relatively sparse. Implicit in nearest-neighbor classification is the assumption that the class probabilities are roughly constant in the neighborhood, and hence simple averages give good estimates. In high dimensional space, the neighborhood represented by the few nearest samples may not be local. Consider N data points uniformly distributed in the unit cube [− 21 , 12 ]p . Let R be the radius of a 1-nearest-neighborhood centered at the origin. Then 1/p 1 1/N median(R) = vp−1/p 1− , 2 where vp r p is the volume of the sphere of radius r in p dimensions. The median radius quickly approaches 0.5, the distance to the edge of the cube, when dimension increases. Jia Li http://www.stat.psu.edu/∼jiali
19.Learning Vector Quantization and K-Nearest Neighbor Adjust distance metric locally, so that the resulting neighborhoods stretch out in directions for which the class probabilities don’t change much. Jia Li http://www.stat.psu.edu/∼jiali
20.Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/∼jiali
21.Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/∼jiali
22.Learning Vector Quantization and K-Nearest Neighbor Discriminant Adaptive Nearest-Neighbor (DANN) At each query point, a neighborhood of say 50 points is formed. Class probabilities are NOT assumed constant in this neighborhood. This neighborhood is used only to decide how to define the adapted metric. After the metric is decided, a normal k-nearest neighbor rule is applied to classify the query. The metric changes with the query. Jia Li http://www.stat.psu.edu/∼jiali
23.Learning Vector Quantization and K-Nearest Neighbor The DANN metric at a query point x0 is defined by D(x, x0 ) = (x − x0 )T Σ(x − x0 ) , where Σ = W−1/2 [(W−1/2 )T BW−1/2 + I](W−1/2 )T = W−1/2 [B∗ + I](W−1/2 )T . W is the pooled within-class covariance matrix and B is the between class covariance matrix. Jia Li http://www.stat.psu.edu/∼jiali
24.Learning Vector Quantization and K-Nearest Neighbor Intuition We compute W and B in LDA. Recall we did similar computation when deriving discriminant coordinates. Eigen decomposition of B∗ = V∗ DB V∗T . Σ = W−1/2 [B∗ + I](W−1/2 )T = W−1/2 [V∗ DB V∗T + I](W−1/2 )T = W−1/2 V∗ · (DB + I) · (W−1/2 V∗ )T Note that the column vectors of W−1/2 V∗ are simply the discriminant coordinates. I is added to avoid using samples far away from the query point. Jia Li http://www.stat.psu.edu/∼jiali
25.Learning Vector Quantization and K-Nearest Neighbor Geometric interpretation: To compute DNAA metric, x − x0 is projected onto the discriminant coordinates. The projection values on the significant discriminant coordinates are magnified; those on the insignificant DCs are shrunk. The implication is that the neighborhood is stretched in the directions of the insignificant DCs and squeezed in those of the significant ones. (W−1/2 )T is introduced to sphere the data so that the within-class covariance matrix is the identity matrix. Jia Li http://www.stat.psu.edu/∼jiali
26.Learning Vector Quantization and K-Nearest Neighbor Significant discriminant coordinates represent directions in which class probabilities change substantially. Hence when we form a neighborhood, we want it to have small span in these directions. On the opposite, we want the neighborhood to have large span in directions in which class probabilities don’t change much. To summarize: we want to form a neighborhood that contains as many samples as possible but in the mean while has approximately constant class probabilities. Jia Li http://www.stat.psu.edu/∼jiali
27.Learning Vector Quantization and K-Nearest Neighbor To summarize: we want to form a neighborhood that contains as many samples as possible but in the mean while has approximately constant class probabilities. Jia Li http://www.stat.psu.edu/∼jiali
28.Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/∼jiali
29.Learning Vector Quantization and K-Nearest Neighbor Global Dimension Reduction At each training sample xi , the between-centroids sum of squares matrix Bi is computed, using a neighborhood of say 50 points. Average Bi over all training samples: N ¯= 1 B Bi . N i=1 Jia Li http://www.stat.psu.edu/∼jiali