在机器学习中有效地运用稀疏性

稀疏性是指只使用几个特性表示解决方案。“稀疏”的解决方案推广到样本外更好,由于少过度拟合,并且稀疏解更容易解释由于重要的特性很少而且计算效率更高。本章除了介绍稀疏性的优点之外,还粗略的介绍了其实现算法以及简短的证明。
展开查看详情

1.Efficiently Implementing Sparsity in Learning M. Magdon-Ismail Rensselaer Polytechnic Institute (Joint Work) December 9, 2013.

2. Out-of-Sample is What Counts NO • A pattern exists • We don’t know it YES • We have data to learn it • Tested on new cases ? c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 2 /16 Data −→

3. Data Data Matrix Response Matrix d dimensions name age debt income ··· hair weight sex credit? limit risk John 21yrs −$10K $65K ··· black 175lbs M 2K high     n data points  Joe 74yrs −$100K · · · blonde 275lbs M  $25K  × 0 −           Jane 27yrs −$20K $85K · · · blonde 135lbs F  10K low            .  .  ..  ..             Jen 37yrs −$400K $105K · · · brun 155lbs F 15K high X ∈ Rn×d Y ∈ Rn×ω c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 3 /16 More beautiful data −→

4. More Beautiful Data X ∈ R231×174 Y ∈ R231×166 15 15 % reconstruction error % reconstruction error 10 10 5 5 0 0 10 20 30 40 50 60 10 20 30 40 50 60 number of principal components number of principal components c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 4 /16 Sparsity is good −→

5. Throwing Out Unnecessary Features is Good Sparsity: represent your solution using only a few features. ‘Sparse’ solutions generalize to out-of-sample better – less overfitting. Sparse solutions are easier to interpret – few important features. Computations are more efficient. Problem: How to find the few relevant features quickly. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 5 /16 PCA, K-means, Linear Regression −→

6. PCA, K-means, Linear Regression k = 20 r = 2k PCA K-Means Regression = Exact Exact Exact Approx, fast (relative error) top-k PCA regression Fast-sparse regression (additive error) Sparse, approx, fast (relative error) Sparse, approx, fast (relative error) c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 6 /16 Sparsity example −→

7. Sparsity Represent your solution using only a few . . . Example: linear regression             =     Xw = y y is an optimal linear combination of only a few columns in X. (sparse regression; regularization (|| w ||0 ≤ k); feature subset selection; . . . ) c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 7 /16 SVD −→

8. Singular Value Decomposition (SVD) Σk 0 Vkt X = Uk Ud−k O(nd2) 0 Σd−k Vd−k t U Σ Vt (n × d) (d × d) (d × d) Xk = Uk Σk Vkt = XVk Vkt Xk is the best rank-k approximation to X. Reconstruction of X using only a few deg. of freedom. X X20 X40 X60 Vk is an orthonormal basis for the best k-dimensional subspace of the row space of X. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 8 /16 Fast approximate SVD −→

9. Fast Approximate SVD 1: Z = XR R ∼ N (d × r), Z ∈ Rn×r 2: Q = qr.factorize(Z) 3: ˆ k ← svdk (QtX) V ˆ kV Theorem. Let r = k(1 + 1ǫ ) and E = X − XV ˆ t . Then, k E [|| E ||] ≤ (1 + ǫ)|| X − Xk || running time is O(ndk) = o(svd) [BDM, FOCS 2011] c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 9 /16 Vk and sparsity −→

10. Vk and Sparsity Important “dimensions” of Vkt are important for X      ×s 1 ×s2 ×s3 ×s4 ×s5  −→   Vkt ˆ t ∈ Rk×r V k The sampled r columns are “good” if ˆ kt V I = VktVk ≈ V ˆ k. Sampling schemes: Largest norm (Jollife, 1972); Randomized norm sampling (Rudelson, 1999; RudelsonVershynin, 2007); Greedy (Batson et al, 2009; BDM, 2011). c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 10 /16 Sparse PCA – algorithm −→

11. Sparse PCA – Algorithm 1: Choose a few columns C of X; C ∈ Rn×r . 2: Find the best rank-k approximation of X in the span of C, XC,k . 3: Compute the SVDk of t XC,k = UC,k ΣC,k VC,k . 4: Z = XVC,k . Each feature in Z is a mixture of only the few original r feature dimensions in C. t || X − XVC,k VC,k t || ≤ || X − XC,k VC,k VC,k || = || X − XC,k || ≤ 1 + O( 2k r ) || X − Xk ||. [BDM, FOCS 2011] c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 11 /16 Sparse PCA - pictures −→

12. Sparse PCA k = 20 k = 40 k = 60 Dense PCA Sparse PCA, r = 2k Theorem. One can construct, in o(svd), k features that are r-sparse, r = O(k), that are as good as exact dense top-k PCA-features. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 12 /16 Clustering: K-Means −→

13. Clustering: K-Means Full, slow Fast, sparse 3 clusters 4 Clusters Theorem. There is a subset of features of size O(#clusters) which produces nearly the optimal partition (within a constant factor). One can quickly produce features with a log-approximation factor. [BDM,2013] c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 13 /16 Regression using few important features −→

14. Fast Regression using Few Important Features PCA, slow, dense Sparse, fast k = 20 k = 40 Theorem. Can find O(k) pure features which performs as well top-k pca-regression (additive error controlled by || X − Xk ||F /σk ). [BDM,2013] c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 14 /16 Proofs −→

15. The Proofs All the algorithms use the sparsifier of Vkt in [BDM,FOCS2011]. 1. Choose columns of Vkt to preserve its singular values. 2. Ensure that the selected columns preserve the structural properties of the objective with respect to the columns of X that are sampled. 3. Use dual set sparsification algorithms to accomplish (2). c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 15 /16 Thanks −→

16. THANKS! • Data compression (PCA): quick and reveals few important features • Unsupervised clustering: quick and reveals few important features • Supervised Regression: quick and reveals few important features Few features: easy to interpret; better generalizers; faster computations. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 16 /16