本系列是来自斯坦福大学关于大数据系列讲座的slides:大型矩阵分析与推理,大型矩阵在许多应用程序中的出现带来了一系列新的算法和工具。在过去的几年里,矩阵分析和数值线性代数的大矩阵已经成为一个蓬勃发展的领域。很多机器学习都是基于线性代数。通常,预测是参数向量和特征向量之间点积的函数。这基本上假设了特性之间的某种独立性。相反,矩阵参数可以用来学习特征之间的相互关系:参数矩阵的第(i,j)项表示特征i与特征j的关系。本章介绍了一些重要参数,以及矩阵的应用等。

注脚

展开查看详情

1.Learning with Matrix Parameters Nati Srebro

2. Matrix Completion movies • Predictor itself is a matrix 2 1 4 5 5 4 ? 1 3 3 5 2 • To learn: need bias 4 ? 5 3 ? (prior / hypothesis class / regularizer) 4 1 3 5 1 2 5 1 ? 5 4 4 • Elementwise (i.e. treat matrix as vector)  can’t generalize users 2 ? 5 ? 4 3 3 1 5 2 1 3 1 2 3 • Matrix constraints/regularizers: 4 5 1 3 • Block/cluster structure (eg Plaid Model) 3 3 ? 5 2 ? 1 1 • Rank 5 2 ? 4 4 • Factorization Norms: Trace-Norm, Weighted 1 3 1 5 4 5 1 2 4 5 ? Tr-Norm, Max-Norm, Local Max-Norm, … • Spectral Regularizers • Group Norms

3.Matrix Factorization Models rank(X) = min dim(U; V ) X=UV 0 V 2 1 4 5 5 4 1 3 3 5 2 4 5 3 4 13 5 2 1 4 1 5 5 4 U 3 3 2 5 3 1 1 5 2 4 2 3 1 4 51 3 3 3 5 2 1 1 5 2 4 4 1 3 15 4 5 1 2 4 5

4. Matrix Factorization Models rank(X) = min dim(U; V ) X=UV 0 low norm V Bound avg norm of factorization: 2 1 4 5 P 5 4 1 3 kUk2 F = i jUij 2 3 5 2 4 5 3 kXktr = min kUkF ¢ kV kF 4 13 5 X=UV 0 2 1 4 1 5 5 4 U 3 3 2 5 3 1 1 5 2 4 2 3 1 Bound norm of fact. uniformly: kUk2;1 = maxi jUij 4 51 3 3 3 5 kXkmax = min kUk2;1¢kV k2;1 2 1 1 X=UV 0 5 2 4 4 1 3 15 4 5 aka 2:ℓ1!ℓ1 norm 1 2 4 5

5. Transfer in Multi-Task Learning • m related prediction tasks: [Argyriou et al 2007] Learn predictor 𝜙𝑖 for each task 𝑖 = 1. . 𝑚 • m classes, predict with arg max 𝜙𝑦 (𝑥) [Amit et al 2007] 𝑦 • Transfer from learned tasks to new task • Semi-supervised learning: create auxiliary tasks from unlabeled data (e.g. predict held-out word from context), transfer from aux task to actual task of interest (e.g. parsing, tagging) [Ando Zhang 2005] w1 w2 Factorization model ≡ two layer network, w3 w4 shared units (learned features) in hidden layer w5 • Predictors naturally parameterized by a matrix (but there is no requirement that we output a matrix)

6. Correlation Clustering as Matrix Learning [Jalaia et al 2011,2012] input similarity clustering matrix

7. Correlation Clustering as Matrix Learning [Jalaia et al 2011,2012] input similarity clustering matrix

8. Correlation Clustering as Matrix Learning [Jalaia et al 2011,2012] A ¼ K input similarity clustering matrix minK ||K-A||1 s.t. K is permuted block-1 diag. • Can represent desired object as a matrix Also: • Corwdsourced similarity learning [Tamuz et al 2011] • Binary hashing [Tavakoli et al 2013] • Collaborative permutation learning

9. Covariance/Precision Matrix Estimation • Learning Mahalanobis Metric 𝑑 𝑥, 𝑦 ∝ exp(−𝑥 ′ M𝑥) • Inferring Dependency Structure • Sparse Markov Net  Sparse Precision Matrix • k Latent Variables  + Rank k • Many latent variables with regularized affect  + Trace-Norm/Max-Norm

10. Principal Component Analysis • View I: low rank matrix approximation • min | 𝐴−𝐾 | 𝑟𝑎𝑛𝑘 𝐴 ≤𝑘 • Approximating matrix itself is a matrix parameter • Does not give compact representation • Does not generalize • View II: find subspace capturing as much of data distribution as possible • Maximizing variance inside subspace: max 𝐸[ 𝑃𝑥 2 ] • Minimizing reconstruction error: min 𝐸[ 𝑥 − 𝑃𝑥 2 ] • Parameter is low-dim subspace  represent as matrix

11. Principal Component Analysis: Matrix Representation • min | 𝐴−𝑋 | 𝑟𝑎𝑛𝑘 𝐴 ≤𝑘 • Represent subspace using basis matrix 𝑈 ∈ 𝑅𝑑×𝑘 min 𝐸 min 𝑥 − 𝑈𝑣 2 𝑣 = min0≼𝑈≼𝐼 𝐸 𝑥 ′ (𝐼 − 𝑈𝑈 ′ )𝑥 • Represent subspace using projector 𝑃 = 𝑈𝑈′ ∈ 𝑅𝑑×𝑑 min E[x’(I-P)x] s.t. 0≼𝑃≼𝐼 rank(P)≤k

12. Principal Component Analysis: Matrix Representation • min | 𝐴−𝑋 | 𝑟𝑎𝑛𝑘 𝐴 ≤𝑘 • Represent subspace using basis matrix 𝑈 ∈ 𝑅𝑑×𝑘 min 𝐸 min 𝑥 − 𝑈𝑣 2 𝑣 = min0≼𝑈≼𝐼 𝐸 𝑥 ′ (𝐼 − 𝑈𝑈 ′ )𝑥 • Represent subspace using projector 𝑃 = 𝑈𝑈′ ∈ 𝑅𝑑×𝑑 min E[x’(I-P)x] • Optimum preserved s.t. 0≼𝑃≼𝐼 • Efficiently extract rank-k 𝑃 using rand rounding, without loss in objective rank(P)≤k tr(P)≤k

13. Matrix Learning • Matrix Completion, Direct Matrix Learning • Predictor itself is a matrix • Multi-Task/Class Learning What is a good inductive bias? • Predictors can be parameterized by a matrix • Similarity Learning, Link Prediction, Collaborative Permutation Learning, Clustering • Can represent desired object as a matrix Desired output must • Subspace Learning (PCA), Topic Models have specific • Basis Matrix or Projector structure

14. Possible Inductive Bias: Matrix Constraints / Regularizers Elementwise Factorization Operator Norms Rank Spectral Norm 𝑋 2 Frobenious: 𝑋 2 Trace-Norm 𝑋 1 Weighted Tr-Norm Group Norms Structural 𝑋 ∞ Max-Norm Group Lasso Plaid Models Local Max-Norm 𝑋 Block Structure 2,∞ NMF Sparse MF

15. Spectral Functions • Spectral function: F(X) = f( singular values of X) • F is spectral iff it is rotation invariant: F(X)=F(UXV’) • Examples: • rank(X) = |spectrum|0 • Frobenious |X|2 = |spectrum|2 • Trace-Norm = |spectrum|1 • Spectral Norm ||X||2 = |spectrum|∞ • Positive semi-definite ≡ spectrum ≥ 0 • Trace of p.s.d. matrix = ∑spectrum • Relative entropy of spectrum • Can lift many vector properties: • Convexity, (strong convexity) • 𝛻𝐹 𝑋 = 𝑈𝛻𝑓 𝑆 𝑉 ′ • Projection operations • Duality: F*(X)=Uf*(S)V’ • Mirror-Descent Updates (e.g. “multiplicative matrix updates”) • ≈ Concentration Bounds

16. Possible Inductive Bias: Matrix Constraints / Regularizers Elementwise Factorization Operator Norms Rank Spectral Norm 𝑋 2 Frobenious: 𝑋 2 Trace-Norm 𝑋 1 Weighted Tr-Norm Group Norms Structural 𝑋 ∞ Max-Norm Group Lasso Plaid Models Local Max-Norm 𝑋 Block Structure 2,∞ NMF Sparse MF

17. Learning with Matrices • Matrices occur explicitly or implicitly in many learning problems • Advantages of matrix view (even when not explicit): can use existing tools, relaxations, opt methods, concentration and generalization bounds, etc • What is a good inductive bias? • Is some structure required or is it just an inductive bias? • Spectral functions convenient, but don’t capture everything!