HOT

Learning with Matrices

1 .Learning with Matrix Parameters Nati Srebro

2 . Matrix Completion movies • Predictor itself is a matrix 2 1 4 5 5 4 ? 1 3 3 5 2 • To learn: need bias 4 ? 5 3 ? (prior / hypothesis class / regularizer) 4 1 3 5 1 2 5 1 ? 5 4 4 • Elementwise (i.e. treat matrix as vector)  can’t generalize users 2 ? 5 ? 4 3 3 1 5 2 1 3 1 2 3 • Matrix constraints/regularizers: 4 5 1 3 • Block/cluster structure (eg Plaid Model) 3 3 ? 5 2 ? 1 1 • Rank 5 2 ? 4 4 • Factorization Norms: Trace-Norm, Weighted 1 3 1 5 4 5 1 2 4 5 ? Tr-Norm, Max-Norm, Local Max-Norm, … • Spectral Regularizers • Group Norms

3 .Matrix Factorization Models rank(X) = min dim(U; V ) X=UV 0 V 2 1 4 5 5 4 1 3 3 5 2 4 5 3 4 13 5 2 1 4 1 5 5 4 U 3 3 2 5 3 1 1 5 2 4 2 3 1 4 51 3 3 3 5 2 1 1 5 2 4 4 1 3 15 4 5 1 2 4 5

4 . Matrix Factorization Models rank(X) = min dim(U; V ) X=UV 0 low norm V Bound avg norm of factorization: 2 1 4 5 P 5 4 1 3 kUk2 F = i jUij 2 3 5 2 4 5 3 kXktr = min kUkF ¢ kV kF 4 13 5 X=UV 0 2 1 4 1 5 5 4 U 3 3 2 5 3 1 1 5 2 4 2 3 1 Bound norm of fact. uniformly: kUk2;1 = maxi jUij 4 51 3 3 3 5 kXkmax = min kUk2;1¢kV k2;1 2 1 1 X=UV 0 5 2 4 4 1 3 15 4 5 aka 2:ℓ1!ℓ1 norm 1 2 4 5

5 . Transfer in Multi-Task Learning • m related prediction tasks: [Argyriou et al 2007] Learn predictor 𝜙𝑖 for each task 𝑖 = 1. . 𝑚 • m classes, predict with arg max 𝜙𝑦 (𝑥) [Amit et al 2007] 𝑦 • Transfer from learned tasks to new task • Semi-supervised learning: create auxiliary tasks from unlabeled data (e.g. predict held-out word from context), transfer from aux task to actual task of interest (e.g. parsing, tagging) [Ando Zhang 2005] w1 w2 Factorization model ≡ two layer network, w3 w4 shared units (learned features) in hidden layer w5 • Predictors naturally parameterized by a matrix (but there is no requirement that we output a matrix)

6 . Correlation Clustering as Matrix Learning [Jalaia et al 2011,2012] input similarity clustering matrix

7 . Correlation Clustering as Matrix Learning [Jalaia et al 2011,2012] input similarity clustering matrix

8 . Correlation Clustering as Matrix Learning [Jalaia et al 2011,2012] A ¼ K input similarity clustering matrix minK ||K-A||1 s.t. K is permuted block-1 diag. • Can represent desired object as a matrix Also: • Corwdsourced similarity learning [Tamuz et al 2011] • Binary hashing [Tavakoli et al 2013] • Collaborative permutation learning

9 . Covariance/Precision Matrix Estimation • Learning Mahalanobis Metric 𝑑 𝑥, 𝑦 ∝ exp(−𝑥 ′ M𝑥) • Inferring Dependency Structure • Sparse Markov Net  Sparse Precision Matrix • k Latent Variables  + Rank k • Many latent variables with regularized affect  + Trace-Norm/Max-Norm

10 . Principal Component Analysis • View I: low rank matrix approximation • min | 𝐴−𝐾 | 𝑟𝑎𝑛𝑘 𝐴 ≤𝑘 • Approximating matrix itself is a matrix parameter • Does not give compact representation • Does not generalize • View II: find subspace capturing as much of data distribution as possible • Maximizing variance inside subspace: max 𝐸[ 𝑃𝑥 2 ] • Minimizing reconstruction error: min 𝐸[ 𝑥 − 𝑃𝑥 2 ] • Parameter is low-dim subspace  represent as matrix

11 . Principal Component Analysis: Matrix Representation • min | 𝐴−𝑋 | 𝑟𝑎𝑛𝑘 𝐴 ≤𝑘 • Represent subspace using basis matrix 𝑈 ∈ 𝑅𝑑×𝑘 min 𝐸 min 𝑥 − 𝑈𝑣 2 𝑣 = min0≼𝑈≼𝐼 𝐸 𝑥 ′ (𝐼 − 𝑈𝑈 ′ )𝑥 • Represent subspace using projector 𝑃 = 𝑈𝑈′ ∈ 𝑅𝑑×𝑑 min E[x’(I-P)x] s.t. 0≼𝑃≼𝐼 rank(P)≤k

12 . Principal Component Analysis: Matrix Representation • min | 𝐴−𝑋 | 𝑟𝑎𝑛𝑘 𝐴 ≤𝑘 • Represent subspace using basis matrix 𝑈 ∈ 𝑅𝑑×𝑘 min 𝐸 min 𝑥 − 𝑈𝑣 2 𝑣 = min0≼𝑈≼𝐼 𝐸 𝑥 ′ (𝐼 − 𝑈𝑈 ′ )𝑥 • Represent subspace using projector 𝑃 = 𝑈𝑈′ ∈ 𝑅𝑑×𝑑 min E[x’(I-P)x] • Optimum preserved s.t. 0≼𝑃≼𝐼 • Efficiently extract rank-k 𝑃 using rand rounding, without loss in objective rank(P)≤k tr(P)≤k

13 . Matrix Learning • Matrix Completion, Direct Matrix Learning • Predictor itself is a matrix • Multi-Task/Class Learning What is a good inductive bias? • Predictors can be parameterized by a matrix • Similarity Learning, Link Prediction, Collaborative Permutation Learning, Clustering • Can represent desired object as a matrix Desired output must • Subspace Learning (PCA), Topic Models have specific • Basis Matrix or Projector structure

14 . Possible Inductive Bias: Matrix Constraints / Regularizers Elementwise Factorization Operator Norms Rank Spectral Norm 𝑋 2 Frobenious: 𝑋 2 Trace-Norm 𝑋 1 Weighted Tr-Norm Group Norms Structural 𝑋 ∞ Max-Norm Group Lasso Plaid Models Local Max-Norm 𝑋 Block Structure 2,∞ NMF Sparse MF

15 . Spectral Functions • Spectral function: F(X) = f( singular values of X) • F is spectral iff it is rotation invariant: F(X)=F(UXV’) • Examples: • rank(X) = |spectrum|0 • Frobenious |X|2 = |spectrum|2 • Trace-Norm = |spectrum|1 • Spectral Norm ||X||2 = |spectrum|∞ • Positive semi-definite ≡ spectrum ≥ 0 • Trace of p.s.d. matrix = ∑spectrum • Relative entropy of spectrum • Can lift many vector properties: • Convexity, (strong convexity) • 𝛻𝐹 𝑋 = 𝑈𝛻𝑓 𝑆 𝑉 ′ • Projection operations • Duality: F*(X)=Uf*(S)V’ • Mirror-Descent Updates (e.g. “multiplicative matrix updates”) • ≈ Concentration Bounds

16 . Possible Inductive Bias: Matrix Constraints / Regularizers Elementwise Factorization Operator Norms Rank Spectral Norm 𝑋 2 Frobenious: 𝑋 2 Trace-Norm 𝑋 1 Weighted Tr-Norm Group Norms Structural 𝑋 ∞ Max-Norm Group Lasso Plaid Models Local Max-Norm 𝑋 Block Structure 2,∞ NMF Sparse MF

17 . Learning with Matrices • Matrices occur explicitly or implicitly in many learning problems • Advantages of matrix view (even when not explicit): can use existing tools, relaxations, opt methods, concentration and generalization bounds, etc • What is a good inductive bias? • Is some structure required or is it just an inductive bias? • Spectral functions convenient, but don’t capture everything!

0 点赞
0 收藏
0下载 3秒后跳转登录页面