# 大型矩阵分析与推理

1. Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred Warmuth Reza Bosagh Zadeh - Gunnar Carlsson - Michael Mahoney Dec 9, NIPS 2013 1 / 32

2.Introductory musing — What is a matrix? ai,j 1 A vector of n2 parameters 2 A covariance 3 A generalized probability distribution 4 ... 2 / 32

3.1. A vector of n2 parameters When you regularize with the squared Frobenius norm min ||W||2F + loss(tr(WXn )) W n 3 / 32

4.1. A vector of n2 parameters When you regularize with the squared Frobenius norm min ||W||2F + loss(tr(WXn )) W n Equivalent to min ||vec(W)||22 + loss(vec(W) · vec(Xn )) vec(W) n No structure: n2 independent variables 4 / 32

5.2. A covariance View the symmetric positive definite matrix C as a covariance matrix of some random feature vector c ∈ Rn , i.e. C = E (c − E(c))(c − E(c)) n features plus their pairwise interactions 5 / 32

6.Symmetric matrices as ellipses Ellipse = {Cu : u 2 = 1} Dotted lines connect point u on unit ball with point Cu on ellipse 6 / 32

7.Symmetric matrices as ellipses Eigenvectors form axes Eigenvalues are lengths 7 / 32

8.Dyads uu , where u unit vector One eigenvalue one All others zero Rank one projection matrix 8 / 32

9.Directional variance along direction u V(c u) = u Cu = tr(C uu ) ≥ 0 The outer figure eight is direction u times the variance u C u PCA: find direction of largest variance 9 / 32

10.3 dimensional variance plots tr(C uu ) is generalized probability when tr(C) = 1 10 / 32

11.3. Generalized probability distributions Probability vector ω = (.2, .1., .6, .1) = i ωi ei mixture coefficients pure events Density matrix W= i ωi wi wi mixture coefficients pure density matrices 11 / 32

12.3. Generalized probability distributions Probability vector ω = (.2, .1., .6, .1) = i ωi ei mixture coefficients pure events Density matrix W= i ωi wi wi mixture coefficients pure density matrices Matrices as generalized distributions 12 / 32

13.3. Generalized probability distributions Probability vector ω = (.2, .1., .6, .1) = i ωi ei mixture coefficients pure events Density matrix W= i ωi wi wi mixture coefficients pure density matrices Matrices as generalized distributions Many mixtures lead to same density matrix There always exists a decomposition into n eigendyads Density matrix: Symmetric positive matrix of trace one 13 / 32

14.It’s like a probability! Total variance along orthogonal set of directions is 1 u1 Wu1 + u2 Wu2 = 1 a+b+c =1 14 / 32

15.Uniform density? 1 1 All dyads have generalized probability nI n 1 1 1 tr( I uu ) = tr(uu ) = n n n Generalized probabilities of n orthogonal dyads sum to 1 15 / 32

16.Conventional Bayes Rule P(Mi )P(y |Mi ) P(Mi |y ) = P(y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 16 / 32

17.Conventional Bayes Rule P(Mi )P(y |Mi ) P(Mi |y ) = P(y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 17 / 32

18.Conventional Bayes Rule P(Mi )P(y |Mi ) P(Mi |y ) = P(y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 18 / 32

19.Conventional Bayes Rule P(Mi )P(y |Mi ) P(Mi |y ) = P(y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 19 / 32

20.Bayes Rule for density matrices exp (log D(M) + log D(y|M)) D(M|y) = tr (above matrix) 1 update with data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 20 / 32

21.Bayes Rule for density matrices exp (log D(M) + log D(y|M)) D(M|y) = tr (above matrix) 2 updates with same data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 21 / 32

22.Bayes Rule for density matrices exp (log D(M) + log D(y|M)) D(M|y) = tr (above matrix) 3 updates with same data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 22 / 32

23.Bayes Rule for density matrices exp (log D(M) + log D(y|M)) D(M|y) = tr (above matrix) 4 updates with same data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 23 / 32

24.Bayes Rule for density matrices exp (log D(M) + log D(y|M)) D(M|y) = tr (above matrix) 10 updates with same data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 24 / 32

25.Bayes Rule for density matrices exp (log D(M) + log D(y|M)) D(M|y) = tr (above matrix) 20 updates with same data likelyhood matrix D(y|M) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 25 / 32

26.Bayes’ rules vector matrix P(Mi )·P(y |Mi ) D(M) D(y|M) Bayes rule P(Mi |y ) = D(M|y) = j P(Mj )·P(y |Mj ) tr(D(M) D(y|M) A B := exp(log A + log B) 26 / 32

27.Bayes’ rules vector matrix P(Mi )·P(y |Mi ) D(M) D(y|M) Bayes rule P(Mi |y ) = D(M|y) = j P(Mj )·P(y |Mj ) tr(D(M) D(y|M) A B := exp(log A + log B) Regularizer Entropy Quantum Entropy 27 / 32

28.Vector case as special case of matrix case Vectors as diagonal matrices All matrices same eigensystem Fancy becomes · Often the hardest problem ie bounds for the vector case “lift” to the matrix case 28 / 32

29.Vector case as special case of matrix case Vectors as diagonal matrices All matrices same eigensystem Fancy becomes · Often the hardest problem ie bounds for the vector case “lift” to the matrix case This phenomenon has been dubbed the “free matrix lunch” Size of matrix = size of vector = n 29 / 32