SGD for large-scale SDP

本章介绍大规模矩阵算法,首先介绍了填充大规模矩阵的启发式算法、Nuclear Norm minimization、Low-rank parameterization、Method of Multipliers和随机梯度下降(SGD)在矩阵分解中的应用,并介绍了Quadratic Inverse 问题,并给出了一些解决问题的启发式想法。
展开查看详情

1.SGD for large-scale SDP Benjamin Recht University of California, Berkeley

2. Matrix Completion ! Mij known for black cells ! Mij unknown for white cells M! = Rows index questions ! Columns index users • How do you fill in the missing data? R* M = L p1 x p2 p1 x r r x p2 p1p2 entries r(p1+p2) entries

3. Heuristic for Matrix Completion IDEA: Replace rank with R* nuclear norm: M = L minimize kXk⇤ subject to (X) = b kxn kxr rxn • Step 1: Pick (i,j) and compute residual: ! e = (Li RTj Mij ) • Step 2: Take a mixture of current model and corrected model (𝛼,β>0):   Li ↵Li eRj Rj ↵Rj eLi Some guy on livejournal, 2006 Succeeds when number of Fazel, Parillo, Recht, 2007 samples is Õ(r(k+n)) Candes and Recht, 2008

4. Equivalent Formulations • Semidefinite embedding: ! ! ! ! ! • Low rank parametrization:

5.Nuclear Norm minimization Pk minimize kXk⇤ = i=1 i (X) subject to (X) = y Low-rank parameterization X = U ⌃V ⇤ minimize 1 2 (kLk2 F + kRk2F ) L = U⌃ 1/2 subject to (LR ) = y ⇤ R = V ⌃1/2 Method of Multipliers k X X r n X X r minimize L2ia + 2 Rja + k (LR⇤ ) yk22 i=1 a=1 j=1 a=1

6. Theoretical Foundation k X X r n X X r minimize L2ia + 2 Rja + k (LR⇤ ) yk22 i=1 a=1 j=1 a=1 • Any local minimum of this nonlinear program is a global optimum of the nuclear norm minimization problem provided the nuclear norm has a rank smaller than r. Burer and Monteiro, 2005

7. JELLYFISH • SGD for Matrix Factorizations. ! P Example: minimize (i,j)2⌦ (Xij Mij )2 + µkXk⇤ • Idea: approximate X ⇡ LRT ! X minimize(L,R) (Li RTj Mij )2 + µi kLi k2F + µj kRj k2F ! (i,j)2⌦ • Step 1: Pick (i,j) and compute residual: ! e = (Li RTj Mij ) • Step 2: Take a stochastic gradient step:   Li ↵Li eRj Rj ↵Rj eLi

8. JELLYFISH Observation: With replacement sample=poor locality Idea: Bias sample to improve locality. Algorithm: Shuffle the data. 1. Process {L1R1, L2R2, L3R3} in parallel Big win: No locks! 2. Process {L1R2, L2R3, L3R1} in parallel (model access) 3. Process {L1R3, L2R1, L3R2} in parallel Recht and Ré, 2010

9. jfish • 100x faster than standard solvers ! • 25% speedup over HOGWILD! on 12 core machine ! • 3x speedup on 40 core machine Netflix data-set 100M examples 17770 rows 480189 columns

10. Extensions • Other structured matrix problems • max-norm • NMF (with non-negativity or simplex constraint) • LDA? ! • Decoupling works for any algorithm where we can project the rows independently • However, those not arising from SDPs have anything resembling guarantees.

11. Quadratic Inverse Problems • Blind deconvolution • Phase retrieval • Dictionary learning • Nonnegative matrix decomposition

12. Quadratic Inverse Problems • Blind deconvolution X • Phase retrieval Ij = s k hj k k • Dictionary learning • Nonnegative matrix decomposition

13. Quadratic Inverse Problems • Blind deconvolution • Phase retrieval ⇤ 2 ⇤ ⇤ Ij = |fj s| =s f j fj s • Dictionary learning • Nonnegative matrix decomposition

14. Quadratic Inverse Problems • Blind deconvolution X • Phase retrieval Xij = Dik Ckj k • Dictionary learning • Nonnegative matrix decomposition

15. Challenges and Ideas • Extremely ill-posed ✦ Impose additional structures ✦ Diversify measurements • Nonlinear, non-convex ✦ Lift to a high-dimensional space ✦ Enforce low-rankness

16.Blind Deconvolution with Subspace Knowledge m encode p y decode ˜ m ˜ h = m = ⇤ discovers unknown x C y w x channel and message convolution with unknown channel Ahmed, R, and Romberg, 2012

17. Blind Deconvolution with Random Masks i i I = s ⇤ (m h) or (mi s) ⇤ h Sender Channel Receiver Shutter Shutter Open Open Close Close Time Time Shutter Traditional Exposure Shutter Coded Expo Open Close % Time Time aditional %OXUUHG Exposure Coded Exposure Image Courtesy: Raskar, Agrawal, Tumblin ,PDJH

18. Lifting + Low-Rank X Ij = mj k s k hj k k T X = hs { 2 3 2 32 3 I1 m1 0 0 ··· 0 s 1 h1 s 2 h1 s 3 h1 ··· s n h1 6 I2 7 6 0 m2 0 ··· 0 7 6 s 1 h2 s 2 h2 s 3 h2 ··· s n h2 7 6 7 6 76 7 6 I3 7 6 0 0 m3 ··· 0 7 6 s 1 h3 s 2 h3 s 3 h3 ··· s n h3 7 6 7 6 76 7 6 I4 7 = 6 0 0 0 ··· 0 7 6 s 1 h4 s 2 h4 s 3 h4 ··· s n h4 7 6 7 6 76 7 6 .. 7 6 .. .. .. .. .. 7 6 .. .. .. .. .. 7 4.5 4 . . . . . 54 . . . . . 5 In 0 0 0 ··· mn s 1 hn s 2 hn s 3 hn ··· s n hn find X minimize kXk⇤ subject to subject to Iji = trace(H jT diag(mi )X) Iji = trace(H jT diag(mi )X) rank(X) = 1

19. Theoretical Foundation • Any local minimum of this nonlinear program is a global optimum of the nuclear norm minimization problem provided the nuclear norm has a rank 1 solution. Burer and Monteiro, 2005

20. Super-resolution • Image size: 256x256 • filter size: 16x16 • channels: 20

21. SGD for large SDP • Burer and Monteiro lifts • Make large SDPs practically solvable • Still gaps in our theoretical understanding: stationary points, initializations, stochastic convergence • SGD with biased orderings • Bias for locality and speed • Little understanding of the theory of nonuniform orderings