# Sketching as a Tool for Numerical Linear Algebra

1.Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden

2.Talk Outline  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l1) Regression  Sketching to speed up Low Rank Approximation 2

3.Regression Linear Regression  Statistical method to study linear dependencies between variables in the presence of noise. Example  Ohm's law V = R ∙ I  Find linear function that best fits the data 3

4.Regression Standard Setting  One measured variable b  A set of predictor variables a1 ,…, a d  Assumption: b = x0 + a1 x1 + … + ad xd +   is assumed to be noise and the xi are model parameters we want to learn  Can assume x0 = 0  Now consider n observations of b 4

5.Regression analysis Matrix form Input: nd-matrix A and a vector b=(b1,…, bn) n is the number of observations; d is the number of predictor variables Output: x* so that Ax* and b are close  Consider the over-constrained case, when n À d  Can assume that A has full column rank 5

6.Regression analysis Least Squares Method  Find x* that minimizes |Ax-b|22 =  (bi – <Ai*, x>)²)²  Ai* is i-th row of A  Certain desirable statistical properties  Closed form solution: x = (ATA)-1 AT b Method of least absolute deviation (l1 -regression)  Find x* that minimizes |Ax-b|1 =  |bi – <Ai*, x>)²|  Cost is less sensitive to outliers than least squares  Can solve via linear programming Time complexities are at least n*d2, we want better! 6

7.Talk Outline  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l1) Regression  Sketching to speed up Low Rank Approximation 7

8.Sketching to solve least squares regression  How to find an approximate solution x to minx |Ax-b|2 ?  Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2 with high probability  Draw S from a k x n random family of matrices, for a value k << n  Compute S*A and S*b  Output the solution x‘ to minx‘ |(SA)x-(Sb)|2 8

9.How to choose the right sketching matrix S?  Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2  Lots of matrices work  S is d/ε2 x n matrix of i.i.d. Normal random variables  Computing S*A may be slow… 9

10.How to choose the right sketching matrix S? [S]  S is a Johnson Lindenstrauss Transform  S = P*H*D  D is a diagonal matrix with +1, -1 on diagonals  H is the Hadamard transform  P just chooses a random (small) subset of rows of H*D  S*A can be computed much faster 10

11.Even faster sketching matrices [CW,MM,NN]  CountSketch matrix  Define k x n matrix S, for k = d2/ε2  S is really sparse: single randomly chosen non-zero entry per column [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 [ Surprisingly, this works! 0-1 0 0 0 0 0 1 11

12.Talk Outline  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l1) Regression  Sketching to speed up Low Rank Approximation 12

13.Sketching to solve l1-regression  How to find an approximate solution x to minx |Ax-b|1 ?  Goal: output x‘ for which |Ax‘-b|1 · (1+ε) minx |Ax-b|1 with high probability  Natural attempt: Draw S from a k x n random family of matrices, for a value k << n  Compute S*A and S*b  Output the solution x‘ to minx‘ |(SA)x-(Sb)|1  Turns out this does not work! 13

14.Sketching to solve l1-regression [SW]  Why doesn’t outputting the solution x‘ to minx‘ |(SA)x- (Sb)|1 work?  Don‘t know of k x n matrices S with small k for which if x‘ is solution to minx |(SA)x-(Sb)|1 then |Ax‘-b|1 · (1+ε) minx |Ax-b|1 with high probability  Instead: can find an S so that |Ax‘-b|1 · (d log d) minx |Ax-b|1  S is a matrix of i.i.d. Cauchy random variables 14

15.Cauchy random variables  Cauchy random variables not as nice as Normal (Gaussian) random variables  They don’t have a mean and have infinite variance  Ratio of two independent Normal random variables is Cauchy 15

16.Sketching to solve l1-regression  How to find an approximate solution x to minx |Ax-b|1 ?  Want x‘ for which if x‘ is solution to minx |(SA)x-(Sb)|1 , then |Ax‘-b|1 · (1+ε) minx |Ax-b|1 with high probability  For d log d x n matrix S of Cauchy random variables: |Ax‘-b|1 · (d log d) minx |Ax-b|1  For this “poor” solution x’, let b’ = Ax’-b  Might as well solve regression problem with A and b’ 16

17.Sketching to solve l1-regression  Main Idea: Compute a QR-factorization of S*A  Q has orthonormal columns and Q*R = S*A  A*R-1 turns out to be a “well-conditioning” of original matrix A  Compute A*R-1 and sample d3.5/ε2 rows of [A*R-1 , b’] where the i-th row is sampled proportional to its 1-norm  Solve regression problem on the (reweighted) samples 17

18.Sketching to solve l1-regression [MM]  Most expensive operation is computing S*A where S is the matrix of i.i.d. Cauchy random variables  All other operations are in the “smaller space”  Can speed this up by choosing S as follows: [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 [ ¢ C1 C2 C3 [ 0-1 0 0 0 0 0 1 … Cn 18

19.Further sketching improvements [WZ]  Can show you need a fewer number of sampled rows in later steps if instead choose S as follows  Instead of diagonal of Cauchy random variables, choose diagonal of reciprocals of exponential random variables [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 [ ¢ 1/E1 1/E2 1/E3 [ 0-1 0 0 0 0 0 1 … 1/En 19

20.Talk Outline  Exact regression algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l1) Regression  Sketching to speed up Low Rank Approximation 20

21.Low rank approximation  A is an n x n matrix  Typically well-approximated by low rank matrix  E.g., only high rank because of noise  Want to output a rank k matrix A’, so that |A-A’|F · (1+ε) |A-Ak|F, w.h.p., where Ak = argminrank k matrices B |A-B|F  For matrix C, |C|F = (Σi,j Ci,j2)1/2 21

22. Solution to low-rank approximation [S]  Given n x n input matrix A Most time- consuming  Compute S*A using a sketching matrix S with k << n rows. S*A takes random linear combinations of rows of A step is computing S*A A SScan canbe bematrix matrixof ofi.i.d. i.i.d. Normals Normals SScan canbe beaaFast FastJohnson Johnson SA Lindenstrauss LindenstraussMatrix Matrix  Project rows of A onto SA, then find best rank-k SScan can be be aaCountSketch CountSketch approximation to points inside of SA. matrix matrix 22

23.Conclusion  Gave fast sketching-based algorithms for  Least Squares Regression  Least Absolute Deviation (l1) Regression  Low Rank Approximation  Sketching also provides “dimensionality reduction”  Communication-efficient solutions for these problems 23