a Parameter Server Approach to Distributed Matrix Factorization

a Parameter Server Approach to Distributed Matrix Factorization

1.Factorbird : a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter , Venu Satuluri , Reza Zadeh Distributed Machine Learning and Matrix Computations workshop in conjunction with NIPS 2014

2.Latent Factor Models Given M sparse n x m Returns U and V rank k Applications Dimensionality reduction Recommendation Inference

3.Seem familiar? So why not just use SVD? SVD!

4.Problems with SVD (Feb 24, 2015 edition)

5.Revamped loss function g – global bias term b U i – user-specific bias term for user i b V j – item-specific bias term for item j prediction function p( i , j) = g + b U i + b V j + u T i v j a( i , j) – analogous to SVD’s m ij (ground truth) New loss function:


7.Problems Resulting U and V , for graphs with millions of vertices, still equate to hundreds of gigabytes of floating point values. SGD is inherently sequential; either locking or multiple passes are required to synchronize.

8.Problem 1: size of parameters Solution: Parameter Server architecture

9.Problem 2: simultaneous writes Solution: …so what?

10.Lock-free concurrent updates? Assumptions f is Lipshitz continuously differentiable f is strongly convex Ω (size of hypergraph ) is small Δ (fraction of edges that intersect any variable) is small ρ ( sparsity of hypergraph ) is small

11.Factorbird Architecture

12.Parameter server architecture Open source! http :// parameterserver.org /

13.Factorbird Machinery memcached – Distributed memory object caching system finagle – Twitter’s RPC system HDFS – persistent filestore for data Scalding – Scala front-end for Hadoop MapReduce jobs Mesos – resource manager for learner machines

14.Factorbird stubs

15.Model assessment Matrix factorization using RMSE Root-mean squared error SGD performance often a function of hyperparameters λ : regularization η : learning rate k : number of latent factors

16.[Hyper]Parameter grid search aka “parameter scans:” finding the optimal combination of hyperparameters Parallelize!

17.Experiments “ RealGraph ” Not a dataset; a framework for creating graph of user-user interactions on Twitter Kamath , Krishna, et al. " RealGraph : User Interaction Prediction at Twitter." User Engagement Optimization Workshop@ KDD. 2014.

18.Experiments Data: binarized adjacency matrix of subset of Twitter follower graph a( i , j) = 1 if user i interacted with user j , 0 otherwise All prediction errors weighted equally ( w( i , j) = 1) 100 million interactions 440,000 [popular] users

19.Experiments 80% training, 10% validation, 10% testing

20.Experiments k = 2 Homophily

21.Experiments Scalability of Factorbird large RealGraph subset 229M x 195M (44.6 quadrillion ) 38.5 billion non-zero entries Single SGD pass through training set: ~2.5 hours ~ 40 billion parameters

22.Important to note As with most (if not all) distributed platforms:

23.Future work Support streaming (user follows) Simultaneous factorization Fault tolerance Reduce network traffic s/ memcached /custom application/g Load balancing

24.Strengths Excellent extension of prior work Hogwild , RealGraph Current and [mostly] open technology Hadoop , Scalding, Mesos , memcached Clear problem, clear solution, clear validation

25.Weaknesses Lack of detail, lack of detail, lack of detail How does number of machines affect runtime? What were performance metrics of the large RealGraph subset? What were some of the properties of the dataset (when was it collected, how were edges determined, what does “popular” mean, etc )? How did other factorization methods perform by comparison?