展开查看详情
1.Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine
2.Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Matrix operations on Spark
3.Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters Wide use in both enterprises and web industry How do we program these things?
4.Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale: How to split problem across nodes? Must consider network & data locality How to deal with failures? (inevitable at scale) Even worse: stragglers (node not failed, but slow) Rarely used in commodity datacenters
5.Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators System picks how to split each operator into tasks and where to run each task Run parts twice fault recovery Biggest example: MapReduce Map Map Map Reduce Reduce
6.MapReduce for Matrix Operations Matrix-vector multiply Power iteration (e.g. PageRank) Gradient descent methods Stochastic SVD Tall skinny QR Many others!
7.Why Use a Data Flow Engine? Ease of programming High-level functions instead of message passing Wide deployment More common than MPI, especially “near” data Scalability to very largest clusters Even HPC world is now concerned about resilience
8.Why Use a Data Flow Engine? Ease of programming High-level functions instead of message passing Wide deployment More common than MPI, especially “near” data Scalability to very largest clusters Even HPC world is now concerned about resilience
9.Limitations of MapReduce MapReduce is great at one -pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing State between steps goes to distributed file system Slow due to replication & disk storage No control of data partitioning across steps
10.iter . 1 iter . 2 . . . Input file system read file system write file system read file system write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . file system read Commonly spend 90% of time doing I/O Example: Iterative Apps
11.Example: PageRank Repeatedly multiply sparse matrix and vector Requires repeatedly hashing together page adjacency lists and rank vector Neighbors (id, edges) Ranks (id, rank) … Same file grouped over and over iteration 1 iteration 2 iteration 3
12.Spark Programming Model Extends MapReduce with primitives for efficient data sharing “Resilient distributed datasets” Open source in Apache Incubator Growing community with 100+ contributors APIs in Java, Scala & Python
13.Resilient Distributed Datasets (RDDs) Collections of objects stored across a cluster User-controlled partitioning & storage (memory, disk, …) Automatically rebuilt on failure urls = spark.textFile ( “ hdfs ://...” ) records = urls. map ( lambda s: (s, 1) ) counts = records. reduceByKey ( lambda a, b: a + b ) bigCounts = counts. filter ( lambda ( url , cnt ): cnt > 10 ) Input file map reduce filter Known to be hash-partitioned Also known bigCounts. cache ( ) bigCounts. filter ( lambda ( k,v ): “news” in k ). count ( ) bigCounts. join ( otherPartitionedRDD )
14.Performance Time per Iteration (s)
15.Performance Time per Iteration (s)
16. Neighbors (id, edges) Ranks (id, rank) PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing join join join … partitionBy
17. PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing Neighbors (id, edges) Ranks (id, rank) join join join … same node partitionBy
18. PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing Neighbors (id, edges) Ranks (id, rank) join partitionBy join join …
19.PageRank Code # RDD of ( id, neighbors ) pairs links = spark.textFile (...). map ( parsePage ) . partitionBy (128). cache () ranks = links. mapValues ( lambda v: 1.0 ) # RDD of ( id, rank) for i in range(ITERATIONS): ranks = links. join (ranks). flatMap ( lambda (id, (links, rank) ): [(d, rank/ links.size ) for d in links] ). reduceByKey ( lambda a, b: a + b )
20.PageRank Results
21.Alternating Least Squares Start with random A 1 , B 1 Solve for A 2 to minimize ||R – A 2 B 1 T || Solve for B 2 to minimize ||R – A 2 B 2 T | | Repeat until convergence R A = B T
22.ALS on Spark Cache 2 copies of R in memory, one partitioned by rows and one by columns Keep A & B partitioned in corresponding way Operate on blocks to lower communication R A = B T Joint work with Joey Gonzales, Virginia Smith
23.ALS Results
24.Benefit for Users Same engine performs data extraction, model training and interactive queries … DFS read DFS write parse DFS read DFS write train DFS read DFS write query DFS DFS read parse train query Separate engines Spark
25.Other Projects on Spark MLlib : built- in Spark library for ML Includes ALS, K-means||, various algorithms on SGD Frankin , Gonzales et al. [MLOSS ‘13] MLI : Matlab -like language for writing apps Basic ALS in 35 lines of code Evan Sparks, Ameet Talwalkar et al . [ICDM ‘13]
26.100+ developers, 25+ companies contributing; most active development community after Hadoop Spark Community
27.Conclusion Data flow engines are becoming an important platform for matrix algorithms Spark offers a simple programming model that greatly speeds these up More info: spark.incubator.apache.org