使用数据流引擎进行大型矩阵操作

现如今,数据增长速度快于处理速度,唯一的解决方案是在大型集群上并行化,而且这种技术以及广泛应用于企业和网络行业。本章主要内容有:讲解数据流与传统的网络编程的区别、MapReduce的局限性、Spark computing engine、Matrix operations on Spark等。
展开查看详情

1.Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine

2.Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Matrix operations on Spark

3.Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters Wide use in both enterprises and web industry How do we program these things?

4.Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale: How to split problem across nodes? Must consider network & data locality How to deal with failures? (inevitable at scale) Even worse: stragglers (node not failed, but slow) Rarely used in commodity datacenters

5.Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators System picks how to split each operator into tasks and where to run each task Run parts twice fault recovery Biggest example: MapReduce Map Map Map Reduce Reduce

6.MapReduce for Matrix Operations Matrix-vector multiply Power iteration (e.g. PageRank) Gradient descent methods Stochastic SVD Tall skinny QR Many others!

7.Why Use a Data Flow Engine? Ease of programming High-level functions instead of message passing Wide deployment More common than MPI, especially “near” data Scalability to very largest clusters Even HPC world is now concerned about resilience

8.Why Use a Data Flow Engine? Ease of programming High-level functions instead of message passing Wide deployment More common than MPI, especially “near” data Scalability to very largest clusters Even HPC world is now concerned about resilience

9.Limitations of MapReduce MapReduce is great at one -pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing State between steps goes to distributed file system Slow due to replication & disk storage No control of data partitioning across steps

10.iter . 1 iter . 2 . . . Input file system read file system write file system read file system write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . file system read Commonly spend 90% of time doing I/O Example: Iterative Apps

11.Example: PageRank Repeatedly multiply sparse matrix and vector Requires repeatedly hashing together page adjacency lists and rank vector Neighbors (id, edges) Ranks (id, rank) … Same file grouped over and over iteration 1 iteration 2 iteration 3

12.Spark Programming Model Extends MapReduce with primitives for efficient data sharing “Resilient distributed datasets” Open source in Apache Incubator Growing community with 100+ contributors APIs in Java, Scala & Python

13.Resilient Distributed Datasets (RDDs) Collections of objects stored across a cluster User-controlled partitioning & storage (memory, disk, …) Automatically rebuilt on failure urls = spark.textFile ( “ hdfs ://...” ) records = urls. map ( lambda s: (s, 1) ) counts = records. reduceByKey ( lambda a, b: a + b ) bigCounts = counts. filter ( lambda ( url , cnt ): cnt > 10 ) Input file map reduce filter Known to be hash-partitioned Also known bigCounts. cache ( ) bigCounts. filter ( lambda ( k,v ): “news” in k ). count ( ) bigCounts. join ( otherPartitionedRDD )

14.Performance Time per Iteration (s)

15.Performance Time per Iteration (s)

16. Neighbors (id, edges) Ranks (id, rank) PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing join join join … partitionBy

17. PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing Neighbors (id, edges) Ranks (id, rank) join join join … same node partitionBy

18. PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing Neighbors (id, edges) Ranks (id, rank) join partitionBy join join …

19.PageRank Code # RDD of ( id, neighbors ) pairs links = spark.textFile (...). map ( parsePage ) . partitionBy (128). cache () ranks = links. mapValues ( lambda v: 1.0 ) # RDD of ( id, rank) for i in range(ITERATIONS): ranks = links. join (ranks). flatMap ( lambda (id, (links, rank) ): [(d, rank/ links.size ) for d in links] ). reduceByKey ( lambda a, b: a + b )

20.PageRank Results

21.Alternating Least Squares Start with random A 1 , B 1 Solve for A 2 to minimize ||R – A 2 B 1 T || Solve for B 2 to minimize ||R – A 2 B 2 T | | Repeat until convergence R A = B T

22.ALS on Spark Cache 2 copies of R in memory, one partitioned by rows and one by columns Keep A & B partitioned in corresponding way Operate on blocks to lower communication R A = B T Joint work with Joey Gonzales, Virginia Smith

23.ALS Results

24.Benefit for Users Same engine performs data extraction, model training and interactive queries … DFS read DFS write parse DFS read DFS write train DFS read DFS write query DFS DFS read parse train query Separate engines Spark

25.Other Projects on Spark MLlib : built- in Spark library for ML Includes ALS, K-means||, various algorithms on SGD Frankin , Gonzales et al. [MLOSS ‘13] MLI : Matlab -like language for writing apps Basic ALS in 35 lines of code Evan Sparks, Ameet Talwalkar et al . [ICDM ‘13]

26.100+ developers, 25+ companies contributing; most active development community after Hadoop Spark Community

27.Conclusion Data flow engines are becoming an important platform for matrix algorithms Spark offers a simple programming model that greatly speeds these up More info: spark.incubator.apache.org