使用数据流引擎进行大型矩阵操作

下载 1

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/u181/Large_Scale_Matrix_Analysis_and_Inference08?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

有只羊

发布于

6年前

3744

人观看

#信息技术

现如今，数据增长速度快于处理速度，唯一的解决方案是在大型集群上并行化，而且这种技术以及广泛应用于企业和网络行业。本章主要内容有：讲解数据流与传统的网络编程的区别、MapReduce的局限性、Spark computing engine、Matrix operations on Spark等。

展开查看详情

1 .Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine

2 .Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Matrix operations on Spark

3 .Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters Wide use in both enterprises and web industry How do we program these things?

4 .Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale: How to split problem across nodes? Must consider network & data locality How to deal with failures? (inevitable at scale) Even worse: stragglers (node not failed, but slow) Rarely used in commodity datacenters

5 .Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators System picks how to split each operator into tasks and where to run each task Run parts twice fault recovery Biggest example: MapReduce Map Map Map Reduce Reduce

6 .MapReduce for Matrix Operations Matrix-vector multiply Power iteration (e.g. PageRank) Gradient descent methods Stochastic SVD Tall skinny QR Many others!

7 .Why Use a Data Flow Engine? Ease of programming High-level functions instead of message passing Wide deployment More common than MPI, especially “near” data Scalability to very largest clusters Even HPC world is now concerned about resilience

8 .Why Use a Data Flow Engine? Ease of programming High-level functions instead of message passing Wide deployment More common than MPI, especially “near” data Scalability to very largest clusters Even HPC world is now concerned about resilience

9 .Limitations of MapReduce MapReduce is great at one -pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing State between steps goes to distributed file system Slow due to replication & disk storage No control of data partitioning across steps

10 .iter . 1 iter . 2 . . . Input file system read file system write file system read file system write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . file system read Commonly spend 90% of time doing I/O Example: Iterative Apps

11 .Example: PageRank Repeatedly multiply sparse matrix and vector Requires repeatedly hashing together page adjacency lists and rank vector Neighbors (id, edges) Ranks (id, rank) … Same file grouped over and over iteration 1 iteration 2 iteration 3

12 .Spark Programming Model Extends MapReduce with primitives for efficient data sharing “Resilient distributed datasets” Open source in Apache Incubator Growing community with 100+ contributors APIs in Java, Scala & Python

13 .Resilient Distributed Datasets (RDDs) Collections of objects stored across a cluster User-controlled partitioning & storage (memory, disk, …) Automatically rebuilt on failure urls = spark.textFile ( “ hdfs ://...” ) records = urls. map ( lambda s: (s, 1) ) counts = records. reduceByKey ( lambda a, b: a + b ) bigCounts = counts. filter ( lambda ( url , cnt ): cnt > 10 ) Input file map reduce filter Known to be hash-partitioned Also known bigCounts. cache ( ) bigCounts. filter ( lambda ( k,v ): “news” in k ). count ( ) bigCounts. join ( otherPartitionedRDD )

14 .Performance Time per Iteration (s)

15 .Performance Time per Iteration (s)

16 . Neighbors (id, edges) Ranks (id, rank) PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing join join join … partitionBy

17 . PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing Neighbors (id, edges) Ranks (id, rank) join join join … same node partitionBy

18 . PageRank Using cache(), keep neighbors in RAM Using partitioning, avoid repeated hashing Neighbors (id, edges) Ranks (id, rank) join partitionBy join join …

19 .PageRank Code # RDD of ( id, neighbors ) pairs links = spark.textFile (...). map ( parsePage ) . partitionBy (128). cache () ranks = links. mapValues ( lambda v: 1.0 ) # RDD of ( id, rank) for i in range(ITERATIONS): ranks = links. join (ranks). flatMap ( lambda (id, (links, rank) ): [(d, rank/ links.size ) for d in links] ). reduceByKey ( lambda a, b: a + b )

20 .PageRank Results

21 .Alternating Least Squares Start with random A 1 , B 1 Solve for A 2 to minimize ||R – A 2 B 1 T || Solve for B 2 to minimize ||R – A 2 B 2 T | | Repeat until convergence R A = B T

22 .ALS on Spark Cache 2 copies of R in memory, one partitioned by rows and one by columns Keep A & B partitioned in corresponding way Operate on blocks to lower communication R A = B T Joint work with Joey Gonzales, Virginia Smith

23 .ALS Results

24 .Benefit for Users Same engine performs data extraction, model training and interactive queries … DFS read DFS write parse DFS read DFS write train DFS read DFS write query DFS DFS read parse train query Separate engines Spark

25 .Other Projects on Spark MLlib : built- in Spark library for ML Includes ALS, K-means||, various algorithms on SGD Frankin , Gonzales et al. [MLOSS ‘13] MLI : Matlab -like language for writing apps Basic ALS in 35 lines of code Evan Sparks, Ameet Talwalkar et al . [ICDM ‘13]

26 .100+ developers, 25+ companies contributing; most active development community after Hadoop Spark Community

27 .Conclusion Data flow engines are becoming an important platform for matrix algorithms Spark offers a simple programming model that greatly speeds these up More info: spark.incubator.apache.org

0点赞

1收藏

1下载