Other Distributed Frameworks

Other Distributed Frameworks
展开查看详情

1.Other Distributed Frameworks Shannon Quinn

2.Distinction General Compute Engines Hadoop User-facing APIs Cascading Scalding

3.Alternative Frameworks Apache Mahout Apache Giraph GraphLab Apache Storm Apache Tez Apache Flink

4.Alternative Frameworks Apache Mahout Apache Giraph GraphLab Apache Storm Apache Tez Apache Flink

5.Apache Mahout A Tale of Two Frameworks Distributed machine learning on Hadoop 0.1 to 0.9 “Samsara” New in 0.10+

6.Machine learning on Hadoop Born out of the Apache Lucene project Built on Hadoop (all in Java) Pragmatic machine learning at scale

7.1: Recommendation

8.2: Classification

9.3: Clustering

10.Other MapReduce algorithms Dimensionality reduction Lanczos SSVD LDA Regression Logistic Linear Random Forest Evolutionary algorithms

11.Mahout-Samsara Programming “environment” for distributed machine learning R-like syntax Interactive shell (like Spark) Under-the-hood algebraic optimizer Engine-agnostic Spark H2O Flink ?

12.Mahout-Samsara

13.Mahout 3 main components Engine-agnostic environment for building scalable ML algorithms (Samsara) Engine-specific algorithms (Spark, H2O) Legacy MapReduce algorithms

14.Mahout v0.10.0 released April 11 (as in, 5 days ago!) 0.10.1 More base linear algebra functionality 0.11.0 Compatible with Spark 1.3 1.0 ?

15.Mahout features by engine

16.Mahout features by engine

17.Mahout features by engine No engine-agnostic clustering algorithms yet Still the domain of legacy MapReduce H2O and especially Flink still highly experimental

18.Mahout features by engine No engine-agnostic clustering algorithms yet Still the domain of legacy MapReduce H2O and especially Flink still highly experimental

19.Apache Giraph Vertex-centric alternative to Hadoop Runs on Hadoop

20.Giraph “…an iterative graph processing system built for high scalability.” Bulk-synchronous Parallel (BSP) model of distributed computation

21.Bulk-synchronous Parallel Vertex-centric model

22.Giraph terminology Superstep Sequence of iterations Each “active” vertex invokes a compute() method receives messages sent to the vertex in the previous superstep , computes using the messages, and the vertex and outgoing edge values, which may result in modifications to the values, and may send messages to other vertices.

23.Shortest path Example compute() method

24.Giraph terminology Barrier The messages sent in any current superstep get delivered to the destination vertices only in the next superstep Vertices start computing the next superstep after every vertex has completed computing the current superstep

25.Giraph terminology Barrier The messages sent in any current superstep get delivered to the destination vertices only in the next superstep Vertices start computing the next superstep after every vertex has completed computing the current superstep

26.GraphLab / Dato Began as a PhD thesis at Carnegie Mellon University Like Mahout, a Tale of Two Frameworks GraphLab 1.0, 2.0 Vertex-centric alternative to Hadoop for graph analytics (a la Apache Giraph ) Dato , GraphLab Create ??? SaaS : front-facing Python API for interacting with [presumably] C++ backend on AWS

27.GraphLab : the early years Envisioned as a vertex-centric alternative to Hadoop and, in particular, Mahout Built in C++ Liked to compare apples and oranges…

28.GraphLab to Dato Data Engineering Extraction, transformation Visualization Data Intelligence Recommendation Clustering Classification Deployment Creating services

29.Dato data structures SArray An immutable, homogeneously typed array object backed by persistent storage. SArray is scaled to hold data that are much larger than the machine’s main memory . It fully supports missing values and random access. The data backing an SArray is located on the same machine as the GraphLab Server process. Each column in an SFrame is an SArray . SFrames A tabular, column-mutable dataframe object that can scale to big data. The data in SFrame is stored column-wise on the GraphLab Server side, and is stored on persistent storage (e.g. disk) to avoid being constrained by memory size. Each column in an SFrame is a size-immutable SArray , but SFrames are mutable in that columns can be added and subtracted with ease . An SFrame essentially acts as an ordered dict of SArrays . SGraph A scalable graph data structure. The SGraph data structure allows arbitrary dictionary attributes on vertices and edges, provides flexible vertex and edge query functions, and seamless transformation to and from SFrame .