Alluxio 2.0 Preview

Alluxio 的最新版本2.0带来了哪些新的功能?Alluxio 开源社区的Maintainer Calvin Jia在今年三月的Alluxio湾区meetup上的报告上涵盖了Alluxio 2.0在设计和实现上的新功能。

1.Alluxio 2.0.0-preview 03/14 Alluxio Meetup

2.Release Manager for Alluxio 2.0.0 Contributor since Tachyon 0.4 (2012) Founding Engineer @ Alluxio About Me Calvin Jia

3.Alluxio Overview Open source, distributed storage system Commonly used for data analytics such as OLAP on Hadoop Deployed at Huya , Two Sigma, Tencent, and many others Largest deployments of over 1000 nodes Java File API HDFS Interface S3 Interface REST API POSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver

4.Agenda Alluxio 2.0 Motivation 1 Architectural Innovations 2 Release Roadmap 3

5.Alluxio 2.0 Motivations

6.Why 2.0 Alluxio 1.x target use cases are largely addressed Three major types of feedback from users Want to support POSIX-based workloads, especially ML Want better options for data management Want to scale to larger clusters

7.Use Cases Alluxio 1.x Burst compute into cloud with data on-prem Enable object stores for data analytics platforms Accelerate OLAP on Hadoop Example As a data scientist, I want to be able to spin up my own elastic compute cluster that can easily and efficiently access my data stores New in Alluxio 2.x Enable ML/DL frameworks on object stores Data lifecycle management and data migration Examples As a data scientist, I want to run my existing simulations on larger datasets stored in S3. As a data infrastructure engineer, I want to automatically tier data between Alluxio and the under store.

8.ML/DL Workloads Alluxio 1.x focuses primarily on Hadoop based workloads, ie. OLAP on Hadoop Alluxio 2.x will continue to excel for these workloads New emphasis on ML frameworks such as Tensorflow Primarily accesses the same data set which Alluxio already is serving Challenges include new API and file characteristics, such as file access pattern and file sizes

9.Data Management Finer grained control over Alluxio replication Automated and scalable async persistence Distributed data loading Mechanism for cross-mount data operations

10.Scaling Namespace scaling - scale to 1 billion files Cluster scaling - scale to 3000 worker nodes Client scaling - scale to 30,000 concurrent clients

11.Architectural Innovations

12.Architectural Innovations in 2.0 Off heap metadata storage (namespace scaling) gRPC transport layer (cluster and client scaling) Improved POSIX API (new workloads) Job Service (enable data management) Embedded Journal and Internal Leader Election (better integration with object stores, fewer external dependencies)

13.Off Heap Metadata Storage Uses an embedded RocksDB to store inode tree Internal cache for frequently used inodes Performance is comparable to previous on-heap option when working set can fit in cache

14.gRPC Transport Layer Switch from Thrift (metadata) + Netty (data) transport to a consolidated gRPC based transport Connection multiplexing to reduce the number of connections from # of application threads to # of applications Threading model enables the master to serve concurrent requests without being limited by internal threadpool size or open file descriptors on the master

15.Improved POSIX API Alluxio FUSE based POSIX API Limitations such as no random write, file cannot be read until complete Validated against Tensorflow’s image recognition and recommendation workloads Taking suggestions for other POSIX-based workloads!

16.Job Service New process which serves as a lightweight computation framework for Alluxio specific tasks Enables replication factor control without user input Enables faster loading/persisting of data in a distributed manner Allows users to do cross-mount operations Async through is handled automatically

17.Embedded Journal and Internal Leader Election New journaling service reliant only on Alluxio master processes No longer need an external distributed storage to store the journal Greatly benefits environments without a distributed file system Uses Raft as the consensus algorithm Consensus is used for journal integrity Consensus can also be used for leader election in high availability mode

18.Release Roadmap

19.Alluxio 2.0.0 Release Alluxio 2.0.0-preview is available now Any and all feedback is appreciated! File bugs and feature requests on our Github issues Alluxio 2.0.0 will be released in ~3 months

20.Questions? Alluxio Website - Alluxio Community Slack Channel - Alluxio Office Hours & Webinars -

21.Questions? Alluxio Bay Area Meetup @ alluxio /slack WeChat