Hybrid collaborative tiered storage with Alluxio

应用程序从AWS S3或者阿里云OSS读取数据时,通常都会有严重的性能问题,毕竟是要通过远程网络。Alluxio可以提供一个透明的数据缓存层,自动缓存需要读取远端OSS/S3数据,但是Alluxio本身什么时候拉取远端数据呢?默认全部缓存?还是按需缓存?这个PPT里将会介绍Alluxio的层次化存储概念,结合ZFS系统,最大化性能并减小应用程序开发难度。

1.Hybrid collaborative tiered storage with Alluxio Thai Bui Data Engineer @ Bazaarvoice

2.Bazaarvoice ● Founded in 2005 in Austin, TX ● Digital marketing SaaS platforms for ratings and reviews ○ Display & syndicate reviews from brands to retailer websites ○ Reporting & analytics on consumers, reviews, products, etc. ● 2,600 client websites ● 5.4 billion product page views each month ● 900 million unique shoppers each month

3.Reporting & analytics on S3 When you have 100s of TB of data on S3 ● Just listing the files is slow ● Download speed in EC2 is limited (50-150Mb/s per node) ● No concept of cache ● No concept of data locality

4.AWS S3 : The Need For Speed ● Add tiered storage to S3 ○ Hot, warm, cold storage (fastest, fast, and not so fast) ○ Metadata cache ○ Data cache ● Keep data local ○ In the same machine, not via the Ethernet cable ● Compatible with existing services ○ Hadoop, Spark, Hive, Presto, etc. ● Adaptive & highly configurable ○ Symlink for S3

5.Overview ● Alluxio App2 App1 Spark ○ Distributed data storage ○ Hadoop compatible Metastore Cold ○ By AMPLab S3 ● ZFS Alluxio ○ OS-level file system Hot & Warm ○ Volume manager ○ By Sun Microsystems ZFS ● Both are open-source

6.Alluxio : The tiered-storage layer ● Support for native filesystem and Hadoop filesystem ● Distributed and can be installed on every node ○ Provides data locality ● Mount S3, HDFS, etc. to Alluxio ○ Think symlink. No data movement. ● Use Hive metastore to partition data into hot/warm and cold region ○ Acts as a remote tiered-storage layer

7.ZFS : The acceleration layer ● Both a filesytem & a volume manager ○ Mirror write to 2 SSDs -> 2x read speed ● Works at the Linux kernel-space ○ Works with RAM to accelerate read/write ○ Auto promote/demote blocks from RAM to other storage ○ Used with local NVMe SSD if data is not in RAM ○ Acts as a local tiered-storage layer ● Extremely reliable ○ Automatic block checksum & repair

8.ZFS + NVMe: Micro benchmark I3.4xlarge, up to 10Gbit network, 2 x 1.9 NVMe SSD ● Baseline w/ EBS ○ 135 MB/s write (dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync) ○ 157 MB/s read (dd if=/tmp/test1.img of=/dev/zero bs=8k) ● ZFS + 2 mirrored NVMe SSD ○ 820 MB/s write (dd if=/dev/zero of=/alluxio/fs/test1.img bs=1G count=1) ○ 1.7 GB/s read (dd if=/alluxio/fs/test1.img of=/dev/zero bs=1G count=1) ● 4x write, 10x read compared to EBS ● 10-15x compared to S3

9.With ZFS Native/Hadoop Filesystem API Alluxio User-space Kernel-space ZFS Hot RAM promote demote Warm NVMe SSD

10.With Hive Hive Metastore Last 30 Cold S3 > 30 days days Hot & Alluxio Warm

11.CPU/IO Monitoring

12.Tiered storage Monitoring

13.Alluxio Monitoring

14.Hive Monitoring & Performance Scanning 5G of data in tiered storage, 350M rows, fewer projections Scanning 200G of data in tiered storage, 500M rows, select *

15. Scanning 35G of data in S3, 1.6B rows, count Metadata/split calculation ops distinct 60s, majority of the time spent on scanning S3

16.Result ● 5-10X read improvement in Hive ○ Worker can short-circuit and read directly from ZFS instead of S3 ○ Move compute to the data ● Easy to debug, with feedback loop, collaborative ○ Data publishers + data analysts/scientists ● Good for iterating over the same data set multiple times ○ Machine learning ○ Exploratory analysis ● Give us control over S3 ○ More recent data should be faster to access