用Alluxio加速云上大数据分析以及持久化内存带来的新机会

基于云的大数据分析因其低成本和灵活部署等特性已经变得越来越火,但其性能还是跟本地部署的集群存在差距。本次分享会介绍不同的负载(Terasort, TPC-DS, Machine Learning)在S3和在本地部署中的性能差异和分析,以及使用Alluxio时对这些负载的加速效果。持久化内存这种介质的出现,在存储界打开了了一个新世界,本次分享也会探讨Alluxio在使用持久化内存时的新机会。

了解更多Alluxio知识,请访问官网 https://www.alluxio.org/

展开查看详情

1.Yuan Zhou Oct, 2018

2. Legal Disclaimer & Optimization Notice Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER I NTELLECTUAL PROPERTY RIGHT. Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 2 Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

3.Agenda ▪ Background ▪ Bigdata Analytics on Cloud Reference Architecture ▪ Bigdata Analytics on Cloud with Alluxio ▪ Persistent Memory for Alluxio ▪ Summary Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

4.Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

5.background #1 #2 #3 Get a bigger cluster Give each team their Give teams ability to for many teams to share. own dedicated cluster, spin-up/spin-down each with a copy of clusters which can PBs of data. share data sets. Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. 5 *Other names and brands may be claimed as the property of others.

6.Benefits of compute and storage disaggregation Independent scale Enable Agile Simple and flexible Hybrid cloud of CPU and storage Single copy of data application software deployment capacity development management • Rightsized HW for • Multiple compute • In-memory cloning • Mix and match • Avoid software each layer cluster share • Snapshot service resources version • Reduce resource common data • Quick & efficient depending on management wastage repo/lake copies workload nature • Upgrade compute • Cost saving • Simplified data and life cycle software only management • Reduced provisioning overhead • Improve security Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. 6 *Other names and brands may be claimed as the property of others.

7.Bigdata Analytics over Disaggregate Storage Hadoop Compatible File System abstraction layer: Unified storage API interface Hadoop fs –ls s3a://job/ adl:// oss:// s3:// s3n:// gs:// wasb:// s3a:// 2006 2008 2014 2015 2016 Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. 7 *Other names and brands may be claimed as the property of others.

8.Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

9.Workloads Simple Read/Write ▪ DFSIO: TestDFSIO is the canonical example of a benchmark that attempts to measure the Storage's capacity for reading and writing bulk data. ▪ Terasort: a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. Batch Analytics ▪ To consistently executing analytical process to process large set of data. ▪ Leveraging 54 derived from TPC-DS * queries with intensive reads across objects in different buckets Data Transformation(not covered in this talk) ▪ ETL: Taking data as it is originally generated and transforming it to a format (Parquet, ORC) that more tuned for analytical workloads. Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

10. System configurations 5x Compute Node Head Hadoop Hadoop Hadoop Hadoop Hadoop • Intel® Xeon™ processor E5-2699 v4 @ DNS Server Hive Hive Hive Hive Hive 2.2GHz, 128GB mem Spark Spark Spark Spark Spark • 2x10G 82599 10Gb NIC • 2x SSDs • 3x Data storage (can be emliminated) Software: • Hadoop 2.8.1 1x10Gb NIC • Spark 2.2.0 • Hive 2.2.1 • CentOS 7.5 2x10Gb NIC 5x Storage Node, 5 RGW nodes(co-located) • Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz • 128GB Memory MON • 3x 82599 10Gb NIC • 7x 1TB HDD as data drive RGW1 RGW2 RGW3 RGW4 RGW5 • 1 OSD instances one each HDD • CentOS 7.5 OSD1 OSD2 OSD3 OSD4 OSD5 • Ceph Mimic OSD1 … OSD7 Optimization Notice *Other names and brands may be Copyright © 2018, Intel Corporation. All rights reserved. claimed as the property of others. 10 *Other names and brands may be claimed as the property of others.

11.Bigdata on Object Storage Performance Overview Performance Overview(Normalized) 1.6 1.4 1.3 1.2 1.0 1.0 1.0 1.0 1.0 0.9 0.8 0.7 0.6 0.6 0.4 0.4 0.2 0.0 TPC-DS (54 quiries) TERASORT 1T KMEANS 374g spark(yarn) + LOCAL HDFS (HDD) spark(yarn) + S3 (HDD) spark(yarn) + REMOTE HDFS (HDD) • In Terasort workload, S3 showed 60% perf gap comparing with local HDFS since all it’s very IO intensive • In TPC-DS workload, S3 showed 30% perf gap, and in Kmeans workload, S3 showed 40% perf gap Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

12.Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

13. System Configuration(Alluxio) 5x Compute Node Hardware: • ntel® Xeon™ processor Gold 6140 @ 2.3GHz, 384GB Memory • 1x 82599 10Gb NIC • 5x P4500 SSD (2 for spark-shuffle) DNS Hadoop Hadoop Hadoop Hadoop Hadoop Software: Hive Hive Hive Hive Hive • Hadoop 2.8.1 Spark Spark Spark Spark • Spark 2.2.0 Spark • Hive 2.2.1 • CentOS 7.5 Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Acceleration Layer • 200GB Mem for mem mode • 1TB SSD(P4500) for SSD mode Software: 1x10Gb NIC • Alluxio 1.8.0 5x Storage Node • Intel(R) Xeon(R) CPU Gold 6140 @ 2.30GHz, 192GB Memory • 2x 82599 10Gb NIC • 7x 1TB HDD for Ceph bluestore or HDFS namenode and datanode Software: CEPH REMOTE CEPH REMOTE CEPH REMOTE CEPH REMOTE CEPH REMOTE • Hadoop 2.8.1 MON HDFS HDFS HDFS HDFS HDFS • Ceph 12.2.7 • CentOS 7.5 RGW NN RGW RGW RGW RGW OSD DN OSD DN OSD DN OSD DN OSD DN Optimization Notice *Other names and brands may be Copyright © 2018, Intel Corporation. All rights reserved. Intel Confidential claimed as the property of others. 13 *Other names and brands may be claimed as the property of others.

14.Performance w/ Alluxio Acceleration Alluxio Accleration for Disaggregated analytics storage with different workloads (Normalized) 1.6 1.3 1.3 1.4 1.2 1.3 1.2 1.1 1.0 1.0 1.0 1.0 1.0 1.0 0.9 0.8 0.7 0.6 0.6 0.4 0.4 0.2 0.0 0.0 Batch Query (54 quiries) TERASORT 1T KMEANS 374g spark(yarn) + LOCAL HDFS (HDD) spark(yarn) + S3 (HDD) spark(yarn) + REMOTE HDFS (HDD) spark(yarn) + alluxio(SSD) + S3 (HDD) spark(yarn) + alluxio(MEM) + S3 (HDD) • Alluxio based in memory acceleration layers provides significant performance boost for analytics workloads with disaggregated storage • Up to 3.25x for terasort • Up to 1.8x compared with local HDFS Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. Intel Confidential 14 *Other names and brands may be claimed as the property of others.

15. Spark-SQL Performance Overview Spark-SQL performance W/ S3 S3 Warmup Hot • The performance of first several queries in Warmup are worse than S3 baseline • For most queries SSD cache does improved the performance Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

16. Compute-side caching for I/O intensive queries I/O intensive query performance(normalized, lower is better) 1.20 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.90 0.85 0.78 0.80 0.75 0.75 0.71 0.66 Seconds 0.60 0.40 0.20 0.00 query19.sql query42.sql query43.sql query52.sql query55.sql query63.sql query68.sql S3 Compute-side Caching • Compute-side caching brings better efficiency(10% - 30%) for I/O intensive queries 16 Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

17. Spark-SQL Performance Analysis • CPU% on Compute nodes are higher -CPU utilization S3 Warmup Hot Compute CPU Utilization Compute CPU Utilization CPU Utilization 120 120 120 100 100 100 Average of %idle Average of %idle Average of 80 80 80 Average of %steal Average of %steal Average of Axis Title Axis Title Axis Title 60 Average of %iowait 60 Average of %iowait 60 Average of 40 Average of %nice 40 Average of %nice 40 Average of 20 Average of %system 20 Average of %system 20 Average of Average of %user Average of %user Average of 0 0 0 1110 1665 3325 4750 2190 3285 4380 2220 2775 3330 3885 4440 4995 5551 6106 6661 1425 1900 2375 2850 3800 4275 5225 5700 1095 1460 1825 2555 2920 3650 4015 555 475 950 365 730 0 0 0 Storage CPU Utilization Storage CPU Utilization Storage CPU Utilization 120 120 120 100 100 100 Average of %idle Average of %idle Average of 80 80 80 Average of %steal Average of %steal Average of Axis Title Axis Title Axis Title 60 Average of %iowait 60 Average of %iowait 60 Average of 40 Average of %nice 40 Average of %nice 40 Average of 20 Average of %system 20 Average of %system 20 Average of Average of %user Average of %user Average of 0Optimization Notice 0 0 1110 1665 2377 2852 3327 1825 2920 2220 2775 3331 3886 4441 4996 5551 6106 6661 1426 1902 3802 4277 4752 5227 5702 1095 1460 2190 2555 3285 3650 4015 4380 555 730 476 951 365 0 0 0 Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

18. Spark-SQL Performance Analysis • MEM% on Compute nodes are higher -MemoryS3 utilizationCompute Memory Warmup Utilization Hot Compute Memory Utilization Compute Memory Utilization 120 120 90 80 100 100 70 80 80 60 Axis Title Axis Title Axis Title 50 60 60 40 Total Total Total 40 40 30 20 20 20 10 0 0 0 3456 4836 2125 3450 1040 1385 1730 2076 2421 2766 3111 3801 4146 4491 5181 5526 5871 1065 1330 1595 1860 2390 2655 2920 3185 3715 3980 4245 4510 350 800 695 270 535 5 5 2285 3425 6086 1145 1525 1905 2665 3045 3805 4185 4565 4946 5326 5706 6466 6846 385 765 5 Storage Memory Utilization Storage Memory Utilization Storage Memory Utilization 22.8 23.2 21.98 22.6 23 21.97 22.8 21.96 22.4 22.6 21.95 Axis Title Axis Title Axis Title 22.2 22.4 21.94 22 22.2 21.93 Total Total Total 22 21.92 21.8 21.8 21.91 21.6 21.6 21.9 21.4 21.4 21.89 1605 4405 1385 2421 3456 4491 5526 1685 1205 2005 2405 2805 3205 3605 4005 4805 5205 5605 6005 6405 6805 1040 1730 2075 2766 3111 3801 4146 4836 5181 5871 1125 1405 1965 2245 2525 2805 3085 3365 3645 3925 4205 4485 405 350 805 695 285 565 845 5 5 5 Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

19. Spark-SQL Performance Analysis • Local cache saved lots disk IO on storage nodes -Disk utilization S3 Warmup Hot Compute Disk Bandwidth Compute Disk Bandwidth Compute Disk Bandwidth 700000 1200000 1200000 600000 1000000 1000000 500000 800000 800000 Axis Title Axis Title Axis Title 400000 Sum of rkB/s 600000 Sum of rkB/s600000 Sum of rkB/s 300000 Sum of wkB/s 400000 Sum of wkB/s 400000 Sum of wkB/s 200000 100000 200000 200000 0 0 0 -5 -5 -5 2770 6655 1025 4630 5146 3155 3550 1105 1660 2215 3325 3880 4435 4990 5545 6100 1540 2055 2570 3085 3600 4115 5661 1180 1575 1970 2365 2760 3945 4340 510 550 390 785 Storage Disk Bandwidth Storage Disk Bandwidth Storage Disk Bandwidth 300000 400000 80000 350000 70000 250000 300000 60000 200000 250000 50000 Axis Title Axis Title Axis Title 150000 Sum of rkB/s200000 40000 Sum of rkB/s Sum of rkB/s 150000 30000 100000 Sum of wkB/s Sum of wkB/s Sum of wkB/s 100000 20000 50000 50000 10000 0 0 0 Optimization Notice -5 -5 -5 3881 4991 6101 1420 4746 1105 1660 2215 2770 3325 4436 5546 6656 1895 2371 2846 3321 3796 4271 5221 5696 1090 1455 1820 2185 2550 2915 3280 3645 4010 4375 550 470 945 360 725 Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

20. Spark-SQL Performance Analysis • Local cache saved a lots Network IO between compute and storage nodes -Network S3 utilization Warmup Hot Compute Network IO Compute Network IO Compute Network IO 1200000 1200000 1200000 1000000 1000000 1000000 800000 800000 800000 Axis Title Axis Title Axis Title 600000 Sum of rxkB/s 600000 Sum of rxkB/s 600000 Sum 400000 Sum of txkB/s 400000 Sum of txkB/s 400000 Sum 200000 200000 200000 0 0 0 2220 4995 4120 5665 2555 4380 1110 1665 2775 3330 3885 4440 5550 6101 6656 1030 1545 2060 2575 3090 3605 4635 5150 1095 1460 1825 2190 2920 3285 3650 4015 730 555 515 365 0 0 0 Storage Network IO Storage Network IO Storage Network IO 1200000 1400000 3000 1000000 1200000 2500 1000000 800000 2000 Axis Title Axis Title Axis Title 800000 600000 Sum of rxkB/s Sum of rxkB/s1500 Sum 600000 400000 Sum of txkB/s Sum of txkB/s1000 Sum 400000 200000 200000 500 0 0 0 Optimization Notice 6661 2040 3400 1110 1665 2220 2775 3330 3885 4440 4995 5551 6106 1031 1546 2061 2577 3092 3607 4122 4637 5152 5667 1020 1360 1700 2380 2720 3060 3741 4081 4421 516 680 555 340 0 0 0 Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

21.S3a committer • New S3a committer is designed for the “rename” issue • Instead of using COPY + DELETE, the new committers relies MPU mechanism to implement atomic rename on S3 • Staging committer is more simpler in general and also more mature as it’s from production code of Netflix • Hadoop 3.1.0 is released with “staging” and “magic” committer • For the staging committer, Netflix has been used in production for years • The Magic committer should be better by design, but it’s still WIP Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

22.S3a staging committer performance S3a committer performance with Terasort 1T(normalized) 1.4 1.3 1.2 1.0 1.0 0.8 0.6 0.6 0.4 0.4 0.2 0.0 spark(yarn) + LOCAL HDFS (HDD) spark(yarn) + REMOTE HDFS (HDD) spark(yarn) + S3 (HDD) spark(yarn) + S3 (HDD) + S3a committer • New S3a staging committer improved performance by ~50% than w/ original committer Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

23.Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

24.Persistent Memory • New media technologies that can offer byte-addressability persistence, along with DRAM-like performance • Different types • NVDIMM • Intel Optane DC Persistent Memory • Intel Optane DC SSD(disk) Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

25.Intel Optane DC Memory Source: https://www.snia.org/sites/default/files/SDC/2018/presentations/PM/Andy_Rudoff_Update _SNIA_Persistent_Memory_Programming_Model.pdf Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

26.Performance of DCPPM vs. NAND Source: https://software.intel.com/en-us/persistent-memory Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

27.Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

28.A new PMEM tier for Alluxio Mem Mem PMEM SSD SSD HDD HDD Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

29.New PMEM tier for Alluxio Our Approach client worker client worker client worker reader/writer reader/writer reader/writer PMDK POSIX Context Switches IO POSIX Context Switches Userspace IO Load/Store DAX Memory filesystem filesystem mapped file Pagecache PMEM To bypass filesystem overhead and context switches Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.