Elastify Cloud-Native Spark Application with Persistent Memory

Cloud native deployment has become one of the major trends for large scale Big Data analytics. Compared to on-premise data center, cloud offers much stronger scalability and higher elasticity to Big Data applications. However, cloud is also considered to be less performance than on-premise alternatives due to virtualization and cluster resource disaggregation. We present a new cloud native Spark application architecture backed by persistent memory technology. The key ingredient of this architecture is a novel acceleration engine that uses Intel’s 3DXPoint technology as external memory. We discuss how the performance of multiple aspects of data processing can be improved using this new architecture. As a key takeaway, audience will gain understanding on the benefits of latest persistent memory technology, and how such new technology could be leveraged in cloud data processing architecture.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Elastify Cloud-Native Spark Application with Persistent Memory Saisai Shao (Tencent), Peiyu Zhuang (MemVerge) #UnifiedAnalytics #SparkAISummit

3.About Me Saisai Shao Expert Software Engineer in Tencent Cloud Apache Spark Committer and Apache Livy (incubator) PPMC Peiyu Zhuang Software Engineer in MemVerge #UnifiedAnalytics #SparkAISummit 3

4.Tencent Cloud 3+ 3.5+ 12000+ 20+ Billions Trillions Trillions Largest big data cluster Records per day Time of computations Ads per day 100PB+ 500TB 500PB+ Data Tech Model Scenario

5.Tencent Cloud Big Data and AI AI Conversation Intelligent Smart Live (XiaoWei) Customer Service Conference Broadcasting AI Services Intelligent Face/Human GrandEye Smart Search Recommendation Identification AI Platform Natural Language Image Recognition Voice Recognition Processing Service AI TI-ML Foundations RayData Data Visualization SaaS BI Solution Big Data Stream Compute Service Elasticsearch Service Services Snova Data Warehouse Sparkling Data Warehouse Suite Elastic MapReduce #UnifiedAnalytics #SparkAISummit 5

6.About MemVerge MemVerge is a startup company based in San Jose. We started in 2017. We are delivering world’s first Memory- Converged Infrastructure (MCI) system, called Distributed Memory Objects (DMO). Up to 768 TB Up to 72 GB/s < 1 μs Total memory per cluster Read bandwidth per node Access Latency #UnifiedAnalytics #SparkAISummit 6

7.Back To the Date of MapReduce How did we design data application? • Network bandwidth vs. disk throughput 1Gbps • Move code rather than moving data • Fast small memory vs. slow large disk • Optimize sequence R/W #UnifiedAnalytics #SparkAISummit 7

8.The Trends of HW in DC Datacenter Bandwidth Migration Enterprise Bytes Shipments: HDD and SSD * https://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-734328.pdf * https://www.backblaze.com/blog/hdd-vs-ssd-in-data-centers/ #UnifiedAnalytics #SparkAISummit 8

9.Modern DC Architecture Changes happen to modern DC? Compute nodes • Separate data and computation, high- speed network between compute nodes and storage boxes storage boxes • Tiered storage for hot and cold data Accelerators 25~100Gbps #UnifiedAnalytics #SparkAISummit 9

10.Reimaging the DC Memory and Storage Hierarchy HOT Memory DRAM Improving memory capacity Improving SSD performance WARM SSD Storage Efficient and scalable storage HDD/TAPE COLD #UnifiedAnalytics #SparkAISummit 10

11.Embrace the New Architecture • Low latency and high throughput, like DRAM – Latency: 200 ~ 400ns – Bandwidth: • Read: Up to 8GB/s • Write: Up to 3GB/s • High density and non-volatility, like NAND – Up to 6TB per server IntelⓇ OptaneTM DC Persistent Memory • Memory-speed storage system #UnifiedAnalytics #SparkAISummit 11

12.How to Use DCPMM RDMA/DPDK DCPMM per node DCPMM centric arch #UnifiedAnalytics #SparkAISummit 12

13.MemVerge Elastic Spark Solution Ethernet Switch Data Source RDD Caching and Storage Shuffle Data #UnifiedAnalytics #SparkAISummit 13 13

14.A PMEM Centric Data Platform MemVerge Spark Adaptors MemVerge DMO Cluster Shared Persistent Memory Node 1 Node 2 Node 3 Node 4 Node N DRAM … DRAM DRAM DRAM DRAM PMEM PMEM PMEM PMEM PMEM #UnifiedAnalytics #SparkAISummit 14 14

15.Spark Integration Spark with additional Data Source RDD persist APIs RDD Hadoop compatible MemVerge Caching and Storage storage APIs DMO A new generic shuffle Shuffle Data manager #UnifiedAnalytics #SparkAISummit 15 15

16.DCPMM Equipped Shuffle Service #UnifiedAnalytics #SparkAISummit 16

17.Shuffle & Block Manager • Block manager persists data to the Compute Node memory or disk in local nodes. Spark Executor • Losing an executor means recomputing of the whole shuffle task. Shuffle Manager • The storage and network implementation is coupled with the Persist & Retrieve Data shuffle implementation. Shuffle Output Block Manager Memory Disk Local Store Store Disk #UnifiedAnalytics #SparkAISummit 17

18.The Problems of Current Shuffle • Poor elasticity – The failure of node leads to shuffle data lost • Heavy overhead to NodeManager – Coexisting with NM brings heavy overhead to NM for heavy workloads • Unsuitable to cloud environment – Data/computation separation architecture brings no advantages to local shuffle • The community is also working on these problems: – SPARK-25299 Use remote storage for persisting shuffle data – SPARK-26268 Decouple shuffle data from Spark deployment #UnifiedAnalytics #SparkAISummit 18

19.MemVerge Splash Shuffle Manager • A flexible shuffle manager – supports user-defined storage backend and network transport for shuffle data • Open source – https://github.com/MemVerge/splash • Spark JIRA: SPARK-25299 #UnifiedAnalytics #SparkAISummit 19 19

20.Splash Shuffle Manager Worker 1 Worker 2 • Create a new shuffle manager that Executor 1 Executor 2 implements shuffle manager interface Splash Splash • Extract the storage and network implementations to the storage plugin Storage Plugin Storage Plugin interface Read Write • Apply different plugins for different shuffle shuffle storage & network • Separate storage and compute Storage System • Tolerate node failure (NFS, local FS, • Support dynamic allocation HDFS, S3, DMO …) #UnifiedAnalytics #SparkAISummit 20

21.Persist Shuffle Data to PMEM • Distributed Memory Object (DMO) is a Shuffle distributed file system built on PMEM. {} Manager Splash Shuffle • The storage plugin allows us to persist data Manager into the DMO system, a separated storage cluster. Storage {} Plugin • The use of PMEM and fast network DMO Plugin technologies (RDMA or DPDK) in the storage cluster speeds up the shuffle. DMO System Persistent Memory #UnifiedAnalytics #SparkAISummit 21

22.Benchmark Settings Common Baseline DMO • 4 compute nodes • 4 HDD • 2 storage nodes • 10GbE network • 7200 RPM • 400GB PMEM/node • Driver memory 4g • Executor memory 6g • Total cores 160 • Executor cores 4 #UnifiedAnalytics #SparkAISummit 22

23.TeraSort Results Reduce Stage (min) TeraSort 400GB, 216G Shuffle Write Map Stage (min) 35 30 25 20 22 15 11 10 10 5 9.2 6 6.5 0 Baseline DMO with UDP DMO with DPDK #UnifiedAnalytics #SparkAISummit 23

24.TPC-DS Results TPC-DS 1.2TB Duration (s) 1800 Baseline 1600 DMO 1400 1200 1000 800 600 400 200 0 78 4 64 24a 24b 80 23a 23b 25 17 29 11 93 74 50 16 40 Query ID #UnifiedAnalytics #SparkAISummit 24

25.TPC-DS Results, cont. TPC-DS Query 80 TPC-DS Query 4 1200 80 1500 80 1000 60 60 800 1000 40 600 40 20 400 500 20 0 200 0 0 0 -20 400GB 800GB 1200GB 400GB 800GB 1200GB TPC-DS Query 23 TPC-DS Query 24 2000 50 2500 50 40 2000 40 1500 30 1500 30 1000 20 1000 20 10 500 500 10 0 Baseline 0 -10 0 0 DMO 400GB 800GB 1200GB 400GB 800GB 1200GB Percent #UnifiedAnalytics #SparkAISummit 25

26.Benchmark in Cloud • 4 Sparkling host: • Spark Conf – 32 core – Driver memory 4G – 128GB DRAM – 50GB cloud system disk – Executor memory 30G – 4*11TB SATA data disk – Executor memory overhead 2G – 10G network – Executor Cores 4 • 3 DMO host: – Executor Instances 12 – 32 core – 256GB DRAM (200GB PMEM) • TPC-DS – 50GB cloud system disk – data size 1TB – 10G network – 10 queries that has the biggest shuffle data #UnifiedAnalytics #SparkAISummit 26

27.TPC-DS Results in Cloud 1100.0 20.0% Baseline DMO 1000.0 15.0% Percent 900.0 800.0 10.0% 700.0 5.0% 600.0 500.0 0.0% 400.0 -5.0% 300.0 200.0 -10.0% q78 q64 q23a q24a q23b q80 q24b q4 q25 q17 #UnifiedAnalytics #SparkAISummit 27

28.Future Work • Verify the solution in production environment • Data path performance tuning for Splash shuffle manager • Enable map side merge in Splash shuffle manager • Support the Java NIO style IO interface in Splash shuffle manager • Landing the Splash shuffle service in cloud environment #UnifiedAnalytics #SparkAISummit 28

29.Summary • PMEM will bring fundamental changes to ALL data centers and enable a data-driven future • MemVerge and Tencent Cloud deliver better scalability and performance at a lower cost not just for Spark – AI, Big Data, Banking, Animation Studios, Gaming Industry, IoT, etc. – Machine learning, analytics, and online systems • Thank you Intel for supporting our work! #UnifiedAnalytics #SparkAISummit 29

由Apache Spark PMC & Committers发起。致力于发布与传播Apache Spark + AI技术,生态,最佳实践,前沿信息。