- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Elastify Cloud-Native Spark Application with Persistent Memory
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .Elastify Cloud-Native Spark Application with Persistent Memory Saisai Shao (Tencent), Peiyu Zhuang (MemVerge) #UnifiedAnalytics #SparkAISummit
3 .About Me Saisai Shao Expert Software Engineer in Tencent Cloud Apache Spark Committer and Apache Livy (incubator) PPMC Peiyu Zhuang Software Engineer in MemVerge #UnifiedAnalytics #SparkAISummit 3
4 .Tencent Cloud 3+ 3.5+ 12000+ 20+ Billions Trillions Trillions Largest big data cluster Records per day Time of computations Ads per day 100PB+ 500TB 500PB+ Data Tech Model Scenario
5 .Tencent Cloud Big Data and AI AI Conversation Intelligent Smart Live (XiaoWei) Customer Service Conference Broadcasting AI Services Intelligent Face/Human GrandEye Smart Search Recommendation Identification AI Platform Natural Language Image Recognition Voice Recognition Processing Service AI TI-ML Foundations RayData Data Visualization SaaS BI Solution Big Data Stream Compute Service Elasticsearch Service Services Snova Data Warehouse Sparkling Data Warehouse Suite Elastic MapReduce #UnifiedAnalytics #SparkAISummit 5
6 .About MemVerge MemVerge is a startup company based in San Jose. We started in 2017. We are delivering world’s first Memory- Converged Infrastructure (MCI) system, called Distributed Memory Objects (DMO). Up to 768 TB Up to 72 GB/s < 1 μs Total memory per cluster Read bandwidth per node Access Latency #UnifiedAnalytics #SparkAISummit 6
7 .Back To the Date of MapReduce How did we design data application? • Network bandwidth vs. disk throughput 1Gbps • Move code rather than moving data • Fast small memory vs. slow large disk • Optimize sequence R/W #UnifiedAnalytics #SparkAISummit 7
8 .The Trends of HW in DC Datacenter Bandwidth Migration Enterprise Bytes Shipments: HDD and SSD * https://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-734328.pdf * https://www.backblaze.com/blog/hdd-vs-ssd-in-data-centers/ #UnifiedAnalytics #SparkAISummit 8
9 .Modern DC Architecture Changes happen to modern DC? Compute nodes • Separate data and computation, high- speed network between compute nodes and storage boxes storage boxes • Tiered storage for hot and cold data Accelerators 25~100Gbps #UnifiedAnalytics #SparkAISummit 9
10 .Reimaging the DC Memory and Storage Hierarchy HOT Memory DRAM Improving memory capacity Improving SSD performance WARM SSD Storage Efficient and scalable storage HDD/TAPE COLD #UnifiedAnalytics #SparkAISummit 10
11 .Embrace the New Architecture • Low latency and high throughput, like DRAM – Latency: 200 ~ 400ns – Bandwidth: • Read: Up to 8GB/s • Write: Up to 3GB/s • High density and non-volatility, like NAND – Up to 6TB per server IntelⓇ OptaneTM DC Persistent Memory • Memory-speed storage system #UnifiedAnalytics #SparkAISummit 11
12 .How to Use DCPMM RDMA/DPDK DCPMM per node DCPMM centric arch #UnifiedAnalytics #SparkAISummit 12
13 .MemVerge Elastic Spark Solution Ethernet Switch Data Source RDD Caching and Storage Shuffle Data #UnifiedAnalytics #SparkAISummit 13 13
14 .A PMEM Centric Data Platform MemVerge Spark Adaptors MemVerge DMO Cluster Shared Persistent Memory Node 1 Node 2 Node 3 Node 4 Node N DRAM … DRAM DRAM DRAM DRAM PMEM PMEM PMEM PMEM PMEM #UnifiedAnalytics #SparkAISummit 14 14
15 .Spark Integration Spark with additional Data Source RDD persist APIs RDD Hadoop compatible MemVerge Caching and Storage storage APIs DMO A new generic shuffle Shuffle Data manager #UnifiedAnalytics #SparkAISummit 15 15
16 .DCPMM Equipped Shuffle Service #UnifiedAnalytics #SparkAISummit 16
17 .Shuffle & Block Manager • Block manager persists data to the Compute Node memory or disk in local nodes. Spark Executor • Losing an executor means recomputing of the whole shuffle task. Shuffle Manager • The storage and network implementation is coupled with the Persist & Retrieve Data shuffle implementation. Shuffle Output Block Manager Memory Disk Local Store Store Disk #UnifiedAnalytics #SparkAISummit 17
18 .The Problems of Current Shuffle • Poor elasticity – The failure of node leads to shuffle data lost • Heavy overhead to NodeManager – Coexisting with NM brings heavy overhead to NM for heavy workloads • Unsuitable to cloud environment – Data/computation separation architecture brings no advantages to local shuffle • The community is also working on these problems: – SPARK-25299 Use remote storage for persisting shuffle data – SPARK-26268 Decouple shuffle data from Spark deployment #UnifiedAnalytics #SparkAISummit 18
19 .MemVerge Splash Shuffle Manager • A flexible shuffle manager – supports user-defined storage backend and network transport for shuffle data • Open source – https://github.com/MemVerge/splash • Spark JIRA: SPARK-25299 #UnifiedAnalytics #SparkAISummit 19 19
20 .Splash Shuffle Manager Worker 1 Worker 2 • Create a new shuffle manager that Executor 1 Executor 2 implements shuffle manager interface Splash Splash • Extract the storage and network implementations to the storage plugin Storage Plugin Storage Plugin interface Read Write • Apply different plugins for different shuffle shuffle storage & network • Separate storage and compute Storage System • Tolerate node failure (NFS, local FS, • Support dynamic allocation HDFS, S3, DMO …) #UnifiedAnalytics #SparkAISummit 20
21 .Persist Shuffle Data to PMEM • Distributed Memory Object (DMO) is a Shuffle distributed file system built on PMEM. {} Manager Splash Shuffle • The storage plugin allows us to persist data Manager into the DMO system, a separated storage cluster. Storage {} Plugin • The use of PMEM and fast network DMO Plugin technologies (RDMA or DPDK) in the storage cluster speeds up the shuffle. DMO System Persistent Memory #UnifiedAnalytics #SparkAISummit 21
22 .Benchmark Settings Common Baseline DMO • 4 compute nodes • 4 HDD • 2 storage nodes • 10GbE network • 7200 RPM • 400GB PMEM/node • Driver memory 4g • Executor memory 6g • Total cores 160 • Executor cores 4 #UnifiedAnalytics #SparkAISummit 22
23 .TeraSort Results Reduce Stage (min) TeraSort 400GB, 216G Shuffle Write Map Stage (min) 35 30 25 20 22 15 11 10 10 5 9.2 6 6.5 0 Baseline DMO with UDP DMO with DPDK #UnifiedAnalytics #SparkAISummit 23
24 .TPC-DS Results TPC-DS 1.2TB Duration (s) 1800 Baseline 1600 DMO 1400 1200 1000 800 600 400 200 0 78 4 64 24a 24b 80 23a 23b 25 17 29 11 93 74 50 16 40 Query ID #UnifiedAnalytics #SparkAISummit 24
25 .TPC-DS Results, cont. TPC-DS Query 80 TPC-DS Query 4 1200 80 1500 80 1000 60 60 800 1000 40 600 40 20 400 500 20 0 200 0 0 0 -20 400GB 800GB 1200GB 400GB 800GB 1200GB TPC-DS Query 23 TPC-DS Query 24 2000 50 2500 50 40 2000 40 1500 30 1500 30 1000 20 1000 20 10 500 500 10 0 Baseline 0 -10 0 0 DMO 400GB 800GB 1200GB 400GB 800GB 1200GB Percent #UnifiedAnalytics #SparkAISummit 25
26 .Benchmark in Cloud • 4 Sparkling host: • Spark Conf – 32 core – Driver memory 4G – 128GB DRAM – 50GB cloud system disk – Executor memory 30G – 4*11TB SATA data disk – Executor memory overhead 2G – 10G network – Executor Cores 4 • 3 DMO host: – Executor Instances 12 – 32 core – 256GB DRAM (200GB PMEM) • TPC-DS – 50GB cloud system disk – data size 1TB – 10G network – 10 queries that has the biggest shuffle data #UnifiedAnalytics #SparkAISummit 26
27 .TPC-DS Results in Cloud 1100.0 20.0% Baseline DMO 1000.0 15.0% Percent 900.0 800.0 10.0% 700.0 5.0% 600.0 500.0 0.0% 400.0 -5.0% 300.0 200.0 -10.0% q78 q64 q23a q24a q23b q80 q24b q4 q25 q17 #UnifiedAnalytics #SparkAISummit 27
28 .Future Work • Verify the solution in production environment • Data path performance tuning for Splash shuffle manager • Enable map side merge in Splash shuffle manager • Support the Java NIO style IO interface in Splash shuffle manager • Landing the Splash shuffle service in cloud environment #UnifiedAnalytics #SparkAISummit 28
29 .Summary • PMEM will bring fundamental changes to ALL data centers and enable a data-driven future • MemVerge and Tencent Cloud deliver better scalability and performance at a lower cost not just for Spark – AI, Big Data, Banking, Animation Studios, Gaming Industry, IoT, etc. – Machine learning, analytics, and online systems • Thank you Intel for supporting our work! #UnifiedAnalytics #SparkAISummit 29