The capacity of data grows rapidly in big data area, more and more memory are consumed either in the computation or holding the intermediate data for analytic jobs. For those memory intensive workloads, end-point users have to scale out the computation cluster or extend memory with storage like HDD or SSD to meet the requirement of computing tasks. For scaling out the cluster, the extra cost from cluster management, operation and maintenance will increase the total cost if the extra CPU resources are not fully utilized. To address the shortcoming above, Intel Optane DC persistent memory (Optane DCPM) breaks the traditional memory/storage hierarchy and scale up the computing server with higher capacity persistent memory. Also it brings higher bandwidth & lower latency than storage like SSD or HDD. And Apache Spark is widely used in the analytics like SQL and Machine Learning on the cloud environment. For cloud environment, low performance of remote data access is typical a stop gap for users especially for some I/O intensive queries. For the ML workload, it’s an iterative model which I/O bandwidth is the key to the end-2-end performance. In this talk, we will introduce how to accelerate Spark SQL with OAP (https://github.com/Intel-bigdata/OAP) to accelerate SQL performance on Cloud to archive 8X performance gain and RDD cache to improve K-means performance with 2.5X performance gain leveraging Intel Optane DCPM. Also we will have a deep dive how Optane DCPM for these performance gains.

注脚

展开查看详情

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Accelerate your Spark with Intel Optane DC Persistent Memory Cheng Xu, Intel Balcer Piotr, Intel #UnifiedAnalytics #SparkAISummit

3.Notices and Disclaimers © 2018 Intel Corporation. Intel, the Intel logo, 3D XPoint, Optane, Xeon, Xeon logos, and Intel Optane logo are trademarks of Intel Corporation in the U.S. and/or other countries. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. The cost reduction scenarios described are intended to enable you to get a better understanding of how the purchase of a given Intel based product, combined with a number of situation-specific variables, might affect future costs and savings. Circumstances will vary and there may be unaccounted-for costs related to the use and deployment of a given product. Nothing in this document should be interpreted as either a promise of or contract for a given level of costs or cost reduction. The benchmark results reported above may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations. Results have been estimated based on tests conducted on pre-production systems, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance results are based on testing as of 03-14-2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. *Other names and brands may be claimed as the property of others. 3

4.About US Intel SSP/DAT – Data Analytics Technology ● Active open source contributions: About 30 committers from our team with significant contributions to a lot of big data key projects including Spark, Hadoop, HBase, Hive... ● A global engineering team (China, US, India) - Major team is based in Shanghai ● Mission: To fully optimize analytics solutions on IA and make Intel the best and easiest platform for big data and analytics customers Intel NVMS – Non-Volatile Memory Solutions Group • Tasked with preparing the ground for Intel® Optane™ DC Persistent Memory • Enabling new and existing software through Persistent Memory Development Kit (PMDK) • pmem.io 4

5.Some Challenges in Spark: Memory??? Out of memory Large Spill

6.Re-architecting the Memory/Storage Hierarchy

7.Intel® Optane™ DC Persistent Memory - Product Overview (Optane™ based Memory Module for the Data Center) Cascade Lake Server IMC IMC * DIMM population shown as an example only. DIMM Capacity • 128, 256, 512GB Speed • 2666 MT/sec • DDR4 electrical & physical Capacity per CPU • Close to DRAM latency • 3TB (not including DRAM) • Cache line size access 7

8.Persistent memory Operating Modes Memory Mode & AppDirect 8

9.Intel® Optane™ DC Persistent Memory - Operational Modes App Direct Mode memory Mode Persistent High capacity High availability / Affordable less downtime Significantly faster Ease of storage adoption† Note that a BIOS update will be required before using Intel persistent memory 9

10.INTEL® OPTANE™ DC PERSISTENT MEMORY Support for Breadth of applications PERSISTENT PERFORMANCE AFFORDABLE MEMORY CAPACITY & MAXIMUM CAPACITY FOR MANY APPLICATIONS AP P LIC ATION AP P LIC ATION VOLATILE MEMORY POOL OPTANE PERSISTENT DRAM AS CACHE DRAM MEMORY OPTANE PERSISTENT MEMORY 10

11.App Direct Mode Options • No Code Changes Required Legacy Storage APIs Storage APIs with DAX • Code changes may be required* (AppDirect) • Bypasses file system page cache • Operates in Blocks like SSD/HDD • Traditional read/write • Requires DAX enabled file system Application USER SPACE mmap mmap • Works with Existing File • XFS, EXT4, NTFS Systems Standard Standard Load/ Standard Store • No Kernel Code or interrupts Raw Device File API File API • Atomicity at block level Access PMDK • No interrupts • Block size configurable “DAX” • Fastest IO path possible • 4K, 512B* File System KERNEL SPACE • NVDIMM Driver required pmem-Aware MMU DevDAX * Code changes required for load/store direct File System Mappings access if the application does not already support • Support starting Kernel 4.2 BTT this. Block Atomicity • Configured as Boot Device Generic NVDIMM Driver • Higher Endurance than Enterprise SSDs HARDWARE • High Performance Block Storage • Low Latency, higher BW, High Persistent Memory IOPs *Requires Linux 11

12.Spark DCPMM optimization Overview Spark Input Data 9 I/O intensive Workload (from Decision Support Queries) OAP Cache RDD Cache K-means Workload Cache Hit Cache Miss Tied Storage Compute Layer IO Layer OAP DCPMM optimization (Against DRAM) RDD cache DCPMM optimization (Against DRAM): ● High Capacity, Less Cache Miss ● Reduce DRAM footprint ● Avoid heavy cost disk read ● Higher Capacity to cache more data ● Status: To Be Added ● Status: To Be Added DRAM DRAM Low Bandwidth IO storage (e.g. HDD, S3)

13.OAP I/O Cache Use Case 1: Spark SQL

14.OAP (Optimized Analytics Package) Goal • IO cache is critical for I/O intensive workload especially on low bandwidth environment (e.g. Cloud, On-Prem HDD based system) • Make full use of the advantage from DCPMM to speed up Spark SQL • High capacity and high throughput • No latency and reduced DRAM footprint • Better performance per TCO Feature • Fine grain cache (e.g. column chunk for Parquet) columnar based cache • Cache aware scheduler (V2 API via preferred location) • Self managed DCPMM pool (no extra GC overhead) • Easy to use (easily turn on/off), transparent to user (no changes for their queries) https://github.com/Intel-bigdata/OAP 14

15.Spark DCPMM Full Software Stack SQL workload Unmodified SQL Spark SQL Unchanged Spark OAP Provide scheduler and fine gain cache based on Data Source API Native library VMEMCACHE Abstract away hardware details and cache implementation to access DCPMM PMEM-AWARE Expose persistent memory as memory-mapped files (DAX) File SYSTEM Intel® Optane™ DC Persistent Memory Module 15

16.Deployment Overview Server 1 Server 2 Spark Gateway SQL (e.g. ThriftServer, Spark shell) Cache Aware Scheduler Task scheduled Spark Executor Cached Data Source (v1/v2) Native library (vmemcache) Cache Hit Cache Miss Intel Optane DC Persistent Memory Local Storage (HDD)

17.Cache Design - Problem statement • Local LRU cache • Support for large capacities available with persistent memory (many terabytes per server) • Lightweight, efficient and embeddable • In-memory • Scalable

18.Cache Design - Existing solutions • In-memory databases tend to rely on malloc() in some form for allocating memory for entries – Which means allocating anonymous memory • Persistent Memory is exposed by the operating system through normal file- system operations – Which means allocating byte-addressable PMEM needs to use file memory mapping (fsdax). • We could modify the allocator of an existing in-memory database and be done with it, right? J

19.Cache Design - Fragmentation • Manual dynamic memory management a’la dlmalloc/jemalloc/tcmalloc/palloc causes fragmentation • Applications with substantial expected runtime durations need a way to combat this problem – Compacting GC (Java, .NET) – Defragmentation (Redis, Apache Ignite) – Slab allocation (memcached) • Especially so if there’s substantial expected variety in allocated sizes

20.Cache Design - Extent allocation • If fragmentation is unavoidable, and defragmentation/compacting is CPU and memory bandwidth intensive, let’s embrace it! • Usually only done in relatively large blocks in file-systems. • But on PMEM, we are no longer restricted by large transfer units (sectors, pages etc)

21.Cache Design - Scalable replacement policy • Performance of libvmemcache was bottlenecked by naïve implementation of LRU based on a doubly-linked list. • With 100st of threads, most of the time of any request was spent waiting on a list lock… • Locking per-node doesn’t solve the problem…

22.Cache Design - Buffered LRU • Our solution was quite simple. • We’ve added a wait-free ringbuffer which buffers the list-move operations • This way, the list only needs to get locked during eviction or when the ringbuffer is full.

23. Cache Design - Lightweight, embeddable, in-memory caching VMEMcache *cache = vmemcache_new("/tmp", VMEMCACHE_MIN_POOL, VMEMCACHE_MIN_EXTENT, VMEMCACHE_REPLACEMENT_LRU); const char *key = "foo"; vmemcache_put(cache, key, strlen(key), "bar", sizeof("bar")); char buf[128]; ssize_t len = vmemcache_get(cache, key, strlen(key), buf, sizeof(buf), 0, NULL); vmemcache_delete(cache); libvmemcache has normal get/put APIs, optional replacement policy, and configurable extent size Works with terabyte-sized in-memory workloads without a sweat, with very high space utilization. Also works on regular DRAM. https://github.com/pmem/vmemcache

24. Cache Design – Status and Fine Grain Parquet File Footer Guava Based Cache Column Chunk #1 Fine Grain Cache Column Chunk #2 Column Chunk #N RowGroup #1 RowGroup #2 Vmemcache RowGroup #N Cache Status & Report 24

25.Experiments and Configurations DCPMM DRAM Hardware DRAM 192GB (12x 16GB DDR4) 768GB (24x 32GB DDR4) Intel Optane DC Persistent Memory 1TB (QS: 8 x 128GB) N/A DCPMM Mode App Direct (Memkind) N/A SSD N/A N/A CPU 2 * Cascadelake 8280M (Thread(s) per core: 2, Core(s) per socket: 28, Socket(s): 2 CPU max MHz: 4000.0000 CPU min MHz: 1000.0000 L1d cache: 32K, L1i cache: 32K, L2 cache: 1024K, L3 cache: 39424) OS 4.20.4-200.fc29.x86_64 (BKC: WW06'19, BIOS: SE5C620.86B.0D.01.0134.100420181737) Software OAP 1TB DCPMM based OAP cache 610GB DRAM based OAP cache Hadoop 8 * HDD disk (ST1000NX0313, 1-replica uncompressed & plain encoded data on Hadoop) Spark 1 * Driver (5GB) + 2 * Executor (62 cores, 74GB), spark.sql.oap.rowgroup.size=1MB JDK Oracle JDK 1.8.0_161 Workload Data Scale 2TB, 3TB, 4TB Decision Making Queries 9 I/O intensive queries Multi-Tenants 9 threads (Fair scheduled) 25

26. Test Scenario 1: Both Fit In DRAM And DCPMM - 2TB On Premise Performance With 2TB Data Scale, both DCPMM and DRAM based OAP cache can hold the full dataset (613GB). DCPMM is 24.6% (=1- 100.1/132.81) performance lower than DRAM as measured on 9 I/O intensive decision making queries. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configurations: See page20.Test by Intel on 24/02/2019. *The performance gain can be much lower if IO is improved (e.g. compression & encoding enabled) or some hacking codes fixed (e.g. NUMA scheduler)

27. Test Scenario 2: Fit In DCPMM Not For DRAM - 3TB On Premise Performance With 3TB Data Scale, only DCPMM based OAP cache can hold the full dataset (920GB). DCPMM shows 8X* performance gain over DRAM as measured on 9 I/O intensive decision making queries. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configurations: See page20.Test by Intel on 24/02/2019. *The performance gain can be much lower if IO is improved (e.g. compression & encoding enabled) or some hacking codes fixed (e.g. NUMA scheduler)

28. Performance Analysis - System Metrics DCPMM OAP Cache works avoiding disk read (blue) Disk read (blue) comes from Disk read (blue) happens from time to time and only disk write for shuffle (red) shuffle data ● Input data is all cached in DCPMM while partially for DRAM ● 18GB/s bandwidth while DRAM case it’s bounded by disk IO which is only about DCPMM reaches up to 250MB/s ~ 450MB/s ● OAP cache doesn’t apply for shuffle data (intermediate data) that extra IO pressure put onto DRAM case ● DCPMM reads about 27.5GB while DRAM reads about 395.6GB

29. Test Scenario 3: None Of DRAM &DCPMM Fit - 4TB On Premise Performance With 4TB Data Scale, none of DCPMM and DRAM based OAP cache can hold the full dataset (1226.7GB). DCPMM shows 1.66X* (=2252.80/1353.67) performance gain over DRAM as measured on 9 I/O intensive decision making queries. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configurations: See page20.Test by Intel on 24/02/2019. *The performance gain can be much lower if IO is improved (e.g. compression & encoding enabled) or some hacking codes fixed (e.g. NUMA scheduler)