持久化内存技术在实时决策系统中的应用

实时决策系统: 如实时推荐,实时反欺诈,要求系统在极短的时间内对用户产生的数据进行处理,并拉取已训练好的超高维模型进行打分预测。这一过程中将涉及实时特征提取,超高纬模型参数查询等多步操作。为了满足强实时性的要求,决策系统需要把用于提取特征的数据以及超高纬模型的参数存储在DRAM内存中。在实际部署过程中,我们发现这部分内存数据量可以多达10TB级甚至更高。借助Intel Optane DC memory(PMEM)持久化内存高容量,持久化等特性,可以很好的降低实时决策系统内存数据的存储成本,并大大加快传统DRAM-based内存系统因节点失效所需的恢复时间。我们将介绍我们利用PMDK对第四范式自研的两项核心科技:实时特征提取数据库RTIDB和高纬参数服务器集群进行优化的成果。

展开查看详情

1.持久化内存技术在实时决策系统中的应用 Chen Cheng 陈宬 High Performance Computing Team, Singapore 21-Sep-20 第四范式(北京)技术有限公司 Copyright ©2020 4Paradigm All Rights Reserved.

2.About 4Paradigm A Leading Provider of AI Technologies and Services 4Paradigm continuously develops innovative data science applications, applying them in various industries. Based on machine learning technologies and rich experience in industrial Beijing applications, 4Paradigm products can mine and predict data more accurately and reveal the hidden rules behind the data, to help enterprises achieve intelligent transformation and improve operational efficiency. Shanghai 4Paradigm has helped 8000+ customers in 12000+ scenarios to complete AI transformation. Shenzhen Hong Kong Corporate vision: AI for Everyone Singapore Corporate culture: Integrity and Innovation 4Paradigm’s is headquartered in Beijing, with offices in Shanghai, Shenzhen, Hong Kong, and Singapore. Copyright ©2020 4Paradigm All Rights Reserved. 2

3.About Me • Focus on optimizing operating system, database system, machine learning system by using new hardware, such as Non-volatile Memory. • Granted 2 US patents and 2 PCT patents. Published 20+ papers on top tired conferences and journals (500+ citations). Copyright ©2020 4Paradigm All Rights Reserved. 3

4.Big Data + high-dimensional features achieve excellent accuracy Target Users Other Users 4 The Traditional System AI-based System Small Data + Low- Big Data + Low- Big Data + High- dimensional Features dimensional Features dimensional Features AUC 73% AUC 67% AUC 92% Copyright ©2020 4Paradigm All Rights Reserved.

5. AI Workflow & Online Parameter Server History Data Finance Media Retail Response Applications Trained Prediction Medical Internet Energy Model / Score Data Warehouse Train De Validate p lo a ck Feature Feature y db … Structured Engineering F ee Get Extraction Test Parameter Structured Record Data In-memory Selected Database Get high- Model Features Matrix dimensional Features Vector Structured Transactional Database parameters Raw Data Off-line Training Online On-line Inference RTIDB Parameter Server Copyright ©2020 4Paradigm All Rights Reserved. 5

6.On-line Parameter Server Users Model Characteristics Get Parameters • Billion dimensions of features • Data Replication DRAM DRAM DRAM (for high availability and throughput) Model 1 Model 1 Model 1 Storage 1 Storage 3 Storage 5 Storage 3 Storage 3 Storage 5 Storage 4 Storage 2 Storage 1 Storage 4 Storage 2 ... ... ... PS Node 1 PS Node 2 PS Node 3 On-line Parameter Server Cluster Copyright ©2020 4Paradigm All Rights Reserved. 6

7.On-line Parameter Server Users Pain Point Get Parameters Huge Memory Consumption DRAM DRAM DRAM Model 1 Model 1 Model 1 Storage 1 Storage 3 Storage 5 Storage 3 Storage 3 Storage 5 Long Recovery Time Storage 4 Storage 2 Storage 1 Storage 4 Storage 2 ... ... ... PS Node 1 PS Node 2 PS Node 3 On-line Parameter Server Cluster Copyright ©2020 4Paradigm All Rights Reserved. 7

8.On-line Parameter Server Users Pain Point Get Parameters Huge Memory Consumption DRAM DRAM DRAM Model 1 Model 1 Model 1 Storage 1 Storage 3 Storage 5 Storage 3 Storage 3 Storage 5 Long Recovery Time Storage 4 Storage 2 Storage 1 Storage 4 Storage 2 ... ... ... PS Node 1 PS Node 2 PS Node 3 On-line Parameter Server Cluster Copyright ©2020 4Paradigm All Rights Reserved. 8

9.Intel® Optane™ DC Persistent Memory (PMEM) Capacity DRAM: 4GB ~ 128GB PMEM: 128GB ~ 512GB Copyright ©2020 4Paradigm All Rights Reserved. 9

10.Intel® Optane™ DC Persistent Memory (PMEM) DRAM Memory Mode App Direct Mode Applications Applications User Space MMAP PMDK Kernel Persistent Memory Space DRAM as cache Aware File System Intel® Optane™ DC Memory Persistent Region Copyright ©2020 4Paradigm All Rights Reserved. 10

11.Apply PMEM Memory Mode on Parameter Server Users Pain Point Get Parameters Huge Memory Consumption PMEM Memory Mode PMEM Memory Mode PMEM Memory Mode Model 1 Model 1 Model 1 Storage 1 Storage 3 Storage 5 Storage 3 Storage 3 Storage 5 Long Recovery Time Storage 4 Storage 2 Storage 1 Storage 4 Storage 2 ... ... ... PS Node 1 PS Node 2 PS Node 3 On-line Parameter Server Cluster Copyright ©2020 4Paradigm All Rights Reserved. 11

12.Apply App Direct Mode on Parameter Server Users Pain Point Get Parameters Huge Memory Consumption PMEM App Direct PMEM App Direct PMEM App Direct Model 1 Model 1 Model 1 Storage 1 Storage 3 Storage 5 Storage 3 Storage 3 Storage 5 Long Recovery Time Storage 4 Storage 2 Storage 1 Storage 4 Storage 2 ... ... ... PS Node 1 PS Node 2 PS Node 3 On-line Parameter Server Cluster Copyright ©2020 4Paradigm All Rights Reserved. 12

13.Propose PMEM-based Parameter Server: HyperPS Problem to Solve: a. How to organize data inside of the PMEM. b. A new recovery procedure after node failure. Copyright ©2020 4Paradigm All Rights Reserved. 13

14. HyperPS Storage Architecture Level 1 Storage ID List Storage 0 Storage 3 ... Storage N Para 0 Value Sub 0 PMEM-based HashMap Para 1 Value ... Level 2 Sub 1 PMEM-based HashMap Storage X Para N Value ... Sub N PMEM-based HashMap Copyright ©2020 4Paradigm All Rights Reserved. 14

15.Environment Setup • Hardware • Software • PMDK 1.8 - libpmemobj-cpp 1.10 • Benchmark tool - In-house benchmarking tool based on Jmeter - The result shown is the end-to-end performance of prediction, not the pure performance of the parameter server Copyright ©2020 4Paradigm All Rights Reserved. 15

16.TCO (Total Cost of Ownership) • Hardware Options • DRAM Server: 384 GB DRAM ( 12 x 32 GB DDR4 DRAM) • HyperPS Server: 2 TB PMEM ( 8 x 256 GB PMEM) • On-line Model Size: ~ 128 GB Number of models 1 10 50 80 100 DRAM (GB) 128 1280 6400 10240 12800 PMEM (GB) 204 2040 10200 16320 20400 Num of DRAM Server 1 4 17 27 34 Num of HyperPS Server 1 1 5 8 10 Copyright ©2020 4Paradigm All Rights Reserved. 16

17.Recovery Time (1 model ~ 60 millions features ) Single Model Recovery Time (Seconds) DRAM Server HyperPS PMEM Server ≈ 2000X Faster Copyright ©2020 4Paradigm All Rights Reserved. 17

18.Recovery Time (20 model ~ 12 billions features ) 20 Models Recovery Time (Seconds) DRAM Server HyperPS PMEM Server ≈ 17114X Faster Copyright ©2020 4Paradigm All Rights Reserved. 18

19.PMEM-based PS Performance • Latency - Gap between HyperPS PMEM Server & DRAM Server ~1 ms • Throughput - Gap between HyperPS PMEM Server & DRAM Server < 3.5 % Copyright ©2020 4Paradigm All Rights Reserved. 19

20.Thanks. 商务咨询 TEL business@4paradigm.com 400-179-1188 北京总部 上海总部 深圳总部 新加坡总部 北京市海淀区华润五彩城 上海市浦东新区浦东南路1111号 深圳市南山区文化中心区文心二路 Fourth Paradigm Southeast Asia 写字楼第 7-13 层 新世纪办公中心15层 海德二道茂业时代10层10A单元 PTE LTD 1 Fusionopolis Place, #03-20 Galaxis (West Lobby), Singapore, 138522