- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
基于英特尔®傲腾™持久内存的特征工程内存数据库
如今,越来越多的企业意识到了AI在企业经营、决策中的重要作用,AI迎来了落地应用爆发期。作为AI落地的关键组件,超高维在线预估系统基于实时提取的超高维特征和预先训练的模型对业务数据进行实时评估,因而被广泛应用在欺诈交易识别、个性化推荐等在线实时推理业务场景中。为了支撑高性能的实时特征存取需求,业界诞生了诸多实时内存数据库。然而,伴随着业务的持续扩张和数据量的指数级增长,实时内存数据库所存在的潜在弊端与风险使其难以高效、低成本的满足不断增长的业务硬实时需求此次分享的工作是前不久被VLDB录取的论文:以解决在线预估系统的业务需求和痛点为目的,针对如何设计底层数据库组件来高效支撑万亿维稀疏特征在线预估系统,以及如何基于英特尔®傲腾™持久内存进一步解决业务和系统设计的痛点等两方面进行创新性设计和全面优化。
杨俊,博士,本科毕业于上海交通大学计算机系,后在香港科技大学师从罗琼教授攻读计算机科学与技术博士学位,研究新存储介质(Flash SSD等)在数据库领域的性能优化。毕业后就职新加坡科技局数据存储研究所继续从事新存储介质(PMEM)相关领域的研究和系统优化工作,并在顶级国际会议和期刊(Fast, TC …)发表多篇研究论文。现就职于第四范式在新加坡的高性能计算部门,从存储系统的角度对机器学习系统全流程进行性能优化。
展开查看详情
1 . 基于非易失性内存的特征数据库 m ig ad Yang Jun 杨俊 High Performance Computing Team, Singapore ar 4P 18-Mar-21 第四范式(北京)技术有限公司 Copyright ©2020 4Paradigm All Rights Reserved.
2 .Published on VLDB 2021 m ig ad ar 4P Copyright ©2020 4Paradigm All Rights Reserved. 2
3 .Big Data + high-dimensional features achieve excellent accuracy Target Users Other Users 3 m The Traditional System AI-based System ig ad Small Data + Low- arBig Data + Low- Big Data + High- 4P dimensional Features dimensional Features dimensional Features AUC 73% AUC 67% AUC 92% Copyright ©2020 4Paradigm All Rights Reserved.
4 . AI Workflow History Data m Response Applications Trained Prediction ig Model / Score Data Warehouse Train Original Record ad Feature Validate Feature Get Extraction Structured Engineering Test Parameter Data In-memory ar Selected Database Model Features Matrix Features Vector Structured Transactional Database Raw Data 4P FEDB Off-line Training On-line Inference Copyright ©2020 4Paradigm All Rights Reserved. 4
5 .What is features? m Card ID Date Time Amount Types of Currency POS Info ... A New Transaction 9527xxxxxx 20200702 13:26 124.8 USD 8880xxxxxx ... /Purchasing Record ig ad ar 4P Copyright ©2020 4Paradigm All Rights Reserved. 5
6 . What is features? m Card ID Date Time Amount Types of Currency POS Info ... A New Transaction 9527xxxxxx 20200702 13:26 124.8 USD 8880xxxxxx ... /Purchasing Record ig Card Info: Card Level, Activation Date, Payment Due Date, . . . Shop Info: Shop ID, Type, Location , City, Country, . . . Basic Features ad Account Info: Card Number, Current Balance, Credit Limit Amount, Available Credit. . . (hundreds of) ar Pattern of the Transaction Time: Pattern of Visited Shops: • The top 3 time-of-the-day of the • The top 3 shops that most frequently transactions happened in last 1/3/5 days, appear in the last 10s, 1/5/10 mins. Real-time 4P 1/2/3/4 weeks • The top 3 shop types that most • The top 3 amount of the transactions frequently appear in the last 10s, Features • happened in the last 10s, 1/5/10 mins The amount differs from the last 1/5/10 mins. (thousands of) . . . transaction . . . Copyright ©2020 4Paradigm All Rights Reserved. 6
7 . Time constrains for on-line reference m Applications Response ig Prediction / Score ad Original Record Get Parameter Feature Extraction << 50 milliseconds Features Vector ar In-memory Database 4P Structured Raw Data On-line Inference FEDB Copyright ©2020 4Paradigm All Rights Reserved. 7
8 .4paradigm’s Feature Engineering Database (FEDB) m ig ad ar 4P Copyright ©2020 4Paradigm All Rights Reserved. 8
9 .4paradigm’s Feature Engineering Database (FEDB) m 37X ~ 610X faster ig ad ar 4P Copyright ©2020 4Paradigm All Rights Reserved. 9
10 . Limitation of DRAM-based FEDB m Pain Point Applications Response ig Prediction / Score Huge Memory Consumption ad Original Record Feature Get Extraction Parameter Long Recovery Time Features Vector ar In-memory Database 4P Structured Raw Data ~ 10 TB DRAM On-line Inference FEDB Copyright ©2020 4Paradigm All Rights Reserved. 10
11 .Intel® Optane™ DC Persistent Memory (PMEM) m ig Capacity ad DRAM: 4GB ~ 128GB PMEM: 128GB ~ 512GB ar 4P Copyright ©2020 4Paradigm All Rights Reserved. 11
12 .Intel® Optane™ DC Persistent Memory (PMEM) m DRAM Memory Mode App Direct Mode Applications Applications ig User Space MMAP ad PMDK Kernel Persistent Memory Space DRAM as cache Aware File System ar 4P Intel® Optane™ DC Memory Persistent Region Copyright ©2020 4Paradigm All Rights Reserved. 12
13 .使用PMEM的 不同方法 m ig ad ar 4P Copyright ©2020 4Paradigm All Rights Reserved. 13
14 .FEDB使用的 双层(持久化) 跳表 m ig ad ar 4P Copyright ©2020 4Paradigm All Rights Reserved. 14
15 .Long Tail Latency m ig ad ar 4P Reduce ~20% Copyright ©2020 4Paradigm All Rights Reserved. of long tail latency 15
16 .Total Cost of Ownership m ig ad Save 58.4% of total cost ar 4P Copyright ©2020 4Paradigm All Rights Reserved. 16
17 .Recovery Time m ig ad ar 4P Reduce 99.7% Copyright ©2020 4Paradigm All Rights Reserved. of recovery time 17
18 .Publication & Open Source m VLDB 2021: Optimizing In-memory Database Engine for AI-powered On-line Decision Augmentation Using Persistent Memory http://vldb.org/pvldb/vol14/p799-chen.pdf ig ad PMEM Data Structure Git: https://github.com/4paradigm/pmemstore ar SparkSQL with FEDB Git: https://github.com/4paradigm/SparkSQLWithFeDB 4P Copyright ©2020 4Paradigm All Rights Reserved. 18
19 .Thanks. m ig ad 商务咨询 TEL ar 4P business@4paradigm.com 400-179-1188 北京总部 上海总部 深圳总部 新加坡总部 北京市海淀区华润五彩城 上海市浦东新区浦东南路1111号 深圳市南山区文化中心区文心二路 Fourth Paradigm Southeast Asia 写字楼第 7-13 层 新世纪办公中心15层 海德二道茂业时代10层10A单元 PTE LTD 1 Fusionopolis Place, #03-20 Galaxis (West Lobby), Singapore, 138522