后摩尔定律时代大数据处理的挑战和机遇

后摩尔定律时代大数据处理的挑战和机遇
展开查看详情

1. ) 后摩尔定律时代大数据处理的挑战和机遇 TC BD ( 会 张晓东 大 术 美国俄亥俄州立大学 技 据 (The Ohio State University) 数 大 国 中 18 20

2. 计算机 系统和应用进程的三个进化阶段 ) • 计算机是为 “计算 (computing)” 而研制的(1930s -1990s) TC – 1936-1937, ABC in Iowa State, Zues in Berlin (电子管) – 1945, ENIAC in University of Pennsylvania (电子管) BD – 1945, 冯诺曼模型, 一个计算, 存储, 加上二者数据的传输体系结构 – 1950, PN结型晶体管 (贝尔实验室) ( – 1958-1962, CMOS 集成电路 (美国仙童, RCA公司) – 1965 年摩尔定律, 1971年第一个CPU芯片 (Intel 4004) 会 – 操作系统,存贮系统,编译软件, 高性能计算。。。 大 – 物质和物理世界被转变为数字世界, 快速计算和深度分析 – 人类社会有了前所未有的科技突破:气象,新型材料, 。。。。 术 技 • 计算机是为 “网络 (connectivity)” 而研制的 (1980s – 2010s) – 互联网和无线上网是一个全新数据世界的基础: 据 • 1981-2017: Bandwidth: from 50K bps to 100P bps (2 M times) 数 • 1981-2017: # of devices/users: from 0.1 to 10 (100 times) 大 • 网络电话, 微博,QQ, 微信,网上购物, 网上查询, 。。。 国 • 计算机是为 “数据中心 (data)” 而研制的 (从21世纪开始) 中 – 今天大数据的爆炸并不是已有的物理和物质的数字世界的一个延续 2 18 – 这个新的数据世界精确地记录和追踪人类自身的行为 – 有史以来90%的数据是过去两年产生的 20

3. Moore’s Law is reaching to the end ) TC BD ( 会 大 术 技 据 数 大 We are close to the final usable size limit: a transistor gate length: 5 nm (2020) 国 中 Minimum cost per transistor has been up since 32 nm (2010) 18 20 Road map: 22 nm in 2012, 14 nm in 2014, 10 nm in 2017, 7 nm 2018 3

4. A powerful and general purpose ecosystem ) TC • A simple and one-size-fit-all computing abstraction BD – Any computing task is expressed by programs and executed by ISA – A task is divided by a sequence of standard execution time unit ( – Under multiprogramming model, OS assigns time units to each task 会 – Data blocks are moved around in the memory hierarchy 大 – Programming is very flexible 术 技 • Moore’s law has hidden dark side of the ecosystem 据 – Efficiency continues to become low in both power and execution 数 – Not inclusive to other modes, e.g. SIMD, data flow, and domain specific 大 国 中 • As Moore’s law is ending 18 – General-purpose computing needs a lot of external help (within CMOS) 20 – Efforts have been made beyond CMOS 4

5. Limit 1: Kernel-Centric Management ) • OS Kernel controls TC everything, CPU is BD frequently interrupted ( • In-memory computing 会 and NVM technologies 大 can overload CPU 术 • We are in an era of 技 CPU-IO performance 据 inversion, where CPU is 数 a bottleneck 大 国 丞相李斯向秦始皇进言:“ 中 今皇帝并有天下, 别黑 18 白而定于一尊”. 20

6.Limit 1: Kernel-Centric Management is not sustained ) • OS Kernel controls TC everything, CPU is BD frequently interrupted ( 会 • In-memory computing 大 and NVM technologies 术 can overload CPU 技 据 • We are in an era of 数 CPU-IO performance 大 inversion, where CPU is 国 a bottleneck 中 18 20

7.20 18 中 国 大 数 据 技 术 大 会 ( BD TC ) 7

8.Relative IO Performance Growth is Faster than CPU in last 10 years due to SSD and ending Moore’s Law ) TC BD ( 会 大 术 技 据 数 大 国 中 18 20 8 Data collected by IBM Research, 2017

9.Limit 2: General-purpose Computing is Flexible but not Efficient ) (公平和效率发生了冲突) TC • CPU and Multicores BD – Mainstream environment and very flexible to all applications ( – Performance gain stops as the end of Moore’s law 会 – Efficiency is increasingly low 大 • GPUs 术 – Efficient for applications with high SIMD parallelism 技 – Special programming and low flexibility • FPGAs 据 数 – Efficient for applications defined by the hardware 大 – Very special programming and very low flexibility 国 • ASICs 中 – Very efficient for applications by customized hardware 18 – Expert programing with no flexibility 20 9

10. A case for the highly efficient specialization ) TC BD ( 会 大 术 技 据 数 大 国 – Crypto currency mining is high computing intensive task in bitcoin system 中 – Since 2008, the mining has been evolved on CPU, GPU, FPGA, and ASIC 18 10 – The efficiency factors are CPU=1, multicore = 10, GPU=100, FPGA=1,000 20 – ASIC is the most efficient in throughput and power: 10,000 to 1,000,000

11.Limit 3: non-inclusive to specialized devices ) TC BD ( 会 大 术 技 GPU 据 FPGA ASIC CPU (general) 数 大 国 Due to three major differences: 中 • Programming difference, e.g. CUDA for GPU, Verilog for FPGA, … 18 • Execution model difference, e.g. SMID for GPU, flexible for FPGA • Management difference, e.g. no OS support for GPU, FPGA, … 20 11

12. Post Moore’ Law Ecosystem ) TC l Acceleration from specialized devices is highly necessary BD – Raising speedup/scalability best utilizing external devices ( 会 • In-memory and NVM storage provide fast data accesses 大 – Data will be served most time from DRAM or fast NVM 术 技 据 • CPU bottleneck must be reduced 数 – Kernel-centric management has to be modified 大 国 • The system must be flexible, specialized and inclusive 中 18 – An comprehensive abstraction and automatic tools are needed 20

13.Case 1: Maximizing Throughput of Key Value Stores ) TC ๏ A simple but effective method in data processing where a data BD record (or a value) is stored and retrieved with its associated key • variable types and lengths of record (value) ( • simple or no schema 会 • easy software development for many applications 大 术 Key-value stores have been widely used in production systems: 技 ๏ 据 数 大 国 中 18 20

14.Key-Value Store: A Data Model Supporting Scale-out ) TC BD Command Operation GET(key) Read value ( 会 SET(key, value) Write a KV item 大 DEL(key) Delete a KV item 术 技 Hash (Key)  Server ID 据 数 大 国 中 18 § It is horizontally scalable, Data Servers 20 § It can be potentially of (very) high performance.

15.Workflow of an in-memory Key-Value Store ) TC BD ( Network Processing 会 大 术 Memory Management 技 据 数 Index Operations 大 国 中 Access Value 18 20

16. Workflow of a Typical Key-Value Store ) TC BD ( TCP/IP Processing 会 Request Parsing 大 DELETE SET GET Network Processing 术 Extract Keys Extract Key&Value Extract Keys 技 Memory Memory Full Not Full Evict 据 Allocate Memory Management 数 大 Delete from Insert into Search in Index Index Index Index Operations 国 中 Read & Send Value Access Value 18 20

17.Where does time go in KV-Store MICA [NSDI’14] ) TC Network processing & BD memory management ( 会 大 Index Operations 术 技 据 数 大 国 Access Value 中 18 20 Index operation becomes one of the major bottlenecks 1 7

18.Where does time go in KV-Store MICA [NSDI’14]? ) TC Network BD processing & Memory ( Management 会 大 Index Operations 术 技 据 数 大 Access Value 国 中 18 20

19. Data Workflow of Key-Value Stores ) TC BD Query Network Processing & ( Memory Management 会 Hash Table 大 Random 术 Memory Accesses 技 据 数 大 国 中 Index Operation Access Value 18 20

20.Random Memory Accesses of Indexing are Expensive ) TC BD Sequential memory access ( Random memory access 会 大 术 3 16 技 2 7 据 数 CPU: Intel Xeon E5-2650v2 大 Memory: 1600 MHz DDR3 国 中 18 20

21. Reason: long latency comes from row buffer misses ) TC Processor BD Bus bandwidth time Row Buffer ( 会 大 术 Row Access DRAM Core 技 据 数 大 国 中 18 • Row buffer hits: repeated accesses in the same page 20 • Row buffer misses: random accesses in different pages

22. Inabilities for CPUs to accelerate random accesses ) TC BD ๏ Do something for Index Operations (random memory accesses) ( 会 1 Caching 大 • Working set is large (100 GB), CPU cache is small (10 MB) 术 技 2 Prefetching • 据 Hard to predict next memory address 数 Multithreading 大 3 Limited number of hardware supported threads 国 • Limited by # of Miss Status Holding Registers (MSHRs) for large volumes 中 • 18 20

23. CPU vs. GPU ) TC BD Control Cache ( 会 Massive ALUs 大 术 技 据 数 大 国 Intel Xeon E5-2650v2: Nvidia GTX 780: 7 billion Transistors 中 2.3 billion Transistors 8 Cores 2,304 Cores 18 59.7 GB/s memory bandwidth 288.4 GB/s memory bandwidth 20

24. Mega-KV System Framework ) TC BD ( Network processing, 会 Index Operations Read & Send Value Memory management 大 术 Pre- GPU Post- Request TX 技 Processing Processing Processing 据 数 大 国 中 18 20

25. Mega-KV System Framework ) TC BD ( Network processing, 会 Memory management 大 术 Pre- 技 Processing 据 数 大 国 中 18 20

26. Mega-KV System Framework ) TC BD ( Network processing, 会 Memory management 大 术 Pre- 技 Processing 据 数 大 Parallel Processing 国 in GPU 中 18 20

27. Mega-KV System Framework ) TC BD ( Network processing, 会 Memory management 大 术 Pre- 技 Processing 据 数 大 Post- Parallel Processing 国 Processing in GPU 中 18 Read & Send Value 20

28.Challenges of Offloading Index Operations to GPUs ) TC BD 1. GPU’s memory capacity is small: ~10 GB ( – Keys may take hundreds of GBs 会 2. Low PCIe bandwidth 大 – PCIe is generally the bottleneck of GPU if large bulk of data needs to be transferred 术 技 3. Handling variable-length input is inefficient for GPUs 据 – Search along the input buffer or transfer another offset buffer 数 – Variable-length string comparison 大 国 4. Linked list in GPUs is inefficient 中 – The locks to insert/delete linked items would hinder parallelism 18 – more random memory accesses 20

29. Our Solutions ) TC BD Input data (Keys) Index ( C C C C 会 大 术 Compress 技 据 数 C C C C key 大 国 中 GPU optimized cuckoo hash table that Compressed fixed-length signatures 18 stores key signatures and value locations Address challenges 1, 3, and 4 Address challenges 2, 3 20 (GPU memory capacity and variable length (PCIe bandwidth and variable length data) data) 2 9