A Reconfigurable Fabric for Accelerating Large-Scale ... - H2RC

in CNTK, etc. Scalable DNN Hardware Microservice. BrainWave. Soft DPU. Instr Decoder & Control. Neural FU. A Scalable FPGA-powered DNN Serving ...
展开查看详情

1.Accelerating Deep Neural Networks at Datacenter Scale with the BrainWave Architecture Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengil, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Christian Boehn, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Tamas Juhasz, Ratna Kumar Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Steve Reinhardt, Adam Sapek, Raja Seera, Balaji Sridharan, Lisa Woods, Phillip Yi-Xiao, Ritchie Zhao, Doug Burger

2.The Rise of Deep Learning in ML Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more… Problem: DNNs are challenging to serve and deploy in large-scale online services Heavily constrained by latency, cost, and power Size and complexity of DNNs outpacing growth of commodity CPUs Convolutional Neural Networks 2 h t-1 h t h t+1 x t-1 x t x t+1 h t-1 h t h t+1 y t-1 y t y t+1 Recurrent Neural Networks

3.DNN Processing Units EFFICIENCY Silicon alternatives for DNNs 3 FLEXIBILITY Soft DPU (FPGA) Control Unit (CU) Registers Arithmetic Logic Unit (ALU) CPUs GPUs ASICs Hard DPU Cerebras Google TPU Graphcore Groq Intel Nervana Movidius Wave Computing Etc. BrainWave Baidu SDA Deephi Tech ESE Teradeep Etc.

4.FPGAs ideal for adapting to rapidly evolving ML CNNs, LSTMs, MLPs, reinforcement learning, feature extraction, decision trees, etc. Inference-optimized numerical precision Exploit sparsity, deep compression for larger, faster models Excellent inference performance at low batch sizes Ultra-low latency serving on modern DNNs >10X lower than CPUs and GPUs Scale to many FPGAs in single DNN service The power of Deep Learning on FPGA Performance Flexibility Scale Microsoft has the world’s largest cloud investment in FPGAs Multiple Exa-Ops of aggregate AI capacity BrainWave runs on Microsoft’s scale infrastructure 4

5.Project BrainWave F F F L0 L1 F F F L0 Pretrained DNN Model in CNTK, etc. Scalable DNN Hardware Microservice BrainWave Soft DPU Instr Decoder & Control Neural FU A Scalable FPGA-powered DNN Serving Platform Fast: ultra-low latency, high-throughput serving of DNN models at low batch sizes Flexible : adaptive numerical precision and custom operators Friendly : turnkey deployment of CNTK/Caffe/TF/ etc 5 Network switches FPGAs

6.CPU compute layer Reconfigurable compute layer (FPGA) Converged network Runs on a Configurable Cloud at Massive Scale

7.Deployed in Production Datacenters Sub-millisecond FPGA compute latencies at batch 1 7 Tail latencies in BrainWave-powered DNN models appear negligible in E2E software pipelines Deployment of LSTM-based NLP model (tens of millions of parameters) Takes tens of milliseconds to serve on well-tuned CPU implementations

8.How It Works: The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale HW Microservices on Intel FPGAs 8 BrainWave System General Infrastructure

9.How It Works: The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale HW Microservices on Intel FPGAs A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs 9

10.How It Works: The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale HW Microservices on Intel FPGAs A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms 10

11.How It Works: The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale HW Microservices on Intel FPGAs A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch 11

12.How It Works: The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale HW Microservices on Intel FPGAs A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs 12

13.How It Works: The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale HW Microservices on Intel FPGAs A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs Intel FPGAs deployed at scale with HW microservices [MICRO’16] 13

14.The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale HW Microservices on Intel FPGAs 14

15.FPGAs Are Deployed in MSFT Servers Worldwide 15 A Cloud-Scale Acceleration Architecture [MICRO’16]

16.WCS Gen4.1 Blade with NIC and Catapult FPGA Catapult v2 Mezzanine card Card locations FPGAs Are Deployed in MSFT Servers Worldwide 16 [ISCA’14, HotChips’14, MICRO’16]

17.Hardware Microservices on FPGAs [MICRO’16] Web search ranking Traditional software (CPU) server plane QPI DRAM DRAM NIC CPU 1 QSFP 40Gb/s 2 ToR Gen3 x8 Gen3 2x8 FPGA DRAM CPU 40Gb/s QSFP QSFP Hardware acceleration plane Interconnected FPGAs form a separate plane of computation Can be managed and used independently from the CPU Web search ranking Deep neural networks SDN offload SQL 17 CPUs FPGAs Routers

18.The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale HW Microservices on Intel FPGAs 18

19.BrainWave Compiler & Runtime FPGA0 FPGA1 Add500 1000-dim Vector 1000-dim Vector Split 500x500 Matrix MatMul500 500x500 Matrix MatMul500 MatMul500 MatMul500 500x500 Matrix Add500 Add500 Sigmoid500 Sigmoid500 Split Add500 500 500 Concat 500 500 500x500 Matrix 19 Target compiler FPGA Target compiler CPU-CNTK Frontends Portable IR Target compiler CPU-Caffe Transformed IRs Graph Splitter and Optimizer Deployment Package Caffe Model FPGA HW Microservice CNTK Model Tensorflow Model

20.Common Scenarios 20 NxN Weight Matrix x y = O(N 2 ) data O(N 2 ) compute Input activation Output pre-activation NxNxN Input Activation KxKxN N weight kernels O(N 3 ) data O(N 4 K 2 ) compute NxNxN Output pre-activation = Convolutional Neural Network (CNN) High Compute-to-Data Ratio MLPs, LSTMs, GRUs Low compute-to-data ratio

21.Common Scenarios 21 NxN Weight Matrix x y = O(N 2 ) data O(N 2 ) compute Input activation Output pre-activation NxNxN Input Activation KxKxN O(N 3 ) data O(N 4 K 2 ) compute NxNxN Output pre-activation = Convolutional Neural Network (CNN) High Compute-to-Data Ratio MLPs, LSTMs, GRUs Low compute-to-data ratio

22.Conventional Acceleration Approach: Local Offload and Streaming FPGA DRAM 2xCPU Model Parameters Initialized in DRAM 22

23.Conventional Acceleration Approach: Local Offload and Streaming FPGA DRAM 2xCPU R R 23 Model Parameters Initialized in DRAM For memory-intensive DNNs with low compute-to-data ratios (e.g., LSTM), HW utilization limited by off-chip DRAM bandwidth

24.Improving HW utilization with batching Batch Size Hardware Utilization (%) 24 FPGA DRAM R R R R R R R R

25.Improving HW utilization with batching Batch Size Latency at 99th Maximum Allowed Latency Batch Size Hardware Utilization (%) Batching improves HW utilization but increases latency 25

26.Improving HW utilization with batching Batch Size Latency at 99th Maximum Allowed Latency Batch Size Hardware Utilization (%) 26 Ideally want high HW utilization at low batch sizes Batching improves HW utilization but increases latency

27.Alternative: “Persistent” Neural Nets FPGA DRAM 2xCPU 27

28.Alternative: “Persistent” Neural Nets 2xCPU Observations State-of-art FPGAs have O(10K) distributed Block RAMs O(10MB)  Tens of TB/sec of memory BW Large-scale cloud services and DNN models run persistently 28 Solution: persist all model parameters in FPGA on-chip memory during service lifetime

29.Alternative: “Persistent” Neural Nets 2xCPU 29 R