Implementing Machine Learning & Neural Network Chip Architectures

Preferred Networks, Real time data analytics with deep learning and Chainer library, Japan. Reduced Energy Microsystems, Lowest power silicon for deep ...
展开查看详情

1.Designing Neural Network SoC Architectures Using Configurable Cache Coherent Interconnect IP CHIPEX 2018 1 May 2018 Regis Gaillard Application Engineer, Arteris IP regis.gaillard@arteris.com

2.Agenda 1 Neural Networks & Deep Learning 2 Chips for NN 3 Technologies for NN Chips 4 Technologies for AI Chips 5 The Future 1 May 2018 Copyright © 2018 Arteris 2

3.Automotive is driving Machine Learning Copyright © 2018 Arteris 3 Updated model ? New data for training CPUs + HW Accelerators CPUs + HW Accelerators 1 May 2018

4.Mimicking a Brain Cell Copyright © 2018 Arteris 4 input dendrite cell body output axon axon from a different neuron x 0 w 0 x 0 w 0 x 1 w 1 x 2 w 2 Input Value Weight     f f f = Activation Function Sources: MIT , Stanford 1 May 2018

5.Many cells become a neural network Copyright © 2018 Arteris 5 Input Layer Hidden Layer Output Layer Source: Stanford 1 May 2018

6.A neural network in operation Copyright © 2018 Arteris 6 Source: Lee et al ., Communications of the ACM 2011 1 May 2018

7.Many types with specialized processing requirements Copyright © 2018 Arteris 7 Source : http:// www.asimovinstitute.org /neural-network-zoo/ 1 May 2018

8.Copyright © 2018 Arteris 8 Training vs. Inference 1 May 2018 Source: Nvidia

9.Algorithms And research in GPU utilization The Internet Large organized “Big Data” datasets are easier to gather Compute Power We helped make chips more powerful and easier to design Datacenters Software & infrastructure created to support search & cloud computing Why is Deep Learning Happening Now? Money Financial backing “gold rush” for first-mover advantage 1 May 2018 Copyright © 2018 Arteris 9

10.A Silicon Design Renaissance For AI Cadence Synopsys Ceva Nvidia AiMotive Copyright © 2018 Arteris 10 HARDWARE ACCELERATOR IP 1 May 2018 Today’s hype is driving a funding rush that is allowing many different approaches to AI processing to be explored, including new IP offerings: APPROACHES Systolic Memory oriented On chip storage focused Big cores Massively parallel tiny cores Analog Optical Quantum?

11.Deep Learning Startups Are Happening AIMotive Portable software for automated driving Hungary Axis Semi Massive array of compute cores USA Bitmain Coin miner builds training ASIC China BrainChip Spiking Neuron Adaptive Processor USA Cambricon Device and cloud processors for AI China Cerebras Systems Specialized next-generation chip for deep-learning applications USA Deep Vision Low-power computer vision USA Deephi Compressed CNN networks and processors China Esperanto Massive array of RISC-V cores USA Graphcore Graph-oriented processors for deep learning UK Groq Google spinout deep learning chip USA Horizon Robotics Smart Home, automotive and Public safety China IntelliGo Hardware and software for image and speech processing China Mythic-AI Ultra-low power NN inference IC design based on flash+analog +digital USA Novumind AI for IoT USA Preferred Networks Real time data analytics with deep learning and Chainer library Japan Reduced Energy Microsystems Lowest power silicon for deep learning and machine vision USA SenseTime Computer vision China Tenstorrent Deep learning processor: designed for faster training and adaptability to future algorithms Canada Syntient Customize analog neural networks USA ThinCI vision processing chips USA Thinkforce AI chips China Unisound AI-based speech and text China Vathys Deep learning supercomputers USA Wave Computing Deep Learning computers based on custom silicon USA Source: Chris Rowen, Cognite Ventures 1 May 2018 Copyright © 2018 Arteris 11 Startup activity is unusually high, despite the well known historical VC bias against chip startups

12.A Wide Range of Applications and Markets Low Power Edge or consumer Battery powered Focused application Mid-Range Not battery powered Thermal constraints Functional safety Copyright © 2018 Arteris 12 1 May 2018 High Performance Data center; High bandwidth Many compute engines Flexible computing

13.Power Efficiency Scalability Hardware Acceleration Copyright © 2018 Arteris 13 Technologies for Advanced AI Chips 1 May 2018

14.Interconnects enable SoC architectures Copyright © 2018 Arteris 14 Design-Specific Subsystems GPU Subsystem 3D Graphics DSP Subsystem (A/V) AES 2D GR. MPEG Etc. FlexNoC ® Non-coherent Interconnect High Speed Wired Peripherals USB 3 USB 2 PHY 3.0, 2.0 PCIe PHY Ethernet PHY Wireless Subsystem WiFi GSM LTE LTE Adv. HDMI MIPI Display PMU JTAG I/O Peripherals Memory Subsystem Wide IO LP DDR DDR3 PHY PHY Memory Scheduler Memory Controller Arteris IP FlexNoC non-coherent interconnect IP Subsystem Interconnect CRI Crypto Firewall (PCF+) RSA-PSS Cert. Engine Security Subsystem IP IP IP IP IP IP FlexWay® Interconnect Application IP Subsystem IP IP IP IP IP IP FlexWay Interconnect Ncore ™ Cache Coherent Interconnect CPU Subsystem A72 L2 cache A72 A72 A72 A53 L2 cache A53 A53 A53 Arteris IP Ncore cache coherent interconnect IP 1 May 2018 Interchip Links Interchip Links

15.Custom Hardware Acceleration Specialized processing Custom matrix operations Multiply accumulate (MAC) Fixed point operations Flexible operand bit widths (8→4→2 bits) Specialized dataflow Specific to algorithm(s) Goal is optimization of Data reuse Local accumulation Process algorithm specific data formats The best performance and energy efficiency as hardware is customized for the application The more focused the application, the more likely the use of multiple types of tightly integrated custom processors Hardware accelerators enable: Low power Low latency Data reuse Data locality Deep Learning Needs: Hardware Acceleration Delivers: 1 May 2018 Copyright © 2018 Arteris 15

16.Neural Network Architecture Revolution Performance & Power efficiency: High parallelism Structured, specialized architectures Excellent results with low precision Appropriate memory bandwidth → 100x energy benefit over CPU 1 May 2018 Copyright © 2018 Arteris 16 Low precision analog Datacenter CPUs Source: Chris Rowen, Cognite Ventures Hardware accelerators outpace CPU-ONLY

17.DDR Anatomy of a Neural Network Processor Example: Tensilica Vision C5 neural network DSP Keys to success: Scale to many 1000s of multiply-add (MAC) units High MAC density High MAC utilization High programmability High data bandwidth Registers Local memory Off-chip Compression/decompression On-chip Mem NN DSP On-chip Mem NN DSP On-chip Mem NN DSP Distributed On-chip mem Neural network DSP * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + Instruction Cache 4-way instruction decode Scalar op execution Scalar registers - 32b Vector data registers 1024b Vector accumulators 3072b Tensor execution units * + * + * + * + * + * + * + * + Vector execution units On-the-fly decompression Memory access Unit Memory access Unit Memory Port 512b: 64GB/s DMA Engine: 64 GB/s Memory Port 512b: 64GB/s On-chip Memory (64KB – 256KB per core) M ulti-processor Neural Network DSP: Each DSP is complete processor Less data movement = lower energy General vision DSP instruction set: 8b & 16b Tensor unit sustains 1024 MACs/cycle Data: 500 GB/s Weights: 120 GB/s Accumulators: 750 GB/s Source: Chris Rowen, Cognite Ventures 1 May 2018 Copyright © 2018 Arteris 17

18.How Do You Feed It? Each new AI chip has a unique memory access profile High performance chips are optimized for max bandwidth Google TPU: Attached DDR3 Google TPU2: HBM Intel Nervana : HBM2 Graphcore : On-chip local memory Mid-range (Automotive, Vision aggregation) rely on standard I/O like PCIe or Ethernet Low end (Mobile, low power) may have only inputs from sensors and a low power DDR Once on chip, data flow in on-chip interconnect between custom hardware elements must also be optimized Bandwidth – wide paths when needed, narrower when not Coherency – available where needed to simplify software development and portability Flexibility – meet needs from high performance to minimal cost, and everything in between Copyright © 2018 Arteris 18 The key is to optimize the available bandwidth for the algorithm of choice, so: The next data is ready when it is needed and where it is needed Utilization of the compute resources is maximized 1 May 2018

19.Using Proxy Caches to Integrate HW Accelerators Integrate non-coherent HW accelerators as fully coherent peers Proxy caches Association and configurability for machine learning use cases Copyright © 2018 Arteris 19 allow existing non-coherent cores to participate full in coherent system Acc1 Acc2 Acc3 Acc4 DRAM Transport Interconnect CHI CPU Cache ($) CHI Coherent Agent Interface Directory Snoop Filters(s) Snoop Filters(s) System Memory Interface CMC ($) System Memory Interface CMC ($) SMMU ACE-Lite Non-coherent Agent Interface ACE CPU Cache ($) ACE Coherent Agent Interface CCIX Coherent Agent Interface Peripheral Access Interface DRAM Per. Mem Proxy cache ($) AXI Non-coherent Agent Interface Proxy cache ($) 1 May 2018

20.Coherent Read Example – Cache Hit Copyright © 2018 Arteris 20 Acc1 Acc2 Acc3 Acc4 DRAM Transport Interconnect CHI CPU Cache ($) CHI Coherent Agent Interface Directory Snoop Filters(s) Snoop Filters(s) System Memory Interface CMC ($) System Memory Interface CMC ($) SMMU ACE-Lite Non-coherent Agent Interface ACE CPU Cache ($) ACE Coherent Agent Interface CCIX Coherent Agent Interface Peripheral Access Interface DRAM Per. Mem Proxy cache ($) AXI Non-coherent Agent Interface Proxy cache ($) ❷ ❸ ❶ Consumer Producer 1 May 2018

21.Coherent Read Example – Cache Misses (to CMC) Copyright © 2018 Arteris 21 Acc1 Acc2 Acc3 Acc4 DRAM Transport Interconnect CHI CPU Cache ($) CHI Coherent Agent Interface Directory Snoop Filters(s) Snoop Filters(s) System Memory Interface CMC ($) System Memory Interface CMC ($) SMMU ACE-Lite Non-coherent Agent Interface ACE CPU Cache ($) ACE Coherent Agent Interface CCIX Coherent Agent Interface Peripheral Access Interface DRAM Per. Mem Proxy cache ($) AXI Non-coherent Agent Interface Proxy cache ($) ❷ ❸ ❶ Consumer Memory ❹ 1 May 2018

22.The (Really) Hard Stuff – Safety and Reliability 22 How do you verify a deep learning system? How do you debug the Neural Network black box? What are the ethics and biases of these systems? What does it mean to make a Neural Network “safe”? Copyright © 2018 Arteris 1 May 2018

23.Data flow protection for functional safety Data protection (at rest & transit) Parity for data path protection ECC memory protection Intelligent Ncore hardware unit duplication Don’t duplicate protected memories or links Do duplicate HW that affects packets Integrate checkers, ECC/parity generators & buffers Fault controller with BIST Copyright © 2018 Arteris 23 Acc1 Acc2 Acc3 Acc4 Acc5 DRAM Directory Transport Interconnect Non-coherent bridge Proxy cache ($) Non-coherent bridge Proxy cache ($) CPU1 Cache ($) Coherent Memory Interface Coherent Agent Interface CPU2 Cache ($) Coherent Agent I nterface Coherent Memory Interface Snoop Filters(s) Fault Controller Snoop Filters(s) CMC ($) CMC ($) 1 May 2018

24.Neural Network SoCs: Takeaways Deep learning is being implemented with neural networks in custom SoCs Hardware acceleration and optimized dataflow, both to on and off-chip resources, are required for performance and power efficiency optimization The more specific the use case, the more opportunities for the use of many different types of hardware accelerators General-purpose systems will have few types, while more specialized systems can have five or more types of hardware accelerators to implement a custom pipeline As the number of hardware accelerator increases, managing the data flow in software become more difficult Drives the need to integrate accelerators in hardware-coherent systems Functional safety is a must for autonomous driving edge inference 1 May 2018 Copyright © 2018 Arteris 24

25.Thank You regis.gaillard@arteris.com