A New Golden Age for Computer Architecture

John Hennessy 和 David Patterson,两位在计算机架构领域鼎鼎大名的教授,回顾了历史体系架构,指出当今的体系架构中问题,诸如摩尔定律的终结,安全性问题,并提出未来计算机体系架构全新的研究方法,用DSL(特定领域语言),DSA(特定领域架构)来解决当前芯片行业发展,比如设计专门神经网络处理器来完成机器学习中的问题,可编程的网络路由和网卡等,在特定领域,新的架构和设计方法会带来更多的机会,两位大牛也分享了目前业界的一些成果。

1. A New Golden Age for Computer Architecture: Domain-Specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development John Hennessy and David Patterson Stanford and UC Berkeley 13 June 2018 https://www.youtube.com/watch?v=3LVeEjsn8Ts 1

2. Outline Part I: History of Part II: Current Architecture - Architecture Challenges - Mainframes, Ending of Dennard Scaling Minicomputers, and Moore’s Law, Security Microprocessors, RISC vs CISC, VLIW Part III: Future Architecture Opportunities - Domain Specific Languages and Architecture, Open Architectures, Agile Hardware Development 2

3. IBM Compatibility Problem in Early 1960s By early 1960’s, IBM had 4 incompatible lines of computers! 701 ➡ 7094 650 ➡ 7074 702 ➡ 7080 1401 ➡ 7010 Each system had its own: ▪ Instruction set architecture (ISA) ▪ I/O system and Secondary Storage: magnetic tapes, drums and disks ▪ Assemblers, compilers, libraries,... ▪ Market niche: business, scientific, real time, ... IBM System/360 – one ISA to rule them all 3

4. Control versus Datapath ▪ Processor designs split between datapath, where numbers are stored and arithmetic operations computed, and control, which sequences operations on datapath ▪ Biggest challenge for computer designers was getting control correct Control Instruction Control Lines Condition?▪ Maurice Wilkes invented the idea of microprogramming to design the control unit of a Datapath Registers Inst. Reg. processor* PC ALU ▪ Logic expensive vs. ROM or RAM Busy? Address Data ▪ ROM cheaper than RAM Main Memory ▪ ROM much faster than RAM * "Micro-programming and the design of the control circuits in an electronic digital computer," M. Wilkes, and J. Stringer. Mathematical Proc. of the Cambridge Philosophical Society, Vol. 49, 1953. 4

5. Microprogramming in IBM 360 Model M30 M40 M50 M65 Datapath width 8 bits 16 bits 32 bits 64 bits Microcode size 4k x 50 4k x 52 2.75k x 85 2.75k x 87 Clock cycle time (ROM) 750 ns 625 ns 500 ns 200 ns Main memory cycle time 1500 ns 2500 ns 2000 ns 750 ns Price (1964 $) $192,000 $216,000 $460,000 $1,080,000 Price (2018 $) $1,560,000 $1,760,000 $3,720,000 $8,720,000 Fred Brooks, Jr. 5

6. IC Technology, Microcode, and CISC ▪ Logic, RAM, ROM all implemented using same transistors ▪ Semiconductor RAM ≈ same speed as ROM ▪ With Moore’s Law, memory for control store could grow ▪ Since RAM, easier to fix microcode bugs ▪Allowed more complicated ISAs (CISC) ▪ Minicomputer (TTL server) example: -Digital Equipment Corp. (DEC) -VAX ISA in 1977 ▪ 5K x 96b microcode 6

7. Writable Control Store ▪ If Control Store is RAM, then could tailor “firmware” to application: “Writable Control Store” ▪ Microprogramming became popular in academia - Patterson PhD thesis* - SIGMICRO was for microprogramming** ▪ Xerox Alto (Bit Slice TTL) in 1973 -1st computer with Graphical User Interface & Ethernet -BitBlt and Ethernet controller in microcode * Verification of microprograms, David Patterson, UCLA, 1976 ** “The design of a system for the synthesis of correct microprograms,” David Patterson, Proc. 8th Annual Workshop of Microprogramming, 1975 Chuck Thacker 7

8. Microprocessor Evolution ▪ Rapid progress in 1970s, fueled by advances in MOS technology, imitated minicomputers and mainframe ISAs ▪ “Microprocessor Wars”: compete by adding instructions (easy for microcode), justified given assembly language programming ▪ Intel iAPX 432: Most ambitious 1970s micro, started in 1975 ▪ 32-bit capability-based object-oriented architecture, custom OS written in Ada ▪ Severe performance, complexity (multiple chips), and usability problems; announced 1981 ▪ Intel 8086 (1978, 8MHz, 29,000 transistors) ▪ “Stopgap” 16-bit processor, 52 weeks to new chip ▪ ISA architected in 3 weeks (10 person weeks) assembly-compatible with 8 bit 8080 ▪ IBM PC 1981 picks Intel 8088 for 8-bit bus (and Motorola 68000 was late) ▪ Estimated PC sales: 250,000 ▪ Actual PC sales: 100,000,000 ⇒ 8086 “overnight” success ▪ Binary compatibility of PC software ⇒ bright future for 8086 8

9. Analyzing Microcoded Machines 1980s ▪ World changed to HLL programming from assembly ▪ Compilers now source of measurements ▪ John Cocke group at IBM ▪ Worked on a simple pipelined processor, 801 minicomputer (ECL server), and advanced compilers inside IBM ▪ Ported their compiler to IBM 370, only used simple register-register and load/store instructions (similar to 801) ▪ Up to 3X faster than existing compilers that used full 370 ISA! ▪ Emer and Clark at DEC in early 1980s* ▪ Found VAX 11/180 average clock cycles per instruction (CPI) = 10! John Cocke ▪ Found 20% of VAX ISA ⇒ 60% of microcode, but only 0.2% of execution time! ▪ Patterson after ‘79 DEC sabbatical: repair microcode bugs in microprocessors?** ▪ What’s magic about ISA interpreter in Writable Control Store? Why not other programs? * "A Characterization of Processor Performance in the VAX-11/780," J. Emer and D.Clark, ISCA, 1984. ** “RISCy History,” David Patterson, May 30, 2018, Computer Architecture Today Blog 9

10. From CISC to RISC ▪ Use SRAM for instruction cache of user-visible instructions ▪ Contents of fast instruction memory change to what application needs now vs. ISA interpreter ▪ Use simple ISA ▪ Instructions as simple as microinstructions, but not as wide ▪ Compiled code only used a few CISC instructions anyways ▪ Enable pipelined implementations ▪ Further benefit with chip integration ▪ In early ‘80s, could finally fit 32-bit datapath + small caches on a single chip ▪ Chaitin’s register allocation scheme* benefits load-store ISAs *Chaitin, Gregory J., et al. "Register allocation via coloring." Computer languages 6.1 (1981), 47-57. 10

11. Berkeley & Stanford RISC Chips Fitzpatrick, Daniel, John Foderaro, Manolis Katevenis, Howard Landman, David Patterson, James Peek, Zvi Peshkess, Carlo Séquin, Robert Sherburne, and Korbin Van Dyke. "A RISCy approach to VLSI." ACM SIGARCH Computer Architecture News 10, no. 1 (1982): Hennessy, John, Norman Jouppi, Steven Przybylski, Christopher Rowen, Thomas Gross, Forest Baskett, and John Gill. RISC-I (1982) Contains 44,420 transistors, fabbed in 5 "MIPS: A microprocessor architecture." In µm NMOS, with a die area of 77 mm2, ran at 1 MHz ACM SIGMICRO Newsletter, vol. 13, no. 4, (1982). Stanford MIPS (1983) contains 25,000 transistors, was fabbed in 3 µm & RISC-II (1983) contains 40,760 transistors, was fabbed 4 µm NMOS, ran at 4 MHz (3 µm ), and size is 50 mm2 (4 µm) in 3 µm NMOS, ran at 3 MHz, and the size is 60 mm2 (Microprocessor without Interlocked Pipeline Stages) 11

12.“Iron Law” of Processor Performance: How RISC can win Time = Instructions Clock cycles __Time___ Program Program * Instruction * Clock cycle ▪ CISC executes fewer instructions per program (≈ 3/4X instructions), but many more clock cycles per instruction (≈ 6X CPI) ⇒ RISC ≈ 4X faster than CISC “Performance from architecture: comparing a RISC and a CISC with similar hardware organization,” Dileep Bhandarkar and Douglas Clark, Proc. Symposium, ASPLOS, 1991. 12

13. Video of RISC History* *Full ACM video is at http://bit.ly/2KKltJ5 13

14. CISC vs. RISC Today PC Era PostPC Era: Client/Cloud ▪ Hardware translates x86 ▪ IP in SoC vs. MPU instructions into internal ▪ Value die area, energy as much as RISC instructions performance ▪ Then use any RISC ▪ > 20B total / year in 2017 technique inside MPU ▪ x86 in PCs peaks in 2011, now ▪ > 350M / year ! decline ~8% / year (2016 < 2007) ▪ x86 ISA eventually ▪ x86 servers ⇒ Cloud ~10M servers dominates servers as well total* (0.05% of 20B) as desktops ▪ 99% Processors today are RISC *“A Decade of Mobile Computing”, Vijay Reddi, 7/21/17, Computer Architecture Today 14

15. VLIW: Very Long Instruction Word (Josh Fisher) Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency ▪ Multiple operations packed into one instruction (like a wide microinstruction) ▪ Each operation slot is for a fixed function ▪ Constant operation latencies are specified ▪ Architecture requires guarantee of: ▪ Parallelism within an instruction ⇒ no cross-operation RAW check ▪ No data use before data ready ⇒ no data interlocks 15

16. From RISC to Intel/HP Itanium, EPIC IA-64 ▪ EPIC is Intel’s name for their VLIW architecture ▪ “Explicitly Parallel Instruction Computing” ▪ A binary object-code-compatible VLIW ▪ Developed jointly with HP starting 1994 ▪ IA-64 was Intel’s chosen 64b ISA successor to 32b x86 ▪ IA-64 = Intel Architecture 64-bit ▪ AMD wouldn’t be able to make IA-64, unlike x86, so had to make 64-bit x86 ▪ First chip late (2001 vs 1997), but eventually delivered (2002) ▪ Many companies gave up RISC for Itanium since it was widely believed to be inevitable (Microsoft, SGI, Hitachi, Bull, …) 17

17. VLIW Issues and an “EPIC Failure” ▪ Compiler couldn't handle complex dependencies in integer code (pointers) ▪ Code size explosion ▪ Unpredictable branches ▪ Variable memory latency (unpredictable cache misses) -Out of Order techniques dealt with cache latencies ▪ Out of Order subsumed VLIW benefits ▪ “The Itanium approach...was supposed to be so terrific –until it turned out that the wished-for compilers were basically impossible to write.” - Donald Knuth, Stanford ▪ Pundits noted delays and under performance of Itanium product ridiculed by the chip industry Itanimum ⇒ “Itanic” (like infamous ship Titanic) 18

18. Summary Part I: Consensus on ISAs Today ▪ Not CISC: no new general-purpose CISC ISA in 30 years ▪ Not VLIW: no new general-purpose VLIW ISA in 15 years. VLIW has failed in general-purpose computing arena ▪ Complex VLIW architectures close to in-order superscalar in complexity, no real advantage on large complex apps ▪ Although VLIWs successful in embedded DSP market (Simpler VLIWs, easier branches, no caches, smaller programs) ▪ RISC! Widely agreed (still) that RISC principles are best for general purpose ISA! 19

19. Outline Part I: History of Part II: Current Architecture - Architecture Challenges - Mainframes, Ending of Dennard Scaling Minicomputers, and Moore’s Law, Security Microprocessors, RISC vs CISC, VLIW Part III: Future Architecture Opportunities - Domain Specific Languages and Architecture, Open Architectures, Agile Hardware Development 20

20. Proprietary + Confidential Fundamental Changes in Technology • Technology • End of Dennard scaling: power becomes the key constraint • Ending of Moore’s Law: transistors improvement slows • Architectural • Limitation and inefficiencies in exploiting instruction level parallelism end the uniprocessor era in 2004 • Amdahl’s Law and its implications end “easy” multicore era • Products • PC/Server⇒ IoT, Mobile/Cloud 21 Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

21.End of Growth of Single Program Speed? End of the Am- Line? dahl’s 2X / End of Law 20 yrs Dennard ⇒ (3%/yr) Scaling 2X / ⇒ 6 yrs Multicore (12%/yr) RISC 2X / 3.5 CISC 2X / 1.5 2X / 3.5 yrs yrs (22%/yr) yrs (23%/yr) (52%/yr) Based on SPECintCPU. Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018 22

22.Moore’s Law in DRAMs 23

23.Moore’s Law Slowdown in Intel Processors Cost/transistor slowing down faster, due to fab costs. 24

24. Technology & Power: Dennard Scaling Power consumption based on models in “Dark Silicon and the End of Multicore Scaling,” Hadi Esmaelizadeh, ISCA, 2011 Energy scaling for fixed task is better, since more and faster transistors Power consumption based on models in Esmaeilzadeh 25

25. Sorry State of Security ▪ Many protection mechanisms earlier - Domains, rings, even capabilities ▪ Not well used ⇒ disappeared - Didn’t seem to help, and lots of overhead ▪ Early hope: SW would eliminate attack vectors - Perhaps through verification: too hard - Kernels and microkernels: explosion in size ▪ Cleary, not case for almost all software - Must build secure systems despite SW bugs! ▪ Hardware must help with security! 26

26.Example of Current State of the Art: x86 ● 40+ years of interfaces leading to attack vectors ○ e.g., Intel Management Engine (ME) processor ■ Runs firmware management system more privileged than system SW ■ “Sadly, and most depressing, there is no option for us users to opt-out from having this on our computing devices, whether we want it or not. The author considers this as probably the biggest mistake the PC industry has got itself into she has ever witnessed.”* ○ e.g., Fuzz testing of x86 potential opcodes** ■ Unknown instruction: freeze processor despite being in user mode * “Intel x86 considered harmful,” Joanna Rutkowska, 2015 ** “Breaking the x86 ISA,” Christopher Domas, 2016 27

27. Spectre & Computer Architecture ● Definition of instruction set architecture ● “What the machine language programmer must know to properly write a correct but timing-independent program.” ● Spectre: speculation ⇒ timing attacks that leak ≥10 kb/s ● More microarchitecture attacks on the way* ● Security via resource Isolation? Turn off multithreading ● Spectre is bug in computer architecture definition vs chip ● Need Computer Architecture 2.0 to prevent timing leaks** * “A Survey of Microarchitectural Timing Attacks and Countermeasures on Contemporary Hardware,” Qian Ge, Yuval Yarom, David Cock, and Gernot Heiser, Journal of Cryptographic Engineering, April, 2018 ** “A Primer on the Meltdown & Spectre Hardware Security Design Flaws and 28 their Important Implications”, Mark Hill, 2/15/18, Computer Architecture Today

28. Part II: Challenges Summary ▪ Performance improvements are at a standstill - Slowing Moore’s Law - No more Dennard Scaling - Microarchitecture techniques: ILP, multicore, etc. are inefficient, hence burn energy ▪ State of computer security is embarrassing for all of us in the computing field - Seems unlikely systems will ever become secure using software only solutions 29

29. Outline Part I: History of Part II: Current Architecture - Architecture Challenges - Mainframes, Ending of Dennard Scaling Minicomputers, and Moore’s Law, Security Microprocessors, RISC vs CISC, VLIW Part III: Future Architecture Opportunities - Domain Specific Languages and Architecture, Open Architectures, Agile Hardware Development 30