线程级别

计算机体系结构中的多线程你是否了解呢?本章节就此做出了介绍,首先顺序软件的执行速度是有限的,由此得出并行处理是获得更高性能的唯一途径,怎么来获得呢?通过SIMD,MIMD,另外介绍了同步,需要专门的硬件说明,通常使用高级支持等等。
展开查看详情

1.CS 61C: Great Ideas in Computer Architecture Lecture 19: Thread-Level Parallel Processing Krste Asanović & Randy H. Katz http:// inst.eecs.berkeley.edu /~ cs61c/fa17 1 11/2/17 Fall 2017 - Lecture # 19

2.Agenda MIMD - multiple programs simultaneously Threads Parallel programming: OpenMP Synchronization primitives Synchronization in OpenMP And, in Conclusion … 2 11/2/17 Fall 2017 - Lecture # 19

3.Improving Performance Increase clock rate f s Reached practical maximum for today’s technology < 5GHz for general purpose computers Lower CPI (cycles per instruction) SIMD, “instruction level parallelism” Perform multiple tasks simultaneously Multiple CPUs, each executing different program Tasks may be related E.g. each CPU performs part of a big matrix multiplication or unrelated E.g. distribute different web http requests over different computers E.g. run pptx (view lecture slides) and browser ( youtube ) simultaneously Do all of the above: High f s , SIMD, multiple parallel tasks Today’s lecture 3 11/2/17 Fall 2017 - Lecture # 19

4.New-School Machine Structures (It’s a bit more complicated!) Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions >1 instruction @ one time e.g., 5 pipelined instructions Parallel Data >1 data item @ one time e.g., Add of 4 pairs of words Hardware descriptions All gates @ one time Programming Languages Smart Phone Warehouse Scale Computer Software Hardware Harness Parallelism & Achieve High Performance Logic Gates Core Core … Memory (Cache) Input/Output Computer Cache Memory Core Instruction Unit(s ) Functional Unit(s ) A 3 +B 3 A 2 +B 2 A 1 +B 1 A 0 +B 0 Projects 3 and 5! 4 11/2/17 Fall 2017 - Lecture # 19

5.Parallel Computer Architectures Several separate computers, some means for communication (e.g ., Ethernet) Massive array of computers, fast communication between processors Multi-core CPU: 1 datapath in single chip share L3 cache, memory, peripherals Example : Hive machines GPU “graphics processing unit” 5 11/2/17 Fall 2017 - Lecture # 19

6.Example: CPU with Two Cores Processor “Core” 1 Control Datapath PC Registers (ALU) Memory Input Output Bytes I/O-Memory Interfaces Processor 0 Memory Accesses Processor “Core” 2 Control Datapath PC Registers (ALU) Processor 1 Memory Accesses 6 11/2/17 Fall 2017 - Lecture # 19

7.Multiprocessor Execution Model Each processor (core) executes its own instructions Separate resources (not shared) Datapath (PC, registers, ALU) Highest level caches (e.g., 1 st and 2 nd ) Shared resources Memory (DRAM) Often 3 rd level cache Often on same silicon chip But not a requirement Nomenclature “Multiprocessor Microprocessor” Multicore processor E.g., four core CPU (central processing unit) Executes four different instruction streams simultaneously 7 11/2/17 Fall 2017 - Lecture # 19

8.Transition to Multicore Sequential App Performance +50% per year, ~ 100x per decade 8 11/2/17 Fall 2017 - Lecture # 19

9.Pixel 2 vs. iPhone 8 9 11/2/17 Fall 2017 - Lecture # 19

10.Pixel 2 vs. iPhone 8 10 11/2/17 Fall 2017 - Lecture # 19 ALUs nm MHz GFlops 2.35Ghz + 1.9Ghz, 64Bit Octa -Core

11.Pixel 2 vs. iPhone 8 11 11/2/17 Fall 2017 - Lecture # 19

12.Pixel 2 vs. iPhone 8 12 11/2/17 Fall 2017 - Lecture # 19

13.Multiprocessor Execution Model Shared memory Each “core” has access to the entire memory in the processor Special hardware keeps caches consistent (next lecture!) Advantages: Simplifies communication in program via shared variables Drawbacks: Does not scale well: “Slow” memory shared by many “customers” (cores) May become bottleneck (Amdahl’s Law) Two ways to use a multiprocessor: Job-level parallelism Processors work on unrelated problems No communication between programs Partition work of single task between several cores E.g., each performs part of large matrix multiplication 13 11/2/17 Fall 2017 - Lecture # 19

14.Parallel Processing It’s difficult! It’s inevitable Only path to increase performance Only path to lower energy consumption (improve battery life) In mobile systems (e.g., smart phones, tablets) Multiple cores Dedicated processors, e.g., Motion processor, image processor, neural processor in iPhone 8 + X GPU (graphics processing unit) W arehouse-scale computers (next week!) Multiple “nodes” “Boxes” with several CPUs, disks per box MIMD (multi-core) and SIMD (e.g. AVX) in each node 14 11/2/17 Fall 2017 - Lecture # 19

15.15 11/2/17 Fall 2017 - Lecture # 19 Potential Parallel Performance (assuming software can use it) Year Cores SIMD bits / Core Core * SIMD bits Total, e.g. FLOPs /Cycle 2003 2 128 256 4 2005 4 128 512 8 2007 6 128 768 12 2009 8 128 1024 16 2011 10 256 2560 40 2013 12 256 3072 48 2015 14 512 7168 112 2017 16 512 8192 128 2019 18 1024 18432 288 2021 20 1024 20480 320 2.5X 8X 20X MIMD SIMD MIMD & SIMD +2/ 2yrs 2X/ 4yrs 12 years 20 x in 12 years 20 1/12 = 1.28 x  28% per year or 2x every 3 years! IF (!) we can use it

16.Agenda MIMD - multiple programs simultaneously Threads Parallel programming: OpenMP Synchronization primitives Synchronization in OpenMP And, in Conclusion … 16 11/2/17 Fall 2017 - Lecture # 19

17.Programs Running on my Computer PID TTY TIME CMD 220 ?? 0:04.34 / usr / libexec / UserEventAgent (Aqua) 222 ?? 0:10.60 / usr / sbin / distnoted agent 224 ?? 0:09.11 / usr / sbin / cfprefsd agent 229 ?? 0:04.71 / usr / sbin / usernoted 230 ?? 0:02.35 / usr / libexec / nsurlsessiond 232 ?? 0:28.68 /System/Library/ PrivateFrameworks / CalendarAgent.framework / Executables / CalendarAgent 234 ?? 0:04.36 /System/Library/ PrivateFrameworks / GameCenterFoundation.framework /Versions/A/ gamed 235 ?? 0:01.90 /System/Library/ CoreServices / cloudphotosd.app /Contents/ MacOS / cloudphotosd 236 ?? 0:49.72 / usr / libexec / secinitd 239 ?? 0:01.66 /System/Library/ PrivateFrameworks / TCC.framework /Resources/ tccd 240 ?? 0:12.68 /System/Library/Frameworks/ Accounts.framework /Versions/A/Support/ accountsd 241 ?? 0:09.56 / usr / libexec / SafariCloudHistoryPushAgent 242 ?? 0:00.27 /System/Library/ PrivateFrameworks / CallHistory.framework /Support/ CallHistorySyncHelper 243 ?? 0:00.74 /System/Library/ CoreServices / mapspushd 244 ?? 0:00.79 / usr / libexec / fmfd 246 ?? 0:00.09 /System/Library/ PrivateFrameworks / AskPermission.framework /Versions/A/Resources/ askpermissiond 248 ?? 0:01.03 /System/Library/ PrivateFrameworks / CloudDocsDaemon.framework /Versions/A/Support/ bird 249 ?? 0:02.50 /System/Library/ PrivateFrameworks / IDS.framework / identityservicesd.app /Contents/ MacOS / identityservicesd 250 ?? 0:04.81 / usr / libexec / secd 254 ?? 0:24.01 /System/Library/ PrivateFrameworks / CloudKitDaemon.framework /Support/ cloudd 258 ?? 0:04.73 /System/Library/ PrivateFrameworks / TelephonyUtilities.framework / callservicesd 267 ?? 0:02.15 /System/Library/ CoreServices / AirPlayUIAgent.app /Contents/ MacOS / AirPlayUIAgent -- launchd 271 ?? 0:03.91 / usr / libexec / nsurlstoraged 274 ?? 0:00.90 /System/ Library / PrivateFrameworks / CommerceKit.framework /Versions/A/Resources/ storeaccountd 282 ?? 0:00.09 / usr / sbin / pboard 283 ?? 0:00.90 / System / Library / PrivateFrameworks / InternetAccounts.framework / Versions /A/ XPCServices / com.apple.internetaccounts.xpc / Contents / MacOS / com.apple.internetaccounts 285 ?? 0:04.72 / System / Library / Frameworks / ApplicationServices.framework / Frameworks / ATS.framework / Support / fontd 291 ?? 0:00.25 / System / Library / Frameworks / Security.framework / Versions /A/ Resources / CloudKeychainProxy.bundle / Contents / MacOS / CloudKeychainProxy 292 ?? 0:09.54 / System / Library / CoreServices / CoreServicesUIAgent.app / Contents / MacOS / CoreServicesUIAgent 293 ?? 0:00.29 / System / Library / PrivateFrameworks / CloudPhotoServices.framework / Versions /A/ Frameworks / CloudPhotoServicesConfiguration.framework / Versions /A/ XPCServices / com.apple.CloudPhotosConfiguration.xpc / Contents / MacOS / com.apple.CloudPhotosConfiguration 297 ?? 0:00.84 / System / Library / PrivateFrameworks / CloudServices.framework / Resources / com.apple.sbd 302 ?? 0:26.11 / System / Library / CoreServices / Dock.app / Contents / MacOS / Dock 303 ?? 0:09.55 / System / Library / CoreServices / SystemUIServer.app / Contents / MacOS / SystemUIServer … 156 total at this moment How does my laptop do this ? Imagine doing 156 assignments all at the same time ! 17 11/2/17 Fall 2017 - Lecture # 19 p s -x

18.Threads S equential flow of instructions that performs some task Up to now we just called this a “program” Each thread has: Dedicated PC (program counter) Separate registers Accesses the shared memory Each physical core provides one (or more) H ardware threads that actively execute instructions Each executes one “ hardware thread ” O perating system multiplexes multiple S oftware threads onto the available hardware threads A ll threads except those mapped to hardware threads are waiting 18 11/2/17 Fall 2017 - Lecture # 19

19.Operating System Threads Give illusion of many “simultaneously” active threads Multiplex software threads onto hardware threads: Switch out blocked threads (e.g ., cache miss, user input, network access) Timer (e.g ., switch active thread every 1 ms ) Remove a software thread from a hardware thread by Interrupting its execution Saving its registers and PC to memory Start executing a different software thread by L oading its previously saved registers into a hardware thread’s registers J umping to its saved PC 19 11/2/17 Fall 2017 - Lecture # 19

20.Example: Four Cores Thread pool : List of threads competing for processor OS maps threads to cores and schedules logical (software) threads Core 2 Each “Core” actively runs one instruction stream at a time Core 1 Core 3 Core 4 20 11/2/17 Fall 2017 - Lecture # 19

21.Multithreading Typical scenario: Active thread encounters cache miss Active thread waits ~ 1000 cycles for data from DRAM  switch out and run different thread until data available Problem Must save current thread state and load new thread state PC, all registers (could be many, e.g. AVX)  must perform switch in ≪1000 cycles Can hardware help? Moore’s Law: transistors are plenty 21 11/2/17 Fall 2017 - Lecture # 19

22.Two copies of PC and Registers inside processor hardware Looks identical to two processors to software (hardware thread 0, hardware thread 1) Hyperthreading : Both threads can be active simultaneously Hardware Assisted Software Multithreading 22 Memory Input Output Bytes I/O-Memory Interfaces Processor ( 1 Core, 2 Threads) Control Datapath PC 0 Registers 0 (ALU) PC 1 Registers 1 CS 61c Lecture 19: Thread Level Parallel Processing

23.Multithreading Logical threads ≈ 1% more hardware , ≈ 10 % (?) better performance Separate registers S hare datapath, ALU(s), caches Multicore => Duplicate Processors ≈50% more hardware, ≈2X better performance? Modern machines do both Multiple cores with multiple threads per core 23 11/2/17 Fall 2017 - Lecture # 19

24.Randy’s Laptop $ sysctl -a | grep hw hw.physicalcpu : 2 hw.logicalcpu : 4 hw.l1icachesize: 32,768 hw.l1dcachesize: 32,768 hw.l2cachesize: 262,144 hw.l3cachesize: 4,194, 304 2 Cores 4 Threads total 24 11/2/17 Fall 2017 - Lecture # 19

25.Example: 6 Cores, 24 Logical Threads Thread pool : List of threads competing for processor OS maps threads to cores and schedules logical (software) threads Thread 1 Core 2 Thread 2 Thread 3 Thread 4 Thread 1 Core 6 Thread 2 Thread 3 Thread 4 Thread 1 Core 4 Thread 2 Thread 3 Thread 4 Thread 1 Core 5 Thread 2 Thread 3 Thread 4 Thread 1 Core 3 Thread 2 Thread 3 Thread 4 Thread 1 Core 1 Thread 2 Thread 3 Thread 4 4 Logical threads per core (hardware) thread 25 11/2/17 Fall 2017 - Lecture # 19

26.Break! 26 11/2/17 Fall 2017 - Lecture # 19

27.Agenda MIMD - multiple programs simultaneously Threads Parallel programming: OpenMP Synchronization primitives Synchronization in OpenMP And, in Conclusion … 27 11/2/17 Fall 2017 - Lecture # 19

28.Languages Supporting Parallel Programming ActorScript Concurrent Pascal JoCaml Orc Ada Concurrent ML Join Oz Afnix Concurrent Haskell Java Pict Alef Curry Joule Reia Alice CUDA Joyce SALSA APL E LabVIEW Scala Axum Eiffel Limbo SISAL Chapel Erlang Linda SR Cilk Fortan 90 MultiLisp Stackless Python Clean Go Modula-3 SuperPascal Clojure Io Occam VHDL Concurrent C Janus occam-π XC Which one to pick? 28 11/2/17 Fall 2017 - Lecture # 19

29.Why So Many Parallel Programming Languages ? Why “ intrinsics ”? TO Intel: fix your #()&$! Compiler! It’s happening ... but SIMD features are continually added to compilers (Intel, gcc ) Intense area of research Research progress: 20+ years to translate C into good (fast!) assembly How long to translate C into good (fast!) parallel code? General problem is very hard to solve Present state: specialized solutions for specific cases Your opportunity to become famous! 29 11/2/17 Fall 2017 - Lecture # 19