内存连贯性,缓存一致性何同步

现代操作系统中,程序执行性能和内存,缓存密不可分,在多任务调度系统下,多进程/多线程并行执行相互不干扰,既保证高性能,又要确保系统能够按照预定的编程语义,保持内存和缓存的连贯性和一致性,本章对问题的描述和解决办法进行了详细探讨,对于应用程序开发者而言,理解程序语言提供的多线程/多进程同步编程接口也是有很大帮助的。
展开查看详情

1. Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization Jan, 27, 2017 (some cache coherence slides adapted from Ian Watson; some memory consistency slides from Sarita Adve)

2. Classic Example •  Suppose we have to implement a function to handle withdrawals from a bank account: withdraw (account, amount) { balance = get_balance(account); balance = balance – amount; put_balance(account, balance); return balance; } •  Now suppose that you and your father share a bank account with a balance of $1000 •  Then you each go to separate ATM machines and simultaneously withdraw $100 from the account 2

3. Interleaved Schedules •  The problem is that the execution of the two threads can be interleaved: balance = get_balance(account); balance = balance – amount; Execution sequence seen by CPU balance = get_balance(account); balance = balance – amount; Context switch put_balance(account, balance); put_balance(account, balance); •  What is the balance of the account now? 3

4. How Interleaved Can It Get? How contorted can the interleavings be? •  We'll assume that the only atomic operations are reads and writes of individual memory locations –  Some architectures don't even give you that! •  We'll assume that a context switch can occur at any time ............... get_balance(account); •  We'll assume that you can balance = get_balance(account); delay a thread as long as you balance = ................................... like as long as it's not delayed balance = balance – amount; forever balance = balance – amount; put_balance(account, balance); put_balance(account, balance); 4

5. Mutual Exclusion •  Mutual exclusion to synchronize access to shared resources –  This allows us to have larger atomic blocks –  What does atomic mean? •  Code that uses mutual called a critical section –  Only one thread at a time can execute in the critical section –  All other threads are forced to wait on entry –  When a thread leaves a critical section, another can enter –  Example: sharing an ATM with others •  What requirements would you place on a critical section? 5

6. Using Locks acquire(lock); balance = get_balance(account); balance = balance – amount; withdraw (account, amount) { acquire(lock); balance = get_balance(account); balance = balance – amount; acquire(lock); put_balance(account, balance); Critical release(lock); Section put_balance(account, balance); return balance; release(lock); } balance = get_balance(account); balance = balance – amount; put_balance(account, balance); release(lock); –  Why is the “return” outside the critical section? Is this ok? –  What happens when a third thread calls acquire? 6

7. Stepping back •  What does the OS need to support? –  And why? Isnt this an application/ programming problem? •  Synchronization is hard – why? •  Synchronization can be a performance problem – why? •  Other semantics than mutual exclusion possible. 7

8. Implementing locks •  Software implementations possible –  You should have seen Dekker’s algorithm and possibly Peterson’s algorithm –  They are difficult to get right –  They make assumptions on the system that may no longer hold •  (e.g., memory consistency as we will see shortly) •  Most systems offer hardware support 8

9. Using Test-And-Set •  Here is our lock implementation with test-and-set: struct lock { int held = 0; } void acquire (lock) { while (test-and-set(&lock->held)); } void release (lock) { lock->held = 0; } •  When will the while return? What is the value of held? 9

10. Overview •  Before we talk deeply about synchronization –  Need to get an idea about the memory model in shared memory systems –  Is synchronization only an issue in multi-processor systems? •  What is a shared memory processor (SMP)? •  Shared memory processors –  Two primary architectures: •  Bus-based/local network shared-memory machines (small- scale) •  Directory-based shared-memory machines (large-scale) 10

11. Plan… •  Introduce and discuss cache coherence •  Discuss basic synchronization, up to MCS locks (from the paper we are reading) •  Introduce memory consistency and implications •  Is this an architecture class??? –  Thesame issues manifest in large scale distributed systems 11

12. CRASH COURSE ON CACHE COHERENCE 12

13. Bus-based Shared Memory Organization Basic picture is simple :- Shared Memory CPU CPU CPU Cache Cache Cache Shared Bus 13

14. Organization •  Bus is usually simple physical connection (wires) •  Bus bandwidth limits no. of CPUs •  Could be multiple memory elements •  For now, assume that each CPU has only a single level of cache •  Other organizations (e.g., with a network) have NUMA issues 14

15. Problem of Memory Coherence •  Assume just single level caches and main memory •  Processor writes to location in its cache •  Other caches may hold shared copies - these will be out of date •  Updating main memory alone is not enough •  What happens if two updates happen at (nearly) the same time? –  Can two different processors see them out of order? 15

16. Example 1 2 3 X: 24 Shared CPU CPU CPU Memory Cache Cache Cache Shared Bus Processor 1 reads X: obtains 24 from memory and caches it Processor 2 reads X: obtains 24 from memory and caches it Processor 1 writes 32 to X: its locally cached copy is updated Processor 3 reads X: what value should it get? Memory and processor 2 think it is 24 Processor 1 thinks it is 32 Notice that having write-through caches is not good enough 16

17. Cache Coherence •  Try to make the system behave as if there are no caches! •  How? Idea: Try to make every CPU know who has a copy of its cached data? •  too complex! •  More practical: –  Snoopy caches •  Each CPU snoops memory bus •  Looks for read/write activity concerned with data addresses which it has cached. –  What does it do with them? •  This assumes a bus structure where all communication can be seen by all. •  More scalable solution: ‘directory 17 based’ coherence schemes

18. Snooping Protocols •  Write Invalidate –  CPU with write operation sends invalidate message –  Snooping caches invalidate their copy –  CPU writes to its cached copy •  Write through or write back? –  Any shared read in other CPUs will now miss in cache and re-fetch new data. 18

19. Snooping Protocols •  Write Update –  CPU with write updates its own copy –  All snooping caches update their copy •  Note that in both schemes, problem of simultaneous writes is taken care of by bus arbitration - only one CPU can use the bus at any one time. •  Harder problem for arbitrary networks 19

20. Update or Invalidate? •  Which should we use? •  Bus bandwidth is a precious commodity in shared memory multi-processors –  Contention/cache interrogation can lead to 10x or more drop in performance –  (also important to minimize false sharing) •  Therefore, invalidate protocols used in most commercial SMPs 20

21. Implementation Issues •  In both schemes, knowing if a cached value is not shared (copy in another cache) can avoid sending any messages. •  Invalidate description assumed that a cache value update was written through to memory. If we used a ‘copy back’ scheme other processors could re-fetch old value on a cache miss. •  We need a protocol to handle all this. 21

22. MESI – locally initiated accesses Read Miss(sh) Read Invalid Shared Hit Mem Read Read Invalidate Mem Read Miss(ex) RWITM Write Write Hit Miss Read Read Modified Exclusive Hit Hit Write Hit Write = bus transaction Hit 22

23. MESI – remotely initiated accesses Mem Read Invalidate Invalid Shared Mem Read RWITM Mem Read RWITM Modified Exclusive = copy back 23

24.Both together 24 Image credit: Wiki page on MESI By Jugones55 - Own work, GFDL, https://commons.wikimedia.org/w/index.php?curid=7136764

25. MESI notes •  There are other protocols and minor variations (particularly to do with write miss) •  Normal ‘write back’ when cache line is evicted is done if line state is M •  Multi-level caches –  If caches are inclusive, only the lowest level cache needs to snoop on the bus •  Most modern CPUs have inclusive caches •  But they don’t perform as well as non-inclusive caches 25

26. Cache Coherence summary •  Reads and writes are atomic –  What does atomic mean? •  As if there is no cache •  Some magic to make things work –  Have performance implications –  …and therefore, have implications on performance of programs 26

27. Directory Schemes •  Snoopy schemes do not scale because they rely on broadcast •  Directory-based schemes allow scaling. –  avoid broadcasts by keeping track of all PEs caching a memory block, and then using point-to-point messages to maintain coherence –  they allow the flexibility to use any scalable point- to-point network 27

28.Basic Scheme (Censier & Feautrier) P P Cache Cache • Assume "k" processors. • With each cache-block in memory: k presence-bits, and 1 dirty-bit Interconnection Network • With each cache-block in cache: 1valid bit, and 1 dirty (owner) bit •• • Memory Directory presence bits dirty bit –  Read from main memory by PE-i: •  If dirty-bit is OFF then { read from main memory; turn p[i] ON; } •  if dirty-bit is ON then { recall line from dirty PE (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to PE-i; } –  Write to main memory: •  If dirty-bit OFF then { send invalidations to all PEs caching that block; turn dirty-bit ON; 28 turn P[i] ON; ... } •  ...

29. Key Issues •  Scaling of memory and directory bandwidth –  Can not have main memory or directory memory centralized –  Need a distributed memory and directory structure •  Directory memory requirements do not scale well –  Number of presence bits grows with number of PEs –  Many ways to get around this problem •  limited pointer schemes of many flavors •  Industry standard –  SCI: Scalable Coherent Interface 29