内存连贯性，缓存一致性何同步

下载 3

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/u172/Memory_Consistency_Cache_Coherence_and_Synchronization?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

小小学徒

发布于

6年前

4829

人观看

#信息技术

现代操作系统中，程序执行性能和内存，缓存密不可分，在多任务调度系统下，多进程/多线程并行执行相互不干扰，既保证高性能，又要确保系统能够按照预定的编程语义，保持内存和缓存的连贯性和一致性，本章对问题的描述和解决办法进行了详细探讨，对于应用程序开发者而言，理解程序语言提供的多线程/多进程同步编程接口也是有很大帮助的。

展开查看详情

1 . Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization Jan, 27, 2017 (some cache coherence slides adapted from Ian Watson; some memory consistency slides from Sarita Adve)

2 . Classic Example •  Suppose we have to implement a function to handle withdrawals from a bank account: withdraw (account, amount) { balance = get_balance(account); balance = balance – amount; put_balance(account, balance); return balance; } •  Now suppose that you and your father share a bank account with a balance of $1000 •  Then you each go to separate ATM machines and simultaneously withdraw $100 from the account 2

3 . Interleaved Schedules •  The problem is that the execution of the two threads can be interleaved: balance = get_balance(account); balance = balance – amount; Execution sequence seen by CPU balance = get_balance(account); balance = balance – amount; Context switch put_balance(account, balance); put_balance(account, balance); •  What is the balance of the account now? 3

4 . How Interleaved Can It Get? How contorted can the interleavings be? •  We'll assume that the only atomic operations are reads and writes of individual memory locations –  Some architectures don't even give you that! •  We'll assume that a context switch can occur at any time ............... get_balance(account); •  We'll assume that you can balance = get_balance(account); delay a thread as long as you balance = ................................... like as long as it's not delayed balance = balance – amount; forever balance = balance – amount; put_balance(account, balance); put_balance(account, balance); 4

5 . Mutual Exclusion •  Mutual exclusion to synchronize access to shared resources –  This allows us to have larger atomic blocks –  What does atomic mean? •  Code that uses mutual called a critical section –  Only one thread at a time can execute in the critical section –  All other threads are forced to wait on entry –  When a thread leaves a critical section, another can enter –  Example: sharing an ATM with others •  What requirements would you place on a critical section? 5

6 . Using Locks acquire(lock); balance = get_balance(account); balance = balance – amount; withdraw (account, amount) { acquire(lock); balance = get_balance(account); balance = balance – amount; acquire(lock); put_balance(account, balance); Critical release(lock); Section put_balance(account, balance); return balance; release(lock); } balance = get_balance(account); balance = balance – amount; put_balance(account, balance); release(lock); –  Why is the “return” outside the critical section? Is this ok? –  What happens when a third thread calls acquire? 6

7 . Stepping back •  What does the OS need to support? –  And why? Isnt this an application/ programming problem? •  Synchronization is hard – why? •  Synchronization can be a performance problem – why? •  Other semantics than mutual exclusion possible. 7

8 . Implementing locks •  Software implementations possible –  You should have seen Dekker’s algorithm and possibly Peterson’s algorithm –  They are difficult to get right –  They make assumptions on the system that may no longer hold •  (e.g., memory consistency as we will see shortly) •  Most systems offer hardware support 8

9 . Using Test-And-Set •  Here is our lock implementation with test-and-set: struct lock { int held = 0; } void acquire (lock) { while (test-and-set(&lock->held)); } void release (lock) { lock->held = 0; } •  When will the while return? What is the value of held? 9

10 . Overview •  Before we talk deeply about synchronization –  Need to get an idea about the memory model in shared memory systems –  Is synchronization only an issue in multi-processor systems? •  What is a shared memory processor (SMP)? •  Shared memory processors –  Two primary architectures: •  Bus-based/local network shared-memory machines (small- scale) •  Directory-based shared-memory machines (large-scale) 10

11 . Plan… •  Introduce and discuss cache coherence •  Discuss basic synchronization, up to MCS locks (from the paper we are reading) •  Introduce memory consistency and implications •  Is this an architecture class??? –  Thesame issues manifest in large scale distributed systems 11

12 . CRASH COURSE ON CACHE COHERENCE 12

13 . Bus-based Shared Memory Organization Basic picture is simple :- Shared Memory CPU CPU CPU Cache Cache Cache Shared Bus 13

14 . Organization •  Bus is usually simple physical connection (wires) •  Bus bandwidth limits no. of CPUs •  Could be multiple memory elements •  For now, assume that each CPU has only a single level of cache •  Other organizations (e.g., with a network) have NUMA issues 14

15 . Problem of Memory Coherence •  Assume just single level caches and main memory •  Processor writes to location in its cache •  Other caches may hold shared copies - these will be out of date •  Updating main memory alone is not enough •  What happens if two updates happen at (nearly) the same time? –  Can two different processors see them out of order? 15

16 . Example 1 2 3 X: 24 Shared CPU CPU CPU Memory Cache Cache Cache Shared Bus Processor 1 reads X: obtains 24 from memory and caches it Processor 2 reads X: obtains 24 from memory and caches it Processor 1 writes 32 to X: its locally cached copy is updated Processor 3 reads X: what value should it get? Memory and processor 2 think it is 24 Processor 1 thinks it is 32 Notice that having write-through caches is not good enough 16

17 . Cache Coherence •  Try to make the system behave as if there are no caches! •  How? Idea: Try to make every CPU know who has a copy of its cached data? •  too complex! •  More practical: –  Snoopy caches •  Each CPU snoops memory bus •  Looks for read/write activity concerned with data addresses which it has cached. –  What does it do with them? •  This assumes a bus structure where all communication can be seen by all. •  More scalable solution: ‘directory 17 based’ coherence schemes

18 . Snooping Protocols •  Write Invalidate –  CPU with write operation sends invalidate message –  Snooping caches invalidate their copy –  CPU writes to its cached copy •  Write through or write back? –  Any shared read in other CPUs will now miss in cache and re-fetch new data. 18

19 . Snooping Protocols •  Write Update –  CPU with write updates its own copy –  All snooping caches update their copy •  Note that in both schemes, problem of simultaneous writes is taken care of by bus arbitration - only one CPU can use the bus at any one time. •  Harder problem for arbitrary networks 19

20 . Update or Invalidate? •  Which should we use? •  Bus bandwidth is a precious commodity in shared memory multi-processors –  Contention/cache interrogation can lead to 10x or more drop in performance –  (also important to minimize false sharing) •  Therefore, invalidate protocols used in most commercial SMPs 20

21 . Implementation Issues •  In both schemes, knowing if a cached value is not shared (copy in another cache) can avoid sending any messages. •  Invalidate description assumed that a cache value update was written through to memory. If we used a ‘copy back’ scheme other processors could re-fetch old value on a cache miss. •  We need a protocol to handle all this. 21

22 . MESI – locally initiated accesses Read Miss(sh) Read Invalid Shared Hit Mem Read Read Invalidate Mem Read Miss(ex) RWITM Write Write Hit Miss Read Read Modified Exclusive Hit Hit Write Hit Write = bus transaction Hit 22

23 . MESI – remotely initiated accesses Mem Read Invalidate Invalid Shared Mem Read RWITM Mem Read RWITM Modified Exclusive = copy back 23

24 .Both together 24 Image credit: Wiki page on MESI By Jugones55 - Own work, GFDL, https://commons.wikimedia.org/w/index.php?curid=7136764

25 . MESI notes •  There are other protocols and minor variations (particularly to do with write miss) •  Normal ‘write back’ when cache line is evicted is done if line state is M •  Multi-level caches –  If caches are inclusive, only the lowest level cache needs to snoop on the bus •  Most modern CPUs have inclusive caches •  But they don’t perform as well as non-inclusive caches 25

26 . Cache Coherence summary •  Reads and writes are atomic –  What does atomic mean? •  As if there is no cache •  Some magic to make things work –  Have performance implications –  …and therefore, have implications on performance of programs 26

27 . Directory Schemes •  Snoopy schemes do not scale because they rely on broadcast •  Directory-based schemes allow scaling. –  avoid broadcasts by keeping track of all PEs caching a memory block, and then using point-to-point messages to maintain coherence –  they allow the flexibility to use any scalable point- to-point network 27

28 .Basic Scheme (Censier & Feautrier) P P Cache Cache • Assume "k" processors. • With each cache-block in memory: k presence-bits, and 1 dirty-bit Interconnection Network • With each cache-block in cache: 1valid bit, and 1 dirty (owner) bit •• • Memory Directory presence bits dirty bit –  Read from main memory by PE-i: •  If dirty-bit is OFF then { read from main memory; turn p[i] ON; } •  if dirty-bit is ON then { recall line from dirty PE (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to PE-i; } –  Write to main memory: •  If dirty-bit OFF then { send invalidations to all PEs caching that block; turn dirty-bit ON; 28 turn P[i] ON; ... } •  ...

29 . Key Issues •  Scaling of memory and directory bandwidth –  Can not have main memory or directory memory centralized –  Need a distributed memory and directory structure •  Directory memory requirements do not scale well –  Number of presence bits grows with number of PEs –  Many ways to get around this problem •  limited pointer schemes of many flavors •  Industry standard –  SCI: Scalable Coherent Interface 29

46点赞

5收藏

3下载