存储器一致性和事务内存

本文讲述了存储器一致性和事务内存的基本知识。举例说明了二者之间的区别,介绍了多线程程序单处理器。
展开查看详情

1.Advanced Operating Systems (CS 202) Memory Consistency and Transactional Memory Feb. 1, 2017

2. Lets start with an example Code: Initially A = Flag = 0 P1 P2 A = 23; while (Flag != 1) {;} Flag = 1; B = A; Idea: –  P1 writes data into A and sets Flag to tell P2 that data value can be read from A. –  P2 waits till Flag is set and then reads data from A. –  What possible values can B have?

3. Multi-threaded programs on uniprocessor •  Processor executes all threads of program –  unspecified scheduling policy •  Operations in each thread are P executed in order •  Atomic operations: lock/unlock etc. for synchronization between threads •  Result is as if instructions from different threads were interleaved in some order •  Non-determinacy: program may produce different outputs depending MEMORY on scheduling of threads (eg) Thread 1 Thread 2 ….. …… x := 1; print(x); x := 2;

4. Multi-threaded programs on multiprocessor •  Each processor executes one thread –  let’s keep it simple •  Operations in each thread are P P P executed in order •  One processor at a time can access global memory to perform load/ store/atomic operations –  Assume no caching of global data •  You can show that running multi- MEMORY threaded program on multiple processors does not change possible output(s) of program from uniprocessor case

5. More realistic architecture •  Two key assumptions so far: 1.  processors do not cache global data •  improving execution efficiency: –  allow caching »  leads to cache coherence solved as we discussed 2.  Instructions are executed in order •  improving execution efficiency: –  allow processors to execute instructions out of order subject to data/control dependences »  this can change the semantics of the program! »  Reordering happens for other reasons too »  preventing this requires attention to memory consistency model of processor

6. Recall: uniprocessor execution •  Processors reorder or change operations to improve performance –  Registers may eliminate some loads and stores –  Load/store buffers may delay/reorder memory accesses –  Lockup free caches; split transactions buses; … •  Constraint on reordering: must respect dependences –  But only sequential ones •  Reorderings can be performed either by compiler or processor

7. Permitted memory-op reorderings •  Stores to different memory locations can be performed out of program order store v1, data store b1, flag store b1, flag ßà store v1, data •  Loads from different memory locations can be performed out of program order load flag, r1 load data,r2 load data, r2 ßà load flag, r1 •  Load and store to different memory locations can be performed out of program order

8. Example of hardware reordering Load bypassing Store buffer Processor Memory system •  Store buffer holds store operations that need to be sent to memory •  Loads are higher priority operations than stores since their results are needed to keep processor busy, so they bypass the store buffer •  Load address is checked against addresses in store buffer, so store buffer satisfies load if there is an address match •  Result: load can bypass stores to other addresses

9. Problem in multiprocessor context •  Canonical model –  operations from given processor are executed in program order –  memory operations from different processors appear to be interleaved in some order at the memory •  Question: –  If a processor is allowed to reorder independent operations in its own instruction stream, will the execution always produce the same results as the canonical model? –  Answer: no. Let us look at some examples.

10. Example (I) Code: Initially A = Flag = 0 P1 P2 A = 23; while (Flag != 1) {;} Flag = 1; ... = A; Idea: –  P1 writes data into A and sets Flag to tell P2 that data value can be read from A. –  P2 waits till Flag is set and then reads data from A.

11. Execution Sequence for (I) Code: Initially A = Flag = 0 P1 P2 A = 23; while (Flag != 1) {;} Flag = 1; ... = A; Possible execution sequence on each processor: P1 P2 Write A 23 Read Flag //get 0 Write Flag 1 …… Read Flag //get 1 Read A //what do you get? Problem: If the two writes on processor P1 can be reordered, it is possible for processor P2 to read 0 from variable A. Can happen on most modern processors.

12. Example II Code: (like Dekker’s algorithm) Initially Flag1 = Flag2 = 0 P1 P2 Flag1 = 1; Flag2 = 1; If (Flag2 == 0) If (Flag1 == 0) critical section critical section Possible execution sequence on each processor: P1 P2 Write Flag1, 1 Write Flag2, 1 Read Flag2 //get 0 Read Flag1 //what do you get?

13. Execution sequence for (II) Code: (like Dekker’s algorithm) Initially Flag1 = Flag2 = 0 P1 P2 Flag1 = 1; Flag2 = 1; If (Flag2 == 0) If (Flag1 == 0) critical section critical section Possible execution sequence on each processor: P1 P2 Write Flag1, 1 Write Flag2, 1 Read Flag2 //get 0 Read Flag1, ?? Most people would say that P2 will read 1 as the value of Flag1. Since P1 reads 0 as the value of Flag2, P1’s read of Flag2 must happen before P2 writes to Flag2. Intuitively, we would expect P1’s write of Flag to happen before P2’s read of Flag1. However, this is true only if reads and writes on the same processor to different locations are not reordered by the compiler or the hardware. Unfortunately, this is very common on most processors (store-buffers with load- bypassing).

14. Lessons •  Uniprocessors can reorder instructions subject only to control and data dependence constraints •  These constraints are not sufficient in shared- memory context –  simple parallel programs may produce counter- intuitive results •  Question: what constraints must we put on uniprocessor instruction reordering so that –  shared-memory programming is intuitive –  but we do not lose uniprocessor performance? •  Many answers to this question –  answer is called memory consistency model supported by the processor

15. Consistency models -  Consistency models are not about memory operations from different processors. -  Consistency models are not about dependent memory operations in a single processor’s instruction stream (these are respected even by processors that reorder instructions). -  Consistency models are all about ordering constraints on independent memory operations in a single processor’s instruction stream that have some high-level dependence (such as flags guarding data) that should be respected to obtain intuitively reasonable results.

16. Simplest Memory Consistency Model •  Sequential consistency (SC) [Lamport] –  our canonical model: processor is not allowed to reorder reads and writes to global memory P1 P2 P3 Pn MEMORY

17. Sequential Consistency •  SC constrains all memory operations: •  Write → Read •  Write → Write •  Read → Read, Write -  Simple model for reasoning about parallel programs -  You can verify that the examples considered earlier work correctly under sequential consistency. -  However, this simplicity comes at the cost of performance. -  Question: how do we reconcile sequential consistency model with the demands of performance?

18. Relaxed consistency •  Allow reordering for performance, but provide a safety net -  E.g., Processor has fence instruction: -  Accesses before fence in program order must complete before fence -  Accesses after fence in program order must wait for fence -  Fences are performed in program order -  Weak consistency: programmer puts fences where reordering is not acceptable -  Implementation of fence: -  processor has counter that is incremented when data op is issued, and decremented when data op is completed -  Example: PowerPC has SYNC instruction -  Language constructs: -  OpenMP: flush -  All synchronization operations like lock and unlock act like a fence

19. Weak ordering picture fence program Memory operations within these execution fence regions can be reordered fence

20. Example revisited Code: Initially A = Flag = 0 P1 P2 A = 23; flush; ßmemory fence while (Flag != 1) {;} Flag = 1; B = A; Execution: –  P1 writes data into A –  Flush waits till write to A is completed –  P1 then writes data to Flag –  Therefore, if P2 sees Flag = 1, it is guaranteed that it will read the correct value of A even if memory operations in P1 before flush and memory operations after flush are reordered by the hardware or compiler. –  Question: does P2 need a flush between the two statements?

21. Another relaxed model: release consistency -  Further relaxation of weak consistency -  Synchronization accesses are divided into -  Acquires: operations like lock -  Release: operations like unlock -  Semantics of acquire: -  Acquire must complete before all following memory accesses -  Semantics of release: -  all memory operations before release are complete -  However, -  acquire does not wait for accesses preceding it -  accesses after release in program order do not have to wait for release -  operations which follow release and which need to wait must be protected by an acquire

22.Implementations on Current Processors

23. Comments •  In the literature, there are a large number of other consistency models –  E.g., Eventual consistency –  We will revisit some later… •  It is important to remember that these are concerned with reordering of independent memory operations within a processor. •  Easy to come up with shared-memory programs that behave differently for each consistency model. –  Therefore, we have to be careful with concurrency primitives! –  How do we get them right? –  How do we make them portable?

24. Summary •  Two problems: memory consistency and cache coherence •  Coherence defines order of accesses to the same location –  Does not really complicate consistency •  Memory consistency model –  what instructions is compiler or hardware allowed to reorder? –  sequential consistency: perform global memory operations in program order –  relaxed consistency models: all of them rely on some safety mechanisms to restrict reordering when necessary

25. Transactional memory (FYI – Not required) slide credit: slides adapted from several presentations, including stanford TCC group and MIT superTech group

26. Motivation •  Uniprocessor Systems –  Frequency –  Power consumption –  Wire delay limits scalability –  Design complexity vs. verification effort –  Where is ILP? •  Support for multiprocessor or multicore systems –  Replicate small, simple cores, design is scalable –  Faster design turnaround time, Time to market –  Exploit TLP, in addition to ILP within each core –  But now we have new problems 26

27. Parallel Software Problems •  Parallel systems are often programmed with –  Synchronization through barriers –  Shared objects access control through locks •  Lock granularity and organization must balance performance and correctness –  Coarse-grain locking: Lock contention –  Fine-grain locking: Extra overhead –  Must be careful to avoid deadlocks or data races –  Must be careful not to leave anything unprotected •  Performance tuning is not intuitive –  Performance bottlenecks are related to low level events •  E.g. false sharing, coherence misses •  Feedback is often indirect (cache 27 lines, rather than variables)

28. Parallel Hardware Complexity (TCC’s view) •  Cache coherence protocols are complex –  Must track ownership of cache lines •  Consistency protocols are complex –  Must provide rules to correctly order individual loads/stores –  Difficult for both hardware and software •  Current protocols rely on low latency, not bandwidth –  Critical short control messages on ownership transfers –  Latency of short messages unlikely to scale well in the future –  Bandwidth is likely to scale much better •  High speed interchip connections 28

29. What do we want? •  A shared memory system with –  A simple, easy programming model (unlike message passing) –  A simple, low-complexity hardware implementation (unlike shared memory) –  Good performance 29