04-Automated Parallelization in Software

Out-of-order Superscalars and their Limitations Static Instruction Scheduling
展开查看详情

1.CSC2/458 Parallel and Distributed Systems Automated Parallelization in Software Sreepathi Pai January 30, 2018 URCS

2.Outline Out-of-order Superscalars and their Limitations Static Instruction Scheduling

3.Outline Out-of-order Superscalars and their Limitations Static Instruction Scheduling

4.How will a processor parallelize this? for(i = 0; i < A; i++) { sum1 = sum1 + i; } for(j = 0; j < A; j++) { sum2 = sum2 + j; }

5.Dynamic Instruction Stream i = 0 i < A (true) sum1 = sum1 + 0 i++ i < A (true) sum1 = sum1 + 1 i++ ... i < A (false) j = 0 j < A (true) sum2 = sum2 + 0 j++ j < A (true) sum2 = sum2 + 1 j++ j < A (true) ...

6.An Intel Processor Pipeline Source: Intel

7.Instruction Pipeline • Instructions flow into “issue window” • from dynamic instruction stream • Dependences are calculated and resources allocated • Independent instructions are dispatched to backend out-of-order • Instructions are retired in-order using a “reorder buffer”

8.Outline Out-of-order Superscalars and their Limitations Static Instruction Scheduling

9.VLIW Processors • Very Long Instruction Word Processors • Can execute multiple instructions at the same time • So superscalar • But leaves independence checking to the compiler • Compiler packs instructions into ”long words” • Example: Slot 1 Slot 2 VLIW1: ins1 ins2 VLIW2: ins3 [empty]

10.VLIW example Consider static code below: for(i = 0; i < A; i++) { sum1 = sum1 + i; } for(j = 0; j < A; j++) { sum2 = sum2 + j; } For a 2-wide VLIW, one packing could be: Slot 1 Slot 2 i = 0 j = 0 i < A j < A sum1 = sum1 + i sum2 = sum2 + j i++ j++

11.Program Semantics • When processors commit in-order, they preserve appearance of executing in program order • Not always true when multiple processors are involved • But when compilers emit code, they change order from what is in program • Which orders in the original program must be preserved? • Which orders do not need to be preserved?

12.Our Ordering Principles • Preserve Data Dependences • Preserve Control Dependences What about: printf("hello"); printf("world");

13.Basic Block Scheduling • Basic block is a single-entry, single-exit code block • Instructions in basic block have the same control dependence • All can execute together if they have no dependence • Is there an advantage in reordering instructions within a basic block?

14.Instruction Scheduling Consider: A = 1 // takes 1 cycle B = A + 1 // takes 1 cycle C = A * 3 // takes 2 cycles and 2 ALUs D = A + 5 // takes 1 cycle Assume you have 2 ALUs. How should you schedule these instructions to lower total time?

15.Increasing the size of Basic Blocks • Basic blocks are usually small • Not many opportunities to schedule instructions • How can we increase size of basic blocks? • Remember out-of-order processors do speculation ...