- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
04-Automated Parallelization in Software
展开查看详情
1 .CSC2/458 Parallel and Distributed Systems Automated Parallelization in Software Sreepathi Pai January 30, 2018 URCS
2 .Outline Out-of-order Superscalars and their Limitations Static Instruction Scheduling
3 .Outline Out-of-order Superscalars and their Limitations Static Instruction Scheduling
4 .How will a processor parallelize this? for(i = 0; i < A; i++) { sum1 = sum1 + i; } for(j = 0; j < A; j++) { sum2 = sum2 + j; }
5 .Dynamic Instruction Stream i = 0 i < A (true) sum1 = sum1 + 0 i++ i < A (true) sum1 = sum1 + 1 i++ ... i < A (false) j = 0 j < A (true) sum2 = sum2 + 0 j++ j < A (true) sum2 = sum2 + 1 j++ j < A (true) ...
6 .An Intel Processor Pipeline Source: Intel
7 .Instruction Pipeline • Instructions flow into “issue window” • from dynamic instruction stream • Dependences are calculated and resources allocated • Independent instructions are dispatched to backend out-of-order • Instructions are retired in-order using a “reorder buffer”
8 .Outline Out-of-order Superscalars and their Limitations Static Instruction Scheduling
9 .VLIW Processors • Very Long Instruction Word Processors • Can execute multiple instructions at the same time • So superscalar • But leaves independence checking to the compiler • Compiler packs instructions into ”long words” • Example: Slot 1 Slot 2 VLIW1: ins1 ins2 VLIW2: ins3 [empty]
10 .VLIW example Consider static code below: for(i = 0; i < A; i++) { sum1 = sum1 + i; } for(j = 0; j < A; j++) { sum2 = sum2 + j; } For a 2-wide VLIW, one packing could be: Slot 1 Slot 2 i = 0 j = 0 i < A j < A sum1 = sum1 + i sum2 = sum2 + j i++ j++
11 .Program Semantics • When processors commit in-order, they preserve appearance of executing in program order • Not always true when multiple processors are involved • But when compilers emit code, they change order from what is in program • Which orders in the original program must be preserved? • Which orders do not need to be preserved?
12 .Our Ordering Principles • Preserve Data Dependences • Preserve Control Dependences What about: printf("hello"); printf("world");
13 .Basic Block Scheduling • Basic block is a single-entry, single-exit code block • Instructions in basic block have the same control dependence • All can execute together if they have no dependence • Is there an advantage in reordering instructions within a basic block?
14 .Instruction Scheduling Consider: A = 1 // takes 1 cycle B = A + 1 // takes 1 cycle C = A * 3 // takes 2 cycles and 2 ALUs D = A + 5 // takes 1 cycle Assume you have 2 ALUs. How should you schedule these instructions to lower total time?
15 .Increasing the size of Basic Blocks • Basic blocks are usually small • Not many opportunities to schedule instructions • How can we increase size of basic blocks? • Remember out-of-order processors do speculation ...