流水线处理机的危害

本文主要利用流水线来印证计算机管道处理器。处理器很好,但是需要小心使用。列举了流水线洗涤的例子,说明了流水线生产的好处。然后回顾了单周期、多周期、流水线的图解。复习了流水线数据控制综述与总结。说明了电脑不是那么容易的,单一内存是结构性的危害。讨论了解决存储结构危害的方法。
展开查看详情

1. ECE4680 Computer Organization and Architecture Hazards in a Pipeline Processor Pipeline is good but you need be careful. ECE4680 Hazards.1 2002-4-3 Pipelining: Its Natural! °Laundry Example °Ann, Brian, Cathy, Dave each have one load of clothes A B C D to wash, dry, and fold °Washer takes 30 minutes °Dryer takes 40 minutes °“Folder” takes 20 minutes ECE4680 Hazards.2 2002-4-3

2. Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a A s k B O r d C e r D °Sequential laundry takes 6 hours for 4 loads °If they learned pipelining, how long would laundry take? ECE4680 Hazards.3 2002-4-3 Pipelined Laundry: Start work ASAP 6 PM 7 8 9 10 11 Midnight Time 30 40 40 40 40 20 T a A s k B O r d C e r D °Pipelined laundry takes 3.5 hours for 4 loads ECE4680 Hazards.4 2002-4-3

3. Pipelining Lessons °Pipelining doesn’t help 6 PM 7 8 9 latency of single task, it helps throughput of entire workload Time °Pipeline rate limited by slowest pipeline stage 30 40 40 40 40 20 T °Multiple tasks operating a A simultaneously s k °Potential speedup = Number B of pipe stages O °Unbalanced lengths of pipe r stages reduces speedup d C e °Time to “fill” pipeline and r time to “drain” it reduces D speedup ECE4680 Hazards.5 2002-4-3 Why not pipeline for instructions? ECE4680 Hazards.6 2002-4-3

4. Review: Single Cycle, Multiple Cycle, vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr ECE4680 Hazards.7 2002-4-3 Review: A Pipelined Datapath Clk Ifetch Reg/Dec Exec Mem Wr RegWr ExtOp ALUOp Branch 1 0 PC+4 PC+4 PC+4 PC Imm16 Imm16 Mem/Wr Register Ex/Mem Register Rs Zero Data ID/Ex Register busA IF/ID Register A Ra busB Mem IUnit Rb Exec RA Do Rt 1 Unit WA Mux RFile Di Rt Rw Di I 0 0 Rd 1 RegDst ALUSrc MemWr MemtoReg ECE4680 Hazards.8 2002-4-3

5. Review: Pipeline Control “Data Stationary Control” °The Main Control generates the control signals during Reg/Dec • Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later • Control signals for Mem (MemWr Branch) are used 2 cycles later • Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc Ex/Mem Register Mem/Wr Register ALUOp ALUOp ID/Ex Register IF/ID Register Main RegDst RegDst Control MemWr MemWr MemWr Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr ECE4680 Hazards.9 2002-4-3 Review: Pipeline Summary °Pipeline Processor: • Natural enhancement of the multiple clock cycle processor • Each functional unit can only be used once per instruction • If a instruction is going to use a functional unit: - it must use it at the same stage as all other instructions • Pipeline Control: - Each stage’s control signal depends ONLY on the instruction that is currently in that stage ECE4680 Hazards.10 2002-4-3

6. Its not that easy for computers ° Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards (hardware resource conflicts): HW cannot support this combination of instructions (single person to fold and put clothes away) • Data hazards (data dependencies): Instruction depends on result of prior instruction still in the pipeline (missing sock) • Control hazards (changes in program flow): Pipelining of branches & other instructions stall the pipeline until the hazard (“bubbles”) in the pipeline disappear. ECE4680 Hazards.11 2002-4-3 Single Memory is a Structural Hazard Time (clock cycles) ALU I Mem Reg Mem Reg n Load s ALU Mem Reg Mem Reg t Instr 1 r. ALU Mem Reg Mem Reg O Instr 2 r ALU d Mem Reg Mem Reg e Instr 3 r ALU Mem Reg Mem Reg Instr 4 ECE4680 Hazards.12 2002-4-3

7. Option 1: Stall to resolve Memory Structural Hazard Time (clock cycles) ALU I Mem Reg Mem Reg n Load s ALU Mem Reg Mem Reg t Instr 1 r. ALU Mem Reg Mem Reg O Instr 2 r d ALU e Instr 3(stall) bubble Mem Reg Mem Reg r ALU Instr 4 Mem Reg Mem Reg ECE4680 Hazards.13 2002-4-3 Option 2: Duplicate to Resolve Structural Hazard • Separate Instruction Cache (Im) & Data Cache (Dm) Time (clock cycles) ALU I Im Reg Dm Reg n Load s ALU Im Reg Dm Reg t Instr 1 r. ALU Im Reg Dm Reg O Instr 2 r ALU d Im Reg Dm Reg e Instr 3 r ALU Im Reg Dm Reg Instr 4 ECE4680 Hazards.14 2002-4-3

8. Data Hazard on r1 add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11 ECE4680 Hazards.15 2002-4-3 Data Hazard on r1: (Fig. 6.36, page 477) • Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg I add r1,r2,r3 Im Dm n ALU s Im Reg Dm Reg t sub r4,r1,r3 r. ALU Im Reg Dm Reg and r6,r1,r7 O ALU r Im Reg Dm Reg d or r8,r1,r9 e ALU Im Reg Dm Reg r xor r10,r1,r11 Shading on right/left half means read/write on that element. ECE4680 Hazards.16 2002-4-3

9. Option1: HW Stalls to Resolve Data Hazard • Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg I add r1,r2,r3 Im Dm n ALU s sub r4, r1,r3 Im bubble bubble bubble Reg Dm Reg t r. ALU and r6,r1,r7 Im Reg Dm O r or r8,r1,r9 ALU d Im Reg e r Im Reg xor r10,r1,r11 • The Control logic can be complex ECE4680 Hazards.17 2002-4-3 But recall use of “Data Stationary Control” °The Main Control generates the control signals during Reg/Dec • Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later • Control signals for Mem (MemWr Branch) are used 2 cycles later • Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc Ex/Mem Register Mem/Wr Register ALUOp ALUOp ID/Ex Register IF/ID Register Main RegDst RegDst Control MemW MemW MemW r Branch r Branch rBranch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr ECE4680 Hazards.18 2002-4-3

10. Option 1: How HW really stalls pipeline • HW doesn’t change PC keeps fetching same instruction & sets control signals to benign values (0) Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg I add r1,r2,r3 Im Dm n s t stall Im bubble bubble bubble bubble r. Im bubble bubble bubble bubble stall O r Im bubble bubble bubble bubble d stall e ALU r sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm ECE4680 Hazards.19 2002-4-3 Option 2: SW inserts indepdendent instructions • Worst case inserts NOP instructions Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg I add r1,r2,r3 Im Dm n ALU s Im Reg Dm Reg t nop r. ALU Im Reg Dm Reg nop O ALU r Im Reg Dm Reg d nop e ALU r sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm ECE4680 Hazards.20 2002-4-3

11. Option 3 Insight: Data is available! (Fig. 6.37, page 481) • Pipeline registers already contain needed data Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg I add r1,r2,r3 Im Dm n ALU s Im Reg Dm Reg t sub r4,r1,r3 r. ALU Im Reg Dm Reg and r6,r1,r7 O ALU r Im Reg Dm Reg d or r8,r1,r9 e ALU Im Reg Dm Reg r xor r10,r1,r11 ECE4680 Hazards.21 2002-4-3 HW Change for “Forwarding” (Bypassing):(Fig 6.38-40, p 482-484) • Increase multiplexors to add paths from pipeline registers • Assumes register read during write gets new value (otherwise more results to be forwarded) ID/EX EX/MEM MEM/WB Zero? M u x ALU Data M Memory u x ECE4680 Hazards.22 2002-4-3

12. From Last Lecture: The Delay Load Phenomenon Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Clock I0: Load Ifetch Reg/Dec Exec Mem Wr Plus 1 Ifetch Reg/Dec Exec Mem Wr Plus 2 Ifetch Reg/Dec Exec Mem Wr Plus 3 Ifetch Reg/Dec Exec Mem Wr Plus 4 Ifetch Reg/Dec Exec Mem Wr °Although Load is fetched during Cycle 1: • The data is NOT written into the Reg File until the end of Cycle 5 • We cannot read this value from the Reg File until Cycle 6 • 3-instruction delay before the load take effect ECE4680 Hazards.23 2002-4-3 Forwarding reduces Data Hazard to 1 cycle: (Fig. 6.44, p489) Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg I lw r1, 0(r2) Im Dm n ALU s Im Reg Dm Reg t sub r4,r1,r6 r. ALU Im Reg Dm Reg and r6,r1,r7 O ALU r Im Reg Dm Reg d or r8,r1,r9 e r ECE4680 Hazards.24 2002-4-3

13. Option1: HW Stalls to Resolve Data Hazard • “Interlock”: checks for hazard & stalls Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg I lw r1, 0(r2) Im Dm n s t stall Im bubble bubble bubble bubble r. ALU Im Reg Dm Reg sub r4,r1,r3 O ALU r Im Reg Dm Reg d and r6,r1,r7 e ALU r Im Reg Dm Reg or r8,r1,r9 ECE4680 Hazards.25 2002-4-3 Option 2: SW inserts independent instructions • Worst case inserts NOP instructions • MIPS I solution: No HW checking Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg I lw r1, 0(r2) Im Dm n ALU s Reg Reg t nop Im Dm r. ALU Im Reg Dm Reg sub r4,r1,r3 O ALU r Im Reg Dm Reg d and r6,r1,r7 e ALU r Im Reg Dm Reg or r8,r1,r9 ECE4680 Hazards.26 2002-4-3

14. Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: Fast code: LW Rb,b LW Rb,b LW Rc,c LW Rc,c ADD Ra,Rb,Rc LW Re,e SW a,Ra ADD Ra,Rb,Rc LW Re,e LW Rf,f LW Rf,f SW a,Ra SUB Rd,Re,Rf SUB Rd,Re,Rf SW d,Rd SW d,Rd ECE4680 Hazards.27 2002-4-3 Compiler Avoiding Load Stalls: scheduled unscheduled gcc 54% 31% spice 42% 14% 65% tex 25% 0% 20% 40% 60% 80% % loads stalling pipeline ECE4680 Hazards.28 2002-4-3

15. From Last Lecture: The Delay Branch Phenomenon Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11 Clk 12: Beq Ifetch Reg/Dec Exec Mem Wr (target is 1000) 16: R-type Ifetch Reg/Dec Exec Mem Wr 20: R-type Ifetch Reg/Dec Exec Mem Wr 24: R-type Ifetch Reg/Dec Exec Mem Wr 1000: Target of Br Ifetch Reg/Dec Exec Mem Wr °Although Beq is fetched during Cycle 4: • Target address is NOT written into the PC until the end of Cycle 7 • Branch’s target is NOT fetched until Cycle 8 • 3-instruction delay before the branch take effect ECE4680 Hazards.29 2002-4-3 Control Hazard on Branches: 3 stage stall (Fig. 6.50, P497) Time (in Clock Cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Program Execution Order (in Instructions) 40 beq 40 beq $1,$3,36 $1, $3, 36 IM Reg DM Reg 44 and$12,$2,$5 44 and $12, $2, $5 IM Reg DM Reg 48oror$13,$6,$2 $13, $6, $2 IM Reg DM Reg 48 52 add 52 add$14,$2,$2 $14, $2, $2 IM Reg DM Reg 72 lw$4,$7,100 80 ld $4, 50($7) IM Reg DM Reg ECE4680 Hazards.30 2002-4-3

16. Branch Stall Impact Why? °If CPI = 1, 30% branch, Stall 3 cycles new CPI = 1.9! °2 part solution: • Determine branch taken or not sooner, AND • Compute taken branch address earlier °MIPS branch tests = 0 or ≠ 0 ° Solution Option 1: • Move Zero test to ID/RF stage • Adder to calculate new PC in ID/RF stage • 1 clock cycle penalty for branch vs. 3 ECE4680 Hazards.31 2002-4-3 Option 1: move HW forward to reduce branch delay Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc. Access Back M u x IF/ID ID/EX EX/MEM MEM/WB ADD 4 Zero? IR6..10 PC IR11..15 M Instruction IR u Memory Registers x MEM/WB.IR ALU M Data u M Memory x u x 16 Sign 32 extend ECE4680 Hazards.32 2002-4-3

17. Branch Delay now 1 clock cycle Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc. Access Back M u x IF/ID ID/EX EX/MEM MEM/WB ADD Zero? ADD 4 IR6..10 PC IR11..15 M Instruction IR u Memory Registers x MEM/WB.IR ALU M Data u M Memory x u x 16 Sign 32 extend ECE4680 Hazards.33 2002-4-3 Option 2: Define Branch as Delayed °Worst case, Software inserts NOP into branch delay °Where to get instructions to fill branch delay slot? • Before branch instruction • From the target address: only valuable when branch • From fall through: only valuable when don’t branch °Compiler effectiveness for single branch delay slot: • Fills about 60% of branch delay slots • About 80% of instructions executed in branch delay slots useful in computation • about 50% (60% x 80%) of slots usefully filled ECE4680 Hazards.34 2002-4-3

18. When is pipelining hard? (1) ° Interrupts: 5 instructions executing in 5 stage pipeline • How to stop the pipeline? • Restrart? • Who caused the interrupt? Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic interrupt MEM Page fault on data fetch; misaligned memory access; memory-protection violation ° Load with data page fault, Add with instruction page fault? ° Solution 1: interrupt vector/instruction 2: interrupt ASAP, restart everything incomplete ECE4680 Hazards.35 2002-4-3 When is pipelining hard? (2) ° Complex Addressing Modes and Instructions ° Address modes: Auto increment causes register change during instruction execution • Interrupts? • Now worry about write hazards since write no longer last stage - Write After Read (WAR): Write occurs before independent read - Write After Write (WAW): Writes occur in wrong order, leaving wrong result in registers - (Previous data hazard called RAW, for Read After Write) ° Memory-memory Move instructions • Multiple page faults • make progress? ECE4680 Hazards.36 2002-4-3

19. When is pipelining hard? (3) ° Floating Point: long execution time ° Also, may pipeline FP execution unit so that can initiate new instructions without waiting full latency FP Instruction Latency Initiation Rate (MIPS R4000) Add, Subtract 4 3 Multiply 8 4 Divide 36 35 Square root 112 111 Negate 2 1 Absolute value 2 1 FP compare 3 2 ° Divide, Square Root take -10X to -30X longer than Add • Exceptions? • Adds WAR and WAW hazards since pipelines are no longer same length ECE4680 Hazards.37 2002-4-3 Data Hazards ° Read After Write (RAW) • Attempting to read a value that hasn't been written yet. This is the most common type, and can be overcome by forwarding. ° Write after write (WAW): • Writing a value before a preceding write has completed. This can only happen in complex pipes that allow instructions to proceed out of order, or that have multiple write-back stages (mostly CISC), or that have multiple pipes that can write. ° Write after read (WAR): • Writing a value before a preceding read has completed. These also require a complex pipeline that can sometimes write in an early stage, and read in a later stage. It is also possible when multiple pipelines or out-of-order issue are employed. ° Read after read (RAR) does not produce a hazard. ECE4680 Hazards.38 2002-4-3

20. Hazard Detection Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. Rregs ( i ) = Registers read by instruction i Wregs ( i ) = Registers written by instruction i ° A RAW hazard exists on register ρ if ∃ ρ, ρ ∈ Rregs( i ) ∩ Wregs( j ) – Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction. – When instruction issues, reserve its result register. – When one operation completes, remove its write reservation. ° A WAW hazard exists on register ρ if ∃ ρ, ρ ∈ Wregs( i ) ∩ Wregs( j ) ° A WAR hazard exists on register ρ if ∃ ρ, ρ ∈ Wregs( i ) ∩ Rregs( j ) ECE4680 Hazards.39 2002-4-3 Avoiding Data Hazards by Design Suppose instructions are executed in a pipelined fashion such that Instructions are initiated in order. ° WAW avoidance: if writes to a particular resource (e.g., reg) are performed in the same stage for all instructions, then no WAW hazards occur. proof: writes are in the same time sequence as instructions. I R/D E W I R/D E W I R/D E W ° WAR avoidance: if in all instructions reads of a resource occur at an earlier stage than writes to that resource occur in any instruction, then no WAR hazards occur. proof: A successor instruction must issue later, hence it will perform writes only after all reads for the current instruction. ECE4680 Hazards.40 2002-4-3

21. First Generation RISC Pipelines ° All instructions follow same pipeline order (“static schedule”). ° Register write in last stage – Avoid WAW hazards ° All register reads performed in first stage after issue. – Avoid WAR hazards ° Memory access in stage 4 – Avoid all memory hazards ° Control hazards resolved by delayed branch (with fast path) ° RAW hazards resolved by bypass, except on load results which are resolved by fiat (delayed load). Substantial pipelining with very little cost or complexity. Machine organization is (slightly) exposed! Relies very heavily on "hit assumption"of memory accesses in cache ECE4680 Hazards.41 2002-4-3 Review: Summary of Pipelining Basics °Speed Up Š Pipeline Depth; if ideal CPI is 1, then: Speedup = Pipeline depth × Clock cycle unpipelined 1+Pipeline stall cycles per instruction Clock cycle pipelined °Hazards limit performance on computers: • structural: need more HW resources • data: need forwarding, compiler scheduling • control: early evaluation & PC, delayed branch, prediction °Increasing length of pipe increases impact of hazards since pipelining helps instruction bandwidth, not latency °Compilers key to reducing cost of data and control hazards • load delay slots • branch delay slots °Exceptions, Instruction Set, FP makes pipelining harder °Longer pipelines => Branch prediction, more instruction parallelism? ECE4680 Hazards.42 2002-4-3