07 Scoreboard,Tomasulo, Register Renaming

• Scoreboard: Track dependencies through reservations – Simple scheme for out-of-order execution – WAW and WAR hazards force stalls – cannot handle multiple instructions with same destination register • Reservations stations: renaming to larger set of registers + buffering source operands – Prevents registers as bottleneck – Avoids WAR, WAW hazards of Scoreboard – Allows loop unrolling in HW • Dynamic hardware schemes can unroll loops dynamically in hardware – Form of limited dataflow – Register renaming is essential • Lasting Contributions of Tomasulo Algorithm – Dynamic scheduling – Register renaming – Load/store disambiguation • 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264
展开查看详情

1. CS252 Recall: Revised FP Loop Minimizing Stalls Graduate Computer Architecture 1 Loop: LD F0,0(R1) Lecture 7 2 stall 3 ADDD F4,F0,F2 Scoreboard, Tomasulo, 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch Register Renaming 6 SD 8(R1),F4 ;altered when move past SUBI February 8th, 2012 Swap BNEZ and SD by changing address of SD Instruction Instruction Latency in producing result using result clock cycles John Kubiatowicz FP ALU op Another FP ALU op 3 Electrical Engineering and Computer Sciences FP ALU op Store double 2 University of California, Berkeley Load double FP ALU op 1 6 clocks: Unroll loop 4 times code to make faster? http://www.eecs.berkeley.edu/~kubitron/cs252 2/08/2012 cs252-S12, Lecture07 2 Recall: Software Pipelining Example Trace Scheduling in VLIW Before: Unrolled 3 times After: Software Pipelined • Problem: need large blocks of instructions w/o branches 1 LD F0,0(R1) 1 SD 0(R1),F4 ; Stores M[i] – Only way to be able to find groups of unrelated instructions 2 ADDD F4,F0,F2 2 ADDD F4,F0,F2 ; Adds to M[i-1] – Dynamic branch prediction not an option 3 SD 0(R1),F4 3 LD F0,-16(R1); Loads M[i-2] 4 LD F6,-8(R1) 4 SUBI R1,R1,#8 • Parallelism across IF branches vs. LOOP branches 5 ADDD F8,F6,F2 5 6 SD -8(R1),F8 BNEZ R1,LOOP • Two steps: 7 LD F10,-16(R1) – Trace Selection 8 ADDD F12,F10,F2 SW Pipeline » Find likely sequence of basic blocks (trace) overlapped ops of (statically predicted or profile predicted) 9 SD -16(R1),F12 long sequence of straight-line code 10 SUBI R1,R1,#24 – Trace Compaction 11 BNEZ R1,LOOP Time Loop Unrolled » Squeeze trace into few VLIW instructions • Symbolic Loop Unrolling » Need bookkeeping code in case prediction is wrong – Maximize result-use distance • This is a form of compiler-generated speculation – Less code space than unrolling – Compiler must generate “fixup” code to handle cases in which trace is – Fill & drain pipe only once per loop Time not the taken branch vs. once per each unrolled iteration in loop unrolling – Needs extra registers: undoes bad guess by discarding 5 cycles per iteration • Subtle compiler bugs mean wrong answer vs. poorer performance; no hardware interlocks 2/08/2012 cs252-S12, Lecture07 3 2/08/2012 cs252-S12, Lecture07 4

2. Does a loop-carried dependence mean When Safe to Unroll Loop? there is no parallelism??? • Example: Where are data dependencies? (A,B,C distinct & nonoverlapping) • Consider: for (i=0; i<100; i=i+1) { for (i=0; i< 8; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ A = A + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } }  Could compute: 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. “Cycle 1”: temp0 = C[0] + C[1]; 2. S1 uses a value computed by S1 in an earlier iteration, since temp1 = C[2] + C[3]; iteration i computes A[i+1] which is read in iteration i+1. The same is temp2 = C[4] + C[5]; true of S2 for B[i] and B[i+1]. temp3 = C[6] + C[7]; This is a “loop-carried dependence”: between iterations “Cycle 2”: temp4 = temp0 + temp1; temp5 = temp2 + temp3; • For our prior example, each iteration was distinct “Cycle 3”: A = temp4 + temp5; – In this case, iterations can’t be executed in parallel, Right???? • Relies on associative nature of “+”. 2/08/2012 cs252-S12, Lecture07 5 2/08/2012 cs252-S12, Lecture07 6 Can we use HW to get CPI closer to 1? Problems? • Why in HW at run time? • How do we prevent WAR and WAW hazards? – Works when can’t know real dependence at compile time • How do we deal with variable latency? – Compiler simpler – Forwarding for RAW hazards harder. – Code for one machine runs well on another Clock Cycle Number Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 • Key idea: Allow instructions behind stall to proceed LD F6,34(R2) IF ID EX MEM WB DIVD F0,F2,F4 ADDD F10,F0,F8 LD F2,45(R3) IF ID EX MEM WB RAW SUBD F12,F8,F14 MULTD F0,F2,F4 IF ID stall M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 MEM WB SUBD F8,F6,F2 IF ID A1 A2 MEM WB • Out-of-order execution => out-of-order completion. DIVD F10,F0,F6 IF ID stall stall stall stall stall stall stall stall stall D1 D2 ADDD F6,F8,F2 IF ID A1 A2 MEM WB WAR • How to get precise exceptions? 2/08/2012 cs252-S12, Lecture07 7 2/08/2012 cs252-S12, Lecture07 8

3. Scoreboard: a bookkeeping technique Scoreboard Architecture (CDC 6600) • Out-of-order execution divides ID stage: FP Mult 1. Issue—decode instructions, check for structural hazards FP Mult Functional Units 2. Read operands—wait until no data hazards, then read operands • Scoreboards date to CDC6600 in 1963 Registers – Readings for Monday include one on CDC6600 FP Divide • Instructions execute whenever not dependent on previous instructions and no hazards. FP Add • CDC 6600: In order issue, out-of-order execution, out- of-order commit (or completion) – No forwarding! Integer – Imprecise interrupt/exception model for now SCOREBOARD Memory 2/08/2012 cs252-S12, Lecture07 9 2/08/2012 cs252-S12, Lecture07 10 Scoreboard Implications Four Stages of Scoreboard Control • Out-of-order completion => WAR, WAW hazards? • Issue—decode instructions & check for structural • Solutions for WAR: hazards (ID1) – Stall writeback until registers have been read – Instructions issued in program order (for hazard checking) – Read registers only during Read Operands stage – Don’t issue if structural hazard – Don’t issue if instruction is output dependent on any previously • Solution for WAW: issued but uncompleted instruction (no WAW hazards) – Detect hazard and stall issue of new instruction until other instruction completes • Read operands—wait until no data hazards, then read operands (ID2) • No register renaming – All real dependencies (RAW hazards) resolved in this stage, since • Need to have multiple instructions in execution we wait for instructions to write back data. phase => multiple execution units or pipelined – No forwarding of data in this model! execution units • Scoreboard keeps track of dependencies between instructions that have already issued • Scoreboard replaces ID, EX, WB with 4 stages 2/08/2012 cs252-S12, Lecture07 11 2/08/2012 cs252-S12, Lecture07 12

4. Four Stages of Scoreboard Control Three Parts of the Scoreboard • Execution—operate on operands (EX) • Instruction status: – The functional unit begins execution upon receiving operands. Which of 4 steps the instruction is in When the result is ready, it notifies the scoreboard that it has completed execution. • Functional unit status:—Indicates the state of the • Write result—finish execution (WB) functional unit (FU). 9 fields for each functional unit – Stall until no WAR hazards with previous instructions: Busy: Indicates whether the unit is busy or not Op: Operation to perform in the unit (e.g., + or –) Example: DIVD F0,F2,F4 Fi: Destination register ADDD F10,F0,F8 SUBD F8,F8,F14 Fj,Fk: Source-register numbers Qj,Qk: Functional units producing source registers Fj, Fk CDC 6600 scoreboard would stall SUBD until ADDD reads operands Rj,Rk: Flags indicating when Fj, Fk are ready • Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register 2/08/2012 cs252-S12, Lecture07 13 2/08/2012 cs252-S12, Lecture07 14 Scoreboard Example Detailed Scoreboard Pipeline Control Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 Instruction Wait until Bookkeeping LD F2 45+ R3 status MULTD F0 F2 F4 Busy(FU) yes; Op(FU) op; SUBD F8 F6 F2 Fi(FU) `D’; Fj(FU) `S1’; Not busy (FU) DIVD F10 F0 F6 Issue Fk(FU) `S2’; Qj Result(‘S1’); and not result(D) Qk Result(`S2’); Rj not Qj; ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Rk not Qk; Result(‘D’) FU; Read Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Rj and Rk Rj No; Rk No Integer No operands Mult1 No Execution Functional unit Mult2 No complete done Add No Divide No f((Fj(f)Fi(FU) f(if Qj(f)=FU then Rj(f) Yes); Register result status: Write or Rj(f)=No) & f(if Qk(f)=FU then Rj(f) Yes); result (Fk(f)Fi(FU) or Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Result(Fi(FU)) 0; Busy(FU) No FU Rk( f )=No)) 2/08/2012 cs252-S12, Lecture07 15 2/08/2012 cs252-S12, Lecture07 16

5. Scoreboard Example: Cycle 1 Scoreboard Example: Cycle 2 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 LD F6 34+ R2 1 2 LD F2 45+ R3 LD F2 45+ R3 MULTD F0 F2 F4 MULTD F0 F2 F4 SUBD F8 F6 F2 SUBD F8 F6 F2 DIVD F10 F0 F6 DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Integer Yes Load F6 R2 Yes Mult1 No Mult1 No Mult2 No Mult2 No Add No Add No Divide No Divide No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 FU Integer 2 FU Integer • Issue 2nd LD? 2/08/2012 cs252-S12, Lecture07 17 2/08/2012 cs252-S12, Lecture07 18 Scoreboard Example: Cycle 3 Scoreboard Example: Cycle 4 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 LD F2 45+ R3 MULTD F0 F2 F4 MULTD F0 F2 F4 SUBD F8 F6 F2 SUBD F8 F6 F2 DIVD F10 F0 F6 DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 No Integer No Mult1 No Mult1 No Mult2 No Mult2 No Add No Add No Divide No Divide No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 3 FU Integer 4 FU Integer • Issue MULT? 2/08/2012 cs252-S12, Lecture07 19 2/08/2012 cs252-S12, Lecture07 20

6. Scoreboard Example: Cycle 5 Scoreboard Example: Cycle 6 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 LD F2 45+ R3 5 6 MULTD F0 F2 F4 MULTD F0 F2 F4 6 SUBD F8 F6 F2 SUBD F8 F6 F2 DIVD F10 F0 F6 DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Integer Yes Load F2 R3 Yes Mult1 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Mult2 No Add No Add No Divide No Divide No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 5 FU Integer 6 FU Mult1 Integer 2/08/2012 cs252-S12, Lecture07 21 2/08/2012 cs252-S12, Lecture07 22 Scoreboard Example: Cycle 8a Scoreboard Example: Cycle 7 (First half of clock cycle) Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 LD F2 45+ R3 5 6 7 MULTD F0 F2 F4 6 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 SUBD F8 F6 F2 7 DIVD F10 F0 F6 DIVD F10 F0 F6 8 ADDD F6 F8 F2 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 No Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Add Yes Sub F8 F6 F2 Integer Yes No Divide No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 7 FU Mult1 Integer Add 8 FU Mult1 Integer Add Divide • Read multiply operands? 2/08/2012 cs252-S12, Lecture07 23 2/08/2012 cs252-S12, Lecture07 24

7. Scoreboard Example: Cycle 8b (Second half of clock cycle) Scoreboard Example: Cycle 9 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 DIVD F10 F0 F6 8 ADDD F6 F8 F2 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Integer No Mult1 Yes Mult F0 F2 F4 Yes Yes Note 10 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Remaining Mult2 No Add Yes Sub F8 F6 F2 Yes Yes 2 Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 8 FU Mult1 Add Divide 9 FU Mult1 Add Divide • Read operands for MULT & SUB? Issue ADDD? 2/08/2012 cs252-S12, Lecture07 25 2/08/2012 cs252-S12, Lecture07 26 Scoreboard Example: Cycle 10 Scoreboard Example: Cycle 11 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 SUBD F8 F6 F2 7 9 11 DIVD F10 F0 F6 8 DIVD F10 F0 F6 8 ADDD F6 F8 F2 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Integer No 9 Mult1 Yes Mult F0 F2 F4 No No 8 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Mult2 No 1 Add Yes Sub F8 F6 F2 No No 0 Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 10 FU Mult1 Add Divide 11 FU Mult1 Add Divide 2/08/2012 cs252-S12, Lecture07 27 2/08/2012 cs252-S12, Lecture07 28

8. Scoreboard Example: Cycle 12 Scoreboard Example: Cycle 13 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 DIVD F10 F0 F6 8 ADDD F6 F8 F2 ADDD F6 F8 F2 13 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Integer No 7 Mult1 Yes Mult F0 F2 F4 No No 6 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Mult2 No Add No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 12 FU Mult1 Divide 13 FU Mult1 Add Divide • Read operands for DIVD? 2/08/2012 cs252-S12, Lecture07 29 2/08/2012 cs252-S12, Lecture07 30 Scoreboard Example: Cycle 14 Scoreboard Example: Cycle 15 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 ADDD F6 F8 F2 13 14 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Integer No 5 Mult1 Yes Mult F0 F2 F4 No No 4 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Mult2 No 2 Add Yes Add F6 F8 F2 Yes Yes 1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 14 FU Mult1 Add Divide 15 FU Mult1 Add Divide 2/08/2012 cs252-S12, Lecture07 31 2/08/2012 cs252-S12, Lecture07 32

9. Scoreboard Example: Cycle 16 Scoreboard Example: Cycle 17 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 DIVD F10 F0 F6 8 WAR Hazard! ADDD F6 F8 F2 13 14 16 ADDD F6 F8 F2 13 14 16 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Integer No 3 Mult1 Yes Mult F0 F2 F4 No No 2 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Mult2 No 0 Add Yes Add F6 F8 F2 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 16 FU Mult1 Add Divide 17 FU Mult1 Add Divide • Why not write result of ADD??? 2/08/2012 cs252-S12, Lecture07 33 2/08/2012 cs252-S12, Lecture07 34 Scoreboard Example: Cycle 18 Scoreboard Example: Cycle 19 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 MULTD F0 F2 F4 6 9 19 SUBD F8 F6 F2 7 9 11 12 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 ADDD F6 F8 F2 13 14 16 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Integer No 1 Mult1 Yes Mult F0 F2 F4 No No 0 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Mult2 No Add Yes Add F6 F8 F2 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 18 FU Mult1 Add Divide 19 FU Mult1 Add Divide 2/08/2012 cs252-S12, Lecture07 35 2/08/2012 cs252-S12, Lecture07 36

10. Scoreboard Example: Cycle 20 Scoreboard Example: Cycle 21 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 DIVD F10 F0 F6 8 21 ADDD F6 F8 F2 13 14 16 ADDD F6 F8 F2 13 14 16 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Integer No Mult1 No Mult1 No Mult2 No Mult2 No Add Yes Add F6 F8 F2 No No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Divide Yes Div F10 F0 F6 Yes Yes Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 20 FU Add Divide 21 FU Add Divide • WAR Hazard is now gone... 2/08/2012 cs252-S12, Lecture07 37 2/08/2012 cs252-S12, Lecture07 38 Scoreboard Example: Cycle 22 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 SUBD F8 F2 F6 F4 F2 6 7 9 9 19 11 20 12 Faster than light computation DIVD ADDD F10 F6 F0 F8 F6 F2 8 13 21 14 16 22 (skip a couple of cycles) Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 39 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 22 FU Divide 2/08/2012 cs252-S12, Lecture07 39 2/08/2012 cs252-S12, Lecture07 40

11. Scoreboard Example: Cycle 61 Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 DIVD F10 F0 F6 8 21 61 62 ADDD F6 F8 F2 13 14 16 22 ADDD F6 F8 F2 13 14 16 22 Functional unit status: dest S1 S2 FU FU Fj? Fk? Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Integer No Mult1 No Mult1 No Mult2 No Mult2 No Add No Add No 0 Divide Yes Div F10 F0 F6 No No Divide No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 61 FU Divide 62 FU 2/08/2012 cs252-S12, Lecture07 41 2/08/2012 cs252-S12, Lecture07 42 Review: Scoreboard Example: Cycle 62 CDC 6600 Scoreboard Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result • Speedup 1.7 from compiler; 2.5 by hand LD F6 34+ R2 1 2 3 4 BUT slow memory (no cache) limits benefit LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 • Limitations of 6600 scoreboard: SUBD F8 F6 F2 7 9 11 12 – No forwarding hardware DIVD F10 F0 F6 8 21 61 62 – Limited to instructions in basic block (small window) ADDD F6 F8 F2 13 14 16 22 – Small number of functional units (structural hazards), Functional unit status: dest S1 S2 FU FU Fj? Fk? especially integer/load store units Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk – Do not issue on structural hazards Integer No Mult1 No – Wait for WAR hazards Mult2 No – Prevent WAW hazards Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 62 FU • In-order issue; out-of-order execute & commit 2/08/2012 cs252-S12, Lecture07 43 2/08/2012 cs252-S12, Lecture07 44

12. Another Dynamic Algorithm: CS 252 Administrivia Tomasulo Algorithm • For IBM 360/91 about 3 years after CDC 6600 (1966) • Interesting Resource: http://bitsavers.org • Goal: High Performance without special compilers – Has digital versions of users manuals for old machines – Quite interesting! • Differences between IBM 360 & CDC 6600 ISA – I’ll link in some of them to your reading pages when it is appropriate – IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 – Very limited bandwidth: use mirrors such as: http://bitsavers.vt100.net – IBM has 4 FP registers vs. 8 in CDC 6600 • Midterm I: March 21st – IBM has memory-register ops – Will try to do a 5:00-8:00 slot. Would this work for people? – No class that day • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, – Pizza afterwards… Pentium II, PowerPC 604, … 2/08/2012 cs252-S12, Lecture07 45 2/08/2012 cs252-S12, Lecture07 46 Tomasulo Organization Tomasulo Algorithm vs. Scoreboard From Mem FP Op FP Registers Queue • Control & buffers distributed with Function Units (FU) vs. Load Buffers centralized in scoreboard; Load1 – FU buffers called “reservation stations”; have pending operands Load2 Load3 • Registers in instructions replaced by values or pointers Load4 Load5 Store to reservation stations(RS); called register renaming ; Load6 Buffers – avoids WAR, WAW hazards – More reservation stations than registers, so can do optimizations Add1 compilers can’t Add2 Mult1 Add3 Mult2 • Results to FU from RS, not through registers, over Reservation To Mem Common Data Bus that broadcasts results to all FUs Stations FP adders FP multipliers • Load and Stores treated as FUs with RSs as well • Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue Common Data Bus (CDB) 2/08/2012 cs252-S12, Lecture07 47 2/08/2012 cs252-S12, Lecture07 48

13. Reservation Station Components Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue Op: Operation to perform in the unit (e.g., + or –) If reservation station free (no structural hazard), Vj, Vk: Value of Source operands control issues instr & sends operands (renames registers). – Store buffers has V field, result to be stored 2. Execution—operate on operands (EX) Qj, Qk: Reservation stations producing source When both operands ready then execute; if not ready, watch Common Data Bus for result registers (value to be written) – Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready 3. Write result—finish execution (WB) – Store buffers only have Qi for RS producing result Write on Common Data Bus to all awaiting units; mark reservation station available Busy: Indicates reservation station or FU is busy • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus) Register result status—Indicates which functional unit – 64 bits of data + 4 bits of Functional Unit source address will write each register, if one exists. Blank when no – Write if matches expected Functional Unit (produces result) pending instructions that will write that register. – Does the broadcast 2/08/2012 cs252-S12, Lecture07 49 2/08/2012 cs252-S12, Lecture07 50 Tomasulo Example Tomasulo Example Cycle 1 Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 Load1 No LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 SUBD F8 F6 F2 DIVD F10 F0 F6 DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No Mult1 No Mult1 No Mult2 No Mult2 No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 0 FU 1 FU Load1 2/08/2012 cs252-S12, Lecture07 51 2/08/2012 cs252-S12, Lecture07 52

14. Tomasulo Example Cycle 2 Tomasulo Example Cycle 3 Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 SUBD F8 F6 F2 DIVD F10 F0 F6 DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No Mult1 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Mult2 No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 2 FU Load2 Load1 3 FU Mult1 Load2 Load1 • Note: registers names are removed (“renamed”) in Note: Unlike 6600, can have multiple loads outstanding Reservation Stations; MULT issued vs. scoreboard 2/08/2012 cs252-S12, Lecture07 53 • Load1 completing; what 2/08/2012 is Lecture07 cs252-S12, waiting for Load1? 54 Tomasulo Example Cycle 4 Tomasulo Example Cycle 5 Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 Load2 Yes 45+R3 LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 SUBD F8 F6 F2 4 DIVD F10 F0 F6 DIVD F10 F0 F6 5 ADDD F6 F8 F2 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 Yes SUBD M(A1) Load2 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add2 No Add3 No Add3 No Mult1 Yes MULTD R(F4) Load2 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 No Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 4 FU Mult1 Load2 M(A1) Add1 5 FU Mult1 M(A2) M(A1) Add1 Mult2 • Load2 completing; what is waiting for Load2? 2/08/2012 cs252-S12, Lecture07 55 2/08/2012 cs252-S12, Lecture07 56

15. Tomasulo Example Cycle 6 Tomasulo Example Cycle 7 Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk 1 Add1 Yes SUBD M(A1) M(A2) 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add2 Yes ADDD M(A2) Add1 Add3 No Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 6 FU Mult1 M(A2) Add2 Add1 Mult2 7 FU Mult1 M(A2) Add2 Add1 Mult2 • Issue ADDD here vs. scoreboard? • Add1 completing; what is waiting for it? 2/08/2012 cs252-S12, Lecture07 57 2/08/2012 cs252-S12, Lecture07 58 Tomasulo Example Cycle 8 Tomasulo Example Cycle 9 Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No 2 Add2 Yes ADDD (M-M) M(A2) 1 Add2 Yes ADDD (M-M) M(A2) Add3 No Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 8 FU Mult1 M(A2) Add2 (M-M) Mult2 9 FU Mult1 M(A2) Add2 (M-M) Mult2 2/08/2012 cs252-S12, Lecture07 59 2/08/2012 cs252-S12, Lecture07 60

16. Tomasulo Example Cycle 10 Tomasulo Example Cycle 11 Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add2 No Add3 No Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 10 FU Mult1 M(A2) Add2 (M-M) Mult2 11 FU Mult1 M(A2) (M-M+M(M-M) Mult2 • Write result of ADDD here vs. scoreboard? • Add2 completing; what is waiting for it? • All quick instructions complete in this cycle! 2/08/2012 cs252-S12, Lecture07 61 2/08/2012 cs252-S12, Lecture07 62 Tomasulo Example Cycle 12 Tomasulo Example Cycle 13 Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No 3 Mult1 Yes MULTD M(A2) R(F4) 2 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 12 FU Mult1 M(A2) (M-M+M(M-M) Mult2 13 FU Mult1 M(A2) (M-M+M(M-M) Mult2 2/08/2012 cs252-S12, Lecture07 63 2/08/2012 cs252-S12, Lecture07 64

17. Tomasulo Example Cycle 14 Tomasulo Example Cycle 15 Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No MULTD F0 F2 F4 3 15 Load3 No SUBD F8 F6 F2 4 7 8 SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No 1 Mult1 Yes MULTD M(A2) R(F4) 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 14 FU Mult1 M(A2) (M-M+M(M-M) Mult2 15 FU Mult1 M(A2) (M-M+M(M-M) Mult2 2/08/2012 cs252-S12, Lecture07 65 2/08/2012 cs252-S12, Lecture07 66 Tomasulo Example Cycle 16 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 SUBD F8 F2 F6 F4 F2 3 4 15 7 16 8 Load3 No Faster than light computation DIVD ADDD F10 F6 F0 F8 F6 F2 5 6 10 11 (skip a couple of cycles) Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 16 FU M*F4 M(A2) (M-M+M(M-M) Mult2 2/08/2012 cs252-S12, Lecture07 67 2/08/2012 cs252-S12, Lecture07 68

18. Tomasulo Example Cycle 55 Tomasulo Example Cycle 56 Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 15 16 Load3 No MULTD F0 F2 F4 3 15 16 Load3 No SUBD F8 F6 F2 4 7 8 SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 56 ADDD F6 F8 F2 6 10 11 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No Mult1 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1) 0 Mult2 Yes DIVD M*F4 M(A1) Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 55 FU M*F4 M(A2) (M-M+M(M-M) Mult2 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2 • Mult2 is completing; what is waiting for it? 2/08/2012 cs252-S12, Lecture07 69 2/08/2012 cs252-S12, Lecture07 70 Tomasulo Example Cycle 57 Compare to Scoreboard Cycle 62 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No Instruction status: Read Exec Write Exec Write MULTD F0 F2 F4 3 15 16 Load3 No Instruction j k Issue Oper Comp Result Issue Comp Result SUBD F8 F6 F2 4 7 8 LD F6 34+ R2 1 2 3 4 1 3 4 DIVD F10 F0 F6 5 56 57 LD F2 45+ R3 5 6 7 8 2 4 5 ADDD F6 F8 F2 6 10 11 MULTD F0 F2 F4 6 9 19 20 3 15 16 Reservation Stations: S1 S2 RS RS SUBD F8 F6 F2 7 9 11 12 4 7 8 Time Name Busy Op Vj Vk Qj Qk DIVD F10 F0 F6 8 21 61 62 5 56 57 Add1 No ADDD F6 F8 F2 13 14 16 22 6 10 11 Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1) • Why take longer on scoreboard/6600? Register result status: • Structural Hazards Clock 56 FU F0 F2 M*F4 M(A2) F4 F6 F8 F10 (M-M+M(M-M) Result F12 ... F30 • Lack of forwarding • Once again: In-order issue, out-of-order execution and completion. 2/08/2012 cs252-S12, Lecture07 71 2/08/2012 cs252-S12, Lecture07 72

19. Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600) Recall: Unrolled Loop That Minimizes Stalls 1 Loop:LD F0,0(R1) 2 LD F6,-8(R1) • What assumptions Pipelined Functional Units Multiple Functional Units 3 LD F10,-16(R1) made when moved (6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷) 4 LD F14,-24(R1) code? 5 ADDD F4,F0,F2 – OK to move store past window size: ≤ 14 instructions ≤ 5 instructions 6 ADDD F8,F6,F2 SUBI even though changes 7 ADDD F12,F10,F2 register No issue on structural hazard same 8 ADDD F16,F14,F2 – OK to move loads before WAR: renaming avoids stall completion 9 SD 0(R1),F4 stores: get right data? 10 SD -8(R1),F8 – When is it safe for WAW: renaming avoids stall issue 11 SD -16(R1),F12 compiler to do such Broadcast results from FU Write/read registers 12 SUBI R1,R1,#32 changes? 13 BNEZ R1,LOOP Control: reservation stations central scoreboard 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration 2/08/2012 cs252-S12, Lecture07 73 2/08/2012 cs252-S12, Lecture07 74 Tomasulo Loop Example Loop Example Instruction status: Exec Write Loop: LD F0 0 R1 ITER Instruction j k Issue CompResult Busy Addr Fu MULTD F4 F0 F2 1 LD F0 0 R1 Load1 No 1 MULTD F4 F0 F2 Load2 No SD F4 0 R1 1 SD F4 0 R1 Load3 No SUBI R1 R1 #8 2 LD F0 0 R1 Store1 No 2 MULTD F4 F0 F2 Store2 No BNEZ R1 Loop 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS • Assume Multiply takes 4 clocks Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 • Assume first load takes 8 clocks (cache miss), second Add2 No MULTD F4 F0 F2 load takes 1 clock (hit) Add3 No SD F4 0 R1 • To be clear, will show clocks for SUBI, BNEZ Mult1 No SUBI R1 R1 #8 • Reality: integer instructions ahead Mult2 No BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 0 80 Fu 2/08/2012 cs252-S12, Lecture07 75 2/08/2012 cs252-S12, Lecture07 76

20. Loop Example Cycle 1 Loop Example Cycle 2 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 Load2 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 Load3 No 1 SD F4 0 R1 Load3 No 2 LD F0 0 R1 Store1 No 2 LD F0 0 R1 Store1 No 2 MULTD F4 F0 F2 Store2 No 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 No SUBI R1 R1 #8 Mult1 Yes Multd R(F4) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 1 80 Fu Load1 2 80 Fu Load1 Mult1 2/08/2012 cs252-S12, Lecture07 77 2/08/2012 cs252-S12, Lecture07 78 Loop Example Cycle 3 Loop Example Cycle 4 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 No 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 Store1 Yes 80 Mult1 2 LD F0 0 R1 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 Store2 No 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F4) Load1 SUBI R1 R1 #8 Mult1 Yes Multd R(F4) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 3 80 Fu Load1 Mult1 4 80 Fu Load1 Mult1 • Implicit renaming sets up “DataFlow” graph • Dispatching SUBI Instruction 2/08/2012 cs252-S12, Lecture07 79 2/08/2012 cs252-S12, Lecture07 80

21. Loop Example Cycle 5 Loop Example Cycle 6 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 No 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 Store1 Yes 80 Mult1 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 Store2 No 2 MULTD F4 F0 F2 Store2 No 2 SD F4 0 R1 Store3 No 2 SD F4 0 R1 Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F4) Load1 SUBI R1 R1 #8 Mult1 Yes Multd R(F4) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 5 72 Fu Load1 Mult1 6 72 Fu Load2 Mult1 • And, BNEZ instruction • Notice that F0 never sees Load from location 80 2/08/2012 cs252-S12, Lecture07 81 2/08/2012 cs252-S12, Lecture07 82 Loop Example Cycle 7 Loop Example Cycle 8 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 Load1 Yes 80 1 LD F0 0 R1 1 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 No 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 Store3 No 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 7 72 Fu Load2 Mult2 8 72 Fu Load2 Mult2 • Register file completely detached from computation • First and Second iteration 2/08/2012 completely overlapped cs252-S12, Lecture07 83 2/08/2012 cs252-S12, Lecture07 84

22. Loop Example Cycle 9 Loop Example Cycle 10 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 Load1 Yes 80 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 LD F0 0 R1 6 10 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 9 72 Fu Load2 Mult2 10 64 Fu Load2 Mult2 • Load1 completing: who is waiting? • Load2 completing: who is waiting? • Note: Dispatching SUBIcs252-S12, Lecture07 2/08/2012 85 • Note: Dispatching BNEZ 2/08/2012 cs252-S12, Lecture07 86 Loop Example Cycle 11 Loop Example Cycle 12 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 MULTD F4 F0 F2 2 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 4 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop 3 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 11 64 Fu Load3 Mult2 12 64 Fu Load3 Mult2 • Next load in sequence • Why not issue third multiply? 2/08/2012 cs252-S12, Lecture07 87 2/08/2012 cs252-S12, Lecture07 88

23. Loop Example Cycle 13 Loop Example Cycle 14 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 Load2 No 1 MULTD F4 F0 F2 2 14 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8 2 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop 1 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 13 64 Fu Load3 Mult2 14 64 Fu Load3 Mult2 • Mult1 completing. Who is waiting? 2/08/2012 cs252-S12, Lecture07 89 2/08/2012 cs252-S12, Lecture07 90 Loop Example Cycle 15 Loop Example Cycle 16 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 1 SD F4 0 R1 3 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 Store3 No 2 SD F4 0 R1 8 Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 No SUBI R1 R1 #8 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 15 64 Fu Load3 Mult2 16 64 Fu Load3 Mult1 • Mult2 completing. Who is waiting? 2/08/2012 cs252-S12, Lecture07 91 2/08/2012 cs252-S12, Lecture07 92

24. Loop Example Cycle 17 Loop Example Cycle 18 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 Load3 Yes 64 1 SD F4 0 R1 3 18 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 SD F4 0 R1 8 Store3 Yes 64 Mult1 2 SD F4 0 R1 8 Store3 Yes 64 Mult1 Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 17 64 Fu Load3 Mult1 18 64 Fu Load3 Mult1 2/08/2012 cs252-S12, Lecture07 93 2/08/2012 cs252-S12, Lecture07 94 Loop Example Cycle 19 Loop Example Cycle 20 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 10 Load1 No 1 LD F0 0 R1 1 9 10 Load1 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 MULTD F4 F0 F2 2 14 15 Load2 No 1 SD F4 0 R1 3 18 19 Load3 Yes 64 1 SD F4 0 R1 3 18 19 Load3 Yes 64 2 LD F0 0 R1 6 10 11 Store1 No 2 LD F0 0 R1 6 10 11 Store1 No 2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R2 2 MULTD F4 F0 F2 7 15 16 Store2 No 2 SD F4 0 R1 8 19 Store3 Yes 64 Mult1 2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1 Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 19 64 Fu Load3 Mult1 20 64 Fu Load3 Mult1 2/08/2012 cs252-S12, Lecture07 95 2/08/2012 cs252-S12, Lecture07 96

25. Why can Tomasulo overlap iterations of loops? Summary • Scoreboard: Track dependencies through reservations – Simple scheme for out-of-order execution • Register renaming – WAW and WAR hazards force stalls – cannot handle multiple instructions – Multiple iterations use different physical destinations for registers with same destination register (dynamic loop unrolling). • Reservations stations: renaming to larger set of registers + buffering source operands • Reservation stations – Prevents registers as bottleneck – Avoids WAR, WAW hazards of Scoreboard – Permit instruction issue to advance past integer control flow operations – Allows loop unrolling in HW • Dynamic hardware schemes can unroll loops dynamically in • Other idea: Tomasulo building dynamic “DataFlow” hardware graph from instructions – Form of limited dataflow – Fits in with readings for Wednesday – Register renaming is essential • Lasting Contributions of Tomasulo Algorithm – Dynamic scheduling – Register renaming – Load/store disambiguation • 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264 2/08/2012 cs252-S12, Lecture07 97 2/08/2012 cs252-S12, Lecture07 98