性能评估

本文主要学习了计算机体系结构的性能评估。学习了计算机系统组成和组织层次以及本章内容的总结。所有计算机由五个部件组成处理器:(1)数据通路和(2)控制(3)记忆(4)输入设备(5)输出设备。学习集成电路成本---公式的运用和CMOS改进。
展开查看详情

1. ECE4680 Computer Organization and Architecture Lecture 2: Performance Evaluation ECE4680 Lec2 Performance.1 January 23, 2003 Review: What is "Computer Architecture" ° Co-ordination of levels of abstraction Application Operating Compiler System Instruction Set Architecture Instr. Set Proc. I/O system Digital Design Circuit Design ° Under a set of rapidly changing Forces ° CA = IS + CO ECE4680 Lec2 Performance.2 January 23, 2003

2. Review: Levels of Representation temp = v[k]; High Level Language v[k] = v[k+1]; Program v[k+1] = temp; Compiler lw $15, 0($2) Assembly Language lw $16, 4($2) Program sw $16, 0($2) sw $15, 4($2) Assembler 0000 1001 1100 0110 1010 1111 0101 1000 Machine Language 1010 1111 0101 1000 0000 1001 1100 0110 Program 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 Machine Interpretation Control Signal Specification ECE4680 Lec2 Performance.3 January 23, 2003 Review: Levels of Organization SPARCstation 20 Computer SPARC Memory Devices Processor Control Input Datapath Output ECE4680 Lec2 Performance.4 January 23, 2003

3. Summary: Computer System Components Proc Caches Busses adapters Memory Controllers Disks I/O Devices: Displays Networks Keyboards °All have interfaces & organizations ECE4680 Lec2 Performance.5 January 23, 2003 Review: Summary from Last Lecture °All computers consist of five components • Processor: (1) datapath and (2) control • (3) Memory • (4) Input devices and (5) Output devices °Not all “memory” are created equally • Cache: fast (expensive) memory are placed closer to the processor • Main memory: less expensive memory--we can have more °Input and output (I/O) devices has the messiest organization • Wide range of speed: graphics vs. keyboard • Wide range of requirements: speed, standard, cost ... etc. • Least amount of research (so far) ECE4680 Lec2 Performance.6 January 23, 2003

4. Integrated Circuits Costs --- manufacturing process (p24) Click “how chips are made” on the website. ECE4680 Lec2 Performance.7 January 23, 2003 Integrated Circuits Costs --- formula Cost _ per _ wafter Die cost = Die _ per _ wafer × Yield wafer _ area Dies per wafer = Die _ area 1 Die Yield = (1 + ( Defect _ per _ area × Die _ area )) 2 Die Cost is goes roughly with the cube of the area. ECE4680 Lec2 Performance.8 January 23, 2003

5. Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX 2 0.90 $900 1.0 43 360 71% $4 486DX2 3 0.80 $1200 1.0 81 181 54% $12 PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53 HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73 DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149 SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272 Pentium 3 0.80 $1500 1.5 296 40 9% $417 From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15 ECE4680 Lec2 Performance.9 January 23, 2003 Other Costs IC cost = Die cost + Testing cost + Packaging cost Final test yield Packaging Cost: depends on pins, heat dissipation Chip Die Package Test & Total cost pins type cost Assembly 386DX $4 132 QFP $1 $4 $9 486DX2 $12 168 PGA $11 $12 $35 PowerPC 601 $53 304 QFP $3 $21 $77 HP PA 7100 $73 504 PGA $35 $16 $124 DEC Alpha $149 431 PGA $30 $23 $202 SuperSPARC $272 293 PGA $20 $34 $326 Pentium $417 273 PGA $19 $37 $473 ECE4680 Lec2 Performance.10 January 23, 2003

6. CMOS improvements °Die size 2X / 3 years; Line widths halve / 7 years 25 20 15 Transistors per Unit Area 10 5 Die Size 0 1980 1983 1986 1989 1992 Capacity Speed Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 1.4x in 10 years disk 4x in 3 years 1.4x in 10 years ECE4680 Lec2 Performance.11 January 23, 2003 Processor Performance 120 IBM Power 2/590 P e 100 r f 80 DEC AXP 300 o r 60 m 1.54X/yr HP 9000/750 a 40 n IBM RS6000/540 1.35X/yr c 20 MIPS M2000 e MIPS M/120 Sun-4/260 0 1987 1988 1989 1990 1991 1992 1993 Year ECE4680 Lec2 Performance.12 January 23, 2003

7. The bottom line: Performance (and cost) Airplane DC to Range Speed Passengers Throughput Cost Paris (m.p.h.) (p x m.p.h.) Boeing 6.5 hours 4150 610 470 286,700 ??? 747 BAC/Sud 3 hours 4000 1350 132 178200 ??? Concorde Douglas 7.3 hours 8720 544 146 79,424 ??? DC-8 °Different measurements lead to different results ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns. .. (Performance) – throughput, bandwidth ° Cost renders the measurement more complex ° The bottom-line performance measurement is CPU execution time. ECE4680 Lec2 Performance.13 January 23, 2003 The bottom line: Performance (and cost) " X is n times faster than Y" means ExTime(Y) Performance(X) -------------- = ---------------------- ExTime(X) Performance(Y) • Time of Concorde vs. Boeing 747? • Throughput of Boeing 747 vs. Concorde? ECE4680 Lec2 Performance.14 January 23, 2003

8. Metrics of performance Answers per month Application Operations per second Programming Language Compiler (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s ISA Datapath Control Megabytes per second Function Units Transistors Wires Pins Cycles per second (clock rate) ECE4680 Lec2 Performance.15 January 23, 2003 Relating Processor Metrics (pp60-66) °CPU execution time = CPU clock cycles/pgm ÷ clock rate °or CPU execution time = CPU clock cycles/pgm X clock cycle time °CPU clock cycles/pgm = Instructions/pgm X avg. clock cycles per instr. °or CPI = CPU clock cycles/pgm ÷ Instructions/pgm °CPI tells us something about the Instruction Set Architecture, the Implementation of that architecture, and the program measured ECE4680 Lec2 Performance.16 January 23, 2003

9. Aspects of CPU Performance CPU CPUtime time == Seconds Seconds == Instructions Instructions xx Cycles Cycles xx Seconds Seconds Program Program Program Program Instruction Instruction Cycle Cycle instr. count CPI clock rate Program Compiler Instr. Set Arch. Organization Technology ECE4680 Lec2 Performance.17 January 23, 2003 Aspects of CPU Performance CPU CPUtime time == Seconds Seconds == Instructions Instructions xx Cycles Cycles xx Seconds Seconds Program Program Program Program Instruction Instruction Cycle Cycle instr. count CPI clock rate Program X X Compiler X (x) Instr. Set Arch. X X Organization X X Technology X ECE4680 Lec2 Performance.18 January 23, 2003

10. Organizational Trade-offs 3 factors: Where are they? Application How are they related? Programming Language Compiler ISA Instruction Mix Datapath Control CPI Function Units Transistors Wires Pins Cycle Time ECE4680 Lec2 Performance.19 January 23, 2003 CPI – How to compute? “Average cycles per instruction” CPI = (CPU Time × Clock Rate) / Instruction Count = Clock Cycles / Instruction Count “CPU clock cycles summed up” n CPU time = ClockCycleTime × ∑ CPI i × Ci i =1 "instruction frequency" n CPI = ∑ CPI i =1 i × Fi where Fi = Ci Instruction _ Count Invest Resources where time is Spent! ECE4680 Lec2 Performance.20 January 23, 2003

11. Example (page 60) Our favorite program runs in 10 sec on machine A, which has a 400MHz clock. We are trying to design a machine B with faster clock rate so as to reduce the execution time to 6 sec. The increase of clock rate will affect the rest of the CPU design, causing B to require 1.2 times as many clock cycles as machine A for this program. What clock rate should be? ECE4680 Lec2 Performance.21 January 23, 2003 Answer: CPU time A = CPU clock cycle A / clock rate A ==> CPU clock cycle A = 10 sec x 400 x 10^6 Clock rate B = CPU clock cycle B / CPU time B = 1.2*400*10^6 / 6 = 800 MHz ECE4680 Lec2 Performance.22 January 23, 2003

12. Example Base Machine (Reg / Reg) and Instruction frequencies in the execution of a program: Op Freq Cycles ALU 43% 1 Load 21% 2 Store 12% 2 Branch 24% 2 Question: What is the average CPI of the machine? CPI = 1×43% + 2×21% + 2×12% +2×24% = 1.57 ECE4680 Lec2 Performance.23 January 23, 2003 Example: (page 62) Suppose we have two implementations of the same instruction set. Machine A has a clock cycle time of 10 ns and an average CPI of 2.0 for some program. Machine B has a clock cycle time of 20 ns and an average CPI of 1.2 for the same program. Which is faster? And by how much? Let I denote the number of instructions of the program CPU time A = I * 2.0* 10 = 20 I CPU time B = I * 1.2 *20 = 24 I Machine A is 1.2 faster than B . Again we see 3 factors are related ! ECE4680 Lec2 Performance.24 January 23, 2003

13. Example (pp65-66) ISA has 3 kinds of instructions: Instruction class CPI for this instruction class A 1 B 2 C 3 One program has 2 code sequences: Code Sequence Instruction counts for instruction class A B C 1 2 1 2 2 4 1 1 Which code sequence has more instructions? Which will be faster? What is the CPI for each sequence? S.1 has 5 instructions; S.2 has 6. S.1 needs 2x1+1x2+2x3=10 cycles; S.2 needs 4x1+1x2+1x3=9 cycles. S.1 has CPI=10/5=2; S.2 has CPI=9/6=1.5 ECE4680 Lec2 Performance.25 January 23, 2003 Marketing Metrics MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 •Million Instructions Per Seconds •machines with different instruction sets ? •programs with different instruction mixes ? Many pitfalls? •dynamic frequency of instructions •Peak MIPS: impractical • uncorrelated with performance. ( see the next example) MFLOP/S = FP Operations / Time * 10^6 •Million Floating-point Operations Per Second •machine dependent •often not where time is spent ?? ECE4680 Lec2 Performance.26 January 23, 2003

14. Example: (similar to example at PP78-79, but not the same) Referring to example at slide 23, assume we build an optimizing compiler for the load/store machine. The compiler discards 50% of the ALU instructions. 1) What is the CPI_opt ? 2) Ignoring system issues and assuming a 20 ns clock cycle time (50 MHz clock rate). What is the MIPS rating for optimized code versus unoptimized code? Does the MIPS rating agree with the rating of execution time? Op Freq Cycle New Freq ALU 43% 1 27% Optimizing compiler Load 21% 2 27% Store 12% 2 15% Branch 24% 2 31% CPI 1.57 1.73 MIPS 31.8 28.9 ECE4680 Lec2 Performance.27 January 23, 2003 Why Do Benchmarks? Or How to evaluate an athlete? °Triathlon (3 sports) • swimming • bicycling • running °Pentathlon (5 sports) °Decathlon: (10 sports) • sprinting • 100-meter, • hurdling • 400-meter, • long jumping • 1,500-meter runs; • 110-meter high hurdle; • discus • discus, • javelin • javelin throws; °Heptathlon (7 sports) • shot-put; • pole vault; • high jump; • long jump. ECE4680 Lec2 Performance.28 January 23, 2003

15. Why Do Benchmarks? °How we evaluate differences • Different systems • Changes to a single system °Provide a target • Benchmarks should represent large class of important programs • Improving benchmark performance should help many programs °For better or worse, benchmarks shape a field °Good ones accelerate progress • good target for development °Bad benchmarks hurt progress • help real programs v. sell machines/papers? • Inventions that help real programs don’t help benchmark ECE4680 Lec2 Performance.29 January 23, 2003 Programs to Evaluate Processor Performance (pp86-87) °(Toy) Benchmarks • Small but easy to compile and run on simulators, convenient in early designing stage of a new machine. No compiler for novel machines. • 10-100 line • e.g.,: sieve, puzzle, quicksort °Synthetic Benchmarks • attempt to use a single benchmark to match average frequencies of real workloads or a set of benchmarks • e.g., Whetstone(Algol60 Fortran), Dhrystone(Ada C) °Kernels • Time critical excerpts of Real programs • Popular in scientific computing to illuminate performance of individual features of a machine. • e.g., Livermore loops(21 loops), Linpack(linear algebra) °Real programs • e.g., gcc, spice ECE4680 Lec2 Performance.30 January 23, 2003

16. Successful Benchmark: SPEC (pp87-89) °1987 RISC industry mired in “bench marketing”: (“That is 8 MIPS machine, but they claim 10 MIPS!”) °EETimes (http://www.eet.com/) + 5 companies band together to perform Systems Performance Evaluation Committee (SPEC) in 1988: Sun, MIPS, HP, Apollo, DEC °Create standard list of programs, inputs, reporting: some real programs, includes OS calls, some I/O ECE4680 Lec2 Performance.31 January 23, 2003 SPEC first round °First round 1989; 10 programs = 4 for integer + 6 for FP, single number to summarize performance °One program (matrix300): 99% of time in single line of code °New front-end compiler could improve dramatically (Fig. 2.3) 800 700 600 500 400 300 200 100 0 gcc doduc li fpppp spice eqntott nasa7 tomcatv epresso matrix300 Benchmark ECE4680 Lec2 Performance.32 January 23, 2003

17. SPEC Evolution °Second round; SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c )” wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas °Add SPECbase: don’t allow program-specific optimization flags. °Third round; 1995; new set of programs(8 int + 10fp) (fig. 2.6, p72) • “benchmarks useful for 3 years” • Base machine is changed from VAX-11/780 to Sun SPARC 10/40 °Newer rounds include SPEC HPC96, SPEC JVM98, SPEC WEB99, SPEC OMP2001. See http://www.spec.org for more details. ECE4680 Lec2 Performance.33 January 23, 2003 How to Summarize Results? Program 1: 1 sec on machine A, 10 sec on machine B Program 2: 1000 sec 100 sec What are your conclusions? • A is 10 times faster than B for program1. • B is 10 times faster than A for Program2. • Total execution time: a consistent summary measure B is 1001/110=9.1 times faster than A. Workload: need to consider the percentage/frequency of each program in the total job. ECE4680 Lec2 Performance.34 January 23, 2003

18. How to Summarize Performance °Suppose n programs have execution time ti where i = 1, 2 ,... n . n °Suppose the workload is ( w1 , w 2 ,.... w n ) where ∑ i =1 wi = 1 . n 1 1 °Arithmetic Mean AM ( t i ) = n ∑ i =1 ti = n ( t1 + t 2 +  + t n ) n °Weighted Arithmetic mean WAM (ti ) = ∑ i =1 t i w i = t1 w 1 + t 2 w 2 +  + t n w n n 1 1 1 °Geometric Mean GM ( t i ) = n ∏ i =1 ti = n t1 × t 2 ×  × t n = t1 n × t 2 n ×  × t n n ( t i ) = t1 × t2 ×  × tn w1 w2 wn °Weighted Geometric Mean WGM n n °Harmonic Mean HM ( t i ) = n = 1 1 1 1 ∑ i =1 ti + t1 t 2 + + tn n n °Weighted Harmonic Mean GHM (ti ) = n = wi w1 w2 w ∑ i =1 ti t1 + t2 + + n tn ECE4680 Lec2 Performance.35 January 23, 2003 How to Summarize Performance °3 means: arithmetic, geometric and harmonic. Which one is correct? °Arithmetic mean (or weighted arithmetic mean) tracks execution time. °Harmonic mean (or weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time °Normalized execution time is handy for scaling performance (e.g., time reference on machine ÷ time on measured machine ) °But do not take the arithmetic mean of normalized execution time, use of the geometric wards all improvements equally: program A going from 2 seconds to 1 second as important as program B going from 2000 seconds to 1000 seconds. See example p81 ECE4680 Lec2 Performance.36 January 23, 2003

19. Impact of Means on SPECmark89 for IBM 550 Ratio to VAX: Time: Weighted Time: Program Before After Before After Before After gcc 30 29 49 51 8.91 9.22 espresso 35 34 65 67 7.64 7.86 spice 47 47 510 510 5.69 5.69 doduc 46 49 41 38 5.81 5.45 nasa7 78 144 258 140 3.43 1.86 li 34 34 183 183 7.86 7.86 eqntott 40 40 28 28 6.68 6.68 matrix300 78 730 58 6 3.43 0.37 fpppp 90 87 34 35 2.97 3.07 tomcatv 133 138 20 19 2.01 1.94 Mean 54 72 124 108 54.42 49.99 Geometric Arithmetic Weighted Arith. Ratio 1.33 Ratio 1.16 Ratio 1.09 ECE4680 Lec2 Performance.37 January 23, 2003 Example (p81) Normalized to A Normalized to B Time on A Time on B A B A B program1 1 10 1 10 0.1 1 program2 1000 100 1 0.1 10 1 Arithmetic mean 500.5 55 1 5.05 5.05 1 Geometric mean 31.6 31.6 1 1 1 1 The difficulty arises from the use of arithmetic mean of normalized time. Geometric mean is independent of which data series we use for normalized time because GM ( xi ) x = GM ( i ) GM ( yi ) yi ECE4680 Lec2 Performance.38 January 23, 2003

20. Amdahl's Law Speedup due to enhancement E: ExTime(without E) Performance(with E) Speedup(E) = = ExTime(with E) Performance(without E) F F S Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, F ExTime(with E) = ((1 - F) + ) × ExTime(without E) S ExTime(without E) 1 1 Speedup(E) = = ≤ F F 1- F ((1 - F) + ) × ExTime(without E) (1 - F) + S S ECE4680 Lec2 Performance.39 January 23, 2003 Example: Suppose a person wants to travel from city A to city B by city C. The routes from A to C are in mountains and the routes from C to B are in desert. The distances from A to C , and from C to B are 80 miles and 200 miles, respectively. From A to C, walk at speed of 4 mph From C to B, walk or drive (at speed of 100 mph) Question: How long will it take for the entire trip How much faster from A to B by a car as opposed to walk ECE4680 Lec2 Performance.40 January 23, 2003

21. Example: Suppose an enhancement runs 10 times faster than the original machine, but is only usable 40% of the time. Question: what is the overall speedup? Answer: Fraction_enhance = 0.4 Speedup_enhanced = 10 Speedup_overall = 1/(0.6+0.4/10) = 1.56 ECE4680 Lec2 Performance.41 January 23, 2003 Cost Summary °Integrated circuits driving computer industry °Die costs goes up with the cube of die area ECE4680 Lec2 Performance.42 January 23, 2003

22. Performance Evaluation Summary CPU CPUtime time == Seconds Seconds ==Instructions Instructions xx Cycles Cycles xx Seconds Seconds Program Program Program Program Instruction Instruction Cycle Cycle °Time is the measure of computer performance! °Good products created when have: • Good benchmarks • Good ways to summarize performance °If not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins. °Remember Amdahl’s Law: Speedup is limited by unimproved part of program. ECE4680 Lec2 Performance.43 January 23, 2003 Homework, due Feb. 3 class time (Monday) Question 1.1 through 1.26 Question 2.1 through 2.4 Question 2.10 through 2.13 Question 2.18 through 2.20 Question 2.41 ECE4680 Lec2 Performance.44 January 23, 2003