13-Parallel Performance Methods and Technologies

Parallel Performance and Complexity To use a scalable parallel computer well, you must write high-performance parallel programs To get high-performance parallel programs, you must understand and optimize performance for the combination of programming model, algorithm, language,platform, … Unfortunately, parallel performance measurement, analysis and optimization can be an easy process Parallel performance is complex
展开查看详情

1. Parallel Performance Methods and Technologies Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 13 – Parallel Performance Methods

2. Parallel Performance and Complexity q  To use a scalable parallel computer well, you must write high-performance parallel programs q  To get high-performance parallel programs, you must understand and optimize performance for the combination of programming model, algorithm, language, platform, … q  Unfortunately, parallel performance measurement, analysis and optimization can be an easy process q  Parallel performance is complex Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 2

3. Parallel Performance Evaluation q  Study of performance in parallel systems ❍  Models and behaviors ❍  Evaluative techniques q  Evaluation methodologies ❍  Analyticalmodeling and statistical modeling ❍  Simulation-based modeling ❍  Empirical measurement, analysis, and modeling q  Purposes ❍  Planning ❍  Diagnosis ❍  Tuning Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 3

4. Parallel Performance Engineering and Productivity q  Scalable, optimized applications deliver HPC promise q  Optimization through performance engineering process ❍  Understand performance complexity and inefficiencies ❍  Tune application to run optimally on high-end machines q  How to make the process more effective and productive? q  What performance technology should be used? ❍  Performance technology part of larger environment ❍  Programmability, reusability, portability, robustness ❍  Application development and optimization productivity q  Process, performance technology, and its use will change as parallel systems evolve q  Goal is to deliver effective performance with high productivity value now and in the future Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 4

5. Motivation q  Parallel / distributed systems are complex ❍  Four layers ◆ application –  algorithm, data structures ◆ parallel programming interface / middleware –  compiler, parallel libraries, communication, synchronization ◆ operating system –  process and memory management, IO ◆ hardware –  CPU, memory, network q  Mapping/interaction between different layers Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 5

6. Performance Factors q  Factorswhich determine a program's performance are complex, interrelated, and sometimes hidden q  Application related factors ❍  Algorithms dataset sizes, task granularity, memory usage patterns, load balancing. I/O communication patterns q  Hardware related factors ❍  Processor architecture, memory hierarchy, I/O network q  Software related factors ❍  Operating system, compiler/preprocessor, communication protocols, libraries Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 6

7. Utilization of Computational Resources q  Resources can be under-utilized or used inefficiently ❍  Identifying these circumstances can give clues to where performance problems exist q  Resources may be “virtual” ❍  Not actually a physical resource (e.g., thread, process) q  Performance analysis tools are essential to optimizing an application's performance ❍  Can assist you in understanding what your program is "really doing” ❍  May provide suggestions how program performance should be improved Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 7

8. Performance Analysis and Tuning: The Basics q  Most important goal of performance tuning is to reduce a program's wall clock execution time ❍  Iterative process to optimize efficiency ❍  Efficiency is a relationship of execution time q  So, where does the time go? q  Find your program's hot spots and eliminate the bottlenecks in them ❍  Hot spot: an area of code within the program that uses a disproportionately high amount of processor time ❍  Bottleneck : an area of code within the program that uses processor resources inefficiently and therefore causes unnecessary delays q  Understand what, where, and how time is being spent Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 8

9. Sequential Performance q  Sequential performance is all about: ❍  How time is distributed ❍  What resources are used where and when q  “Sequential” factors ❍  Computation ◆ choosing the right algorithm is important ◆ compilers can help ❍  Memory systems and cache and memory ◆ more difficult to assess and determine effects ◆ modeling can help ❍  Input / output Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 9

10. Parallel Performance q  Parallel performance is about sequential performance AND parallel interactions ❍  Sequential performance is the performance within each thread of execution ❍  “Parallel” factors lead to overheads ◆ concurrency (threading, processes) ◆ interprocess communication (message passing) ◆ synchronization (both explicit and implicit) ❍  Parallel interactions also lead to parallelism inefficiency ◆ load imbalances Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 10

11. Sequential Performance Tuning q  Sequential performance tuning is a time-driven process q  Find the thing that takes the most time and make it take less time (i.e., make it more efficient) q  May lead to program restructuring ❍  Changes in data storage and structure ❍  Rearrangement of tasks and operations q  May look for opportunities for better resource utilization ❍  Cache management is a big one ❍  Locality, locality, locality! ❍  Virtual memory management may also pay off q  May look for opportunities for better processor usage Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 11

12. Parallel Performance Tuning q  In contrast to sequential performance tuning, parallel performance tuning might be described as conflict- driven or interaction-driven q  Find the points of parallel interactions and determine the overheads associated with them q  Overheads can be the cost of performing the interactions ❍  Transfer of data ❍  Extra operations to implement coordination q  Overheads also include time spent waiting ❍  Lack of work ❍  Waiting for dependency to be satisfied Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 12

13. Parallel Performance Engineering Process Implementation Measurement Preparation Performance Refinement Analysis Analysis Program Tuning Ranking Production Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 13

14. Parallel Performance Engineering Process Implementation Measurement Preparation Performance Refinement Analysis Analysis Program Tuning Ranking Production Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 14

15. Performance Observability (Guiding Thesis) q  Performance evaluation problems define the requirements for performance analysis methods q  Performance observability is the ability to “accurately” capture, analyze, and present (collectively observe) information about computer system/software performance q  Tools for performance observability must balance the need for performance data against the cost of obtaining it (environment complexity, performance intrusion) ❍  Too little performance data makes analysis difficult ❍  Too much data perturbs the measured system. q  Important to understand performance observability complexity and develop technology to address it Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 15

16. Parallel Performance Engineering Process q  Traditionally an empirically-based approach ❍  observation experimentation diagnosis tuning q  Performance technology developed for each level Performance Performance   Tuning Technology   Performance   hypotheses •  Data  mining   •  Models   Technology   Performance •  Expert  systems   •  Experiment   Diagnosis management   propertie Performance   •  Performance   s Performance Technology   storage   Experimentation •  Instrumenta+on   characterization •  Measurement   Performance •  Analysis   Observation •  Visualiza+on   Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 16

17. Performance Analysis and Optimization Cycle Instrumentation  n    Insertion of extra code (probes, hooks)  into     application Measurement     n  Collection of data relevant to     performance analysis Analysis     n  Calculation of metrics, identification of             performance problems Presentation  n  Transformation of the results into a  representation   that can be easily  understood   by a human user Optimization n  Elimination of performance problems Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 17

18. Performance Metrics and Measurement q  Observability depends on measurement q  What is able to be observed and measured? q  A metric represents a type of measured data ❍  Count: how often some thing occurred ◆  calls to a routine, cache misses, messages sent, … ❍  Duration: how long some thing took place ◆  execution time of a routine, message communication time, … ❍  Size: how big some thing is ◆  message size, memory allocated, … q  A measurement records performance data q  Certain quantities can not be measured directly ❍  Derived metric: calculated from metrics ◆  rates of some thing (e.g., flops per second) are one example Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 18

19. Performance Benchmarking q  Benchmarking typically involves the measurement of metrics for a particular type of evaluation ❍  Standardize on an experimentation methodology ❍  Standardize on a collection of benchmark programs ❍  Standardize on set of metrics q  High-Performance Linpack (HPL) for Top 500 q  NAS Parallel Benchmarks q  SPEC q  Typically look at MIPS and FLOPS Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 19

20. How Is Time Measured? q  How do we determine where the time goes? “A person with one clock knows what time it is, a person with two clocks is never sure.” Confucious (attributed) q  Clocks are not the same ❍  Have different resolutions and overheads for access q  Time is an abstraction based on clock ❍  Only as good (accurate) as the clock we use ❍  Only as good as what we use it for Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 20

21. Execution Time q  There are different types of time q  Wall-clock time ❍  Based on realtime clock (continuously running) ❍  Includes time spent in all activities q  Virtual process time (aka CPU time) ❍  Time when process is executing (CPU is active) ◆  user time and system time (can mean different things) ❍  Does not include time when process is inherently waiting q  Parallel execution time ❍  Runs whenever any parallel part is executing ❍  Need to define a global time basis Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 21

22. Observation Types q  There are two types of performance observation that determine different measurement methods ❍  Direct performance observation ❍  Indirect performance observation q  Direct performance observation is based on a scientific theory of measurement that considers the cost (overhead) with respect to accuracy q  Indirect performance observation is based on a sampling theory of measurement that assumes some degree of statistical stationarity Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 22

23. Direct Performance Observation q  Execution actions exposed as events ❍  In general, actions reflect some execution state ◆ presence at a code location or change in data ◆ occurrence in parallelism context (thread of execution) ❍  Events encode actions for observation q  Observation is direct ❍  Direct instrumentation of program code (probes) ❍  Instrumentation invokes performance measurement ❍  Event measurement = performance data + context q  Performance experiment ❍  Actual events + performance measurements Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 23

24. Indirect Performance Observation q  Program code instrumentation is not used q  Performance is observed indirectly ❍  Execution is interrupted ◆ can be triggered by different events ❍  Execution state is queried (sampled) ◆ different performance data measured ❍  Event-based sampling (EBS) q  Performance attribution is inferred ❍  Determined by execution context (state) ❍  Observation resolution determined by interrupt period ❍  Performance data associated with context for period Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 24

25. Direct Observation: Events q  Event types ❍  Interval events (begin/end events) ◆ measures performance between begin and end ◆ metrics monotonically increase ❍  Atomic events ◆ used to capture performance data state q  Code events ❍  Routines,classes, templates ❍  Statement-level blocks, loops q  User-defined events ❍  Specified by the user q  Abstract mapping events Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 25

26. Direct Observation: Instrumentation q  Events defined by instrumentation access q  Instrumentation levels ❍  Source code ❍  Library code ❍  Object code ❍  Executable code ❍  Runtime system ❍  Operating system q  Levels provide different information / semantics q  Different tools needed for each level q  Often instrumentation on multiple levels required Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 26

27. Direct Observation: Techniques q  Static instrumentation ❍  Program instrumented prior to execution q  Dynamic instrumentation ❍  Program instrumented at runtime q  Manual and automatic mechanisms q  Tool required for automatic support ❍  Source time: preprocessor, translator, compiler ❍  Link time: wrapper library, preload ❍  Execution time: binary rewrite, dynamic q  Advantages / disadvantages Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 27

28. Indirect Observation: Events/Triggers q  Events are actions external to program code ❍  Timer countdown, HW counter overflow, … ❍  Consequence of program execution ❍  Event frequency determined by: ◆ type, setup, number enabled (exposed) q  Triggers used to invoke measurement tool ❍  Trapswhen events occur (interrupt) ❍  Associated with events ❍  May add differentiation to events Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 28

29. Indirect Observation: Context q  When events trigger, execution context determined at time of trap (interrupt) ❍  Access to PC from interrupt frame ❍  Access to information about process/thread ❍  Possible access to call stack ◆ requires call stack unwinder q  Assumption is that the context was the same during the preceding period ❍  Between successive triggers ❍  Statistical approximation valid for long running programs assuming repeated behavior Introduction to Parallel Computing, University of Oregon, IPCC Lecture 13 – Parallel Performance Methods 29