Although the evolution of hardware is improving at an incredible rate, the advances in
parallel software have been hampered for many reasons. Developing an efficient parallel
application is still not an easy task. Our thesis is that many performance problems and their reasons can be quickly located and explained with automated techniques that work on unmodified parallel applications. This work identifies main obstacles for such diagnosis and presents a two-step approach for addressing them. In this approach, the application is automatically modeled and diagnosed during its execution.
First, we introduce an online performance modeling technique that enables automated discovery of causal execution flows through communication and computational activities in message-passing parallel programs. Second, we present a systematic approach to online performance analysis. The automated
analysis uses online model to quickly identify the most important performance problems,
and correlate them with application source code. Our technique is able to discover causal
dependences between the problems, infer their root causes in some scenarios and explain
them to developers. In this work, we focus on diagnosing scientific MPI parallel applications and their communication and computational problems although the approach can be extended to support other classes of activities and programming models.
We have evaluated our approach on a variety of scientific parallel applications. In all scenarios, our online performance modeling technique proved effective for low-overhead capturing of program’s behavior and facilitated performance understanding. With our automated, model-based performance analysis approach, we were able to easily identify the most severe performance problems during application execution, and locate their root causes without previous knowledge of application internals.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Online performance modeling and analysis of message-passing parallel applications
1. Online performance modeling and
analysis of message-passing
parallel applications
Delayed receive
PhD Thesis
Oleg Morajko
Universitat Autònoma de Barcelona, Long local calculations
Barcelona, 2008
2. Motivation
• Parallel system hardware is evolving at an incredible rate
• Contemporary HPC systems
– Top500 ranging from 1.000 to 200.000+ processors (June 2008)
– Take BSC MareNostrum: 10K processors
• Whole industry is shifting to parallel computing
2
3. Motivation
• Challenges of developing large-scale scientific software
– Evolution of programming models is much slower
– Hard to achieve good efficiency
– Hard to achieve scalability
• The parallel applications
rarely achieve good
performance immediately
MPI
3
4. Motivation
• Challenges of developing large-scale scientific software
– Evolution of programming models is much slower
– Hard to achieve good efficiency
– Hard to achieve scalability
• The parallel applications
rarely achieve good
performance immediately
Careful performance analysis
and optimization tasks are crucial
4
5. Motivation
• Quickly finding performance problems and their reasons is hard
• Requires thorough understanding of the program’s behavior
– Parallel algorithm, domain decomposition, communication, synchronization
• Large scale brings additional complexities
– Large data volume, excessive analysis cost
• Existing tools support finding what happens, where, and when
– Locating root causes of problems still manual
– Tools expose scalability limitations (E.g. tracing)
• Problem diagnosis still requires substantial time and effort of
highly-skilled professionals
5
6. Our goals
• Analyze the performance of parallel applications
• Detect bottlenecks and explain their causes
– Focus on communication and synchronization in message-passing
programs
• Automate the approach to the extent possible
• Scalable to thousands of nodes
• Online approach without trace files
6
7. Contributions
• A systematic approach for automated diagnosis of application
performance
– Application is monitored, modeled and diagnosed during its execution
• Scalable modeling technique that generates performance
knowledge about application behavior
• Analysis technique that diagnoses MPI applications running in
large-scale parallel systems
– Detects performance bottlenecks on-the-fly
– Finds root causes
• Prototype tool to demonstrate the ideas
7
8. Outline
1. Overview of approaches
2. Online performance modeling
3. Online performance analysis
4. Experimental evaluation
5. Conclusions and future work
8
11. Classical performance analysis
Drawbacks
• Manual task of experimental nature
• Time consuming
• High degree of expertise required
• Full trace excessive volume of information
• Poor scalability
11
17. Automated online analysis
Key characteristics
• Discovers application model on-the-fly
– Model execution flows, not modules/functions
– Lossy trace compression
• Runtime analysis based on continuous model
observation
• Automatically locates problems while app runs
• Search for root-causes of problems
17
19. Modeling objectives
• Enable high-level understanding of application performance
• Reflect parallel application structure and runtime behavior
• Maintain tradeoff between volume of collected data and level
of preserved details
– Communication and computational patterns
– Causality of events
• Base for online performance analysis
19
20. Online performance modeling
• Novel application performance modeling approach
• Combines static code analysis with runtime monitoring to
extract performance knowledge
• Three step approach:
– Modeling individual tasks
– Modeling inter-task communication
– Modeling entire application
20
21. Modeling individual tasks
• We decompose execution into units that correspond to
different activities:
– Communication activities (E.g. MPI_Send, MPI_Gather)
– Computation activities (E.g. calc_gauss)
– Control activities (E.g. program start/termination)
– Others (E.g. I/O)
• We capture execution flow through these activities using a
directed graph called Task Activity Graph (TAG):
– Nodes model communication activities and loops
– Edges represent sequential flow of execution (computation activities)
– Nodes and edges maintain happens-before relationship
21
22. Modeling individual tasks
Task Activity Graph (TAG) reflects program structure by
modeling executed flow of activities
22
27. Modeling techniques
We developed a set of techniques to automatically construct
and exploit the PTAG model at runtime
• Targeted to parallel scientific applications
• Focus on modeling MPI applications
• But extendible to other programming paradigms
• Low-overhead
• Scalable to 1000+ nodes
27
33. Building individual TAG
Runtime modeling
• Process generated events
• Walk the stack to capture
program location (call path)
• Update TAG incrementally
Modeler
shared memory
capture
MPI Task
events RT Library
3
4 update
33
34. Building individual TAG
Model sampling
• Goal: examine model at runtime
• Read model from shared memory
• Sampling is periodic
• Lock-free synchronization
Modeler 5 sample
shared memory
MPI Task
RT Library
34
35. Online communication modeling
How to model inter-task communication?
• Intercept MPI communication calls (nodes)
• Match sender nodes with receiver nodes
• Add messages edges to the TAG models
35
36. Online communication modeling
• Requires tracking of individual messages transmitted from
sender to receiver(s) at runtime
• Achieved by propagating piggyback data over every
transmitted MPI message
• Transmit node id from sender to receiver(s)
• P2P/Blocking/Non-blocking/Collective
• Optimized hybrid strategy to minimize intrusion
• Store references to sender’s nodes at receiver’s TAG
36
37. Online parallel application modeling
Building and maintaining PTAG
• Individual TAGs are
distributed
Hierarchical Reduction
• Collect TAGs snapshots Network (TBON)
• Distributed merge
• Periodic process
Individual TAGs Merged groups PTAG
of TAGs
37
41. Benefits of modeling
• Facilitates performance understanding
• Reveals communication and computational patterns and their
causal relationships
• Enables an assortment of online analysis techniques
– Quick identification of performance bottlenecks and their location
– Behavioral task clustering
– Causal relationships permit root-cause analysis
– Feedback-guided analysis (refinements)
41
43. Online analysis objectives
• Diagnose the performance on-the-fly
• Detect relevant performance bottlenecks and their
reasons
• Distinguish problem symptoms from root causes
• Explain what, where, when and why
• Focus on communication and synchronization
problems in MPI applications
43
44. Online performance analysis
Time-continuous Root-Cause Analysis process
Monitoring
Modeling
Analysis Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
44
45. Root-cause analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Phase 1: Problem identification
• Focus attention on code regions with biggest potential
optimization benefits
• A potential bottleneck – an individual task activity with
significant amount of execution time
• TAG node might corresponds to a communication or
synchronization problem
• TAG edge might be a computation-bound problem
45
46. Problem identification
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
CPU-bound activity
~45% time
Cold activity Hot activity
Blocked receive
~42% time
• Rainbow spectrum TAG coloring
Communication or
• Activity time / Max Activity Time synchronization problem
46
47. Problem identification
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
TAG ranking process
• Identify potential bottlenecks for further analysis
• Periodic ranking in moving time-window
Select top problems by ranking
Rank = activity time / task time
> 20% for computation activities
> 3% for communication activities
TAG snapshot
Potential bottlenecks
47
48. Root-cause analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Phase 2: In-depth problem analysis
• For each potential bottleneck, investigate its causes
• Explore knowledge-based cause space
• Focus on causes that contribute most to the problem time
• Distinguish task-local problems from inter-task problems
– Find root-causes of task-local problems
• E.g. CPU-bound computation, local I/O
– Find symptoms of inter-task problems
• E.g. Blocked receive, barrier
48
49. In-depth problem analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Performance models for activities
• Classification of activities
• Each class has a performance model that divides the activity
cost into separate components
• Each component is a non-exclusive
potential cause of the problem
49
50. In-depth problem analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Model for computational activities
• Sequential code region modeled by TAG edge
• No external knowledge about computation
• Determine where edge-constrained code spends time
• Divide TAG edge into components
– Functional or basic-blocks decomposition
• Apply statistical profiling constrained to an edge
– Dynamic instrumentation
• Other metrics
– Idle time, I/O time, hardware counters
50
51. In-depth problem analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Model for communication activities
Communication cost = Synchronization Cost + Transmission Cost
Transmission cost
Overall communication cost
Task
e1 Send e3
e2 Receive e4
Time
Synchronization cost
• Captures semantics of well-known synchronization inefficiencies
– Late sender, wait at barrier, early reduce, etc.
51
52. In-depth problem analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Model for communication activities
Communication cost = Synchronization Cost + Transmission Cost
Transmission cost
Overall communication cost
• Piggyback send entry
Task
timestamp (e1)
e1 Send e3 • Accumulate
synchronization cost
e2 Receive e4
per message edge
Time
Synchronization cost
• Captures semantics of well-known synchronization inefficiencies
– Late sender, wait at barrier, early reduce, etc.
52
53. In-depth problem analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Example receive activity break-down
Requires inter-task
cause-effect analysis
53
54. Root-cause analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Phase 3: Cause-effect analysis
• Explain causes of synchronization inefficiencies
– Why sender is late?
• Correlate problems into cause-effect chains
• Distinguish root-causes of inefficiencies from their causal
propagation (symptoms)
• Pinpoint problems in non-dominant code regions
• Improve the feedback provided to application developers
54
55. Cause-effect analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Causal propagation
Causes
Causes
ComputationA
(Task A)
Late Sender Causes
(Task A)
Inefficiency1 Causes
(Task B)
Late Sender
(Task B)
Task
Inefficiency2
(Task C)
ComputationB Causes
(Task B)
A ComputationA Send1
WT1 Inefficiency 1 m0
B Receive1 ComputationB Send2
WT2 Inefficiency 2 m1
C ComputationC Receive2
t0 t1 t2 t3 t4
Time
55
56. Cause-effect analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Explaining problem causes
• Causes of waiting time between two nodes as the differences
between their execution paths
– Online adaptation of Wait-Time Analysis approach by Meira et al.
– Based on PTAG model, not full trace
• Explain synchronization inefficiencies by means of other
activities
– Identify corresponding execution paths in PTAG model
– Compare the paths
– Build causal tree with explanations
– Merge trees of individual problems
56
57. Cause-effect analysis
Phase 1 Phase 2 Phase 3
Problem Problem Cause-effect
identification analysis analysis
Execution path comparison
Inefficiency caused by
Late Sender
problem
Path q (Task 1)
e7
... ...
e1 e2 e4 e5 e6
Path p (Task 2) Inefficiency
at MPI_Recv Waiting-time
e3 (Task 1) 138,4 sec.
... ...
e1 e2
Late Sender
(Task 2)
91.9% 7.7%
Computation Computation
edge e3 edge e2
(Task 2) (Task 2)
Root causes 57
58. Benefits of RCA
• Systematic approach to online performance analysis
• Quick identification of problems as they manifest at runtime
(without trace)
• Causal correlation of different problems
• Discovery of root-causes of synchronization inefficiencies
58
60. Prototype tool
global
analyzer
• Implemented in C++
• DynInst 5.1 mrnet mrnet
…
comm comm
• MRNet 1.2 node node
• OpenMPI 1.2.x mrnet mrnet
• Linux platforms comm
node
comm
node
…
– x86
– IA-64 (Itanium)
– PowerPC 32/64 dmad dmad dmad dmad dmad
MPI
Task
MPI
Task
MPI
Task
… MPI
Task
MPI
Task
60
61. Experimental environment
UAB cluster BSC Marenostrum
x86/Linux PowerPC-64/Linux
32 nodes 512 nodes (restricted)
Intel Pentium IV 3GHz PowerPC 2.3GHz dual core
Linux FC4 SUSE Linux Enterprise Server 9
Gigabit Ethernet Myrinet
61
62. Modeling MPI applications
• Experiences with different classes of MPI codes
– SPMD codes
• WaveSend – 1D stencil, concurrent wave equation
• NAS Parallel Benchmarks – 2D stencils
• SMG2000 – 3D stencil, multigrid solver
– Master/Worker
• XFire – forest fire propagation simulator
+ Demonstrated ability to model arbitrary MPI code with
low-overhead
+ Best with regular codes
– Limitations with recursive codes
62
63. Case study #1: Modeling SPMD
Integer sort (IS) NAS Parallel Benchmark
• Large integer sort used in
“particle method” codes
• Tests both integer computation
speed and communication
performance
• Mostly collective communication
• We extract PTAG to understand
application communication
patterns and behavior
63
64. Case study #2: Master/Worker
Forest Fire Propagation Simulator (XFire)
• Calculates the expansion of the fireline
• Computationally intensive code, exploits data parallelism
• We extract and cluster PTAG
64
65. Evaluation of overheads
Sources of overheads
• Offline startup
– Less than 20 seconds per 1MB executable
– In function of program size
• Online TAG construction
– 4-20 μs per instrumented call (*)
– Depends on the number of instrumented calls and loops
• Online TAG sampling
– 40-50 μs per snapshot (256 KB)
– Depends on program structure size, number of communication links
(*) Experiments conducted in UAB cluster
65
67. Case study #3: SPMD analysis
WaveSend application
• Parallel calculations of vibrating string over time
• Wave equation, block-decomposition
• P2P communication to exchange boundary
points with nearest neighbors
• Synthetic performance problems
67
68. Case study #3: SPMD analysis
WaveSend
PTAG
After execution
68
69. Case study #3: SPMD analysis
CPU-bound problem at task 7
PTAG after 30 seconds
of execution
69
70. Case study #3: SPMD analysis
Potential bottlenecks
Task 0 findings:
35.4% CPU-bound
in edge 8→6
Task 1 findings:
33% CPU-bound
in edge 11→6
Task 6 findings:
32.1% CPU-bound
in edge 11→6
Task 7 findings:
50.5% CPU-bound
in edge 8→6
70
71. Case study #3: SPMD analysis
Potential bottlenecks
Task 0 findings:
21.4% blocked receive
caused by late sender
from task 1
Task 1 findings:
19.1 % blocked receive
caused by late sender
from task 2
Task 6 findings:
19.2 blocked receive
caused by late sender
from task 7
71
73. Case study #3: SPMD analysis
Analysis results
• Load imbalance found
• Multiple instances of late-sender problem
• Causal propagation of inefficiencies
• Root-cause found in task 7 as an imbalanced computational
edge
73
75. Conclusions
• A novel approach for online performance modeling
– Discovers high-level application structure and runtime behavior
– A differential hybrid technique that combines both static code analysis with
runtime monitoring to extract performance knowledge
– Scalable to 1000+ processors
• An automated online performance analysis approach
– Enables quick detection of performance bottlenecks
– Focuses on explaining sources of communication and synchronization
– Correlates different problems and identifies their root causes
• A prototype tool that models and analyzes MPI applications
at runtime
75
76. Future work
• Modeling
– Support for other classes of activities (I/O, MPI RMA)
– OpenMP applications
– Support for recursive codes
– Multi-experiment support
• Analysis
– More accurate cause-effect analysis with causal paths
– Evaluation of scalability of analysis in large-scale HPC
– Actionable recommendations
– Integration with automatic tuning framework (MATE)
76
77. Online performance modeling and analysis
of message-passing parallel applications
Thank You
PhD Thesis, Oleg Morajko
Universitat Autònoma de Barcelona
77