SlideShare a Scribd company logo
1 of 174
Download to read offline
Performance Optimization
with PerfExpert and MACPO
Jim Browne, Ashay Rane and Leo Fialho
ICS 2013
Victo
Ap
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
1 Introduction
2 PerfExpert
3 MACPO
4 GPU/Accelerators
5 Closure
2 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
3 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
In the morning:
09:00 Introduction and motivation [Jim]
09:20 What PerfExpert can provide to you? [Leo]
09:30 Demo [Leo]
09:45 How PerfExpert does that? (opening Pandora’s box) [Leo]
10:15 Extending PerfExpert [Leo]
10:30 (Coffee?) break [everyone, including you]
10:45 Hands on tutorial [all the team]
11:45 Morning closure [all the team]
3 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
4 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
In the afternoon:
01:30 What MACPO can provide to you? [Ashay]
02:00 Demo [Ashay, Jim]
02:30 How MACPO does that? [Ashay]
03:15 (Coffee?) break [everyone, including you]
03:30 Hands on tutorial [Ashay]
04:00 Selecting code segments to run on GPUs/accelerators [Jim]
04:30 Enhancing PerfExpert with MACPO analysis [all the team]
04:45 Afternoon closure and future work [all the team]
4 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
5 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate far below peak
5 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate far below peak
Chip/node architectural complexity is growing rapidly
5 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate far below peak
Chip/node architectural complexity is growing rapidly
Performance optimization for these chips requires deep
knowledge of architectures, code patterns, compilers, etc.
Performance optimization tools
5 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate far below peak
Chip/node architectural complexity is growing rapidly
Performance optimization for these chips requires deep
knowledge of architectures, code patterns, compilers, etc.
Performance optimization tools
Powerful in the hands of experts
5 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate far below peak
Chip/node architectural complexity is growing rapidly
Performance optimization for these chips requires deep
knowledge of architectures, code patterns, compilers, etc.
Performance optimization tools
Powerful in the hands of experts
Require detailed performance and system expertise
5 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate far below peak
Chip/node architectural complexity is growing rapidly
Performance optimization for these chips requires deep
knowledge of architectures, code patterns, compilers, etc.
Performance optimization tools
Powerful in the hands of experts
Require detailed performance and system expertise
HPC application developers are domain experts, not computer
gurus
5 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate far below peak
Chip/node architectural complexity is growing rapidly
Performance optimization for these chips requires deep
knowledge of architectures, code patterns, compilers, etc.
Performance optimization tools
Powerful in the hands of experts
Require detailed performance and system expertise
HPC application developers are domain experts, not computer
gurus
Result: Many HPC programmers do not use these tools
(seriously)
5 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoals:
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoals:
Make use of the tool as simple as possible
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoals:
Make use of the tool as simple as possible
Start with only chip/node level optimization
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoals:
Make use of the tool as simple as possible
Start with only chip/node level optimization
Make it adaptable across multiple architectures
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoals:
Make use of the tool as simple as possible
Start with only chip/node level optimization
Make it adaptable across multiple architectures
Design for extension to communication and I/O performance
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoals:
Make use of the tool as simple as possible
Start with only chip/node level optimization
Make it adaptable across multiple architectures
Design for extension to communication and I/O performance
How to accomplish?
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoals:
Make use of the tool as simple as possible
Start with only chip/node level optimization
Make it adaptable across multiple architectures
Design for extension to communication and I/O performance
How to accomplish?
Formulate the performance optimization task as a workflow of
subtasks
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoals:
Make use of the tool as simple as possible
Start with only chip/node level optimization
Make it adaptable across multiple architectures
Design for extension to communication and I/O performance
How to accomplish?
Formulate the performance optimization task as a workflow of
subtasks
Leverage the state-of-the-art: Build on the best available tools
for the subtasks to minimize the effort and cost of
development
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoals:
Make use of the tool as simple as possible
Start with only chip/node level optimization
Make it adaptable across multiple architectures
Design for extension to communication and I/O performance
How to accomplish?
Formulate the performance optimization task as a workflow of
subtasks
Leverage the state-of-the-art: Build on the best available tools
for the subtasks to minimize the effort and cost of
development
Automate the entire workflow
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoals:
Make use of the tool as simple as possible
Start with only chip/node level optimization
Make it adaptable across multiple architectures
Design for extension to communication and I/O performance
How to accomplish?
Formulate the performance optimization task as a workflow of
subtasks
Leverage the state-of-the-art: Build on the best available tools
for the subtasks to minimize the effort and cost of
development
Automate the entire workflow
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance optimization:
7 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance optimization:
Measurement and attribution (1)
7 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance optimization:
Measurement and attribution (1)
Analysis, diagnosis and identification of bottlenecks (2)
7 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance optimization:
Measurement and attribution (1)
Analysis, diagnosis and identification of bottlenecks (2)
Selection of effective optimizations (3)
7 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance optimization:
Measurement and attribution (1)
Analysis, diagnosis and identification of bottlenecks (2)
Selection of effective optimizations (3)
Implementation of optimizations (4)
7 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance optimization:
Measurement and attribution (1)
Analysis, diagnosis and identification of bottlenecks (2)
Selection of effective optimizations (3)
Implementation of optimizations (4)
Use of State-of-the-Art:
7 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance optimization:
Measurement and attribution (1)
Analysis, diagnosis and identification of bottlenecks (2)
Selection of effective optimizations (3)
Implementation of optimizations (4)
Use of State-of-the-Art:
HPCToolkit, MACPO based on ROSE (1)
7 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance optimization:
Measurement and attribution (1)
Analysis, diagnosis and identification of bottlenecks (2)
Selection of effective optimizations (3)
Implementation of optimizations (4)
Use of State-of-the-Art:
HPCToolkit, MACPO based on ROSE (1)
PerfExpert Team (2 and 3)
7 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance optimization:
Measurement and attribution (1)
Analysis, diagnosis and identification of bottlenecks (2)
Selection of effective optimizations (3)
Implementation of optimizations (4)
Use of State-of-the-Art:
HPCToolkit, MACPO based on ROSE (1)
PerfExpert Team (2 and 3)
PerfExpert Team based on ROSE, PIPS, Bison and Flex (4)
7 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance optimization:
Measurement and attribution (1)
Analysis, diagnosis and identification of bottlenecks (2)
Selection of effective optimizations (3)
Implementation of optimizations (4)
Use of State-of-the-Art:
HPCToolkit, MACPO based on ROSE (1)
PerfExpert Team (2 and 3)
PerfExpert Team based on ROSE, PIPS, Bison and Flex (4)
7 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
8 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
Nearly complete optimization first three stages of optimization
for chip/node level
8 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
Nearly complete optimization first three stages of optimization
for chip/node level
Framework for implementing optimizations is complete and
several optimizations are completed
8 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
Nearly complete optimization first three stages of optimization
for chip/node level
Framework for implementing optimizations is complete and
several optimizations are completed
Integrates code segment focused and data structure based
measurements (MACPO)
8 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
Nearly complete optimization first three stages of optimization
for chip/node level
Framework for implementing optimizations is complete and
several optimizations are completed
Integrates code segment focused and data structure based
measurements (MACPO)
Workflow will apply to communication and I/O optimization
as well
8 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
Nearly complete optimization first three stages of optimization
for chip/node level
Framework for implementing optimizations is complete and
several optimizations are completed
Integrates code segment focused and data structure based
measurements (MACPO)
Workflow will apply to communication and I/O optimization
as well
8 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
9 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore resolved traces
9 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore resolved traces
Code segment local measurement
9 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore resolved traces
Code segment local measurement
Data structure specific traces
9 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore resolved traces
Code segment local measurement
Data structure specific traces
Order of magnitude lower overhead of measurement
9 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore resolved traces
Code segment local measurement
Data structure specific traces
Order of magnitude lower overhead of measurement
More accurate (associative) cache models
9 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore resolved traces
Code segment local measurement
Data structure specific traces
Order of magnitude lower overhead of measurement
More accurate (associative) cache models
Strides by data structure and code segment
9 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore resolved traces
Code segment local measurement
Data structure specific traces
Order of magnitude lower overhead of measurement
More accurate (associative) cache models
Strides by data structure and code segment
Architecture “independent” metrics
9 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore resolved traces
Code segment local measurement
Data structure specific traces
Order of magnitude lower overhead of measurement
More accurate (associative) cache models
Strides by data structure and code segment
Architecture “independent” metrics
9 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
1 Introduction
2 PerfExpert
3 MACPO
4 GPU/Accelerators
5 Closure
10 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
11 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
Identification of bottlenecks by relevance
11 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
Identification of bottlenecks by relevance
Performance analysis based on performance metrics
11 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
Identification of bottlenecks by relevance
Performance analysis based on performance metrics
Recommendations for optimization
11 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
Identification of bottlenecks by relevance
Performance analysis based on performance metrics
Recommendations for optimization
There are three possible outputs:
11 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
Identification of bottlenecks by relevance
Performance analysis based on performance metrics
Recommendations for optimization
There are three possible outputs:
Performance report only
11 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
Identification of bottlenecks by relevance
Performance analysis based on performance metrics
Recommendations for optimization
There are three possible outputs:
Performance report only
List of recommendations
11 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
Identification of bottlenecks by relevance
Performance analysis based on performance metrics
Recommendations for optimization
There are three possible outputs:
Performance report only
List of recommendations
Fully automated code transformation
11 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
Identification of bottlenecks by relevance
Performance analysis based on performance metrics
Recommendations for optimization
There are three possible outputs:
Performance report only
List of recommendations
Fully automated code transformation
11 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
12 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report:
Loop in function compute() at mm.c:8 (99.8% of the total runtime)
===============================================================================
ratio to total instrns % 0.........25...........50.........75........100
- floating point : 100 ***********************************************
- data accesses : 25 ************
* GFLOPS (% max) : 12 ******
- packed : 0 *
- scalar : 12 ******
-------------------------------------------------------------------------------
performance assessment LCPI good......okay......fair......poor......bad....
* overall : 3.0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+
upper bound estimates
* data accesses : 9.6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+
- L1d hits : 0.9 >>>>>>>>>>>>>>>>>
- L2d hits : 1.8 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
- L2d misses : 6.9 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+
* instruction accesses : 0.1 >
- L1i hits : 0.0 >
- L2i hits : 0.0 >
- L2i misses : 0.1 >
* data TLB : 4.6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+
* instruction TLB : 0.0 >
* branch instructions : 0.1 >>
- correctly predicted : 0.1 >>
- mispredicted : 0.0 >
* floating-point instr : 5.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+
- fast FP instr : 5.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+
- slow FP instr : 0.0 >
12 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
List of Recommendations:
13 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
List of Recommendations:
#--------------------------------------------------
# Recommendations for mm.c:8
#--------------------------------------------------
#
# This is a possible recommendation for this code segment
#
Recommendation ID: 31
Recommendation Description: change the order of loops
Recommendation Reason: this optimization may improve the memory access pattern and make it more
cache and TLB friendly
Pattern Recognizers: c loop2 f loop2
Code example:
loop i {
loop j {...}
}
=====> loop j {
loop i {...}
}
13 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Short Demo
Short demo
14 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: The Big Picture
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
15 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Work Flow Script
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
16 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Work Flow Script
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
This is a shell script
16 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Work Flow Script
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
This is a shell script
Accepts parameters
16 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Work Flow Script
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
This is a shell script
Accepts parameters
Invokes all tools
(including the compiler)
16 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Work Flow Script
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
This is a shell script
Accepts parameters
Invokes all tools
(including the compiler)
Backward compatible
16 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Analyzer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
17 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Analyzer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
This is the old
PerfExpert, minus
“recommender”
17 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Analyzer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
This is the old
PerfExpert, minus
“recommender”
Based on HPCToolKit
17 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: MACPO
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
18 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: MACPO
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Enhances the set of
metrics with data access
performance metrics
18 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: MACPO
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Enhances the set of
metrics with data access
performance metrics
Based on ROSE
18 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
19 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Loads performance metrics on the Support Database
19 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Loads performance metrics on the Support Database
Runs all “recommendation selection functions”
19 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Loads performance metrics on the Support Database
Runs all “recommendation selection functions”
Concatenates and ranks the list of recommendations
19 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Loads performance metrics on the Support Database
Runs all “recommendation selection functions”
Concatenates and ranks the list of recommendations
Extracts code fragments identified as bottlenecks
19 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Loads performance metrics on the Support Database
Runs all “recommendation selection functions”
Concatenates and ranks the list of recommendations
Extracts code fragments identified as bottlenecks
Based on ROSE
19 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Loads performance metrics on the Support Database
Runs all “recommendation selection functions”
Concatenates and ranks the list of recommendations
Extracts code fragments identified as bottlenecks
Based on ROSE
Extendable: accepts user-defined performance
metrics
19 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Loads performance metrics on the Support Database
Runs all “recommendation selection functions”
Concatenates and ranks the list of recommendations
Extracts code fragments identified as bottlenecks
Based on ROSE
Extendable: accepts user-defined performance
metrics
Extendable: it is possible to write new
“recommendation selection functions” (SQL query)
19 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Support Database
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
20 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Support Database
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
This is a SQLite database
20 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Support Database
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
This is a SQLite database
Stores the list of “recommendation selection functions”,
“pattern recognizers” and “code transformers”
20 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Support Database
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
This is a SQLite database
Stores the list of “recommendation selection functions”,
“pattern recognizers” and “code transformers”
Engine to run the “recommendation selection functions”
20 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
21 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Acts as a “filter” trying to find (match) the right code
transformer for a source code fragment (identified as bottleneck)
21 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Acts as a “filter” trying to find (match) the right code
transformer for a source code fragment (identified as bottleneck)
Language sensitive
21 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Acts as a “filter” trying to find (match) the right code
transformer for a source code fragment (identified as bottleneck)
Language sensitive
Based on Bison and Flex
21 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Acts as a “filter” trying to find (match) the right code
transformer for a source code fragment (identified as bottleneck)
Language sensitive
Based on Bison and Flex
One recommendation may have multiple pattern recognizers
21 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Acts as a “filter” trying to find (match) the right code
transformer for a source code fragment (identified as bottleneck)
Language sensitive
Based on Bison and Flex
One recommendation may have multiple pattern recognizers
Extendable: it is possible to write new grammars to recognize/
match/filter code fragments (to work with new “transformers”)
21 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
22 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Implements the recommendation by applying source code
transformation
22 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Implements the recommendation by applying source code
transformation
May or may not be language sensitive
22 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Implements the recommendation by applying source code
transformation
May or may not be language sensitive
Based on ROSE, PIPS or anything you want
22 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Implements the recommendation by applying source code
transformation
May or may not be language sensitive
Based on ROSE, PIPS or anything you want
One code pattern may lead to multiple code transformers
22 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Implements the recommendation by applying source code
transformation
May or may not be language sensitive
Based on ROSE, PIPS or anything you want
One code pattern may lead to multiple code transformers
Extendable: it is possible to write code transformers using
any language you want
22 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Integrator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
23 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Integrator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Generates a new source
code by integrating to the
transformed code
fragments
23 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Integrator
User Interface!
original!
source!
code!
Compiler! Analyzer!
(HPCToolKit)!
MACPO!
code bottlenecks and
general performance
metrics!
add data access!
performance metrics to
previous output!
code fragments to!
optimize and list of!
recommendations!
!
Pattern
Recognizer!
(Bison/Flex)!
code fragments to
optimize and list of
code transformers!
!
optimized code
fragments!
Optimization
Formulator!
(ROSE)!
Integrator!
(ROSE)!
optimized!
source code!
!
Support Database!
Transformer!
(PIPS/ROSE)!
Compilation Phase!
DiagnoseandRecommendationPhases!
Code Transformation Phase!
CodeIntegrationPhase!
Input/output
data!
Developed by
the authors!
Standard
Compiler!
Measurement and Analysis Phases!
Work Flow
Script!
binary object!
Generates a new source
code by integrating to the
transformed code
fragments
Based on ROSE
23 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
24 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this performance optimization “architecture” strong?
24 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this performance optimization “architecture” strong?
Each piece of the tool chain can be updated/upgraded
individually
24 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this performance optimization “architecture” strong?
Each piece of the tool chain can be updated/upgraded
individually
It is flexible: you can add new metrics as well as plug new
tools to measure application performance
24 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this performance optimization “architecture” strong?
Each piece of the tool chain can be updated/upgraded
individually
It is flexible: you can add new metrics as well as plug new
tools to measure application performance
It is extendable: new recommendations, transformations and
strategies to select recommendations (we are counting on
you!)
24 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this performance optimization “architecture” strong?
Each piece of the tool chain can be updated/upgraded
individually
It is flexible: you can add new metrics as well as plug new
tools to measure application performance
It is extendable: new recommendations, transformations and
strategies to select recommendations (we are counting on
you!)
Multi-language, multi-architecture, open-source and built on
top of well-established tools (HPCToolKit, ROSE, PIPS, etc.)
24 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this performance optimization “architecture” strong?
Each piece of the tool chain can be updated/upgraded
individually
It is flexible: you can add new metrics as well as plug new
tools to measure application performance
It is extendable: new recommendations, transformations and
strategies to select recommendations (we are counting on
you!)
Multi-language, multi-architecture, open-source and built on
top of well-established tools (HPCToolKit, ROSE, PIPS, etc.)
Easy to use and lightweight!
24 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
25 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
25 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding performance metrics
25 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding performance metrics
Optimization recommendations [entries on the SQL database]
25 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding performance metrics
Optimization recommendations [entries on the SQL database]
“Recommendation selection functions”
25 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding performance metrics
Optimization recommendations [entries on the SQL database]
“Recommendation selection functions”
Pattern recognizers
25 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding performance metrics
Optimization recommendations [entries on the SQL database]
“Recommendation selection functions”
Pattern recognizers
Code transformers
25 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
26 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding Performance Metrics
26 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding Performance Metrics
code.section info=Loop in function compute() at mm.c:8
code.filename=mm.c
code.line number=8
code.type=loop
code.function name=compute
code.extra info=3
code.representativeness=99.8
perfexpert.ratio.data accesses=0.25
perfexpert.instruction accesses.L2i hits=0.002
perfexpert.branch instructions.mispredicted=0.0
perfexpert.floating-point instr.fast FP instr=5.073
perfexpert.data accesses.L2d hits=1.846
...
26 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
27 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Recommendation Selection Functions
27 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Recommendation Selection Functions
SELECT r.id AS recommendation id, SUM(
(CASE c.short WHEN ’d-l1’ THEN
(m.data accesses L1d hits - (max * 0.1))
ELSE 0 END) +
... ) AS score FROM recommendation AS r
JOIN metric AS m JOIN (SELECT MAX(
m.data accesses L1d hits, m.data accesses L2d hits,
... ) AS max
FROM metric AS m WHERE m.overall * 100 /
(0.5 * (100 - m.ratio floating point) +
m.ratio floating point) > 1
AND m.id = @RID)
WHERE (r.loop <= @LPD AND m.code type = ’loop’) OR
(r.loop IS NULL AND m.code type = ’function’)
AND m.id = @RID
GROUP BY r.id ORDER BY score DESC;
27 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
28 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Pattern Recognizers
28 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Pattern Recognizers
nested iteration statement
: WHILE ’(’ exp ’)’ WHILE ’(’ exp ’)’ stmnt
| WHILE ’(’ exp ’)’ ’’ WHILE ’(’ exp ’)’ stmnt ’’
| DO DO stmnt WHILE ’(’ exp ’)’ ’;’ stmnt WHILE ’(’ exp ’)’ ’;’
| DO ’’ DO stmnt WHILE ’(’ exp ’)’ ’;’ ’’ WHILE ’(’ exp ’)’ ’;’
| FOR ’(’ exp stmnt exp stmnt ’)’ FOR ’(’ exp stmnt exp stmnt ’)’ stmnt
| FOR ’(’ exp stmnt exp stmnt ’)’ ’’ FOR ’(’ exp stmnt exp stmnt ’)’ stmnt ’’
| FOR ’(’ exp stmnt exp stmnt exp ’)’ FOR ’(’ exp stmnt exp stmnt exp ’)’ stmnt
| FOR ’(’ exp stmnt exp stmnt exp ’)’ ’’ FOR ’(’ exp stmnt exp stmnt exp ’)’ stmnt ’’ ;
28 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
29 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Code Transformers
29 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Code Transformers
create c loop2 ../source/mm.c
activate INTERPROCEDURAL SUMMARY PRECONDITION
activate TRANSFORMERS INTER FULL
activate PRECONDITIONS INTER FULL
setproperty SEMANTICS FIX POINT OPERATOR ‘‘derivative’’
module compute
apply LOOP INTERCHANGE
loop 8
apply UNSPLIT[%PROGRAM]
close
quit
29 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Hands on Tutorial
Accessing Stampede:
ssh login@stampede.tacc.utexas.edu
use the password that has been provided to you
Request a Compute Node:
./reserve
now we are ready to go...
30 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Hands on Tutorial
Accessing Stampede:
cd 1
perfexpert
perfexpert -s mm.c mm
grep -R "running time" *
more mm.c
more perfexpert-temp-zUKfkx7/1/fragments/new/mm.c
perfexpert mm
perfexpert -r 5 mm
cd ../2
perfexpert -m -s backprop.c backprop
31 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
32 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
What we saw in the morning:
Introduction and motivation
What PerfExpert can provide to you?
Demo
How PerfExpert does that? (opening Pandora’s box)
Extending PerfExpert
Hands on tutorial
Morning closure
32 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
What we saw in the morning:
Introduction and motivation
What PerfExpert can provide to you?
Demo
How PerfExpert does that? (opening Pandora’s box)
Extending PerfExpert
Hands on tutorial
Morning closure
What we will see in the afternoon:
How to enhance the application performance using memory access
metrics (MAPCO)
32 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
1 Introduction
2 PerfExpert
3 MACPO
4 GPU/Accelerators
5 Closure
33 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Short Demo
Short demo
34 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
1 Introduction
2 PerfExpert
3 MACPO
4 GPU/Accelerators
5 Closure
35 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Performance Optimization by Mapping to Accelerators
36 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Performance Optimization by Mapping to Accelerators
36 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Performance Optimization by Mapping to Accelerators
Mapping of code segments to accelerators is
becoming one of the most methods for optimizing
the performance of an application
36 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Performance Optimization by Mapping to Accelerators
Mapping of code segments to accelerators is
becoming one of the most methods for optimizing
the performance of an application
Problem: how to select those parts of an
application which will benefit from execution on an
accelerator?
36 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for multicore chip execution — PerfExpert, why?
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for multicore chip execution — PerfExpert, why?
Identify time consuming kernels in code — PerfExpert
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for multicore chip execution — PerfExpert, why?
Identify time consuming kernels in code — PerfExpert
Eliminate kernels not easily mappable for SIMT/SIMD execution —
How?
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for multicore chip execution — PerfExpert, why?
Identify time consuming kernels in code — PerfExpert
Eliminate kernels not easily mappable for SIMT/SIMD execution —
How?
Characterize the kernels suitable for SIMT/SIMD execution ––
What properties?
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for multicore chip execution — PerfExpert, why?
Identify time consuming kernels in code — PerfExpert
Eliminate kernels not easily mappable for SIMT/SIMD execution —
How?
Characterize the kernels suitable for SIMT/SIMD execution ––
What properties?
Rank appropriate kernels using the characteristics identified in the
last step
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for multicore chip execution — PerfExpert, why?
Identify time consuming kernels in code — PerfExpert
Eliminate kernels not easily mappable for SIMT/SIMD execution —
How?
Characterize the kernels suitable for SIMT/SIMD execution ––
What properties?
Rank appropriate kernels using the characteristics identified in the
last step
Estimate cost of data movement
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for multicore chip execution — PerfExpert, why?
Identify time consuming kernels in code — PerfExpert
Eliminate kernels not easily mappable for SIMT/SIMD execution —
How?
Characterize the kernels suitable for SIMT/SIMD execution ––
What properties?
Rank appropriate kernels using the characteristics identified in the
last step
Estimate cost of data movement
Look for refactorings that will enable leaving data on accelerator
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for multicore chip execution — PerfExpert, why?
Identify time consuming kernels in code — PerfExpert
Eliminate kernels not easily mappable for SIMT/SIMD execution —
How?
Characterize the kernels suitable for SIMT/SIMD execution ––
What properties?
Rank appropriate kernels using the characteristics identified in the
last step
Estimate cost of data movement
Look for refactorings that will enable leaving data on accelerator
Generate compiler annotations for translation of C/C++/Fortran to
CUDA/OpenCL
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for multicore chip execution — PerfExpert, why?
Identify time consuming kernels in code — PerfExpert
Eliminate kernels not easily mappable for SIMT/SIMD execution —
How?
Characterize the kernels suitable for SIMT/SIMD execution ––
What properties?
Rank appropriate kernels using the characteristics identified in the
last step
Estimate cost of data movement
Look for refactorings that will enable leaving data on accelerator
Generate compiler annotations for translation of C/C++/Fortran to
CUDA/OpenCL
Suggest kernels needing new algorithms
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
Frequent TLB misses
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
Frequent TLB misses
High fraction of branches
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
Frequent TLB misses
High fraction of branches
Cache conflicts across cores
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
Frequent TLB misses
High fraction of branches
Cache conflicts across cores
Irregular access strides for kernel data structures
Characterizing “Good” Kernels
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
Frequent TLB misses
High fraction of branches
Cache conflicts across cores
Irregular access strides for kernel data structures
Characterizing “Good” Kernels
Computational intensity
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
Frequent TLB misses
High fraction of branches
Cache conflicts across cores
Irregular access strides for kernel data structures
Characterizing “Good” Kernels
Computational intensity
Pure “local” SPMD parallelism
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
Frequent TLB misses
High fraction of branches
Cache conflicts across cores
Irregular access strides for kernel data structures
Characterizing “Good” Kernels
Computational intensity
Pure “local” SPMD parallelism
Streaming parallelism or vectorization
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
Frequent TLB misses
High fraction of branches
Cache conflicts across cores
Irregular access strides for kernel data structures
Characterizing “Good” Kernels
Computational intensity
Pure “local” SPMD parallelism
Streaming parallelism or vectorization
Regular access strides for data structures
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
Frequent TLB misses
High fraction of branches
Cache conflicts across cores
Irregular access strides for kernel data structures
Characterizing “Good” Kernels
Computational intensity
Pure “local” SPMD parallelism
Streaming parallelism or vectorization
Regular access strides for data structures
Data reuse factor and data transfer volume
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kernels
Frequent TLB misses
High fraction of branches
Cache conflicts across cores
Irregular access strides for kernel data structures
Characterizing “Good” Kernels
Computational intensity
Pure “local” SPMD parallelism
Streaming parallelism or vectorization
Regular access strides for data structures
Data reuse factor and data transfer volume
“Limited” recursion
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
39 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Ranking “Good” Kernels
39 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Ranking “Good” Kernels
Curve fit characteristics to speed-up measurements of kernels
that have already been mapped
39 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Ranking “Good” Kernels
Curve fit characteristics to speed-up measurements of kernels
that have already been mapped
Sort by values of characteristics in some chosen order
39 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Ranking “Good” Kernels
Curve fit characteristics to speed-up measurements of kernels
that have already been mapped
Sort by values of characteristics in some chosen order
Hold up your thumb?
39 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Example
40 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Example
40 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
1 Introduction
2 PerfExpert
3 MACPO
4 GPU/Accelerators
5 Closure
41 / 42
Thank You
Victor
Apr

More Related Content

Similar to Apresentacao

Similar to Apresentacao (20)

BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
 
OpenPOWER Webinar from University of Delaware - Title :OpenMP (offloading) o...
OpenPOWER Webinar from University of Delaware  - Title :OpenMP (offloading) o...OpenPOWER Webinar from University of Delaware  - Title :OpenMP (offloading) o...
OpenPOWER Webinar from University of Delaware - Title :OpenMP (offloading) o...
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
 
Présentation du FME World Tour 2018 à Québec
Présentation du FME World Tour 2018 à QuébecPrésentation du FME World Tour 2018 à Québec
Présentation du FME World Tour 2018 à Québec
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACC
 
Présentation du FME World Tour 2018 à Montréal
Présentation du FME World Tour 2018 à MontréalPrésentation du FME World Tour 2018 à Montréal
Présentation du FME World Tour 2018 à Montréal
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Introducing Performance Awareness in an Integrated Specification Environment
Introducing Performance Awareness in an Integrated Specification EnvironmentIntroducing Performance Awareness in an Integrated Specification Environment
Introducing Performance Awareness in an Integrated Specification Environment
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
Tales of modernizing trello's web stack
Tales of modernizing trello's web stackTales of modernizing trello's web stack
Tales of modernizing trello's web stack
 
Fast Lane Company Profile Lin
Fast Lane Company Profile LinFast Lane Company Profile Lin
Fast Lane Company Profile Lin
 
Profiling and Optimizing for Xeon Phi with Allinea MAP
Profiling and Optimizing for Xeon Phi with Allinea MAPProfiling and Optimizing for Xeon Phi with Allinea MAP
Profiling and Optimizing for Xeon Phi with Allinea MAP
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
Design and Optimize your code for high-performance with Intel® Advisor and I...
Design and Optimize your code for high-performance with Intel®  Advisor and I...Design and Optimize your code for high-performance with Intel®  Advisor and I...
Design and Optimize your code for high-performance with Intel® Advisor and I...
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
TI TechDays 2010: swiftBoot
TI TechDays 2010: swiftBootTI TechDays 2010: swiftBoot
TI TechDays 2010: swiftBoot
 
Learn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVLearn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFV
 
Kunskapsbaren 2011 Linköping - Att produktifiera mjukvara
Kunskapsbaren 2011 Linköping - Att produktifiera mjukvaraKunskapsbaren 2011 Linköping - Att produktifiera mjukvara
Kunskapsbaren 2011 Linköping - Att produktifiera mjukvara
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Apresentacao

  • 1. Performance Optimization with PerfExpert and MACPO Jim Browne, Ashay Rane and Leo Fialho ICS 2013 Victo Ap
  • 2. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 2 / 42
  • 3. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 3 / 42
  • 4. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda In the morning: 09:00 Introduction and motivation [Jim] 09:20 What PerfExpert can provide to you? [Leo] 09:30 Demo [Leo] 09:45 How PerfExpert does that? (opening Pandora’s box) [Leo] 10:15 Extending PerfExpert [Leo] 10:30 (Coffee?) break [everyone, including you] 10:45 Hands on tutorial [all the team] 11:45 Morning closure [all the team] 3 / 42
  • 5. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 4 / 42
  • 6. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda In the afternoon: 01:30 What MACPO can provide to you? [Ashay] 02:00 Demo [Ashay, Jim] 02:30 How MACPO does that? [Ashay] 03:15 (Coffee?) break [everyone, including you] 03:30 Hands on tutorial [Ashay] 04:00 Selecting code segments to run on GPUs/accelerators [Jim] 04:30 Enhancing PerfExpert with MACPO analysis [all the team] 04:45 Afternoon closure and future work [all the team] 4 / 42
  • 7. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? 5 / 42
  • 8. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak 5 / 42
  • 9. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly 5 / 42
  • 10. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools 5 / 42
  • 11. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts 5 / 42
  • 12. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts Require detailed performance and system expertise 5 / 42
  • 13. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts Require detailed performance and system expertise HPC application developers are domain experts, not computer gurus 5 / 42
  • 14. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts Require detailed performance and system expertise HPC application developers are domain experts, not computer gurus Result: Many HPC programmers do not use these tools (seriously) 5 / 42
  • 15. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! 6 / 42
  • 16. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: 6 / 42
  • 17. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible 6 / 42
  • 18. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization 6 / 42
  • 19. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures 6 / 42
  • 20. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance 6 / 42
  • 21. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? 6 / 42
  • 22. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks 6 / 42
  • 23. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks Leverage the state-of-the-art: Build on the best available tools for the subtasks to minimize the effort and cost of development 6 / 42
  • 24. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks Leverage the state-of-the-art: Build on the best available tools for the subtasks to minimize the effort and cost of development Automate the entire workflow 6 / 42
  • 25. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks Leverage the state-of-the-art: Build on the best available tools for the subtasks to minimize the effort and cost of development Automate the entire workflow 6 / 42
  • 26. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: 7 / 42
  • 27. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) 7 / 42
  • 28. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) 7 / 42
  • 29. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) 7 / 42
  • 30. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) 7 / 42
  • 31. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: 7 / 42
  • 32. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) 7 / 42
  • 33. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) PerfExpert Team (2 and 3) 7 / 42
  • 34. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) PerfExpert Team (2 and 3) PerfExpert Team based on ROSE, PIPS, Bison and Flex (4) 7 / 42
  • 35. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) PerfExpert Team (2 and 3) PerfExpert Team based on ROSE, PIPS, Bison and Flex (4) 7 / 42
  • 36. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: 8 / 42
  • 37. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level 8 / 42
  • 38. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed 8 / 42
  • 39. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed Integrates code segment focused and data structure based measurements (MACPO) 8 / 42
  • 40. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed Integrates code segment focused and data structure based measurements (MACPO) Workflow will apply to communication and I/O optimization as well 8 / 42
  • 41. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed Integrates code segment focused and data structure based measurements (MACPO) Workflow will apply to communication and I/O optimization as well 8 / 42
  • 42. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: 9 / 42
  • 43. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces 9 / 42
  • 44. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement 9 / 42
  • 45. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces 9 / 42
  • 46. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement 9 / 42
  • 47. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models 9 / 42
  • 48. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models Strides by data structure and code segment 9 / 42
  • 49. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models Strides by data structure and code segment Architecture “independent” metrics 9 / 42
  • 50. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models Strides by data structure and code segment Architecture “independent” metrics 9 / 42
  • 51. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 10 / 42
  • 52. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: 11 / 42
  • 53. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance 11 / 42
  • 54. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics 11 / 42
  • 55. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization 11 / 42
  • 56. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: 11 / 42
  • 57. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only 11 / 42
  • 58. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only List of recommendations 11 / 42
  • 59. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only List of recommendations Fully automated code transformation 11 / 42
  • 60. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only List of recommendations Fully automated code transformation 11 / 42
  • 61. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: 12 / 42
  • 62. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Loop in function compute() at mm.c:8 (99.8% of the total runtime) =============================================================================== ratio to total instrns % 0.........25...........50.........75........100 - floating point : 100 *********************************************** - data accesses : 25 ************ * GFLOPS (% max) : 12 ****** - packed : 0 * - scalar : 12 ****** ------------------------------------------------------------------------------- performance assessment LCPI good......okay......fair......poor......bad.... * overall : 3.0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ upper bound estimates * data accesses : 9.6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - L1d hits : 0.9 >>>>>>>>>>>>>>>>> - L2d hits : 1.8 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - L2d misses : 6.9 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ * instruction accesses : 0.1 > - L1i hits : 0.0 > - L2i hits : 0.0 > - L2i misses : 0.1 > * data TLB : 4.6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ * instruction TLB : 0.0 > * branch instructions : 0.1 >> - correctly predicted : 0.1 >> - mispredicted : 0.0 > * floating-point instr : 5.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - fast FP instr : 5.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - slow FP instr : 0.0 > 12 / 42
  • 63. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? List of Recommendations: 13 / 42
  • 64. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? List of Recommendations: #-------------------------------------------------- # Recommendations for mm.c:8 #-------------------------------------------------- # # This is a possible recommendation for this code segment # Recommendation ID: 31 Recommendation Description: change the order of loops Recommendation Reason: this optimization may improve the memory access pattern and make it more cache and TLB friendly Pattern Recognizers: c loop2 f loop2 Code example: loop i { loop j {...} } =====> loop j { loop i {...} } 13 / 42
  • 65. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Short Demo Short demo 14 / 42
  • 66. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: The Big Picture User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 15 / 42
  • 67. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 16 / 42
  • 68. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script 16 / 42
  • 69. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script Accepts parameters 16 / 42
  • 70. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script Accepts parameters Invokes all tools (including the compiler) 16 / 42
  • 71. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script Accepts parameters Invokes all tools (including the compiler) Backward compatible 16 / 42
  • 72. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Analyzer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 17 / 42
  • 73. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Analyzer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is the old PerfExpert, minus “recommender” 17 / 42
  • 74. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Analyzer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is the old PerfExpert, minus “recommender” Based on HPCToolKit 17 / 42
  • 75. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: MACPO User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 18 / 42
  • 76. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: MACPO User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Enhances the set of metrics with data access performance metrics 18 / 42
  • 77. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: MACPO User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Enhances the set of metrics with data access performance metrics Based on ROSE 18 / 42
  • 78. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 19 / 42
  • 79. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database 19 / 42
  • 80. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” 19 / 42
  • 81. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations 19 / 42
  • 82. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks 19 / 42
  • 83. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks Based on ROSE 19 / 42
  • 84. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks Based on ROSE Extendable: accepts user-defined performance metrics 19 / 42
  • 85. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks Based on ROSE Extendable: accepts user-defined performance metrics Extendable: it is possible to write new “recommendation selection functions” (SQL query) 19 / 42
  • 86. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 20 / 42
  • 87. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a SQLite database 20 / 42
  • 88. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a SQLite database Stores the list of “recommendation selection functions”, “pattern recognizers” and “code transformers” 20 / 42
  • 89. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a SQLite database Stores the list of “recommendation selection functions”, “pattern recognizers” and “code transformers” Engine to run the “recommendation selection functions” 20 / 42
  • 90. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 21 / 42
  • 91. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) 21 / 42
  • 92. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive 21 / 42
  • 93. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive Based on Bison and Flex 21 / 42
  • 94. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive Based on Bison and Flex One recommendation may have multiple pattern recognizers 21 / 42
  • 95. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive Based on Bison and Flex One recommendation may have multiple pattern recognizers Extendable: it is possible to write new grammars to recognize/ match/filter code fragments (to work with new “transformers”) 21 / 42
  • 96. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 22 / 42
  • 97. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation 22 / 42
  • 98. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive 22 / 42
  • 99. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive Based on ROSE, PIPS or anything you want 22 / 42
  • 100. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive Based on ROSE, PIPS or anything you want One code pattern may lead to multiple code transformers 22 / 42
  • 101. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive Based on ROSE, PIPS or anything you want One code pattern may lead to multiple code transformers Extendable: it is possible to write code transformers using any language you want 22 / 42
  • 102. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Integrator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 23 / 42
  • 103. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Integrator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Generates a new source code by integrating to the transformed code fragments 23 / 42
  • 104. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Integrator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Generates a new source code by integrating to the transformed code fragments Based on ROSE 23 / 42
  • 105. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points 24 / 42
  • 106. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? 24 / 42
  • 107. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually 24 / 42
  • 108. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance 24 / 42
  • 109. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance It is extendable: new recommendations, transformations and strategies to select recommendations (we are counting on you!) 24 / 42
  • 110. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance It is extendable: new recommendations, transformations and strategies to select recommendations (we are counting on you!) Multi-language, multi-architecture, open-source and built on top of well-established tools (HPCToolKit, ROSE, PIPS, etc.) 24 / 42
  • 111. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance It is extendable: new recommendations, transformations and strategies to select recommendations (we are counting on you!) Multi-language, multi-architecture, open-source and built on top of well-established tools (HPCToolKit, ROSE, PIPS, etc.) Easy to use and lightweight! 24 / 42
  • 112. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 25 / 42
  • 113. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 25 / 42
  • 114. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics 25 / 42
  • 115. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] 25 / 42
  • 116. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] “Recommendation selection functions” 25 / 42
  • 117. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] “Recommendation selection functions” Pattern recognizers 25 / 42
  • 118. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] “Recommendation selection functions” Pattern recognizers Code transformers 25 / 42
  • 119. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 26 / 42
  • 120. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding Performance Metrics 26 / 42
  • 121. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding Performance Metrics code.section info=Loop in function compute() at mm.c:8 code.filename=mm.c code.line number=8 code.type=loop code.function name=compute code.extra info=3 code.representativeness=99.8 perfexpert.ratio.data accesses=0.25 perfexpert.instruction accesses.L2i hits=0.002 perfexpert.branch instructions.mispredicted=0.0 perfexpert.floating-point instr.fast FP instr=5.073 perfexpert.data accesses.L2d hits=1.846 ... 26 / 42
  • 122. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 27 / 42
  • 123. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Recommendation Selection Functions 27 / 42
  • 124. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Recommendation Selection Functions SELECT r.id AS recommendation id, SUM( (CASE c.short WHEN ’d-l1’ THEN (m.data accesses L1d hits - (max * 0.1)) ELSE 0 END) + ... ) AS score FROM recommendation AS r JOIN metric AS m JOIN (SELECT MAX( m.data accesses L1d hits, m.data accesses L2d hits, ... ) AS max FROM metric AS m WHERE m.overall * 100 / (0.5 * (100 - m.ratio floating point) + m.ratio floating point) > 1 AND m.id = @RID) WHERE (r.loop <= @LPD AND m.code type = ’loop’) OR (r.loop IS NULL AND m.code type = ’function’) AND m.id = @RID GROUP BY r.id ORDER BY score DESC; 27 / 42
  • 125. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 28 / 42
  • 126. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Pattern Recognizers 28 / 42
  • 127. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Pattern Recognizers nested iteration statement : WHILE ’(’ exp ’)’ WHILE ’(’ exp ’)’ stmnt | WHILE ’(’ exp ’)’ ’’ WHILE ’(’ exp ’)’ stmnt ’’ | DO DO stmnt WHILE ’(’ exp ’)’ ’;’ stmnt WHILE ’(’ exp ’)’ ’;’ | DO ’’ DO stmnt WHILE ’(’ exp ’)’ ’;’ ’’ WHILE ’(’ exp ’)’ ’;’ | FOR ’(’ exp stmnt exp stmnt ’)’ FOR ’(’ exp stmnt exp stmnt ’)’ stmnt | FOR ’(’ exp stmnt exp stmnt ’)’ ’’ FOR ’(’ exp stmnt exp stmnt ’)’ stmnt ’’ | FOR ’(’ exp stmnt exp stmnt exp ’)’ FOR ’(’ exp stmnt exp stmnt exp ’)’ stmnt | FOR ’(’ exp stmnt exp stmnt exp ’)’ ’’ FOR ’(’ exp stmnt exp stmnt exp ’)’ stmnt ’’ ; 28 / 42
  • 128. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 29 / 42
  • 129. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Code Transformers 29 / 42
  • 130. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Code Transformers create c loop2 ../source/mm.c activate INTERPROCEDURAL SUMMARY PRECONDITION activate TRANSFORMERS INTER FULL activate PRECONDITIONS INTER FULL setproperty SEMANTICS FIX POINT OPERATOR ‘‘derivative’’ module compute apply LOOP INTERCHANGE loop 8 apply UNSPLIT[%PROGRAM] close quit 29 / 42
  • 131. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Hands on Tutorial Accessing Stampede: ssh login@stampede.tacc.utexas.edu use the password that has been provided to you Request a Compute Node: ./reserve now we are ready to go... 30 / 42
  • 132. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Hands on Tutorial Accessing Stampede: cd 1 perfexpert perfexpert -s mm.c mm grep -R "running time" * more mm.c more perfexpert-temp-zUKfkx7/1/fragments/new/mm.c perfexpert mm perfexpert -r 5 mm cd ../2 perfexpert -m -s backprop.c backprop 31 / 42
  • 133. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 32 / 42
  • 134. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda What we saw in the morning: Introduction and motivation What PerfExpert can provide to you? Demo How PerfExpert does that? (opening Pandora’s box) Extending PerfExpert Hands on tutorial Morning closure 32 / 42
  • 135. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda What we saw in the morning: Introduction and motivation What PerfExpert can provide to you? Demo How PerfExpert does that? (opening Pandora’s box) Extending PerfExpert Hands on tutorial Morning closure What we will see in the afternoon: How to enhance the application performance using memory access metrics (MAPCO) 32 / 42
  • 136. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 33 / 42
  • 137. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Short Demo Short demo 34 / 42
  • 138. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 35 / 42
  • 139. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators 36 / 42
  • 140. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators 36 / 42
  • 141. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators Mapping of code segments to accelerators is becoming one of the most methods for optimizing the performance of an application 36 / 42
  • 142. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators Mapping of code segments to accelerators is becoming one of the most methods for optimizing the performance of an application Problem: how to select those parts of an application which will benefit from execution on an accelerator? 36 / 42
  • 143. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 37 / 42
  • 144. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 37 / 42
  • 145. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? 37 / 42
  • 146. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert 37 / 42
  • 147. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? 37 / 42
  • 148. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? 37 / 42
  • 149. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step 37 / 42
  • 150. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement 37 / 42
  • 151. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement Look for refactorings that will enable leaving data on accelerator 37 / 42
  • 152. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement Look for refactorings that will enable leaving data on accelerator Generate compiler annotations for translation of C/C++/Fortran to CUDA/OpenCL 37 / 42
  • 153. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement Look for refactorings that will enable leaving data on accelerator Generate compiler annotations for translation of C/C++/Fortran to CUDA/OpenCL Suggest kernels needing new algorithms 37 / 42
  • 154. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 38 / 42
  • 155. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels 38 / 42
  • 156. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses 38 / 42
  • 157. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches 38 / 42
  • 158. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores 38 / 42
  • 159. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels 38 / 42
  • 160. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity 38 / 42
  • 161. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism 38 / 42
  • 162. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization 38 / 42
  • 163. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization Regular access strides for data structures 38 / 42
  • 164. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization Regular access strides for data structures Data reuse factor and data transfer volume 38 / 42
  • 165. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization Regular access strides for data structures Data reuse factor and data transfer volume “Limited” recursion 38 / 42
  • 166. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 39 / 42
  • 167. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels 39 / 42
  • 168. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels Curve fit characteristics to speed-up measurements of kernels that have already been mapped 39 / 42
  • 169. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels Curve fit characteristics to speed-up measurements of kernels that have already been mapped Sort by values of characteristics in some chosen order 39 / 42
  • 170. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels Curve fit characteristics to speed-up measurements of kernels that have already been mapped Sort by values of characteristics in some chosen order Hold up your thumb? 39 / 42
  • 171. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Example 40 / 42
  • 172. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Example 40 / 42
  • 173. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 41 / 42