SlideShare a Scribd company logo
1 of 48
Download to read offline
The Impact of Compiler Auto-Optimisation
on Arm-based HPC Microarchitectures
Emanuele Del Sozzo1, Javier Setoain2, Filippo Spiga2
emanuele.delsozzo@polimi.it
1Politecnico di Milano
2Arm Research
NECSTLab – Politecnico di Milano
9/11/2018
<disclaimer>
This presentation describes the results of my
internship project at Arm Research in Cambridge
</disclaimer>
1
What is Arm?
• A semiconductor and software design company
• Primary business: CPU design
Secondary business: software tools, SoC, GPUs
• Market: (mainly) embedded devices (microcontrollers, mobile phones,
tablets, etc.)
• Business model: creation and licensing of its IPs
2
Why Arm?
Master’s thesis topic: scheduling policy for Arm big.LITTLE[1]
• QoS and power efficiency oriented
• Dynamic Voltage and Frequency Scaling (DVFS)
• Tasks allocation
• Exploitation of all the underlying resources
[1] E. Del Sozzo, G. C. Durelli, E. M. G. Trainiti, A. Miele, M. D. Santambrogio and C. Bolchini, "Workload-aware power optimization strategy for asymmetric multiprocessors," 2016 Design, Automation & Test in Europe Conference &
Exhibition (DATE), Dresden, 2016, pp. 531-534.
…
frequency
Thread Number
Thread Mapping
Throughput
ARM big.LITTLE
Policy
App 0 App 1 App N
A15 A15 A15 A15 A7 A7 A7 A7
Cache L2 Cache L2
AXI Cache Coherent BUS
DDR
3
Context Definition
• Divergence in Arm architecture implementation by partners
• Differences in performance for the same executable on different microarchitectures
• Difference amplified when a original binary optimized for a specific chip
runs on another one
4
A big.LITTLE Example
C / C++
Code
01101
101010
100101
Cortex-A15
Cortex-A7 Cortex-A7
Cortex-A7 Cortex-A7
Cortex-A15
Cortex-A15 Cortex-A15
big Cluster
LITTLE Cluster
Compilation
5
A big.LITTLE Example
C / C++
Code
01101
101010
100101
Cortex-A15
Cortex-A7 Cortex-A7
Cortex-A7 Cortex-A7
Cortex-A15
Cortex-A15 Cortex-A15
big Cluster
LITTLE Cluster
Compilation
6
A big.LITTLE Example
C / C++
Code
01101
101010
100101
Cortex-A15
Cortex-A7 Cortex-A7
Cortex-A7 Cortex-A7
Cortex-A15
Cortex-A15 Cortex-A15
big Cluster
LITTLE Cluster
Compilation
7
A big.LITTLE Example
Fetch Decode Issue
Queue
Integer
Multiply
Floating-Point / NEON
Dual Issue
Load / Store
Writeback
Fetch
Decode, Rename &
Dispatch
IssueQueue
Integer
Floating-Point / NEON
Writeback
Integer
Multiply
Load
Store
Branch
Loop Cache
Cortex-A7 Pipeline
Cortex-A15 Pipeline
8
Internship Goal
• Evaluate the impact of compiler optimizations on HPC microarchitectures
• Development of a methodology to systematically find optimal subsets of
optimization flags
• Analyze performance loss when compiling for a chip using the optimal
subset of flags from another
Compiler Optimization
11
Compiler Optimization
• Choose the right set of compiler optimizations is complex and depends on[2]:
• Programming language
• Application
• Architecture
• (Some of the) open problems in the compiler optimization field:
1. What optimization to use / Which set of
parameters to choose from
2. In which order to apply the optimizations
[2] A. H. Ashouri, W. Killian, G. Palermo and C. Silvano, "A Survey on Compiler Autotuning using Machine Learning", ACM Computing Surveys 2018
12
Compiler Optimization
• Choose the right set of compiler optimizations is complex and depends on[2]:
• Programming language
• Application
• Architecture
• (Some of the) open problems in the compiler optimization field:
1. What optimization to use / Which set of
parameters to choose from
2. In which order to apply the optimizations
Selection problem
Phase ordering problem
[2] A. H. Ashouri, W. Killian, G. Palermo and C. Silvano, "A Survey on Compiler Autotuning using Machine Learning", ACM Computing Surveys 2018
13
Compiler Optimization
• Choose the right set of compiler optimizations is complex and depends on[2]:
• Programming language
• Application
• Architecture
• (Some of the) open problems in the compiler optimization field:
1. What optimization to use / Which set of
parameters to choose from
2. In which order to apply the optimizations
Selection problem
Phase ordering problem
[2] A. H. Ashouri, W. Killian, G. Palermo and C. Silvano, "A Survey on Compiler Autotuning using Machine Learning", ACM Computing Surveys 2018
14
Selection Problem
• Compilers already provide a fixed-sequence of optimizations:
-O0, -O1, -O2, -O3, -Ofast, -Os, …
• Not good enough to obtain the best achievable application-specific performance
• Literature approaches to optimize a given application rely on:
• Application characterization techniques[3]
• Optimization space exploration techniques[4]
• Machine learning models[5]
[3] G. Fursin et al. 2011. “Milepost GCC: Machine Learning Enabled Self-tuning Compiler”. Int. J. of Parallel Programming 39, 3 (2011), 296–327.
[4] C. Blackmore, O. Ray and K. Eder, ”Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on Embedded Systems”, ArXiv.org
[5] K.D. Cooper, P. J. Schielke, and D. Subramanian, 1999, "Optimizing for reduced code space using genetic algorithms",ACM SIGPLAN Notices (1999)
15
Selection Problem
• Compilers already provide a fixed-sequence of optimizations:
-O0, -O1, -O2, -O3, -Ofast, -Os, …
• Not good enough to obtain the best achievable application-specific performance
• Literature approaches to optimize a given application rely on:
• Application characterization techniques[3]
• Optimization space exploration techniques[4]
• Machine learning models[5]
[3] G. Fursin et al. 2011. “Milepost GCC: Machine Learning Enabled Self-tuning Compiler”. Int. J. of Parallel Programming 39, 3 (2011), 296–327.
[4] C. Blackmore, O. Ray and K. Eder, ”Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on Embedded Systems”, ArXiv.org
[5] K.D. Cooper, P. J. Schielke, and D. Subramanian, 1999, "Optimizing for reduced code space using genetic algorithms",ACM SIGPLAN Notices (1999)
16
Combined Elimination (CE)
An iterative method to find the optimal set of flags for a given application[4]
[4] C. Blackmore, O. Ray and K. Eder, ”Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on Embedded Systems”, ArXiv.org
TB is the execution time of the target
program compiled with configuration B
Implementation and
Initial Evaluation
18
CE Implementation
• Python implementation of CE
• First evaluation on a x86 desktop machine (quad-core Intel i7-7700 CPU @ 3.60GHz)
• Integration with GCC and LLVM compilers
• Benchmarks: SPEC 2017
19
SPEC 2017 Benchmarks
Name Type Language KLOC
600.perlbench_s Integer C 362
602.gcc_s Integer C 1,304
605.mcf_s Integer C 3
620.omnetepp_s Integer C++ 134
623.xalancbmk_s Integer C++ 520
625.x264_s Integer C 96
631.deepsjeng_s Integer C++ 10
641.leela_s Integer C++ 21
648.exchange2_s Integer Fortran 1
657.xz_s Integer C 33
Name Type Language KLOC
603.bwaves_s Floating-Point Fortran 1
607.cactuBSSN_s Floating-Point C++, C, Fortran 257
619.lbm_s Floating-Point C 1
621.wrf_s Floating-Point Fortran, C 991
627.cam4_s Floating-Point Fortran, C 407
628.pop2_s Floating-Point Fortran, C 338
638.imagick_s Floating-Point C 259
644.nab_s Floating-Point C 24
649.fotonik3d_s Floating-Point Fortran 14
654.roms_s Floating-Point Fortran 210
20
GCC Flags Selection
• GCC provides a command to list the optimization flags:
gcc -OLevelFlag -Q --help=optimizers
• GCC flags selection details:
• Starting from -O3 optimization flags, CE deactivates one flag at the time
• Only the flags enabled by -O3 are considered
• Parametric flags are never deactivated
• GCC version 5.4.0 on Ubuntu 16.04
21
GCC Results
1.54X 1.22X 1.13X 1.08X 1.08X 1.08X 1.07X 1.07X 1.05X 1.04X 1.04X 1.04X 1.04X 1.02X 1.02X 1.01X 1.01X 1.01X 1.01X 1X
-O3 Optimization Combined Elimination
NormalizedExecutionTime
0
0.2
0.4
0.6
0.8
1.0
648.exchange2
631.deepsjeng
605.m
cf
644.nab
627.cam
4
625.x264
654.rom
s
628.pop2
619.lbm
602.gcc
607.cactuBSSN
621.w
rf
641.leela
657.xz638.im
agick
600.perlbench649.fotonik3d620.om
netpp
623.xalancbm
k603.bw
aves
22
LLVM Architecture
.c source
clang -emit-llvm
.bc / .ll
llvm-link
libWhatever.a
.bc / .ll
opt
.bc / .ll
llc
.s
llvm-mc / as
.o
lld / ld
executable
dynLibWhatever.o
llvm-as.ll .bc
llvm-dis.bc .ll
https://github.com/skeru/LLVM-intro/blob/master/img/03/toolchain.pdf
23
LLVM Architecture
.c source
clang -emit-llvm
.bc / .ll
llvm-link
libWhatever.a
.bc / .ll
opt
.bc / .ll
llc
.s
llvm-mc / as
.o
lld / ld
executable
dynLibWhatever.o
llvm-as.ll .bc
llvm-dis.bc .ll
https://github.com/skeru/LLVM-intro/blob/master/img/03/toolchain.pdf
24
LLVM Optimization Passes
-tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg
-domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias
-assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt
-domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg
-basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa
-aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa
-aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg
-domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq
-pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree
-loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate
-licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify
-lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion…
25
LLVM Optimization Passes
-tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg
-domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias
-assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt
-domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg
-basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa
-aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa
-aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg
-domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq
-pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree
-loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate
-licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify
-lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion…
Analysis Passes
26
LLVM Optimization Passes
-tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg
-domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias
-assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt
-domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg
-basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa
-aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa
-aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg
-domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq
-pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree
-loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate
-licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify
-lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion…
Utility Passes
27
LLVM Optimization Passes
-tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg
-domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias
-assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt
-domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg
-basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa
-aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa
-aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg
-domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq
-pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree
-loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate
-licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify
-lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion…
Transform Passes
28
LLVM Optimization Passes
-tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg
-domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias
-assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt
-domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg
-basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa
-aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa
-aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg
-domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq
-pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree
-loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate
-licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify
-lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion…
Analysis Passes Transform Passes Utility Passes
29
LLVM Optimization Passes
-tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg
-domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias
-assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt
-domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg
-basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa
-aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa
-aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg
-domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq
-pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree
-loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate
-licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify
-lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion…
Transform Passes
30
LLVM Passes Selection
• LLVM provides a command to list the optimization passes:
llvm-as < /dev/null | opt -OLevelFlag -disable-output -debug-pass=Arguments
• LLVM passes selection details:
• Starting from -O3 optimization passes, CE deactivates one transform pass at the time
• If one pass is applied multiple times, CE deactivates one instance of that pass at the time
• Only the passes enabled by -O3 are considered
• LLVM version 5.0.2 on Ubuntu 16.04
• Only C/C++ SPEC benchmarks considered
31
LLVM Results
1.49X 1.18X 1.17X 1.12X 1.08X 1.05X 1.04X 1.02X 1.02X 1.02X 1.01X 1X
-O3 Optimization Combined Elimination
NormalizedExecutionTime
0
0.2
0.4
0.6
0.8
1.0
631.deepsjeng
605.m
cf
638.im
agick
644.nab
623.xalancbm
k
657.xz
625.x264
620.om
netpp
602.gcc
641.leela
600.perlbench
619.lbm
32
GCC vs LLVM Results
GCC -O3 Optimization LLVM -O3 Optimization Combined Elimination w/ GCC Combined Elimination w/ LLVM
NormalizedExecutionTime
0
0.5
1.0
1.5
2.0
600.perlbench
602.gcc
605.m
cf
619.lbm
620.om
netpp
623.xalancbm
k
625.x264
631.deepsjeng
638.im
agick
641.leela
644.nab
657.xz
Execution times normalized wrt GCC –O3
Evaluation on Arm
HPC Machines
34
Arm HPC Machines
Centriq
Producer: Qualcomm
ISA: Armv8
OS: Ubuntu 16.04
Core: 47 (1 thread per core)
Coherency: Qualcomm High-Speed Coherent
Interconnect
Compiler: Arm HPC Compiler 18.4.1
ThunderX2
Producer: Cavium
ISA: Armv8
OS: Ubuntu 16.04
Core: 56 (4 threads per core)
Coherency: Cavium Coherent Processor
Interconnect
Compiler: Arm HPC Compiler 18.4.1
35
Centriq Results
1.21X 1.2X 1.09X 1.08X 1.07X 1.07X 1.04X 1.03X 1.03X 1.03X 1.02X 1.02X 1.01X 1.01X 1.01X 1.01X 1.01X 1.01X 1.01X 1.01X
-O3 Optimization Combined Elimination
NormalizedExecutionTime
0
0.2
0.4
0.6
0.8
1.0
648.exchange2
602.gcc
625.x264
605.m
cf638.im
agick
623.xalancbm
k649.fotonik3d620.om
netpp
654.rom
s
641.leela
628.pop2
600.perlbench
644.nab
631.deepsjeng
657.xz
619.lbm
607.cactuBSSN
621.w
rf
627.cam
4603.bw
aves
36
ThunderX2 Results
1.83X 1.15X 1.08X 1.08X 1.06X 1.05X 1.05X 1.05X 1.04X 1.03X 1.02X 1.02X 1.02X 1.02X 1.01X 1.01X 1.01X 1.01X 1X 1X
-O3 Optimization Combined Elimination
NormalizedExecutionTime
0
0.2
0.4
0.6
0.8
1.0
631.deepsjeng
648.exchange2
625.x264
605.m
cf
641.leela
600.perlbench638.im
agick
644.nab
607.cactuBSSN
619.lbm
654.rom
s
628.pop2649.fotonik3d620.om
netpp
627.cam
4
602.gcc
621.w
rf603.bw
aves
657.xz
623.xalancbm
k
Cross Validation
38
ThunderX2 Optimizations on Centriq
0.965X
0.819X
0.979X 1.04X 0.998X 0.976X 0.977X 0.996X 1X 1.09X 1.01X 1.01X 1X
0.626X
1X 0.97X 1.02X 0.996X 0.985X 0.992X
-O3 Optimization Combined Elimination ThunderX2 Optimization
NormalizedExecutionTime
0
0.5
1.0
1.5
600.perlbench
602.gcc603.bw
aves
605.m
cf
607.cactuBSSN
619.lbm620.om
netpp
621.w
rf
623.xalancbm
k
625.x264
627.cam
4
628.pop2
631.deepsjeng638.im
agick
641.leela
644.nab
648.exchange2649.fotonik3d
654.rom
s
657.xz
39
ThunderX2 Optimizations on Centriq
Norm. ThunderX2 Optimization <= Norm. -O3 Optimization Norm. ThunderX2 Optimization > Norm. -O3 Optimization
7 13
Average
0.9766 1.0763
Median
0.9863 1.0210
Norm. ThunderX2 Optimization <= Norm. CE Norm. ThunderX2 Optimization > Norm. CE
1 19
Average
0.9961 1.0986
Median
0.9961 1.0407
40
Centriq Optimizations on ThunderX2
1.03X
0.903X
1X 1.02X 1.01X 0.995X 0.969X 0.999X 0.986X 1.07X 1X 1X 1.77X 1.02X 1.01X 1.03X 1.1X 1X
0.895X
0.994X
-O3 Optimization Combined Elimination Centriq Optimization
NormalizedExecutionTime
0
0.2
0.4
0.6
0.8
1.0
1.2
600.perlbench
602.gcc603.bw
aves
605.m
cf
607.cactuBSSN
619.lbm620.om
netpp
621.w
rf
623.xalancbm
k
625.x264
627.cam
4
628.pop2
631.deepsjeng638.im
agick
641.leela
644.nab
648.exchange2649.fotonik3d
654.rom
s
657.xz
41
Centriq Optimizations on ThunderX2
Norm. Centriq Optimization <= Norm. -O3 Optimization Norm. Centriq Optimization > Norm. -O3 Optimization
12 8
Average
0.9395 1.0355
Median
0.9771 1.0101
Norm. Centriq Optimization <= Norm. CE Norm. Centriq Optimization > Norm. CE
0 20
Average
- 1.0352
Median
- 1.0209
42
Interesting Facts
• Some combinations of flags/passes cause compile/runtime errors
• In benchmark 644.nab_s, deactivation of one of the following passes:
-instcombine (fifth instance)
-jump-threading (second instance)
causes LLVM compiler to indefinitely loop (both on x86 and Arm HPC machines)
find distinct threads of control flow
running through a basic block
combine redundant instructions
Results Discussion and
Next Steps
44
Results Discussion
• In almost all cases CE found a better configuration of passes
• 623.xalancbmk_s on ThunderX2 no better configuration found
• Final configurations:
• For each machine, different benchmarks converged to different configurations
• Centriq and ThunderX2 final configuration of the same benchmark differ
• Cross validation shows the impact of optimizing for a different chip:
• Just one case where a benchmark optimized for a different chip performed (slightly) better
• Optimizing for Centriq gives better results than optimizing for ThunderX2 wrt a -O3 baseline
45
(Possible) Next Steps
• Repeat the same evaluation on Arm HPC machines using GCC
• Experiments are currently running
• Identify patterns within final configurations
• Investigate the usage Machine Learning models:
• PROS: faster than CE (takes from few hours to few days)
• CONS: huge amount of benchmarks to build a model
464646
Thank You!
Danke!
Merci!
!
!
Gracias!
Kiitos!
Emanuele Del Sozzo (emanuele.delsozzo@polimi.it)
Javier Setoain (javier.setoain@arm.com)
Filippo Spiga (filippo.spiga@arm.com)
4747
The trademarks featured in this
presentation are registered and/or
unregistered trademarks of ARM
Limited (or its subsidiaries) in the EU
and/or elsewhere. All rights
reserved. All other marks featured
may be trademarks of their
respective owners.

More Related Content

What's hot

Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...
Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...
Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...Filip Krikava
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Cost estimation using cocomo model
Cost estimation using cocomo modelCost estimation using cocomo model
Cost estimation using cocomo modelNitesh Bichwani
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
 
IBM ILOG CP Optimizer for Detailed Scheduling Illustrated on Three Problems
IBM ILOG CP Optimizer for Detailed Scheduling Illustrated on Three ProblemsIBM ILOG CP Optimizer for Detailed Scheduling Illustrated on Three Problems
IBM ILOG CP Optimizer for Detailed Scheduling Illustrated on Three ProblemsPhilippe Laborie
 
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software Architectures
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software ArchitecturesACTRESS: Domain-Specific Modeling of Self-Adaptive Software Architectures
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software ArchitecturesFilip Krikava
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-ServiceHiroshi Doyu
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...Edge AI and Vision Alliance
 
Recent MIP Performance Improvements in IBM ILOG CPLEX Optimization Studio
Recent MIP Performance Improvements in IBM ILOG CPLEX Optimization StudioRecent MIP Performance Improvements in IBM ILOG CPLEX Optimization Studio
Recent MIP Performance Improvements in IBM ILOG CPLEX Optimization StudioIBM Decision Optimization
 
An introduction to CP Optimizer
An introduction to CP OptimizerAn introduction to CP Optimizer
An introduction to CP OptimizerPhilippe Laborie
 
TenYearsCPOptimizer
TenYearsCPOptimizerTenYearsCPOptimizer
TenYearsCPOptimizerPaulShawIBM
 

What's hot (15)

Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...
Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...
Integrating Adaptation Mechanisms Using Control Theory Centric Architecture M...
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Cocomo models
Cocomo modelsCocomo models
Cocomo models
 
Cost estimation using cocomo model
Cost estimation using cocomo modelCost estimation using cocomo model
Cost estimation using cocomo model
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
IBM ILOG CP Optimizer for Detailed Scheduling Illustrated on Three Problems
IBM ILOG CP Optimizer for Detailed Scheduling Illustrated on Three ProblemsIBM ILOG CP Optimizer for Detailed Scheduling Illustrated on Three Problems
IBM ILOG CP Optimizer for Detailed Scheduling Illustrated on Three Problems
 
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software Architectures
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software ArchitecturesACTRESS: Domain-Specific Modeling of Self-Adaptive Software Architectures
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software Architectures
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
 
Cocomo
CocomoCocomo
Cocomo
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
 
Recent MIP Performance Improvements in IBM ILOG CPLEX Optimization Studio
Recent MIP Performance Improvements in IBM ILOG CPLEX Optimization StudioRecent MIP Performance Improvements in IBM ILOG CPLEX Optimization Studio
Recent MIP Performance Improvements in IBM ILOG CPLEX Optimization Studio
 
YuanYiPeng_161017
YuanYiPeng_161017YuanYiPeng_161017
YuanYiPeng_161017
 
PhD Thesis Defense
PhD Thesis DefensePhD Thesis Defense
PhD Thesis Defense
 
An introduction to CP Optimizer
An introduction to CP OptimizerAn introduction to CP Optimizer
An introduction to CP Optimizer
 
TenYearsCPOptimizer
TenYearsCPOptimizerTenYearsCPOptimizer
TenYearsCPOptimizer
 

Similar to The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures

How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsDatabricks
 
MILEPOST GCC: machine learning based research compiler
MILEPOST GCC: machine learning based research compilerMILEPOST GCC: machine learning based research compiler
MILEPOST GCC: machine learning based research compilerbutest
 
Developing Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsDeveloping Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsToradex
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionShobha Kumar
 
Compiler Optimization-Space Exploration
Compiler Optimization-Space ExplorationCompiler Optimization-Space Exploration
Compiler Optimization-Space Explorationtmusabbir
 
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTXDecision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTXSanjayKPrasad2
 
Software effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN networkSoftware effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN networkIOSR Journals
 
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorEarly Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorIntel IT Center
 
Collective Mind: bringing reproducible research to the masses
Collective Mind: bringing reproducible research to the massesCollective Mind: bringing reproducible research to the masses
Collective Mind: bringing reproducible research to the massesGrigori Fursin
 
In Depth Constructive Cost Modeling related slides
In Depth Constructive Cost Modeling related slidesIn Depth Constructive Cost Modeling related slides
In Depth Constructive Cost Modeling related slidesChobodiDamsaraniPadm
 
Ml also helps generic compiler ?
Ml also helps generic compiler ?Ml also helps generic compiler ?
Ml also helps generic compiler ?Ryo Takahashi
 
An integrated approach for designing and testing specific processors
An integrated approach for designing and testing specific processorsAn integrated approach for designing and testing specific processors
An integrated approach for designing and testing specific processorsVLSICS Design
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
 

Similar to The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures (20)

How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
MILEPOST GCC: machine learning based research compiler
MILEPOST GCC: machine learning based research compilerMILEPOST GCC: machine learning based research compiler
MILEPOST GCC: machine learning based research compiler
 
Developing Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsDeveloping Real-Time Systems on Application Processors
Developing Real-Time Systems on Application Processors
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solution
 
V2I6_IJERTV2IS60721
V2I6_IJERTV2IS60721V2I6_IJERTV2IS60721
V2I6_IJERTV2IS60721
 
Cost estamition
Cost estamitionCost estamition
Cost estamition
 
Compiler Optimization-Space Exploration
Compiler Optimization-Space ExplorationCompiler Optimization-Space Exploration
Compiler Optimization-Space Exploration
 
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTXDecision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
 
Software effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN networkSoftware effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN network
 
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorEarly Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
 
Collective Mind: bringing reproducible research to the masses
Collective Mind: bringing reproducible research to the massesCollective Mind: bringing reproducible research to the masses
Collective Mind: bringing reproducible research to the masses
 
Embedded System-design technology
Embedded System-design technologyEmbedded System-design technology
Embedded System-design technology
 
In Depth Constructive Cost Modeling related slides
In Depth Constructive Cost Modeling related slidesIn Depth Constructive Cost Modeling related slides
In Depth Constructive Cost Modeling related slides
 
Ml also helps generic compiler ?
Ml also helps generic compiler ?Ml also helps generic compiler ?
Ml also helps generic compiler ?
 
Ch1
Ch1Ch1
Ch1
 
Aa03101540158
Aa03101540158Aa03101540158
Aa03101540158
 
01-06 OCRE Test Suite - Fernandes.pdf
01-06 OCRE Test Suite - Fernandes.pdf01-06 OCRE Test Suite - Fernandes.pdf
01-06 OCRE Test Suite - Fernandes.pdf
 
Metrics
MetricsMetrics
Metrics
 
An integrated approach for designing and testing specific processors
An integrated approach for designing and testing specific processorsAn integrated approach for designing and testing specific processors
An integrated approach for designing and testing specific processors
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 

More from NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingNECST Lab @ Politecnico di Milano
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...NECST Lab @ Politecnico di Milano
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification SystemNECST Lab @ Politecnico di Milano
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingNECST Lab @ Politecnico di Milano
 

More from NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Recently uploaded

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 

Recently uploaded (20)

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 

The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures

  • 1. The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures Emanuele Del Sozzo1, Javier Setoain2, Filippo Spiga2 emanuele.delsozzo@polimi.it 1Politecnico di Milano 2Arm Research NECSTLab – Politecnico di Milano 9/11/2018
  • 2. <disclaimer> This presentation describes the results of my internship project at Arm Research in Cambridge </disclaimer>
  • 3. 1 What is Arm? • A semiconductor and software design company • Primary business: CPU design Secondary business: software tools, SoC, GPUs • Market: (mainly) embedded devices (microcontrollers, mobile phones, tablets, etc.) • Business model: creation and licensing of its IPs
  • 4. 2 Why Arm? Master’s thesis topic: scheduling policy for Arm big.LITTLE[1] • QoS and power efficiency oriented • Dynamic Voltage and Frequency Scaling (DVFS) • Tasks allocation • Exploitation of all the underlying resources [1] E. Del Sozzo, G. C. Durelli, E. M. G. Trainiti, A. Miele, M. D. Santambrogio and C. Bolchini, "Workload-aware power optimization strategy for asymmetric multiprocessors," 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, 2016, pp. 531-534. … frequency Thread Number Thread Mapping Throughput ARM big.LITTLE Policy App 0 App 1 App N A15 A15 A15 A15 A7 A7 A7 A7 Cache L2 Cache L2 AXI Cache Coherent BUS DDR
  • 5. 3 Context Definition • Divergence in Arm architecture implementation by partners • Differences in performance for the same executable on different microarchitectures • Difference amplified when a original binary optimized for a specific chip runs on another one
  • 6. 4 A big.LITTLE Example C / C++ Code 01101 101010 100101 Cortex-A15 Cortex-A7 Cortex-A7 Cortex-A7 Cortex-A7 Cortex-A15 Cortex-A15 Cortex-A15 big Cluster LITTLE Cluster Compilation
  • 7. 5 A big.LITTLE Example C / C++ Code 01101 101010 100101 Cortex-A15 Cortex-A7 Cortex-A7 Cortex-A7 Cortex-A7 Cortex-A15 Cortex-A15 Cortex-A15 big Cluster LITTLE Cluster Compilation
  • 8. 6 A big.LITTLE Example C / C++ Code 01101 101010 100101 Cortex-A15 Cortex-A7 Cortex-A7 Cortex-A7 Cortex-A7 Cortex-A15 Cortex-A15 Cortex-A15 big Cluster LITTLE Cluster Compilation
  • 9. 7 A big.LITTLE Example Fetch Decode Issue Queue Integer Multiply Floating-Point / NEON Dual Issue Load / Store Writeback Fetch Decode, Rename & Dispatch IssueQueue Integer Floating-Point / NEON Writeback Integer Multiply Load Store Branch Loop Cache Cortex-A7 Pipeline Cortex-A15 Pipeline
  • 10. 8 Internship Goal • Evaluate the impact of compiler optimizations on HPC microarchitectures • Development of a methodology to systematically find optimal subsets of optimization flags • Analyze performance loss when compiling for a chip using the optimal subset of flags from another
  • 12. 11 Compiler Optimization • Choose the right set of compiler optimizations is complex and depends on[2]: • Programming language • Application • Architecture • (Some of the) open problems in the compiler optimization field: 1. What optimization to use / Which set of parameters to choose from 2. In which order to apply the optimizations [2] A. H. Ashouri, W. Killian, G. Palermo and C. Silvano, "A Survey on Compiler Autotuning using Machine Learning", ACM Computing Surveys 2018
  • 13. 12 Compiler Optimization • Choose the right set of compiler optimizations is complex and depends on[2]: • Programming language • Application • Architecture • (Some of the) open problems in the compiler optimization field: 1. What optimization to use / Which set of parameters to choose from 2. In which order to apply the optimizations Selection problem Phase ordering problem [2] A. H. Ashouri, W. Killian, G. Palermo and C. Silvano, "A Survey on Compiler Autotuning using Machine Learning", ACM Computing Surveys 2018
  • 14. 13 Compiler Optimization • Choose the right set of compiler optimizations is complex and depends on[2]: • Programming language • Application • Architecture • (Some of the) open problems in the compiler optimization field: 1. What optimization to use / Which set of parameters to choose from 2. In which order to apply the optimizations Selection problem Phase ordering problem [2] A. H. Ashouri, W. Killian, G. Palermo and C. Silvano, "A Survey on Compiler Autotuning using Machine Learning", ACM Computing Surveys 2018
  • 15. 14 Selection Problem • Compilers already provide a fixed-sequence of optimizations: -O0, -O1, -O2, -O3, -Ofast, -Os, … • Not good enough to obtain the best achievable application-specific performance • Literature approaches to optimize a given application rely on: • Application characterization techniques[3] • Optimization space exploration techniques[4] • Machine learning models[5] [3] G. Fursin et al. 2011. “Milepost GCC: Machine Learning Enabled Self-tuning Compiler”. Int. J. of Parallel Programming 39, 3 (2011), 296–327. [4] C. Blackmore, O. Ray and K. Eder, ”Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on Embedded Systems”, ArXiv.org [5] K.D. Cooper, P. J. Schielke, and D. Subramanian, 1999, "Optimizing for reduced code space using genetic algorithms",ACM SIGPLAN Notices (1999)
  • 16. 15 Selection Problem • Compilers already provide a fixed-sequence of optimizations: -O0, -O1, -O2, -O3, -Ofast, -Os, … • Not good enough to obtain the best achievable application-specific performance • Literature approaches to optimize a given application rely on: • Application characterization techniques[3] • Optimization space exploration techniques[4] • Machine learning models[5] [3] G. Fursin et al. 2011. “Milepost GCC: Machine Learning Enabled Self-tuning Compiler”. Int. J. of Parallel Programming 39, 3 (2011), 296–327. [4] C. Blackmore, O. Ray and K. Eder, ”Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on Embedded Systems”, ArXiv.org [5] K.D. Cooper, P. J. Schielke, and D. Subramanian, 1999, "Optimizing for reduced code space using genetic algorithms",ACM SIGPLAN Notices (1999)
  • 17. 16 Combined Elimination (CE) An iterative method to find the optimal set of flags for a given application[4] [4] C. Blackmore, O. Ray and K. Eder, ”Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on Embedded Systems”, ArXiv.org TB is the execution time of the target program compiled with configuration B
  • 19. 18 CE Implementation • Python implementation of CE • First evaluation on a x86 desktop machine (quad-core Intel i7-7700 CPU @ 3.60GHz) • Integration with GCC and LLVM compilers • Benchmarks: SPEC 2017
  • 20. 19 SPEC 2017 Benchmarks Name Type Language KLOC 600.perlbench_s Integer C 362 602.gcc_s Integer C 1,304 605.mcf_s Integer C 3 620.omnetepp_s Integer C++ 134 623.xalancbmk_s Integer C++ 520 625.x264_s Integer C 96 631.deepsjeng_s Integer C++ 10 641.leela_s Integer C++ 21 648.exchange2_s Integer Fortran 1 657.xz_s Integer C 33 Name Type Language KLOC 603.bwaves_s Floating-Point Fortran 1 607.cactuBSSN_s Floating-Point C++, C, Fortran 257 619.lbm_s Floating-Point C 1 621.wrf_s Floating-Point Fortran, C 991 627.cam4_s Floating-Point Fortran, C 407 628.pop2_s Floating-Point Fortran, C 338 638.imagick_s Floating-Point C 259 644.nab_s Floating-Point C 24 649.fotonik3d_s Floating-Point Fortran 14 654.roms_s Floating-Point Fortran 210
  • 21. 20 GCC Flags Selection • GCC provides a command to list the optimization flags: gcc -OLevelFlag -Q --help=optimizers • GCC flags selection details: • Starting from -O3 optimization flags, CE deactivates one flag at the time • Only the flags enabled by -O3 are considered • Parametric flags are never deactivated • GCC version 5.4.0 on Ubuntu 16.04
  • 22. 21 GCC Results 1.54X 1.22X 1.13X 1.08X 1.08X 1.08X 1.07X 1.07X 1.05X 1.04X 1.04X 1.04X 1.04X 1.02X 1.02X 1.01X 1.01X 1.01X 1.01X 1X -O3 Optimization Combined Elimination NormalizedExecutionTime 0 0.2 0.4 0.6 0.8 1.0 648.exchange2 631.deepsjeng 605.m cf 644.nab 627.cam 4 625.x264 654.rom s 628.pop2 619.lbm 602.gcc 607.cactuBSSN 621.w rf 641.leela 657.xz638.im agick 600.perlbench649.fotonik3d620.om netpp 623.xalancbm k603.bw aves
  • 23. 22 LLVM Architecture .c source clang -emit-llvm .bc / .ll llvm-link libWhatever.a .bc / .ll opt .bc / .ll llc .s llvm-mc / as .o lld / ld executable dynLibWhatever.o llvm-as.ll .bc llvm-dis.bc .ll https://github.com/skeru/LLVM-intro/blob/master/img/03/toolchain.pdf
  • 24. 23 LLVM Architecture .c source clang -emit-llvm .bc / .ll llvm-link libWhatever.a .bc / .ll opt .bc / .ll llc .s llvm-mc / as .o lld / ld executable dynLibWhatever.o llvm-as.ll .bc llvm-dis.bc .ll https://github.com/skeru/LLVM-intro/blob/master/img/03/toolchain.pdf
  • 25. 24 LLVM Optimization Passes -tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg -domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt -domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg -basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa -aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa -aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg -domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq -pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion…
  • 26. 25 LLVM Optimization Passes -tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg -domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt -domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg -basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa -aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa -aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg -domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq -pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion… Analysis Passes
  • 27. 26 LLVM Optimization Passes -tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg -domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt -domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg -basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa -aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa -aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg -domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq -pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion… Utility Passes
  • 28. 27 LLVM Optimization Passes -tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg -domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt -domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg -basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa -aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa -aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg -domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq -pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion… Transform Passes
  • 29. 28 LLVM Optimization Passes -tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg -domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt -domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg -basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa -aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa -aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg -domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq -pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion… Analysis Passes Transform Passes Utility Passes
  • 30. 29 LLVM Optimization Passes -tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg -domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt -domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg -basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa -aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa -aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg -domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq -pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion… Transform Passes
  • 31. 30 LLVM Passes Selection • LLVM provides a command to list the optimization passes: llvm-as < /dev/null | opt -OLevelFlag -disable-output -debug-pass=Arguments • LLVM passes selection details: • Starting from -O3 optimization passes, CE deactivates one transform pass at the time • If one pass is applied multiple times, CE deactivates one instance of that pass at the time • Only the passes enabled by -O3 are considered • LLVM version 5.0.2 on Ubuntu 16.04 • Only C/C++ SPEC benchmarks considered
  • 32. 31 LLVM Results 1.49X 1.18X 1.17X 1.12X 1.08X 1.05X 1.04X 1.02X 1.02X 1.02X 1.01X 1X -O3 Optimization Combined Elimination NormalizedExecutionTime 0 0.2 0.4 0.6 0.8 1.0 631.deepsjeng 605.m cf 638.im agick 644.nab 623.xalancbm k 657.xz 625.x264 620.om netpp 602.gcc 641.leela 600.perlbench 619.lbm
  • 33. 32 GCC vs LLVM Results GCC -O3 Optimization LLVM -O3 Optimization Combined Elimination w/ GCC Combined Elimination w/ LLVM NormalizedExecutionTime 0 0.5 1.0 1.5 2.0 600.perlbench 602.gcc 605.m cf 619.lbm 620.om netpp 623.xalancbm k 625.x264 631.deepsjeng 638.im agick 641.leela 644.nab 657.xz Execution times normalized wrt GCC –O3
  • 35. 34 Arm HPC Machines Centriq Producer: Qualcomm ISA: Armv8 OS: Ubuntu 16.04 Core: 47 (1 thread per core) Coherency: Qualcomm High-Speed Coherent Interconnect Compiler: Arm HPC Compiler 18.4.1 ThunderX2 Producer: Cavium ISA: Armv8 OS: Ubuntu 16.04 Core: 56 (4 threads per core) Coherency: Cavium Coherent Processor Interconnect Compiler: Arm HPC Compiler 18.4.1
  • 36. 35 Centriq Results 1.21X 1.2X 1.09X 1.08X 1.07X 1.07X 1.04X 1.03X 1.03X 1.03X 1.02X 1.02X 1.01X 1.01X 1.01X 1.01X 1.01X 1.01X 1.01X 1.01X -O3 Optimization Combined Elimination NormalizedExecutionTime 0 0.2 0.4 0.6 0.8 1.0 648.exchange2 602.gcc 625.x264 605.m cf638.im agick 623.xalancbm k649.fotonik3d620.om netpp 654.rom s 641.leela 628.pop2 600.perlbench 644.nab 631.deepsjeng 657.xz 619.lbm 607.cactuBSSN 621.w rf 627.cam 4603.bw aves
  • 37. 36 ThunderX2 Results 1.83X 1.15X 1.08X 1.08X 1.06X 1.05X 1.05X 1.05X 1.04X 1.03X 1.02X 1.02X 1.02X 1.02X 1.01X 1.01X 1.01X 1.01X 1X 1X -O3 Optimization Combined Elimination NormalizedExecutionTime 0 0.2 0.4 0.6 0.8 1.0 631.deepsjeng 648.exchange2 625.x264 605.m cf 641.leela 600.perlbench638.im agick 644.nab 607.cactuBSSN 619.lbm 654.rom s 628.pop2649.fotonik3d620.om netpp 627.cam 4 602.gcc 621.w rf603.bw aves 657.xz 623.xalancbm k
  • 39. 38 ThunderX2 Optimizations on Centriq 0.965X 0.819X 0.979X 1.04X 0.998X 0.976X 0.977X 0.996X 1X 1.09X 1.01X 1.01X 1X 0.626X 1X 0.97X 1.02X 0.996X 0.985X 0.992X -O3 Optimization Combined Elimination ThunderX2 Optimization NormalizedExecutionTime 0 0.5 1.0 1.5 600.perlbench 602.gcc603.bw aves 605.m cf 607.cactuBSSN 619.lbm620.om netpp 621.w rf 623.xalancbm k 625.x264 627.cam 4 628.pop2 631.deepsjeng638.im agick 641.leela 644.nab 648.exchange2649.fotonik3d 654.rom s 657.xz
  • 40. 39 ThunderX2 Optimizations on Centriq Norm. ThunderX2 Optimization <= Norm. -O3 Optimization Norm. ThunderX2 Optimization > Norm. -O3 Optimization 7 13 Average 0.9766 1.0763 Median 0.9863 1.0210 Norm. ThunderX2 Optimization <= Norm. CE Norm. ThunderX2 Optimization > Norm. CE 1 19 Average 0.9961 1.0986 Median 0.9961 1.0407
  • 41. 40 Centriq Optimizations on ThunderX2 1.03X 0.903X 1X 1.02X 1.01X 0.995X 0.969X 0.999X 0.986X 1.07X 1X 1X 1.77X 1.02X 1.01X 1.03X 1.1X 1X 0.895X 0.994X -O3 Optimization Combined Elimination Centriq Optimization NormalizedExecutionTime 0 0.2 0.4 0.6 0.8 1.0 1.2 600.perlbench 602.gcc603.bw aves 605.m cf 607.cactuBSSN 619.lbm620.om netpp 621.w rf 623.xalancbm k 625.x264 627.cam 4 628.pop2 631.deepsjeng638.im agick 641.leela 644.nab 648.exchange2649.fotonik3d 654.rom s 657.xz
  • 42. 41 Centriq Optimizations on ThunderX2 Norm. Centriq Optimization <= Norm. -O3 Optimization Norm. Centriq Optimization > Norm. -O3 Optimization 12 8 Average 0.9395 1.0355 Median 0.9771 1.0101 Norm. Centriq Optimization <= Norm. CE Norm. Centriq Optimization > Norm. CE 0 20 Average - 1.0352 Median - 1.0209
  • 43. 42 Interesting Facts • Some combinations of flags/passes cause compile/runtime errors • In benchmark 644.nab_s, deactivation of one of the following passes: -instcombine (fifth instance) -jump-threading (second instance) causes LLVM compiler to indefinitely loop (both on x86 and Arm HPC machines) find distinct threads of control flow running through a basic block combine redundant instructions
  • 45. 44 Results Discussion • In almost all cases CE found a better configuration of passes • 623.xalancbmk_s on ThunderX2 no better configuration found • Final configurations: • For each machine, different benchmarks converged to different configurations • Centriq and ThunderX2 final configuration of the same benchmark differ • Cross validation shows the impact of optimizing for a different chip: • Just one case where a benchmark optimized for a different chip performed (slightly) better • Optimizing for Centriq gives better results than optimizing for ThunderX2 wrt a -O3 baseline
  • 46. 45 (Possible) Next Steps • Repeat the same evaluation on Arm HPC machines using GCC • Experiments are currently running • Identify patterns within final configurations • Investigate the usage Machine Learning models: • PROS: faster than CE (takes from few hours to few days) • CONS: huge amount of benchmarks to build a model
  • 47. 464646 Thank You! Danke! Merci! ! ! Gracias! Kiitos! Emanuele Del Sozzo (emanuele.delsozzo@polimi.it) Javier Setoain (javier.setoain@arm.com) Filippo Spiga (filippo.spiga@arm.com)
  • 48. 4747 The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.