Arm's business model results in certain divergence in architecture implementation by partners, which in turn results in differences in performance for the same executable on different microarchitectures. This difference might be amplified if the original binary was optimized for a specific one, but it is to be run on another one. This scenario might be more and more common as Arm’s market grows and expands, since the same program might require running on different chips at different times depending on load balancing and availability in a data center. This talk presents the results of an internship project whose purpose was to establish a baseline for performance loss in these types of scenarios, evaluate its severity and study potential techniques to mitigate it.
The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures
1. The Impact of Compiler Auto-Optimisation
on Arm-based HPC Microarchitectures
Emanuele Del Sozzo1, Javier Setoain2, Filippo Spiga2
emanuele.delsozzo@polimi.it
1Politecnico di Milano
2Arm Research
NECSTLab – Politecnico di Milano
9/11/2018
3. 1
What is Arm?
• A semiconductor and software design company
• Primary business: CPU design
Secondary business: software tools, SoC, GPUs
• Market: (mainly) embedded devices (microcontrollers, mobile phones,
tablets, etc.)
• Business model: creation and licensing of its IPs
4. 2
Why Arm?
Master’s thesis topic: scheduling policy for Arm big.LITTLE[1]
• QoS and power efficiency oriented
• Dynamic Voltage and Frequency Scaling (DVFS)
• Tasks allocation
• Exploitation of all the underlying resources
[1] E. Del Sozzo, G. C. Durelli, E. M. G. Trainiti, A. Miele, M. D. Santambrogio and C. Bolchini, "Workload-aware power optimization strategy for asymmetric multiprocessors," 2016 Design, Automation & Test in Europe Conference &
Exhibition (DATE), Dresden, 2016, pp. 531-534.
…
frequency
Thread Number
Thread Mapping
Throughput
ARM big.LITTLE
Policy
App 0 App 1 App N
A15 A15 A15 A15 A7 A7 A7 A7
Cache L2 Cache L2
AXI Cache Coherent BUS
DDR
5. 3
Context Definition
• Divergence in Arm architecture implementation by partners
• Differences in performance for the same executable on different microarchitectures
• Difference amplified when a original binary optimized for a specific chip
runs on another one
6. 4
A big.LITTLE Example
C / C++
Code
01101
101010
100101
Cortex-A15
Cortex-A7 Cortex-A7
Cortex-A7 Cortex-A7
Cortex-A15
Cortex-A15 Cortex-A15
big Cluster
LITTLE Cluster
Compilation
7. 5
A big.LITTLE Example
C / C++
Code
01101
101010
100101
Cortex-A15
Cortex-A7 Cortex-A7
Cortex-A7 Cortex-A7
Cortex-A15
Cortex-A15 Cortex-A15
big Cluster
LITTLE Cluster
Compilation
8. 6
A big.LITTLE Example
C / C++
Code
01101
101010
100101
Cortex-A15
Cortex-A7 Cortex-A7
Cortex-A7 Cortex-A7
Cortex-A15
Cortex-A15 Cortex-A15
big Cluster
LITTLE Cluster
Compilation
10. 8
Internship Goal
• Evaluate the impact of compiler optimizations on HPC microarchitectures
• Development of a methodology to systematically find optimal subsets of
optimization flags
• Analyze performance loss when compiling for a chip using the optimal
subset of flags from another
12. 11
Compiler Optimization
• Choose the right set of compiler optimizations is complex and depends on[2]:
• Programming language
• Application
• Architecture
• (Some of the) open problems in the compiler optimization field:
1. What optimization to use / Which set of
parameters to choose from
2. In which order to apply the optimizations
[2] A. H. Ashouri, W. Killian, G. Palermo and C. Silvano, "A Survey on Compiler Autotuning using Machine Learning", ACM Computing Surveys 2018
13. 12
Compiler Optimization
• Choose the right set of compiler optimizations is complex and depends on[2]:
• Programming language
• Application
• Architecture
• (Some of the) open problems in the compiler optimization field:
1. What optimization to use / Which set of
parameters to choose from
2. In which order to apply the optimizations
Selection problem
Phase ordering problem
[2] A. H. Ashouri, W. Killian, G. Palermo and C. Silvano, "A Survey on Compiler Autotuning using Machine Learning", ACM Computing Surveys 2018
14. 13
Compiler Optimization
• Choose the right set of compiler optimizations is complex and depends on[2]:
• Programming language
• Application
• Architecture
• (Some of the) open problems in the compiler optimization field:
1. What optimization to use / Which set of
parameters to choose from
2. In which order to apply the optimizations
Selection problem
Phase ordering problem
[2] A. H. Ashouri, W. Killian, G. Palermo and C. Silvano, "A Survey on Compiler Autotuning using Machine Learning", ACM Computing Surveys 2018
15. 14
Selection Problem
• Compilers already provide a fixed-sequence of optimizations:
-O0, -O1, -O2, -O3, -Ofast, -Os, …
• Not good enough to obtain the best achievable application-specific performance
• Literature approaches to optimize a given application rely on:
• Application characterization techniques[3]
• Optimization space exploration techniques[4]
• Machine learning models[5]
[3] G. Fursin et al. 2011. “Milepost GCC: Machine Learning Enabled Self-tuning Compiler”. Int. J. of Parallel Programming 39, 3 (2011), 296–327.
[4] C. Blackmore, O. Ray and K. Eder, ”Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on Embedded Systems”, ArXiv.org
[5] K.D. Cooper, P. J. Schielke, and D. Subramanian, 1999, "Optimizing for reduced code space using genetic algorithms",ACM SIGPLAN Notices (1999)
16. 15
Selection Problem
• Compilers already provide a fixed-sequence of optimizations:
-O0, -O1, -O2, -O3, -Ofast, -Os, …
• Not good enough to obtain the best achievable application-specific performance
• Literature approaches to optimize a given application rely on:
• Application characterization techniques[3]
• Optimization space exploration techniques[4]
• Machine learning models[5]
[3] G. Fursin et al. 2011. “Milepost GCC: Machine Learning Enabled Self-tuning Compiler”. Int. J. of Parallel Programming 39, 3 (2011), 296–327.
[4] C. Blackmore, O. Ray and K. Eder, ”Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on Embedded Systems”, ArXiv.org
[5] K.D. Cooper, P. J. Schielke, and D. Subramanian, 1999, "Optimizing for reduced code space using genetic algorithms",ACM SIGPLAN Notices (1999)
17. 16
Combined Elimination (CE)
An iterative method to find the optimal set of flags for a given application[4]
[4] C. Blackmore, O. Ray and K. Eder, ”Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on Embedded Systems”, ArXiv.org
TB is the execution time of the target
program compiled with configuration B
19. 18
CE Implementation
• Python implementation of CE
• First evaluation on a x86 desktop machine (quad-core Intel i7-7700 CPU @ 3.60GHz)
• Integration with GCC and LLVM compilers
• Benchmarks: SPEC 2017
20. 19
SPEC 2017 Benchmarks
Name Type Language KLOC
600.perlbench_s Integer C 362
602.gcc_s Integer C 1,304
605.mcf_s Integer C 3
620.omnetepp_s Integer C++ 134
623.xalancbmk_s Integer C++ 520
625.x264_s Integer C 96
631.deepsjeng_s Integer C++ 10
641.leela_s Integer C++ 21
648.exchange2_s Integer Fortran 1
657.xz_s Integer C 33
Name Type Language KLOC
603.bwaves_s Floating-Point Fortran 1
607.cactuBSSN_s Floating-Point C++, C, Fortran 257
619.lbm_s Floating-Point C 1
621.wrf_s Floating-Point Fortran, C 991
627.cam4_s Floating-Point Fortran, C 407
628.pop2_s Floating-Point Fortran, C 338
638.imagick_s Floating-Point C 259
644.nab_s Floating-Point C 24
649.fotonik3d_s Floating-Point Fortran 14
654.roms_s Floating-Point Fortran 210
21. 20
GCC Flags Selection
• GCC provides a command to list the optimization flags:
gcc -OLevelFlag -Q --help=optimizers
• GCC flags selection details:
• Starting from -O3 optimization flags, CE deactivates one flag at the time
• Only the flags enabled by -O3 are considered
• Parametric flags are never deactivated
• GCC version 5.4.0 on Ubuntu 16.04
31. 30
LLVM Passes Selection
• LLVM provides a command to list the optimization passes:
llvm-as < /dev/null | opt -OLevelFlag -disable-output -debug-pass=Arguments
• LLVM passes selection details:
• Starting from -O3 optimization passes, CE deactivates one transform pass at the time
• If one pass is applied multiple times, CE deactivates one instance of that pass at the time
• Only the passes enabled by -O3 are considered
• LLVM version 5.0.2 on Ubuntu 16.04
• Only C/C++ SPEC benchmarks considered
42. 41
Centriq Optimizations on ThunderX2
Norm. Centriq Optimization <= Norm. -O3 Optimization Norm. Centriq Optimization > Norm. -O3 Optimization
12 8
Average
0.9395 1.0355
Median
0.9771 1.0101
Norm. Centriq Optimization <= Norm. CE Norm. Centriq Optimization > Norm. CE
0 20
Average
- 1.0352
Median
- 1.0209
43. 42
Interesting Facts
• Some combinations of flags/passes cause compile/runtime errors
• In benchmark 644.nab_s, deactivation of one of the following passes:
-instcombine (fifth instance)
-jump-threading (second instance)
causes LLVM compiler to indefinitely loop (both on x86 and Arm HPC machines)
find distinct threads of control flow
running through a basic block
combine redundant instructions
45. 44
Results Discussion
• In almost all cases CE found a better configuration of passes
• 623.xalancbmk_s on ThunderX2 no better configuration found
• Final configurations:
• For each machine, different benchmarks converged to different configurations
• Centriq and ThunderX2 final configuration of the same benchmark differ
• Cross validation shows the impact of optimizing for a different chip:
• Just one case where a benchmark optimized for a different chip performed (slightly) better
• Optimizing for Centriq gives better results than optimizing for ThunderX2 wrt a -O3 baseline
46. 45
(Possible) Next Steps
• Repeat the same evaluation on Arm HPC machines using GCC
• Experiments are currently running
• Identify patterns within final configurations
• Investigate the usage Machine Learning models:
• PROS: faster than CE (takes from few hours to few days)
• CONS: huge amount of benchmarks to build a model
48. 4747
The trademarks featured in this
presentation are registered and/or
unregistered trademarks of ARM
Limited (or its subsidiaries) in the EU
and/or elsewhere. All rights
reserved. All other marks featured
may be trademarks of their
respective owners.