The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures

The Impact of Compiler Auto-Optimisation
on Arm-based HPC Microarchitectures
Emanuele Del Sozzo1, Javier Setoain2, Filippo Spiga2
emanuele.delsozzo@polimi.it
1Politecnico di Milano
2Arm Research
NECSTLab – Politecnico di Milano
9/11/2018

<disclaimer>
This presentation describes the results of my
internship project at Arm Research in Cambridge
</disclaimer>

1
What is Arm?
• A semiconductor and software design company
• Primary business: CPU design
Secondary business: software tools, SoC, GPUs
• Market: (mainly) embedded devices (microcontrollers, mobile phones,
tablets, etc.)
• Business model: creation and licensing of its IPs

2
Why Arm?
Master’s thesis topic: scheduling policy for Arm big.LITTLE[1]
• QoS and power efficiency oriented
• Dynamic Voltage and Frequency Scaling (DVFS)
• Tasks allocation
• Exploitation of all the underlying resources
[1] E. Del Sozzo, G. C. Durelli, E. M. G. Trainiti, A. Miele, M. D. Santambrogio and C. Bolchini, "Workload-aware power optimization strategy for asymmetric multiprocessors," 2016 Design, Automation & Test in Europe Conference &
Exhibition (DATE), Dresden, 2016, pp. 531-534.
…
frequency
Thread Number
Thread Mapping
Throughput
ARM big.LITTLE
Policy
App 0 App 1 App N
A15 A15 A15 A15 A7 A7 A7 A7
Cache L2 Cache L2
AXI Cache Coherent BUS
DDR

3
Context Definition
• Divergence in Arm architecture implementation by partners
• Differences in performance for the same executable on different microarchitectures
• Difference amplified when a original binary optimized for a specific chip
runs on another one

4
A big.LITTLE Example
C / C++
Code
01101
101010
100101
Cortex-A15
Cortex-A7 Cortex-A7
Cortex-A7 Cortex-A7
Cortex-A15
Cortex-A15 Cortex-A15
big Cluster
LITTLE Cluster
Compilation

5
C / C++
Code
01101
101010
100101
Cortex-A15
Cortex-A7 Cortex-A7
Cortex-A7 Cortex-A7
Cortex-A15
big Cluster
LITTLE Cluster
Compilation

6
C / C++
Code
01101
101010
100101
Cortex-A15
Cortex-A7 Cortex-A7
Cortex-A7 Cortex-A7
Cortex-A15
big Cluster
LITTLE Cluster
Compilation

7
Fetch Decode Issue
Queue
Integer
Multiply
Floating-Point / NEON
Dual Issue
Load / Store
Writeback
Fetch
Decode, Rename &
Dispatch
IssueQueue
Integer
Floating-Point / NEON
Writeback
Integer
Multiply
Load
Store
Branch
Loop Cache
Cortex-A7 Pipeline
Cortex-A15 Pipeline

8
Internship Goal
• Evaluate the impact of compiler optimizations on HPC microarchitectures
• Development of a methodology to systematically find optimal subsets of
optimization flags
• Analyze performance loss when compiling for a chip using the optimal
subset of flags from another

11
Compiler Optimization
• Choose the right set of compiler optimizations is complex and depends on[2]:
• Programming language
• Application
• Architecture
• (Some of the) open problems in the compiler optimization field:
1. What optimization to use / Which set of
parameters to choose from
2. In which order to apply the optimizations
[2] A. H. Ashouri, W. Killian, G. Palermo and C. Silvano, "A Survey on Compiler Autotuning using Machine Learning", ACM Computing Surveys 2018

12
• Application
• Architecture
Selection problem
Phase ordering problem

13
• Application
• Architecture
Selection problem
Phase ordering problem

14
Selection Problem
• Compilers already provide a fixed-sequence of optimizations:
-O0, -O1, -O2, -O3, -Ofast, -Os, …
• Not good enough to obtain the best achievable application-specific performance
• Literature approaches to optimize a given application rely on:
• Application characterization techniques[3]
• Optimization space exploration techniques[4]
• Machine learning models[5]
[3] G. Fursin et al. 2011. “Milepost GCC: Machine Learning Enabled Self-tuning Compiler”. Int. J. of Parallel Programming 39, 3 (2011), 296–327.
[4] C. Blackmore, O. Ray and K. Eder, ”Automatically Tuning the GCC Compiler to Optimize the Performance of Applications Running on Embedded Systems”, ArXiv.org
[5] K.D. Cooper, P. J. Schielke, and D. Subramanian, 1999, "Optimizing for reduced code space using genetic algorithms",ACM SIGPLAN Notices (1999)

15
Selection Problem
• Compilers already provide a fixed-sequence of optimizations:
-O0, -O1, -O2, -O3, -Ofast, -Os, …
• Not good enough to obtain the best achievable application-specific performance
• Literature approaches to optimize a given application rely on:
• Application characterization techniques[3]
• Optimization space exploration techniques[4]
• Machine learning models[5]
[3] G. Fursin et al. 2011. “Milepost GCC: Machine Learning Enabled Self-tuning Compiler”. Int. J. of Parallel Programming 39, 3 (2011), 296–327.
[5] K.D. Cooper, P. J. Schielke, and D. Subramanian, 1999, "Optimizing for reduced code space using genetic algorithms",ACM SIGPLAN Notices (1999)

16
Combined Elimination (CE)
An iterative method to find the optimal set of flags for a given application[4]
TB is the execution time of the target
program compiled with configuration B

Implementation and
Initial Evaluation

18
CE Implementation
• Python implementation of CE
• First evaluation on a x86 desktop machine (quad-core Intel i7-7700 CPU @ 3.60GHz)
• Integration with GCC and LLVM compilers
• Benchmarks: SPEC 2017

19
SPEC 2017 Benchmarks
Name Type Language KLOC
600.perlbench_s Integer C 362
602.gcc_s Integer C 1,304
605.mcf_s Integer C 3
620.omnetepp_s Integer C++ 134
623.xalancbmk_s Integer C++ 520
625.x264_s Integer C 96
631.deepsjeng_s Integer C++ 10
641.leela_s Integer C++ 21
648.exchange2_s Integer Fortran 1
657.xz_s Integer C 33
Name Type Language KLOC
603.bwaves_s Floating-Point Fortran 1
607.cactuBSSN_s Floating-Point C++, C, Fortran 257
619.lbm_s Floating-Point C 1
621.wrf_s Floating-Point Fortran, C 991
627.cam4_s Floating-Point Fortran, C 407
628.pop2_s Floating-Point Fortran, C 338
638.imagick_s Floating-Point C 259
644.nab_s Floating-Point C 24
649.fotonik3d_s Floating-Point Fortran 14
654.roms_s Floating-Point Fortran 210

20
GCC Flags Selection
• GCC provides a command to list the optimization flags:
gcc -OLevelFlag -Q --help=optimizers
• GCC flags selection details:
• Starting from -O3 optimization flags, CE deactivates one flag at the time
• Only the flags enabled by -O3 are considered
• Parametric flags are never deactivated
• GCC version 5.4.0 on Ubuntu 16.04

21
GCC Results
1.54X 1.22X 1.13X 1.08X 1.08X 1.08X 1.07X 1.07X 1.05X 1.04X 1.04X 1.04X 1.04X 1.02X 1.02X 1.01X 1.01X 1.01X 1.01X 1X
-O3 Optimization Combined Elimination
NormalizedExecutionTime
0
0.2
0.4
0.6
0.8
1.0
648.exchange2
631.deepsjeng
605.m
cf
644.nab
627.cam
4
625.x264
654.rom
s
628.pop2
619.lbm
602.gcc
607.cactuBSSN
621.w
rf
641.leela
657.xz638.im
agick
600.perlbench649.fotonik3d620.om
netpp
623.xalancbm
k603.bw
aves

22
LLVM Architecture
.c source
clang -emit-llvm
.bc / .ll
llvm-link
libWhatever.a
.bc / .ll
opt
.bc / .ll
llc
.s
llvm-mc / as
.o
lld / ld
executable
dynLibWhatever.o
llvm-as.ll .bc
llvm-dis.bc .ll
https://github.com/skeru/LLVM-intro/blob/master/img/03/toolchain.pdf

23
LLVM Architecture
.c source
clang -emit-llvm
.bc / .ll
llvm-link
libWhatever.a
.bc / .ll
opt
.bc / .ll
llc
.s
llvm-mc / as
.o
lld / ld
executable
dynLibWhatever.o
llvm-as.ll .bc
llvm-dis.bc .ll
https://github.com/skeru/LLVM-intro/blob/master/img/03/toolchain.pdf

24
LLVM Optimization Passes
-tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -simplifycfg
-domtree -sroa -early-cse -lower-expect -targetlibinfo -tti -tbaa -scoped-noalias
-assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt
-domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg
-basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa
-aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa
-aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg
-domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq
-pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree
-loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate
-licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify
-lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion…

25
Analysis Passes

26
Utility Passes

27
Transform Passes

28
Analysis Passes Transform Passes Utility Passes

29
Transform Passes

30
LLVM Passes Selection
• LLVM provides a command to list the optimization passes:
llvm-as < /dev/null | opt -OLevelFlag -disable-output -debug-pass=Arguments
• LLVM passes selection details:
• Starting from -O3 optimization passes, CE deactivates one transform pass at the time
• If one pass is applied multiple times, CE deactivates one instance of that pass at the time
• Only the passes enabled by -O3 are considered
• LLVM version 5.0.2 on Ubuntu 16.04
• Only C/C++ SPEC benchmarks considered

31
LLVM Results
1.49X 1.18X 1.17X 1.12X 1.08X 1.05X 1.04X 1.02X 1.02X 1.02X 1.01X 1X
0
0.2
0.4
0.6
0.8
1.0
631.deepsjeng
605.m
cf
638.im
agick
644.nab
623.xalancbm
k
657.xz
625.x264
620.om
netpp
602.gcc
641.leela
600.perlbench
619.lbm

32
GCC vs LLVM Results
GCC -O3 Optimization LLVM -O3 Optimization Combined Elimination w/ GCC Combined Elimination w/ LLVM
0
0.5
1.0
1.5
2.0
600.perlbench
602.gcc
605.m
cf
619.lbm
620.om
netpp
623.xalancbm
k
625.x264
631.deepsjeng
638.im
agick
641.leela
644.nab
657.xz
Execution times normalized wrt GCC –O3

Evaluation on Arm
HPC Machines

34
Arm HPC Machines
Centriq
Producer: Qualcomm
ISA: Armv8
OS: Ubuntu 16.04
Core: 47 (1 thread per core)
Coherency: Qualcomm High-Speed Coherent
Interconnect
Compiler: Arm HPC Compiler 18.4.1
ThunderX2
Producer: Cavium
ISA: Armv8
OS: Ubuntu 16.04
Core: 56 (4 threads per core)
Coherency: Cavium Coherent Processor
Interconnect
Compiler: Arm HPC Compiler 18.4.1

35
Centriq Results
1.21X 1.2X 1.09X 1.08X 1.07X 1.07X 1.04X 1.03X 1.03X 1.03X 1.02X 1.02X 1.01X 1.01X 1.01X 1.01X 1.01X 1.01X 1.01X 1.01X
0
0.2
0.4
0.6
0.8
1.0
648.exchange2
602.gcc
625.x264
605.m
cf638.im
agick
623.xalancbm
k649.fotonik3d620.om
netpp
654.rom
s
641.leela
628.pop2
600.perlbench
644.nab
631.deepsjeng
657.xz
619.lbm
607.cactuBSSN
621.w
rf
627.cam
4603.bw
aves

36
ThunderX2 Results
1.83X 1.15X 1.08X 1.08X 1.06X 1.05X 1.05X 1.05X 1.04X 1.03X 1.02X 1.02X 1.02X 1.02X 1.01X 1.01X 1.01X 1.01X 1X 1X
0
0.2
0.4
0.6
0.8
1.0
631.deepsjeng
648.exchange2
625.x264
605.m
cf
641.leela
600.perlbench638.im
agick
644.nab
607.cactuBSSN
619.lbm
654.rom
s
628.pop2649.fotonik3d620.om
netpp
627.cam
4
602.gcc
621.w
rf603.bw
aves
657.xz
623.xalancbm
k

38
ThunderX2 Optimizations on Centriq
0.965X
0.819X
0.979X 1.04X 0.998X 0.976X 0.977X 0.996X 1X 1.09X 1.01X 1.01X 1X
0.626X
1X 0.97X 1.02X 0.996X 0.985X 0.992X
-O3 Optimization Combined Elimination ThunderX2 Optimization
0
0.5
1.0
1.5
600.perlbench
602.gcc603.bw
aves
605.m
cf
607.cactuBSSN
619.lbm620.om
netpp
621.w
rf
623.xalancbm
k
625.x264
627.cam
4
628.pop2
631.deepsjeng638.im
agick
641.leela
644.nab
648.exchange2649.fotonik3d
654.rom
s
657.xz

39
ThunderX2 Optimizations on Centriq
Norm. ThunderX2 Optimization <= Norm. -O3 Optimization Norm. ThunderX2 Optimization > Norm. -O3 Optimization
7 13
Average
0.9766 1.0763
Median
0.9863 1.0210
Norm. ThunderX2 Optimization <= Norm. CE Norm. ThunderX2 Optimization > Norm. CE
1 19
Average
0.9961 1.0986
Median
0.9961 1.0407

40
Centriq Optimizations on ThunderX2
1.03X
0.903X
1X 1.02X 1.01X 0.995X 0.969X 0.999X 0.986X 1.07X 1X 1X 1.77X 1.02X 1.01X 1.03X 1.1X 1X
0.895X
0.994X
-O3 Optimization Combined Elimination Centriq Optimization
0
0.2
0.4
0.6
0.8
1.0
1.2
600.perlbench
602.gcc603.bw
aves
605.m
cf
607.cactuBSSN
619.lbm620.om
netpp
621.w
rf
623.xalancbm
k
625.x264
627.cam
4
628.pop2
631.deepsjeng638.im
agick
641.leela
644.nab
648.exchange2649.fotonik3d
654.rom
s
657.xz

41
Centriq Optimizations on ThunderX2
Norm. Centriq Optimization <= Norm. -O3 Optimization Norm. Centriq Optimization > Norm. -O3 Optimization
12 8
Average
0.9395 1.0355
Median
0.9771 1.0101
Norm. Centriq Optimization <= Norm. CE Norm. Centriq Optimization > Norm. CE
0 20
Average
- 1.0352
Median
- 1.0209

42
Interesting Facts
• Some combinations of flags/passes cause compile/runtime errors
• In benchmark 644.nab_s, deactivation of one of the following passes:
-instcombine (fifth instance)
-jump-threading (second instance)
causes LLVM compiler to indefinitely loop (both on x86 and Arm HPC machines)
find distinct threads of control flow
running through a basic block
combine redundant instructions

Results Discussion and
Next Steps

44
Results Discussion
• In almost all cases CE found a better configuration of passes
• 623.xalancbmk_s on ThunderX2 no better configuration found
• Final configurations:
• For each machine, different benchmarks converged to different configurations
• Centriq and ThunderX2 final configuration of the same benchmark differ
• Cross validation shows the impact of optimizing for a different chip:
• Just one case where a benchmark optimized for a different chip performed (slightly) better
• Optimizing for Centriq gives better results than optimizing for ThunderX2 wrt a -O3 baseline

45
(Possible) Next Steps
• Repeat the same evaluation on Arm HPC machines using GCC
• Experiments are currently running
• Identify patterns within final configurations
• Investigate the usage Machine Learning models:
• PROS: faster than CE (takes from few hours to few days)
• CONS: huge amount of benchmarks to build a model

464646
Thank You!
Danke!
Merci!
!
!
Gracias!
Kiitos!
Emanuele Del Sozzo (emanuele.delsozzo@polimi.it)
Javier Setoain (javier.setoain@arm.com)
Filippo Spiga (filippo.spiga@arm.com)

4747
The trademarks featured in this
presentation are registered and/or
unregistered trademarks of ARM
Limited (or its subsidiaries) in the EU
and/or elsewhere. All rights
reserved. All other marks featured
may be trademarks of their
respective owners.

The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures

Similar to The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures (20)

More from NECST Lab @ Politecnico di Milano

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded

Recently uploaded (20)

The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures