Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
26 Oct 2015•0 j'aime
3 j'aime
Soyez le premier à aimer ceci
afficher plus
•1,338 vues
vues
Nombre de vues
0
Sur Slideshare
0
À partir des intégrations
0
Nombre d'intégrations
0
Télécharger pour lire hors ligne
Signaler
Formation
The slides give an idea about how to look pragmatically at software optimization and order optimization approaches according to this pragmatic point of view
3
OUTLINE
What is optimization?
Pragmatic approach
Optimization trade-offs
Knowledge which is required
Where to get the performance?
Optimization cycle
Top-Down (High-low) approach
Optimization cycle (revised)
Optimization steps overview
How to learn optimization?
Recommended literature
Summary
4
WHAT IS OPTIMIZATION?
In computing, optimization is a process of modifying a system
to make some aspect of it to work more ef ciently or use fewer
resources, in particular, a process of transforming a piece of
code to make it more ef cient without changing its output.
5
PRAGMATIC APPROACH
“Programmers waste enormous amounts of time thinking
about, or worrying about, the speed of non-critical parts of
their programs, and these attempts at ef ciency actually have
a strong negative impact when debugging and maintenance
are considered. We should forget about small inef ciencies, say
about 97% of the time; premature optimization is the root of
all evil. Yet we should not pass up our opportunities in that
critical 3%.“
-Donald Knuth, Structur ed Programming With go to Statements
1. Find what to start from (3%)
2. Know when to stop (97%)
6
OPTIMIZATION TRADE-OFFS
Code portability decreases when we go deeper
Performance portability decreases when we go deeper
The cost of maintenance and extensibility increases
when we go deeper
Optimizations are often not reusable
Optimizations become obsolete very quickly
...but still performance is a crucial requirement for most
applications.
7
KNOWLEDGE WHICH IS REQUIRED
1. The code
The problem, it solves
The algorithm, it implements
The algorithmic complexity
2. The compiler
Compilation trajectory
Compiler's capabilities and obstacles
3. The platform
Architecture capabilities
Instruction Set Architecture
Micro-architecture speci cs
8
WHERE TO GET THE PERFORMANCE?
High-level Programmer
Middle-level Compiler
Low-level Hardware
12
STEP #1: UNDERSTAND THE CODE
Different people think differently
you'll need some time to get used to the code
Understand data ow
input/output parameters
data dependencies
Identify performance limiters
Time
Pro le
Collect metrics
e.g. CPI, power consumption
13 . 1
STEP #2: USE APPROPRIATE ALGORITHM
Consider and lower big O complexity
Choose data structures wisely
Look for optimized libraries
Find opportunities to scalarize & parallelize
13 . 2
STEP #2: USE APPROPRIATE ALGORITHM
Compilers are not aware of semantics of code, taking this into
account focus on an algorithmic aspect rst.
Decrease big-O complexity
Use optimized libraries for subroutines
Restructure the code to use fewer resources
Split problem on subtasks, organize them wisely
Parallelize
What if you need to sort 100 Mb of numerical data...
WHAT SORTING ALGORITHM WOULD YOU CHOOSE?
14 . 1
STEP #3: OPTIMIZE MEMORY ACCESSES
You'll be surprised how many algorithms are memory bound!
Optimization for memory usually involves:
Data restructuring
to load only data that is really needed for computations.
Data packaging
to shrink the data in size
Loop transformations
to walk through the data in a more ef cient way,
to increase temporal & spacial locality,
to perform cache-aware optimization
14 . 2
STEP #3: OPTIMIZE MEMORY ACCESSES
Compilers are quite good at local optimization, such as
loop bodies transformations,
local functions inlining,
arithmetic expressions simpli cation
so help a compiler rather than try to outfox it.
Work cohesively with it on
enabling auto-vectorization,
optimizing critical loops,
vectorizing.
14 . 3
STEP #3: OPTIMIZE MEMORY ACCESSES
for (int j = 0; j < height; j++)
for (int i = 0; i < width; i++)
if (img[j * width + i] > 0)
count++;
for (int i = 0; i < width; i++)
for (int j = 0; j < height; j++)
if (img[j * width + i] > 0)
count++;
WHICH IS MORE OPTIMAL FOR CONVENTIONAL CPU PROCESSOR?
14 . 4
STEP #3: OPTIMIZE MEMORY ACCESSES
for (int j = 0; j < height; j++)
for (int i = 0; i < width; i++)
if (img[j * width + i] > 0)
count++;
for (int i = 0; i < width; i++)
for (int j = 0; j < height; j++)
if (img[j * width + i] > 0)
count++;
15 . 1
The compiler usually helps a lot here:
STEP #4: MINIMIZE NUMBER OF OPERATIONS
Reducing a program in the number of operations
doesn't necessarily decrease its runtime,
but it's a good heuristic, though.
Machine-independent
optimizations
Common Sub-expression Elimination
Constant propagation
Redundancy elimination
..
Machine-dependent
optimizations
Register allocation
Instruction selectIon
Instruction scheduling
Peephole optimization
..
15 . 2
STEP #4: MINIMIZE NUMBER OF OPERATIONS
float pows(float a,float b,float c, float d, float e, float f, float x)
{
return
a * powf(x, 5.f) +
b * powf(x, 4.f) +
c * powf(x, 3.f) +
d * powf(x, 2.f) +
e * x + f ;
}
gcc -march=armv7-a -mfpu=neon-vfpv4 -mthumb
-mfloat-abi=softfp -O3 1.c -S -o 1.s
15 . 4
STEP #4: MINIMIZE NUMBER OF OPERATIONS
float horner(float a, float b, float c, float d, float e, float f, float x)
{
return ((((a * x + b) * x + c) * x + d) * x + e) * x + f;
}
horner:
flds s15, [sp, #8]
fmsr s11, r0
fmsr s12, r1
flds s14, [sp]
vfma.f32 s12, s11, s15
fmsr s11, r2
flds s13, [sp, #4]
vfma.f32 s11, s12, s15
fcpys s12, s11
fmsr s11, r3
vfma.f32 s11, s12, s15
vfma.f32 s14, s11, s15
vfma.f32 s13, s14, s15
fmrs r0, s13
bx lr
15 . 5
STEP #4: MINIMIZE NUMBER OF OPERATIONS
Unfortunately, sometimes a compiler fails some optimization
steps (e.g. register allocation, scalarization) and harms the
performance by introducing redundant operations.
Starting from this optimization step it is worth to look at the
assembly code to check whether the compiler is actually
automating a particular optimization.
16 . 1
STEP #5: SHRINK THE CRITICAL PATH
Critical path is the longest sequence of operations in a code
block that must be completed in order, which is usually caused
by dependencies between steps or operations.
The critical path of a code block is hardly deducible from
high-level code and requires assembly inspection.
Knowledge about architecture capabilities is required to
estimate critical path more precisely.
Some pro lers are able to do critical path analysis.
The term could also refer to the longest sequence of
dependent steps in a pipeline that limits its parallelization.
Control- ow diagram is used to nding the critical path.
16 . 2
STEP #5: SHRINK THE CRITICAL PATH
Let's look at the critical path of the following code block.
const uint8_t* p0 = src.ptr(row0);
const uint8_t* p1 = src.ptr(row1);
uint8_t* dptr = dst.ptr(row);
for (int col = 0; col < cols; ++col)
{
dptr[col] = (p0[col*2]+p0[col*2+1]
+ p1[col*2]+p1[col*2+1]+2)>>2;
}
WHAT IS THE CRITICAL PATH OF THIS CODE LINE?
16 . 5
But, the compiler reorders instructions
since integer math is associative
STEP #5: SHRINK THE CRITICAL PATH
6 ?
16 . 6
And let's assume that hardware schedules
1 arithmetic and 1 memory operation per clock.
STEP #5: SHRINK THE CRITICAL PATH
AND BACK TO 8 AGAIN
17 . 1
STEP #6: DO HW-SPECIFIC OPTIMIZATION
It requires comprehensive understanding of the target HW,
which usually goes beyond compiler's abilities
Using special hardware capabilities
Overcoming micro-architecture weakness
Using instructions, which are speci c for concrete HW
balancing usage of different instruction types
A classical example here is a question of recomputing
temporal v.s. getting it from the memory.
17 . 2
STEP #6: DO HW-SPECIFIC OPTIMIZATION
Modern hardware is quite advanced,
deep pipelines,
out-of-order execution,
sophisticated branch prediction,
multi-level memory hierarchies,
processor specialization.
so utilize unique properties of the hardware.
Peephole optimization is not as important
as used to be 10 years ago.
18
STEP #7: DIVE INTO ASSEMBLY
Assembler is a must-have to check the compiler
but it is rarely used to write low-level code.
Raw assembly make sense to:
Overcome compiler bugs & optimization limitations
addition of redundant instructions
suboptimal register allocation
Use speci c hardware features
which are not expressed in higher level ISA
Keep in mind that:
Assembly writing is the least portable optimization
In-line assembly limits compiler optimizations
19
HOW TO LEARN OPTIMIZATION?
Optimization is a craft rather than a science.
Practice more
Do not make practical knowledge too theoretical.
Look, what other people do
Do nd real use cases of different optimization
approaches and techniques.
Dig into an architecture
HW evolves rapidly hence devices obsolete in a wink.
Comprehensive knowledge helps see beforehand.
20 . 1
RECOMMENDED LITERATURE
by and
Computer Architecture, Fifth Edition:
A Quantitative Approach
John L. Hennessy David A. Patterson.
20 . 2
RECOMMENDED LITERATURE
byThe Mature Optimization Handbook Carlos Bueno
20 . 3
RECOMMENDED LITERATURE
by
Is Parallel Programming Hard,
And, If So, What Can You Do About It?
Paul E. McKenney
20 . 4
RECOMMENDED LITERATURE
by and
Engineering a Compiler
Keith Cooper Linda Torczon
21
SUMMARY
Practice, look what others do and dig into an architecture.
The main task of an optimizer is nding the critical part.
Optimizer's mastership is to know where to stop.
Knowledge about the code, the compiler and the platform
is a must-have.
Optimization is a measure-analyze-optimize-check cycle.
Stick to the high-to-low approach.
Get the performance from algorithmic and data structure
choices rst,
... ensure memory access patterns next,
... then go deeper.