SlideShare une entreprise Scribd logo
1  sur  52
Télécharger pour lire hors ligne
By:
Hardik Devani
Rockwell Collins, IDC
May 2011
“Code Less… Create More.. !!“
Code Optimization Techniques
2
Compiler Optimization Techniques.
It describes some of the
optimizations the compiler/programmer may use to
improve the performance of their code.
So….
Compiler/ Programmer
Optimization Techniques.
“The more information a programmer gives the compiler the
better
job it can do at optimizing the code.”
Agenda
 Introduction
 CPU & Memory Organization
 Pipelining Architectures
 Optimization criterions
 16 Compiler Optimizations Techniques
 Discussions
 Take away
3
What is not included here ??
 How a compiler used with the proper optimization
flags can reduce the runtime of a code.
 http://gcc.gnu.org/onlinedocs/gcc/Optimize-
Options.html
 Machine Code Optimizations
 Implementing Optimization in the Hardware
4
Introduction
 Modern CPU's are Super-scaler and Super-
pipelined
 It is not only the clock speed of a CPU that
defines how fast it can execute code.
 CPUs now have clock speeds approaching
1000Mhz while the bus that moves data between
memory and the CPU may only run at 100Mhz.
5
Contd..
 In Order / Out of Order Instruction Execution
 By breaking complex operations into smaller
stages the CPU designer can remove any bottle
necks in the pipeline and speed up throughput.
 If a CPU has multiple floating point units and
integer units the problem becomes keeping them
all busy to achieve maximum efficiency.
6
CPU & Memory Organization
7
Cache Memory
 The cache is a small amount of very fast memory that is
used to store frequently used data or instructions.
 Level 1 cache is small, 8k-128k, runs at the same speed as
the CPU
 Level 2 cache will be larger, 128k-8MB, generally run slower,
maybe half the speed of the CPU.
 Also See “Role of Cache in pipelined architecture”
8
What can we think for
Optimization??
 Think in Binary Powers
 Shift is easy in/then Multiply and divide.
 Think Independent
 instruction execution for multi processing or
multiple ALU capabilities.
 Keep Pipelined architecture in mind.
 Its Not a “MUST” to optimize..
 Mind it..!!
9
Factors affecting optimization
 The machine
 GCC Target ?
 Embedded Sys
 Debugging Build
 Special Purpose
 Architecture of the CPU
 No. Of CPU registers
 RISC / CISC
 Pipelines (Pipeline Stalls)
 No. Of Functional Units
 Machine Architecture
 Cache size
10
Optimizing Techniques
1. Copy Propagation
2. Constant Folding
3. Dead Code Removal
4. Strength Reduction
5. Induction Variable Simplification
6. Function In-Lining
7. Loop invariant Conditionals
8. Variable Renaming
9. Common Sub-Expression Elimination
10. Loop Invariant Code Motion
11. Loop Fusion
12. Pushing Loops Inside Subroutine Calls
13. Loop Index Dependent Conditionals
14. Loop Unrolling
15. Loop stride Size
16. Float Point Optimizations
11
Copy Propagation
 Consider
X=Y
Z=1.0+X
 The compiler might change this to
X=Y
Z=1.0+Y
 Modern CPUs can sometimes execute a number of
instructions at the same time. (Multi-Core / Multi-ALU)
 Remove as many dependencies as possible so that
instructions can be executed independently.
 Reuse outer scope variables.
12
Constant Folding
 Consider
 const int J=100;
 const int K=200;
 int M= J+K;
 The compiler might compute M at compile time as a const rather than at
run time.
 Programmer can help the compiler to make the optimization by defining
any variables that are constant as constant using either the keywords
const or PARAMETER.
 This will also help prevent the programmer from accidently changing the
value of a variable that should be constant.
 Define Scope and access specifiers to optimization.
13
Dead Code Removal
 The compiler will search out pieces of code that have
no effect and remove them.
 examples of dead code are functions that are never
called and variables that are calculated but never
used again.
 When the compiler compiles a piece of code it makes
a number of passes, on each pass it tries to optimize
the code, on some passes it may remove code it
created previously as it is no longer needed.
When a piece of code is being developed or modified
the programmer often adds code for testing purposes
that should be removed when the testing is complete.
14
Strength Reduction
 Replace complex or difficult or expensive operations with simpler ones.
 replacing division by a constant with multiplication by its reciprocal
 Consider
Y=X**2
 Raising a Number to an exponent
 first X is converted to a logarithm and then multiplied by two and then converted
back.
 Y=X*X Is much more efficient.
 Multiplication is Easier than raising a number to a power.
 Consider Y=X**2.0
 This can not be optimized as in the case above as
(Float ) 2.0 != 2 (integer) in general.
 This is an optimization the programmer can use when raising a number to the power of a
small integer.
Similarly Y=2*X to Y=X+X. Also see other Implied inferences
if the CPU can perform the addition faster than the multiplication.
Some CPU's are much slower at performing division compared to
addition/subtraction/multiplication so divisions should be avoided were possible.
15
Induction Variable
Simplification
 Consider
for(i=1; i<=N; i++)
{
K=i*4+M;
C=2*A[K];
.....
}
 Optimized as
K=M;
for(i=1; i<=N; i++)
{
K=K+4;
C=2*A[K];
.......
}
 Facilitate better memory access patterns, on each iteration through the
loop it can predicate which element of the array it will require.
16
Function In-lining & “inline”
directive
 Consider
for(i=0; i<=N ; i++)
{
ke[i]=kinetic_energy(mass[i], velocity[i]);
}
 Optimized as
float kinetic_energy(float m, float v)
{
return 0.5*m*v*v;
}
 Calling a function is associated with overhead. (Stack
switching, clear registers etc.).
 Huge functions been “inlined” = > Code Bloat => difficult to
cache the large *.exe file
17
Loop Invariant Conditionals
 Consider
DO I=1,K
IF ( N.EQ.0 ) THEN
A(I)=A(I)+B(I)*C
ELSE
A(I)=0
ENDIF
ENDDO
 Optimized as
IF ( N.EQ.0 ) then
DO I=1,k
A(I)=A(I)+B(I)*C
ENDDO
ELSE
DO I=1,K
A(I)=0
ENDDO
ENDIF
 Condition checking is done once so,, LOOP UNROLLING can be done.
18
Variable Renaming
 Consider
X=Y*Z
Q=R+X+X
X=A+B
 Optimized as
X_=Y*Z
q=R + X_ + X_
X= A+B
 The third line of code can not be executed until the second line
has completed. This inhibits the flexibility of the CPU. The
compiler might avoid this by creating a temporary variable, _X
 The second and third lines are now independent of each other
and can be executed in any order and possibly at the same time.
19
Common Sub-Expression
Elimination
 Consider
A=C*(F+G)
D=(F+G)/N
 Optimized as
temp=F+G
A=C*temp
D=temp/N
 if the expression gets more complicated then it can be
more difficult to perform and so it is up to the
programmer to implement. As with most optimizations
there is some penalty for the increase in speed, in this
case the penalty for the computer is storing an extra
variable and for the programmer it is readability.
20
Loop Invariant Code Motion
 Consider
for(i=0; i<=N; i++)
{
A[i]=F[i] + C*D;
E=G[K];
}
 Optimized as
temp=C*D;
for(i=0; i<=N; i++)
{
A[i]=F[i] + temp;
}
E=G[K];
 For each iteration of the loop the value of C*D, E and G[K] does
not change.
 C*D is only calculated once rather than N times and E is only
assigned once.
 compiler may not be able to move a complicated expression out
of a loop and it will be up to the programmer to perform this
optimization.
21
Loop Fusion
 Consider
DO i=1,n
x(i) = a(i) + b(i)
ENDDO
DO i=1,n
y(i) = a(i) * c(i)
ENDDO
 Optimized as
DO i=1,n
x(i) = a(i) + b(i)
y(i) = a(i) * c(i)
ENDDO
 It has better data reuse, a(i) will only have to be loaded
from memory once. Also the CPU has a greater chance to
perform some of the operations in parallel.
22
Pushing Loops inside Subroutine
Calls
 Consider
DO i=1,n
CALL add(x(i),y(i),z(i))
ENDDO
SUBROUTINE add(x,y,z)
REAL*8 x,y,z
z = x + y
END
 Optimized as
CALL add(x(i),y(i),z(i),n)
SUBROUTINE add(x,y,z,n)
REAL*8 x(n),y(n),z(n)
INTEGER i
DO i=1,n
z(i)=x(i)+y(i)
ENDDO
END
 The subroutine add was called n times, adding the overhead of n subroutine calls to the run
time of the code. After the loop is moved inside the body of the subroutine, there will only be
one subroutine call.
23
Loop Index Dependent
Conditionals
 Consider
DO I=1,N
DO J=1,N
IF( J.LT.I ) THEN
A(J,I)=A(J,I) + B(J,I)*C
ELSE
A(J,I)=0.0
ENDIF
ENDDO
ENDDO
 Optimized as
DO I=1,N
DO J=1,I-1
A(J,I)=A(J,I) + B(J,I)*C
ENDDO
DO J=I,N
A(J,I)=0.0
ENDDO
ENDDO
 It has an if-block that must be tested on each iteration of the loop. The loop could
be rewritten as follows to remove if-block
24
Loop unrolling
 Consider
DO I=1,N
A[I]=B[I]+C[I]
ENDDO
 There are two loads for B[i] and C[i], one store for A[i], one
addition B[i]+C[i], another addition for I=I+1 and a test i<=N. A
total of six operations.
 Consider
DO I=1,N,2
A(i)=B(i)+C(i)
A(i+1)=B(i+1)+C(i+1)
ENDDO
 four loads, two stores, three additions and one test for a total of
ten.
 The second form is more efficient with ten operations for
effectively two iterations against six operations for one iteration
of the first loop. This is called loop unrolling
25
Contd..
 if the loop is unrolled further it becomes even more efficient. It
should also be noted that if the CPU can perform more that one
operation at a time the payoff may be even greater.
 Originally programmers had to perform this
optimization themselves but now it is best left to
the compiler.
 What happens when there is to be an odd
number of iterations of the loop????
26
The compiler adds a pre-conditioning loop to fill in the extra
iterations
NUM=IMOD(N,4)
DO I=1,NUM
A(I)=B(I)+C(I)
ENDDO
DO I=NUM,N,4
A(I)=B(I)+C(I)
A(I+1)=B(I+1)+C(I+1)
A(I+2)=B(I+2)+C(I+2)
A(I+3)=B(I+3)+C(I+3)
ENDDO
 The second loop has been unrolled four times, the first loop
computes the "odd" number of iterations. IMOD(N,4) returns the
remainder of dividing N by 4.
27
Loop Stride Size
 Arrays are stored in sequential parts of main
memory
 When the CPU requests an element from an
array it not only grabs the specified element but a
number of adjacent elements to it. It stores these
extra elements in a cache, the cache is a small
area of very fast memory
 In the above diagram if the CPU requests A1 it
will grab A1 through to A9 and store them in the
cache.28
Contd..
 Hopefully the next variable it requests will be one
of these elements stored in the cache saving time
by not accessing the slower main memory.
(Cache Hit)
 If not, all the elements stored in the cache are
useless and we have to make the long trip to
main memory. (Cache Miss)
 The elements are loaded into what is called a
cache line, the length of the cache line defines
how many elements can be stored. The cache
line size differs from CPU to CPU.
 Using the cache in the most efficient manner is
the Objective.
29
Contd..
 The size of the step that a loop takes through an
array is called the stride.
 Unit Stride (max Cache Hits => Most Efficient
Utilization)
 Stride of Two
 lay out of Multi Dimensional arrays are different in
memory depending on language.
 Viz. Implementation in FORTRAN vs. C
o In FORTRAN unit stride is achieved by iterating over the left
most subscript first.
o To achieve unit stride for multi-dimensional array in C you
should iterate over the rightmost subscript
30
Float-Point optimizations
 Compilers follow a IEEE Floating-Point Standards
( viz. IEEE_754) to perform floating point
operations.
 This rules can be governed and by using a set of
less restrictive set of rules the compiler may be
able to speed up the computation.
 This less restrictive rules rarely give the results
that may not be same as those for non-optimized
ones or the difference is insignificant.
 Optimizations done should be checked for results
with the non-optimized ones.
31
32
Take Away
 Reorder operations to allow multiple computations
to happen in parallel, either at the instruction,
memory, or thread level.
 Code and data that are accessed closely together
in time should be placed close together in memory
to increase spatial locality of reference.
(Temporal and Spatial)
 Accesses to memory are increasingly more
expensive for each level of the memory hierarchy,
so place the most commonly used items in
registers first, then caches, then main memory,33
Notes
 The more information a programmer gives the
compiler the better job it can do at optimizing the
code.
 “Code Less.. Create More..!! “
 The more precise the information the compiler
has, the better it can employ any or all of these
optimization techniques.
34
Glossary
 Super Scalar :
microprocessor architectures that enable more than
one instruction to be executed per clock cycle.
Nearly all modern microprocessors, including the
Pentium, PowerPC, Alpha, and SPARC
microprocessors are superscalar.
35
Bibliography
 http://en.wikipedia.org/wiki/Compiler_optimization
36
Thank you..!!
37
Replace Multiply by Shift
 A := A * 4;
 Can be replaced by 2-bit left shift (signed/unsigned)
 But must worry about overflow if language does
 A := A / 4;
 If unsigned, can replace with shift right
 But shift right arithmetic is a well-known problem
 Language may allow it anyway (traditional C)
38
Addition chains for multiplication
 If multiply is very slow (or on a machine with no
multiply instruction like the original SPARC),
decomposing a constant operand into sum of powers
of two can be effective:
X * 125 = x * 128 - x*4 + x
 two shifts, one subtract and one add, which may
be faster than one multiply
Note: similarity with efficient exponentiation method
39
Implementation of Multi
Dimensional Arrays
 FORTRAN (Unit Stride)
DO J=1,N
DO I=1,N
A(I,J)= B(I,J)*C(I,J)
ENDDO
ENDDO
 FOTRAN (Stride N)
DO J=1,N
DO I=1,N
A(J,I)= B(J,I)*C(J,I)
ENDDO
ENDDO
 C (Unit Stride)
for(i=0; i<=n; i++)
for(j=0; j<=n; j++)
a[i][j]=b[i][j]*c[i][j];
Resume
40
Instruction Pipelining in CPU
 Pipelining improves system performance in terms of
throughput.
 Pipelined organization requires sophisticated
compilation techniques.
 Performance :
 The potential increase in performance resulting from
pipelining is proportional to the number of pipeline
stages.
 However, this increase would be achieved only if all
pipeline stages require the same time to complete, and
there is no interruption throughout program execution.
 Unfortunately, this is not true.
• Resume41
Use the Idea of Pipelining in a
Computer
F
1
E
1
F
2
E
2
F
3
E
3
I1 I2 I3
(a) Sequential execution
Instruction
fetch
unit
Execution
unit
Interstage buffer
B1
(b) Hardware organization
Time
F1 E1
F2 E2
F3 E3
I1
I2
I3
Instruction
(c) Pipelined execution
Figure1. Basic idea of instruction pipelining.
Clock cycle 1 2 3 4
Time
Fetch + Execution
F4I4
F1
F2
F3
I1
I2
I3
D1
D2
D3
D4
E1
E2
E3
E4
W1
W2
W3
W4
Figure 8.2. A 4-stage pipeline.
(a) Instruction execution div ided into f our steps
F : Fetch
instruction
D : Decode
instruction
and f etch
operands
E: Execute
operation
W : Write
results
Interstage buff ers
(b) Hardware organization
B1 B2 B3
Fetch + Decode
+ Execution + Write
Contd..
Pipeline Performance
F1
F2
F3
I1
I2
I3
E1
E2
E3
D1
D2
D3
W1
W2
W3
Instruction
F4 D4I4
Clock cy cle 1 2 3 4 5 6 7 8 9
Figure 8.3. Effect of an execution operation taking more than one clock cycle.
E4
F5I5 D5
Time
E5
W4
Resume
Role of Cache Memory
 Each pipeline stage is expected to complete in one clock cycle.
 The clock period should be long enough to let the slowest
pipeline stage to complete.
 Faster stages can only wait for the slowest one to complete.
 Since main memory is very slow compared to the execution, if
each instruction needs to be fetched from main memory, pipeline
is almost useless.
 Fortunately, we have cache.
• Resume
46
(b) Position of the source and result registers in the processor pipeline
Figure 8.7. Operand forwarding in a pipelined processor.
Forwarding path
Optimization Techniques
47
 Common themes
 Optimize the common case
 Avoid redundancy
 Less code
 Fewer jumps by using straight line code, also called
branch-free code
 Locality
 Exploit the memory hierarchy
 Parallelize
 More precise information is better
 Runtime metrics can help
 Strength reduction
Specific techniques
48
 Loop optimizations
 Induction variable analysis
 Loop fission or loop distribution
 Loop Fusion or Loop Combining
 Loop inversion
 Loop interchange
 Loop- Invariant Code Motion
 Loop nest Optimization
 Loop Reversal
 Loop unrolling
 Loop Splitting
 Loop unswitching
 Software Pipelining
 Automatic Parallelization
 Data Flow Optimizations
 Common sub-expression elimination
 Constant Folding and Propagation
 Induction Variable recognition and elimination
 Alias Classification and pointer analysis
 Dead store elimination
 SSA Based Optimizations
 Global Value Numbering
 Sparse conditional constant propagation
 Code Generator Optimizations
 Register allocation
 Instruction selection
 Instruction scheduling
 Rematerialization
 Code Factoring
 Trampolines
 Reordering Computations
 Functional Language Optimizations
 Removing Recursion
 Deforestation (data structure fusion)
 Others
 Bounds checking elimination
 Branch offset optimization
 Code block reordering
 Dead code elimination
 Factoring out invariants
 Inline/Macro Expansion
 Jump threading
Inline expansion or macro expansion
49
 When some code invokes a procedure, it is possible to
directly insert the body of the procedure inside the calling
code rather than transferring control to it.
 This saves the overhead related to procedure calls, as well
as providing great opportunity for many different
parameter-specific optimizations, but comes at the cost of
space;
 The procedure body is duplicated each time the procedure
is called inline.
 Generally, inlining is useful in performance-critical code
that makes a large number of calls to small procedures.
 A "fewer jumps" optimization.
Macro Compressions
50
 A space optimization that recognizes common
sequences of code, creates subprograms ("code
macros") that contain the common code, and
replaces the occurrences of the common code
sequences with calls to the corresponding
subprograms.
 This is most effectively done as a machine code
optimization, when all the code is present.
 The technique was first used to conserve space
in an interpretive byte stream used in an
Interprocedural optimizations
51
 Interprocedural optimization works on the entire program, across procedure and
file boundaries.
 It works tightly with intraprocedural counterparts, carried out with the cooperation
of a local part and global part.
 Typical interprocedural optimizations are:
 procedure inlining
 interprocedural dead code elimination
 interprocedural constant propagation
 procedure reordering.
 As usual, the compiler needs to perform interprocedural analysis before its actual
optimizations.
 Interprocedural analyses include alias analysis, array access analysis, and the
construction of a call graph.
 Due to the extra time and space required by interprocedural analysis, most
compilers do not perform it by default. Users must use compiler options explicitly
to tell the compiler to enable interprocedural analysis and other expensive
optimizations.
52

Contenu connexe

Tendances

Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazardAJAL A J
 
Micro programmed control
Micro programmed  controlMicro programmed  control
Micro programmed controlShashank Singh
 
Code optimization in compiler design
Code optimization in compiler designCode optimization in compiler design
Code optimization in compiler designKuppusamy P
 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and ArchitectureVinit Raut
 
Intermediate code generation
Intermediate code generationIntermediate code generation
Intermediate code generationRamchandraRegmi
 
RISC and CISC Processors
RISC and CISC ProcessorsRISC and CISC Processors
RISC and CISC ProcessorsAdeel Rasheed
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipeliningMazin Alwaaly
 
General register organization (computer organization)
General register organization  (computer organization)General register organization  (computer organization)
General register organization (computer organization)rishi ram khanal
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Register organization, stack
Register organization, stackRegister organization, stack
Register organization, stackAsif Iqbal
 

Tendances (20)

Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazard
 
Micro programmed control
Micro programmed  controlMicro programmed  control
Micro programmed control
 
Unit iv(simple code generator)
Unit iv(simple code generator)Unit iv(simple code generator)
Unit iv(simple code generator)
 
pipelining
pipeliningpipelining
pipelining
 
Micro program example
Micro program exampleMicro program example
Micro program example
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
 
Code optimization in compiler design
Code optimization in compiler designCode optimization in compiler design
Code optimization in compiler design
 
Randomized Algorithm
Randomized Algorithm Randomized Algorithm
Randomized Algorithm
 
Aca2 06 new
Aca2 06 newAca2 06 new
Aca2 06 new
 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and Architecture
 
Intermediate code generation
Intermediate code generationIntermediate code generation
Intermediate code generation
 
RISC and CISC Processors
RISC and CISC ProcessorsRISC and CISC Processors
RISC and CISC Processors
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipelining
 
General register organization (computer organization)
General register organization  (computer organization)General register organization  (computer organization)
General register organization (computer organization)
 
Mips
MipsMips
Mips
 
Peephole Optimization
Peephole OptimizationPeephole Optimization
Peephole Optimization
 
ADDRESSING MODES
ADDRESSING MODESADDRESSING MODES
ADDRESSING MODES
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Register organization, stack
Register organization, stackRegister organization, stack
Register organization, stack
 
Code optimization
Code optimizationCode optimization
Code optimization
 

Similaire à Compiler Optimization Techniques for Code Performance

Compiler presention
Compiler presentionCompiler presention
Compiler presentionFaria Priya
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languagesAnkit Pandey
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Inline function
Inline functionInline function
Inline functionTech_MX
 
Compiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flatteningCompiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flatteningCAFxX
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Provectus
 
02 intel v_tune_session_02
02 intel v_tune_session_0202 intel v_tune_session_02
02 intel v_tune_session_02Vivek chan
 
LCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLinaro
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
CUDA by Example : Atomics : Notes
CUDA by Example : Atomics : NotesCUDA by Example : Atomics : Notes
CUDA by Example : Atomics : NotesSubhajit Sahu
 
Lab 4 final report
Lab 4 final reportLab 4 final report
Lab 4 final reportKyle Villano
 

Similaire à Compiler Optimization Techniques for Code Performance (20)

Compiler presention
Compiler presentionCompiler presention
Compiler presention
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Inline function
Inline functionInline function
Inline function
 
Code Optimizatoion
Code OptimizatoionCode Optimizatoion
Code Optimizatoion
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
Compiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flatteningCompiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flattening
 
Matopt
MatoptMatopt
Matopt
 
Code optimization
Code optimizationCode optimization
Code optimization
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
 
02 intel v_tune_session_02
02 intel v_tune_session_0202 intel v_tune_session_02
02 intel v_tune_session_02
 
LCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoC
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Fx570 ms 991ms_e
Fx570 ms 991ms_eFx570 ms 991ms_e
Fx570 ms 991ms_e
 
Performance_Programming
Performance_ProgrammingPerformance_Programming
Performance_Programming
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Peephole Optimization
Peephole OptimizationPeephole Optimization
Peephole Optimization
 
CUDA by Example : Atomics : Notes
CUDA by Example : Atomics : NotesCUDA by Example : Atomics : Notes
CUDA by Example : Atomics : Notes
 
Lab 4 final report
Lab 4 final reportLab 4 final report
Lab 4 final report
 

Dernier

Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 

Dernier (20)

Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 

Compiler Optimization Techniques for Code Performance

  • 1. By: Hardik Devani Rockwell Collins, IDC May 2011 “Code Less… Create More.. !!“ Code Optimization Techniques
  • 2. 2 Compiler Optimization Techniques. It describes some of the optimizations the compiler/programmer may use to improve the performance of their code. So…. Compiler/ Programmer Optimization Techniques. “The more information a programmer gives the compiler the better job it can do at optimizing the code.”
  • 3. Agenda  Introduction  CPU & Memory Organization  Pipelining Architectures  Optimization criterions  16 Compiler Optimizations Techniques  Discussions  Take away 3
  • 4. What is not included here ??  How a compiler used with the proper optimization flags can reduce the runtime of a code.  http://gcc.gnu.org/onlinedocs/gcc/Optimize- Options.html  Machine Code Optimizations  Implementing Optimization in the Hardware 4
  • 5. Introduction  Modern CPU's are Super-scaler and Super- pipelined  It is not only the clock speed of a CPU that defines how fast it can execute code.  CPUs now have clock speeds approaching 1000Mhz while the bus that moves data between memory and the CPU may only run at 100Mhz. 5
  • 6. Contd..  In Order / Out of Order Instruction Execution  By breaking complex operations into smaller stages the CPU designer can remove any bottle necks in the pipeline and speed up throughput.  If a CPU has multiple floating point units and integer units the problem becomes keeping them all busy to achieve maximum efficiency. 6
  • 7. CPU & Memory Organization 7
  • 8. Cache Memory  The cache is a small amount of very fast memory that is used to store frequently used data or instructions.  Level 1 cache is small, 8k-128k, runs at the same speed as the CPU  Level 2 cache will be larger, 128k-8MB, generally run slower, maybe half the speed of the CPU.  Also See “Role of Cache in pipelined architecture” 8
  • 9. What can we think for Optimization??  Think in Binary Powers  Shift is easy in/then Multiply and divide.  Think Independent  instruction execution for multi processing or multiple ALU capabilities.  Keep Pipelined architecture in mind.  Its Not a “MUST” to optimize..  Mind it..!! 9
  • 10. Factors affecting optimization  The machine  GCC Target ?  Embedded Sys  Debugging Build  Special Purpose  Architecture of the CPU  No. Of CPU registers  RISC / CISC  Pipelines (Pipeline Stalls)  No. Of Functional Units  Machine Architecture  Cache size 10
  • 11. Optimizing Techniques 1. Copy Propagation 2. Constant Folding 3. Dead Code Removal 4. Strength Reduction 5. Induction Variable Simplification 6. Function In-Lining 7. Loop invariant Conditionals 8. Variable Renaming 9. Common Sub-Expression Elimination 10. Loop Invariant Code Motion 11. Loop Fusion 12. Pushing Loops Inside Subroutine Calls 13. Loop Index Dependent Conditionals 14. Loop Unrolling 15. Loop stride Size 16. Float Point Optimizations 11
  • 12. Copy Propagation  Consider X=Y Z=1.0+X  The compiler might change this to X=Y Z=1.0+Y  Modern CPUs can sometimes execute a number of instructions at the same time. (Multi-Core / Multi-ALU)  Remove as many dependencies as possible so that instructions can be executed independently.  Reuse outer scope variables. 12
  • 13. Constant Folding  Consider  const int J=100;  const int K=200;  int M= J+K;  The compiler might compute M at compile time as a const rather than at run time.  Programmer can help the compiler to make the optimization by defining any variables that are constant as constant using either the keywords const or PARAMETER.  This will also help prevent the programmer from accidently changing the value of a variable that should be constant.  Define Scope and access specifiers to optimization. 13
  • 14. Dead Code Removal  The compiler will search out pieces of code that have no effect and remove them.  examples of dead code are functions that are never called and variables that are calculated but never used again.  When the compiler compiles a piece of code it makes a number of passes, on each pass it tries to optimize the code, on some passes it may remove code it created previously as it is no longer needed. When a piece of code is being developed or modified the programmer often adds code for testing purposes that should be removed when the testing is complete. 14
  • 15. Strength Reduction  Replace complex or difficult or expensive operations with simpler ones.  replacing division by a constant with multiplication by its reciprocal  Consider Y=X**2  Raising a Number to an exponent  first X is converted to a logarithm and then multiplied by two and then converted back.  Y=X*X Is much more efficient.  Multiplication is Easier than raising a number to a power.  Consider Y=X**2.0  This can not be optimized as in the case above as (Float ) 2.0 != 2 (integer) in general.  This is an optimization the programmer can use when raising a number to the power of a small integer. Similarly Y=2*X to Y=X+X. Also see other Implied inferences if the CPU can perform the addition faster than the multiplication. Some CPU's are much slower at performing division compared to addition/subtraction/multiplication so divisions should be avoided were possible. 15
  • 16. Induction Variable Simplification  Consider for(i=1; i<=N; i++) { K=i*4+M; C=2*A[K]; ..... }  Optimized as K=M; for(i=1; i<=N; i++) { K=K+4; C=2*A[K]; ....... }  Facilitate better memory access patterns, on each iteration through the loop it can predicate which element of the array it will require. 16
  • 17. Function In-lining & “inline” directive  Consider for(i=0; i<=N ; i++) { ke[i]=kinetic_energy(mass[i], velocity[i]); }  Optimized as float kinetic_energy(float m, float v) { return 0.5*m*v*v; }  Calling a function is associated with overhead. (Stack switching, clear registers etc.).  Huge functions been “inlined” = > Code Bloat => difficult to cache the large *.exe file 17
  • 18. Loop Invariant Conditionals  Consider DO I=1,K IF ( N.EQ.0 ) THEN A(I)=A(I)+B(I)*C ELSE A(I)=0 ENDIF ENDDO  Optimized as IF ( N.EQ.0 ) then DO I=1,k A(I)=A(I)+B(I)*C ENDDO ELSE DO I=1,K A(I)=0 ENDDO ENDIF  Condition checking is done once so,, LOOP UNROLLING can be done. 18
  • 19. Variable Renaming  Consider X=Y*Z Q=R+X+X X=A+B  Optimized as X_=Y*Z q=R + X_ + X_ X= A+B  The third line of code can not be executed until the second line has completed. This inhibits the flexibility of the CPU. The compiler might avoid this by creating a temporary variable, _X  The second and third lines are now independent of each other and can be executed in any order and possibly at the same time. 19
  • 20. Common Sub-Expression Elimination  Consider A=C*(F+G) D=(F+G)/N  Optimized as temp=F+G A=C*temp D=temp/N  if the expression gets more complicated then it can be more difficult to perform and so it is up to the programmer to implement. As with most optimizations there is some penalty for the increase in speed, in this case the penalty for the computer is storing an extra variable and for the programmer it is readability. 20
  • 21. Loop Invariant Code Motion  Consider for(i=0; i<=N; i++) { A[i]=F[i] + C*D; E=G[K]; }  Optimized as temp=C*D; for(i=0; i<=N; i++) { A[i]=F[i] + temp; } E=G[K];  For each iteration of the loop the value of C*D, E and G[K] does not change.  C*D is only calculated once rather than N times and E is only assigned once.  compiler may not be able to move a complicated expression out of a loop and it will be up to the programmer to perform this optimization. 21
  • 22. Loop Fusion  Consider DO i=1,n x(i) = a(i) + b(i) ENDDO DO i=1,n y(i) = a(i) * c(i) ENDDO  Optimized as DO i=1,n x(i) = a(i) + b(i) y(i) = a(i) * c(i) ENDDO  It has better data reuse, a(i) will only have to be loaded from memory once. Also the CPU has a greater chance to perform some of the operations in parallel. 22
  • 23. Pushing Loops inside Subroutine Calls  Consider DO i=1,n CALL add(x(i),y(i),z(i)) ENDDO SUBROUTINE add(x,y,z) REAL*8 x,y,z z = x + y END  Optimized as CALL add(x(i),y(i),z(i),n) SUBROUTINE add(x,y,z,n) REAL*8 x(n),y(n),z(n) INTEGER i DO i=1,n z(i)=x(i)+y(i) ENDDO END  The subroutine add was called n times, adding the overhead of n subroutine calls to the run time of the code. After the loop is moved inside the body of the subroutine, there will only be one subroutine call. 23
  • 24. Loop Index Dependent Conditionals  Consider DO I=1,N DO J=1,N IF( J.LT.I ) THEN A(J,I)=A(J,I) + B(J,I)*C ELSE A(J,I)=0.0 ENDIF ENDDO ENDDO  Optimized as DO I=1,N DO J=1,I-1 A(J,I)=A(J,I) + B(J,I)*C ENDDO DO J=I,N A(J,I)=0.0 ENDDO ENDDO  It has an if-block that must be tested on each iteration of the loop. The loop could be rewritten as follows to remove if-block 24
  • 25. Loop unrolling  Consider DO I=1,N A[I]=B[I]+C[I] ENDDO  There are two loads for B[i] and C[i], one store for A[i], one addition B[i]+C[i], another addition for I=I+1 and a test i<=N. A total of six operations.  Consider DO I=1,N,2 A(i)=B(i)+C(i) A(i+1)=B(i+1)+C(i+1) ENDDO  four loads, two stores, three additions and one test for a total of ten.  The second form is more efficient with ten operations for effectively two iterations against six operations for one iteration of the first loop. This is called loop unrolling 25
  • 26. Contd..  if the loop is unrolled further it becomes even more efficient. It should also be noted that if the CPU can perform more that one operation at a time the payoff may be even greater.  Originally programmers had to perform this optimization themselves but now it is best left to the compiler.  What happens when there is to be an odd number of iterations of the loop???? 26
  • 27. The compiler adds a pre-conditioning loop to fill in the extra iterations NUM=IMOD(N,4) DO I=1,NUM A(I)=B(I)+C(I) ENDDO DO I=NUM,N,4 A(I)=B(I)+C(I) A(I+1)=B(I+1)+C(I+1) A(I+2)=B(I+2)+C(I+2) A(I+3)=B(I+3)+C(I+3) ENDDO  The second loop has been unrolled four times, the first loop computes the "odd" number of iterations. IMOD(N,4) returns the remainder of dividing N by 4. 27
  • 28. Loop Stride Size  Arrays are stored in sequential parts of main memory  When the CPU requests an element from an array it not only grabs the specified element but a number of adjacent elements to it. It stores these extra elements in a cache, the cache is a small area of very fast memory  In the above diagram if the CPU requests A1 it will grab A1 through to A9 and store them in the cache.28
  • 29. Contd..  Hopefully the next variable it requests will be one of these elements stored in the cache saving time by not accessing the slower main memory. (Cache Hit)  If not, all the elements stored in the cache are useless and we have to make the long trip to main memory. (Cache Miss)  The elements are loaded into what is called a cache line, the length of the cache line defines how many elements can be stored. The cache line size differs from CPU to CPU.  Using the cache in the most efficient manner is the Objective. 29
  • 30. Contd..  The size of the step that a loop takes through an array is called the stride.  Unit Stride (max Cache Hits => Most Efficient Utilization)  Stride of Two  lay out of Multi Dimensional arrays are different in memory depending on language.  Viz. Implementation in FORTRAN vs. C o In FORTRAN unit stride is achieved by iterating over the left most subscript first. o To achieve unit stride for multi-dimensional array in C you should iterate over the rightmost subscript 30
  • 31. Float-Point optimizations  Compilers follow a IEEE Floating-Point Standards ( viz. IEEE_754) to perform floating point operations.  This rules can be governed and by using a set of less restrictive set of rules the compiler may be able to speed up the computation.  This less restrictive rules rarely give the results that may not be same as those for non-optimized ones or the difference is insignificant.  Optimizations done should be checked for results with the non-optimized ones. 31
  • 32. 32
  • 33. Take Away  Reorder operations to allow multiple computations to happen in parallel, either at the instruction, memory, or thread level.  Code and data that are accessed closely together in time should be placed close together in memory to increase spatial locality of reference. (Temporal and Spatial)  Accesses to memory are increasingly more expensive for each level of the memory hierarchy, so place the most commonly used items in registers first, then caches, then main memory,33
  • 34. Notes  The more information a programmer gives the compiler the better job it can do at optimizing the code.  “Code Less.. Create More..!! “  The more precise the information the compiler has, the better it can employ any or all of these optimization techniques. 34
  • 35. Glossary  Super Scalar : microprocessor architectures that enable more than one instruction to be executed per clock cycle. Nearly all modern microprocessors, including the Pentium, PowerPC, Alpha, and SPARC microprocessors are superscalar. 35
  • 38. Replace Multiply by Shift  A := A * 4;  Can be replaced by 2-bit left shift (signed/unsigned)  But must worry about overflow if language does  A := A / 4;  If unsigned, can replace with shift right  But shift right arithmetic is a well-known problem  Language may allow it anyway (traditional C) 38
  • 39. Addition chains for multiplication  If multiply is very slow (or on a machine with no multiply instruction like the original SPARC), decomposing a constant operand into sum of powers of two can be effective: X * 125 = x * 128 - x*4 + x  two shifts, one subtract and one add, which may be faster than one multiply Note: similarity with efficient exponentiation method 39
  • 40. Implementation of Multi Dimensional Arrays  FORTRAN (Unit Stride) DO J=1,N DO I=1,N A(I,J)= B(I,J)*C(I,J) ENDDO ENDDO  FOTRAN (Stride N) DO J=1,N DO I=1,N A(J,I)= B(J,I)*C(J,I) ENDDO ENDDO  C (Unit Stride) for(i=0; i<=n; i++) for(j=0; j<=n; j++) a[i][j]=b[i][j]*c[i][j]; Resume 40
  • 41. Instruction Pipelining in CPU  Pipelining improves system performance in terms of throughput.  Pipelined organization requires sophisticated compilation techniques.  Performance :  The potential increase in performance resulting from pipelining is proportional to the number of pipeline stages.  However, this increase would be achieved only if all pipeline stages require the same time to complete, and there is no interruption throughout program execution.  Unfortunately, this is not true. • Resume41
  • 42. Use the Idea of Pipelining in a Computer F 1 E 1 F 2 E 2 F 3 E 3 I1 I2 I3 (a) Sequential execution Instruction fetch unit Execution unit Interstage buffer B1 (b) Hardware organization Time F1 E1 F2 E2 F3 E3 I1 I2 I3 Instruction (c) Pipelined execution Figure1. Basic idea of instruction pipelining. Clock cycle 1 2 3 4 Time Fetch + Execution
  • 43. F4I4 F1 F2 F3 I1 I2 I3 D1 D2 D3 D4 E1 E2 E3 E4 W1 W2 W3 W4 Figure 8.2. A 4-stage pipeline. (a) Instruction execution div ided into f our steps F : Fetch instruction D : Decode instruction and f etch operands E: Execute operation W : Write results Interstage buff ers (b) Hardware organization B1 B2 B3 Fetch + Decode + Execution + Write Contd..
  • 44. Pipeline Performance F1 F2 F3 I1 I2 I3 E1 E2 E3 D1 D2 D3 W1 W2 W3 Instruction F4 D4I4 Clock cy cle 1 2 3 4 5 6 7 8 9 Figure 8.3. Effect of an execution operation taking more than one clock cycle. E4 F5I5 D5 Time E5 W4 Resume
  • 45. Role of Cache Memory  Each pipeline stage is expected to complete in one clock cycle.  The clock period should be long enough to let the slowest pipeline stage to complete.  Faster stages can only wait for the slowest one to complete.  Since main memory is very slow compared to the execution, if each instruction needs to be fetched from main memory, pipeline is almost useless.  Fortunately, we have cache. • Resume
  • 46. 46 (b) Position of the source and result registers in the processor pipeline Figure 8.7. Operand forwarding in a pipelined processor. Forwarding path
  • 47. Optimization Techniques 47  Common themes  Optimize the common case  Avoid redundancy  Less code  Fewer jumps by using straight line code, also called branch-free code  Locality  Exploit the memory hierarchy  Parallelize  More precise information is better  Runtime metrics can help  Strength reduction
  • 48. Specific techniques 48  Loop optimizations  Induction variable analysis  Loop fission or loop distribution  Loop Fusion or Loop Combining  Loop inversion  Loop interchange  Loop- Invariant Code Motion  Loop nest Optimization  Loop Reversal  Loop unrolling  Loop Splitting  Loop unswitching  Software Pipelining  Automatic Parallelization  Data Flow Optimizations  Common sub-expression elimination  Constant Folding and Propagation  Induction Variable recognition and elimination  Alias Classification and pointer analysis  Dead store elimination  SSA Based Optimizations  Global Value Numbering  Sparse conditional constant propagation  Code Generator Optimizations  Register allocation  Instruction selection  Instruction scheduling  Rematerialization  Code Factoring  Trampolines  Reordering Computations  Functional Language Optimizations  Removing Recursion  Deforestation (data structure fusion)  Others  Bounds checking elimination  Branch offset optimization  Code block reordering  Dead code elimination  Factoring out invariants  Inline/Macro Expansion  Jump threading
  • 49. Inline expansion or macro expansion 49  When some code invokes a procedure, it is possible to directly insert the body of the procedure inside the calling code rather than transferring control to it.  This saves the overhead related to procedure calls, as well as providing great opportunity for many different parameter-specific optimizations, but comes at the cost of space;  The procedure body is duplicated each time the procedure is called inline.  Generally, inlining is useful in performance-critical code that makes a large number of calls to small procedures.  A "fewer jumps" optimization.
  • 50. Macro Compressions 50  A space optimization that recognizes common sequences of code, creates subprograms ("code macros") that contain the common code, and replaces the occurrences of the common code sequences with calls to the corresponding subprograms.  This is most effectively done as a machine code optimization, when all the code is present.  The technique was first used to conserve space in an interpretive byte stream used in an
  • 51. Interprocedural optimizations 51  Interprocedural optimization works on the entire program, across procedure and file boundaries.  It works tightly with intraprocedural counterparts, carried out with the cooperation of a local part and global part.  Typical interprocedural optimizations are:  procedure inlining  interprocedural dead code elimination  interprocedural constant propagation  procedure reordering.  As usual, the compiler needs to perform interprocedural analysis before its actual optimizations.  Interprocedural analyses include alias analysis, array access analysis, and the construction of a call graph.  Due to the extra time and space required by interprocedural analysis, most compilers do not perform it by default. Users must use compiler options explicitly to tell the compiler to enable interprocedural analysis and other expensive optimizations.
  • 52. 52