SlideShare a Scribd company logo
1 of 56
Download to read offline
Daniel Towner
Tool chain engineer for massively multi-core
systems (several hundred cores per chip,
several thousand in a system).
Official GCC port maintainer.
Technical lead for multi-core debugger.
System architect for small-cell/femto-cell
telecommunications company.
Brief anatomy of a compiler
Register allocation
Vectorisation
Induction variables
Debugging optimised code
Front -end Back-endMiddle-end
Machine code
generation
Source Input
C++
C
Fortran
Java
Ada
D
X86
ARM
PowerPC
Intermediate Representation
A generic assembly-like
language, which can take
represent any of the basic
instructions in any target.
Front -end Back-endMiddle-end
Optimisation
Machine code
generation
Source Input
C++
C
Fortran
Java
Ada
D
X86
ARM
PowerPC
Intermediate Representation
All the interesting optimisations happen in the middle-end, and are
intermediate-representation translations.
I will look at just a few.
Compilers have to map program
variables to processor registers,
but how does this happen?
Colour in the map, making
sure that adjacent countries
are not given the same
colour.
Bonus marks for:
• If you had N colours, could
the map be coloured?
• What is the minimum
number of colours needed?
• Given N colours, and
knowing that the graph
could be coloured, how
would it be coloured?
Lets create a graph:
•A node for each country
•Edges connect nodes
(countries) which share a
land border.
Once we have a graph, we
can use a k-colouring
algorithm to assign colours
to nodes.
You’ll have to trust me that
such algorithms exist,
because there isn’t
sufficient time to show them
now.
(and k-colouring is an NP-
complete problem!)
Colour in the map, making
sure that adjacent countries
are given the same colour.
Bonus marks for:
•What is the minimum
number of colours needed?
• 4 colours for planar maps
•If you had N colours, could
the map be coloured in?
•4 or more, yes, 3 maybe
L12:
t102 := mem[t9];
t22 := t1 * 5;
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
t49 := t2 < t63;
if (t49) goto l45:
T66 := t44 – 23;
L34:
T45 := t6 – t2;
T41 := t43 – t99;
T19:=mem[t4];
L45:
T45 := t45 ^ t56;
Mem[t918] := t45;
Mem[t33] := t55;
T55 := t44 – t34;
t25 := t1 * 5;
L23:
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
T45 := t98 – t55;
Mem[t45] := t22;
*this is illustative code – it isn’t real!
L12:
t102 := mem[t9];
t22 := t1 * 5;
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
t49 := t2 < t63;
if (t49) goto l45:
T66 := t44 – 23;
L34:
T45 := t6 – t2;
T41 := t43 – t99;
T19:=mem[t4];
L45:
T45 := t45 ^ t56;
Mem[t918] := t45;
Mem[t33] := t55;
T55 := t44 – t34;
t25 := t1 * 5;
L23:
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
T45 := t98 – t55;
Mem[t45] := t22;
*this is illustative code – it isn’t real!
V1
V2
V3
V4
V5
V6
L12:
t102 := mem[t9];
t22 := t1 * 5;
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
t49 := t2 < t63;
if (t49) goto l45:
T66 := t44 – 23;
L34:
T45 := t6 – t2;
T41 := t43 – t99;
T19:=mem[t4];
L45:
T45 := t45 ^ t56;
Mem[t918] := t45;
Mem[t33] := t55;
T55 := t44 – t34;
t25 := t1 * 5;
L23:
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
T45 := t98 – t55;
Mem[t45] := t22;
*this is illustative code – it isn’t real!
V1
V2
V3
V4
V5
V6
V1 V4
V2
V3
V5
V6
L12:
t102 := mem[t9];
t22 := t1 * 5;
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
t49 := t2 < t63;
if (t49) goto l45:
T66 := t44 – 23;
L34:
T45 := t6 – t2;
T41 := t43 – t99;
T19:=mem[t4];
L45:
T45 := t45 ^ t56;
Mem[t918] := t45;
Mem[t33] := t55;
T55 := t44 – t34;
t25 := t1 * 5;
L23:
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
T45 := t98 – t55;
Mem[t45] := t22;
*this is illustative code – it isn’t real!
V1
V2
V3
V4
V5
V6
V1 V4
V2
V3
V5
V6
L12:
t102 := mem[t9];
t22 := t1 * 5;
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
t49 := t2 < t63;
if (t49) goto l45:
T66 := t44 – 23;
L34:
T45 := t6 – t2;
T41 := t43 – t99;
T19:=mem[t4];
L45:
T45 := t45 ^ t56;
Mem[t918] := t45;
Mem[t33] := t55;
T55 := t44 – t34;
t25 := t1 * 5;
L23:
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
T45 := t98 – t55;
Mem[t45] := t22;
*this is illustative code – it isn’t real!
V1
V2
V3
V4
V5
V6
V1 V4
V2
V3
V5
V6
R1
R3
R2
R2
R2
R3
But graph colouring is an NP-complete
problem? How can compilers run in
reasonable time?
Compilers are optimistic – they use an
algorithm which assumes everything
goes to plan.
• If this works – great!
• If their optimism was misplaced and they fail,
make the problem easier by reducing live
ranges...
L12:
t102 := mem[t9];
t22 := t1 * 5;
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
t49 := t2 < t63;
if (t49) goto l45:
T66 := t44 – 23;
L34:
T45 := t6 – t2;
T41 := t43 – t99;
T19:=mem[t4];
L45:
T45 := t45 ^ t56;
Mem[t918] := t45;
Mem[t33] := t55;
T55 := t44 – t34;
t25 := t1 * 5;
L23:
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
T45 := t98 – t55;
Mem[t45] := t22;
*this is illustative code – it isn’t real!
V1
L12:
t102 := mem[t9];
t22 := t1 * 5;
mem[fp+4] := t22;
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
t49 := t2 < t63;
if (t49) goto l45:
T66 := t44 – 23;
L34:
T45 := t6 – t2;
T41 := t43 – t99;
T19:=mem[t4];
L45:
T45 := t45 ^ t56;
Mem[t918] := t45;
Mem[t33] := t55;
T55 := t44 – t34;
t25 := t1 * 5;
L23:
T99 := 123;
t33 := t1 + 23;
t63 := mem[t2]
T45 := t98 – t55;
t22 := mem[fp+4];
Mem[t45] := t22;
V7
V8
Smaller live ranges means less
interference
Less interference means more chance of
register allocation working.
A few iterations of allocation and spilling
will generally result in a valid register
allocation.
Large data set operations will perform the
same operation on multiple elements of data.
Vectorisation allows those operations to be
parallelised, but how ?
 A vector stores multiple related
elements of data.
a0 a1 a2 a3 a4 a5 a6 a7
 Combine with another vector in a single
instruction to produce a third vector.
a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
c0 c1 c2 c3 c4 c5 c6 c7
ADD
=
c0 = a0 + b0, c1 = a1 + b1, ...
 Merge elements into a scalar value.
a0 a1 a2 a3 a4 a5 a6 a7
MAX
T
 Vectorisation is important for many types of
large data set processing operations.
 Vector units (SIMD) are everywhere:
• Intel MMX, SSE[1,2,3,4]
• PowerPC (Xbox 360)
• GPU engines
• ARM 7s (iPhone 5, Galaxy S3)
 Automatically generating code to exploit these is
therefore very important.
 Vector algorithms are important elsewhere, not
just compilers – e.g., Map/reduce
Consider a simple loop to find the largest
value:
int m = 0;
for (size_t i=0; i!=916; ++i)
m = std::max (m, a[i]);
Try to vectorise this to take advantage of
a vmax instruction, which compares 8
elements at once.
916 (114 * 8 + 4) elements, but vmax
works on 8 elements at a time. Split the
loop:
int m = 0;
for (size_t i=0; i!=912; ++i)
m = std::max (m, a[i]);
for (size_t i=912; i!=916; ++i)
m = std::max (m, a[i]);
Nest another loop inside the first which
operates on 8-elements at a time:
for (size_t i=0; i!=912; i+=8)
for (size_t j=0; j!=8; ++j)
m = std::max (m, a[i + j]);
Introduce more instruction-like operations:
for (size_t i=0; i!=912; i+=8)
{
int* tbase = &a[i];
for (size_t j=0; j!=8; ++j)
{
int temp = tbase[j];
m = std::max (m, temp);
}
}
 Change 1 loop with 2 sub operations, into 2
loops with 1 sub operation:
for (size_t i=0; i!=912; i+=8)
{
int* tbase = &a[i];
int temp[8];
for (size_t j=0; j!=8; ++j)
temp[j] = tbase[j];
for (size_t j=0; j!=8; ++j)
m = std::max (m, temp[j]);
}
Replace the sub-loops with their vector
equivalents:
for (size_t i=0; i!=912; i+=8)
{
int* tbase = &a[i];
int temp[8] = vecLoad (tbase);
int localm = vecMax (temp);
m = std::max (m, localm);
}
The loop could then be optimised
further:
• Strength reduction/induction
• Loop unrolling
• And so on...
Many loops contain variables whose
values are related. By understanding
these relationships, and substituting
equivalencies, the loops can run faster.
SOURCE INTERMEDIATE
int32_t *a;
for(size_t i=0; i!=N; ++i)
a[i] = i * 5;
copy 0, i
start:
cmp i, n
beq end
mul i, 5, t0
mul i, 4, t1
add t1, a, t2
store t0, t2
incr i
bra start
end:
copy 0, i
start:
cmp i, n
beq end
mul i, 5, t0
mul i, 4, t1
add t1, a, t2
store t0, t2
incr i
bra start
end:
 Three different counts:
• i counts from 0 to n in steps of 1.
• t0 counts from 0 to n*5 in steps of 5
• t2 counts from &a[0] to &a[n] in steps of 4.
 The values are related. In any
iteration, knowing one gives you the
others.
 The loop contains two (expensive?)
multiplications
copy 0, i
copy 0, t0
copy &a, t2
start:
cmp i, n
beq end
add t0, 5, t0
add t2, 4, t2
incr i
bra start
end:
 The values written to a[i] are:
i * 5
 Since i is incrementing by one
each iteration, the values
assigned are:
0, 5, 10, 15, ...
 So the multiplication can be
replaced by an initialisation, and
an addition in each iteration. As
can the other multiplication.
 Also improves ILP.
copy 0, i
copy 0, t0
copy &a, t2
start:
cmp i, n
beq end
add t0, 5, t0
add t2, 4, t2
incr i
bra start
end:
 There are three additions in
the loop now:
• i incremented by 1
• t0 by 5
• t2 by 4
 i is used only for controlling
the loop [0..n)
 We could equally well use:
• t0 in range [0...n*5)
• t2 in range [&a, &a + 4 * n).
copy 0, i
copy 0, t0
copy &a, t2
start:
cmp i, n
beq end
add t0, 5, t0
add t2, 4, t2
incr i
bra start
end:
copy 0, t0
copy &a, t2
mul n, 5, t9
start:
cmp t0, t9
beq end
add t0, 5, t0
add t2, 4, t2
bra start
end:
Fewer overall instructions, fewer loop
instructions and also fewer registers.
copy 0, t0
copy &a, t2
mul n, 5, t9
start:
cmp t0, t9
beq end
add t0, 5, t0
add t2, 4, t2
bra start
end:
copy 0, t0
copy &a, t2
mul n, 5, t9
start:
cmp t0, t9
beq end
add t0, 5, t0
add t2, 4, t2
bra start
end:
 Too many branches.
 Branches are expensive:
• Long latency
• Bubbles in the pipeline
• Cause scheduling problems.
 You will always need a
comparison, so can that
close the loop?
 A loop with a comparison at the end is a do..while. In
source language terms:
do {
a[i] = i * 5;
} while (i != n);
 But we need to look out for zero iterations too:
if (n > 0)
do {
a[i] = i * 5;
} while (i != n);
}
(Note that the compiler implements this in
intermediate code)
copy 0, t0
copy &a, t2
mul n, 5, t9
start:
cmp t0, t9
beq end
add t0, 5, t0
add t2, 4, t2
bra start
end:
cmp 0, n
beq end
copy 0, t0
copy &a, t2
mul n, 5, t9
start:
add t0, 5, t0
add t2, 4, t2
cmp t0, t9
beq start
end:
Fewer branches. Fewer loop instructions.
fn:
.LFB0:
testl %edi, %edi
jle .L4
leal (%rdi,%rdi,4), %ecx
movl $array, %edx
xorl %eax, %eax
.L3:
movl %eax, (%rdx)
addl $5, %eax
addq $4, %rdx
cmpl %ecx, %eax
jne .L3
.L4:
Have you ever come across these?
while (len--) *++dst = *src++;
while (*dst++ = *src++);
Have you ever come across these?
while (len--) *++dst = *src++;
while (*dst++ = *src++);
They still comes up as recommended.
for (i=0; i!=len; ++i)
dst[i] = src[i];
This is more readable, more obvious, and
thanks to induction, strength reduction
and loop inversion, it’s just as fast.
As the compiler gets more clever with
its optimisations, how does the
debugger make sense of it all?
“Debugging is twice as hard as writing the code in the first place.Therefore, if you
write the code as cleverly as possible, you are, by definition, not smart enough to
debug it.” – Brian W. Kernighan
The compiler tells the debugger what it
has generated using DWARF.
DWARF is Debug With Arbitrary Record
Format.
Often stored in an ELF file.
Can be extended to allow for special
features of a compiler or platform.
DWARF is non-intrusive (i.e., the
compiled code doesn’t change)
 Each variable has an associated DWARF
expression which tells the debugger the
value of that variable.
 There are several basic expressions, such
as:
• Register23
• Memory[ConstantAddress]
• Memory[Register]
• Constant(value)
 An expression can be multi-part too:
Piece(0,16) Register12 Piece(16,128) Memory(FP)
The debugger contains a virtual machine,
based upon a stack-engine.
The variable expression elements are
instructions for that virtual machine.
For example, read the contents of a
function argument passed on the stack:
RegSP, Constant(12),
Add, Deref
The debugger contains a virtual machine,
based upon a stack-engine.
The variable expression elements are
instructions for that virtual machine.
For example, read the contents of a
function argument passed on the stack:
RegSP, Constant(12),
Add, Deref
12356
The debugger contains a virtual machine,
based upon a stack-engine.
The variable expression elements are
instructions for that virtual machine.
For example, read the contents of a
function argument passed on the stack:
RegSP, Constant(12),
Add, Deref
12
12356
The debugger contains a virtual machine,
based upon a stack-engine.
The variable expression elements are
instructions for that virtual machine.
For example, read the contents of a
function argument passed on the stack:
RegSP, Constant(12),
Add, Deref
12368
The debugger contains a virtual machine,
based upon a stack-engine.
The variable expression elements are
instructions for that virtual machine.
For example, read the contents of a
function argument passed on the stack:
RegSP, Constant(12),
Add, Deref
The variable’s value is left.
42
Consider our previous induction-
optimised loop.
The variable i was optimised out of
existence.
However, the debugger can still generate
a value for i by running a program:
RegT0, Constant(5), Div
Or:
RegT2, Constant(&a), Sub, Constant(2), Shr
 The virtual machine implementation can do:
• Arithmetic (add, sub, etc)
• Memory operations (load)
• Register operations
• Branching (if/then/else)
• More exotic debugger-specific instructions (e.g., Piece).
 This allows the compiler to generate programs
which reverse the effects of optimisations, to
allow sane debug output to be generated.
 Special data-types (e.g., built-in linked lists) can
be supported by the compiler generating
appropriate programs, rather than debuggers
having to be extended with extra support.
Hopefully I’ve revealed a little more of
what goes on in a compiler.
The sophisticated algorithms used allow
the programmer to concentrate on
writing clean, understandable, testable,
maintainable code.
Questions?
What is the fastest way to move/copy a
block of memory?
 What is the fastest way to move a region of
memory?
memmove (dest, src, length);
 If length is known, and is small, the compiler is
allowed to generate direct code to deal with it.
 If the compiler can’t deal with it, it will hand off
to the run-time library.
 The run-time library will often look at the length,
the alignment, and choose the best strategy to do
the copy (e.g., is a vector engine available). It
may even hand off to the OS...
 ...which is likely to know about the processor
type, cache organisation, line sizes, DMA, page
mappings, and all sorts of tricks to do this the
fastest possible way.

More Related Content

What's hot

Time complexity (linear search vs binary search)
Time complexity (linear search vs binary search)Time complexity (linear search vs binary search)
Time complexity (linear search vs binary search)Kumar
 
Transfer functions, poles and zeros.
Transfer functions, poles and zeros.Transfer functions, poles and zeros.
Transfer functions, poles and zeros.ARIF HUSSAIN
 
Control systems lab task
Control systems lab taskControl systems lab task
Control systems lab taskARIF HUSSAIN
 
asymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisasymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisAnindita Kundu
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysisSoujanya V
 
Matlab implementation of fast fourier transform
Matlab implementation of  fast fourier transformMatlab implementation of  fast fourier transform
Matlab implementation of fast fourier transformRakesh kumar jha
 
Matlab code for comparing two microphone files
Matlab code for comparing two microphone filesMatlab code for comparing two microphone files
Matlab code for comparing two microphone filesMinh Anh Nguyen
 
Digitla Communication pulse shaping filter
Digitla Communication pulse shaping filterDigitla Communication pulse shaping filter
Digitla Communication pulse shaping filtermirfanjum
 
how to calclute time complexity of algortihm
how to calclute time complexity of algortihmhow to calclute time complexity of algortihm
how to calclute time complexity of algortihmSajid Marwat
 
Asymptotic Notation and Complexity
Asymptotic Notation and ComplexityAsymptotic Notation and Complexity
Asymptotic Notation and ComplexityRajandeep Gill
 
Circular convolution Using DFT Matlab Code
Circular convolution Using DFT Matlab CodeCircular convolution Using DFT Matlab Code
Circular convolution Using DFT Matlab CodeBharti Airtel Ltd.
 
Design of FFT Processor
Design of FFT ProcessorDesign of FFT Processor
Design of FFT ProcessorRohit Singh
 
A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkJaewook. Kang
 
Time and space complexity
Time and space complexityTime and space complexity
Time and space complexityAnkit Katiyar
 
pradeepbishtLecture13 div conq
pradeepbishtLecture13 div conqpradeepbishtLecture13 div conq
pradeepbishtLecture13 div conqPradeep Bisht
 
Lecture 4 asymptotic notations
Lecture 4   asymptotic notationsLecture 4   asymptotic notations
Lecture 4 asymptotic notationsjayavignesh86
 

What's hot (20)

Time complexity (linear search vs binary search)
Time complexity (linear search vs binary search)Time complexity (linear search vs binary search)
Time complexity (linear search vs binary search)
 
Transfer functions, poles and zeros.
Transfer functions, poles and zeros.Transfer functions, poles and zeros.
Transfer functions, poles and zeros.
 
Asymptotic Notation
Asymptotic NotationAsymptotic Notation
Asymptotic Notation
 
Control systems lab task
Control systems lab taskControl systems lab task
Control systems lab task
 
asymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisasymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysis
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysis
 
Ch2
Ch2Ch2
Ch2
 
Matlab implementation of fast fourier transform
Matlab implementation of  fast fourier transformMatlab implementation of  fast fourier transform
Matlab implementation of fast fourier transform
 
Matlab code for comparing two microphone files
Matlab code for comparing two microphone filesMatlab code for comparing two microphone files
Matlab code for comparing two microphone files
 
Digitla Communication pulse shaping filter
Digitla Communication pulse shaping filterDigitla Communication pulse shaping filter
Digitla Communication pulse shaping filter
 
how to calclute time complexity of algortihm
how to calclute time complexity of algortihmhow to calclute time complexity of algortihm
how to calclute time complexity of algortihm
 
Asymptotic Notation and Complexity
Asymptotic Notation and ComplexityAsymptotic Notation and Complexity
Asymptotic Notation and Complexity
 
Circular convolution Using DFT Matlab Code
Circular convolution Using DFT Matlab CodeCircular convolution Using DFT Matlab Code
Circular convolution Using DFT Matlab Code
 
Design of FFT Processor
Design of FFT ProcessorDesign of FFT Processor
Design of FFT Processor
 
A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB Simulink
 
Time and space complexity
Time and space complexityTime and space complexity
Time and space complexity
 
Asymptotic Notation
Asymptotic NotationAsymptotic Notation
Asymptotic Notation
 
pradeepbishtLecture13 div conq
pradeepbishtLecture13 div conqpradeepbishtLecture13 div conq
pradeepbishtLecture13 div conq
 
Lecture 4 asymptotic notations
Lecture 4   asymptotic notationsLecture 4   asymptotic notations
Lecture 4 asymptotic notations
 
Slide2
Slide2Slide2
Slide2
 

Similar to The Inner Secrets of Compilers

l1.ppt
l1.pptl1.ppt
l1.pptImXaib
 
Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004jeronimored
 
Introduction of Algorithm.pdf
Introduction of Algorithm.pdfIntroduction of Algorithm.pdf
Introduction of Algorithm.pdfLaxmiMobile1
 
How to tune a query - ODTUG 2012
How to tune a query - ODTUG 2012How to tune a query - ODTUG 2012
How to tune a query - ODTUG 2012Connor McDonald
 
Data Structure & Algorithms - Mathematical
Data Structure & Algorithms - MathematicalData Structure & Algorithms - Mathematical
Data Structure & Algorithms - Mathematicalbabuk110
 
Theta join (M-bucket-I algorithm explained)
Theta join (M-bucket-I algorithm explained)Theta join (M-bucket-I algorithm explained)
Theta join (M-bucket-I algorithm explained)Minsub Yim
 
MATLAB sessions Laboratory 4MAT 275 Laboratory 4MATLAB .docx
MATLAB sessions Laboratory 4MAT 275 Laboratory 4MATLAB .docxMATLAB sessions Laboratory 4MAT 275 Laboratory 4MATLAB .docx
MATLAB sessions Laboratory 4MAT 275 Laboratory 4MATLAB .docxandreecapon
 
computer architecture 4
computer architecture 4 computer architecture 4
computer architecture 4 Dr.Umadevi V
 
Code Tuning
Code TuningCode Tuning
Code Tuningbgtraghu
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...Positive Hack Days
 
Task4output.txt 2 5 9 13 15 10 1 0 3 7 11 14 1.docx
Task4output.txt 2  5  9 13 15 10  1  0  3  7 11 14 1.docxTask4output.txt 2  5  9 13 15 10  1  0  3  7 11 14 1.docx
Task4output.txt 2 5 9 13 15 10 1 0 3 7 11 14 1.docxjosies1
 
Chapter Eight(2)
Chapter Eight(2)Chapter Eight(2)
Chapter Eight(2)bolovv
 

Similar to The Inner Secrets of Compilers (20)

l1.ppt
l1.pptl1.ppt
l1.ppt
 
l1.ppt
l1.pptl1.ppt
l1.ppt
 
l1.ppt
l1.pptl1.ppt
l1.ppt
 
Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004
 
Introduction of Algorithm.pdf
Introduction of Algorithm.pdfIntroduction of Algorithm.pdf
Introduction of Algorithm.pdf
 
How to tune a query - ODTUG 2012
How to tune a query - ODTUG 2012How to tune a query - ODTUG 2012
How to tune a query - ODTUG 2012
 
Data Structure & Algorithms - Mathematical
Data Structure & Algorithms - MathematicalData Structure & Algorithms - Mathematical
Data Structure & Algorithms - Mathematical
 
Jp
Jp Jp
Jp
 
Theta join (M-bucket-I algorithm explained)
Theta join (M-bucket-I algorithm explained)Theta join (M-bucket-I algorithm explained)
Theta join (M-bucket-I algorithm explained)
 
MATLAB sessions Laboratory 4MAT 275 Laboratory 4MATLAB .docx
MATLAB sessions Laboratory 4MAT 275 Laboratory 4MATLAB .docxMATLAB sessions Laboratory 4MAT 275 Laboratory 4MATLAB .docx
MATLAB sessions Laboratory 4MAT 275 Laboratory 4MATLAB .docx
 
computer architecture 4
computer architecture 4 computer architecture 4
computer architecture 4
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
 
Slides13.pdf
Slides13.pdfSlides13.pdf
Slides13.pdf
 
Matlab
MatlabMatlab
Matlab
 
Matlab
MatlabMatlab
Matlab
 
Task4output.txt 2 5 9 13 15 10 1 0 3 7 11 14 1.docx
Task4output.txt 2  5  9 13 15 10  1  0  3  7 11 14 1.docxTask4output.txt 2  5  9 13 15 10  1  0  3  7 11 14 1.docx
Task4output.txt 2 5 9 13 15 10 1 0 3 7 11 14 1.docx
 
Chapter Eight(2)
Chapter Eight(2)Chapter Eight(2)
Chapter Eight(2)
 
Es272 ch1
Es272 ch1Es272 ch1
Es272 ch1
 
CPP Homework Help
CPP Homework HelpCPP Homework Help
CPP Homework Help
 

More from IT MegaMeet

Coding in community
Coding in communityCoding in community
Coding in communityIT MegaMeet
 
Get it into your genes - The EnsEMBL Project
Get it into your genes - The EnsEMBL ProjectGet it into your genes - The EnsEMBL Project
Get it into your genes - The EnsEMBL ProjectIT MegaMeet
 
The LCH Grid - High Performance Computing in High Energy Particle Physics
The LCH Grid - High Performance Computing in High Energy Particle PhysicsThe LCH Grid - High Performance Computing in High Energy Particle Physics
The LCH Grid - High Performance Computing in High Energy Particle PhysicsIT MegaMeet
 
Coding masterclasses for schools
Coding masterclasses for schoolsCoding masterclasses for schools
Coding masterclasses for schoolsIT MegaMeet
 
Embracing Continuous Integration
Embracing Continuous IntegrationEmbracing Continuous Integration
Embracing Continuous IntegrationIT MegaMeet
 
Scrum - a deceptively simple process
Scrum - a deceptively simple processScrum - a deceptively simple process
Scrum - a deceptively simple processIT MegaMeet
 

More from IT MegaMeet (6)

Coding in community
Coding in communityCoding in community
Coding in community
 
Get it into your genes - The EnsEMBL Project
Get it into your genes - The EnsEMBL ProjectGet it into your genes - The EnsEMBL Project
Get it into your genes - The EnsEMBL Project
 
The LCH Grid - High Performance Computing in High Energy Particle Physics
The LCH Grid - High Performance Computing in High Energy Particle PhysicsThe LCH Grid - High Performance Computing in High Energy Particle Physics
The LCH Grid - High Performance Computing in High Energy Particle Physics
 
Coding masterclasses for schools
Coding masterclasses for schoolsCoding masterclasses for schools
Coding masterclasses for schools
 
Embracing Continuous Integration
Embracing Continuous IntegrationEmbracing Continuous Integration
Embracing Continuous Integration
 
Scrum - a deceptively simple process
Scrum - a deceptively simple processScrum - a deceptively simple process
Scrum - a deceptively simple process
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

The Inner Secrets of Compilers

  • 2. Tool chain engineer for massively multi-core systems (several hundred cores per chip, several thousand in a system). Official GCC port maintainer. Technical lead for multi-core debugger. System architect for small-cell/femto-cell telecommunications company.
  • 3. Brief anatomy of a compiler Register allocation Vectorisation Induction variables Debugging optimised code
  • 4. Front -end Back-endMiddle-end Machine code generation Source Input C++ C Fortran Java Ada D X86 ARM PowerPC Intermediate Representation A generic assembly-like language, which can take represent any of the basic instructions in any target.
  • 5. Front -end Back-endMiddle-end Optimisation Machine code generation Source Input C++ C Fortran Java Ada D X86 ARM PowerPC Intermediate Representation All the interesting optimisations happen in the middle-end, and are intermediate-representation translations. I will look at just a few.
  • 6. Compilers have to map program variables to processor registers, but how does this happen?
  • 7. Colour in the map, making sure that adjacent countries are not given the same colour. Bonus marks for: • If you had N colours, could the map be coloured? • What is the minimum number of colours needed? • Given N colours, and knowing that the graph could be coloured, how would it be coloured?
  • 8. Lets create a graph: •A node for each country •Edges connect nodes (countries) which share a land border.
  • 9. Once we have a graph, we can use a k-colouring algorithm to assign colours to nodes. You’ll have to trust me that such algorithms exist, because there isn’t sufficient time to show them now. (and k-colouring is an NP- complete problem!)
  • 10. Colour in the map, making sure that adjacent countries are given the same colour. Bonus marks for: •What is the minimum number of colours needed? • 4 colours for planar maps •If you had N colours, could the map be coloured in? •4 or more, yes, 3 maybe
  • 11. L12: t102 := mem[t9]; t22 := t1 * 5; T99 := 123; t33 := t1 + 23; t63 := mem[t2] t49 := t2 < t63; if (t49) goto l45: T66 := t44 – 23; L34: T45 := t6 – t2; T41 := t43 – t99; T19:=mem[t4]; L45: T45 := t45 ^ t56; Mem[t918] := t45; Mem[t33] := t55; T55 := t44 – t34; t25 := t1 * 5; L23: T99 := 123; t33 := t1 + 23; t63 := mem[t2] T45 := t98 – t55; Mem[t45] := t22; *this is illustative code – it isn’t real!
  • 12. L12: t102 := mem[t9]; t22 := t1 * 5; T99 := 123; t33 := t1 + 23; t63 := mem[t2] t49 := t2 < t63; if (t49) goto l45: T66 := t44 – 23; L34: T45 := t6 – t2; T41 := t43 – t99; T19:=mem[t4]; L45: T45 := t45 ^ t56; Mem[t918] := t45; Mem[t33] := t55; T55 := t44 – t34; t25 := t1 * 5; L23: T99 := 123; t33 := t1 + 23; t63 := mem[t2] T45 := t98 – t55; Mem[t45] := t22; *this is illustative code – it isn’t real! V1 V2 V3 V4 V5 V6
  • 13. L12: t102 := mem[t9]; t22 := t1 * 5; T99 := 123; t33 := t1 + 23; t63 := mem[t2] t49 := t2 < t63; if (t49) goto l45: T66 := t44 – 23; L34: T45 := t6 – t2; T41 := t43 – t99; T19:=mem[t4]; L45: T45 := t45 ^ t56; Mem[t918] := t45; Mem[t33] := t55; T55 := t44 – t34; t25 := t1 * 5; L23: T99 := 123; t33 := t1 + 23; t63 := mem[t2] T45 := t98 – t55; Mem[t45] := t22; *this is illustative code – it isn’t real! V1 V2 V3 V4 V5 V6 V1 V4 V2 V3 V5 V6
  • 14. L12: t102 := mem[t9]; t22 := t1 * 5; T99 := 123; t33 := t1 + 23; t63 := mem[t2] t49 := t2 < t63; if (t49) goto l45: T66 := t44 – 23; L34: T45 := t6 – t2; T41 := t43 – t99; T19:=mem[t4]; L45: T45 := t45 ^ t56; Mem[t918] := t45; Mem[t33] := t55; T55 := t44 – t34; t25 := t1 * 5; L23: T99 := 123; t33 := t1 + 23; t63 := mem[t2] T45 := t98 – t55; Mem[t45] := t22; *this is illustative code – it isn’t real! V1 V2 V3 V4 V5 V6 V1 V4 V2 V3 V5 V6
  • 15. L12: t102 := mem[t9]; t22 := t1 * 5; T99 := 123; t33 := t1 + 23; t63 := mem[t2] t49 := t2 < t63; if (t49) goto l45: T66 := t44 – 23; L34: T45 := t6 – t2; T41 := t43 – t99; T19:=mem[t4]; L45: T45 := t45 ^ t56; Mem[t918] := t45; Mem[t33] := t55; T55 := t44 – t34; t25 := t1 * 5; L23: T99 := 123; t33 := t1 + 23; t63 := mem[t2] T45 := t98 – t55; Mem[t45] := t22; *this is illustative code – it isn’t real! V1 V2 V3 V4 V5 V6 V1 V4 V2 V3 V5 V6 R1 R3 R2 R2 R2 R3
  • 16. But graph colouring is an NP-complete problem? How can compilers run in reasonable time? Compilers are optimistic – they use an algorithm which assumes everything goes to plan. • If this works – great! • If their optimism was misplaced and they fail, make the problem easier by reducing live ranges...
  • 17. L12: t102 := mem[t9]; t22 := t1 * 5; T99 := 123; t33 := t1 + 23; t63 := mem[t2] t49 := t2 < t63; if (t49) goto l45: T66 := t44 – 23; L34: T45 := t6 – t2; T41 := t43 – t99; T19:=mem[t4]; L45: T45 := t45 ^ t56; Mem[t918] := t45; Mem[t33] := t55; T55 := t44 – t34; t25 := t1 * 5; L23: T99 := 123; t33 := t1 + 23; t63 := mem[t2] T45 := t98 – t55; Mem[t45] := t22; *this is illustative code – it isn’t real! V1 L12: t102 := mem[t9]; t22 := t1 * 5; mem[fp+4] := t22; T99 := 123; t33 := t1 + 23; t63 := mem[t2] t49 := t2 < t63; if (t49) goto l45: T66 := t44 – 23; L34: T45 := t6 – t2; T41 := t43 – t99; T19:=mem[t4]; L45: T45 := t45 ^ t56; Mem[t918] := t45; Mem[t33] := t55; T55 := t44 – t34; t25 := t1 * 5; L23: T99 := 123; t33 := t1 + 23; t63 := mem[t2] T45 := t98 – t55; t22 := mem[fp+4]; Mem[t45] := t22; V7 V8
  • 18. Smaller live ranges means less interference Less interference means more chance of register allocation working. A few iterations of allocation and spilling will generally result in a valid register allocation.
  • 19. Large data set operations will perform the same operation on multiple elements of data. Vectorisation allows those operations to be parallelised, but how ?
  • 20.  A vector stores multiple related elements of data. a0 a1 a2 a3 a4 a5 a6 a7
  • 21.  Combine with another vector in a single instruction to produce a third vector. a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7 c0 c1 c2 c3 c4 c5 c6 c7 ADD = c0 = a0 + b0, c1 = a1 + b1, ...
  • 22.  Merge elements into a scalar value. a0 a1 a2 a3 a4 a5 a6 a7 MAX T
  • 23.  Vectorisation is important for many types of large data set processing operations.  Vector units (SIMD) are everywhere: • Intel MMX, SSE[1,2,3,4] • PowerPC (Xbox 360) • GPU engines • ARM 7s (iPhone 5, Galaxy S3)  Automatically generating code to exploit these is therefore very important.  Vector algorithms are important elsewhere, not just compilers – e.g., Map/reduce
  • 24. Consider a simple loop to find the largest value: int m = 0; for (size_t i=0; i!=916; ++i) m = std::max (m, a[i]); Try to vectorise this to take advantage of a vmax instruction, which compares 8 elements at once.
  • 25. 916 (114 * 8 + 4) elements, but vmax works on 8 elements at a time. Split the loop: int m = 0; for (size_t i=0; i!=912; ++i) m = std::max (m, a[i]); for (size_t i=912; i!=916; ++i) m = std::max (m, a[i]);
  • 26. Nest another loop inside the first which operates on 8-elements at a time: for (size_t i=0; i!=912; i+=8) for (size_t j=0; j!=8; ++j) m = std::max (m, a[i + j]);
  • 27. Introduce more instruction-like operations: for (size_t i=0; i!=912; i+=8) { int* tbase = &a[i]; for (size_t j=0; j!=8; ++j) { int temp = tbase[j]; m = std::max (m, temp); } }
  • 28.  Change 1 loop with 2 sub operations, into 2 loops with 1 sub operation: for (size_t i=0; i!=912; i+=8) { int* tbase = &a[i]; int temp[8]; for (size_t j=0; j!=8; ++j) temp[j] = tbase[j]; for (size_t j=0; j!=8; ++j) m = std::max (m, temp[j]); }
  • 29. Replace the sub-loops with their vector equivalents: for (size_t i=0; i!=912; i+=8) { int* tbase = &a[i]; int temp[8] = vecLoad (tbase); int localm = vecMax (temp); m = std::max (m, localm); }
  • 30. The loop could then be optimised further: • Strength reduction/induction • Loop unrolling • And so on...
  • 31. Many loops contain variables whose values are related. By understanding these relationships, and substituting equivalencies, the loops can run faster.
  • 32. SOURCE INTERMEDIATE int32_t *a; for(size_t i=0; i!=N; ++i) a[i] = i * 5; copy 0, i start: cmp i, n beq end mul i, 5, t0 mul i, 4, t1 add t1, a, t2 store t0, t2 incr i bra start end:
  • 33. copy 0, i start: cmp i, n beq end mul i, 5, t0 mul i, 4, t1 add t1, a, t2 store t0, t2 incr i bra start end:  Three different counts: • i counts from 0 to n in steps of 1. • t0 counts from 0 to n*5 in steps of 5 • t2 counts from &a[0] to &a[n] in steps of 4.  The values are related. In any iteration, knowing one gives you the others.  The loop contains two (expensive?) multiplications
  • 34. copy 0, i copy 0, t0 copy &a, t2 start: cmp i, n beq end add t0, 5, t0 add t2, 4, t2 incr i bra start end:  The values written to a[i] are: i * 5  Since i is incrementing by one each iteration, the values assigned are: 0, 5, 10, 15, ...  So the multiplication can be replaced by an initialisation, and an addition in each iteration. As can the other multiplication.  Also improves ILP.
  • 35. copy 0, i copy 0, t0 copy &a, t2 start: cmp i, n beq end add t0, 5, t0 add t2, 4, t2 incr i bra start end:  There are three additions in the loop now: • i incremented by 1 • t0 by 5 • t2 by 4  i is used only for controlling the loop [0..n)  We could equally well use: • t0 in range [0...n*5) • t2 in range [&a, &a + 4 * n).
  • 36. copy 0, i copy 0, t0 copy &a, t2 start: cmp i, n beq end add t0, 5, t0 add t2, 4, t2 incr i bra start end: copy 0, t0 copy &a, t2 mul n, 5, t9 start: cmp t0, t9 beq end add t0, 5, t0 add t2, 4, t2 bra start end: Fewer overall instructions, fewer loop instructions and also fewer registers.
  • 37. copy 0, t0 copy &a, t2 mul n, 5, t9 start: cmp t0, t9 beq end add t0, 5, t0 add t2, 4, t2 bra start end:
  • 38. copy 0, t0 copy &a, t2 mul n, 5, t9 start: cmp t0, t9 beq end add t0, 5, t0 add t2, 4, t2 bra start end:  Too many branches.  Branches are expensive: • Long latency • Bubbles in the pipeline • Cause scheduling problems.  You will always need a comparison, so can that close the loop?
  • 39.  A loop with a comparison at the end is a do..while. In source language terms: do { a[i] = i * 5; } while (i != n);  But we need to look out for zero iterations too: if (n > 0) do { a[i] = i * 5; } while (i != n); } (Note that the compiler implements this in intermediate code)
  • 40. copy 0, t0 copy &a, t2 mul n, 5, t9 start: cmp t0, t9 beq end add t0, 5, t0 add t2, 4, t2 bra start end: cmp 0, n beq end copy 0, t0 copy &a, t2 mul n, 5, t9 start: add t0, 5, t0 add t2, 4, t2 cmp t0, t9 beq start end: Fewer branches. Fewer loop instructions.
  • 41. fn: .LFB0: testl %edi, %edi jle .L4 leal (%rdi,%rdi,4), %ecx movl $array, %edx xorl %eax, %eax .L3: movl %eax, (%rdx) addl $5, %eax addq $4, %rdx cmpl %ecx, %eax jne .L3 .L4:
  • 42. Have you ever come across these? while (len--) *++dst = *src++; while (*dst++ = *src++);
  • 43. Have you ever come across these? while (len--) *++dst = *src++; while (*dst++ = *src++); They still comes up as recommended. for (i=0; i!=len; ++i) dst[i] = src[i]; This is more readable, more obvious, and thanks to induction, strength reduction and loop inversion, it’s just as fast.
  • 44. As the compiler gets more clever with its optimisations, how does the debugger make sense of it all? “Debugging is twice as hard as writing the code in the first place.Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” – Brian W. Kernighan
  • 45. The compiler tells the debugger what it has generated using DWARF. DWARF is Debug With Arbitrary Record Format. Often stored in an ELF file. Can be extended to allow for special features of a compiler or platform. DWARF is non-intrusive (i.e., the compiled code doesn’t change)
  • 46.  Each variable has an associated DWARF expression which tells the debugger the value of that variable.  There are several basic expressions, such as: • Register23 • Memory[ConstantAddress] • Memory[Register] • Constant(value)  An expression can be multi-part too: Piece(0,16) Register12 Piece(16,128) Memory(FP)
  • 47. The debugger contains a virtual machine, based upon a stack-engine. The variable expression elements are instructions for that virtual machine. For example, read the contents of a function argument passed on the stack: RegSP, Constant(12), Add, Deref
  • 48. The debugger contains a virtual machine, based upon a stack-engine. The variable expression elements are instructions for that virtual machine. For example, read the contents of a function argument passed on the stack: RegSP, Constant(12), Add, Deref 12356
  • 49. The debugger contains a virtual machine, based upon a stack-engine. The variable expression elements are instructions for that virtual machine. For example, read the contents of a function argument passed on the stack: RegSP, Constant(12), Add, Deref 12 12356
  • 50. The debugger contains a virtual machine, based upon a stack-engine. The variable expression elements are instructions for that virtual machine. For example, read the contents of a function argument passed on the stack: RegSP, Constant(12), Add, Deref 12368
  • 51. The debugger contains a virtual machine, based upon a stack-engine. The variable expression elements are instructions for that virtual machine. For example, read the contents of a function argument passed on the stack: RegSP, Constant(12), Add, Deref The variable’s value is left. 42
  • 52. Consider our previous induction- optimised loop. The variable i was optimised out of existence. However, the debugger can still generate a value for i by running a program: RegT0, Constant(5), Div Or: RegT2, Constant(&a), Sub, Constant(2), Shr
  • 53.  The virtual machine implementation can do: • Arithmetic (add, sub, etc) • Memory operations (load) • Register operations • Branching (if/then/else) • More exotic debugger-specific instructions (e.g., Piece).  This allows the compiler to generate programs which reverse the effects of optimisations, to allow sane debug output to be generated.  Special data-types (e.g., built-in linked lists) can be supported by the compiler generating appropriate programs, rather than debuggers having to be extended with extra support.
  • 54. Hopefully I’ve revealed a little more of what goes on in a compiler. The sophisticated algorithms used allow the programmer to concentrate on writing clean, understandable, testable, maintainable code. Questions?
  • 55. What is the fastest way to move/copy a block of memory?
  • 56.  What is the fastest way to move a region of memory? memmove (dest, src, length);  If length is known, and is small, the compiler is allowed to generate direct code to deal with it.  If the compiler can’t deal with it, it will hand off to the run-time library.  The run-time library will often look at the length, the alignment, and choose the best strategy to do the copy (e.g., is a vector engine available). It may even hand off to the OS...  ...which is likely to know about the processor type, cache organisation, line sizes, DMA, page mappings, and all sorts of tricks to do this the fastest possible way.