3. 20 Nov 2005 roberto innocente 3
3
Speculative execution
A prediction of what work is likely to be
needed soon is made. Then it is
speculatively executed in such a way that
you can commit it if the prediction was
correct or abort it.
5. 20 Nov 2005 roberto innocente 5
5
Linear scaling of speed
Quadratic scaling of transistors
Let's look at the last scaling in
silicon litography from 0.13 u
to 0.9 u : a 0.70 linear scaling,
a 0.49 scaling of surface.
Gate delays scale linearly,
transistors available scale
quadratically
We will get much more in
available complexity than
in gate speed
0,00
50,00
100,00
150,00
200,00
250,00
0.25 0.18 0.13 0.09 0.065
Gate Speed
Transistors
7. 20 Nov 2005 roberto innocente 7
7
Modern microprocessors
Today µ processors take advantage of the
fact that they need to present an
architectural state compliant with the
standard von Neumann's model only from
time to time, being for the remaining time
free to proceed in whatever way they find it
convenient
11. 20 Nov 2005 roberto innocente 11
11
Pipelining
The work to be done is
divided in stages, with a
clear signal interface
between them. After each
stage a latch memorizes
the state for the next
cycle. It adds some
overhead, but the hope is
to get 1 result per cycle,
after the pipe is full.
F
eXecute
Memory
WritebackDecode
Fetch
D X
M
W
Pipeline latch
13. 20 Nov 2005 roberto innocente 13
13
Pipeline at work
cycle F D X M W
1 add r1,r3,r4
2 mul r5,r6,r7
3 bnez loop,r1
4 X
5 X X
6 X X X
7 X X X X
8 div r8,r3,r6 X X X X
9 add r10,r8,r9 X X X
10 jmp loop X X
When there is a
dependency we
say that the
pipeline is stalled
or a bubble is
inserted waiting
for the dependency
to solve. Here a
control
dependency causes
a 4 cycles stall.
14. 20 Nov 2005 roberto innocente 14
14
Instruction dependencies
Data dependency :
add r1,r2,r3 ; r1<r2+r3
mul r1,r4,r5 ; r5<r4*r5
Solution:
register renaming,
result forwarding
Structural dependency:
Solution:
add functional units
Control dependency :
bne label1,r1,r2
add r1,r2,r3
label1:
mul r4,r5,r6
Solution:
branch prediction
15. 20 Nov 2005 roberto innocente 15
15
Multiple issue (Superscalar)
Architectures
F D X
M
W
F D X
M
W
Architectures that are able to
process multiple instructions
at a time. While it was
common to have multiple
execution units (like an
integer and a FP unit), only in
the '90 appeared the first
superscalar architectures e.g.
IBM Power and Pentium Pro.
These architectures require a
very good branch prediction.
Here it's depicted a 2 way
superscalar.
16. 20 Nov 2005 roberto innocente 16
16
Superscalar/2
Current architectures are commonly 4 or 8
way superscalars
The design of the last Alpha, canceled in its
late phase, was for an 8 way superscalar
Extremely good branch prediction is
needed : there can be hundredths of
instructions in flight ( 4 way*30 stages=120)
20. 20 Nov 2005 roberto innocente 20
20
Feature size, frequency, complexity
386 486dx P 60 p pro P II P III P 4 P 4 571
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
feat.size
386 486dx P 60 p pro P II P III P 4 P 4 571
0
500
1000
1500
2000
2500
3000
3500
4000
freq
386 486dx P 60 p pro P II P III P 4 P 4 571
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
trans.#
22. 20 Nov 2005 roberto innocente 22
22
Control xfer instructions
Some of the instructions, instead of simply
incrementing the PC to the next instruction,
change it to a different value. We distinguish :
Unconditional branches or simply jumps
Conditional branches or simply branches
subroutine calls
subroutine returns
traps, returns from interrupts or exceptions
27. 20 Nov 2005 roberto innocente 27
27
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
com
press
gcc
go ijpeg
m
88ksim
perl
vortex
xlisp
Dynamic instructions
Dynamic branches
Dynamic Cond BR
Branches by frequency
SPEC95
Benchmarks
(on yaxis millions
of instruction)
28. 20 Nov 2005 roberto innocente 28
28
Alw ays
taken
1 4 %
9 5 1 0 0 %
2 1 %
5 0 9 5 %
2 0 %
5 5 0 %
2 4 %
0 5 %
7 %
Never
Taken
1 4 %
Alw ays taken
9 5 1 0 0 %
5 0 9 5 %
5 5 0 %
0 5 %
Never Taken
Branches by taken rate
Average from
SPECint95
29. 20 Nov 2005 roberto innocente 29
29
Occurrences of branches
Occurrences of branches (conditional branches) :
SPECint 95 1 out of 5 instruction executed (20%)
SPECfp 95 1 out of 10 instruction executed (10%)
Basic block is the term used for a sequence of
instructions without any control xfer
Note : this is different and much more than the rate of
branches in the static program
30. 20 Nov 2005 roberto innocente 30
30
Mispredictions effects
b=rate of instruction
executed that are
branches (0.10.2)
p=prediction accuracy
(currently the best is in
the 0.900.97 range)
f=instructions “inflight”
(in execution, currently
over 100)
Oversimplification:
misprediction is recognized
only at the very end and
forces to squash all the
following f inflight instr.
Then every 1/(b*(1p)) instr.
we squash f instr.
E = 1/(1+f*(b*(1p))
33. 20 Nov 2005 roberto innocente 33
33
Static branch prediction
Always taken (AT), Always not taken (ANT)
Backward taken, forward not taken(BTFNT)
frequently used by current processors, relies on compilers
too (Intel Pentium4)
Complicate rules : for example the bp of PentiumM looks at
the distance between addresses and opcodes
Programmer hints (special opcodes on Pentium, flags on
Itanium)
program reorganization by compilers
Achieves ~ 60/70 % accuracy
34. 20 Nov 2005 roberto innocente 34
34
Semistatic branch prediction
It relies on data collected from previous runs
of the program (profiling : Sun Sparc)
Insertion in the code of appropriate hints :
predict taken
predict not taken
Achieves accuracy of ~ 65/80 %
35. 20 Nov 2005 roberto innocente 35
35
Dynamic branch prediction
As the last time
Bimodal predictors
Achieve accuracy of 70/85
%
2 level / correlation
predictors
Achieve accuracy of
80/90 %
Combining/Meta predictors
Markov/PPM predictors
Neural predictors
static semi
static
bimod
al
2 level combi
ned
50
55
60
65
70
75
80
85
90
95
% branch pred. accuracy
from
to
36. 20 Nov 2005 roberto innocente 36
36
2bc – Two bit saturated counter
The best 4 states FSA
(Finite State Automaton)
SNT,NT,T,ST (Strongly
NotTaken, NotTaken,
Taken, Strongly Taken)
Add 1 when branch is
taken, subtract 1 when
not taken. Saturate at 0
and 3
ST T
NT SNT
t
t
t
t
nt
nt
nt
0001
1011
nt
t
38. 20 Nov 2005 roberto innocente 38
38
Branch correlations
Global correlation Local correlation
for(i=0;i<1000;i++) {
if (i%4 == 0) a[i]=0;
}
if (cond1) { .. }
if (cond1 && cond2) { .. }
if (cond1) a=2;
..
if (a==0) { .. }
if (cond1) { .. }
if (cond2) { .. }
if (cond1 && cond2) { .. }
for(i=0;i<12;i++) { .. }
Outcome depends on the outcome of
previous branches
Outcome depends on previous
outcomes of same branch
39. 20 Nov 2005 roberto innocente 39
39
Twolevel/Correlation predictor (Yeh
Patt’92,PanSohRameh’92)
Branches are correlated one
to the other
We keep a shift register with
the most recent branch
outcomes
We index a bimodal table
(Pattern History Table) with
this branch history register
(BHT)
We can keep only one global
BHT for all the branches
(global 2level predictor) or a
BHT per each branch (local 2
level predictor). The same we
can do for the PHT.
Branch history
register
Pattern History Table
Prediction
Last
outcome
41. 20 Nov 2005 roberto innocente 41
41
gshare (McFarling ’93)
Alleviates the
problem of PHT
destructive
interference between
branches
The PHT is indexed
with the XOR of the
BHT and the BIA
(branch instruction
address)
Branch history
register
Pattern History Table
Prediction
Last
outcome
XOR
Branch address
43. 20 Nov 2005 roberto innocente 43
43
Tournament/Meta predictor (McFarling
’93)
Often happens that a predictor
is better for some branches
and another for other
branches
A bimodal predictor can then
be used to drive a mux that
will choose between the 2
predictors
When the outcome is known
the metapredictor is updated if
one of the predictors was right
and the other wrong
In this case the states are the
confidence on the 2 predictors
Predictor1 Predictor2
Meta
Predictor
Address of branch
instruction
Mux
Hybrid predictor
outcome
44. 20 Nov 2005 roberto innocente 44
44
Data compression
It is a similar and well studied problem, for which there
exists an algorithm reputed nearly optimal (PPM).
the goal is to represent the data with fewer bits :
You use fewer bits for frequent sequences and more
bits for the infrequent ones. The net effect is to use
less bits overall.
It relies on accurately predicting the probabilistic
distribution of data and using a coder tuned to that
45. 20 Nov 2005 roberto innocente 45
45
Markov predictor
A Markov predictor of
order j, bases its
prediction on the last j
outcomes
It builds the matrix of
transition frequencies
and makes the prediction
according to that
pattern next frequency
00 0
1 1
01 0
1 2
10 0 1
1 1
11 0 2
1
1 0 1 1 0 0 1 1 0 Last outcomes
46. 20 Nov 2005 roberto innocente 46
46
PPM – (Cleary, Witten 1984)
Prediction by Partial Matching
A PPM predictor of order m is a set of m+1
Markov predictors
1 0 1 1 0 0 1 1 0
Last m bits
if found
Predict with
Markov predictor of order m
Last m1 bits
if found
if found
Markov predictor of order m1
Markov predictor of order 1
Markov predictor of order 0if not found
if not found
if not found
47. 20 Nov 2005 roberto innocente 47
47
Neural methodsD.Jimenez 2002
Machine learning has often used neural methods
Most neural networks can't be candidates for
hardware prediction at the microarchitecture level
Their implementation would require much more
than several cycles
The standard method of training, the
backpropagation algorithm, is infeasible in a few
machine cycles
48. 20 Nov 2005 roberto innocente 48
48
Perceptron
Introduced by Rosenblatt in
1962 as a model of brain
functioning, popularized by
M.Minsky
We will consider the simplest:
the singlelayer perceptron
A vector of n inputs: x[1]..x[n]
Each input has a weight
associated with it: w[0]..w[n].
This vector characterizes the
perceptron
49. 20 Nov 2005 roberto innocente 49
49
Bipolar perceptron
The inputs and the outcome t can be only 1
or 1
Then t*x[i] = 1 if they agree, or 1 if they
disagree
if the w[i] are integers, y is an integer too
and sign(y) is the prediction
50. 20 Nov 2005 roberto innocente 50
50
Perceptron training
Simply stated : increase the weights of those inputs
that agree with the outcome, and decrease the weight
of those that do not
Let t be the outcome and θ be a threshold after which we
stop to train the perceptron. Then the algorithm is :
if ((sign(y) <> t)||(|y| < theta)) {
for (i=0 ; i<=n;i++) {
w[i] = w[i] + t * x[i];
}
}
51. 20 Nov 2005 roberto innocente 51
51
Perceptron limitations
A single perceptron can only learn linearly separable
functions of the inputs. The linear equation
w[0]+Σ w[i]*x[i]=0 represents an hyperplane in the
ndim space of inputs
AND, OR, NAND, NOR are linearly separable, XOR is
not
Of course any boolean function can be learned by a 2
layer network of perceptrons (as any boolean function
can be represented by a 2layer net of ANDs and
ORs), but it has been shown that for bp there is not
much gain and the delay gets much worse
52. 20 Nov 2005 roberto innocente 52
52
Branch prediction with perceptrons
The inputs of the perceptron are the branch history
We keep a table of perceptrons (the weights) that we address
hashing on the branch address
Every time we meet a branch we load the perceptron in a vector
register and we compute in parallel the dot product between the
weights and the branch history (summing the complements to 1
instead of those to 2)
According to the result we predict the branch taken or not taken
The training alg. is performed and the updated perceptron is
written back
53. 20 Nov 2005 roberto innocente 53
53
It's the serialization
constraint imposed by data
dependencies among
instructions
Was always thought to be
an insurmountable limit
An instruction that needs
data from another
instruction needs to be
executed after that
The dataflow limit
ADD R1,R2,R3 ; R1<R2+R3
ADD R4,R1,R5 ; R4<R1+R5
54. 20 Nov 2005 roberto innocente 54
54
Exceeding the dataflow limit
At the end of the '90 some authors proposed the
use of data prediction to overcome the dataflow
limit
M.Lipasti, Shen Exceeding the data flow limit
This is much more difficult than branch prediction
where you need to predict only a binary value
56. 20 Nov 2005 roberto innocente 56
56
Value prediction/1
It was shown for instance that 20/30 % of
instructions that write value in registers write
the same value as the last time
And 40/50 % write one of the last 4
preceeding values
57. 20 Nov 2005 roberto innocente 57
57
Value prediction/2
What makes these values so predictable ?
It seems this is due to severe penalties realworld
programs incur not only because they are designed
to manage quite infrequent contingencies like
exceptions and error conditions but because they
are general by design. This is shown even by code
aggressively optimized by modern state of the art
compilers
60. 20 Nov 2005 roberto innocente 60
60
Research areas
Reverse engineering of prediction algorithms
implementations
Simulation of new prediction algorithms :
Using legacy Instructions Sets (IS)
Using abstract RISC instructions sets
Hand code optimization and compiler optimization
techniques
61. 20 Nov 2005 roberto innocente 61
61
Reverse engineering
A python or perl script :
produces assembly language kernels (with for
example fix distance between branch
instructions)
compiles and runs the kernels using the
hardware counters for mispredictions to detect
table sizes, conflicts and so on
62. 20 Nov 2005 roberto innocente 62
62
Legacy IS/OS simulations
Can be obtained instrumenting an x86 open
source simulator like bochs that can run
windows or linux
You can then run statically precompiled
binaries over it
Problem : bochs is not even a complete
Pentium II simulator !
63. 20 Nov 2005 roberto innocente 63
63
Abstract IS simulators
SimpleScalar is an opensource framework for a
generic software simulator over which modules for
different prediction algorithms can be implemented
Offers the possibility to customize also the
Instruction Set (IS)
Problem : you need the source and compile all
special libraries to use this tool
65. 20 Nov 2005 roberto innocente 65
65
Scheduling
Code scheduling or reordering of instruction is used
to improve performance or guarantee correctness
Important for dynamically scheduled architectures,
essential for static scheduled architectures
Examples : branch delay slots, memory delays,
multicycle operations
Block scheduling, List scheduling, Superblock
scheduling, Trace Scheduling
66. 20 Nov 2005 roberto innocente 66
66
BTA era is here
(Billion Transistor Architecture)
Intel Itanium2 with 6MB L3 cache has 0.41 billion transistors of which
around 0.3 billion transistors are for the cache memory
It's not clear what will be the best use of the available silicon:
CMP (SingleChip MultiProcessors)
Superwide superspeculative superscalar
Simultaneous MultiThreading
Raw Processors
67. 20 Nov 2005 roberto innocente 67
67
386 486dx P 60 p pro P II P III P 4 P 4 571
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
FO4 gates
pipe length
feat.size
trans.#
freq