Comp architecture : branch prediction

20 Nov 2005 roberto innocente 1
1
Prediction and speculation :
the role of stochastic models of
program behaviour in the performance of
modern computers
r. innocente

2
Speculation
 from the MerriamWebster dict :
an assumption of an unusual risk in
hopes of obtaining commensurate
gains

3
Speculative execution
 A prediction of what work is likely to be
needed soon is made. Then it is
speculatively executed in such a way that
you can commit it if the prediction was
correct or abort it.

4
von Neumann's model :
Stored Program Computer
 The Control Counter
today called Program
Counter (PC) or
Instruction Pointer (IP)
keeps the address of the
next instruction to be
executed. The control
part fetches this
instruction, decodes and
executes it. At the end
the PC is updated.

5
Linear scaling of speed
Quadratic scaling of transistors
 Let's look at the last scaling in
silicon litography from 0.13 u
to 0.9 u : a 0.70 linear scaling,
a 0.49 scaling of surface.
 Gate delays scale linearly,
transistors available scale
quadratically
 We will get much more in
available complexity than
in gate speed
0,00
50,00
100,00
150,00
200,00
250,00
0.25 0.18 0.13 0.09 0.065
Gate Speed
Transistors

6
von Neumann's Projection/
Collapse postulate of QM
 A system can be described with any mix of states,
but if you observe it you can only find it in one of
the eigenstates, and you can only measure an
eigenvalue .
 ( When you look at it the Shroedinger's cat is aut
dead aut alive )

7
Modern microprocessors

Today µ processors take advantage of the
fact that they need to present an
architectural state compliant with the
standard von Neumann's model only from
time to time, being for the remaining time
free to proceed in whatever way they find it
convenient

8
ILP – Instruction Level Parallelism
(Fisher 1981)
 Obeying the standard semantic when
required, try to overlap the execution of
multiple instructions as much as possible.
(We will see that current microprocessors
can have more than 100 instructions in
flight)

9
Enabling technologies for
ILP exploitation

 Pipelining

 Multiple issue = Superscalar

10
A microprocessor in 1989
(Intel 386)
 CPI = Cycles Per Instruction
 Performance = Frequency / CPI
 Intel 386 :
 feature size : 1 micron
 frequency: 33 Mhz
 CPI = 5/6
 Performance = 33 M/6 ~ 6 Kinstructions/s

11
Pipelining
 The work to be done is
divided in stages, with a
clear signal interface
between them. After each
stage a latch memorizes
the state for the next
cycle. It adds some
overhead, but the hope is
to get 1 result per cycle,
after the pipe is full.
F
eXecute
Memory
WritebackDecode
Fetch
D X
M
W
Pipeline latch

12
Limits of pipelining
 A latch can add 2 or 3 gate
delays.
 Current work is around 400
gate delays
 you get a result every 400/n
+ 3 gate delays
 you add an overhead of 3n
gate delays

13
Pipeline at work
cycle F D X M W
1 add r1,r3,r4
2 mul r5,r6,r7
3 bnez loop,r1
4 X
5 X X
6 X X X
7 X X X X
8 div r8,r3,r6 X X X X
9 add r10,r8,r9 X X X
10 jmp loop X X
When there is a
dependency we
say that the
pipeline is stalled
or a bubble is
inserted waiting
for the dependency
to solve. Here a
control
dependency causes
a 4 cycles stall.

14
Instruction dependencies
 Data dependency :
add r1,r2,r3 ; r1<r2+r3
mul r1,r4,r5 ; r5<r4*r5
 Solution:
 register renaming,
result forwarding
 Structural dependency:
 Solution:
 add functional units
 Control dependency :
bne label1,r1,r2
add r1,r2,r3
label1:
mul r4,r5,r6
 Solution:
 branch prediction

15
Multiple issue (Superscalar)
Architectures
F D X
M
W
F D X
M
W
Architectures that are able to
process multiple instructions
at a time. While it was
common to have multiple
execution units (like an
integer and a FP unit), only in
the '90 appeared the first
superscalar architectures e.g.
IBM Power and Pentium Pro.
These architectures require a
very good branch prediction.
Here it's depicted a 2 way
superscalar.

16
Superscalar/2
 Current architectures are commonly 4 or 8
way superscalars
 The design of the last Alpha, canceled in its
late phase, was for an 8 way superscalar
 Extremely good branch prediction is
needed : there can be hundredths of
instructions in flight ( 4 way*30 stages=120)

17
Superscalar at work
cycle F D X M W
1 add r1,r3,r4
mul r5,r6,r7
2 bnez loop,r1
X
3 X
X X
4 X X
X X X
5 X X X
X X X X
The wasted
slots are now
much more than
in the pipelined
only case

18
Real World Architectures
IBM power5

19
15 years of x86
year processor
1979 8088 12
1988 386dx 1 275 5 33 80
1991 486dx 1 1100 50
1993 pentium 60 0.8 3100 60 5
1995 pentiumPro 0.6 5500 150 10
1997 Pentium II 0.35 7500 233 10
1999 Pentium III 0.25 9500 450 10
2000 Pentium 4 0.18 42000 1300 20
2005 Pentium 4 571 0.09 130000 3800 30 13
feature
size
transistor
count
cycles /
instr.
frequen
cy
pipe
length
FO4 gates
per cycle

20
Feature size, frequency, complexity
386 486dx P 60 p pro P II P III P 4 P 4 571
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
feat.size
0
500
1000
1500
2000
2500
3000
3500
4000
freq
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
trans.#

21
A microprocessor in 2005
(Intel Pentium4)
 IPC = Instructions Per Cycle
 Performance = Frequency * IPC
 Intel Pentium4 :
 feature size : 90 nm
 frequency: 3 Ghz
 IPC ~ 2/3 (2 for SPECint,3 for SPECfp)
 Performance = 3 G * 2 = 6 Ginstructions/s

22
Control xfer instructions
 Some of the instructions, instead of simply
incrementing the PC to the next instruction,
change it to a different value. We distinguish :
 Unconditional branches or simply jumps
 Conditional branches or simply branches
 subroutine calls
 subroutine returns
 traps, returns from interrupts or exceptions

23
Assembly – Machine instructions
 Only jumps or branches :
 j <label>
 j @register
 beq <label>
 bne <label>
 bz <label>
 bnz <label>

24
High level Language – Assembly
 for(i=1;i<=4;i++)
    { .. }
 if (i) { .. }
 while (i)
{ .. }
        ld  r1,1
        ld  r2,4
loop:cmp r1,r2
        beq out
         ..
  add r1,r1,1
        jmp loop
out:
        ld r1,i
        bz next
           ..
next:
loop: sub r1,1
         bz out
           ..
         jmp loop
out:

25
SPECStd Perf. Evaluation
Corporation benchmarks
 Wellknown set of benchmarks, continuously
updated, recognized as representative of possible
workloads
 Divided in 2 big sets :
 SPECint : integer programs( go, m88ksim, compress, li,
ijpeg, perl, vortex)
 SPECfp : floating point programs (mathematical
simulation prgs)
 http://www.spec.org

26
Condition
al
7 2 %
I m m ediat
e
1 6 %
Returns
1 0 %
I ndirect
2 %
Conditional
I m m ediate
Returns
I ndirect
Branches by type
Average from
SPECint95

27
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
com
press
gcc
go ijpeg
m
88ksim
perl
vortex
xlisp
Dynamic instructions
Dynamic branches
Dynamic Cond BR
Branches by frequency
SPEC95
Benchmarks
(on yaxis millions
of instruction)

28
Alw ays
taken
1 4 %
9 5 1 0 0 %
2 1 %
5 0 9 5 %
2 0 %
5 5 0 %
2 4 %
0 5 %
7 %
Never
Taken
1 4 %
Alw ays taken
9 5 1 0 0 %
5 0 9 5 %
5 5 0 %
0 5 %
Never Taken
Branches by taken rate
Average from
SPECint95

29
Occurrences of branches
 Occurrences of branches (conditional branches) :
 SPECint 95     1 out of   5 instruction executed (20%)
 SPECfp  95     1 out of 10 instruction executed (10%)
 Basic block is the term used for a sequence of
instructions without any control xfer
Note : this is different and much more than the rate of
branches in the static program

30
Mispredictions effects
 b=rate of instruction
executed that are
branches (0.10.2)
 p=prediction accuracy
(currently the best is in
the 0.900.97 range)
 f=instructions “inflight”
(in execution, currently
over 100)
 Oversimplification:
misprediction is recognized
only at the very end and
forces to squash all the
following f inflight instr.
 Then every 1/(b*(1p)) instr.
we squash f instr.
 E = 1/(1+f*(b*(1p))

31
Efficiency versus bp accuracy
5 inflight instructions
100 in flight
200 inflight
300 inflight

32
Branch prediction methods
Are they using informations collected from the
running programs ?
 Static : no
 Semistatic : info collected from test
samples
 Dynamic : yes

33
Static branch prediction
 Always taken (AT), Always not taken (ANT)
 Backward taken, forward not taken(BTFNT)
 frequently used by current processors, relies on compilers
too (Intel Pentium4)
 Complicate rules : for example the bp of PentiumM looks at
the distance between addresses and opcodes
 Programmer hints (special opcodes on Pentium, flags on
Itanium)
 program reorganization by compilers
 Achieves ~ 60/70 % accuracy

34
Semistatic branch prediction
 It relies on data collected from previous runs
of the program (profiling : Sun Sparc)
 Insertion in the code of appropriate hints :
 predict taken
 predict not taken
 Achieves accuracy of ~ 65/80 %

35
Dynamic branch prediction
 As the last time
 Bimodal predictors
 Achieve accuracy of 70/85
%
 2 level / correlation
predictors
 Achieve accuracy of
80/90 %
 Combining/Meta predictors
 Markov/PPM predictors
 Neural predictors
static semi
static
bimod
al
2 level combi
ned
50
55
60
65
70
75
80
85
90
95
% branch pred. accuracy
from
to

36
2bc – Two bit saturated counter
 The best 4 states FSA
(Finite State Automaton)
 SNT,NT,T,ST (Strongly
NotTaken, NotTaken,
Taken, Strongly Taken)
 Add 1 when branch is
taken, subtract 1 when
not taken. Saturate at 0
and 3
ST T
NT SNT
t
t
t
t
nt
nt
nt
0001
1011
nt
t

37
bimodal predictor (Smith ’85)
Array of counters
ST
T
NT
SNT
t
nt
•At every branch, hashing on the
instruction address (usually simply
using some of the bits of the
address), a counter is chosen and a
prediction is made. Whole array is
initialized to the T or NT state
•When the outcome of the branch is
known the counter is updated
•General consensus on using the 2
bit saturated counter
Address of branch
instruction

38
Branch correlations
 Global correlation  Local correlation
for(i=0;i<1000;i++) {
if (i%4 == 0) a[i]=0;
}
if (cond1) { .. }
if (cond1 && cond2) { .. }
if (cond1) a=2;
..
if (a==0) { .. }
if (cond1) { .. }
if (cond2) { .. }
if (cond1 && cond2) { .. }
for(i=0;i<12;i++) { .. }
Outcome depends on the outcome of
previous branches
Outcome depends on previous
outcomes of same branch

39
Twolevel/Correlation predictor (Yeh
Patt’92,PanSohRameh’92)
 Branches are correlated one
to the other
 We keep a shift register with
the most recent branch
outcomes
 We index a bimodal table
(Pattern History Table) with
this branch history register
(BHT)
 We can keep only one global
BHT for all the branches
(global 2level predictor) or a
BHT per each branch (local 2
level predictor). The same we
can do for the PHT.
Branch history
register
Pattern History Table
Prediction
Last
outcome

40
Taxonomy of 2 level predictors
G
S
P
a
g
s
p
Branch History Pattern History
Global
Shared(aSsociative)
Per address
global
shared(associative)
per address
Gas = Global History Register, adaptive, with shared
Pattern History Table (for instance 8 ways)
} {

41
gshare (McFarling ’93)
 Alleviates the
problem of PHT
destructive
interference between
branches
 The PHT is indexed
with the XOR of the
BHT and the BIA
(branch instruction
address)
Branch history
register
Prediction
Last
outcome
XOR
Branch address

42
Path correlated prediction
The same branch history may
be the result of very different
program behaviours. To
disentangle such situations
we can take some bits of the
target address of each of the
last n taken branches and use
those to address the bimodal
PHT.
TA bits TA bitsTA bitsTA bits TA bits TA bits
Path history register

43
Tournament/Meta predictor (McFarling
’93)
 Often happens that a predictor
is better for some branches
and another for other
branches
 A bimodal predictor can then
be used to drive a mux that
will choose between the 2
predictors
 When the outcome is known
the metapredictor is updated if
one of the predictors was right
and the other wrong
 In this case the states are the
confidence on the 2 predictors
Predictor1 Predictor2
Meta
Predictor
Address of branch
instruction
Mux
Hybrid predictor
outcome

44
Data compression
 It is a similar and well studied problem, for which there
exists an algorithm reputed nearly optimal (PPM).
 the goal is to represent the data with fewer bits :
 You use fewer bits for frequent sequences and more
bits for the infrequent ones. The net effect is to use
less bits overall.
 It relies on accurately predicting the probabilistic
distribution of data and using a coder tuned to that

45
Markov predictor
 A Markov predictor of
order j, bases its
prediction on the last j
outcomes
 It builds the matrix of
transition frequencies
and makes the prediction
according to that
pattern next frequency
00 0
1 1
01 0
1 2
10 0 1
1 1
11 0 2
1
1 0 1 1 0 0 1 1 0 Last outcomes

46
PPM – (Cleary, Witten 1984)
Prediction by Partial Matching
 A PPM predictor of order m is a set of m+1
Markov predictors
1 0 1 1 0 0 1 1 0
Last m bits
if found
Predict with
Markov predictor of order m
Last m1 bits
if found
if found
Markov predictor of order m1
Markov predictor of order 1
Markov predictor of order 0if not found
if not found
if not found

47
Neural methodsD.Jimenez 2002
 Machine learning has often used neural methods
 Most neural networks can't be candidates for
hardware prediction at the microarchitecture level
 Their implementation would require much more
than several cycles
 The standard method of training, the
backpropagation algorithm, is infeasible in a few
machine cycles

48
Perceptron
 Introduced by Rosenblatt in
1962 as a model of brain
functioning, popularized by
M.Minsky
 We will consider the simplest:
the singlelayer perceptron
 A vector of n inputs: x[1]..x[n]
 Each input has a weight
associated with it: w[0]..w[n].
This vector characterizes the
perceptron

49
Bipolar perceptron
 The inputs and the outcome t can be only 1
or 1
 Then t*x[i] = 1 if they agree, or 1 if they
disagree
 if the w[i] are integers, y is an integer too
and sign(y) is the prediction

50
Perceptron training
 Simply stated : increase the weights of those inputs
that agree with the outcome, and decrease the weight
of those that do not
 Let t be the outcome and θ be a threshold after which we
stop to train the perceptron. Then the algorithm is :
if ((sign(y) <> t)||(|y| < theta)) {
for (i=0 ; i<=n;i++) {
w[i] = w[i] + t * x[i];
}
}

51
Perceptron limitations
 A single perceptron can only learn linearly separable
functions of the inputs. The linear equation
 w[0]+Σ w[i]*x[i]=0 represents an hyperplane in the
ndim space of inputs
 AND, OR, NAND, NOR are linearly separable, XOR is
not
 Of course any boolean function can be learned by a 2
layer network of perceptrons (as any boolean function
can be represented by a 2layer net of ANDs and
ORs), but it has been shown that for bp there is not
much gain and the delay gets much worse

52
Branch prediction with perceptrons
 The inputs of the perceptron are the branch history
 We keep a table of perceptrons (the weights) that we address
hashing on the branch address
 Every time we meet a branch we load the perceptron in a vector
register and we compute in parallel the dot product between the
weights and the branch history (summing the complements to 1
instead of those to 2)
 According to the result we predict the branch taken or not taken
 The training alg. is performed and the updated perceptron is
written back

53
 It's the serialization
constraint imposed by data
dependencies among
instructions
 Was always thought to be
an insurmountable limit
 An instruction that needs
data from another
instruction needs to be
executed after that
The dataflow limit
ADD R1,R2,R3 ; R1<R2+R3
ADD R4,R1,R5 ; R4<R1+R5

54
Exceeding the dataflow limit
 At the end of the '90 some authors proposed the
use of data prediction to overcome the dataflow
limit
 M.Lipasti, Shen Exceeding the data flow limit
 This is much more difficult than branch prediction
where you need to predict only a binary value

55
Value locality
 The simulations
showed in fact that
the applications are
obeying also to a
new locality
principle : Value
Locality
Value Locality
Temporal Spatial

56
Value prediction/1
 It was shown for instance that 20/30 % of
instructions that write value in registers write
the same value as the last time
 And 40/50 % write one of the last 4
preceeding values

57
Value prediction/2
 What makes these values so predictable ?
 It seems this is due to severe penalties realworld
programs incur not only because they are designed
to manage quite infrequent contingencies like
exceptions and error conditions but because they
are general by design. This is shown even by code
aggressively optimized by modern state of the art
compilers

58
Value prediction/3
 Load Value prediction
 Register Value prediction

59
Speculation taxonomy
Speculative execution
Control speculation Data speculation
Branch outcome
Branch target
Data Location
Data Value
Load
Register Value
Address

60
Research areas
 Reverse engineering of prediction algorithms
implementations
 Simulation of new prediction algorithms :
 Using legacy Instructions Sets (IS)
 Using abstract RISC instructions sets
 Hand code optimization and compiler optimization
techniques

61
Reverse engineering
 A python or perl script :
 produces assembly language kernels (with for
example fix distance between branch
instructions)
 compiles and runs the kernels using the
hardware counters for mispredictions to detect
table sizes, conflicts and so on

62
Legacy IS/OS simulations
 Can be obtained instrumenting an x86 open
source simulator like bochs that can run
windows or linux
 You can then run statically precompiled
binaries over it
 Problem : bochs is not even a complete
Pentium II simulator !

63
Abstract IS simulators
 SimpleScalar is an opensource framework for a
generic software simulator over which modules for
different prediction algorithms can be implemented
 Offers the possibility to customize also the
Instruction Set (IS)
 Problem : you need the source and compile all
special libraries to use this tool

64
Optimization techniques
 Basic block extension
 Code duplication
 Scheduling techniques :


65
Scheduling
 Code scheduling or reordering of instruction is used
to improve performance or guarantee correctness
 Important for dynamically scheduled architectures,
essential for static scheduled architectures
 Examples : branch delay slots, memory delays,
multicycle operations
 Block scheduling, List scheduling, Superblock
scheduling, Trace Scheduling

66
BTA era is here
(Billion Transistor Architecture)
 Intel Itanium2 with 6MB L3 cache has 0.41 billion transistors of which
around 0.3 billion transistors are for the cache memory
 It's not clear what will be the best use of the available silicon:
 CMP (SingleChip MultiProcessors)
 Superwide superspeculative superscalar
 Simultaneous MultiThreading
 Raw Processors

67
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
FO4 gates
pipe length
feat.size
trans.#
freq

Comp architecture : branch prediction

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (12)

Similaire à Comp architecture : branch prediction

Similaire à Comp architecture : branch prediction (20)

Plus de rinnocente

Plus de rinnocente (15)

Dernier

Dernier (20)

Comp architecture : branch prediction