The document discusses the Tomasulo algorithm, which enables out-of-order execution of instructions in computer processors. It does this through three key mechanisms: common data busing, register tagging, and reservation stations. This allows independent instructions to execute out of order while preserving dependencies. The algorithm tracks dependencies through register tags rather than physical registers, allowing overlapping of dependent instructions by forwarding values via the common data bus. This decouples dependency tracking from instruction decoding and dispatch, improving parallelism.
2. Background
IBM System/360 Model 91
FPU’s add/mul/div takes 2/3/13 cycles
Can performance be improved through utilizing
multiple execution units?
Adder
Mul
div
3. Major Contributions
Proposed three innovative mechanisms:
Common data busing(CDB)
Register tagging scheme
Reservation station
which permits:
Out-of-order execution of independent instructions
while preserving the essential precedences in the
instruction stream
4. Doubt
When people talk about Tomasolu algorithm, they
talk about register renaming
However this word can’t be found in the original
paper
How could anyone invent a thing
without noticing it?
6. From a FPU’s perspective
All instructions are ‘register-to-register’
Register-to-register arithmetic
Storage-to-register arithmetic
Load
Store
Instruction Unit(outside FPU) is in charge of the
address generation and memory access.
7. Be equivalent to destination and source
For example, AD R1, R2
R1 is both a sink and a source
‘sink’ and ‘source’
source
sink
value
18. FLBStorage
A Day in the Life of ‘LD R1, addr’
FLOS
Adder
Mul
divSDB
Decoder
addr
FLB1
LD R1, FLB1
FLR
Instruction
Unit
19. FLBStorage
A Day in the Life of ‘LD R1, addr’
FLOS
Adder
Mul
divSDB
Decoder
addr
FLB1
LD R1, FLB1
FLR
20. FLBStorage
A Day in the Life of ‘LD R1, addr’
FLOS
Mul
divSDB
Decoder
addr
FLB1
LD R1, FLB1
OP
FLR
Adder
21. FLBStorage
A Day in the Life of ‘LD R1, addr’
FLOS
Mul
divSDB
addr
FLB1
LD R1, FLB1
OP
DecoderFLR
Adder
22. FLBStorage
A Day in the Life of ‘LD R1, addr’
FLOS
Adder
Mul
divSDB
FLR
addr
FLB1
R1
LD R1, FLB1
Decoder
23. An Example of Dependence
LD F0, FLB1
MD F0, FLB2
What if send them to different execution units at the
same time?
Adder
Mul
div
to exploit parallelisim
24. An Example of Dependence
LD F0, FLB1
MD F0, FLB2
The result(F0) cannot reflect the impact of LD, because
MD uses the old value of F0
Adder
Mul
div
25. An Example of Dependence
LD F0, FLB1
MD F0, FLB2
Adder
Mul
div
It is also called true dependence,
a.k.a. RAW
26. A Simple Solution
‘busy’ bit scheme
R0
R1
R2
R3
B
I’am already the sink
of some instruction
I need your
contentLD R1 B
MD R1 A
27. Performance Degrades...
When the code keep using one register
E.g. MD F0, E
AD F2, F0
AD F4, A
AD F2, F4
overlap fails because the first AD depends on MD,
though the others don’t
The second AD is qualified to
issue
28. Cause of the Problem
If one instruction gets stuck(due to dependence), the
following can’t be decoded(even it is qualified to
issue)
Solution :
Decouple the dependence mantainance from
decoding
Look ahead more instructions for concurrency
29. Dispatch and Issue Decoupling
MD F0, E
AD F2, F0
AD F4, A
AD F2, F4
Adder
Can issue?Decode
Is that reg busy?
30. Dispatch and Issue Decoupling
MD F0, E
AD F2, F0
AD F4, A
AD F2, F4
Adder
Dispatch
anyway
Decode
Are my operands
ready?
MD F0, E Can issue?
31. An Example of True Dependence
LD F0, FLB1 F0 as sink
AD F2, F0 F0 as source
Adder
Mul
div
FLB
FLR
FLB1
F0
Assume CDB has not
been introduced yet
32. LD F0, FLB1 dispatches to A1
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
An Example of True Dependence
F0 is reserved for some
instruction
33. LD F0, FLB1 dispatches to A1
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
An Example of True Dependence
Its content is calculated
by A1
34. LD F0, FLB1
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
I need the value of F0,
but he seems to be busy
An Example of True Dependence
35. LD F0, FLB1
AD F2, F0 dispatches to A2
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
Since A1 is the
producer, just let
him tell me
An Example of True Dependence
AD F2, F0
36. LD F0, FLB1
AD F2, F0 dispatches to A2
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
Since A1 is the
producer, just ask
him for it
An Example of True Dependence
AD F2, A1
37. LD F0, FLB1 executing
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
An Example of True Dependence
AD F2, A1
Operands are ready.
Execute!
38. LD F0, FLB1 broadcasts it’s result to the air
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
I’m A1. Who needs
my result? Over..
An Example of True Dependence
AD F2, A1
39. LD F0, FLB1 broadcasts it’s result to the air
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
I depend on
A1!
An Example of True Dependence
AD F2, A1
Me too!
40. The Role of CDB
Common Data Bus is in charge of value forwarding
In reg-to-reg model, a value is passed through a
register(write & read)
F0
Write as sink
(Producer)
41. The Role of CDB
Common Data Bus is in charge of value forwarding
In reg-to-reg model, a value is passed through a
register(write & read)
F0
Read as source
(Consumer)
42. The Role of CDB
Add
For Mul
Resv. S
For
Resv. S
FLB
SDB
FLR
Load/Store doesn’t need to go through ALU
The dependence management is decoupled from
execution as expected
43. The Role of CDB
CDB
All units which
may take register
as an operand
All units which can
alter a register
ConsumerProducer
Add
For Mul
Resv. S
For
Resv. S
FLB
SDB
FLR
P:3
P:2
P:6
44. The Role of CDB
CDB
All units which
may take register
as an operand
All units which can
alter a register
ConsumerProducer
Add
For Mul
Resv. S
For
Resv. S
FLB
SDB
FLRC:4
C:3 C:2*2
C:3*2
45. The Implementation of CDB
A consumer recognizes his producer by tagging
Producers throw <tag, value> on the bus by
turns(make a request first)
If tag matches , consumer ingates the value
C C C C C C
P P P P P P
tag tag tag X Y Y
Requset
(2 cycles)
46. The Implementation of CDB
A consumer recognizes his producer by tagging
Producers throw <tag, value> on the bus by
turns(make a request first)
If tag matches , consumer ingates the value
P P P P P P
Y value
C C C C C C
tag tag tag X Y Y
47. The Implementation of CDB
A consumer recognizes his producer by tagging
Producers throw <tag, value> on the bus by
turns(make a request first)
If tag matches , consumer ingates the value
PP P P P P
C C C C C C
tag tag tag X Y Y
request
48. The Implementation of CDB
A consumer recognizes his producer by tagging
Producers throw <tag, value> on the bus by
turns(make a request first)
If tag matches , consumer ingates the value
PP P P P P
C C C C C C
tag tag tag X Y Y
X value
49. The Principle behind the Scene
Tag is a pointer pointing to the producer of the value
required by the current instruction
The pointers construct the dependency information
which are hidden by the reg-reg model(discuss later)
With the information, the order of execution can be
resolved
CDB enables ‘producer-consumer’ style data flow
50. LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
An Example for False Dependence
FLB2
FLB1
WAW
WAR
51. LD F0, FLB1 dispatches
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
An Example for False Dependence
FLB2
FLB1
B FLB1
52. LD F0, FLB1
AD F2, F0 dispatches to A1
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB1
53. LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB1
54. LD F0, FLB1
AD F2, F0
LD F0, FLB2 dispatches
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB2
55. LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0 dispatches to A2
Adder
Mul
div
FLB
FLR
F0
AD F3, F0
AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB2
56. LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
AD F3, F0
AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB2
Keep tracing the source of
the value instead of the
register holding it
57. LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
AD F3, F0
AD F2, F0
An Example for False Dependence
FLB2
FLB1
B FLB2
There’s no need to rename
a register(Naming is just a
way of referring values)
58. Timing Sequence with Busy Bit
D
T EX WB
AG
D
FLB
D
T T EX WBD
D
T EX WB
AG
D
FLB
D
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
T T EX WBD
59. Timing Sequence with Reservation Station
D
T EX WB
AG
D
FLB
D
T T EX WBD
D
T EX WB
AG
D
FLB
D
T T EX WBD
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
60. The Side Effect of Register Machine
What are the differences between a circuit and a
register machine?
61. The Side Effect of Register Machine
What are the differences between a circuit and a
register machine?
Register Machine
General purpose
Control-driven
Implict dependence via
registers
Circuit
Special purpose
Data-driven
Exposed dependence
...But registers are rare
62. Conclusion
Tomasulo algorithm has nothing to do with register
renaming
It resolves the WAR & WAW by elimating the side
effect of using register to pass value
By using Tomasulo algorithm, the execution of a
program is driven by data flow thus exploiting
maximum concurrency