This document discusses pipelined processors and different approaches to pipelining. It describes ideal pipelining and how clock period is determined. It then discusses challenges with pipelining like hazards and clocking overhead. Different techniques for pipelining like conventional pipelining, wave pipelining, and self-timed circuits are explained. Issues with wave pipelining like timing constraints and balancing delays are also covered.
4. Determining Clock Period
P
Reg
Reg Comb
Clock
Δt
Δt ≥ P Δt = Pmax
P = propagation delay Pmax = max propagation delay
slide 4
Anshul Kumar, CSE IITD
5. Ideal Pipelining
Tinst
S stages
Pmax = Tinst / S
Δt = Tinst / S Effective CPI = 1
Effective time per inst Teff = CPI * Δt
= 1 * Tinst / S
slide 5
Anshul Kumar, CSE IITD
6. Pipelining with hazards
Tinst
S stages
Frequency of interruptions - b
Δt = Tinst / S
CPI = 1 + (S - 1) * b
Teff = (1 + (S - 1) * b) * Tinst / S
slide 6
Anshul Kumar, CSE IITD
7. Teff vs. S (Tinst = 10)
12
10
8
b = .2
Teff
6 b = .1
b = .05
4
2
0
1 2 3 4 5 6 7 8 9 10
S
8. A more realistic view
P
Reg
Reg Comb
Clock
Register output delay Register setup time
Clock skew
slide 8
Anshul Kumar, CSE IITD
9. Clocking Overhead
• Fixed overhead c
– Setup time
– Output delay
• Variable overhead
(stretching factor) k
– Clock skew
Δt = Pmax + k * Pmax + c
= (1 + k) * Tinst / S + c
Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c]
slide 9
Anshul Kumar, CSE IITD
10. Teff vs. S (Tinst = 10, c = 1, k = .1)
14
12
10
8 b = .2
Teff
b = .1
6
b = .05
4
2
0
1 3 5 7 9 11 13 15
S
12. Partitioning instruction into cycles
with non-uniform stage times
non-uniform
One action - one pipeline stage
=> large quantization overhead
Multiple actions per stage?
Multiple stages per action?
slide 12
Anshul Kumar, CSE IITD
13. Example Put Away 2 ns
Execute 7+7+8 ns
Data - ALU 3 ns
Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
Decode 6+6 ns
Data - IR 3 ns
Cache Data 10 ns
Cache Dir 6 ns
PC - MAR 4 ns slide 13
Anshul Kumar, CSE IITD
15. Example Put Away 2 ns
Execute 7+7+8 ns
Data - ALU 3 ns
Pmax = 10 ns Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
S = 10
Δt = 14.5 ns Decode 6+6 ns
S * Δt = 145 ns Data - IR 3 ns
Cache Data 10 ns
Cache Dir 6 ns
PC - MAR 4 ns slide 15
Anshul Kumar, CSE IITD
16. Example Put Away 2 ns
Execute 7+7+8 ns
Data - ALU 3 ns
S=9 Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
Pmax = 13 ns
Δt = 17.65 ns Decode 6+6 ns
S * Δt = 159 ns Data - IR 3 ns
Cache Data 10 ns
Cache Dir 6 ns
PC - MAR 4 ns slide 16
Anshul Kumar, CSE IITD
17. Example Put Away 2 ns
Execute 7+7+8 ns
Data - ALU 3 ns
Pmax = 20 ns Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
S=5
Δt = 25 ns Decode 6+6 ns
S * Δt = 125 ns Data - IR 3 ns
Cache Data 10 ns
Cache Dir 6 ns
PC - MAR 4 ns slide 17
Anshul Kumar, CSE IITD
19. Cycle Quantization
Delays are not integral multiple of clock period
Total overhead = clocking overhead
+ quantization overhead
Δt ≥ Tinst / S + c (ignoring k)
∴ S * Δt ≥ Tinst + S * c
Quantization overhead = S * (Δt - c) -Tinst
This reduces as clock period becomes small
slide 19
Anshul Kumar, CSE IITD
20. Other Timing Approaches
• Self Timed Circuits
– No centralized free running clock
– An operation begins as soon as its inputs are
available, that is, all its predecessors have
completed
– Higher speed, lower power consumption
• Wave Pipelining
– Omit inter-stage registers
– Reduced clocking overhead
slide 20
Anshul Kumar, CSE IITD
21. Conventional vs Wave Pipelining
Conventional vs Wave Pipelining
Conventional Pipeline Wave Pipeline
• Registers separate • No registers between
adjoining stages adjoining stages
• Clock period > max prop • Clock period less than
delay max prop delay
• Inter-stage data stored in • Waves of data propagate
registers through combinational
network (effectively, data
is stored in the
combinational circuit
delay!)
slide 21
Anshul Kumar, CSE IITD
22. No pipelining
Reg X X’ Reg Y
Clock
X
X’
Y
slide 22
Anshul Kumar, CSE IITD
24. Wave pipelining
Reg X Z’ Reg W
Clock
X
Z’
slide 24
Anshul Kumar, CSE IITD W
25. Timing
Reg Reg
Comb ckt
X Y
Clock
T≥p+s
T
clock period
X
Y
p s
propagation delay set-up time
slide 25
Anshul Kumar, CSE IITD
26. Timing with clock skew
Reg Reg
Comb ckt
X Y
Clock
T
Clock skew = ±δ
X
Y
p s
δ
δ T ≥ p + s + 2δ
slide 26
Anshul Kumar, CSE IITD
27. Variation in propagation delay
• Different delays in different paths
• Delay variation due to process /
temperature/ power variations
• Data-dependent delay variations
slide 27
Anshul Kumar, CSE IITD
28. Timing for wave pipelining
Reg Reg
Comb ckt
X Y
Clock
T
±δ
X
Δp
pmin
Y pmax
T ≥ Δ p + s + 4δ slide 28
Anshul Kumar, CSE IITD
29. Timing for wave pipelining
(expanded view)
T
X
Δp
Y
nT
(n-1) T pmin pmax
pmin ≥ (n-1) T + 2δ
nT ≥ pmax + s + 2δ
⇒T ≥ Δ p + s + 4δ
slide 29
Anshul Kumar, CSE IITD
30. Comparison
Conventional Pipeline Wave Pipeline
T ≥ pmax/n + s + 2δ T ≥ Δ p + s + 4δ
(plus cycle quantization
overhead)
nT ≥ pmax + ns + 2nδ nT ≥ pmax + s + 2δ
slide 30
Anshul Kumar, CSE IITD
31. Problems with wave pipelining
• Need to balance delays
• Narrow range of clock frequencies
• Control difficult
• Not very suitable for non-linear pipelines
slide 31
Anshul Kumar, CSE IITD
32. References
1. M.J. Flynn, quot;Computer Architecture : Pipelined and Parallel
Processor Designquot;, Narosa Publishing House/ Jones and
Bartlett, 1996.
2. Wayne P. Burleson, Maciej Ciesielski, Fabian Klass, and
Wentai Liu, “Wave-Pipelining: A Tutorial and Research
Survey”, IEEE Trans. on VLSI Systems, vol. 6, no. 3,
September 1998, pp. 464 – 474.
slide 32
Anshul Kumar, CSE IITD