Lec Jan15 2009

CSL718 : Pipelined Processors

PipelineTimings
15th Jan, 2009

Anshul Kumar, CSE IITD

Pipelined Processors
Pipelined Processors

Parallel architectures

Function-parallel Data-parallel

Instr level (ILP) Thread level Process level

Intel’s terminology:
Pipelined VLIWs Superscalar • intra ILP
processors processors • inter ILP
slide 2

Ideal Pipelining

Tinst
S stages

slide 3

Determining Clock Period
P
Reg
Reg Comb

Clock

Δt
Δt ≥ P Δt = Pmax
P = propagation delay Pmax = max propagation delay

slide 4

Ideal Pipelining

Tinst
S stages

Pmax = Tinst / S

Δt = Tinst / S Effective CPI = 1
Effective time per inst Teff = CPI * Δt
= 1 * Tinst / S
slide 5

Pipelining with hazards

Tinst
S stages

Frequency of interruptions - b

Δt = Tinst / S
CPI = 1 + (S - 1) * b
Teff = (1 + (S - 1) * b) * Tinst / S
slide 6

Teff vs. S (Tinst = 10)
12

10

8
b = .2
Teff

6 b = .1
b = .05
4

2

0
1 2 3 4 5 6 7 8 9 10
S

A more realistic view
P
Reg
Reg Comb

Clock

Register output delay Register setup time

Clock skew

slide 8

Clocking Overhead
• Fixed overhead c
– Setup time
– Output delay
• Variable overhead
(stretching factor) k
– Clock skew
Δt = Pmax + k * Pmax + c
= (1 + k) * Tinst / S + c
Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c]

slide 9

Teff vs. S (Tinst = 10, c = 1, k = .1)
14
12
10
8 b = .2
Teff

b = .1
6
b = .05
4
2
0
1 3 5 7 9 11 13 15
S

Pipelining with Clocking Overhead
Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c]

Sopt = √ [(1 - b) * (1 + k) * Tinst / (b * c)]

slide 11

Partitioning instruction into cycles
with non-uniform stage times
non-uniform

One action - one pipeline stage
=> large quantization overhead

Multiple actions per stage?
Multiple stages per action?
slide 12

Example Put Away 2 ns

Execute 7+7+8 ns

Data - ALU 3 ns
Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
Decode 6+6 ns
Data - IR 3 ns
Cache Data 10 ns
Cache Dir 6 ns
PC - MAR 4 ns slide 13

Optimal Pipelining
Tinst = 4+6+10+3+12+9+3+6+10+3+22+2
= 90 ns
b = 0.2 c = 4 ns k = 5%

Sopt = √ [(1 - b) * (1 + k) * Tinst / (b * c)]
= 9.7 ⇒ 9
Pmax = 10 ns

slide 14


Execute 7+7+8 ns

Data - ALU 3 ns
Pmax = 10 ns Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
S = 10
Δt = 14.5 ns Decode 6+6 ns
S * Δt = 145 ns Data - IR 3 ns
Cache Data 10 ns
Cache Dir 6 ns


Execute 7+7+8 ns

Data - ALU 3 ns
S=9 Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
Pmax = 13 ns
Δt = 17.65 ns Decode 6+6 ns
Cache Data 10 ns
Cache Dir 6 ns


Execute 7+7+8 ns

Data - ALU 3 ns
Pmax = 20 ns Cache Data 10 ns
Cache Dir 6 ns
Addr - MAR 3 ns
Gen Addr 9ns
S=5
Δt = 25 ns Decode 6+6 ns
Cache Data 10 ns
Cache Dir 6 ns

Comparison

Δt S * Δt
S Pmax Teff

9 13 17.65 159 45.89

10 10 14.50 145 40.60

5 20 25.00 125 45.00

slide 18

Cycle Quantization
Delays are not integral multiple of clock period
Total overhead = clocking overhead
+ quantization overhead
Δt ≥ Tinst / S + c (ignoring k)
∴ S * Δt ≥ Tinst + S * c
Quantization overhead = S * (Δt - c) -Tinst
This reduces as clock period becomes small

slide 19

Other Timing Approaches
• Self Timed Circuits
– No centralized free running clock
– An operation begins as soon as its inputs are
available, that is, all its predecessors have
completed
– Higher speed, lower power consumption
• Wave Pipelining
– Omit inter-stage registers
– Reduced clocking overhead
slide 20

Conventional vs Wave Pipelining
Conventional vs Wave Pipelining
Conventional Pipeline Wave Pipeline
• Registers separate • No registers between
adjoining stages adjoining stages
• Clock period > max prop • Clock period less than
delay max prop delay
• Inter-stage data stored in • Waves of data propagate
registers through combinational
network (effectively, data
is stored in the
combinational circuit
delay!)

slide 21

No pipelining
Reg X X’ Reg Y

Clock

X
X’
Y

slide 22

Conventional pipelining
Reg X X’ Y Y’ Z Z’ Reg W

Clock

X
X’
Y
Y’
Z
Z’
W

Wave pipelining
Reg X Z’ Reg W

Clock

X

Z’
slide 24
Anshul Kumar, CSE IITD W

Timing
Reg Reg

Comb ckt
X Y
Clock
T≥p+s
T
clock period

X
Y

p s
propagation delay set-up time
slide 25

Timing with clock skew
Reg Reg

Comb ckt
X Y
Clock
T
Clock skew = ±δ

X
Y

p s
δ
δ T ≥ p + s + 2δ
slide 26

Variation in propagation delay
• Different delays in different paths
• Delay variation due to process /
temperature/ power variations
• Data-dependent delay variations

slide 27

Timing for wave pipelining
Reg Reg

Comb ckt
X Y
Clock
T

±δ

X
Δp
pmin
Y pmax

T ≥ Δ p + s + 4δ slide 28

Timing for wave pipelining
(expanded view)
T

X
Δp
Y

nT
(n-1) T pmin pmax
pmin ≥ (n-1) T + 2δ
nT ≥ pmax + s + 2δ
⇒T ≥ Δ p + s + 4δ
slide 29

Comparison
Conventional Pipeline Wave Pipeline
T ≥ pmax/n + s + 2δ T ≥ Δ p + s + 4δ
(plus cycle quantization
overhead)
nT ≥ pmax + ns + 2nδ nT ≥ pmax + s + 2δ

slide 30

Problems with wave pipelining
• Need to balance delays
• Narrow range of clock frequencies
• Control difficult
• Not very suitable for non-linear pipelines

slide 31

References
1. M.J. Flynn, quot;Computer Architecture : Pipelined and Parallel
Processor Designquot;, Narosa Publishing House/ Jones and
Bartlett, 1996.
2. Wayne P. Burleson, Maciej Ciesielski, Fabian Klass, and
Wentai Liu, “Wave-Pipelining: A Tutorial and Research
Survey”, IEEE Trans. on VLSI Systems, vol. 6, no. 3,
September 1998, pp. 464 – 474.

slide 32

Lec Jan15 2009

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (9)

En vedette

En vedette (8)

Similaire à Lec Jan15 2009

Similaire à Lec Jan15 2009 (18)

Plus de Ravi Soni

Plus de Ravi Soni (9)

Dernier

Dernier (20)

Lec Jan15 2009