1. Processor: Superscalars Pipeline Organization
Z. Jerry Shi
Computer Science and Engineering
University of Connecticut
* Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*
2. Targeting better performance
•Factors that decide the execution time
Execution Time = Path Length × CPI × Cycle Time
•Exploit parallelism
5. Pipelining
•An implementation technique whereby multiple instructions are overlapped in execution
–The parallelism among instructions in a sequential stream
–The parallelism among actions needed to execute an instruction
•Divide the execution into multiple steps and do one step each time
–Each step is called a pipe stage or a pipe segment
•Pipeline throughput: how often an instruction leaves the pipeline
•Need to balance the length of each pipeline stage
–Processor cycle time is determined by the slowest stage
•Ideally, the speedup is the number of pipe stages. However,…
–Time per instruction on unpipelined machine / Number of pipe stages
13. Towards Ideal Pipeline CPI
Pipeline CPI =
Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls
–Ideal pipeline CPI: measure of the maximum performance attainable by the implementation
–Structural hazards: HW cannot support this combination of instructions
–Data hazards: Instruction depends on the result of prior instructions
–Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)
•Stall the pipeline when there is a hazard
–Any instructions issued earlier than the stalled instruction continue
–Any instructions after the stalled instruction are also stalled
•No new instrutions are fetched
15. Performance impact of structural hazards
Ideal CPI = 1, no structural hazard, clock rate = 1
40% of the instructions resulting structural hazards, clock rate =1.05
Which one is faster?
Instruction count is the same. Need to consider time per instr. only
The average time per instruction for the processor with the structural hazard is
idealidealTimeCycleTimeCycleTimeCycleCPITimeInstrAVG_3.105.1_ )14.01( _ __
22. Examples of data forwarding
1
2
3
4
5
6
7
8
9
LD R2, 0(R11)
IF
ID
EX
ME
WB
ADD R1, R2, R3
IF
ID
-
EX
ME
WB
ADD R4, R1, R4
IF
-
ID
EX
ME
WB
ADD R5, R1, R5
IF
ID
EX
ME
WB
1
2
3
4
5
6
7
8
9
LD R2, 0(R11)
IF
ID
EX
ME
WB
ST R2, 0(R12)
IF
ID
EX
ME
WB
ADD R1, R3, R4
IF
ID
EX
ME
WB
ST R1, 0(R13)
IF
ID
EX
ME
WB
0
23. Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
Software scheduling to avoid load hazards
Fast code:
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd
Compiler optimizes for performance. Hardware checks for safety.
25. Handling control hazards
Branch instruction
IF
ID
EX
MEM
WB
Brach successor
IF
IF
ID
EX
MEM
WB
Brach successor + 1
IF
ID
EX
MEM
WB
Brach successor + 2
IF
ID
EX
MEM
WB
• Freeze/flush the pipeline. Wait until the branch destination is known
–Penalty is fixed
• Treat every branch as not taken
• Treat every branch as taken
–Any advantages in our 5-stage pipeline?
• Delayed branch
Branch instruction
Sequential successor 1
Branch target if taken
What if the condition is not resolved until the EX stage?
26. Predicted Not Taken
Untaken Branch instr.
IF
ID
EX
MEM
WB
Brach successor
IF
ID
EX
MEM
WB
Brach successor + 1
IF
ID
EX
MEM
WB
Brach successor + 2
IF
ID
EX
MEM
WB
Taken Branch instruction
IF
ID
EX
MEM
WB
Brach successor
IF
IF
ID
EX
MEM
WB
Brach target
IF
ID
EX
MEM
WB
Brach successor + 1
IF
ID
EX
MEM
WB
Brach successor + 2
IF
ID
EX
MEM
WB
27. Scheduling the branch delay slot
•a) is the best choice, fills delay slot & reduces instruction count (IC)
•In b), the sub instruction may need to be copied, increasing IC
•In b) and c), it must be okay to execute sub when branch fails
28. Delayed Branch
•Compiler effectiveness for single branch delay slot:
–Fills about 60% of branch delay slots
–About 80% of instructions executed in branch delay slots useful in computation
–About 50% (60% x 80%) of slots usefully filled
•Delayed branch downside:
As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot
–Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches
–Growth in available transistors has made dynamic approaches relatively cheaper
30. Evaluating Branch Alternatives
Branch scheme
Speedup vs Flush
Delayed branch
Flush
1
Predicted taken
1.06
1.14
Predicted untaken
1.12
1.19
For delayed branch, 50% of the slots can be filled with useful instructions.
33. Latency and initiation interval
•Latency: the number of intervening cycles between an instruction that produces a result and an instruction that uses the results
–Typically 1 cycle less than the depth of the execution pipeline
•Consider LD has a two-stage execution, 1-cycle latency if the following instruction is not ST
•Initiation interval: the number of cycles that must elapse between issuing two operations to the same functional unit
For example, a multiplier with a latency of 7 cycles
Unpipelined: initiation interval is 7 cycles. 1, 8, 15, …
Pipelined: initiation interval is 1 cycle. 1, 2, 3, …
34. Latencies and initiation intervals for functional units
Functional unit
# of execution stage
Latency
Initiation interval
Integer ALU
1
0
1
Data memory
2
1
1
FP add
4
3
1
FP multiply
7
6
1
FP divide
25
24
25
35. Pipeline timing of a set of independent FP oprations
•Instructions are fecthed and sent to functional units in order
•The completion of instructions are not in order because of different execution lenghes
1
2
3
4
5
6
7
8
9
10
11
MUL.D
IF
ID
M1
M2
M3
M4
M5
M6
M7
ME
WB
ADD.D
IF
ID
A1
A2
A3
A4
ME
WB
L.D
IF
ID
EX
ME
WB
S.D
IF
ID
EX
ME
WB
36. FP code sequence showing the stalls (from RAW)
1
2
3
4
5
6
7
8
9
L.D F4, 0(R2)
IF
ID
EX
ME
WB
MUL F0,F4,F6
IF
ID
-
M1
M2
M3
M4
M5
ADD F2,F0,F8
IF
-
ID
-
-
-
-
S.D F2, 0(R2)
IF
-
-
-
-
10
11
12
13
14
15
16
17
18
L.D F4, 0(R2)
MUL F0,F4,F6
M6
M7
ME
WB
ADD F2,F0,F8
A1
A2
A3
A4
ME
WB
S.D F2, 0(R2)
ID
EX
-
-
-
ME
WB
37. Handling multiple writes to register file
•Track the use of the write port in the ID stage and install an instruction before it issues
–Stalls the instruction if it writes in the same cycle as instructions already issued
–Use shift registers to track which instruction need register in which cycle
•Stall a conflicting instruction when it tries to enter either MEM or WB stage
–May choose either instruction
•May give priority to instructions with long latencies
–Does not detect conflict until the entrance of the MEM or WB stage, where it is easy to see
–Complicates pipeline control as stalls may arise from two places
38. Problems with Pipelining
•Exception: An unusual event happens to an instruction during its execution
–Examples: divide by zero, undefined opcode
•Interrupt: Hardware signal to switch the processor to a new instruction stream
–Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting)
•Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1)
–The effect of all instructions up to and including Ii is totalling complete
– No effect of any instruction after Ii can take place
•The interrupt (exception) handler either aborts program or restarts at instruction Ii+1
39. Dealing with exceptions
•Exceptions are harder to handle in a pipelined processor
–An instruction is executed in several steps, making it more difficult to determine whether an instruction can safely change the state of the processor
•Other instructions in pipeline may cause exceptions
•Example of exceptions
–Invoking an operating system service
–Breakpoint (programmer-requested interrupt)
–Integer/FP arithmetic overflow or anomaly
–Memory access (Page fault, protection, misalignment)
–Unknown instructions
–Hardware malfunctions
–I/O request
–Power failure
40. Classification of exceptions
•Synchronous versus asynchronous
–Occur at the same place every time the program is executed?
•User requested versus coerced
–User asks for it?
•User maskable versus nonmaskable
–Can be masked (disabled) by user?
•Within versus between instructions
–Occur in the middle of execution and prevent instruction completion?
•Resume versus terminate
–Can program’s execution be resumed?
41. Stopping and restarting exceptions
•Most difficult exceptions
–Occur within instructions (e.g. in the EX and MEM stage)
–Must be restartable
•Possible solutions
–Force a trap instruction into the pipeline on the next IF
–Until the trap is taken, turn off all writes for the faulting and all following instructions
–In the exception handlers, save the PC of the faulting instructions
Precise exceptions: if the pipeline can be stopped so
the instructions before the faulting instruction can complete
the instructions after the faulting instruction can be restarted
42. Precise Exceptions in Static Pipelines
Key observation: architected state only change in memory and register write stages.
43. A more complicated pipeline
•Fetch
•Decode
•Dispatch
•Issue
•Execute
•Finish
•Complete
•Retire
Branch Prediction
Dynamic Scheduling
Reorder buffer
45. Instruction Fetch
•Limit on maximum throughput of pipeline
•Fetch s instructions per cycle from I cache
•Problems with attaining throughput:
–Control flow Branch Prediction
–Alignment of cache line and PC
46. Interactions between Instruction Fetch and Instruction Cache Structure
•In b), if a fetch group ( s instructions) straddles two cache lines, need to access I cache twice
–If any of the cache line is a miss, the pipeline stalls
47. Instruction Decode
•Extract from assembly instruction
–Instruction Type (Decoder)
–Dependencies (Comparators)
–Operands (Register Files & Buses)
•CISC RISC:
–Converted to ROP (RISC OP)
50. Instruction Dispatch
•Dataflow:
–Send an instruction to a functional unit as soon as its operands are available, regardless of original program order.
–Tomasulo’s
52. Instruction Execution
•How many functional units? Why different types?
–Constraints of area, power, interconnection, etc.
•You cannot put as many as you want
–Mix of functional units may not be ideal for some applications
•Bypassing
–Bypassing needed between functional units to minimize stalls
53. Instruction Completion & Retiring
•Completion Registers
•Reorder/Store buffer in between
–Registers in the buffer (not register file) hold the new values
•Retiring Memory
54. Limiting factors: Pipelining hazards
•Structural hazards
–Resource conflicts when hardware cannot support all possible combinations of instructions simultaneously
•Data hazards
–An instruction depends on the results of a previous instruction
•Control hazards
–Branch instructions that change the instruction flow