2. Parallel processing [2]
Processing instructions in parallel requires
three major tasks:
2. checking dependencies between
instructions to determine which
instructions can be grouped together for
parallel execution;
3. assigning instructions to the functional
units on the hardware;
4. determining when instructions are initiated
placed together into a single word.
3. Major categories [2]
VLIW – Very Long Instruction Word
EPIC – Explicitly Parallel Instruction Computing
5. Superscalar Processors [1]
Superscalar processors are designed to exploit
more instruction-level parallelism in user
programs.
Only independent instructions can be executed
in parallel without causing a wait state.
The amount of instruction-level parallelism
varies widely depending on the type of code
being executed.
6. Pipelining in Superscalar
Processors [1]
In order to fully utilise a superscalar processor
of degree m, m instructions must be executable
in parallel. This situation may not be true in all
clock cycles. In that case, some of the pipelines
may be stalling in a wait state.
In a superscalar processor, the simple
operation latency should require only one cycle,
as in the base scalar processor.
9. Superscalar
Implementation
Simultaneously fetch multiple instructions
Logic to determine true dependencies
involving register values
Mechanisms to communicate these values
Mechanisms to initiate multiple instructions in
parallel
Resources for parallel execution of multiple
instructions
Mechanisms for committing process state in
correct order
10. Some Architectures
PowerPC 604
– six independent execution units:
Branch execution unit
Load/Store unit
3 Integer units
Floating-point unit
– in-order issue
– register renaming
Power PC 620
– provides in addition to the 604 out-of-order issue
Pentium
– three independent execution units:
2 Integer units
Floating point unit
– in-order issue
11. VLIW
Very Long Instruction Word (VLIW) architectures are used for executing more
than one basic instruction at a time.
These processors contain multiple functional units, which fetch from the
instruction cache a Very-Long Instruction Word containing several basic
instructions, and dispatch the entire VLIW for parallel execution. These
capabilities are exploited by compilers which generate code that has grouped
together independent primitive instructions executable in parallel.
VLIW has been described as a natural successor to RISC (Reduced Instruction
Set Computing), because it moves complexity from the hardware to the compiler,
allowing simpler, faster processors.
VLIW eliminates the complicated instruction scheduling and parallel dispatch
that occurs in most modern microprocessors.
12. WHY VLIW ?
The key to higher performance in microprocessors for a broad range of
applications is the ability to exploit fine-grain, instruction-level
parallelism.
Some methods for exploiting fine-grain parallelism include:
Pipelining
Multiple processors
Superscalar implementation
Specifying multiple independent operations per instruction
13. Architecture Comparison:
CISC, RISC & VLIW
ARCHITECTURE CISC RISC VLIW
CHARACTERISTIC
INSTRUCTION SIZE Varies One size, usually 32 bits One size
INSTRUCTION Field placement varies Regular, consistent Regular, consistent
FORMAT placement of fields placement of
Fields
INSTRUCTION Varies from simple to Almost always one Many simple,
SEMANTICS complex ; possibly many simple operation independent
dependent operations operations
per instruction
REGISTERS Few, sometimes special Many, general-purpose Many, general-purpose
14. Architecture Comparison:
CISC, RISC & VLIW
ARCHITECTURE CISC RISC VLIW
CHARACTERISTIC
MEMORY REFERENCES Bundled with operations Not bundled with Not bundled with
in many different types operations, operations,i.e.,
of instructions i.e.,load/store load/store
architecture architecture
HARDWARE DESIGN Exploit micro coded Exploit Exploit
FOCUS implementations implementations Implementations
with one pipeline and & With multiple pipelines,
no microcode no microcode & no
complex dispatch logic
PICTURES OF FIVE
TYPICAL INSTRUCTIONS
15. Advantages of VLIW
VLIW processors rely on the compiler that generates the VLIW code to
explicitly specify parallelism. Relying on the compiler has advantages.
VLIW architecture reduces hardware complexity. VLIW simply moves
complexity from hardware into software.
16. What is ILP ?
Instruction-level parallelism (ILP) is a measure of how many of the
operations in a computer program can be performed simultaneously.
A system is said to embody ILP (instruction-level parallelism) is
multiple instructions runs on them at the same time.
ILP can have a significant effect on performance which is critical to
embedded systems.
ILP provides an form of power saving by slowing the clock.
17. What we intend to do
with ILP ?
We use Micro-architectural techniques to exploit the ILP. The various techniques
include :
Instruction pipelining which depend on CPU caches.
Register renaming which refers to a technique used to avoid unnecessary.
serialization of program operations imposed by the reuse of registers by those
operations.
Speculative execution which reduce pipeline stalls due to control dependencies.
Branch prediction which is used to keep the pipeline full.
Superscalar execution in which multiple execution units are used to execute
multiple instructions in parallel.
Out of Order execution which reduces pipeline stall due to operand dependencies.
18. Algorithms for
scheduling
Few of the Instruction scheduling algorithms used are :
List scheduling
Trace scheduling
Software pipelining (modulo scheduling)
19. List Scheduling
List scheduling by steps :
2. Construct a dependence graph of the basic block. (The edges are
weighted with the latency of the instruction).
3. Use the dependence graph to determine instructions that can execute;
insert on a list, called the Readylist.
4. Use the dependence graph and the Ready list to schedule an instruction
that causes the smallest possible stall; update the Ready list. Repeat
20. Code Representation
for
List Scheduling
a=b+c
d=e - f
1 2 5 6
3 7
1. load R1, b
2. load R2, c 4 8
3. add R2,R1
4. store a, R2
5. load R3, e
6. load R4,f
7. sub R3,R4
8. store d,R3
21. Code Representation
for
List Scheduling
1. load R1, b 1. load R1, b 1 2 5 6
2. load R2, c 5.load R3, e
3. add R2,R1 2. load R2, c 3 7
4. store a, R2 6.load R4, f
5. load R3, e 3.add R2,R1
6. load R4,f 7.sub R3,R4 4 8
7. sub R3,R4 4.store a, R2
8. store d,R3 8. store d, R3
a=b+c
d=e - f
Now we have a schedule that requires no stalls and no NOPs.
22. Problem and
Solution
Register allocation conflict : use of same register creates
anti-Dependencies that restrict scheduling
Register allocation before scheduling
–prevents good scheduling
Scheduling before register allocation
–spills destroy scheduling
Solution : Schedule abstract assembly, Allocate registers, Schedule
23. Trace scheduling
Steps involved in Trace Scheduling :
Trace Selection
– Find the most common trace of basic blocks.
Trace Compaction
–Combine the basic blocks in the trace and schedule them as one block
–Create clean-up code if the execution goes off-trace
Parallelism across IF branches vs. LOOP branches
Can provide a speedup if static prediction is accurate
26. How Trace Scheduling
works
We can see the blocks been
traced depending on the priority.
27. How Trace Scheduling
works
• Creating large extended basic blocks by duplication
• Schedule the larger blocks
Figure above shows how the extended basic blocks can be
created.
28. How Trace Scheduling
works
This block diagram in its final stage shows you the parallelism across the
branches.
29. Limitations of Trace
Scheduling
Optimizations depends on the traces being the dominant paths
in the program’s control-flow.
Therefore, the following two things should be true:
–Programs should demonstrate the behavior of being skewed in
the branches taken at run-time, for typical mixes of input data.
–We should have access to this information at compile time.
Not so easy.
30. Software Pipelining
In software pipelining, iterations of a loop in the source program are
continuously initiated at constant intervals, before the preceding
iterations complete thus taking advantage of the parallelism in data path.
Its also explained as scheduling the operations within an iteration,
such that the iterations can be pipelined to yield optimal throughput.
The sequence of instructions before the steady state are called
PROLOG and the ones that are in the sequence after the steady state is
called EPILOG.
31. Software Pipelining
Example
•Source code:
for(i=0;i<n;i++) sum += a[i] r7 = L r6
---;stall
•Loop body in assembly:
r2 = Add r2,r7
r1 = L r0
---;stall r6 = add r6,12
r2 = Addr2,r1
r0 = addr0,4 r10 = L r9
---;stall
•Unroll loop & allocate registers
r2 = Add r2,r10
r1 = L r0
---;stall r9 = add r9,12
r2 = Add r2,r1
r0 = Add r0,12
r4 = L r3
---;stall
r2 = Add r2,r4
r3 = add r3,12
34. Constraints in Software
pipelining
Recurrence Constraints: which is determined
by loop carried data dependencies.
Resource Constraints: which is determined by
total resource requirements.
35. Remarks on Software
Pipelining
Innermost loop, loops with larger trip count, loops without conditionals
can be software pipelined.
Code size increase due to prolog and epilog.
Code size increase due to unrolling for MVE (Modulo Variable
Expansion).
Register allocation strategies for software pipelined loops .
Loops with conditional can be software pipelined if predicated execution
is supported.
–Higher resource requirement, but efficient schedule