SlideShare a Scribd company logo
1 of 36
Advanced Techniques
For Exploiting ILP
Mr. A. B. Shinde
Assistant Professor,
Electronics Engineering,
P.V.P.I.T., Budhgaon
Contentsโ€ฆ
๏‚จ Complier techniques for
exposing ILP
๏‚จ Limitation on ILP for realizable
Processors
๏‚จ Hardware versus software
speculation
2
Pipelining
๏‚จ Pipeline is a set of data processing elements connected in
series, so that the output of one element is the input of the next one.
๏‚จ The elements of a pipeline are often executed in parallel.
Parallel Computing
๏‚จ Parallel computing is a form of computation in which many
calculations are carried out simultaneously ("in parallel").
๏‚จ There are several different forms of parallel computing:
๏‚ค Bit-level,
๏‚ค Instruction level,
๏‚ค Data, and
๏‚ค Task parallelism.
Instruction Level Parallelism (ILP)
๏‚จ A computer program is, a stream of instructions executed by
a processor.
๏‚จ These instructions can be re-ordered and combined into groups
which are then executed in parallel without changing the result of the
program.
Instruction Level Parallelism (ILP)
In ordinary programs instructions are executed in the order specified by
the programmer.
๏‚จ How much ILP exists in programs is very application specific.
In certain fields, such as graphics and scientific computing the
amount can be very large.
However, workloads such as cryptography exhibit much less
parallelism.
Pipeline Scheduling
๏‚จ The straightforward MIPS code, not scheduled for the pipeline, looks
like this:
Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer by
;8 bytes
BNE R1,R2,Loop ;branch R1!=R2
๏‚จ Letโ€™s see how well this loop will run when it is scheduled on a simple
pipeline for MIPS.
Pipeline Scheduling
๏‚จ Example: Show how the loop would look on MIPS, both scheduled and
unscheduled, including any stalls or idle clock cycles.
๏‚จ Answer: Without any scheduling, the loop will execute as follows,
taking 9 cycles:
Clock cycle issued
Loop: L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
DADDUI R1,R1,#-8 7
stall 8
BNE R1,R2,Loop 9
Pipeline Scheduling
๏‚จ We can schedule the loop to obtain only two stalls and reduce the
time to 7 cycles:
Clock cycles
Loop: L.D F0,0(R1) 1
DADDUI R1,R1,#-8 2
ADD.D F4,F0,F2 3
Stall 4
Stall 5
S.D F4,8(R1) 6
BNE R1,R2,Loop 7
๏‚จ The stalls after ADD.D are for use by the S.D.
Clock cycles
Loop: L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
DADDUI R1,R1,#-8 7
stall 8
BNE R1,R2,Loop 9
Loop Unrolling
๏‚จ In the previous example, we complete one loop iteration and store
back one array element every 7 clock cyclesโ€ฆ
but the actual work of operating on the array element takes
just 3 (the load, add, and store) of those 7 clock cycles.
๏‚จ The remaining 4 clock cycles consist of loop overheadโ€”the
DADDUI and BNEโ€”and two stalls.
๏‚จ To eliminate these 4 clock cycles we need to get more operations
relative to the number of overhead instructions.
๏‚จ A simple scheme for increasing the number of instructions relative to
the branch and overhead instructions is loop unrolling.
๏‚จ Unrolling simply replicates the loop body multiple times.
Loop Unrolling
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) ;drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) ;drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) ;drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
Stall
Stall
S.D F4,8(R1)
BNE R1,R2,Loop
Code without Loop unrolling
Code with Loop unrolling
Loop Unrolling
๏‚จ Loop unrolling can also be used to improve scheduling, because it
eliminates the branch, it allows instructions from different iterations to be
scheduled together.
๏‚จ If we simply replicated the instructions when we unrolled the loop,
the resulting use of the same registers could prevent us from
effectively scheduling the loop.
๏‚จ Thus, the required number of registers will be increased.
Loop Unrolling and Scheduling:
Summary
๏‚จ To obtain the final unrolled code we need to make the following
decisions and transformations:
๏‚จ Determine that unrolling the loop would be useful by finding that the
loop iterations were independent.
๏‚จ Use different registers to avoid unnecessary forcing by using the
same registers for different computations.
๏‚จ Eliminate the extra test and branch instructions and adjust the loop
termination.
๏‚จ Determine that the loads and stores in the unrolled loop can be
interchanged, if loads and stores from different iterations are
independent.
๏‚จ Schedule the code, preserving any dependences needed to produce
same result as the original code.
Loop Unrolling and Scheduling:
Summary
๏‚จ There are three different types of limits to the gains that can be
achieved by loop unrolling:
1. A decrease in the amount of overhead with each unroll,
2. Code size limitations, and
3. Compiler limitations.
Letโ€™s consider the question of loop overhead first.
When we unrolled the loop four times, it generated sufficient
parallelism among the instructions that the loop could be scheduled
with no stall cycles.
In previous example, in 14 clock cycles, only 2 cycles were loop
overhead: the DADDUI, and BNE.
Loop Unrolling and Scheduling:
Summary
๏‚จ A second limit to unrolling is the growth in code size.
๏‚จ Factor often more important than code size is the potential shortfall
in registers that is created by aggressive unrolling and scheduling.
๏‚จ The transformed code, is theoretically faster but it generates a
shortage of registers.
Loop Unrolling and Scheduling:
Summary
๏‚จ Loop unrolling is a simple but useful method for increasing the size
of straight-line code fragments that can be scheduled effectively.
๏‚จ This transformation is useful in a variety of processors, from simple
pipelines to the multiple-issue processors
Limitations of ILP
Limitations of ILP
๏‚จ Exploiting ILP to increase performance began with the first pipelined
processors in the 1960s.
๏‚จ In the 1980s and 1990s, these techniques were used to achieve rapid
performance improvements.
๏‚จ For enhancing performance at a speed of integrated circuit
technologyโ€ฆ The critical question is:
What is needed to exploit more ILP is crucial to both computer
designers and compiler writers.
Limitations of ILP
๏‚จ To know what actually limits ILPโ€ฆ
โ€ฆ we first need to define an ideal processor.
๏‚จ An ideal processor is one where all constraints on ILP are
removed.
๏‚จ The only limits on ILP in ideal processor are by the actual data flows
through either registers or memory.
Ideal Processor
๏‚จ The assumptions made for an ideal or perfect processor are as
follows:
1. Register renaming
There are an infinite number of virtual registers available, and hence
all WAW and WAR hazards are avoided and number of instructions
can begin execution simultaneously.
2. Branch prediction
Branch prediction is perfect. All conditional branches are predicted
exactly.
3. Jump prediction
All jumps are perfectly predicted.
Ideal Processor
๏‚จ The assumptions made for an ideal or perfect processor are as
follows:
4. Memory address analysis
All memory addresses are known exactly, and a load can be moved
before a store provided that the addresses are not identical.
This implements perfect address analysis.
5. Perfect caches
All memory accesses take 1 clock cycle.
Ideal Processor
๏‚จ Assumptions 2 and 3 eliminate all control dependences.
๏‚จ Assumptions 1 and 4 eliminate all but the true data dependences.
๏‚จ These four assumptions mean that any instruction in the programโ€™s
execution can be scheduled on the cycle immediately following the
execution of the predecessor on which it depends.
๏‚จ Under these assumptions, it is possible, for the last executed
instruction in the program to be scheduled on the very first cycle.
Ideal Processor
๏‚จ How close could a dynamically scheduled, speculative processor
come to the ideal processor?
To get into this question, consider what the perfect processor must do:
1. Look arbitrarily far ahead to find a set of instructions to issue,
predicting all branches perfectly.
2. Rename all registers used to avoid WAR and WAW hazards.
3. Determine whether there are any data dependences among the
instructions; if so, rename accordingly.
4. Determine if any memory dependences exist among the issuing
instructions and handle them appropriately.
5. Provide enough replicated functional units to allow all the ready
instructions to issue (no structural hazards).
Ideal Processor
๏‚จ For example, to determine whether n issuing instructions have any
register dependences among them, assuming all instructions are
register-register and the total number of registers is unbounded,
requires
Thus, issuing only 50 instructions requires 2450 comparisons.
This cost obviously limits the number of instructions that can be
considered for issue at once.
comparisons.
Limitations on ILP for Realizable
Processors
๏‚จ The limitations are divided into two classes:
๏‚จ Limitations that arise even for the perfect speculative processor,
and
๏‚จ Limitations that arise for one or more realistic models.
Limitations on ILP for Realizable
Processors
๏‚จ The most important limitations that apply even to the perfect model are
1. WAW and WAR hazards through memory
The WAW and WAR hazards are eliminated through register
renaming, but not in memory usage.
A normal procedure reuses the memory locations of a previous
procedure on the stack, and this can lead to WAW and WAR hazards.
Limitations on ILP for Realizable
Processors
๏‚จ The most important limitations that apply even to the perfect model are
2. Unnecessary dependences
With infinite numbers of registers, all but true register data
dependences are removed.
There are, dependences arising from either recurrences or code
generation conventions that introduce unnecessary data dependences.
Code generation conventions introduce unneeded dependences, in
particular the use of return address registers and a register for the stack
pointer (which is incremented and decremented in the call/return
sequence).
Limitations on ILP for Realizable
Processors
๏‚จ The most important limitations that apply even to the perfect model are
3. Overcoming the data flow limit
If value prediction worked with high accuracy, it could overcome the
data flow limit.
It is difficult to achieve significant enhancement in ILP, using a prediction
scheme.
Limitations on ILP for Realizable
Processors
๏‚จ For a less-than-perfect processor, several ideas have been
proposed that could expose more ILP.
๏‚จ To speculate along multiple paths: This idea was discussed by Lam
and Wilson [1992]. By speculating on multiple paths, the cost of
incorrect recovery is reduced and more parallelism can be
exposed.
๏‚จ Wall [1993] provides data for speculating in both directions on up to
eight branches.
Out of both paths, one will be thrown away.
Every commercial design has instead devoted additional hardware to
better speculation on the correct path.
Hardware Vs Software Speculation
Hardware Vs Software Speculation
๏‚จ To speculate extensively, we must be able to disambiguate (clear
the ambiguity) memory references.
๏‚จ This capability is difficult to do at compile time for integer
programs that contain pointers.
๏‚จ In a hardware-based scheme, dynamic run time disambiguation of
memory addresses is done using Tomasuloโ€™s algorithm.
Hardware Vs Software Speculation
๏‚จ Hardware-based speculation works better when control flow is
unpredictable, and
Hardware-based branch prediction is superior than software-based
branch prediction done at compile time.
For example:
a good static predictor has a misprediction rate of about 16% for four
major integer SPEC92 programs, and a hardware predictor has a
misprediction rate of under 10%, because, speculated instructions may
slow down the computation when the prediction is incorrect.
Hardware Vs Software Speculation
๏‚จ Hardware-based speculation maintains a completely precise
exception model even for speculated instructions.
๏‚จ Hardware-based speculation does not require compensation or
book-keeping code, which is needed by software speculation
mechanisms.
๏‚จ Compiler-based approaches may benefit from the ability to see
further in the code sequence, resulting in better code scheduling than
a purely hardware driven approach.
Hardware Vs Software Speculation
๏‚จ Hardware-based speculation with dynamic scheduling does not
require different code sequences to achieve good performance for
different implementations of an architecture.
๏‚จ On the other hand, more recent explicitly parallel architectures,
(such as IA-64), have added flexibility that reduces the hardware
dependence inherent in a code sequence.
Hardware Vs Software Speculation
๏‚จ The major disadvantage of supporting speculation in hardware is
the complexity and additional hardware resources required.
๏‚จ Some designers have tried to combine the dynamic and compiler-based
approaches to achieve the best of each.
๏‚จ For example:
If conditional moves are combined with register renaming, then a slight
side effect appears. A conditional move that is invalid must copy a value
to the destination register, since it was renamed earlier.
Thank Youโ€ฆ
shindesir.pvp@gmail.com
(This Presentation is Published Only for Educational Purpose)

More Related Content

What's hot

Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
Kishan Panara
ย 

What's hot (20)

FUNDAMENTALS OF COMPUTER DESIGN
FUNDAMENTALS OF COMPUTER DESIGNFUNDAMENTALS OF COMPUTER DESIGN
FUNDAMENTALS OF COMPUTER DESIGN
ย 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
ย 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
ย 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processor
ย 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processing
ย 
Code optimization in compiler design
Code optimization in compiler designCode optimization in compiler design
Code optimization in compiler design
ย 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
ย 
Computer architecture page replacement algorithms
Computer architecture page replacement algorithmsComputer architecture page replacement algorithms
Computer architecture page replacement algorithms
ย 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architecture
ย 
Branch prediction
Branch predictionBranch prediction
Branch prediction
ย 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
ย 
ARM Exception and interrupts
ARM Exception and interrupts ARM Exception and interrupts
ARM Exception and interrupts
ย 
Pipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptPipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture ppt
ย 
Pipelining
PipeliningPipelining
Pipelining
ย 
Task assignment and scheduling
Task assignment and schedulingTask assignment and scheduling
Task assignment and scheduling
ย 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
ย 
Processor Organization and Architecture
Processor Organization and ArchitectureProcessor Organization and Architecture
Processor Organization and Architecture
ย 
Processes and operating systems
Processes and operating systemsProcesses and operating systems
Processes and operating systems
ย 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
ย 
Parallel computing
Parallel computingParallel computing
Parallel computing
ย 

Viewers also liked

Chapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitationChapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitation
subramaniam shankar
ย 

Viewers also liked (20)

Chapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitationChapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitation
ย 
System on chip buses
System on chip busesSystem on chip buses
System on chip buses
ย 
Intel CPU Manufacturing Process
Intel CPU Manufacturing ProcessIntel CPU Manufacturing Process
Intel CPU Manufacturing Process
ย 
SOC Interconnects: AMBA & CoreConnect
SOC Interconnects: AMBA  & CoreConnectSOC Interconnects: AMBA  & CoreConnect
SOC Interconnects: AMBA & CoreConnect
ย 
SOC Processors Used in SOC
SOC Processors Used in SOCSOC Processors Used in SOC
SOC Processors Used in SOC
ย 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
ย 
Image processing fundamentals
Image processing fundamentalsImage processing fundamentals
Image processing fundamentals
ย 
VLSI Design Flow
VLSI Design FlowVLSI Design Flow
VLSI Design Flow
ย 
[2017.03.18] hst binary training part 1
[2017.03.18] hst binary training   part 1[2017.03.18] hst binary training   part 1
[2017.03.18] hst binary training part 1
ย 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
ย 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
ย 
Hydrogen production by a thermally integrated ATR based fuel processor
Hydrogen production by a thermally integrated ATR based fuel processorHydrogen production by a thermally integrated ATR based fuel processor
Hydrogen production by a thermally integrated ATR based fuel processor
ย 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
ย 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...
ย 
SOC Chip Basics
SOC Chip BasicsSOC Chip Basics
SOC Chip Basics
ย 
Dual-core processor
Dual-core processorDual-core processor
Dual-core processor
ย 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures
ย 
Instruction level parallelism
Instruction level parallelismInstruction level parallelism
Instruction level parallelism
ย 
Spartan-II FPGA (xc2s30)
Spartan-II FPGA (xc2s30)Spartan-II FPGA (xc2s30)
Spartan-II FPGA (xc2s30)
ย 
How to Make Effective Presentation
How to Make Effective PresentationHow to Make Effective Presentation
How to Make Effective Presentation
ย 

Similar to Advanced Techniques for Exploiting ILP

Cisc(a022& a023)
Cisc(a022& a023)Cisc(a022& a023)
Cisc(a022& a023)
Piyush Rochwani
ย 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
KandavelEee
ย 
Highridge ISA
Highridge ISAHighridge ISA
Highridge ISA
Alec Selfridge
ย 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
Andrew Daws
ย 
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processor
Bแบฃo Hoang
ย 
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelines
turki_09
ย 

Similar to Advanced Techniques for Exploiting ILP (20)

Instruction Level Parallelism โ€“ Compiler Techniques
Instruction Level Parallelism โ€“ Compiler TechniquesInstruction Level Parallelism โ€“ Compiler Techniques
Instruction Level Parallelism โ€“ Compiler Techniques
ย 
Assembly p1
Assembly p1Assembly p1
Assembly p1
ย 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organization
ย 
Introduction to Microcontrollers
Introduction to MicrocontrollersIntroduction to Microcontrollers
Introduction to Microcontrollers
ย 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose Processors
ย 
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
ย 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) Limitations
ย 
viva q&a for mp lab
viva q&a for mp labviva q&a for mp lab
viva q&a for mp lab
ย 
Computer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organizationComputer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organization
ย 
Lec1 final
Lec1 finalLec1 final
Lec1 final
ย 
ES-CH5.ppt
ES-CH5.pptES-CH5.ppt
ES-CH5.ppt
ย 
Cisc(a022& a023)
Cisc(a022& a023)Cisc(a022& a023)
Cisc(a022& a023)
ย 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
ย 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
ย 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-i
ย 
Highridge ISA
Highridge ISAHighridge ISA
Highridge ISA
ย 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
ย 
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processor
ย 
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelines
ย 
Bca examination 2015 csa
Bca examination 2015 csaBca examination 2015 csa
Bca examination 2015 csa
ย 

More from A B Shinde

More from A B Shinde (20)

Communication System Basics
Communication System BasicsCommunication System Basics
Communication System Basics
ย 
MOSFETs: Single Stage IC Amplifier
MOSFETs: Single Stage IC AmplifierMOSFETs: Single Stage IC Amplifier
MOSFETs: Single Stage IC Amplifier
ย 
MOSFETs
MOSFETsMOSFETs
MOSFETs
ย 
Color Image Processing: Basics
Color Image Processing: BasicsColor Image Processing: Basics
Color Image Processing: Basics
ย 
Edge Detection and Segmentation
Edge Detection and SegmentationEdge Detection and Segmentation
Edge Detection and Segmentation
ย 
Image Processing: Spatial filters
Image Processing: Spatial filtersImage Processing: Spatial filters
Image Processing: Spatial filters
ย 
Image Enhancement in Spatial Domain
Image Enhancement in Spatial DomainImage Enhancement in Spatial Domain
Image Enhancement in Spatial Domain
ย 
Resume Format
Resume FormatResume Format
Resume Format
ย 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image Fundamentals
ย 
Resume Writing
Resume WritingResume Writing
Resume Writing
ย 
Image Processing Basics
Image Processing BasicsImage Processing Basics
Image Processing Basics
ย 
Blooms Taxonomy in Engineering Education
Blooms Taxonomy in Engineering EducationBlooms Taxonomy in Engineering Education
Blooms Taxonomy in Engineering Education
ย 
ISE 7.1i Software
ISE 7.1i SoftwareISE 7.1i Software
ISE 7.1i Software
ย 
VHDL Coding Syntax
VHDL Coding SyntaxVHDL Coding Syntax
VHDL Coding Syntax
ย 
VHDL Programs
VHDL ProgramsVHDL Programs
VHDL Programs
ย 
VLSI Testing Techniques
VLSI Testing TechniquesVLSI Testing Techniques
VLSI Testing Techniques
ย 
Selecting Engineering Project
Selecting Engineering ProjectSelecting Engineering Project
Selecting Engineering Project
ย 
Interview Techniques
Interview TechniquesInterview Techniques
Interview Techniques
ย 
Semiconductors
SemiconductorsSemiconductors
Semiconductors
ย 
Diode Applications & Transistor Basics
Diode Applications & Transistor BasicsDiode Applications & Transistor Basics
Diode Applications & Transistor Basics
ย 

Recently uploaded

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
SUHANI PANDEY
ย 
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort ServiceCall Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
ย 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
ย 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
ย 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
ย 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
ย 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
sivaprakash250
ย 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
ย 

Recently uploaded (20)

Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
ย 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
ย 
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort ServiceCall Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
Call Girls in Ramesh Nagar Delhi ๐Ÿ’ฏ Call Us ๐Ÿ”9953056974 ๐Ÿ” Escort Service
ย 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
ย 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
ย 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
ย 
Top Rated Pune Call Girls Budhwar Peth โŸŸ 6297143586 โŸŸ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth โŸŸ 6297143586 โŸŸ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth โŸŸ 6297143586 โŸŸ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth โŸŸ 6297143586 โŸŸ Call Me For Genuine Se...
ย 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
ย 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
ย 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
ย 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
ย 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
ย 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ย 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
ย 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
ย 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
ย 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
ย 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
ย 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
ย 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
ย 

Advanced Techniques for Exploiting ILP

  • 1. Advanced Techniques For Exploiting ILP Mr. A. B. Shinde Assistant Professor, Electronics Engineering, P.V.P.I.T., Budhgaon
  • 2. Contentsโ€ฆ ๏‚จ Complier techniques for exposing ILP ๏‚จ Limitation on ILP for realizable Processors ๏‚จ Hardware versus software speculation 2
  • 3. Pipelining ๏‚จ Pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. ๏‚จ The elements of a pipeline are often executed in parallel.
  • 4. Parallel Computing ๏‚จ Parallel computing is a form of computation in which many calculations are carried out simultaneously ("in parallel"). ๏‚จ There are several different forms of parallel computing: ๏‚ค Bit-level, ๏‚ค Instruction level, ๏‚ค Data, and ๏‚ค Task parallelism.
  • 5. Instruction Level Parallelism (ILP) ๏‚จ A computer program is, a stream of instructions executed by a processor. ๏‚จ These instructions can be re-ordered and combined into groups which are then executed in parallel without changing the result of the program.
  • 6. Instruction Level Parallelism (ILP) In ordinary programs instructions are executed in the order specified by the programmer. ๏‚จ How much ILP exists in programs is very application specific. In certain fields, such as graphics and scientific computing the amount can be very large. However, workloads such as cryptography exhibit much less parallelism.
  • 7. Pipeline Scheduling ๏‚จ The straightforward MIPS code, not scheduled for the pipeline, looks like this: Loop: L.D F0,0(R1) ;F0=array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4,0(R1) ;store result DADDUI R1,R1,#-8 ;decrement pointer by ;8 bytes BNE R1,R2,Loop ;branch R1!=R2 ๏‚จ Letโ€™s see how well this loop will run when it is scheduled on a simple pipeline for MIPS.
  • 8. Pipeline Scheduling ๏‚จ Example: Show how the loop would look on MIPS, both scheduled and unscheduled, including any stalls or idle clock cycles. ๏‚จ Answer: Without any scheduling, the loop will execute as follows, taking 9 cycles: Clock cycle issued Loop: L.D F0,0(R1) 1 stall 2 ADD.D F4,F0,F2 3 stall 4 stall 5 S.D F4,0(R1) 6 DADDUI R1,R1,#-8 7 stall 8 BNE R1,R2,Loop 9
  • 9. Pipeline Scheduling ๏‚จ We can schedule the loop to obtain only two stalls and reduce the time to 7 cycles: Clock cycles Loop: L.D F0,0(R1) 1 DADDUI R1,R1,#-8 2 ADD.D F4,F0,F2 3 Stall 4 Stall 5 S.D F4,8(R1) 6 BNE R1,R2,Loop 7 ๏‚จ The stalls after ADD.D are for use by the S.D. Clock cycles Loop: L.D F0,0(R1) 1 stall 2 ADD.D F4,F0,F2 3 stall 4 stall 5 S.D F4,0(R1) 6 DADDUI R1,R1,#-8 7 stall 8 BNE R1,R2,Loop 9
  • 10. Loop Unrolling ๏‚จ In the previous example, we complete one loop iteration and store back one array element every 7 clock cyclesโ€ฆ but the actual work of operating on the array element takes just 3 (the load, add, and store) of those 7 clock cycles. ๏‚จ The remaining 4 clock cycles consist of loop overheadโ€”the DADDUI and BNEโ€”and two stalls. ๏‚จ To eliminate these 4 clock cycles we need to get more operations relative to the number of overhead instructions. ๏‚จ A simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. ๏‚จ Unrolling simply replicates the loop body multiple times.
  • 11. Loop Unrolling Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop Loop: L.D F0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 Stall Stall S.D F4,8(R1) BNE R1,R2,Loop Code without Loop unrolling Code with Loop unrolling
  • 12. Loop Unrolling ๏‚จ Loop unrolling can also be used to improve scheduling, because it eliminates the branch, it allows instructions from different iterations to be scheduled together. ๏‚จ If we simply replicated the instructions when we unrolled the loop, the resulting use of the same registers could prevent us from effectively scheduling the loop. ๏‚จ Thus, the required number of registers will be increased.
  • 13. Loop Unrolling and Scheduling: Summary ๏‚จ To obtain the final unrolled code we need to make the following decisions and transformations: ๏‚จ Determine that unrolling the loop would be useful by finding that the loop iterations were independent. ๏‚จ Use different registers to avoid unnecessary forcing by using the same registers for different computations. ๏‚จ Eliminate the extra test and branch instructions and adjust the loop termination. ๏‚จ Determine that the loads and stores in the unrolled loop can be interchanged, if loads and stores from different iterations are independent. ๏‚จ Schedule the code, preserving any dependences needed to produce same result as the original code.
  • 14. Loop Unrolling and Scheduling: Summary ๏‚จ There are three different types of limits to the gains that can be achieved by loop unrolling: 1. A decrease in the amount of overhead with each unroll, 2. Code size limitations, and 3. Compiler limitations. Letโ€™s consider the question of loop overhead first. When we unrolled the loop four times, it generated sufficient parallelism among the instructions that the loop could be scheduled with no stall cycles. In previous example, in 14 clock cycles, only 2 cycles were loop overhead: the DADDUI, and BNE.
  • 15. Loop Unrolling and Scheduling: Summary ๏‚จ A second limit to unrolling is the growth in code size. ๏‚จ Factor often more important than code size is the potential shortfall in registers that is created by aggressive unrolling and scheduling. ๏‚จ The transformed code, is theoretically faster but it generates a shortage of registers.
  • 16. Loop Unrolling and Scheduling: Summary ๏‚จ Loop unrolling is a simple but useful method for increasing the size of straight-line code fragments that can be scheduled effectively. ๏‚จ This transformation is useful in a variety of processors, from simple pipelines to the multiple-issue processors
  • 18. Limitations of ILP ๏‚จ Exploiting ILP to increase performance began with the first pipelined processors in the 1960s. ๏‚จ In the 1980s and 1990s, these techniques were used to achieve rapid performance improvements. ๏‚จ For enhancing performance at a speed of integrated circuit technologyโ€ฆ The critical question is: What is needed to exploit more ILP is crucial to both computer designers and compiler writers.
  • 19. Limitations of ILP ๏‚จ To know what actually limits ILPโ€ฆ โ€ฆ we first need to define an ideal processor. ๏‚จ An ideal processor is one where all constraints on ILP are removed. ๏‚จ The only limits on ILP in ideal processor are by the actual data flows through either registers or memory.
  • 20. Ideal Processor ๏‚จ The assumptions made for an ideal or perfect processor are as follows: 1. Register renaming There are an infinite number of virtual registers available, and hence all WAW and WAR hazards are avoided and number of instructions can begin execution simultaneously. 2. Branch prediction Branch prediction is perfect. All conditional branches are predicted exactly. 3. Jump prediction All jumps are perfectly predicted.
  • 21. Ideal Processor ๏‚จ The assumptions made for an ideal or perfect processor are as follows: 4. Memory address analysis All memory addresses are known exactly, and a load can be moved before a store provided that the addresses are not identical. This implements perfect address analysis. 5. Perfect caches All memory accesses take 1 clock cycle.
  • 22. Ideal Processor ๏‚จ Assumptions 2 and 3 eliminate all control dependences. ๏‚จ Assumptions 1 and 4 eliminate all but the true data dependences. ๏‚จ These four assumptions mean that any instruction in the programโ€™s execution can be scheduled on the cycle immediately following the execution of the predecessor on which it depends. ๏‚จ Under these assumptions, it is possible, for the last executed instruction in the program to be scheduled on the very first cycle.
  • 23. Ideal Processor ๏‚จ How close could a dynamically scheduled, speculative processor come to the ideal processor? To get into this question, consider what the perfect processor must do: 1. Look arbitrarily far ahead to find a set of instructions to issue, predicting all branches perfectly. 2. Rename all registers used to avoid WAR and WAW hazards. 3. Determine whether there are any data dependences among the instructions; if so, rename accordingly. 4. Determine if any memory dependences exist among the issuing instructions and handle them appropriately. 5. Provide enough replicated functional units to allow all the ready instructions to issue (no structural hazards).
  • 24. Ideal Processor ๏‚จ For example, to determine whether n issuing instructions have any register dependences among them, assuming all instructions are register-register and the total number of registers is unbounded, requires Thus, issuing only 50 instructions requires 2450 comparisons. This cost obviously limits the number of instructions that can be considered for issue at once. comparisons.
  • 25. Limitations on ILP for Realizable Processors ๏‚จ The limitations are divided into two classes: ๏‚จ Limitations that arise even for the perfect speculative processor, and ๏‚จ Limitations that arise for one or more realistic models.
  • 26. Limitations on ILP for Realizable Processors ๏‚จ The most important limitations that apply even to the perfect model are 1. WAW and WAR hazards through memory The WAW and WAR hazards are eliminated through register renaming, but not in memory usage. A normal procedure reuses the memory locations of a previous procedure on the stack, and this can lead to WAW and WAR hazards.
  • 27. Limitations on ILP for Realizable Processors ๏‚จ The most important limitations that apply even to the perfect model are 2. Unnecessary dependences With infinite numbers of registers, all but true register data dependences are removed. There are, dependences arising from either recurrences or code generation conventions that introduce unnecessary data dependences. Code generation conventions introduce unneeded dependences, in particular the use of return address registers and a register for the stack pointer (which is incremented and decremented in the call/return sequence).
  • 28. Limitations on ILP for Realizable Processors ๏‚จ The most important limitations that apply even to the perfect model are 3. Overcoming the data flow limit If value prediction worked with high accuracy, it could overcome the data flow limit. It is difficult to achieve significant enhancement in ILP, using a prediction scheme.
  • 29. Limitations on ILP for Realizable Processors ๏‚จ For a less-than-perfect processor, several ideas have been proposed that could expose more ILP. ๏‚จ To speculate along multiple paths: This idea was discussed by Lam and Wilson [1992]. By speculating on multiple paths, the cost of incorrect recovery is reduced and more parallelism can be exposed. ๏‚จ Wall [1993] provides data for speculating in both directions on up to eight branches. Out of both paths, one will be thrown away. Every commercial design has instead devoted additional hardware to better speculation on the correct path.
  • 30. Hardware Vs Software Speculation
  • 31. Hardware Vs Software Speculation ๏‚จ To speculate extensively, we must be able to disambiguate (clear the ambiguity) memory references. ๏‚จ This capability is difficult to do at compile time for integer programs that contain pointers. ๏‚จ In a hardware-based scheme, dynamic run time disambiguation of memory addresses is done using Tomasuloโ€™s algorithm.
  • 32. Hardware Vs Software Speculation ๏‚จ Hardware-based speculation works better when control flow is unpredictable, and Hardware-based branch prediction is superior than software-based branch prediction done at compile time. For example: a good static predictor has a misprediction rate of about 16% for four major integer SPEC92 programs, and a hardware predictor has a misprediction rate of under 10%, because, speculated instructions may slow down the computation when the prediction is incorrect.
  • 33. Hardware Vs Software Speculation ๏‚จ Hardware-based speculation maintains a completely precise exception model even for speculated instructions. ๏‚จ Hardware-based speculation does not require compensation or book-keeping code, which is needed by software speculation mechanisms. ๏‚จ Compiler-based approaches may benefit from the ability to see further in the code sequence, resulting in better code scheduling than a purely hardware driven approach.
  • 34. Hardware Vs Software Speculation ๏‚จ Hardware-based speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementations of an architecture. ๏‚จ On the other hand, more recent explicitly parallel architectures, (such as IA-64), have added flexibility that reduces the hardware dependence inherent in a code sequence.
  • 35. Hardware Vs Software Speculation ๏‚จ The major disadvantage of supporting speculation in hardware is the complexity and additional hardware resources required. ๏‚จ Some designers have tried to combine the dynamic and compiler-based approaches to achieve the best of each. ๏‚จ For example: If conditional moves are combined with register renaming, then a slight side effect appears. A conditional move that is invalid must copy a value to the destination register, since it was renamed earlier.
  • 36. Thank Youโ€ฆ shindesir.pvp@gmail.com (This Presentation is Published Only for Educational Purpose)