SlideShare une entreprise Scribd logo
1  sur  28
Ketan N. Kulkarni & M. V. Rajesh
ECEN 676 – Advanced Computer Architecture
1st May 2009
Reducing Memory Stalls by Hardware
Based Data Prefetching Schemes
Agenda
 Introduction
 Background work
 Hardware based Data Cache Prefetching Algorithms
1. Fixed Offset Prefetching
2. Stride Based Prefetching
3. Tag Correlated Prefetching
 Simulation Setup
 Results
 Conclusions
ECEN 676 - Advanced Computer Architecture2
Introduction
 What is Prefetching?
 Filling the cache with relevant data before it is needed by the program.
 What is the need of Prefetching?
 Expanding gap between Microprocessor and DRAM performance.
 Exponential increase in data access penalty.
 When to Prefetch?
 Whenever bus is idle. (A perfect prefetching scheme is one that totally masks the
memory latency time).
 Advantages:
 Increase L1 Hit rate/ Reduce CPU stalls/ Reduce AMAT.
 Program semantics remain unchanged.
 Caveats:
 Prefetching too far in advance may lead to cache pollution.
 Incorrect prefetching.
ECEN 676 - Advanced Computer Architecture3
How Prefetching works ?
Time
LoadA
(miss in L1)
LoadB
(miss in L1)
Time
LoadA
(hit in L1)
LoadB
(hit in L1)
PrefetchA *
PrefetchB *
FetchA
FetchB
ECEN 676 - Advanced Computer Architecture4
* from L2 to L1
CPU stalled
CPU executing Avoids the possible miss
Feedback based
[Honio2009]
Spatial
Stride Prefetch
[Fu1992]
Markov Prefetch
[Joseph1997]
GHB
[Nesbit2004]
Hybrid
[Hsu1998]
Software Support
[Mowry1992]
AC/DC
[Nesbit2004]
Adaptive Stream
[Hur 2006]
FDP
[Srinath2007]
Software Sequence-Base
(Order Sensitive)
Tag Correlation
[Hu2003]
SMS
[Somogyi2006]
Sequential
[Smith1978]
RPT
[Chen1995]
Locality Detect
[Johnson1998]
Spatial Pat.
[Chen2004]
Buffer Block
[Gindele1977]
Adaptive
Hybrid
Adaptive Seq.
[Dahlgren1993]
Commercial
Processors
SuperSPARC
R10000
PA7200
Power4
Pentium 4
AMPM Prefetch
[Ishii2009]
HW/SW Integrate
[Gornish1994]
Fixed offset
Hardware based Prefetching
 Advantages:
 Dynamic pattern matching.
 No compiler support/ ISA modification needed.
 Takes advantage of regular/ repeatable program
behavior.
 Caveats:
 Increased complexity/ hardware.
 High level program flow information not available.
ECEN 676 - Advanced Computer Architecture6
Fixed Offset Prefetching
 On a cache miss, retrieve next block of memory.
 Sequential prefetching (spatial locality).
ECEN 676 - Advanced Computer Architecture
Tag Index Offset
+
Tag Index Offset
Constant
7
Advantages Disadvantages
Very simple scheme. Relies solely on spatial locality to work.
Less hardware. Can’t detect patterns.
Stride Based Prefetching [Chen1995]
 Exploit stride patterns in data addresses.
 Prefetch this data before the data is accessed.
 Store state & stride data in a reference Prediction
Table (RPT), and update.
 Make state transitions based on correct/ incorrect
predictions.
ECEN 676 - Advanced Computer Architecture8
RPT - Structure
INSTRUCTION
ADDRESS
(PREVIOUS) DATA
ADDRESS
STRIDE STATE
+ PREFETCHING
ADDRESS
Program Counter -
Effective Address
ECEN 676 - Advanced Computer Architecture9
Lookup
Update
State Transition
INIT
TRANS
NO-
PRED
STEADY
Incorrect
Correct
Incorrect
Incorrect
Correct
Correct
Correct
Incorrect
ECEN 676 - Advanced Computer Architecture10
Stride based Prefetching (cont.)
ECEN 676 - Advanced Computer Architecture11
Advantages Disadvantages
Detects uniform strides (e.g. loops). Not much improvement with non-
uniform strides.
Accurate prediction for many cases. Hardware overhead.
Cannot correlate strides of one
instruction with those of others.
RPT - Example
 Load instructions at addresses 500, 504, and 512.
 Base addresses of matrices A, B, and C at locations
10,000, 50,000, and 90,000 respectively.
Matrix Multiplication Assembly Code
int A[100][100], B[100][100],
C[100][100]
for(i = 1; i < 100; i ++){
for(j = 1; j < 100; j ++){
for(k = 1; k < 100; k ++){
A[i][j] += B[i][k] x C[k][j];
}
}
}
500
504
508
512
516
520
524
528
532
536
lw r4, 0(r2)
Iw r5, 0(r3)
mul r6, r5, r4
lw r7, 0(r1)
addu r7, r7, r6
sw r7, 0(rl)
addu r2, r2, 4
addu r3, r3. 400
addu r11, rl l, 1
bne r11, r13,
500
load B[i][k]
load C[kJ[j]
B[i][k] x C[k][j]
load A[i][j]
+=
store A[i][j]
ref B[i][k]
ref C[k][j]
increase k
loop
ECEN 676 - Advanced Computer Architecture12
RPT – Example (contd.)
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
ECEN 676 - Advanced Computer Architecture13
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,000 0 INIT
504 90,000 0 INIT
512 10,000 0 INIT
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,004 4 TRANS
504 90,400 400 TRANS
512 10,000 0 STEAD
Y
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,008 4 STEADY
504 90,800 400 STEADY
512 10,000 0 STEADY
Initial State After Iteration 1
After Iteration 2 After Iteration 3
Tag Correlated Prefetching [Hu2003]
ECEN 676 - Advanced Computer Architecture14
 L1 cache tags exhibit strong regularity.
 Similar to 2-level branch prediction technique.
 Local/ Global History.
 Correlating Prefetcher that work with tags.
TCP Structure
TAG1 TAG2 …… TAG TAG’
ECEN 676 - Advanced Computer Architecture15
TAG INDEX
OFFSE
T
Index
Function
misstagTAG2 …… TAGKTAGK
UpdateMiss Address
misstag
TAG1
THT
TAG
K
TAG
KTAG
KTAG
K
misstag
Lookup
misstag
misstag
misstag
misstag
misstag TAG’
missindex
PHT
Modified TCP
SUM TAG TAG’
ECEN 676 - Advanced Computer Architecture16
TAG INDEX
OFFSE
T
Index
Function
misstagTAGK
Miss Address
THT PHT
TCP (cont.)
ECEN 676 - Advanced Computer Architecture17
Advantages Disadvantages
Captures global and local history. More Hardware.
Recognize recurring patterns. Vulnerable to noise
Simulation Environment
ECEN 676 - Advanced Computer Architecture18
L1
Data
Cache
L2
Data
Cache
Main
Memory
CPU
DATA
Prefetcher
Trace File
ADDRESS
DATA
ADDRESS
DATA
ADDRESS
Hit
Implementation: C++, Perl, Pin Tool [Reddi2004]
Trace Driven Simulation
L1-Cache L2-Cache
32KB Size
2-way Set
Associative
64 byte line size
write-through
no-write-allocate
1 cycle hit time
lru replacement
policy
256KB Size
8-way set
associative
128 byte line size
write-back
write-allocate
20 cycle access
time
lru replacement
policy
Calculate Next Prefetch Address
Benchmarks
ECEN 676 - Advanced Computer Architecture19
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
grep g++ ls plamaptestgen matrix
%
Benchmark
Instruction Mix
Non-mem ops
Stores
Loads
Benchmark Description
grep Unix utility to search for
pattern in input file.
g++ Unix C++ GNU
Compiler.
testgen Program for creating
test patterns for scan
chains in DFT.
plamap A mapping algorithm for
CPLD architecture.
ls Unix utility to list
information about files
in dir.
matrix 100x100 matrix
multiplication.
Simulation Results - I
ECEN 676 - Advanced Computer Architecture20
0.00
1.00
2.00
3.00
4.00
5.00
6.00
grep g++ ls plamap testgen matrix
CPI
Benchmark
CPI
No Prefetching
Fixed Offset Prefetching
Stride Based Prefetching
Tag Correlating Prefetching
Simulation Results - II
ECEN 676 - Advanced Computer Architecture21
90
91
92
93
94
95
96
97
98
99
100
grep g++ ls plamap testgen matrix
Hit Rate (%)
Benchmark
L1 Cache Hit Rate
No Prefetching
Fixed Offset Prefetching
Stride Based Prefetching
Tag Correlating Prefetching
Simulation Results - III
ECEN 676 - Advanced Computer Architecture22
0
0.5
1
1.5
2
2.5
grep g++ ls plamap testgen matrix
AMAT (#cycles)
Benchmark
Average Memory Access Time
No Prefetching
Fixed Offset Prefetching
Stride Based Prefetching
Tag Correlating Prefetching
Simulation Results - IV
ECEN 676 - Advanced Computer Architecture23
94.50
95.00
95.50
96.00
96.50
97.00
97.50
98.00
98.50
lookahead size
L1 Hit
Rate (%)
Benchmark (grep)
Effect of changing offset
on L1 Hit Rate
64
2*64
8*64
95.50
96.00
96.50
97.00
97.50
98.00
98.50
sizes;
L1 Hit Rate
(%)
Benchmark (g++)
Effect of RPT size on L1 Hit
Rate
size=8;
size=64;
size=256;
size=2048;
96.95
97.00
97.05
97.10
97.15
97.20
97.25
97.30
97.35
97.40
97.45
increasing m,n; increasing k;
L1 Hit Rate (%)
Benchmark (testgen)
Effect of THT/PHT parameters on L1 Hit Rate
m=8;n=8;k=4;
m=4;n=4;k=4;
m=8;n=8;k=2;
m=8;n=8;k=4;
m=8;n=8;k=8;
m=8;n=8;k=64;
Simulation Results – V
Prefetching
Algorithm
Hardware CPI
(% improvement)
Hit Rate
(% increase)
AMAT
(% decrease)
Fixed Offset Area of adder, registers 10.28 1.41 9.62
Stride Based 26.75KB (RPT) 16.56 1.75 20.42
TCP 72KB (THT) +
150KB(THT) 18.93 1.96 27.75
Modified TCP 26KB (THT) +
150KB(THT) 16.98 1.80 21.23
ECEN 676 - Advanced Computer Architecture24
The increase in hardware complexity pays off!
Conclusions
 Prefetching increases hit rate and decreases AMAT.
Fixed Offset Stride Based Tag
Correlated
 Fixed offset would give good performance for highly
spatial code.
 Stride Prefetching would perform the best when a
program has steady memory access patterns
regardless of locality.
 TCP would perform better on an average.
ECEN 676 - Advanced Computer Architecture25
Increasing Hardware Complexity
Increasing hit rate, Decreasing AMAT
References
 Chen1995 - Tien-Fu Chen; Jean-Loup Baer, "Effective hardware-
based data prefetching for high-performance processors,"
Computers, IEEE Transactions on , vol.44, no.5, pp.609-623, May
1995.
 Hu2003 - Hu, Z.; Martonosi, M.; Kaxiras, S., "TCP: tag correlating
prefetchers," High-Performance Computer Architecture, 2003.
HPCA-9 2003. Proceedings. The Ninth International Symposium on ,
vol., no., pp. 317-326, 8-12 Feb. 2003.
 Reddi2004 - PIN: A Binary Instrumentation Tool for Computer
Architecture Research and Education VJ Reddi, A Settle, DA
Connors, RS Cohn, 2004.
ECEN 676 - Advanced Computer Architecture26
ECEN 676 - Advanced Computer Architecture27
Questions
 Question 1: Is cache pollution a serious concern for
anyone designing a prefetching algorithm?
 Answer: Cache pollution happens when the cache is
cluttered with useless information. However the
problem is that the exact information that is needed
is not always known, but it is predicted. The goal is
to prefetch all of the necessary data beforehand and
to prefetch the minimal amount of unused data.
ECEN 676 - Advanced Computer Architecture28

Contenu connexe

Tendances

tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timing
Tom Spyrou
 
ECE260BMiniProject2Report
ECE260BMiniProject2ReportECE260BMiniProject2Report
ECE260BMiniProject2Report
Fanyu Yang
 
ds894-zynq-ultrascale-plus-overview
ds894-zynq-ultrascale-plus-overviewds894-zynq-ultrascale-plus-overview
ds894-zynq-ultrascale-plus-overview
Angela Suen
 

Tendances (20)

Tute
TuteTute
Tute
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
Implementing Useful Clock Skew Using Skew Groups
Implementing Useful Clock Skew Using Skew GroupsImplementing Useful Clock Skew Using Skew Groups
Implementing Useful Clock Skew Using Skew Groups
 
Understanding cts log_messages
Understanding cts log_messagesUnderstanding cts log_messages
Understanding cts log_messages
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timing
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Advanced Comuter Architecture Ch6 Problem Solutions
Advanced Comuter Architecture Ch6 Problem SolutionsAdvanced Comuter Architecture Ch6 Problem Solutions
Advanced Comuter Architecture Ch6 Problem Solutions
 
ECE260BMiniProject2Report
ECE260BMiniProject2ReportECE260BMiniProject2Report
ECE260BMiniProject2Report
 
Embedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual ReviewEmbedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual Review
 
Floor plan & Power Plan
Floor plan & Power Plan Floor plan & Power Plan
Floor plan & Power Plan
 
Atc On An Simd Cots System Wmpp05
Atc On An Simd Cots System   Wmpp05Atc On An Simd Cots System   Wmpp05
Atc On An Simd Cots System Wmpp05
 
Design and Implementation of 64 Bit RISC Processor Using System.pdf
Design and Implementation of 64 Bit RISC Processor Using System.pdfDesign and Implementation of 64 Bit RISC Processor Using System.pdf
Design and Implementation of 64 Bit RISC Processor Using System.pdf
 
Implementation of quantum gates using verilog
Implementation of quantum gates using verilogImplementation of quantum gates using verilog
Implementation of quantum gates using verilog
 
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAHigh-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
 
Lecture20 asic back_end_design
Lecture20 asic back_end_designLecture20 asic back_end_design
Lecture20 asic back_end_design
 
ds894-zynq-ultrascale-plus-overview
ds894-zynq-ultrascale-plus-overviewds894-zynq-ultrascale-plus-overview
ds894-zynq-ultrascale-plus-overview
 

Similaire à 676.v3

Target updated track f
Target updated   track fTarget updated   track f
Target updated track f
Alona Gradman
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Cheng-Hsuan Li
 
Verification Strategy for PCI-Express
Verification Strategy for PCI-ExpressVerification Strategy for PCI-Express
Verification Strategy for PCI-Express
DVClub
 

Similaire à 676.v3 (20)

Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
 
Smpant Transact09
Smpant Transact09Smpant Transact09
Smpant Transact09
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Prelim Slides
Prelim SlidesPrelim Slides
Prelim Slides
 
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
A Novel Approach in Scheduling Of the Real- Time Tasks In Heterogeneous Multi...
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track f
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert Goossens
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
 
D031201021027
D031201021027D031201021027
D031201021027
 
Verification Strategy for PCI-Express
Verification Strategy for PCI-ExpressVerification Strategy for PCI-Express
Verification Strategy for PCI-Express
 
Inter Task Communication On Volatile Nodes
Inter Task Communication On Volatile NodesInter Task Communication On Volatile Nodes
Inter Task Communication On Volatile Nodes
 

Plus de Rajesh M (10)

Daily Habits.pdf
Daily Habits.pdfDaily Habits.pdf
Daily Habits.pdf
 
Clock relationships
Clock relationshipsClock relationships
Clock relationships
 
Node Scaling Objectives
Node Scaling ObjectivesNode Scaling Objectives
Node Scaling Objectives
 
Technology scaling introduction
Technology scaling introductionTechnology scaling introduction
Technology scaling introduction
 
Problems between Synthesis and preCTS
Problems between Synthesis and preCTSProblems between Synthesis and preCTS
Problems between Synthesis and preCTS
 
Setup fixing
Setup fixingSetup fixing
Setup fixing
 
Vlsi best notes google docs
Vlsi best notes   google docsVlsi best notes   google docs
Vlsi best notes google docs
 
#50 ethics
#50 ethics#50 ethics
#50 ethics
 
Power Reduction Techniques
Power Reduction TechniquesPower Reduction Techniques
Power Reduction Techniques
 
Study of inter and intra chip variations
Study of inter and intra chip variationsStudy of inter and intra chip variations
Study of inter and intra chip variations
 

Dernier

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Dernier (20)

Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 

676.v3

  • 1. Ketan N. Kulkarni & M. V. Rajesh ECEN 676 – Advanced Computer Architecture 1st May 2009 Reducing Memory Stalls by Hardware Based Data Prefetching Schemes
  • 2. Agenda  Introduction  Background work  Hardware based Data Cache Prefetching Algorithms 1. Fixed Offset Prefetching 2. Stride Based Prefetching 3. Tag Correlated Prefetching  Simulation Setup  Results  Conclusions ECEN 676 - Advanced Computer Architecture2
  • 3. Introduction  What is Prefetching?  Filling the cache with relevant data before it is needed by the program.  What is the need of Prefetching?  Expanding gap between Microprocessor and DRAM performance.  Exponential increase in data access penalty.  When to Prefetch?  Whenever bus is idle. (A perfect prefetching scheme is one that totally masks the memory latency time).  Advantages:  Increase L1 Hit rate/ Reduce CPU stalls/ Reduce AMAT.  Program semantics remain unchanged.  Caveats:  Prefetching too far in advance may lead to cache pollution.  Incorrect prefetching. ECEN 676 - Advanced Computer Architecture3
  • 4. How Prefetching works ? Time LoadA (miss in L1) LoadB (miss in L1) Time LoadA (hit in L1) LoadB (hit in L1) PrefetchA * PrefetchB * FetchA FetchB ECEN 676 - Advanced Computer Architecture4 * from L2 to L1 CPU stalled CPU executing Avoids the possible miss
  • 5. Feedback based [Honio2009] Spatial Stride Prefetch [Fu1992] Markov Prefetch [Joseph1997] GHB [Nesbit2004] Hybrid [Hsu1998] Software Support [Mowry1992] AC/DC [Nesbit2004] Adaptive Stream [Hur 2006] FDP [Srinath2007] Software Sequence-Base (Order Sensitive) Tag Correlation [Hu2003] SMS [Somogyi2006] Sequential [Smith1978] RPT [Chen1995] Locality Detect [Johnson1998] Spatial Pat. [Chen2004] Buffer Block [Gindele1977] Adaptive Hybrid Adaptive Seq. [Dahlgren1993] Commercial Processors SuperSPARC R10000 PA7200 Power4 Pentium 4 AMPM Prefetch [Ishii2009] HW/SW Integrate [Gornish1994] Fixed offset
  • 6. Hardware based Prefetching  Advantages:  Dynamic pattern matching.  No compiler support/ ISA modification needed.  Takes advantage of regular/ repeatable program behavior.  Caveats:  Increased complexity/ hardware.  High level program flow information not available. ECEN 676 - Advanced Computer Architecture6
  • 7. Fixed Offset Prefetching  On a cache miss, retrieve next block of memory.  Sequential prefetching (spatial locality). ECEN 676 - Advanced Computer Architecture Tag Index Offset + Tag Index Offset Constant 7 Advantages Disadvantages Very simple scheme. Relies solely on spatial locality to work. Less hardware. Can’t detect patterns.
  • 8. Stride Based Prefetching [Chen1995]  Exploit stride patterns in data addresses.  Prefetch this data before the data is accessed.  Store state & stride data in a reference Prediction Table (RPT), and update.  Make state transitions based on correct/ incorrect predictions. ECEN 676 - Advanced Computer Architecture8
  • 9. RPT - Structure INSTRUCTION ADDRESS (PREVIOUS) DATA ADDRESS STRIDE STATE + PREFETCHING ADDRESS Program Counter - Effective Address ECEN 676 - Advanced Computer Architecture9 Lookup Update
  • 11. Stride based Prefetching (cont.) ECEN 676 - Advanced Computer Architecture11 Advantages Disadvantages Detects uniform strides (e.g. loops). Not much improvement with non- uniform strides. Accurate prediction for many cases. Hardware overhead. Cannot correlate strides of one instruction with those of others.
  • 12. RPT - Example  Load instructions at addresses 500, 504, and 512.  Base addresses of matrices A, B, and C at locations 10,000, 50,000, and 90,000 respectively. Matrix Multiplication Assembly Code int A[100][100], B[100][100], C[100][100] for(i = 1; i < 100; i ++){ for(j = 1; j < 100; j ++){ for(k = 1; k < 100; k ++){ A[i][j] += B[i][k] x C[k][j]; } } } 500 504 508 512 516 520 524 528 532 536 lw r4, 0(r2) Iw r5, 0(r3) mul r6, r5, r4 lw r7, 0(r1) addu r7, r7, r6 sw r7, 0(rl) addu r2, r2, 4 addu r3, r3. 400 addu r11, rl l, 1 bne r11, r13, 500 load B[i][k] load C[kJ[j] B[i][k] x C[k][j] load A[i][j] += store A[i][j] ref B[i][k] ref C[k][j] increase k loop ECEN 676 - Advanced Computer Architecture12
  • 13. RPT – Example (contd.) INSTRUCTION ADDRESS (PREVIOUS) DATA ADDRESS STRIDE STATE ECEN 676 - Advanced Computer Architecture13 INSTRUCTION ADDRESS (PREVIOUS) DATA ADDRESS STRIDE STATE 500 50,000 0 INIT 504 90,000 0 INIT 512 10,000 0 INIT INSTRUCTION ADDRESS (PREVIOUS) DATA ADDRESS STRIDE STATE 500 50,004 4 TRANS 504 90,400 400 TRANS 512 10,000 0 STEAD Y INSTRUCTION ADDRESS (PREVIOUS) DATA ADDRESS STRIDE STATE 500 50,008 4 STEADY 504 90,800 400 STEADY 512 10,000 0 STEADY Initial State After Iteration 1 After Iteration 2 After Iteration 3
  • 14. Tag Correlated Prefetching [Hu2003] ECEN 676 - Advanced Computer Architecture14  L1 cache tags exhibit strong regularity.  Similar to 2-level branch prediction technique.  Local/ Global History.  Correlating Prefetcher that work with tags.
  • 15. TCP Structure TAG1 TAG2 …… TAG TAG’ ECEN 676 - Advanced Computer Architecture15 TAG INDEX OFFSE T Index Function misstagTAG2 …… TAGKTAGK UpdateMiss Address misstag TAG1 THT TAG K TAG KTAG KTAG K misstag Lookup misstag misstag misstag misstag misstag TAG’ missindex PHT
  • 16. Modified TCP SUM TAG TAG’ ECEN 676 - Advanced Computer Architecture16 TAG INDEX OFFSE T Index Function misstagTAGK Miss Address THT PHT
  • 17. TCP (cont.) ECEN 676 - Advanced Computer Architecture17 Advantages Disadvantages Captures global and local history. More Hardware. Recognize recurring patterns. Vulnerable to noise
  • 18. Simulation Environment ECEN 676 - Advanced Computer Architecture18 L1 Data Cache L2 Data Cache Main Memory CPU DATA Prefetcher Trace File ADDRESS DATA ADDRESS DATA ADDRESS Hit Implementation: C++, Perl, Pin Tool [Reddi2004] Trace Driven Simulation L1-Cache L2-Cache 32KB Size 2-way Set Associative 64 byte line size write-through no-write-allocate 1 cycle hit time lru replacement policy 256KB Size 8-way set associative 128 byte line size write-back write-allocate 20 cycle access time lru replacement policy Calculate Next Prefetch Address
  • 19. Benchmarks ECEN 676 - Advanced Computer Architecture19 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% grep g++ ls plamaptestgen matrix % Benchmark Instruction Mix Non-mem ops Stores Loads Benchmark Description grep Unix utility to search for pattern in input file. g++ Unix C++ GNU Compiler. testgen Program for creating test patterns for scan chains in DFT. plamap A mapping algorithm for CPLD architecture. ls Unix utility to list information about files in dir. matrix 100x100 matrix multiplication.
  • 20. Simulation Results - I ECEN 676 - Advanced Computer Architecture20 0.00 1.00 2.00 3.00 4.00 5.00 6.00 grep g++ ls plamap testgen matrix CPI Benchmark CPI No Prefetching Fixed Offset Prefetching Stride Based Prefetching Tag Correlating Prefetching
  • 21. Simulation Results - II ECEN 676 - Advanced Computer Architecture21 90 91 92 93 94 95 96 97 98 99 100 grep g++ ls plamap testgen matrix Hit Rate (%) Benchmark L1 Cache Hit Rate No Prefetching Fixed Offset Prefetching Stride Based Prefetching Tag Correlating Prefetching
  • 22. Simulation Results - III ECEN 676 - Advanced Computer Architecture22 0 0.5 1 1.5 2 2.5 grep g++ ls plamap testgen matrix AMAT (#cycles) Benchmark Average Memory Access Time No Prefetching Fixed Offset Prefetching Stride Based Prefetching Tag Correlating Prefetching
  • 23. Simulation Results - IV ECEN 676 - Advanced Computer Architecture23 94.50 95.00 95.50 96.00 96.50 97.00 97.50 98.00 98.50 lookahead size L1 Hit Rate (%) Benchmark (grep) Effect of changing offset on L1 Hit Rate 64 2*64 8*64 95.50 96.00 96.50 97.00 97.50 98.00 98.50 sizes; L1 Hit Rate (%) Benchmark (g++) Effect of RPT size on L1 Hit Rate size=8; size=64; size=256; size=2048; 96.95 97.00 97.05 97.10 97.15 97.20 97.25 97.30 97.35 97.40 97.45 increasing m,n; increasing k; L1 Hit Rate (%) Benchmark (testgen) Effect of THT/PHT parameters on L1 Hit Rate m=8;n=8;k=4; m=4;n=4;k=4; m=8;n=8;k=2; m=8;n=8;k=4; m=8;n=8;k=8; m=8;n=8;k=64;
  • 24. Simulation Results – V Prefetching Algorithm Hardware CPI (% improvement) Hit Rate (% increase) AMAT (% decrease) Fixed Offset Area of adder, registers 10.28 1.41 9.62 Stride Based 26.75KB (RPT) 16.56 1.75 20.42 TCP 72KB (THT) + 150KB(THT) 18.93 1.96 27.75 Modified TCP 26KB (THT) + 150KB(THT) 16.98 1.80 21.23 ECEN 676 - Advanced Computer Architecture24 The increase in hardware complexity pays off!
  • 25. Conclusions  Prefetching increases hit rate and decreases AMAT. Fixed Offset Stride Based Tag Correlated  Fixed offset would give good performance for highly spatial code.  Stride Prefetching would perform the best when a program has steady memory access patterns regardless of locality.  TCP would perform better on an average. ECEN 676 - Advanced Computer Architecture25 Increasing Hardware Complexity Increasing hit rate, Decreasing AMAT
  • 26. References  Chen1995 - Tien-Fu Chen; Jean-Loup Baer, "Effective hardware- based data prefetching for high-performance processors," Computers, IEEE Transactions on , vol.44, no.5, pp.609-623, May 1995.  Hu2003 - Hu, Z.; Martonosi, M.; Kaxiras, S., "TCP: tag correlating prefetchers," High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. The Ninth International Symposium on , vol., no., pp. 317-326, 8-12 Feb. 2003.  Reddi2004 - PIN: A Binary Instrumentation Tool for Computer Architecture Research and Education VJ Reddi, A Settle, DA Connors, RS Cohn, 2004. ECEN 676 - Advanced Computer Architecture26
  • 27. ECEN 676 - Advanced Computer Architecture27
  • 28. Questions  Question 1: Is cache pollution a serious concern for anyone designing a prefetching algorithm?  Answer: Cache pollution happens when the cache is cluttered with useless information. However the problem is that the exact information that is needed is not always known, but it is predicted. The goal is to prefetch all of the necessary data beforehand and to prefetch the minimal amount of unused data. ECEN 676 - Advanced Computer Architecture28