The document describes several hardware-based data prefetching schemes that aim to reduce memory stalls by prefetching data into caches before it is needed by a program. It introduces fixed offset prefetching, stride-based prefetching, and tag correlated prefetching. It then discusses the simulation setup used to evaluate these schemes and presents results on their performance in terms of CPI, cache hit rate, and average memory access time. The tag correlated prefetching scheme achieved the best overall performance but at the cost of higher hardware complexity compared to the other schemes.
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
676.v3
1. Ketan N. Kulkarni & M. V. Rajesh
ECEN 676 – Advanced Computer Architecture
1st May 2009
Reducing Memory Stalls by Hardware
Based Data Prefetching Schemes
2. Agenda
Introduction
Background work
Hardware based Data Cache Prefetching Algorithms
1. Fixed Offset Prefetching
2. Stride Based Prefetching
3. Tag Correlated Prefetching
Simulation Setup
Results
Conclusions
ECEN 676 - Advanced Computer Architecture2
3. Introduction
What is Prefetching?
Filling the cache with relevant data before it is needed by the program.
What is the need of Prefetching?
Expanding gap between Microprocessor and DRAM performance.
Exponential increase in data access penalty.
When to Prefetch?
Whenever bus is idle. (A perfect prefetching scheme is one that totally masks the
memory latency time).
Advantages:
Increase L1 Hit rate/ Reduce CPU stalls/ Reduce AMAT.
Program semantics remain unchanged.
Caveats:
Prefetching too far in advance may lead to cache pollution.
Incorrect prefetching.
ECEN 676 - Advanced Computer Architecture3
4. How Prefetching works ?
Time
LoadA
(miss in L1)
LoadB
(miss in L1)
Time
LoadA
(hit in L1)
LoadB
(hit in L1)
PrefetchA *
PrefetchB *
FetchA
FetchB
ECEN 676 - Advanced Computer Architecture4
* from L2 to L1
CPU stalled
CPU executing Avoids the possible miss
6. Hardware based Prefetching
Advantages:
Dynamic pattern matching.
No compiler support/ ISA modification needed.
Takes advantage of regular/ repeatable program
behavior.
Caveats:
Increased complexity/ hardware.
High level program flow information not available.
ECEN 676 - Advanced Computer Architecture6
7. Fixed Offset Prefetching
On a cache miss, retrieve next block of memory.
Sequential prefetching (spatial locality).
ECEN 676 - Advanced Computer Architecture
Tag Index Offset
+
Tag Index Offset
Constant
7
Advantages Disadvantages
Very simple scheme. Relies solely on spatial locality to work.
Less hardware. Can’t detect patterns.
8. Stride Based Prefetching [Chen1995]
Exploit stride patterns in data addresses.
Prefetch this data before the data is accessed.
Store state & stride data in a reference Prediction
Table (RPT), and update.
Make state transitions based on correct/ incorrect
predictions.
ECEN 676 - Advanced Computer Architecture8
9. RPT - Structure
INSTRUCTION
ADDRESS
(PREVIOUS) DATA
ADDRESS
STRIDE STATE
+ PREFETCHING
ADDRESS
Program Counter -
Effective Address
ECEN 676 - Advanced Computer Architecture9
Lookup
Update
11. Stride based Prefetching (cont.)
ECEN 676 - Advanced Computer Architecture11
Advantages Disadvantages
Detects uniform strides (e.g. loops). Not much improvement with non-
uniform strides.
Accurate prediction for many cases. Hardware overhead.
Cannot correlate strides of one
instruction with those of others.
12. RPT - Example
Load instructions at addresses 500, 504, and 512.
Base addresses of matrices A, B, and C at locations
10,000, 50,000, and 90,000 respectively.
Matrix Multiplication Assembly Code
int A[100][100], B[100][100],
C[100][100]
for(i = 1; i < 100; i ++){
for(j = 1; j < 100; j ++){
for(k = 1; k < 100; k ++){
A[i][j] += B[i][k] x C[k][j];
}
}
}
500
504
508
512
516
520
524
528
532
536
lw r4, 0(r2)
Iw r5, 0(r3)
mul r6, r5, r4
lw r7, 0(r1)
addu r7, r7, r6
sw r7, 0(rl)
addu r2, r2, 4
addu r3, r3. 400
addu r11, rl l, 1
bne r11, r13,
500
load B[i][k]
load C[kJ[j]
B[i][k] x C[k][j]
load A[i][j]
+=
store A[i][j]
ref B[i][k]
ref C[k][j]
increase k
loop
ECEN 676 - Advanced Computer Architecture12
13. RPT – Example (contd.)
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
ECEN 676 - Advanced Computer Architecture13
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,000 0 INIT
504 90,000 0 INIT
512 10,000 0 INIT
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,004 4 TRANS
504 90,400 400 TRANS
512 10,000 0 STEAD
Y
INSTRUCTION
ADDRESS
(PREVIOUS)
DATA
ADDRESS
STRIDE STATE
500 50,008 4 STEADY
504 90,800 400 STEADY
512 10,000 0 STEADY
Initial State After Iteration 1
After Iteration 2 After Iteration 3
14. Tag Correlated Prefetching [Hu2003]
ECEN 676 - Advanced Computer Architecture14
L1 cache tags exhibit strong regularity.
Similar to 2-level branch prediction technique.
Local/ Global History.
Correlating Prefetcher that work with tags.
15. TCP Structure
TAG1 TAG2 …… TAG TAG’
ECEN 676 - Advanced Computer Architecture15
TAG INDEX
OFFSE
T
Index
Function
misstagTAG2 …… TAGKTAGK
UpdateMiss Address
misstag
TAG1
THT
TAG
K
TAG
KTAG
KTAG
K
misstag
Lookup
misstag
misstag
misstag
misstag
misstag TAG’
missindex
PHT
16. Modified TCP
SUM TAG TAG’
ECEN 676 - Advanced Computer Architecture16
TAG INDEX
OFFSE
T
Index
Function
misstagTAGK
Miss Address
THT PHT
17. TCP (cont.)
ECEN 676 - Advanced Computer Architecture17
Advantages Disadvantages
Captures global and local history. More Hardware.
Recognize recurring patterns. Vulnerable to noise
18. Simulation Environment
ECEN 676 - Advanced Computer Architecture18
L1
Data
Cache
L2
Data
Cache
Main
Memory
CPU
DATA
Prefetcher
Trace File
ADDRESS
DATA
ADDRESS
DATA
ADDRESS
Hit
Implementation: C++, Perl, Pin Tool [Reddi2004]
Trace Driven Simulation
L1-Cache L2-Cache
32KB Size
2-way Set
Associative
64 byte line size
write-through
no-write-allocate
1 cycle hit time
lru replacement
policy
256KB Size
8-way set
associative
128 byte line size
write-back
write-allocate
20 cycle access
time
lru replacement
policy
Calculate Next Prefetch Address
19. Benchmarks
ECEN 676 - Advanced Computer Architecture19
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
grep g++ ls plamaptestgen matrix
%
Benchmark
Instruction Mix
Non-mem ops
Stores
Loads
Benchmark Description
grep Unix utility to search for
pattern in input file.
g++ Unix C++ GNU
Compiler.
testgen Program for creating
test patterns for scan
chains in DFT.
plamap A mapping algorithm for
CPLD architecture.
ls Unix utility to list
information about files
in dir.
matrix 100x100 matrix
multiplication.
20. Simulation Results - I
ECEN 676 - Advanced Computer Architecture20
0.00
1.00
2.00
3.00
4.00
5.00
6.00
grep g++ ls plamap testgen matrix
CPI
Benchmark
CPI
No Prefetching
Fixed Offset Prefetching
Stride Based Prefetching
Tag Correlating Prefetching
21. Simulation Results - II
ECEN 676 - Advanced Computer Architecture21
90
91
92
93
94
95
96
97
98
99
100
grep g++ ls plamap testgen matrix
Hit Rate (%)
Benchmark
L1 Cache Hit Rate
No Prefetching
Fixed Offset Prefetching
Stride Based Prefetching
Tag Correlating Prefetching
22. Simulation Results - III
ECEN 676 - Advanced Computer Architecture22
0
0.5
1
1.5
2
2.5
grep g++ ls plamap testgen matrix
AMAT (#cycles)
Benchmark
Average Memory Access Time
No Prefetching
Fixed Offset Prefetching
Stride Based Prefetching
Tag Correlating Prefetching
23. Simulation Results - IV
ECEN 676 - Advanced Computer Architecture23
94.50
95.00
95.50
96.00
96.50
97.00
97.50
98.00
98.50
lookahead size
L1 Hit
Rate (%)
Benchmark (grep)
Effect of changing offset
on L1 Hit Rate
64
2*64
8*64
95.50
96.00
96.50
97.00
97.50
98.00
98.50
sizes;
L1 Hit Rate
(%)
Benchmark (g++)
Effect of RPT size on L1 Hit
Rate
size=8;
size=64;
size=256;
size=2048;
96.95
97.00
97.05
97.10
97.15
97.20
97.25
97.30
97.35
97.40
97.45
increasing m,n; increasing k;
L1 Hit Rate (%)
Benchmark (testgen)
Effect of THT/PHT parameters on L1 Hit Rate
m=8;n=8;k=4;
m=4;n=4;k=4;
m=8;n=8;k=2;
m=8;n=8;k=4;
m=8;n=8;k=8;
m=8;n=8;k=64;
24. Simulation Results – V
Prefetching
Algorithm
Hardware CPI
(% improvement)
Hit Rate
(% increase)
AMAT
(% decrease)
Fixed Offset Area of adder, registers 10.28 1.41 9.62
Stride Based 26.75KB (RPT) 16.56 1.75 20.42
TCP 72KB (THT) +
150KB(THT) 18.93 1.96 27.75
Modified TCP 26KB (THT) +
150KB(THT) 16.98 1.80 21.23
ECEN 676 - Advanced Computer Architecture24
The increase in hardware complexity pays off!
25. Conclusions
Prefetching increases hit rate and decreases AMAT.
Fixed Offset Stride Based Tag
Correlated
Fixed offset would give good performance for highly
spatial code.
Stride Prefetching would perform the best when a
program has steady memory access patterns
regardless of locality.
TCP would perform better on an average.
ECEN 676 - Advanced Computer Architecture25
Increasing Hardware Complexity
Increasing hit rate, Decreasing AMAT
26. References
Chen1995 - Tien-Fu Chen; Jean-Loup Baer, "Effective hardware-
based data prefetching for high-performance processors,"
Computers, IEEE Transactions on , vol.44, no.5, pp.609-623, May
1995.
Hu2003 - Hu, Z.; Martonosi, M.; Kaxiras, S., "TCP: tag correlating
prefetchers," High-Performance Computer Architecture, 2003.
HPCA-9 2003. Proceedings. The Ninth International Symposium on ,
vol., no., pp. 317-326, 8-12 Feb. 2003.
Reddi2004 - PIN: A Binary Instrumentation Tool for Computer
Architecture Research and Education VJ Reddi, A Settle, DA
Connors, RS Cohn, 2004.
ECEN 676 - Advanced Computer Architecture26
28. Questions
Question 1: Is cache pollution a serious concern for
anyone designing a prefetching algorithm?
Answer: Cache pollution happens when the cache is
cluttered with useless information. However the
problem is that the exact information that is needed
is not always known, but it is predicted. The goal is
to prefetch all of the necessary data beforehand and
to prefetch the minimal amount of unused data.
ECEN 676 - Advanced Computer Architecture28