5. Load / Store Processing
●
For both Loads and Stores:
●
Effective Address Generation:
–
–
●
Must wait on register value
Must perform address calculation
Address Translation:
–
●
Must access TLB, Can potentially induce a page fault (exception)
For Loads: D-cache Access (Read)
–
–
Check aliasing against store buffer for possible load forwarding
–
●
Can potentially induce a D-cache miss
If bypassing store, must be flagged as “speculative” load until completion
For Stores: D-cache Access (Write)
–
When completing must check aliasing against “speculative” loads
–
After completion, wait in store buffer for access to D-cache
–
Can potentially induce a D-cache miss
5
6. LSU pipeline
●
RegFile Access
–
●
Address Generation
–
●
Read the source registers
Add base, displacement, immediate fields to generate an EA
Cache Access
–
–
Bank access if cache is multi-banked
–
●
Index into set, tag comparison for ways
TLB access
Results
–
–
●
Target registers write back for loads
Store buffer/cache updates for stores
Finish
–
Post instruction status (complete or flush etc)
6
7. Addressing modes
●
An addressing mode is a mechanism for specifying an address.
●
absolute: the address is provided directly
●
●
●
register: the address is provided indirectly, but specifying where (what register)
the address can be found.
displacement: the address is computed by adding a displacement to the
contents of a register
indexed: the address is computed by adding a displacement to the contents of
a register, and then adding in the contents of another register times some
constant.
7
9. Pipeline Arbitration
●
Loads/Stores from Issue Unit
●
Re-executing loads/stores that missed DL1 or DTLB
●
Line Fills from L2
●
Snoops from different agent in case of MP
●
Data Prefetches
9
10. Sub Units
●
Load/Store Engine
–
–
●
Load/Store execution pipeline
2-3 pipelines present in modern designs
L1 Data cache
–
Multi-banked for simultaneous access to same line from multiple pipelines
–
Bank conflicts between loads/stores and snoops
–
virtually/physically indexed
●
–
Virtual indexing helps simultaneous access to TLB, but needs handling
aliases.
WB/WT
●
WB saves bandwidth on writes to L2, but needs handling snoops
–
Inclusive/Exclusive
–
Line Size
10
11. Sub Units
●
Data TLBs
–
–
●
Caches Virtual to Physical translations
TLB miss will cause load or store to stall.
Load Miss Queue
–
Tracks line fill requests to L2
–
Ld/St that miss DL1 including ownership upgrades
–
Handles multiple ld/store misses to same cacheline
–
Restarts loads/stores as line fills arrive
●
Critical data forwarding to re-executing loads
●
L2Hit Restart for best load to use latency during L2 hit cases
●
Store Buffers
●
Load/Store Re-order queue
●
Data Prefetch
●
Exceptions
11
12. Alignments
●
Aligned
–
●
Aligned on an operand sized boundary
Unaligned
–
–
●
Access crossing operand sized boundary
Might get broken down into multiple access
Line Crossing
–
–
Broken down into 2 access and data gets merged together
–
●
Access crossing cachelines.
Not guaranteed to be atomic (both x86, Power)
Page Crossing
–
Access crossing page boundaries
–
Broken down into 2 access, 2 TLB/Page miss handling
12
16. Memory Data Dependencies
●
Memory Dependency Detection:
–
–
Effective addresses can depend on run-time data and other instructions
–
●
Must compute effective addresses of both memory references
Comparison of addresses require much wider comparators
Hard to handle memory dependencies
–
Memory address are much wider than register names (64bit vs 5bits)
–
Memory dependencies are not static
●
A load (or store) instruction’s address can change (e.g. loop)
–
Addresses need to be calculated and translated first
–
Memory instructions take longer to execute relative to other instructions
●
Cache misses can take 100s of cycles
●
TLB misses can take 100s of cycles
16
17. Simple In-order Load/store Processing:
Total Load-Store Order
●
●
●
Keep all loads and stores totally in order
However Loads and stores can execute out of order with respect to other types
of instructions while obeying register data-dependence
Question: So when can a store actually write to cache ?
–
What if we write to cache as it execute ?
17
18. Store Buffers
●
Stores
–
Allocate store buffer entry at DISPATCH (in-order)
–
When register value available, issue & calculate address (“finished”)
–
When all previous instructions retire, store considered completed
●
–
●
Store buffer split into “finished” and “completed” part though pointers
Completed stores go to memory/cache inorder
Loads
–
Loads remember the store buffer entry of the last store before them
–
A load can issues when address register value availabe and
●
All older stores are considered “completed”
●
Q1: What happens to Store buffer when say a branch mispredicts ?
●
Q2: What happens when a snoop hit a Store Buffer entry ?
18
20. Load Bypassing & Forwarding
●
Bypassing
–
–
Store addresses still need to be computed before loads can be issued to
allow checking for load dependences.
–
●
Loads can be allowed to bypass stores (if no aliasing).
If dependence cannot be checked, e.g. store address cannot be determined,
then all subsequent loads are held until address is valid (conservative).
Forwarding
–
If a subsequent load has a dependence on a store still in the store buffer, it
need not wait till the store is issued to the data cache.
–
The load can be directly satisfied from the store buffer if the address is
valid and the data is available in the store buffer.
20
21. Load Forwarding
Q: In case of multiple match, which store do we forward from ?
Q: In case of partial match, can we forward ?
21
22. Non-Speculative Disambiguation
●
Non-speculative load/store disambiguation
–
–
Full address comparison
–
●
Loads wait for addresses of all prior stores
Bypass if no match, forward if match
Can limit performance:
–
load r5,MEM[r3]
cache miss
–
store r7, MEM[r5]
RAW for agen, stalled
–
…
–
load r8, MEM[r9]
independent load stalled
22
23. Speculative Disambiguation
•
What if aliases are rare?
1.
2.
3.
4.
Loads don’t wait for addresses of all
prior stores
Full address comparison of stores that
are ready
Bypass if no match, forward if match
Check all store addresses when they
commit
–
–
5.
No matching loads – speculation was
correct
Matching unbypassed load – incorrect
speculation
Replay starting from incorrect load
27. Memory Dependence Prediction
●
If aliases are rare: static prediction
–
–
●
Predict no alias every time (Blind prediction)
Pay misprediction penalty rarely
If aliases are more frequent: dynamic prediction
–
Use some form of history tables for loads
–
Store Set Algorithm
●
●
●
Allow speculation of loads around stores when program starts
If a load and store causes violation, add the PC of store to the
load's store set.
Next time the load executes, it waits for all stores in the store
set
27
28. Prediction Implementation (Intel Core 2)
•
•
•
•
•
History table indexed by Instruction Pointer
Each entry in the history array has a saturating counter
Once counter saturates: disambiguation possible on this load (take
effect since next iteration) -load is allowed to go even meet unkown
store addresses
When a particular load failed disambiguation: reset its counter
Each time a particular load correctly disambiguated: increment
counter
29. Data Prefetching
●
S/W Prefetching
–
–
●
Instructions like prefetch (x86),
Cache touch instructions (Power)
H/W Prefetching
–
Speculation about future memory access patterns based on previous
patterns
–
Hardware monitors the processor's address reference pattern and issues
prefetch if a predictable memory address pattern is detected
29