SlideShare une entreprise Scribd logo
1  sur  30
Load Store Execution in Processors – My learnings
Ramdas M
Core Block diagram

2
Load,Store Instructions
●

Fixed point Load/Stores
–

Ld RT, RA, RB

(Power)

–

St RS, RA, RB

(Power)

–

MOV register, [address]

(x86)

–

MOV [address], register

(x86)

●

Floating point Load/Stores

●

Byte, half-word, word, double word access

●

String forms (Block moves in x86)

●

Locks

●

Memory barriers (sfence, msync etc)

●

Memory types (WB, UC, WT, WC)
3
Question 1
●

What are the various steps in processing a
load/store?

4
Load / Store Processing
●

For both Loads and Stores:

●

Effective Address Generation:
–
–

●

Must wait on register value
Must perform address calculation

Address Translation:
–

●

Must access TLB, Can potentially induce a page fault (exception)

For Loads: D-cache Access (Read)
–
–

Check aliasing against store buffer for possible load forwarding

–
●

Can potentially induce a D-cache miss
If bypassing store, must be flagged as “speculative” load until completion

For Stores: D-cache Access (Write)
–

When completing must check aliasing against “speculative” loads

–

After completion, wait in store buffer for access to D-cache

–

Can potentially induce a D-cache miss
5
LSU pipeline
●

RegFile Access
–

●

Address Generation
–

●

Read the source registers

Add base, displacement, immediate fields to generate an EA

Cache Access
–
–

Bank access if cache is multi-banked

–
●

Index into set, tag comparison for ways

TLB access

Results
–
–

●

Target registers write back for loads
Store buffer/cache updates for stores

Finish
–

Post instruction status (complete or flush etc)
6
Addressing modes
●

An addressing mode is a mechanism for specifying an address.

●

absolute: the address is provided directly

●

●

●

register: the address is provided indirectly, but specifying where (what register)
the address can be found.
displacement: the address is computed by adding a displacement to the
contents of a register
indexed: the address is computed by adding a displacement to the contents of
a register, and then adding in the contents of another register times some
constant.

7
Pipeline Diagrams
●

Some Pipeline Diagrams to illustrate
–

L1 Hit loads/stores

–

L1Miss, L2 Hit

–

L1,L2 Miss

–

TLB Misses

8
Pipeline Arbitration
●

Loads/Stores from Issue Unit

●

Re-executing loads/stores that missed DL1 or DTLB

●

Line Fills from L2

●

Snoops from different agent in case of MP

●

Data Prefetches

9
Sub Units
●

Load/Store Engine
–

–
●

Load/Store execution pipeline

2-3 pipelines present in modern designs

L1 Data cache
–

Multi-banked for simultaneous access to same line from multiple pipelines

–

Bank conflicts between loads/stores and snoops

–

virtually/physically indexed
●

–

Virtual indexing helps simultaneous access to TLB, but needs handling
aliases.

WB/WT
●

WB saves bandwidth on writes to L2, but needs handling snoops

–

Inclusive/Exclusive

–

Line Size
10
Sub Units
●

Data TLBs
–

–
●

Caches Virtual to Physical translations

TLB miss will cause load or store to stall.

Load Miss Queue
–

Tracks line fill requests to L2

–

Ld/St that miss DL1 including ownership upgrades

–

Handles multiple ld/store misses to same cacheline

–

Restarts loads/stores as line fills arrive
●

Critical data forwarding to re-executing loads

●

L2Hit Restart for best load to use latency during L2 hit cases

●

Store Buffers

●

Load/Store Re-order queue

●

Data Prefetch

●

Exceptions

11
Alignments
●

Aligned
–

●

Aligned on an operand sized boundary

Unaligned
–
–

●

Access crossing operand sized boundary
Might get broken down into multiple access

Line Crossing
–
–

Broken down into 2 access and data gets merged together

–
●

Access crossing cachelines.

Not guaranteed to be atomic (both x86, Power)

Page Crossing
–

Access crossing page boundaries

–

Broken down into 2 access, 2 TLB/Page miss handling
12
Unaligned Access
●

How unaligned accesses are handled?

13
Memory Data Dependences
Question 2
●

Why is it hard to handle memory dependency?

15
Memory Data Dependencies
●

Memory Dependency Detection:
–
–

Effective addresses can depend on run-time data and other instructions

–
●

Must compute effective addresses of both memory references

Comparison of addresses require much wider comparators

Hard to handle memory dependencies
–

Memory address are much wider than register names (64bit vs 5bits)

–

Memory dependencies are not static
●

A load (or store) instruction’s address can change (e.g. loop)

–

Addresses need to be calculated and translated first

–

Memory instructions take longer to execute relative to other instructions
●

Cache misses can take 100s of cycles

●

TLB misses can take 100s of cycles
16
Simple In-order Load/store Processing:
Total Load-Store Order
●

●

●

Keep all loads and stores totally in order
However Loads and stores can execute out of order with respect to other types
of instructions while obeying register data-dependence
Question: So when can a store actually write to cache ?
–

What if we write to cache as it execute ?

17
Store Buffers
●

Stores
–

Allocate store buffer entry at DISPATCH (in-order)

–

When register value available, issue & calculate address (“finished”)

–

When all previous instructions retire, store considered completed
●

–
●

Store buffer split into “finished” and “completed” part though pointers

Completed stores go to memory/cache inorder

Loads
–

Loads remember the store buffer entry of the last store before them

–

A load can issues when address register value availabe and
●

All older stores are considered “completed”

●

Q1: What happens to Store buffer when say a branch mispredicts ?

●

Q2: What happens when a snoop hit a Store Buffer entry ?
18
Load Bypassing & Forwarding
●

Load Bypassing

Load Forwarding

19
Load Bypassing & Forwarding
●

Bypassing
–

–

Store addresses still need to be computed before loads can be issued to
allow checking for load dependences.

–

●

Loads can be allowed to bypass stores (if no aliasing).

If dependence cannot be checked, e.g. store address cannot be determined,
then all subsequent loads are held until address is valid (conservative).

Forwarding
–

If a subsequent load has a dependence on a store still in the store buffer, it
need not wait till the store is issued to the data cache.

–

The load can be directly satisfied from the store buffer if the address is
valid and the data is available in the store buffer.

20
Load Forwarding

Q: In case of multiple match, which store do we forward from ?
Q: In case of partial match, can we forward ?
21
Non-Speculative Disambiguation
●

Non-speculative load/store disambiguation
–
–

Full address comparison

–

●

Loads wait for addresses of all prior stores

Bypass if no match, forward if match

Can limit performance:
–

load r5,MEM[r3]

cache miss

–

store r7, MEM[r5]

RAW for agen, stalled

–

…

–

load r8, MEM[r9]

independent load stalled

22
Speculative Disambiguation
•

What if aliases are rare?
1.
2.
3.
4.

Loads don’t wait for addresses of all
prior stores
Full address comparison of stores that
are ready
Bypass if no match, forward if match
Check all store addresses when they
commit
–
–

5.

No matching loads – speculation was
correct
Matching unbypassed load – incorrect
speculation

Replay starting from incorrect load
Speculative Disambiguation: Safe Speculation

• i1 and i2 issue out of program order
• i1 checks load queue at commit (no match)
Speculative Disambiguation: Violation

• i1 and i2 issue out of program order
• i1 checks load queue at commit (match)
– i2 marked for replay
Load/Store Re-order queues

26
Memory Dependence Prediction
●

If aliases are rare: static prediction
–
–

●

Predict no alias every time (Blind prediction)
Pay misprediction penalty rarely

If aliases are more frequent: dynamic prediction
–

Use some form of history tables for loads

–

Store Set Algorithm
●
●

●

Allow speculation of loads around stores when program starts

If a load and store causes violation, add the PC of store to the
load's store set.
Next time the load executes, it waits for all stores in the store
set
27
Prediction Implementation (Intel Core 2)
•
•
•

•
•

History table indexed by Instruction Pointer
Each entry in the history array has a saturating counter
Once counter saturates: disambiguation possible on this load (take
effect since next iteration) -load is allowed to go even meet unkown
store addresses
When a particular load failed disambiguation: reset its counter
Each time a particular load correctly disambiguated: increment
counter
Data Prefetching
●

S/W Prefetching
–

–
●

Instructions like prefetch (x86),

Cache touch instructions (Power)

H/W Prefetching
–

Speculation about future memory access patterns based on previous
patterns

–

Hardware monitors the processor's address reference pattern and issues
prefetch if a predictable memory address pattern is detected

29
Exceptions
●

Alignment Exceptions

●

Page Faults

●

Cache Parity Errors

30

Contenu connexe

Tendances

Challenges in Using UVM at SoC Level
Challenges in Using UVM at SoC LevelChallenges in Using UVM at SoC Level
Challenges in Using UVM at SoC LevelDVClub
 
AMBA 3 APB Protocol
AMBA 3 APB ProtocolAMBA 3 APB Protocol
AMBA 3 APB ProtocolSwetha GSM
 
Uvm presentation dac2011_final
Uvm presentation dac2011_finalUvm presentation dac2011_final
Uvm presentation dac2011_finalsean chen
 
Verification challenges and methodologies - SoC and ASICs
Verification challenges and methodologies - SoC and ASICsVerification challenges and methodologies - SoC and ASICs
Verification challenges and methodologies - SoC and ASICsDr. Shivananda Koteshwar
 
UVM Methodology Tutorial
UVM Methodology TutorialUVM Methodology Tutorial
UVM Methodology TutorialArrow Devices
 
UVM Update: Register Package
UVM Update: Register PackageUVM Update: Register Package
UVM Update: Register PackageDVClub
 
The Verification Methodology Landscape
The Verification Methodology LandscapeThe Verification Methodology Landscape
The Verification Methodology LandscapeDVClub
 
Verification of amba axi bus protocol implementing incr and wrap burst using ...
Verification of amba axi bus protocol implementing incr and wrap burst using ...Verification of amba axi bus protocol implementing incr and wrap burst using ...
Verification of amba axi bus protocol implementing incr and wrap burst using ...eSAT Journals
 
AMBA 5 COHERENT HUB INTERFACE.pptx
AMBA 5 COHERENT HUB INTERFACE.pptxAMBA 5 COHERENT HUB INTERFACE.pptx
AMBA 5 COHERENT HUB INTERFACE.pptxSairam Chebrolu
 
Wishbone interface and bus cycles
Wishbone interface and bus cyclesWishbone interface and bus cycles
Wishbone interface and bus cyclesdennis gookyi
 

Tendances (20)

Axi
AxiAxi
Axi
 
Ral by pushpa
Ral by pushpa Ral by pushpa
Ral by pushpa
 
Challenges in Using UVM at SoC Level
Challenges in Using UVM at SoC LevelChallenges in Using UVM at SoC Level
Challenges in Using UVM at SoC Level
 
AMBA 3 APB Protocol
AMBA 3 APB ProtocolAMBA 3 APB Protocol
AMBA 3 APB Protocol
 
CPU Verification
CPU VerificationCPU Verification
CPU Verification
 
Uvm presentation dac2011_final
Uvm presentation dac2011_finalUvm presentation dac2011_final
Uvm presentation dac2011_final
 
Axi protocol
Axi protocolAxi protocol
Axi protocol
 
AMBA AHB 5
AMBA AHB 5AMBA AHB 5
AMBA AHB 5
 
Verification challenges and methodologies - SoC and ASICs
Verification challenges and methodologies - SoC and ASICsVerification challenges and methodologies - SoC and ASICs
Verification challenges and methodologies - SoC and ASICs
 
PCIe DL_layer_3.0.1 (1)
PCIe DL_layer_3.0.1 (1)PCIe DL_layer_3.0.1 (1)
PCIe DL_layer_3.0.1 (1)
 
UVM Methodology Tutorial
UVM Methodology TutorialUVM Methodology Tutorial
UVM Methodology Tutorial
 
I2 c protocol
I2 c protocolI2 c protocol
I2 c protocol
 
UVM Update: Register Package
UVM Update: Register PackageUVM Update: Register Package
UVM Update: Register Package
 
axi protocol
axi protocolaxi protocol
axi protocol
 
The Verification Methodology Landscape
The Verification Methodology LandscapeThe Verification Methodology Landscape
The Verification Methodology Landscape
 
ASIC design verification
ASIC design verificationASIC design verification
ASIC design verification
 
AMBA 2.0 REPORT
AMBA 2.0 REPORTAMBA 2.0 REPORT
AMBA 2.0 REPORT
 
Verification of amba axi bus protocol implementing incr and wrap burst using ...
Verification of amba axi bus protocol implementing incr and wrap burst using ...Verification of amba axi bus protocol implementing incr and wrap burst using ...
Verification of amba axi bus protocol implementing incr and wrap burst using ...
 
AMBA 5 COHERENT HUB INTERFACE.pptx
AMBA 5 COHERENT HUB INTERFACE.pptxAMBA 5 COHERENT HUB INTERFACE.pptx
AMBA 5 COHERENT HUB INTERFACE.pptx
 
Wishbone interface and bus cycles
Wishbone interface and bus cyclesWishbone interface and bus cycles
Wishbone interface and bus cycles
 

En vedette

Cracking Digital VLSI Verification Interview: Interview Success
Cracking Digital VLSI Verification Interview: Interview SuccessCracking Digital VLSI Verification Interview: Interview Success
Cracking Digital VLSI Verification Interview: Interview SuccessRamdas Mozhikunnath
 
SystemVerilog based OVM and UVM Verification Methodologies
SystemVerilog based OVM and UVM Verification MethodologiesSystemVerilog based OVM and UVM Verification Methodologies
SystemVerilog based OVM and UVM Verification MethodologiesRamdas Mozhikunnath
 
Randomization and Constraints - Workshop at BMS College
Randomization and Constraints - Workshop at BMS CollegeRandomization and Constraints - Workshop at BMS College
Randomization and Constraints - Workshop at BMS CollegeRamdas Mozhikunnath
 
Memory consistency models and basics
Memory consistency models and basicsMemory consistency models and basics
Memory consistency models and basicsRamdas Mozhikunnath
 
Advances in Verification - Workshop at BMS College of Engineering
Advances in Verification - Workshop at BMS College of EngineeringAdvances in Verification - Workshop at BMS College of Engineering
Advances in Verification - Workshop at BMS College of EngineeringRamdas Mozhikunnath
 
Verification Engineer - Opportunities and Career Path
Verification Engineer - Opportunities and Career PathVerification Engineer - Opportunities and Career Path
Verification Engineer - Opportunities and Career PathRamdas Mozhikunnath
 

En vedette (6)

Cracking Digital VLSI Verification Interview: Interview Success
Cracking Digital VLSI Verification Interview: Interview SuccessCracking Digital VLSI Verification Interview: Interview Success
Cracking Digital VLSI Verification Interview: Interview Success
 
SystemVerilog based OVM and UVM Verification Methodologies
SystemVerilog based OVM and UVM Verification MethodologiesSystemVerilog based OVM and UVM Verification Methodologies
SystemVerilog based OVM and UVM Verification Methodologies
 
Randomization and Constraints - Workshop at BMS College
Randomization and Constraints - Workshop at BMS CollegeRandomization and Constraints - Workshop at BMS College
Randomization and Constraints - Workshop at BMS College
 
Memory consistency models and basics
Memory consistency models and basicsMemory consistency models and basics
Memory consistency models and basics
 
Advances in Verification - Workshop at BMS College of Engineering
Advances in Verification - Workshop at BMS College of EngineeringAdvances in Verification - Workshop at BMS College of Engineering
Advances in Verification - Workshop at BMS College of Engineering
 
Verification Engineer - Opportunities and Career Path
Verification Engineer - Opportunities and Career PathVerification Engineer - Opportunities and Career Path
Verification Engineer - Opportunities and Career Path
 

Similaire à Load Store Execution

Mobicents Summit 2012 - Vladimir Ralev - Mobicents Load Balancer and High Ava...
Mobicents Summit 2012 - Vladimir Ralev - Mobicents Load Balancer and High Ava...Mobicents Summit 2012 - Vladimir Ralev - Mobicents Load Balancer and High Ava...
Mobicents Summit 2012 - Vladimir Ralev - Mobicents Load Balancer and High Ava...telestax
 
Snooping protocols 3
Snooping protocols 3Snooping protocols 3
Snooping protocols 3Yasir Khan
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsJose Pinilla
 
Scalar DB: A library that makes non-ACID databases ACID-compliant
Scalar DB: A library that makes non-ACID databases ACID-compliantScalar DB: A library that makes non-ACID databases ACID-compliant
Scalar DB: A library that makes non-ACID databases ACID-compliantScalar, Inc.
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64Yi-Hsiu Hsu
 
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...Rajesh Kannan S
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingScyllaDB
 
Introduction to armv8 aarch64
Introduction to armv8 aarch64Introduction to armv8 aarch64
Introduction to armv8 aarch64Yi-Hsiu Hsu
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingScyllaDB
 
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Community
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 

Similaire à Load Store Execution (20)

Mobicents Summit 2012 - Vladimir Ralev - Mobicents Load Balancer and High Ava...
Mobicents Summit 2012 - Vladimir Ralev - Mobicents Load Balancer and High Ava...Mobicents Summit 2012 - Vladimir Ralev - Mobicents Load Balancer and High Ava...
Mobicents Summit 2012 - Vladimir Ralev - Mobicents Load Balancer and High Ava...
 
3 ilp
3 ilp3 ilp
3 ilp
 
Taking Full Advantage of Galera Multi Master Cluster
Taking Full Advantage of Galera Multi Master ClusterTaking Full Advantage of Galera Multi Master Cluster
Taking Full Advantage of Galera Multi Master Cluster
 
Snooping protocols 3
Snooping protocols 3Snooping protocols 3
Snooping protocols 3
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) Limitations
 
Scalar DB: A library that makes non-ACID databases ACID-compliant
Scalar DB: A library that makes non-ACID databases ACID-compliantScalar DB: A library that makes non-ACID databases ACID-compliant
Scalar DB: A library that makes non-ACID databases ACID-compliant
 
GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64GCC for ARMv8 Aarch64
GCC for ARMv8 Aarch64
 
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Hbase hivepig
Hbase hivepigHbase hivepig
Hbase hivepig
 
Unit 06 dbms
Unit 06 dbmsUnit 06 dbms
Unit 06 dbms
 
Snooping 2
Snooping 2Snooping 2
Snooping 2
 
Introduction to armv8 aarch64
Introduction to armv8 aarch64Introduction to armv8 aarch64
Introduction to armv8 aarch64
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Lect15
Lect15Lect15
Lect15
 
Hbase hive pig
Hbase hive pigHbase hive pig
Hbase hive pig
 
Java memory model
Java memory modelJava memory model
Java memory model
 

Dernier

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Dernier (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Load Store Execution

  • 1. Load Store Execution in Processors – My learnings Ramdas M
  • 3. Load,Store Instructions ● Fixed point Load/Stores – Ld RT, RA, RB (Power) – St RS, RA, RB (Power) – MOV register, [address] (x86) – MOV [address], register (x86) ● Floating point Load/Stores ● Byte, half-word, word, double word access ● String forms (Block moves in x86) ● Locks ● Memory barriers (sfence, msync etc) ● Memory types (WB, UC, WT, WC) 3
  • 4. Question 1 ● What are the various steps in processing a load/store? 4
  • 5. Load / Store Processing ● For both Loads and Stores: ● Effective Address Generation: – – ● Must wait on register value Must perform address calculation Address Translation: – ● Must access TLB, Can potentially induce a page fault (exception) For Loads: D-cache Access (Read) – – Check aliasing against store buffer for possible load forwarding – ● Can potentially induce a D-cache miss If bypassing store, must be flagged as “speculative” load until completion For Stores: D-cache Access (Write) – When completing must check aliasing against “speculative” loads – After completion, wait in store buffer for access to D-cache – Can potentially induce a D-cache miss 5
  • 6. LSU pipeline ● RegFile Access – ● Address Generation – ● Read the source registers Add base, displacement, immediate fields to generate an EA Cache Access – – Bank access if cache is multi-banked – ● Index into set, tag comparison for ways TLB access Results – – ● Target registers write back for loads Store buffer/cache updates for stores Finish – Post instruction status (complete or flush etc) 6
  • 7. Addressing modes ● An addressing mode is a mechanism for specifying an address. ● absolute: the address is provided directly ● ● ● register: the address is provided indirectly, but specifying where (what register) the address can be found. displacement: the address is computed by adding a displacement to the contents of a register indexed: the address is computed by adding a displacement to the contents of a register, and then adding in the contents of another register times some constant. 7
  • 8. Pipeline Diagrams ● Some Pipeline Diagrams to illustrate – L1 Hit loads/stores – L1Miss, L2 Hit – L1,L2 Miss – TLB Misses 8
  • 9. Pipeline Arbitration ● Loads/Stores from Issue Unit ● Re-executing loads/stores that missed DL1 or DTLB ● Line Fills from L2 ● Snoops from different agent in case of MP ● Data Prefetches 9
  • 10. Sub Units ● Load/Store Engine – – ● Load/Store execution pipeline 2-3 pipelines present in modern designs L1 Data cache – Multi-banked for simultaneous access to same line from multiple pipelines – Bank conflicts between loads/stores and snoops – virtually/physically indexed ● – Virtual indexing helps simultaneous access to TLB, but needs handling aliases. WB/WT ● WB saves bandwidth on writes to L2, but needs handling snoops – Inclusive/Exclusive – Line Size 10
  • 11. Sub Units ● Data TLBs – – ● Caches Virtual to Physical translations TLB miss will cause load or store to stall. Load Miss Queue – Tracks line fill requests to L2 – Ld/St that miss DL1 including ownership upgrades – Handles multiple ld/store misses to same cacheline – Restarts loads/stores as line fills arrive ● Critical data forwarding to re-executing loads ● L2Hit Restart for best load to use latency during L2 hit cases ● Store Buffers ● Load/Store Re-order queue ● Data Prefetch ● Exceptions 11
  • 12. Alignments ● Aligned – ● Aligned on an operand sized boundary Unaligned – – ● Access crossing operand sized boundary Might get broken down into multiple access Line Crossing – – Broken down into 2 access and data gets merged together – ● Access crossing cachelines. Not guaranteed to be atomic (both x86, Power) Page Crossing – Access crossing page boundaries – Broken down into 2 access, 2 TLB/Page miss handling 12
  • 13. Unaligned Access ● How unaligned accesses are handled? 13
  • 15. Question 2 ● Why is it hard to handle memory dependency? 15
  • 16. Memory Data Dependencies ● Memory Dependency Detection: – – Effective addresses can depend on run-time data and other instructions – ● Must compute effective addresses of both memory references Comparison of addresses require much wider comparators Hard to handle memory dependencies – Memory address are much wider than register names (64bit vs 5bits) – Memory dependencies are not static ● A load (or store) instruction’s address can change (e.g. loop) – Addresses need to be calculated and translated first – Memory instructions take longer to execute relative to other instructions ● Cache misses can take 100s of cycles ● TLB misses can take 100s of cycles 16
  • 17. Simple In-order Load/store Processing: Total Load-Store Order ● ● ● Keep all loads and stores totally in order However Loads and stores can execute out of order with respect to other types of instructions while obeying register data-dependence Question: So when can a store actually write to cache ? – What if we write to cache as it execute ? 17
  • 18. Store Buffers ● Stores – Allocate store buffer entry at DISPATCH (in-order) – When register value available, issue & calculate address (“finished”) – When all previous instructions retire, store considered completed ● – ● Store buffer split into “finished” and “completed” part though pointers Completed stores go to memory/cache inorder Loads – Loads remember the store buffer entry of the last store before them – A load can issues when address register value availabe and ● All older stores are considered “completed” ● Q1: What happens to Store buffer when say a branch mispredicts ? ● Q2: What happens when a snoop hit a Store Buffer entry ? 18
  • 19. Load Bypassing & Forwarding ● Load Bypassing Load Forwarding 19
  • 20. Load Bypassing & Forwarding ● Bypassing – – Store addresses still need to be computed before loads can be issued to allow checking for load dependences. – ● Loads can be allowed to bypass stores (if no aliasing). If dependence cannot be checked, e.g. store address cannot be determined, then all subsequent loads are held until address is valid (conservative). Forwarding – If a subsequent load has a dependence on a store still in the store buffer, it need not wait till the store is issued to the data cache. – The load can be directly satisfied from the store buffer if the address is valid and the data is available in the store buffer. 20
  • 21. Load Forwarding Q: In case of multiple match, which store do we forward from ? Q: In case of partial match, can we forward ? 21
  • 22. Non-Speculative Disambiguation ● Non-speculative load/store disambiguation – – Full address comparison – ● Loads wait for addresses of all prior stores Bypass if no match, forward if match Can limit performance: – load r5,MEM[r3] cache miss – store r7, MEM[r5] RAW for agen, stalled – … – load r8, MEM[r9] independent load stalled 22
  • 23. Speculative Disambiguation • What if aliases are rare? 1. 2. 3. 4. Loads don’t wait for addresses of all prior stores Full address comparison of stores that are ready Bypass if no match, forward if match Check all store addresses when they commit – – 5. No matching loads – speculation was correct Matching unbypassed load – incorrect speculation Replay starting from incorrect load
  • 24. Speculative Disambiguation: Safe Speculation • i1 and i2 issue out of program order • i1 checks load queue at commit (no match)
  • 25. Speculative Disambiguation: Violation • i1 and i2 issue out of program order • i1 checks load queue at commit (match) – i2 marked for replay
  • 27. Memory Dependence Prediction ● If aliases are rare: static prediction – – ● Predict no alias every time (Blind prediction) Pay misprediction penalty rarely If aliases are more frequent: dynamic prediction – Use some form of history tables for loads – Store Set Algorithm ● ● ● Allow speculation of loads around stores when program starts If a load and store causes violation, add the PC of store to the load's store set. Next time the load executes, it waits for all stores in the store set 27
  • 28. Prediction Implementation (Intel Core 2) • • • • • History table indexed by Instruction Pointer Each entry in the history array has a saturating counter Once counter saturates: disambiguation possible on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses When a particular load failed disambiguation: reset its counter Each time a particular load correctly disambiguated: increment counter
  • 29. Data Prefetching ● S/W Prefetching – – ● Instructions like prefetch (x86), Cache touch instructions (Power) H/W Prefetching – Speculation about future memory access patterns based on previous patterns – Hardware monitors the processor's address reference pattern and issues prefetch if a predictable memory address pattern is detected 29