SlideShare une entreprise Scribd logo
1  sur  28
Taming Non-blocking Caches to Improve
Isolation in Multicore Real-Time Systems
Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi
University of Kansas
1
High-Performance Multicores
for Real-Time Systems
• Why?
– Intelligence  more performance
– Space, weight, power (SWaP), cost
2
Time Predictability Challenge
3
Core1 Core2 Core3 Core4
Memory Controller (MC)
Shared Cache
• Shared hardware resource contention can cause
significant interference delays
• Shared cache is a major shared resource
DRAM
Task 1 Task 2 Task 3 Task 4
Cache Partitioning
• Can be done in either software or hardware
– Software: page coloring
– Hardware: way partitioning
• Eliminate unwanted cache-line evictions
• Common assumption
– cache partitioning  performance isolation
• If working-set fits in the cache partition
• Not necessarily true on out-of-order cores using
non-blocking caches
4
Outline
• Evaluate the isolation effect of cache partitioning
– On four real COTS multicore architectures
– Observed up to 21X slowdown of a task running on a
dedicated core accessing a dedicated cache partition
• Understand the source of contention
– Using a cycle-accurate full system simulator
• Propose an OS/architecture collaborative solution
– For better cache performance isolation
5
Memory-Level Parallelism (MLP)
• Broadly defined as the number of concurrent
memory requests that a given architecture
can handle at a time
6
Non-blocking Cache(*)
• Can serve cache hits under multiple cache misses
– Essential for an out-of-order core and any multicore
• Miss-Status-Holding Registers (MSHRs)
– On a miss, allocate a MSHR entry to track the req.
– On receiving the data, clear the MSHR entry
7
cpu cpu
miss hit miss
Miss penalty
Miss penalty
stall only when
result is needed
Multiple outstanding misses
(*) D. Kroft. “Lockup-free instruction fetch/prefetch cache organization,” ISCA’81
Non-blocking Cache
• #of cache MSHRs  memory-level parallelism
(MLP) of the cache
• What happens if all MSHRs are occupied?
– The cache is locked up
– Subsequent accesses---including cache hits---to
the cache stall
– We will see the impact of this in later experiments
8
COTS Multicore Platforms
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Core 4core @
1.4GHz
In-order
4core @
1.7GHz
Out-of-order
4core @
2.0GHz
Out-of-order
4core @
2.8GHz
Out-of-order
LLC (shared) 512KB 1MB 2MB 8MB
9
• COTS multicore platforms
– Odroid-XU4: 4x Cortex-A7 and 4x Cortex-A15
– Odroid-U3: 4x Cortex-A9
– Dell desktop: Intel Xeon quad-core (Nehalem)
Identified MLP
• Local MLP
– MLP of a core-private cache
• Global MLP
– MLP of the shared cache (and DRAM)
10
(See paper for our experimental identification method)
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Local MLP 1 4 6 10
Global MLP 4 4 11 16
Cache Interference Experiments
• Measure the performance of the ‘subject’
– (1) alone, (2) with co-runners
– LLC is partitioned (equal partition) using PALLOC (*)
• Q: Does cache partitioning provide isolation?
11
DRAM
LLC
Core1 Core2 Core3 Core4
subject co-runner(s)
(*) Heechul Yun, Renato Mancuso, Zheng-Pei Wu, Rodolfo Pellizzoni. “PALLOC: DRAM Bank-Aware Mem
ory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14
IsolBench: Synthetic Workloads
• Latency
– A linked-list traversal, data dependency, one outstanding miss
• Bandwidth
– An array reads or writes, no data dependency, multiple misses
• Subject benchmarks: LLC partition fitting
12
Working-set size: (LLC) < ¼ LLC  cache-hits, (DRAM) > 2X LLC  cache misses
Latency(LLC) vs. BwRead(DRAM)
• No interference on Cortex-A7 and Nehalem
• On Cortex-A15, Latency(LLC) suffers 3.8X slowdown
– despite partitioned LLC
13
BwRead(LLC) vs. BwRead(DRAM)
• Up to 10.6X slowdown on Cortex-A15
• Cache partitioning != performance isolation
– On all tested out-of-order cores (A9, A15, Nehalem)
14
BwRead(LLC) vs. BwWrite(DRAM)
• Up to 21X slowdown on Cortex-A15
• Writes generally cause more slowdowns
– Due to write-backs
15
EEMBC and SD-VBS
• X-axis: EEMBC, SD-VBS (cache partition fitting)
– Co-runners: BwWrite(DRAM)
• Cache partitioning != performance isolation
16
Cortex-A7 (in-order) Cortex-A15 (out-of-order)
MSHR Contention
• Shortage of cache MSHRs  lock up the cache
• LLC MSHRs are shared resources
– 4cores x L1 cache MSHRs > LLC MSHRs
17
Cortex-A7 Cortex-A9 Cortex-A15 Nehalem
Local MLP 1 4 6 10
Global MLP 4 4 11 16
(See paper for our experimental identification method)
Outline
• Evaluate the isolation effect of cache
partitioning
• Understand the source of contention
– Effect of LLC MSHRs
– Effect of L1 MSHRs
• Propose an OS/architecture collaborative
solution
18
Simulation Setup
• Gem5 full-system simulator
– 4 out-of-order cores (modeling Cortex-A15)
• L1: 32K I&D, LLC (L2): 2048K
– Vary #of L1 and LLC MSHRs
• Linux 3.14
– Use PALLOC to partition LLC
• Workload
– IsolBench, EEMBC, SD-VBS, SPEC2006
19
DRAM
LLC
Core1 Core2 Core3 Core4
subject co-runner(s)
Effect of LLC (Shared) MSHRs
• Increasing LLC MSHRs eliminates the MSHR contention
• Issue: scalable?
– MSHRs are highly associative (must be accessed in parallel)
20
L1:6/LLC:12 MSHRs L1:6/LLC:24 MSHRs
Effect of L1 (Private) MSHRs
• Reducing L1 MSHRs can eliminate contention
• Issue: single thread performance. How much?
– It depends. L1 MSHRs are often underutilized
– EEMBC: low, SD-VBS: medium, SPEC2006: high
21
EEMBC, SD-VBS IsolBench,SPEC2006
Outline
• Evaluate the isolation effect of cache
partitioning
• Understand the source of contention
• Propose an OS/architecture collaborative
solution
22
OS Controlled MSHR Partitioning
23
• Add two registers in each core’s L1 cache
– TargetCount: max. MSHRs (set by OS)
– ValidCount: used MSHRs (set by HW)
• OS can control each core’s MLP
– By adjusting the core’s TargetCount register
Block Addr.Valid Issue Target Info.
Block Addr.Valid Issue Target Info.
Block Addr.Valid Issue Target Info.
MSHR 1
MSHR 2
MSHR n
TargetCount ValidCount
Last Level Cache (LLC)
Core
1
Core
2
Core
3
Core
4
MSHRs
MSHRs MSHRs MSHRs MSHRs
OS Scheduler Design
• Partition LLC MSHRs by enforcing
• When RT tasks are scheduled
– RT tasks: reserve MSHRs
– Non-RT tasks: share remaining MSHRs
• Implemented in Linux kernel
– prepare_task_switch() at kernel/sched/core.c
24
Case Study
• 4 RT tasks (EEMBC)
– One RT per core
– Reserve 2 MSHRs
– P: 20,30,40,60ms
– C: ~8ms
• 4 NRT tasks
– One NRT per core
– Run on slack
• Up to 20% WCET reduction
– Compared to cache partition only
25
Conclusion
• Evaluated the effect of cache partitioning on four real COTS
multicore architectures and a full system simulator
• Found cache partitioning does not ensure cache (hit)
performance isolation due to MSHR contention
• Proposed low-cost architecture extension and OS scheduler
support to eliminate MSHR contention
• Availability
– https://github.com/CSL-KU/IsolBench
– Synthetic benchmarks (IsolBench),test scripts, kernel patches
26
Exp3: BwRead(LLC) vs. BwRead(LLC)
• Cache bandwidth contention is not the main source
– Higher cache-bandwidth co-runners cause less slowdowns
27
EEMBC, SD-VBS Workload
• Subject
– Subset of EEMBC, SD-VBS
– High L2 hit, Low L2 miss
• Co-runners
– BwWrite(DRAM): High L2 miss, write-intensive
28

Contenu connexe

Tendances

Real time operating systems (rtos) concepts 9
Real time operating systems (rtos) concepts 9Real time operating systems (rtos) concepts 9
Real time operating systems (rtos) concepts 9Abu Bakr Ramadan
 
Cache coherence
Cache coherenceCache coherence
Cache coherenceEmployee
 
Chapter 13 file systems
Chapter 13   file systemsChapter 13   file systems
Chapter 13 file systemsAlvin Chin
 
Distributed file system
Distributed file systemDistributed file system
Distributed file systemAnamika Singh
 
Operating System Concepts Presentation
Operating System Concepts PresentationOperating System Concepts Presentation
Operating System Concepts PresentationNitish Jadia
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Pankaj Suryawanshi
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User ReferenceBiju Nair
 
Process synchronization
Process synchronizationProcess synchronization
Process synchronizationAli Ahmad
 
Linux internal
Linux internalLinux internal
Linux internalmcganesh
 
LAS16-200: SCMI - System Management and Control Interface
LAS16-200:  SCMI - System Management and Control InterfaceLAS16-200:  SCMI - System Management and Control Interface
LAS16-200: SCMI - System Management and Control InterfaceLinaro
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLinaro
 
Translation Look Aside buffer
Translation Look Aside buffer Translation Look Aside buffer
Translation Look Aside buffer Zara Nawaz
 
Linux and windows file system
Linux and windows  file systemLinux and windows  file system
Linux and windows file systemlin yucheng
 
File models and file accessing models
File models and file accessing modelsFile models and file accessing models
File models and file accessing modelsishmecse13
 

Tendances (20)

Swapping | Computer Science
Swapping | Computer ScienceSwapping | Computer Science
Swapping | Computer Science
 
Memory Management
Memory ManagementMemory Management
Memory Management
 
6.distributed shared memory
6.distributed shared memory6.distributed shared memory
6.distributed shared memory
 
Real time operating systems (rtos) concepts 9
Real time operating systems (rtos) concepts 9Real time operating systems (rtos) concepts 9
Real time operating systems (rtos) concepts 9
 
Cache coherence
Cache coherenceCache coherence
Cache coherence
 
Chapter 13 file systems
Chapter 13   file systemsChapter 13   file systems
Chapter 13 file systems
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
 
Process Management
Process ManagementProcess Management
Process Management
 
Virtual memory
Virtual memoryVirtual memory
Virtual memory
 
Operating System Concepts Presentation
Operating System Concepts PresentationOperating System Concepts Presentation
Operating System Concepts Presentation
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Process synchronization
Process synchronizationProcess synchronization
Process synchronization
 
Linux internal
Linux internalLinux internal
Linux internal
 
LAS16-200: SCMI - System Management and Control Interface
LAS16-200:  SCMI - System Management and Control InterfaceLAS16-200:  SCMI - System Management and Control Interface
LAS16-200: SCMI - System Management and Control Interface
 
Load balancing
Load balancingLoad balancing
Load balancing
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel Awareness
 
Translation Look Aside buffer
Translation Look Aside buffer Translation Look Aside buffer
Translation Look Aside buffer
 
Linux and windows file system
Linux and windows  file systemLinux and windows  file system
Linux and windows file system
 
File models and file accessing models
File models and file accessing modelsFile models and file accessing models
File models and file accessing models
 

Similaire à Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsParallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsHeechul Yun
 
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureHeechul Yun
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...Heechul Yun
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allenjaxconf
 
Ct213 memory subsystem
Ct213 memory subsystemCt213 memory subsystem
Ct213 memory subsystemSandeep Kamath
 
Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Heechul Yun
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization2022002857mbit
 
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityErasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityYue Cheng
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Hsien-Hsin Sean Lee, Ph.D.
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory UsageTomer Perry
 
Mass storage systems presentation operating systems
Mass storage systems presentation operating systemsMass storage systems presentation operating systems
Mass storage systems presentation operating systemsnight1ng4ale
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Sarwan ali
 

Similaire à Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems (20)

Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsParallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
 
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System Architecture
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
 
Memory (Computer Organization)
Memory (Computer Organization)Memory (Computer Organization)
Memory (Computer Organization)
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Ct213 memory subsystem
Ct213 memory subsystemCt213 memory subsystem
Ct213 memory subsystem
 
Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization
 
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityErasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
 
Lect18
Lect18Lect18
Lect18
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
7_mem_cache.ppt
7_mem_cache.ppt7_mem_cache.ppt
7_mem_cache.ppt
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory Usage
 
Mass storage systems presentation operating systems
Mass storage systems presentation operating systemsMass storage systems presentation operating systems
Mass storage systems presentation operating systems
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler
 

Dernier

Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 

Dernier (20)

Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

  • 1. Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1
  • 2. High-Performance Multicores for Real-Time Systems • Why? – Intelligence  more performance – Space, weight, power (SWaP), cost 2
  • 3. Time Predictability Challenge 3 Core1 Core2 Core3 Core4 Memory Controller (MC) Shared Cache • Shared hardware resource contention can cause significant interference delays • Shared cache is a major shared resource DRAM Task 1 Task 2 Task 3 Task 4
  • 4. Cache Partitioning • Can be done in either software or hardware – Software: page coloring – Hardware: way partitioning • Eliminate unwanted cache-line evictions • Common assumption – cache partitioning  performance isolation • If working-set fits in the cache partition • Not necessarily true on out-of-order cores using non-blocking caches 4
  • 5. Outline • Evaluate the isolation effect of cache partitioning – On four real COTS multicore architectures – Observed up to 21X slowdown of a task running on a dedicated core accessing a dedicated cache partition • Understand the source of contention – Using a cycle-accurate full system simulator • Propose an OS/architecture collaborative solution – For better cache performance isolation 5
  • 6. Memory-Level Parallelism (MLP) • Broadly defined as the number of concurrent memory requests that a given architecture can handle at a time 6
  • 7. Non-blocking Cache(*) • Can serve cache hits under multiple cache misses – Essential for an out-of-order core and any multicore • Miss-Status-Holding Registers (MSHRs) – On a miss, allocate a MSHR entry to track the req. – On receiving the data, clear the MSHR entry 7 cpu cpu miss hit miss Miss penalty Miss penalty stall only when result is needed Multiple outstanding misses (*) D. Kroft. “Lockup-free instruction fetch/prefetch cache organization,” ISCA’81
  • 8. Non-blocking Cache • #of cache MSHRs  memory-level parallelism (MLP) of the cache • What happens if all MSHRs are occupied? – The cache is locked up – Subsequent accesses---including cache hits---to the cache stall – We will see the impact of this in later experiments 8
  • 9. COTS Multicore Platforms Cortex-A7 Cortex-A9 Cortex-A15 Nehalem Core 4core @ 1.4GHz In-order 4core @ 1.7GHz Out-of-order 4core @ 2.0GHz Out-of-order 4core @ 2.8GHz Out-of-order LLC (shared) 512KB 1MB 2MB 8MB 9 • COTS multicore platforms – Odroid-XU4: 4x Cortex-A7 and 4x Cortex-A15 – Odroid-U3: 4x Cortex-A9 – Dell desktop: Intel Xeon quad-core (Nehalem)
  • 10. Identified MLP • Local MLP – MLP of a core-private cache • Global MLP – MLP of the shared cache (and DRAM) 10 (See paper for our experimental identification method) Cortex-A7 Cortex-A9 Cortex-A15 Nehalem Local MLP 1 4 6 10 Global MLP 4 4 11 16
  • 11. Cache Interference Experiments • Measure the performance of the ‘subject’ – (1) alone, (2) with co-runners – LLC is partitioned (equal partition) using PALLOC (*) • Q: Does cache partitioning provide isolation? 11 DRAM LLC Core1 Core2 Core3 Core4 subject co-runner(s) (*) Heechul Yun, Renato Mancuso, Zheng-Pei Wu, Rodolfo Pellizzoni. “PALLOC: DRAM Bank-Aware Mem ory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14
  • 12. IsolBench: Synthetic Workloads • Latency – A linked-list traversal, data dependency, one outstanding miss • Bandwidth – An array reads or writes, no data dependency, multiple misses • Subject benchmarks: LLC partition fitting 12 Working-set size: (LLC) < ¼ LLC  cache-hits, (DRAM) > 2X LLC  cache misses
  • 13. Latency(LLC) vs. BwRead(DRAM) • No interference on Cortex-A7 and Nehalem • On Cortex-A15, Latency(LLC) suffers 3.8X slowdown – despite partitioned LLC 13
  • 14. BwRead(LLC) vs. BwRead(DRAM) • Up to 10.6X slowdown on Cortex-A15 • Cache partitioning != performance isolation – On all tested out-of-order cores (A9, A15, Nehalem) 14
  • 15. BwRead(LLC) vs. BwWrite(DRAM) • Up to 21X slowdown on Cortex-A15 • Writes generally cause more slowdowns – Due to write-backs 15
  • 16. EEMBC and SD-VBS • X-axis: EEMBC, SD-VBS (cache partition fitting) – Co-runners: BwWrite(DRAM) • Cache partitioning != performance isolation 16 Cortex-A7 (in-order) Cortex-A15 (out-of-order)
  • 17. MSHR Contention • Shortage of cache MSHRs  lock up the cache • LLC MSHRs are shared resources – 4cores x L1 cache MSHRs > LLC MSHRs 17 Cortex-A7 Cortex-A9 Cortex-A15 Nehalem Local MLP 1 4 6 10 Global MLP 4 4 11 16 (See paper for our experimental identification method)
  • 18. Outline • Evaluate the isolation effect of cache partitioning • Understand the source of contention – Effect of LLC MSHRs – Effect of L1 MSHRs • Propose an OS/architecture collaborative solution 18
  • 19. Simulation Setup • Gem5 full-system simulator – 4 out-of-order cores (modeling Cortex-A15) • L1: 32K I&D, LLC (L2): 2048K – Vary #of L1 and LLC MSHRs • Linux 3.14 – Use PALLOC to partition LLC • Workload – IsolBench, EEMBC, SD-VBS, SPEC2006 19 DRAM LLC Core1 Core2 Core3 Core4 subject co-runner(s)
  • 20. Effect of LLC (Shared) MSHRs • Increasing LLC MSHRs eliminates the MSHR contention • Issue: scalable? – MSHRs are highly associative (must be accessed in parallel) 20 L1:6/LLC:12 MSHRs L1:6/LLC:24 MSHRs
  • 21. Effect of L1 (Private) MSHRs • Reducing L1 MSHRs can eliminate contention • Issue: single thread performance. How much? – It depends. L1 MSHRs are often underutilized – EEMBC: low, SD-VBS: medium, SPEC2006: high 21 EEMBC, SD-VBS IsolBench,SPEC2006
  • 22. Outline • Evaluate the isolation effect of cache partitioning • Understand the source of contention • Propose an OS/architecture collaborative solution 22
  • 23. OS Controlled MSHR Partitioning 23 • Add two registers in each core’s L1 cache – TargetCount: max. MSHRs (set by OS) – ValidCount: used MSHRs (set by HW) • OS can control each core’s MLP – By adjusting the core’s TargetCount register Block Addr.Valid Issue Target Info. Block Addr.Valid Issue Target Info. Block Addr.Valid Issue Target Info. MSHR 1 MSHR 2 MSHR n TargetCount ValidCount Last Level Cache (LLC) Core 1 Core 2 Core 3 Core 4 MSHRs MSHRs MSHRs MSHRs MSHRs
  • 24. OS Scheduler Design • Partition LLC MSHRs by enforcing • When RT tasks are scheduled – RT tasks: reserve MSHRs – Non-RT tasks: share remaining MSHRs • Implemented in Linux kernel – prepare_task_switch() at kernel/sched/core.c 24
  • 25. Case Study • 4 RT tasks (EEMBC) – One RT per core – Reserve 2 MSHRs – P: 20,30,40,60ms – C: ~8ms • 4 NRT tasks – One NRT per core – Run on slack • Up to 20% WCET reduction – Compared to cache partition only 25
  • 26. Conclusion • Evaluated the effect of cache partitioning on four real COTS multicore architectures and a full system simulator • Found cache partitioning does not ensure cache (hit) performance isolation due to MSHR contention • Proposed low-cost architecture extension and OS scheduler support to eliminate MSHR contention • Availability – https://github.com/CSL-KU/IsolBench – Synthetic benchmarks (IsolBench),test scripts, kernel patches 26
  • 27. Exp3: BwRead(LLC) vs. BwRead(LLC) • Cache bandwidth contention is not the main source – Higher cache-bandwidth co-runners cause less slowdowns 27
  • 28. EEMBC, SD-VBS Workload • Subject – Subset of EEMBC, SD-VBS – High L2 hit, Low L2 miss • Co-runners – BwWrite(DRAM): High L2 miss, write-intensive 28

Notes de l'éditeur

  1. This talk is about improving isolation in modern multicore real-time systems focusing on shared cache.
  2. High-performance multicore processors are increasingly demanded in many modern real-time systems to satisfy the ever increasing performance need while reducing size, weight, and power consumption.
  3. Unfortunately, it is well know that providing timing predictability is very difficult in multicore processors due to contention in shared hardware resources. In this work, we focus on shared cache.
  4. Cache partitioning is a well known technique to improve isolation for real-time systems. And it can be done either in software (by the OS) or in hardware. Either way, the goal of partitioning is to eliminate the unwanted cache-line evictions by providing isolated space per core/task. Most literature equate cache partitioning and cache performance isolation. However, as we will show in this work, partitioning cache doesn’t necessarily provide performance isolation in modern high-performance multicore architectures.
  5. This talk is composed of three parts. First, we evaluate the isolation effect cache partitioning on four real COTS multicore architectures. Surprisingly, we observe up to 21X slowdown of a task running on a dedicated core, accessing dedicated cache partition, and almost all memory accesses are cache hits. We then further investigate and better understand the source of the contention using a cycle accurate full system simulator. Finally, based on the understanding, we propose an OS and architecture collaborative solution to to provide true cache performance isolation.
  6. Before we begin, let me briefly introduce essential background on non-blocking caches. A non-blocking cache can continue to serve cache-hit requests under multiple cache misses. Supporting multiple outstanding misses is important to leverage memory-level parallelism. And it is especially important for out-of-order cores, but in-order multicore also can benefit from using non-blocking shared caches. In a non-blocking cache, there are special hardware registers known as miss-status-holing registers (MSHRs). When a cache miss occurs, the cache controller allocates a MSHR entry to track the outgoing cache-miss and it is cleared when the data is delivered from the lower-level memory hierarch,
  7. The number of MSHRs essentially determines the parallelism of the cache. Now, because MSHRs are limited and getting the data from DRAM could take 100s of cpu cycles, it is possible that MSRHs of a cache are fully occupied with outstanding misses. If that happens, the cache lock-up and all the subsequent accesses to the cache will be stalled.
  8. For experiments, we use four COTS multicore architectures. One in-order and three out-of-order architectuers.
  9. On the platforms, we perform a set of cache-interference experiments. The basic setup is as follows. We ran a subject benchmark on one core, and we increase the number of co-runners on the other cores. I would like to stress that the LLC is partitioned on a per-core basis. So, what we would like to see is whether cache portioning provide cache performance isolation.
  10. First, we use a benchmark suite called IsolBench that consists of two synthetic benchmarks latency and bandwidth. The latency benchmark is a simple linked-list traversal program. Due to data dependency, only one outstanding miss can be generated at a time. The bandwidth benchmark is a simple array access program. As there’s no data dependency, multiple concurrent accesses can be made in an out-of-order core. We created 6 workloads with varying their working-set sizes. Here, LLC denotes less than the given cache partiion while (DRAM) denotes twice size of the the LLC. In all workloads, the subject benchmarks always fit in their cache partitions. In other words, accesses are cache hits. So, ideally, cache partitioning should provide performance isolation regardless of the co-runners.
  11. We also used real benchmarks as subjects and get the similar results. That is isolation is provided in in-order cores but not so in out-of-order cores
  12. We attribute this to the MSHR contention problem. This table shows the experimentally obtained core local and global memory-level parallelism of the tested platforms. Detailed methods to obtain MLP is in paper. These numbers were also validated from the manuals of the processors. What this table suggests is that the MSHRs in the LLCs are smaller than the sum of L1 MSHRs on the tested out-of-order cores. As I mentioned before, then the shortage of the LLC MSHRs can lock-up the LLC, even if the cache space is partitioned.
  13. Next, we validate the MSHR contention problem using a cycle-accurate full system simulator. We also perform a set of sensitive analysis to better understand the effects of MSHRs in the local and shared caches.
  14. At first, we suspected that this may be because of shared bus that connects the LLC and the cores. But we found that it was not the case from next experiment. In this experiment, we chaged the corunners’ working-set sizes to fit in their cache partitions. Because now that co-runners’ memory accesses are also cache hits, the cache bandwidth demand, and therefore the shared bus contention would have been increased. Yet, the slowdown becomes much smaller than the previous experiment where cache bandwidth usages were much lower.