SlideShare une entreprise Scribd logo
1  sur  44
CPU Caches
Jamie Allen
@jamie_allen
http://github.com/jamie-allen
JavaZone 2012
Agenda
• Goal
• Definitions
• Architectures
• Development Tips
• The Future
Goal
Provide you with the information you need
about CPU caches so that you can improve the
performance of your applications
Why?
• Increased virtualization
– Runtime (JVM, RVM)
– Platforms/Environments (cloud)
• Disruptor, 2011
Definitions
SMP
• Symmetric Multiprocessor (SMP) Architecture
• Shared main memory controlled by single OS
• No more Northbridge
NUMA
• Non-Uniform Memory Access
• The organization of processors reflect the time
to access data in RAM, called the “NUMA
factor”
• Shared memory space (as opposed to multiple
commodity machines)
Data Locality
• The most critical factor in performance
• Not guaranteed by a JVM
• Spatial - reused over and over in a loop, data
accessed in small regions
• Temporal - high probability it will be reused
before long
Memory Controller
• Manages communication of reads/writes
between the CPU and RAM
• Integrated Memory Controller on die
Cache Lines
• 32-256 contiguous bytes, most commonly 64
• Beware “false sharing”
• Use padding to ensure unshared lines
• Transferred in 64-bit blocks (8x for 64 byte
lines), arriving every ~4 cycles
• Position in the line of the “critical word”
matters, but not if pre-fetched
• @Contended annotation coming to JVM?
Cache Associativity
• Fully Associative: Put it anywhere
• Somewhere in the middle: n-way set-
associative, 2-way skewed-associative
• Direct Mapped: Each entry can only go in one
specific place
Cache Eviction Strategies
• Least Recently Used (LRU)
• Pseudo-LRU (PLRU): for large associativity
caches
• 2-Way Set Associative
• Direct Mapped
• Others
Cache Write Strategies
• Write through: changed cache line
immediately goes back to main memory
• Write back: cache line is marked when dirty,
eviction sends back to main memory
• Write combining: grouped writes of cache
lines back to main memory
• Uncacheable: dynamic values that can change
without warning
Exclusive versus Inclusive
• Only relevant below L3
• AMD is exclusive
– Progressively more costly due to eviction
– Can hold more data
– Bulldozer uses "write through" from L1d back to L2
• Intel is inclusive
– Can be better for inter-processor memory sharing
– More expensive as lines in L1 are also in L2 & L3
– If evicted in a higher level cache, must be evicted
below as well
Inter-Socket Communication
• GT/s – gigatransfers per second
• Quick Path Interconnect (QPI, Intel) – 8GT/s
• HyperTransport (HTX, AMD) – 6.4GT/s (?)
• Both transfer 16 bits per transmission in
practice, but Sandy Bridge is really 32
MESI+F Cache Coherency Protocol
• Specific to data cache lines
• Request for Ownership (RFO), when a processor tries to write to
a cache line
• Modified, the local processor has changed the cache line,
implies only one who has it
• Exclusive, one processor is using the cache line, not modified
• Shared, multiple processors are using the cache line, not
modified
• Invalid, the cache line is invalid, must be re-fetched
• Forward, designate to respond to requests for a cache line
• All processors MUST acknowledge a message for it to be valid
Static RAM (SRAM)
• Requires 6-8 pieces of circuitry per datum
• Cycle rate access, not quite measurable in
time
• Uses a relatively large amount of power for
what it does
• Data does not fade or leak, does not need to
be refreshed/recharged
Dynamic RAM (DRAM)
• Requires 2 pieces of circuitry per datum
• “Leaks” charge, but not sooner than 64ms
• Reads deplete the charge, requiring
subsequent recharge
• Takes 240 cycles (~100ns) to access
• Intel's Nehalem architecture - each CPU socket
controls a portion of RAM, no other socket has
direct access to it
Architectures
Current Processors
• Intel
– Nehalem/Westmere
– Sandy Bridge
– Ivy Bridge
• AMD
– Bulldozer
• Oracle
– UltraSPARC isn't dead
“Latency Numbers Everyone Should Know”
L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns
Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs
SSD random read ........................ 150,000 ns = 150 µs
Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs
Round trip within same datacenter ...... 500,000 ns = 0.5 ms
Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms
Disk seek ........................... 10,000,000 ns = 10 ms
Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms
Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms
• Shamelessly cribbed from this gist: https://gist.github.com/2843375, originally by Peter Norvig and amended by Jeff Dean
Measured Cache Latencies
Sandy Bridge-E L1d L2 L3 Main
=======================================================================
Sequential Access ..... 3 clk 11 clk 14 clk 6ns
Full Random Access .... 3 clk 11 clk 38 clk 65.8ns
SI Software's benchmarks: http://www.sisoftware.net/?d=qa&f=ben_mem_latency
Registers
• On-core for instructions being executed and
their operands
• Can be accessed in a single cycle
• There are many different types
• A 64-bit Intel Nehalem CPU had 128 Integer &
128 floating point registers
Store Buffers
• Hold data for Out of Order (OoO) execution
• Fully associative
• Prevent “stalls” in execution on a thread when
the cache line is not local to a core on a write
• ~1 cycle
Level Zero (L0)
• New to Sandy Bridge
• A cache of the last 1536 uops decoded
• Well-suited for hot loops
• Not the same as the older "trace" cache
Level One (L1)
• Divided into data and instructions
• 32K data (L1d), 32K instructions (L1i) per core
on a Sandy Bridge
• Sandy Bridge loads data at 256 bits per cycle,
double that of Nehalem
• 3-4 cycles to access L1d
Level Two (L2)
• 256K per core on a Sandy Bridge
• 2MB per “module” on AMD's Bulldozer
architecture
• ~11 cycles to access
• Unified data and instruction caches from here
up
• If the working set size is larger than L2, misses
grow
Level Three (L3)
• Was a “unified” cache up until Sandy Bridge,
shared between cores
• Varies in size with different processors and
versions of an architecture
• 14-38 cycles to access
Programming Tips
Striding & Pre-fetching
• Predictable memory access is really important
• Hardware pre-fetcher on the core looks for
patterns of memory access
• Can be counter-productive if the access
pattern is not predictable
• Martin Thompson blog post: “Memory Access
Patterns are Important”
• Shows the importance of locality and striding
Cache Misses
• Cost hundreds of cycles
• Keep your code simple
• Instruction read misses are most expensive
• Data read miss are less so, but still hurt performance
• Write misses are okay unless using Write Through
• Miss types:
– Compulsory
– Capacity
– Conflict
Programming Optimizations
• Stack allocated data is cheap
• Pointer interaction - you have to retrieve data
being pointed to, even in registers
• Avoid locking and resultant kernel arbitration
• CAS is better and occurs on-thread, but
algorithms become more complex
• Match workload to the size of the last level
cache (LLC, L3)
What about Functional Programming?
• Have to allocate more and more space for
your data structures, leads to eviction
• When you cycle back around, you get cache
misses
• Choose immutability by default, profile to find
poor performance
• Use mutable data in targeted locations
Hyperthreading
• Great for I/O-bound applications
• If you have lots of cache misses
• Doesn't do much for CPU-bound applications
• You have half of the cache resources per core
Data Structures
• BAD: Linked list structures and tree structures
• BAD: Java's HashMap uses chained buckets
• BAD: Standard Java collections generate lots of
garbage
• GOOD: Array-based and contiguous in memory is
much faster
• GOOD: Write your own that are lock-free and
contiguous
• GOOD: Fastutil library, but note that it's additive
Application Memory Wall & GC
• Tremendous amounts of RAM at low cost
• GC will kill you with compaction
• Use pauseless GC
– IBM's Metronome, very predictable
– Azul's C4, very performant
Using GPUs
• Remember, locality matters!
• Need to be able to export a task with data
that does not need to update
The Future
ManyCore
• David Ungar says > 24 cores, generally many
10s of cores
• Really gets interesting above 1000 cores
• Cache coherency won't be possible
• Non-deterministic
Memristor
• Non-volatile, static RAM, same write
endurance as Flash
• 200-300 MB on chip
• Sub-nanosecond writes
• Able to perform processing?
• Multistate, not binary
Phase Change Memory (PRAM)
• Higher performance than today's DRAM
• Intel seems more fascinated by this, just
released its "neuromorphic" chip design
• Not able to perform processing
• Write degradation is supposedly much slower
• Was considered susceptible to unintentional
change, maybe fixed?
Thanks!
Credits!
• What Every Programmer Should Know About Memory, Ulrich Drepper of
RedHat, 2007
• Java Performance, Charlie Hunt
• Wikipedia/Wikimedia Commons
• AnandTech
• The Microarchitecture of AMD, Intel and VIA CPUs
• Everything You Need to Know about the Quick Path Interconnect, Gabriel
Torres/Hardware Secrets
• Inside the Sandy Bridge Architecture, Gabriel Torres/Hardware Secrets
• Martin Thompson's Mechanical Sympathy blog and Disruptor
presentations
• The Application Memory Wall, Gil Tene, CTO of Azul Systems
• AMD Bulldozer/Intel Sandy Bridge Comparison, Gionatan Danti
• SI Software's Memory Latency Benchmarks
• Martin Thompson and Cliff Click provided feedback &additional content

Contenu connexe

Tendances

8 memory management strategies
8 memory management strategies8 memory management strategies
8 memory management strategiesDr. Loganathan R
 
Memory Management | Computer Science
Memory Management | Computer ScienceMemory Management | Computer Science
Memory Management | Computer ScienceTransweb Global Inc
 
Intel core i7 processor
Intel core i7 processorIntel core i7 processor
Intel core i7 processorGautam Kumar
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizationsBrendan Gregg
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processorMazin Alwaaly
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabMichelle Holley
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringGeorg Schönberger
 
Page replacement
Page replacementPage replacement
Page replacementsashi799
 
InnoDB Flushing and Checkpoints
InnoDB Flushing and CheckpointsInnoDB Flushing and Checkpoints
InnoDB Flushing and CheckpointsMIJIN AN
 
Processor Organization
Processor OrganizationProcessor Organization
Processor OrganizationDominik Salvet
 
Structure of operating system
Structure of operating systemStructure of operating system
Structure of operating systemRafi Dar
 
Chapter 12 - Mass Storage Systems
Chapter 12 - Mass Storage SystemsChapter 12 - Mass Storage Systems
Chapter 12 - Mass Storage SystemsWayne Jones Jnr
 

Tendances (20)

Multi core processor
Multi core processorMulti core processor
Multi core processor
 
8 memory management strategies
8 memory management strategies8 memory management strategies
8 memory management strategies
 
Memory Management | Computer Science
Memory Management | Computer ScienceMemory Management | Computer Science
Memory Management | Computer Science
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 
Intel core i7 processor
Intel core i7 processorIntel core i7 processor
Intel core i7 processor
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizations
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processor
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on Lab
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
Page replacement
Page replacementPage replacement
Page replacement
 
InnoDB Flushing and Checkpoints
InnoDB Flushing and CheckpointsInnoDB Flushing and Checkpoints
InnoDB Flushing and Checkpoints
 
Memory management
Memory managementMemory management
Memory management
 
cache memory
cache memorycache memory
cache memory
 
Processor Organization
Processor OrganizationProcessor Organization
Processor Organization
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Cache memory
Cache memoryCache memory
Cache memory
 
Structure of operating system
Structure of operating systemStructure of operating system
Structure of operating system
 
Chapter 12 - Mass Storage Systems
Chapter 12 - Mass Storage SystemsChapter 12 - Mass Storage Systems
Chapter 12 - Mass Storage Systems
 
SWEET - A Tool for WCET Flow Analysis - Björn Lisper
SWEET - A Tool for WCET Flow Analysis - Björn LisperSWEET - A Tool for WCET Flow Analysis - Björn Lisper
SWEET - A Tool for WCET Flow Analysis - Björn Lisper
 

En vedette

VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...aidanshribman
 
Leveraging memory in sql server
Leveraging memory in sql serverLeveraging memory in sql server
Leveraging memory in sql serverChris Adkin
 
SLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsSLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsDavidlohr Bueso
 
美团技术沙龙04 - 高性能服务器架构设计和调优
美团技术沙龙04 - 高性能服务器架构设计和调优美团技术沙龙04 - 高性能服务器架构设计和调优
美团技术沙龙04 - 高性能服务器架构设计和调优美团点评技术团队
 
Tuning Linux for Databases.
Tuning Linux for Databases.Tuning Linux for Databases.
Tuning Linux for Databases.Alexey Lesovsky
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0sprdd
 
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaafTechnical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaafSyed Shaaf
 
Cache & CPU performance
Cache & CPU performanceCache & CPU performance
Cache & CPU performanceso61pi
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesRaghavendra Prabhu
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?Pradeep Kumar
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architectureArpan Baishya
 
Multiple processor (ppt 2010)
Multiple processor (ppt 2010)Multiple processor (ppt 2010)
Multiple processor (ppt 2010)Arth Ramada
 
Multi-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & AnalysisMulti-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & Analysisinside-BigData.com
 

En vedette (20)

Cat @ scale
Cat @ scaleCat @ scale
Cat @ scale
 
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
 
Leveraging memory in sql server
Leveraging memory in sql serverLeveraging memory in sql server
Leveraging memory in sql server
 
SLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsSLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA Systems
 
美团技术沙龙04 - 高性能服务器架构设计和调优
美团技术沙龙04 - 高性能服务器架构设计和调优美团技术沙龙04 - 高性能服务器架构设计和调优
美团技术沙龙04 - 高性能服务器架构设计和调优
 
Tuning Linux for Databases.
Tuning Linux for Databases.Tuning Linux for Databases.
Tuning Linux for Databases.
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
 
Aca 2
Aca 2Aca 2
Aca 2
 
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaafTechnical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
 
Cache & CPU performance
Cache & CPU performanceCache & CPU performance
Cache & CPU performance
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and Opportunities
 
美团技术团队 - KVM性能优化
美团技术团队 - KVM性能优化美团技术团队 - KVM性能优化
美团技术团队 - KVM性能优化
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
Multiprocessor system
Multiprocessor systemMultiprocessor system
Multiprocessor system
 
NUMA overview
NUMA overviewNUMA overview
NUMA overview
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architecture
 
Multiple processor (ppt 2010)
Multiple processor (ppt 2010)Multiple processor (ppt 2010)
Multiple processor (ppt 2010)
 
Multi-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & AnalysisMulti-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & Analysis
 

Similaire à CPU Caches

CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allenjaxconf
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architecturesYoung Alista
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers Sher Shah Merkhel
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibrarySebastian Andrasoni
 
Basic Computer Architecture
Basic Computer ArchitectureBasic Computer Architecture
Basic Computer ArchitectureYong Heui Cho
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
5_Embedded Systems مختصر.pdf
5_Embedded Systems  مختصر.pdf5_Embedded Systems  مختصر.pdf
5_Embedded Systems مختصر.pdfaliamjd
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesHazelcast
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Colin Charles
 
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsHeechul Yun
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Lars Marowsky-Brée
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Community
 

Similaire à CPU Caches (20)

Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
 
Memory
MemoryMemory
Memory
 
Basic Computer Architecture
Basic Computer ArchitectureBasic Computer Architecture
Basic Computer Architecture
 
Coa presentation3
Coa presentation3Coa presentation3
Coa presentation3
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
5_Embedded Systems مختصر.pdf
5_Embedded Systems  مختصر.pdf5_Embedded Systems  مختصر.pdf
5_Embedded Systems مختصر.pdf
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
 

Plus de shinolajla

20180416 reactive is_a_product_rs
20180416 reactive is_a_product_rs20180416 reactive is_a_product_rs
20180416 reactive is_a_product_rsshinolajla
 
20180416 reactive is_a_product
20180416 reactive is_a_product20180416 reactive is_a_product
20180416 reactive is_a_productshinolajla
 
20161027 scala io_keynote
20161027 scala io_keynote20161027 scala io_keynote
20161027 scala io_keynoteshinolajla
 
20160609 nike techtalks reactive applications tools of the trade
20160609 nike techtalks reactive applications   tools of the trade20160609 nike techtalks reactive applications   tools of the trade
20160609 nike techtalks reactive applications tools of the tradeshinolajla
 
20160524 ibm fast data meetup
20160524 ibm fast data meetup20160524 ibm fast data meetup
20160524 ibm fast data meetupshinolajla
 
20160520 The Future of Services
20160520 The Future of Services20160520 The Future of Services
20160520 The Future of Servicesshinolajla
 
20160520 what youneedtoknowaboutlambdas
20160520 what youneedtoknowaboutlambdas20160520 what youneedtoknowaboutlambdas
20160520 what youneedtoknowaboutlambdasshinolajla
 
20160317 lagom sf scala
20160317 lagom sf scala20160317 lagom sf scala
20160317 lagom sf scalashinolajla
 
Effective Akka v2
Effective Akka v2Effective Akka v2
Effective Akka v2shinolajla
 
20150411 mutability matrix of pain scala
20150411 mutability matrix of pain scala20150411 mutability matrix of pain scala
20150411 mutability matrix of pain scalashinolajla
 
Reactive applications tools of the trade huff po
Reactive applications   tools of the trade huff poReactive applications   tools of the trade huff po
Reactive applications tools of the trade huff poshinolajla
 
20140228 fp and_performance
20140228 fp and_performance20140228 fp and_performance
20140228 fp and_performanceshinolajla
 
Effective akka scalaio
Effective akka scalaioEffective akka scalaio
Effective akka scalaioshinolajla
 
Real world akka recepies v3
Real world akka recepies v3Real world akka recepies v3
Real world akka recepies v3shinolajla
 
Effective actors japanesesub
Effective actors japanesesubEffective actors japanesesub
Effective actors japanesesubshinolajla
 
Effective Actors
Effective ActorsEffective Actors
Effective Actorsshinolajla
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scalashinolajla
 

Plus de shinolajla (17)

20180416 reactive is_a_product_rs
20180416 reactive is_a_product_rs20180416 reactive is_a_product_rs
20180416 reactive is_a_product_rs
 
20180416 reactive is_a_product
20180416 reactive is_a_product20180416 reactive is_a_product
20180416 reactive is_a_product
 
20161027 scala io_keynote
20161027 scala io_keynote20161027 scala io_keynote
20161027 scala io_keynote
 
20160609 nike techtalks reactive applications tools of the trade
20160609 nike techtalks reactive applications   tools of the trade20160609 nike techtalks reactive applications   tools of the trade
20160609 nike techtalks reactive applications tools of the trade
 
20160524 ibm fast data meetup
20160524 ibm fast data meetup20160524 ibm fast data meetup
20160524 ibm fast data meetup
 
20160520 The Future of Services
20160520 The Future of Services20160520 The Future of Services
20160520 The Future of Services
 
20160520 what youneedtoknowaboutlambdas
20160520 what youneedtoknowaboutlambdas20160520 what youneedtoknowaboutlambdas
20160520 what youneedtoknowaboutlambdas
 
20160317 lagom sf scala
20160317 lagom sf scala20160317 lagom sf scala
20160317 lagom sf scala
 
Effective Akka v2
Effective Akka v2Effective Akka v2
Effective Akka v2
 
20150411 mutability matrix of pain scala
20150411 mutability matrix of pain scala20150411 mutability matrix of pain scala
20150411 mutability matrix of pain scala
 
Reactive applications tools of the trade huff po
Reactive applications   tools of the trade huff poReactive applications   tools of the trade huff po
Reactive applications tools of the trade huff po
 
20140228 fp and_performance
20140228 fp and_performance20140228 fp and_performance
20140228 fp and_performance
 
Effective akka scalaio
Effective akka scalaioEffective akka scalaio
Effective akka scalaio
 
Real world akka recepies v3
Real world akka recepies v3Real world akka recepies v3
Real world akka recepies v3
 
Effective actors japanesesub
Effective actors japanesesubEffective actors japanesesub
Effective actors japanesesub
 
Effective Actors
Effective ActorsEffective Actors
Effective Actors
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 

Dernier

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Dernier (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

CPU Caches

  • 2. Agenda • Goal • Definitions • Architectures • Development Tips • The Future
  • 3. Goal Provide you with the information you need about CPU caches so that you can improve the performance of your applications
  • 4. Why? • Increased virtualization – Runtime (JVM, RVM) – Platforms/Environments (cloud) • Disruptor, 2011
  • 6. SMP • Symmetric Multiprocessor (SMP) Architecture • Shared main memory controlled by single OS • No more Northbridge
  • 7. NUMA • Non-Uniform Memory Access • The organization of processors reflect the time to access data in RAM, called the “NUMA factor” • Shared memory space (as opposed to multiple commodity machines)
  • 8. Data Locality • The most critical factor in performance • Not guaranteed by a JVM • Spatial - reused over and over in a loop, data accessed in small regions • Temporal - high probability it will be reused before long
  • 9. Memory Controller • Manages communication of reads/writes between the CPU and RAM • Integrated Memory Controller on die
  • 10. Cache Lines • 32-256 contiguous bytes, most commonly 64 • Beware “false sharing” • Use padding to ensure unshared lines • Transferred in 64-bit blocks (8x for 64 byte lines), arriving every ~4 cycles • Position in the line of the “critical word” matters, but not if pre-fetched • @Contended annotation coming to JVM?
  • 11. Cache Associativity • Fully Associative: Put it anywhere • Somewhere in the middle: n-way set- associative, 2-way skewed-associative • Direct Mapped: Each entry can only go in one specific place
  • 12. Cache Eviction Strategies • Least Recently Used (LRU) • Pseudo-LRU (PLRU): for large associativity caches • 2-Way Set Associative • Direct Mapped • Others
  • 13. Cache Write Strategies • Write through: changed cache line immediately goes back to main memory • Write back: cache line is marked when dirty, eviction sends back to main memory • Write combining: grouped writes of cache lines back to main memory • Uncacheable: dynamic values that can change without warning
  • 14. Exclusive versus Inclusive • Only relevant below L3 • AMD is exclusive – Progressively more costly due to eviction – Can hold more data – Bulldozer uses "write through" from L1d back to L2 • Intel is inclusive – Can be better for inter-processor memory sharing – More expensive as lines in L1 are also in L2 & L3 – If evicted in a higher level cache, must be evicted below as well
  • 15. Inter-Socket Communication • GT/s – gigatransfers per second • Quick Path Interconnect (QPI, Intel) – 8GT/s • HyperTransport (HTX, AMD) – 6.4GT/s (?) • Both transfer 16 bits per transmission in practice, but Sandy Bridge is really 32
  • 16. MESI+F Cache Coherency Protocol • Specific to data cache lines • Request for Ownership (RFO), when a processor tries to write to a cache line • Modified, the local processor has changed the cache line, implies only one who has it • Exclusive, one processor is using the cache line, not modified • Shared, multiple processors are using the cache line, not modified • Invalid, the cache line is invalid, must be re-fetched • Forward, designate to respond to requests for a cache line • All processors MUST acknowledge a message for it to be valid
  • 17. Static RAM (SRAM) • Requires 6-8 pieces of circuitry per datum • Cycle rate access, not quite measurable in time • Uses a relatively large amount of power for what it does • Data does not fade or leak, does not need to be refreshed/recharged
  • 18. Dynamic RAM (DRAM) • Requires 2 pieces of circuitry per datum • “Leaks” charge, but not sooner than 64ms • Reads deplete the charge, requiring subsequent recharge • Takes 240 cycles (~100ns) to access • Intel's Nehalem architecture - each CPU socket controls a portion of RAM, no other socket has direct access to it
  • 20. Current Processors • Intel – Nehalem/Westmere – Sandy Bridge – Ivy Bridge • AMD – Bulldozer • Oracle – UltraSPARC isn't dead
  • 21.
  • 22. “Latency Numbers Everyone Should Know” L1 cache reference ......................... 0.5 ns Branch mispredict ............................ 5 ns L2 cache reference ........................... 7 ns Mutex lock/unlock ........................... 25 ns Main memory reference ...................... 100 ns Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs SSD random read ........................ 150,000 ns = 150 µs Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs Round trip within same datacenter ...... 500,000 ns = 0.5 ms Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms Disk seek ........................... 10,000,000 ns = 10 ms Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms • Shamelessly cribbed from this gist: https://gist.github.com/2843375, originally by Peter Norvig and amended by Jeff Dean
  • 23. Measured Cache Latencies Sandy Bridge-E L1d L2 L3 Main ======================================================================= Sequential Access ..... 3 clk 11 clk 14 clk 6ns Full Random Access .... 3 clk 11 clk 38 clk 65.8ns SI Software's benchmarks: http://www.sisoftware.net/?d=qa&f=ben_mem_latency
  • 24. Registers • On-core for instructions being executed and their operands • Can be accessed in a single cycle • There are many different types • A 64-bit Intel Nehalem CPU had 128 Integer & 128 floating point registers
  • 25. Store Buffers • Hold data for Out of Order (OoO) execution • Fully associative • Prevent “stalls” in execution on a thread when the cache line is not local to a core on a write • ~1 cycle
  • 26. Level Zero (L0) • New to Sandy Bridge • A cache of the last 1536 uops decoded • Well-suited for hot loops • Not the same as the older "trace" cache
  • 27. Level One (L1) • Divided into data and instructions • 32K data (L1d), 32K instructions (L1i) per core on a Sandy Bridge • Sandy Bridge loads data at 256 bits per cycle, double that of Nehalem • 3-4 cycles to access L1d
  • 28. Level Two (L2) • 256K per core on a Sandy Bridge • 2MB per “module” on AMD's Bulldozer architecture • ~11 cycles to access • Unified data and instruction caches from here up • If the working set size is larger than L2, misses grow
  • 29. Level Three (L3) • Was a “unified” cache up until Sandy Bridge, shared between cores • Varies in size with different processors and versions of an architecture • 14-38 cycles to access
  • 31. Striding & Pre-fetching • Predictable memory access is really important • Hardware pre-fetcher on the core looks for patterns of memory access • Can be counter-productive if the access pattern is not predictable • Martin Thompson blog post: “Memory Access Patterns are Important” • Shows the importance of locality and striding
  • 32. Cache Misses • Cost hundreds of cycles • Keep your code simple • Instruction read misses are most expensive • Data read miss are less so, but still hurt performance • Write misses are okay unless using Write Through • Miss types: – Compulsory – Capacity – Conflict
  • 33. Programming Optimizations • Stack allocated data is cheap • Pointer interaction - you have to retrieve data being pointed to, even in registers • Avoid locking and resultant kernel arbitration • CAS is better and occurs on-thread, but algorithms become more complex • Match workload to the size of the last level cache (LLC, L3)
  • 34. What about Functional Programming? • Have to allocate more and more space for your data structures, leads to eviction • When you cycle back around, you get cache misses • Choose immutability by default, profile to find poor performance • Use mutable data in targeted locations
  • 35. Hyperthreading • Great for I/O-bound applications • If you have lots of cache misses • Doesn't do much for CPU-bound applications • You have half of the cache resources per core
  • 36. Data Structures • BAD: Linked list structures and tree structures • BAD: Java's HashMap uses chained buckets • BAD: Standard Java collections generate lots of garbage • GOOD: Array-based and contiguous in memory is much faster • GOOD: Write your own that are lock-free and contiguous • GOOD: Fastutil library, but note that it's additive
  • 37. Application Memory Wall & GC • Tremendous amounts of RAM at low cost • GC will kill you with compaction • Use pauseless GC – IBM's Metronome, very predictable – Azul's C4, very performant
  • 38. Using GPUs • Remember, locality matters! • Need to be able to export a task with data that does not need to update
  • 40. ManyCore • David Ungar says > 24 cores, generally many 10s of cores • Really gets interesting above 1000 cores • Cache coherency won't be possible • Non-deterministic
  • 41. Memristor • Non-volatile, static RAM, same write endurance as Flash • 200-300 MB on chip • Sub-nanosecond writes • Able to perform processing? • Multistate, not binary
  • 42. Phase Change Memory (PRAM) • Higher performance than today's DRAM • Intel seems more fascinated by this, just released its "neuromorphic" chip design • Not able to perform processing • Write degradation is supposedly much slower • Was considered susceptible to unintentional change, maybe fixed?
  • 44. Credits! • What Every Programmer Should Know About Memory, Ulrich Drepper of RedHat, 2007 • Java Performance, Charlie Hunt • Wikipedia/Wikimedia Commons • AnandTech • The Microarchitecture of AMD, Intel and VIA CPUs • Everything You Need to Know about the Quick Path Interconnect, Gabriel Torres/Hardware Secrets • Inside the Sandy Bridge Architecture, Gabriel Torres/Hardware Secrets • Martin Thompson's Mechanical Sympathy blog and Disruptor presentations • The Application Memory Wall, Gil Tene, CTO of Azul Systems • AMD Bulldozer/Intel Sandy Bridge Comparison, Gionatan Danti • SI Software's Memory Latency Benchmarks • Martin Thompson and Cliff Click provided feedback &additional content

Notes de l'éditeur

  1. To improve performance, we need to understand the impact of our decisions in the physical world My goal is help everyone understand how each physical level of caching within a CPU works Capped the talk below virtual memory (memory management for multitasking kernels), as that is specific to the operating system running on the machine Not enough time to cover all of the levels of caching that exist on CPUs
  2. 2011 Sandy Bridge and AMD Fusion integrated Northbridge functions into CPUs, along with processor cores, memory controller and graphics processing unit Gil Tene of Azul claims the Vega architecture is a true SMP compared to those of AMD/Intel – circular architecture with linkages between every core
  3. NUMA architectures have more trouble migrating processes across sockets because L3 isn’t warmed Initially, it has to look for a core with the same speed accessing memory resources and enough resources for the process; if none are found, it then allows for degradation
  4. All algorithms are dominated by that - you have to send data somewhere to have an operation performed on it, and then you have to send it back to where it can be used
  5. Memory controllers were moved onto the processor die on AMD64 and Nehalem
  6. Be careful about what's on them, because if a line holds multiple variables and the state of one changes, coherency must be maintained, killing performance for parallel threads on an SMP machine If the critical word in the line is in the last block, takes 30 cycles or more to arrive after the first word. However, the memory controller is free to request the blocks in a different order. The processor can specify the "critical word", and the block that word resides in can be retrieved first for the cache line by the memory controller. The program has the data it needs and can continue while the rest of the line is retrieved and the cache is not yet in a consistent state, called "Critical Word First & Early Restart” Note that with pre-fetching, the critical word is not known - if the core requests the cache line while the pre-fetching is in transit, it will not be able to influence the block ordering and will have to wait until the critical word arrives in order; position in the cache line matters, faster to be at the front than the rear
  7. Replacement policy determines where in the cache an entry of main memory will go Tradeoff - if data can be mapped to multiple places, all of those places have to be checked to see if the data is there, costing power and possibly time; but more associativity means less cache misses Set associative has a fixed number of places and can be 2-16-way; skewed does as well, but uses a hash to place the second cache line in the group, only up to 2-way that I know of
  8. LRU can be expensive, tracks each cache line with “age bits” - each time a cache line is used, all others have to be updated PLRU is used where LRU is prohibitively expensive – “almost always” discards one of the most LRU lines and no guarantee it was the absolutely LEAST recently used is sufficient 2WSA for caches where even PLRU is too slow – address of an item is used to put it in one of two positions, and the LRU of those two is discarded Direct mapped evicts whatever was in that position before it, regardless of when it was last used
  9. Write through caching: slow, but predictable, resulting in lots of FSB traffic Write back caching: mark flag bit as dirty, when evicted the processor notes and sends cache line back to main memory, instead of just dropping it; have to be careful of false sharing and cache coherency Write combining caching: used by graphics cards, groups writes to main memory for speed Uncacheable: dynamic memory values that can change without warning, used for commodity hardware
  10. In an exclusive cache (AMD), when the processor needs data it is loaded by cache line into L1d, which evicts a line to L2, which evicts a line to L3, which evicts a line to main memory; each eviction is progressively more expensive Exclusive is necessary if the L2 is less than 4-8x the size of L1, otherwise duplication of data affects L2’s local hit rate In an inclusive cache (Intel), eviction is much faster - if another processor wants to delete data, it only has to check L2 because it knows L1 will have it as well; exclusive will have to check both, which is more expensive; good for snooping
  11. For intra-socket communication, Sandy Bridge has a 4-ring architecture linking cores, LLC and GPU; prior to Sandy Bridge, required direct linkages GT/s - how many times signals are sent per second Not a bus, but a point to point interface for connections between sockets in a NUMA platform, or from processors to I/O and peripheral controllers DDR = double data rate, two data transferred per clock cycle, Processors "snoop" on each other, and certain actions are announced on external pins to make changes visible to others; the address of the cache line is visible through the address bus Sandy Bridge can transfer on both the rise & fall of the clock cycle and is full duplex HTX could theoretically be up to 32 bits per transmission, but hasn't been done yet that I know of
  12. Request for Ownership (RFO): a cache line is held exclusively for read by one processor, but another requests for write; cache line is sent, but not marked shared by the original holder - it is marked Invalid and the other cache becomes Exclusive. This marking happens in the memory controller. Performing this operation in the last level cache is expensive, as is the I->M transition. If a Shared cache line is modified, all other views of it must be marked invalid via an announcement RFO message. If Exclusive, no need to announce. Processors will try to keep cache lines in an exclusive state, as the E to M transition is much faster than S to M. RFOs occur when a thread is migrated to another processor and the cache lines have to be moved once, or when two threads absolutely must share a line; the costs are a bit smaller when shared across two cores on the same processor. Collisions on the bus can happen, latency can be high with NUMA systems, sheer traffic can slow things down - all reasons to minimize traffic. Since code almost never changes, instruction caches do not use MESI but rather just a SI protocol. If code does change, a lot of pessimistic assumptions have to be made by the controller.
  13. The number of circuits per datum means that SRAM can never be dense.
  14. One transistor, one capacitor Reading contiguous memory is faster than random access due to how you read - you get one write combining buffer at a time from each of the memory banks, 33% slower. 240 cycles to get data from here E7-8870 (Westmere) can hold 4TB of RAM, E5-2600 Sandy Bridge "only" 1TB DDR3-SDRAM: Double Data Rate, Synchronous Dynamic Has a high-bandwidth three-channel interface Reduces power consumption 30% over DDR2 Data is transferred on the rising and falling edges of a 400-1066 MHz I/O clock of the system DDR4 and 5 are coming, or are here with GPGPUs. DDR3L is for lower energy usage
  15. Westmere was the 32nm die shrink of Nehalem from 45nm. Ivy Bridge uses 22nm, but the Sandy Bridge architecture I'm ignoring Oracle SPARC here, but note that Oracle is supposedly building a 16K core UltraSPARC 64 Current Intel top of the line is Xeon E7-8870 (8 cores per socket, 32MB L3, 4TB RAM), but it's a Westmere-based microarchitecture and doesn’t have the Sandy Bridge improvements The E5-2600 Romley is the top shelf new Sandy Bridge offering - it's specs don't sound as mighty (8 cores per socket, max 30MB L3, 1TB RAM). Don't be fooled, the Direct Data IO makes up for it as disks network cards can do DMA (Direct Memory Access) from L3, not RAM, decreasing latency by ~18%. Also has two memory read ports, whereas previous architectures only had one and were a bottleneck for math-intensive applications Sandy Bridge is 4-30MB (8MB is currently max for mobile platforms, even with Ivy Bridge), includes the processor graphics
  16. The problem with these numbers is that while they're not far off, the lowest level cache interactions (registers, store buffers, L0, L1) are truly measured in cycles, not time Clock cycles are also known as Fetch and Execute or Fetch-Decode-Execute Cycle (FDX), retrieve instruction from memory, determine actions required and carry them out; ~1/3 of a nanosecond
  17. Sandy Bridge has 2 load/store operations for each memory channel Low values are because the pre-fetcher is hiding real latency in these tests when “striding” memory
  18. Types of registers include: instruction data (floating point, integer, chars, small bit arrays, etc) address conditional constant etc.
  19. Store buffers disambiguate memory access and manage dependencies for instructions occurring out of program order If CPU 1 needs to write to cache line A but doesn't have the line in its L1, it can drop the write in the store buffer while the memory controller fetches it to avoid a stall; if line is already in L1 it can store directly to it (and bypass store buffer), possibly causing invalidate requests to be sent out
  20. Macro ops have to be decoded into micro ops to be handled by the CPU Nehalem used L1i for macro ops decoding and had a copy of every operand required, but Sandy Bridge caches the uops (Micro-operations), boosting performance in tight code (small, highly profiled and optimized, can be secure, testable, easy to migrate) Not the same as the older "trace" cache, which stored uops in the order in which they were executed, which meant lots of potential duplication; no duplication in the L0, storing only unique decode instructions Hot loops (where the program spends the majority of its time) should be sized to fit here
  21. Instructions are generated by the compiler, which knows rules for good code generation Code has predictable patterns and CPUs are good at recognizing them, which helps pre-fetching Spatial and temporal locality is good Nehalem could load 128 bits per cycle, but Sandy Bridge can load double that because load and store address units can be interchanged - thus using 2 128-bit units per cycle instead of one
  22. When the working set size random access is greater than the L2 size, cache misses start to grow Caches from L2 up are "unified", having both instructions and data (not in terms of shared across cores)
  23. Sandy Bridge allows only exclusive access per core Last Level Cache (LLC)
  24. Doesn't have to be contiguous in an array, you can be jumping in 2K chunks without a performance hit As long as it's predictable, it will fetch the memory before you need it and have it staged Pre-fetching happens with L1d, can happen for L2 in systems with long pipelines Hardware prefetching cannot cross page boundaries, which slows it down as the working set size increases. If it did and the page wasn't there or is invalid, the OS would have to get involved and the program would experience a page fault it didn't initiate itself Temporally relevant data may be evicted fetching a line that will not be used
  25. The pre-fetcher will bring data into L1d for you. The simpler your code, the better it can do this If your code is complex, it will do it wrong, costing a cache miss and forcing it to lose the value of pre-fetching and having to go out to an outer layer to get that instruction again Compulsory/Capacity/Conflict are miss types
  26. Short lived data is not cheap, but variables scoped within a method running on the JVM are stack allocated and very fast Think about the affect of contending locking. You've got a warmed cache with all of the data you need, and because of contention (arbitrated at the kernel level), the thread will be put to sleep and your cached data is sitting until you can gain the lock from the arbitrator. Due to LRU, it could be evicted. The kernel is general purpose, may decide to do some housekeeping like defragging some memory, futher polluting your cache. When your thread finally does gain the lock, it may end up running on an entirely different core and will have to rewarm its cache. Everything you do will be a cache miss until its warm again. CAS is better, checking the value before you replace it with a new value. If it's different, re-read and try again. Happens in user space, not at the kernel, all on thread Algos get a lot harder, though - state machines with many more steps and complexity. And there is still a non-negligible cost. Remember the cycle time diffs for different cache sizes; if the workload can be tailored to the size of the last level cache, performance can be dramatically improved Note that for large data sets, you may have to be oblivious to caching
  27. Eventually, caches have to evict, cache misses cost hundreds of cycles
  28. Hyperthreading shares all CPU resources except the register set Intel CPUs limit to two threads per core, allows one hyper thread to access resources (like the arithmetic logic unit) while another hyper thread is delayed, usually by memory access. Only more efficient if the combined runtime is lower than a single thread, possible by overlapping wait times for different memory access that would ordinarily be sequential. The only variable is the number of cache hits Note that the effectively shared L1d & L2 reduce the available caches and bandwidth for each thread to 50% when executing two completely different threads of execution (questionable unless the caches are very large), inducing more cache misses, therefore hyper threading is only useful in a limited set of situations. Can be good for debugging with SMT (simultaneous multi-threading) Contrast this with Bulldozer's "modules" (akin to two cores), which has dedicated schedulers and integer units for each thread in a single processor
  29. If n is not very large, an array will beat it for performance Linked lists and trees have pointer chasing which are bad for striding across 2K cache pre-fetching Java's hashmap uses chained buckets, where each hash's bucket is a linked list Clojure/Scala vectors are good, because they have groupings of contiguous memory in use, but do not require all data to be contiguous like Java's ArrayList Cliff Click has his lock-free library, Boundary has added with their own implementations Fastutil is additive, no removal, but that's not necessarily a bad thing with proper tombstoning and data cleanup
  30. We can have as much RAM/heap space as we want now. And requirements for RAM grow at about 100x per decade. But are we now bound by GC? You can get 100GB of heap, but how long do you pause for marking/remarking phases and compaction? Even on a 2-4 GB heap, you're going to get multi-second pauses - when and how often? IBM Metronome collector is very predictable. Azul around one millisecond for lots of garbage with C4.
  31. Graphics card processing (OpenCL) is great for specific kinds of operations, like floating point arithmetic. But they don't perform well for all operations, and you have to get the data to them.
  32. Number of cores is large enough that traditional multi-processor techniques are no longer efficient[citation needed] — largely because of issues with congestion in supplying instructions and data to the many processors. Network-on-chip technology may be advantageous above this threshold
  33. From HP, due to come out this year or next, may change memory fundamentally. If the data and the function are together, you get the highest throughput and lowest latency possible. Could replace DRAM and disk entirely. When current flows in one direction through the device, the electrical resistance increases; and when current flows in the opposite direction, the resistance decreases. When the current is stopped, the component retains the last resistance that it had, and when the flow of charge starts again, the resistance of the circuit will be what it was when it was last active. Is write endurance good enough for anything but storage?
  34. Thermally-driven phase change, not an electronic process. Faster because it doesn't need to erase a block of cells before writing. Flash degrades at 5000 writes per sector, where PRAM lasts 100 million writes. Some argue that PRAM should be considered a memristor as well.