SlideShare a Scribd company logo
1 of 59
IBM T.J. Watson Research Center
RACES’12 Oct 21, 2012 © 2012 IBM Corporation
Edge Chasing Delayed Consistency:
Pushing the Limits of Weak Memory Models
Harold “Trey” Cain
IBM T.J. Watson Research Center
Prof. Mikko H. Lipasti
University of Wisconsin
IBM Research
© 2012 IBM Corporation
2 Cain and Lipasti Oct 21, 2012
Gotta go back in time!
 Part of Ph.D. Dissertation
– Never submitted for publication, until now.
– Looked particularly relevant when I saw the RACES CFP.
 Journey back in time to the year 2004, when…
– … Mark Zuckerberg launched Facebook
– … Janet Jackson suffered a “wardrobe malfunction”
during the Superbowl halftime show
– … an incumbent president was being challenged by a
Massachusetts politician
 88mph here we come!
IBM Research
© 2012 IBM Corporation
3 Cain and Lipasti Oct 21, 2012
Edge Chasing Delayed Consistency: Pushing the Limits of
Weak Ordering
 From the RACES website:
– “an approach towards scalability that reduces synchronization
requirements drastically, possibly to the point of discarding them
altogether.”
 A hardware developer’s perspective:
– Constraints of Legacy Code
• What if we want to apply this principle, but have no control over the
applications that are running on a system?
– Can one build a coherence protocol that avoids synchronizing cores as
much as possible?
• For example by allowing each core to use stale versions of cache lines as
long as possible
• While maintaining architectural correctness; i.e. we will not break existing
code
• If we do that, what will happen?
IBM Research
© 2012 IBM Corporation
4 Cain and Lipasti Oct 21, 2012
Cache-Coherent Shared-memory multiprocessors
 Are ubiquitous
 Coherence misses are a major source of performance loss for
shared memory applications
10 years ago Today
IBM Research
© 2012 IBM Corporation
5 Cain and Lipasti Oct 21, 2012
16MB L3 Cache Misses per 1000 inst
IBM Research
© 2012 IBM Corporation
6 Cain and Lipasti Oct 21, 2012
Edge-Chasing Delayed Consistency (ECDC)
 A new hardware implementation of POWER weak
ordering
– Not a new consistency model
 Allows a cache line to be non-speculatively read
after being invalidated.
 Based on necessary conditions
– Processor must fetch new data only if causally dependent
on it.
IBM Research
© 2012 IBM Corporation
7 Cain and Lipasti Oct 21, 2012
Constraint graph
 Introduced for SC by Landin et al., ISCA-18
 Directed-graph represents a multithreaded execution
– Nodes represent dynamic instances of instructions
– Edges represent their transitive orders (program order, RAW,
WAW, WAR).
 If the constraint graph is acyclic, then the execution is
correct
IBM Research
© 2012 IBM Corporation
8 Cain and Lipasti Oct 21, 2012
Constraint graph example - WO
Proc 1 Proc 2
LD AST B
LD B
ST->MB
Order
LD->MB
Order
Write-after-read
dependence order
Read-after-write
dependence order
ST A
MB MB
MB->ST
Order
MB->LD
Order
1.
2.
3.
5.
4.
Observation: An aggressive coherence protocol can ignore coherence messages
unless doing so will create a cycle in the constraint graph
IBM Research
© 2012 IBM Corporation
9 Cain and Lipasti Oct 21, 2012
Edge-chasing delayed consistency
 Based on edge-chasing algorithms used by distributed
database systems for deadlock detection
P1 P2 P3 P4Wham-O!
Cycle in WFG detected when a locally created probe received
IBM Research
© 2012 IBM Corporation
10 Cain and Lipasti Oct 21, 2012
ECDC - Basic idea
 Observation: Cycles in constraint graph can be detected
using a similar mechanism
 Protocol:
– Upon write miss, create a “probe”
– Upon receipt of invalidation, add probe to cache line
• Continue to read stale block until the probe is re-observed on
another message
– Pass probe to other processors at communication
IBM Research
© 2012 IBM Corporation
11 Cain and Lipasti Oct 21, 2012
Example – necessary miss (SC)
Proc
1
Proc 2
LD A
ST B
LD B
RAW
ST A
LD A
WAR
Line A is in proc 1’s
cache, valid bit = 1
Line A is in proc 1’s
cache, valid bit = 0
Supplanter ProbeA =
RAW
IBM Research
© 2012 IBM Corporation
12 Cain and Lipasti Oct 21, 2012
Detecting critical writes
 Some write values shouldn’t be delayed (e.g. lock
releases, barriers, etc.)
 Two heuristics
– Atomic primitives – any cache block that has been
touched by a store-conditional should not be delayed
– Polling detection – If consecutive cache accesses have
same PC and address, discard stale line
IBM Research
© 2012 IBM Corporation
13 Cain and Lipasti Oct 21, 2012
Performance Evaluation
 PHARMSim – Cycle-mode Full System Simulator
– Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar], within
the SimOS-PPC full-system simulator
– Out-of-order single-threaded core
– 32k DM L1 icache (1), 32k DM L1 dcache (1), 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte
cache lines
– Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)
– Stride-based prefetcher modeled after Power4
 Lock-free list insertion microbenchmark
 Full applications
– SPLASH2: fft, fmm, ocean, radix, raytrace
– Commercial: DB2/TPC-B, DB2/TPC-H, SPECjbb2000, SPECweb99
IBM Research
© 2012 IBM Corporation
14 Cain and Lipasti Oct 21, 2012
Why delayed consistency?
 False sharing/Silent sharing
 Convergant/Data-race tolerant algorithms
– Genetic algorithms
– Parallel equation solvers
– Sparse matrix factorization
 Lock-free parallel linked data structures
IBM Research
© 2012 IBM Corporation
15 Cain and Lipasti Oct 21, 2012
Lock-free Algorithms
 For example list insertion:
– New node’s next pointer set to cur
– CAS operation atomically updates prev’s next pointer to new
 Increasingly common
prev cur
new
IBM Research
© 2012 IBM Corporation
16 Cain and Lipasti Oct 21, 2012
Prior work (Delayed consistency)
 Invalidate-based receiver-delayed protocols, sender-delayed
protocols (Dubois et al., SC ’91)
 Lazy release consistency (Keleher et al., ISCA ’92)
 Update-based receiver-delayed, sender-delayed protocols
(Afek et al., TPLS, ’93)
 Tear-off blocks in DSI (Lebeck and Wood, ISCA ’95)
 Write cache for reducing bandwidth in update coherence
protocol (Dahlgren and Stenstrom, JPDC ’95)
IBM Research
© 2012 IBM Corporation
17 Cain and Lipasti Oct 21, 2012
Lock-free list microbenchmark
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 20 40 60 80 100
% updates
cycles/search
base-1000
ecdc-1000
base-100
ecdc-100
base-10
ecdc-10
 Based on hazard-pointer lock-free list maintenance algorithm [Michael, PODC ’02]
15 threads randomly updating or searching linked list, 1 thread performing searches
IBM Research
© 2012 IBM Corporation
18 Cain and Lipasti Oct 21, 2012
Intolerable miss reduction
Left to right: a) baseline, b) ECDC base,
c) ECDC merged read/write sets, d) ECDC scalar probe set
IBM Research
© 2012 IBM Corporation
19 Cain and Lipasti Oct 21, 2012
ECDC Performance (Infinite resources)
IBM Research
© 2012 IBM Corporation
20 Cain and Lipasti Oct 21, 2012
Conclusions
 Of nine applications studied, performance improvement for two
– Mostly due to reduction in false sharing misses
 Other applications:
– Not enough coherence misses, or
– The avoidance of those misses does not improve performance
 We believe these results generalize to lock-based programs
 Other programming models may have potential
– As shown, lock-free data structures
• Should also apply to transactional programming model
– But beware, “Premature Optimization is the Root of All Evil” – Donald Knuth
– Best to identify apps with a communication bottleneck before attacking
IBM Research
© 2012 IBM Corporation
21 Cain and Lipasti Oct 21, 2012
Questions?
IBM Research
© 2012 IBM Corporation
22 Cain and Lipasti Oct 21, 2012
Backup slides
IBM Research
© 2012 IBM Corporation
23 Cain and Lipasti Oct 21, 2012
Base machine model
PHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],
within the SimOS-PPC full-system simulator
Out-of-order
execution core
15-stage, 8-wide pipeline
256 entry reorder buffer, 128 entry load/store queue
32 entry issue queue
Functional
units (latency)
8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),
4 L1 Dcache load ports in OoO window
1 L1 Dcache load/store port at commit
Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection
table, 64 entry RAS, 8k entry 4-way BTB
Memory
system
(latency)
32k DM L1 icache (1), 32k DM L1 dcache (1)
256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines
Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)
Stride-based prefetcher modeled after Power4
IBM Research
© 2012 IBM Corporation
24 Cain and Lipasti Oct 21, 2012
Causality (Lamport)
 An instruction i is causally
dependent upon instruction j if
there is a directed path from j
to i
 Two operations are concurrent
if neither causally depends
upon the other
 Coherence misses are a
significant source of
performance degradation for
many applications
 If two operations are
concurrent, why is their
performance penalized?
Time
P3P2P1
st A
st C
ld A
st B
ld C
ld B
ld A
IBM Research
© 2012 IBM Corporation
25 Cain and Lipasti Oct 21, 2012
Prior work: formal memory model representations
 Local, WRT, global “performance” of memory ops (Dubois et
al., ISCA-13)
 Acyclic graph representation (Landin et al., ISCA-18)
 Modeling memory operation as a series of sub-operations
(Collier, RAPA)
 Acyclic graph + sub-operations (Adve, thesis)
 Initiation event, for modeling early store-to-load forwarding
(Gharachorloo, thesis)
IBM Research
© 2012 IBM Corporation
26 Cain and Lipasti Oct 21, 2012
Anatomy of a cycle
Proc
1
ST A
Proc 2
LD A
ST B
LD BProgram
order
Program
order
WAR
RAW
Incoming invalidate
Cache miss
IBM Research
© 2012 IBM Corporation
27 Cain and Lipasti Oct 21, 2012
Other prior work
 Speculative stale value usage
– LVP with Stale Values (Lepak, Ph.D. Thesis ‘03)
– Coherence Decoupling (Huh et al., ASPLOS ’04)
 Delayed RFO response to improve
synchronization throughput (Rajwar et al., HPCA
’00)
IBM Research
© 2012 IBM Corporation
28 Cain and Lipasti Oct 21, 2012
Constraint graph extensions
 Constraint graph definition differs for other
consistency models
 Processor consistency
– Remove program order edges from stores to subsequent
loads
– Remaining single-thread orders: edges from
• Loads to subsequent loads
• Stores to subsequent stores
• Loads to subsequent stores
IBM Research
© 2012 IBM Corporation
29 Cain and Lipasti Oct 21, 2012
Constraint graph extensions
 Constraint graph definition differs for other
consistency models
 Weak ordering
– Remove program order edges
– Add single-thread ordering edges between
• memory barrier and preceding/following instructions
• same address reads/writes
• dependent instructions
IBM Research
© 2012 IBM Corporation
30 Cain and Lipasti Oct 21, 2012
PC Example – Dekker’s Alg.
Proc
1
ST A
Proc 2
ST B
LD B LD A
Write-after-read
dependence order
Program
order
Program
order
Lack of store-to-load order
results in acyclic graph
1.
2.
3.
4.
IBM Research
© 2012 IBM Corporation
31 Cain and Lipasti Oct 21, 2012
Constraint graph example - SC
Proc
1
ST A
Proc 2
LD A
ST B
LD BProgram
order
Program
order
Write-after-read
dependence order
Read-after-write
dependence order
Cycle indicates that
execution is incorrect
1.
2.
3.
4.
IBM Research
© 2012 IBM Corporation
32 Cain and Lipasti Oct 21, 2012
Constraint graph example - PC
Proc
1
ST A
Proc 2
LD B
ST B
LD A
Program
order
Program
Order
Write-after-read
dependence order
Read-after-write
dependence order
1.
2.
3.
4.
IBM Research
© 2012 IBM Corporation
33 Cain and Lipasti Oct 21, 2012
ECDC Conceptual Description
 Identify causal dependences (upstream probe sets)
– 1 upstream set per processor
– 2 upstream sets per cache block (read set, write set)
 Communicating dependences
– Probe sets passed on response messages
– Probes attached to incoming invalidation messages
– Extra ProbePropagation messages sent at memory barriers
 Identifying usable stale blocks
– Extra stable state in cache (ST)
– Supplanter probe
IBM Research
© 2012 IBM Corporation
34 Cain and Lipasti Oct 21, 2012
ECDC Operation
Initially
1. ld A
2. st A
3. ld B
4. st B
5. ld C
Фprocupstream
{ }
{ }
{ , }
{ , }
{ , }
Ф(read|write)A
{ | , }
{ | , }
{ | , }
{ | , }
{ | , }
{ | }
{ | }
{ | }
{ , | , }
{ , | , }
Ф(read|write)B
IBM Research
© 2012 IBM Corporation
35 Cain and Lipasti Oct 21, 2012
Finite ECDC Performance
 When restricting PPB/STAB resources (220 KB per
processor)
– 16k probe lifetime counter
– 128 entry STAB per processor
– 32 Entry PPB per processor/directory controller (256 PPB
virtual namespace)
 TPC-H/SPECweb99 performance within margin of
error to infinite resources
IBM Research
© 2012 IBM Corporation
36 Cain and Lipasti Oct 21, 2012
Non-atomicity of writes
 Absent from model
 Effect on optimizations
– Forces unnecessary orders to exist
– Correct, but another example of over-conservatism
 Hopefully, infrequent performance divot
Processor p1
st r1, [A]
Processor p2
ld r1, [A]
st r2, [r1]
Processor p3
ld r1, [B]
membar
ld r2, [A]
IBM Research
© 2012 IBM Corporation
37 Cain and Lipasti Oct 21, 2012
ECDC Base machine model
PHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],
within the SimOS-PPC full-system simulator
Out-of-order
execution core
15-stage, 8-wide pipeline
256 entry reorder buffer, 128 entry load/store queue
32 entry issue queue
Functional
units (latency)
8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),
4 L1 Dcache load ports in OoO window
1 L1 Dcache load/store port at commit
Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection
table, 64 entry RAS, 8k entry 4-way BTB
Cache
Hierarchy
(latency)
32k DM L1 icache (1), 32k DM L1 dcache (1)
256K 8-way L2 (7), 16 MB 8-way L3 (15), 128 byte cache lines
Stride-based prefetcher modeled after Power4
Memory
system
(latency)
2-D static DOR routed torus interconnect. 60 cycle per link+route (40 GB/S bandwidth per
link, 5GHZ clock)
Memory (400 cycle best-case latency, 10 GB/S bandwidth)
IBM Research
© 2012 IBM Corporation
38 Cain and Lipasti Oct 21, 2012
Mapping ECDC to HW
 STAB – Maintains
supplanting probe for each
stale cache block
 PPB – Maintains
approximation of upstream
sets
 In caches – 2 extra bits for
stale state and synch
heuristic
DRAM
Dir
MemCtr
NIC
L2 $
D$I$
P
STAB
P
P
B
CastoutPPB
IBM Research
© 2012 IBM Corporation
39 Cain and Lipasti Oct 21, 2012
Probe representation
 Each probe represented by n-bit timer
 Stale block may be used until supplanting probe
timer expires
 Probe set in p-processor system represented by p
timers
IBM Research
© 2012 IBM Corporation
40 Cain and Lipasti Oct 21, 2012
STAB Detail
12525
8123
timer
9980x112c
0x24e2
0xc123
address
925690xf2e5104250x8000 (998)
(13523)
(21646)
Cache
Incoming Invalidates
p1 p2 p3
counters
IBM Research
© 2012 IBM Corporation
41 Cain and Lipasti Oct 21, 2012
PPB Detail
address hash
0
0
0
5
5
15
189
327
0
0
0
27
27
127
282
735
0
0
92
180
280
800
855
950
0
0
0
12
12
12
12
724
Shift register/
probe timers
…
Incoming upstream set
Expired upstream set
Timer index table
IBM Research
© 2012 IBM Corporation
42 Cain and Lipasti Oct 21, 2012
Memory consistency review
 Memory consistency model
– Specifies the programming interface to a shared memory
– i.e. the allowable interleaving of instructions
 Models discussed here:
– Sequential Consistency
– Processor Consistency
• No store-to-load program order
– Weak Ordering
• Order wrt memory barriers
• Same-address order
• Dependence order
IBM Research
© 2012 IBM Corporation
43 Cain and Lipasti Oct 21, 2012
Example – necessary miss (SC)
Proc
1
Proc 2
LD A
ST B
LD B
RAW
ST A
LD A
WAR
PO PO
PO
Block A is in proc 1’s
cache, valid bit = 1
Block A is in proc 1’s
cache, valid bit = 0
IBM Research
© 2012 IBM Corporation
44 Cain and Lipasti Oct 21, 2012
Example – avoidable miss (SC)
Proc
1
Proc 2
LD A
ST B
LD B
RAW
ST A
LD A
WAR
PO
PO
PO
Block A is in proc 1’s
cache, valid bit = 1
Block A is in proc 1’s
cache, valid bit = 0
IBM Research
© 2012 IBM Corporation
45 Cain and Lipasti Oct 21, 2012
Typical ReadX transaction
 When sending invalidation, create probe, add to PPB
 At receipt of invalidation (2b, 2c) add probe to STAB
 When sending invalidate acknowledgment, add probe set to the response
 When receiving invalidate acknowledgment, add incoming probe set to the PPB
3(a) Inval Ack
R
S1
H
1. ReadX
3(b) Inval Ack
S2
2(a) Sharers/Data
2(b) Inval
2(c) Inval
IBM Research
© 2012 IBM Corporation
46 Cain and Lipasti Oct 21, 2012
Invalidation to read distance
0%
20%
40%
60%
80%
100%
1 10 100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09
cycles
%ofloadcohmisses
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H
IBM Research
© 2012 IBM Corporation
47 Cain and Lipasti Oct 21, 2012
Invalidation to read distance (synch)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09
cycles
%ofloadcohmisses
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H
IBM Research
© 2012 IBM Corporation
48 Cain and Lipasti Oct 21, 2012
Invalidation to read distance (data)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09
cycles
%ofloadcohmisses
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H
IBM Research
© 2012 IBM Corporation
49 Cain and Lipasti Oct 21, 2012
STAB entry death cdf
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000 1000000
cycles
%STABentriesdeallocated
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H
IBM Research
© 2012 IBM Corporation
50 Cain and Lipasti Oct 21, 2012
STAB Entry Lifetime
IBM Research
© 2012 IBM Corporation
51 Cain and Lipasti Oct 21, 2012
ECDC performance (16k probe lifetime)
IBM Research
© 2012 IBM Corporation
52 Cain and Lipasti Oct 21, 2012
ECDC Perf (128 entry STAB, 32 entry PPB, 256 entry namespace)
IBM Research
© 2012 IBM Corporation
53 Cain and Lipasti Oct 21, 2012
ProbePropagation messages
IBM Research
© 2012 IBM Corporation
54 Cain and Lipasti Oct 21, 2012
ECDC Storage Overhead
0
50
100
150
200
250
300
350
4p 8p 16p 32p 64p 128p 256p 512p 1024p
Processor count
Storage(KB)
IBM Research
© 2012 IBM Corporation
55 Cain and Lipasti Oct 21, 2012
What about limit study?
 Indicated a larger number of avoidable coherence
misses
 Reasons:
– Did not account for non-speculative nature of protocol
(oracle ECDC could be better)
– Inaccurate measurement of critical writes
• Many loads perform polling to lines that have never been
touched by a load-linked or store-conditional
– Used isolated stale data detection mechanism
IBM Research
© 2012 IBM Corporation
56 Cain and Lipasti Oct 21, 2012
What about speculative load squashes?
 In a few applications, they occur frequently
(SPECjbb2000, TPC-H)
 Implemented/evaluated read-set-tracking w/
squash on miss
 Could eliminate a large fraction of squashes
– Unfortunately, little performance improvement
– Presumably, many squashes caused by contended spin
locks
IBM Research
© 2012 IBM Corporation
57 Cain and Lipasti Oct 21, 2012
ECDC and other consistency models
 Stricter model => more ProbePropagation
messages
 Potential for release consistency
 In SC/PC/TSO, ECDC benefits will probably be
dominated by extra ProbePropagation messages
IBM Research
© 2012 IBM Corporation
58 Cain and Lipasti Oct 21, 2012
Cause of STAB entry deallocation
IBM Research
© 2012 IBM Corporation
59 Cain and Lipasti Oct 21, 2012
Publications
 [ISCA ’04] Memory ordering: A Value-based approach.
– Selected for IEEE Micro Top Picks ‘04
 [PACT ’03] Constraint Graph Analysis of Multithreaded Programs.
– Selected for Best of PACT JILP Issue
 [PACT ’03] Redeeming IPC as a Performance Metric for Multithreaded Programs.
 [CAECW ’02] Precise and Accurate Processor Simulation
 [SPAA Revue ’02] Verifying Sequential Consistency Using Vector Clocks.
 [Micro ’01] Correctly Implementing Value Prediction in Microprocessors that Support
Multithreading or Multiprocessing.
 [WBT ’01] A Dynamic Binary Translation Approach to Architectural Simulation
 [HPCA ’01] An Architectural Characterization of Java TPC-W.
 [Euro-Par ’00] A Callgraph-Based Search Strategy for Automated Performance
Diagnosis.
– Selected as distinguished paper
 [CAECW ’00] Characterizing a Java Implementation of TPC-W

More Related Content

Similar to Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Heiko Joerg Schick
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
Slide_N
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Romeo Kienzler
 

Similar to Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models (20)

Prof. Uri Weiser,Technion
Prof. Uri Weiser,TechnionProf. Uri Weiser,Technion
Prof. Uri Weiser,Technion
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Optimizing LAMPhp Applications
Optimizing LAMPhp ApplicationsOptimizing LAMPhp Applications
Optimizing LAMPhp Applications
 
A zoom on membase vng
A zoom on membase vngA zoom on membase vng
A zoom on membase vng
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
 
Back to The Future V
Back to The Future VBack to The Future V
Back to The Future V
 
The Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing LandscapeThe Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing Landscape
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
Z109889 z4 r-storage-dfsms-vegas-v1910b
Z109889 z4 r-storage-dfsms-vegas-v1910bZ109889 z4 r-storage-dfsms-vegas-v1910b
Z109889 z4 r-storage-dfsms-vegas-v1910b
 
Industrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computeIndustrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric compute
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
B9 cmis
B9 cmisB9 cmis
B9 cmis
 
Distributing your pandas ETL job using Modin and Ray.pdf
Distributing your pandas ETL job using Modin and Ray.pdfDistributing your pandas ETL job using Modin and Ray.pdf
Distributing your pandas ETL job using Modin and Ray.pdf
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
lec0Intro.ppt
lec0Intro.pptlec0Intro.ppt
lec0Intro.ppt
 
RedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power SystemsRedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power Systems
 
Oracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud InfrastructureOracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud Infrastructure
 
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
Fog Cloud Caching at Network Edge via Local Hardware Awareness SpacesFog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

  • 1. IBM T.J. Watson Research Center RACES’12 Oct 21, 2012 © 2012 IBM Corporation Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models Harold “Trey” Cain IBM T.J. Watson Research Center Prof. Mikko H. Lipasti University of Wisconsin
  • 2. IBM Research © 2012 IBM Corporation 2 Cain and Lipasti Oct 21, 2012 Gotta go back in time!  Part of Ph.D. Dissertation – Never submitted for publication, until now. – Looked particularly relevant when I saw the RACES CFP.  Journey back in time to the year 2004, when… – … Mark Zuckerberg launched Facebook – … Janet Jackson suffered a “wardrobe malfunction” during the Superbowl halftime show – … an incumbent president was being challenged by a Massachusetts politician  88mph here we come!
  • 3. IBM Research © 2012 IBM Corporation 3 Cain and Lipasti Oct 21, 2012 Edge Chasing Delayed Consistency: Pushing the Limits of Weak Ordering  From the RACES website: – “an approach towards scalability that reduces synchronization requirements drastically, possibly to the point of discarding them altogether.”  A hardware developer’s perspective: – Constraints of Legacy Code • What if we want to apply this principle, but have no control over the applications that are running on a system? – Can one build a coherence protocol that avoids synchronizing cores as much as possible? • For example by allowing each core to use stale versions of cache lines as long as possible • While maintaining architectural correctness; i.e. we will not break existing code • If we do that, what will happen?
  • 4. IBM Research © 2012 IBM Corporation 4 Cain and Lipasti Oct 21, 2012 Cache-Coherent Shared-memory multiprocessors  Are ubiquitous  Coherence misses are a major source of performance loss for shared memory applications 10 years ago Today
  • 5. IBM Research © 2012 IBM Corporation 5 Cain and Lipasti Oct 21, 2012 16MB L3 Cache Misses per 1000 inst
  • 6. IBM Research © 2012 IBM Corporation 6 Cain and Lipasti Oct 21, 2012 Edge-Chasing Delayed Consistency (ECDC)  A new hardware implementation of POWER weak ordering – Not a new consistency model  Allows a cache line to be non-speculatively read after being invalidated.  Based on necessary conditions – Processor must fetch new data only if causally dependent on it.
  • 7. IBM Research © 2012 IBM Corporation 7 Cain and Lipasti Oct 21, 2012 Constraint graph  Introduced for SC by Landin et al., ISCA-18  Directed-graph represents a multithreaded execution – Nodes represent dynamic instances of instructions – Edges represent their transitive orders (program order, RAW, WAW, WAR).  If the constraint graph is acyclic, then the execution is correct
  • 8. IBM Research © 2012 IBM Corporation 8 Cain and Lipasti Oct 21, 2012 Constraint graph example - WO Proc 1 Proc 2 LD AST B LD B ST->MB Order LD->MB Order Write-after-read dependence order Read-after-write dependence order ST A MB MB MB->ST Order MB->LD Order 1. 2. 3. 5. 4. Observation: An aggressive coherence protocol can ignore coherence messages unless doing so will create a cycle in the constraint graph
  • 9. IBM Research © 2012 IBM Corporation 9 Cain and Lipasti Oct 21, 2012 Edge-chasing delayed consistency  Based on edge-chasing algorithms used by distributed database systems for deadlock detection P1 P2 P3 P4Wham-O! Cycle in WFG detected when a locally created probe received
  • 10. IBM Research © 2012 IBM Corporation 10 Cain and Lipasti Oct 21, 2012 ECDC - Basic idea  Observation: Cycles in constraint graph can be detected using a similar mechanism  Protocol: – Upon write miss, create a “probe” – Upon receipt of invalidation, add probe to cache line • Continue to read stale block until the probe is re-observed on another message – Pass probe to other processors at communication
  • 11. IBM Research © 2012 IBM Corporation 11 Cain and Lipasti Oct 21, 2012 Example – necessary miss (SC) Proc 1 Proc 2 LD A ST B LD B RAW ST A LD A WAR Line A is in proc 1’s cache, valid bit = 1 Line A is in proc 1’s cache, valid bit = 0 Supplanter ProbeA = RAW
  • 12. IBM Research © 2012 IBM Corporation 12 Cain and Lipasti Oct 21, 2012 Detecting critical writes  Some write values shouldn’t be delayed (e.g. lock releases, barriers, etc.)  Two heuristics – Atomic primitives – any cache block that has been touched by a store-conditional should not be delayed – Polling detection – If consecutive cache accesses have same PC and address, discard stale line
  • 13. IBM Research © 2012 IBM Corporation 13 Cain and Lipasti Oct 21, 2012 Performance Evaluation  PHARMSim – Cycle-mode Full System Simulator – Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar], within the SimOS-PPC full-system simulator – Out-of-order single-threaded core – 32k DM L1 icache (1), 32k DM L1 dcache (1), 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines – Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock) – Stride-based prefetcher modeled after Power4  Lock-free list insertion microbenchmark  Full applications – SPLASH2: fft, fmm, ocean, radix, raytrace – Commercial: DB2/TPC-B, DB2/TPC-H, SPECjbb2000, SPECweb99
  • 14. IBM Research © 2012 IBM Corporation 14 Cain and Lipasti Oct 21, 2012 Why delayed consistency?  False sharing/Silent sharing  Convergant/Data-race tolerant algorithms – Genetic algorithms – Parallel equation solvers – Sparse matrix factorization  Lock-free parallel linked data structures
  • 15. IBM Research © 2012 IBM Corporation 15 Cain and Lipasti Oct 21, 2012 Lock-free Algorithms  For example list insertion: – New node’s next pointer set to cur – CAS operation atomically updates prev’s next pointer to new  Increasingly common prev cur new
  • 16. IBM Research © 2012 IBM Corporation 16 Cain and Lipasti Oct 21, 2012 Prior work (Delayed consistency)  Invalidate-based receiver-delayed protocols, sender-delayed protocols (Dubois et al., SC ’91)  Lazy release consistency (Keleher et al., ISCA ’92)  Update-based receiver-delayed, sender-delayed protocols (Afek et al., TPLS, ’93)  Tear-off blocks in DSI (Lebeck and Wood, ISCA ’95)  Write cache for reducing bandwidth in update coherence protocol (Dahlgren and Stenstrom, JPDC ’95)
  • 17. IBM Research © 2012 IBM Corporation 17 Cain and Lipasti Oct 21, 2012 Lock-free list microbenchmark 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 20 40 60 80 100 % updates cycles/search base-1000 ecdc-1000 base-100 ecdc-100 base-10 ecdc-10  Based on hazard-pointer lock-free list maintenance algorithm [Michael, PODC ’02] 15 threads randomly updating or searching linked list, 1 thread performing searches
  • 18. IBM Research © 2012 IBM Corporation 18 Cain and Lipasti Oct 21, 2012 Intolerable miss reduction Left to right: a) baseline, b) ECDC base, c) ECDC merged read/write sets, d) ECDC scalar probe set
  • 19. IBM Research © 2012 IBM Corporation 19 Cain and Lipasti Oct 21, 2012 ECDC Performance (Infinite resources)
  • 20. IBM Research © 2012 IBM Corporation 20 Cain and Lipasti Oct 21, 2012 Conclusions  Of nine applications studied, performance improvement for two – Mostly due to reduction in false sharing misses  Other applications: – Not enough coherence misses, or – The avoidance of those misses does not improve performance  We believe these results generalize to lock-based programs  Other programming models may have potential – As shown, lock-free data structures • Should also apply to transactional programming model – But beware, “Premature Optimization is the Root of All Evil” – Donald Knuth – Best to identify apps with a communication bottleneck before attacking
  • 21. IBM Research © 2012 IBM Corporation 21 Cain and Lipasti Oct 21, 2012 Questions?
  • 22. IBM Research © 2012 IBM Corporation 22 Cain and Lipasti Oct 21, 2012 Backup slides
  • 23. IBM Research © 2012 IBM Corporation 23 Cain and Lipasti Oct 21, 2012 Base machine model PHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar], within the SimOS-PPC full-system simulator Out-of-order execution core 15-stage, 8-wide pipeline 256 entry reorder buffer, 128 entry load/store queue 32 entry issue queue Functional units (latency) 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 4 L1 Dcache load ports in OoO window 1 L1 Dcache load/store port at commit Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB Memory system (latency) 32k DM L1 icache (1), 32k DM L1 dcache (1) 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock) Stride-based prefetcher modeled after Power4
  • 24. IBM Research © 2012 IBM Corporation 24 Cain and Lipasti Oct 21, 2012 Causality (Lamport)  An instruction i is causally dependent upon instruction j if there is a directed path from j to i  Two operations are concurrent if neither causally depends upon the other  Coherence misses are a significant source of performance degradation for many applications  If two operations are concurrent, why is their performance penalized? Time P3P2P1 st A st C ld A st B ld C ld B ld A
  • 25. IBM Research © 2012 IBM Corporation 25 Cain and Lipasti Oct 21, 2012 Prior work: formal memory model representations  Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13)  Acyclic graph representation (Landin et al., ISCA-18)  Modeling memory operation as a series of sub-operations (Collier, RAPA)  Acyclic graph + sub-operations (Adve, thesis)  Initiation event, for modeling early store-to-load forwarding (Gharachorloo, thesis)
  • 26. IBM Research © 2012 IBM Corporation 26 Cain and Lipasti Oct 21, 2012 Anatomy of a cycle Proc 1 ST A Proc 2 LD A ST B LD BProgram order Program order WAR RAW Incoming invalidate Cache miss
  • 27. IBM Research © 2012 IBM Corporation 27 Cain and Lipasti Oct 21, 2012 Other prior work  Speculative stale value usage – LVP with Stale Values (Lepak, Ph.D. Thesis ‘03) – Coherence Decoupling (Huh et al., ASPLOS ’04)  Delayed RFO response to improve synchronization throughput (Rajwar et al., HPCA ’00)
  • 28. IBM Research © 2012 IBM Corporation 28 Cain and Lipasti Oct 21, 2012 Constraint graph extensions  Constraint graph definition differs for other consistency models  Processor consistency – Remove program order edges from stores to subsequent loads – Remaining single-thread orders: edges from • Loads to subsequent loads • Stores to subsequent stores • Loads to subsequent stores
  • 29. IBM Research © 2012 IBM Corporation 29 Cain and Lipasti Oct 21, 2012 Constraint graph extensions  Constraint graph definition differs for other consistency models  Weak ordering – Remove program order edges – Add single-thread ordering edges between • memory barrier and preceding/following instructions • same address reads/writes • dependent instructions
  • 30. IBM Research © 2012 IBM Corporation 30 Cain and Lipasti Oct 21, 2012 PC Example – Dekker’s Alg. Proc 1 ST A Proc 2 ST B LD B LD A Write-after-read dependence order Program order Program order Lack of store-to-load order results in acyclic graph 1. 2. 3. 4.
  • 31. IBM Research © 2012 IBM Corporation 31 Cain and Lipasti Oct 21, 2012 Constraint graph example - SC Proc 1 ST A Proc 2 LD A ST B LD BProgram order Program order Write-after-read dependence order Read-after-write dependence order Cycle indicates that execution is incorrect 1. 2. 3. 4.
  • 32. IBM Research © 2012 IBM Corporation 32 Cain and Lipasti Oct 21, 2012 Constraint graph example - PC Proc 1 ST A Proc 2 LD B ST B LD A Program order Program Order Write-after-read dependence order Read-after-write dependence order 1. 2. 3. 4.
  • 33. IBM Research © 2012 IBM Corporation 33 Cain and Lipasti Oct 21, 2012 ECDC Conceptual Description  Identify causal dependences (upstream probe sets) – 1 upstream set per processor – 2 upstream sets per cache block (read set, write set)  Communicating dependences – Probe sets passed on response messages – Probes attached to incoming invalidation messages – Extra ProbePropagation messages sent at memory barriers  Identifying usable stale blocks – Extra stable state in cache (ST) – Supplanter probe
  • 34. IBM Research © 2012 IBM Corporation 34 Cain and Lipasti Oct 21, 2012 ECDC Operation Initially 1. ld A 2. st A 3. ld B 4. st B 5. ld C Фprocupstream { } { } { , } { , } { , } Ф(read|write)A { | , } { | , } { | , } { | , } { | , } { | } { | } { | } { , | , } { , | , } Ф(read|write)B
  • 35. IBM Research © 2012 IBM Corporation 35 Cain and Lipasti Oct 21, 2012 Finite ECDC Performance  When restricting PPB/STAB resources (220 KB per processor) – 16k probe lifetime counter – 128 entry STAB per processor – 32 Entry PPB per processor/directory controller (256 PPB virtual namespace)  TPC-H/SPECweb99 performance within margin of error to infinite resources
  • 36. IBM Research © 2012 IBM Corporation 36 Cain and Lipasti Oct 21, 2012 Non-atomicity of writes  Absent from model  Effect on optimizations – Forces unnecessary orders to exist – Correct, but another example of over-conservatism  Hopefully, infrequent performance divot Processor p1 st r1, [A] Processor p2 ld r1, [A] st r2, [r1] Processor p3 ld r1, [B] membar ld r2, [A]
  • 37. IBM Research © 2012 IBM Corporation 37 Cain and Lipasti Oct 21, 2012 ECDC Base machine model PHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar], within the SimOS-PPC full-system simulator Out-of-order execution core 15-stage, 8-wide pipeline 256 entry reorder buffer, 128 entry load/store queue 32 entry issue queue Functional units (latency) 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 4 L1 Dcache load ports in OoO window 1 L1 Dcache load/store port at commit Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB Cache Hierarchy (latency) 32k DM L1 icache (1), 32k DM L1 dcache (1) 256K 8-way L2 (7), 16 MB 8-way L3 (15), 128 byte cache lines Stride-based prefetcher modeled after Power4 Memory system (latency) 2-D static DOR routed torus interconnect. 60 cycle per link+route (40 GB/S bandwidth per link, 5GHZ clock) Memory (400 cycle best-case latency, 10 GB/S bandwidth)
  • 38. IBM Research © 2012 IBM Corporation 38 Cain and Lipasti Oct 21, 2012 Mapping ECDC to HW  STAB – Maintains supplanting probe for each stale cache block  PPB – Maintains approximation of upstream sets  In caches – 2 extra bits for stale state and synch heuristic DRAM Dir MemCtr NIC L2 $ D$I$ P STAB P P B CastoutPPB
  • 39. IBM Research © 2012 IBM Corporation 39 Cain and Lipasti Oct 21, 2012 Probe representation  Each probe represented by n-bit timer  Stale block may be used until supplanting probe timer expires  Probe set in p-processor system represented by p timers
  • 40. IBM Research © 2012 IBM Corporation 40 Cain and Lipasti Oct 21, 2012 STAB Detail 12525 8123 timer 9980x112c 0x24e2 0xc123 address 925690xf2e5104250x8000 (998) (13523) (21646) Cache Incoming Invalidates p1 p2 p3 counters
  • 41. IBM Research © 2012 IBM Corporation 41 Cain and Lipasti Oct 21, 2012 PPB Detail address hash 0 0 0 5 5 15 189 327 0 0 0 27 27 127 282 735 0 0 92 180 280 800 855 950 0 0 0 12 12 12 12 724 Shift register/ probe timers … Incoming upstream set Expired upstream set Timer index table
  • 42. IBM Research © 2012 IBM Corporation 42 Cain and Lipasti Oct 21, 2012 Memory consistency review  Memory consistency model – Specifies the programming interface to a shared memory – i.e. the allowable interleaving of instructions  Models discussed here: – Sequential Consistency – Processor Consistency • No store-to-load program order – Weak Ordering • Order wrt memory barriers • Same-address order • Dependence order
  • 43. IBM Research © 2012 IBM Corporation 43 Cain and Lipasti Oct 21, 2012 Example – necessary miss (SC) Proc 1 Proc 2 LD A ST B LD B RAW ST A LD A WAR PO PO PO Block A is in proc 1’s cache, valid bit = 1 Block A is in proc 1’s cache, valid bit = 0
  • 44. IBM Research © 2012 IBM Corporation 44 Cain and Lipasti Oct 21, 2012 Example – avoidable miss (SC) Proc 1 Proc 2 LD A ST B LD B RAW ST A LD A WAR PO PO PO Block A is in proc 1’s cache, valid bit = 1 Block A is in proc 1’s cache, valid bit = 0
  • 45. IBM Research © 2012 IBM Corporation 45 Cain and Lipasti Oct 21, 2012 Typical ReadX transaction  When sending invalidation, create probe, add to PPB  At receipt of invalidation (2b, 2c) add probe to STAB  When sending invalidate acknowledgment, add probe set to the response  When receiving invalidate acknowledgment, add incoming probe set to the PPB 3(a) Inval Ack R S1 H 1. ReadX 3(b) Inval Ack S2 2(a) Sharers/Data 2(b) Inval 2(c) Inval
  • 46. IBM Research © 2012 IBM Corporation 46 Cain and Lipasti Oct 21, 2012 Invalidation to read distance 0% 20% 40% 60% 80% 100% 1 10 100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09 cycles %ofloadcohmisses fft fmm ocean radix raytrace SPECjbb2000 SPECweb99 TPC-B TPC-H
  • 47. IBM Research © 2012 IBM Corporation 47 Cain and Lipasti Oct 21, 2012 Invalidation to read distance (synch) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09 cycles %ofloadcohmisses fft fmm ocean radix raytrace SPECjbb2000 SPECweb99 TPC-B TPC-H
  • 48. IBM Research © 2012 IBM Corporation 48 Cain and Lipasti Oct 21, 2012 Invalidation to read distance (data) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09 cycles %ofloadcohmisses fft fmm ocean radix raytrace SPECjbb2000 SPECweb99 TPC-B TPC-H
  • 49. IBM Research © 2012 IBM Corporation 49 Cain and Lipasti Oct 21, 2012 STAB entry death cdf 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 10 100 1000 10000 100000 1000000 cycles %STABentriesdeallocated fft fmm ocean radix raytrace SPECjbb2000 SPECweb99 TPC-B TPC-H
  • 50. IBM Research © 2012 IBM Corporation 50 Cain and Lipasti Oct 21, 2012 STAB Entry Lifetime
  • 51. IBM Research © 2012 IBM Corporation 51 Cain and Lipasti Oct 21, 2012 ECDC performance (16k probe lifetime)
  • 52. IBM Research © 2012 IBM Corporation 52 Cain and Lipasti Oct 21, 2012 ECDC Perf (128 entry STAB, 32 entry PPB, 256 entry namespace)
  • 53. IBM Research © 2012 IBM Corporation 53 Cain and Lipasti Oct 21, 2012 ProbePropagation messages
  • 54. IBM Research © 2012 IBM Corporation 54 Cain and Lipasti Oct 21, 2012 ECDC Storage Overhead 0 50 100 150 200 250 300 350 4p 8p 16p 32p 64p 128p 256p 512p 1024p Processor count Storage(KB)
  • 55. IBM Research © 2012 IBM Corporation 55 Cain and Lipasti Oct 21, 2012 What about limit study?  Indicated a larger number of avoidable coherence misses  Reasons: – Did not account for non-speculative nature of protocol (oracle ECDC could be better) – Inaccurate measurement of critical writes • Many loads perform polling to lines that have never been touched by a load-linked or store-conditional – Used isolated stale data detection mechanism
  • 56. IBM Research © 2012 IBM Corporation 56 Cain and Lipasti Oct 21, 2012 What about speculative load squashes?  In a few applications, they occur frequently (SPECjbb2000, TPC-H)  Implemented/evaluated read-set-tracking w/ squash on miss  Could eliminate a large fraction of squashes – Unfortunately, little performance improvement – Presumably, many squashes caused by contended spin locks
  • 57. IBM Research © 2012 IBM Corporation 57 Cain and Lipasti Oct 21, 2012 ECDC and other consistency models  Stricter model => more ProbePropagation messages  Potential for release consistency  In SC/PC/TSO, ECDC benefits will probably be dominated by extra ProbePropagation messages
  • 58. IBM Research © 2012 IBM Corporation 58 Cain and Lipasti Oct 21, 2012 Cause of STAB entry deallocation
  • 59. IBM Research © 2012 IBM Corporation 59 Cain and Lipasti Oct 21, 2012 Publications  [ISCA ’04] Memory ordering: A Value-based approach. – Selected for IEEE Micro Top Picks ‘04  [PACT ’03] Constraint Graph Analysis of Multithreaded Programs. – Selected for Best of PACT JILP Issue  [PACT ’03] Redeeming IPC as a Performance Metric for Multithreaded Programs.  [CAECW ’02] Precise and Accurate Processor Simulation  [SPAA Revue ’02] Verifying Sequential Consistency Using Vector Clocks.  [Micro ’01] Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing.  [WBT ’01] A Dynamic Binary Translation Approach to Architectural Simulation  [HPCA ’01] An Architectural Characterization of Java TPC-W.  [Euro-Par ’00] A Callgraph-Based Search Strategy for Automated Performance Diagnosis. – Selected as distinguished paper  [CAECW ’00] Characterizing a Java Implementation of TPC-W

Editor's Notes

  1. OK, so giving this talk is kind of a blast from the past for me. I actually did this work with Mikko Lipasti , who was my advisor at the time, when I was finishing up my PhD in 2004. We never presented it outside of my defense, after which I started my job at IBM and my research directions shifted gears. Since this was way back in 2004, if you don’t mind I just wanted to jog my memory a little bit at the outset of the talk, so like Marty McFly in Back to the Future I’m going to take a journey back to some highlights of 2004. So, this was the year that Mark Zuckerberg launched facebook. Boy, I can barely remember life before facebook. But then. It was also the year Janet Jackson suffered a wardrobe malfunction during the superbowl. How could anyone forget that? And lastly, it was the year that an incumbent president was being challenged by a Massachusetts politician in an election year. OK, so maybe things haven’t changed that much since then, both for politics as well as multiprocessor systems. So that was eight years ago, and some has changed but much has stayed the same. And when you think about it, relaxing synchronization is about as close as a programmer can come to time travel. Will I get the old value, or will I get the new value? Will I get what I expect, or will I get a wardrobe malfunction? OK, so pushing the gas pedal down to 88 mph , the flux capacitor is lit, and let’s go!
  2. So like I said in the lightning round , when I saw the CFP for Races, I knew this would be a great place to share this prior work. While most of the discussion so far has been about software mechanisms for relaxing synchronization, but we were working within the constraints of a hardware developer. We were trying to achieve the same sort of scalable performance, while supporting legacy applications written to the PowerPC weakly ordered memory model , which we were unable to change. Given that constraint, the lever that we used was the hardware cache coherence protocol , where we attempted to avoid coherence misses by allowing a core to continue using stale data in its cache for as long as possible. So by avoiding coherence misses, hopefully we would be able to improve performance. We came up with a new implementation of the PowerPC weakly ordered memory model, which we called edge chasing delayed consistency.
  3. Not that I need to motivate the problem to this audience, but you know shared memory multiprocessors are proliferating everywhere that you look. While they used to be relegated to high-end servers, now many cell phones, tvs, game consoles and tablets are SMPs. And the performance of these SMPs suffers due to coherence misses, even for relatively small systems.
  4. This graph measures a 16 core system operating with a 16MB L3 Cache per core. And it shows the number of L3 misses per 1000 instructions, broken down by type, where the lower blue portion of the bar is the number of coherence misses, as you can see if that it is a significant fraction across ll of the workloads.
  5. So what I’m going to be describing is a optimized implementation of weak ordering called edge-chasing delayed consistency. This is not a new consistency model for the programmer, it is a new implementation of weak ordering that allows a cache line to continue being read after it has been invalidated by another processor. In fact, it is going to allow that cache line to be read until it is absolutely necessary that the core see a new version of the line, where the necessary conditions are dictated by the consistency model, and that time is when the reading processor is causally dependent upon the new value. It is going to continue reading the old data until it is necessary for it to observe the old data. That is, until it observes a value in a memory location that precedes the invalidation of the stale block in the happens before relationship, that is it causally depends upon it.
  6. So we were interested in developing a coherence protocol that enforced the necessary conditions of a consistency model, not sufficient conditions. In order to do this, to really understand what is necessary, we relied on a formalism called a “constraint graph” which many of you are probably aware of. Describe the constraint graph The key thing about the constraint graph is that if it is acyclic, then the execution is correct. If it contains a cycle, then it is impossible to put the set of operations in a total order, therefore it is incorrect.
  7. So we extended the definition of the constraint graph to weakly ordered systems, where instead of there being edges between every instruction executed by a single thread, there are only edges between instructions and memory barriers, as well as a few other edges corresponding to single-threaded data dependnces
  8. So the edge-chasing consistecy model derives its name froma class of deadlock detection algortihms that have been described for distributed database systems.
  9. With 30% updates, speedups of 2.74, 1.82, and 1.18 for these list lengths With 100% updates, speedups of 3.11, 3.87, and 1.35 for these list lengths
  10. Intolerable vs. tolerable misses Bars, left to right We expect ECDC to improve performance for reductions in false sharing misses and true sharing misses to data. As we can see from this chart, most of the reduction comes from misses to falsely shared data and misses to truly shared synchronization data. We do not believe that any of these applications exhibit the data-race tolerant quality of the lock free list insertion microbenchmark or convergent iterative algorithms. Raytrace exhibits the most reduction, over 50 percent of all coherence misses can be tolerated using ECDC, however most of these are synchronization misses. Other applications who can use a significant amount of stale data are TPC-H, SPECweb99, and SPECjbb2000.
  11. So this graph shows the normalized execution time for three variants of the ECDC protocol, so lower is better, relative to a baseline coherence protocol. In terms of performance improvements for real applications, it is a little disappointing, around 4% for SPECweb99 and 7.5% for TPC-H. (don’t go back)
  12. So our conclusions after staring at the data for a while was that the two success stories were mostly benefiting from the false sharing reduction. For the other applictions, either there weren’t enough coherence misses, or the avoidance of those misses does not improve performance. For example in the case of synchronization variables, you may be able to see the “locked” value for a little longer than you would otherwise. So instead of being stalled on a cache miss to retrieve the lock from the processor releasing a lock, you’re simply able to see the old value, and spin longer. It is unclear to us why one would expect results to be any different for applications that rely on lock-based synchronization. For other sorts of synchronization models, the story may be different: for example lock free data structures like the linked list example we showed, or for the transactional programming model perhaps. So one final word of caution before concluding. While Hans data races are pure evil, Donald Knuth has stated that premature optimization is the root of all evil. If you have a Barnes Hut, and a vision for attacking the problem go for it, in other words find your nail before inventing hammers.
  13. So, when I talk about causality and causal dependences, what do I mean by that?
  14. Ended at 16:00
  15. E.G. OoO processor
  16. E.G. OoO processor
  17. Ask Mikko
  18. Infrastructure issues with models weaker than weak ordering