Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

IBM T.J. Watson Research Center
RACES’12 Oct 21, 2012 © 2012 IBM Corporation
Edge Chasing Delayed Consistency:
Pushing the Limits of Weak Memory Models
Harold “Trey” Cain
IBM T.J. Watson Research Center
Prof. Mikko H. Lipasti
University of Wisconsin

IBM Research
© 2012 IBM Corporation
2 Cain and Lipasti Oct 21, 2012
Gotta go back in time!
 Part of Ph.D. Dissertation
– Never submitted for publication, until now.
– Looked particularly relevant when I saw the RACES CFP.
 Journey back in time to the year 2004, when…
– … Mark Zuckerberg launched Facebook
– … Janet Jackson suffered a “wardrobe malfunction”
during the Superbowl halftime show
– … an incumbent president was being challenged by a
Massachusetts politician
 88mph here we come!

IBM Research
Edge Chasing Delayed Consistency: Pushing the Limits of
Weak Ordering
 From the RACES website:
– “an approach towards scalability that reduces synchronization
requirements drastically, possibly to the point of discarding them
altogether.”
 A hardware developer’s perspective:
– Constraints of Legacy Code
• What if we want to apply this principle, but have no control over the
applications that are running on a system?
– Can one build a coherence protocol that avoids synchronizing cores as
much as possible?
• For example by allowing each core to use stale versions of cache lines as
long as possible
• While maintaining architectural correctness; i.e. we will not break existing
code
• If we do that, what will happen?

IBM Research
Cache-Coherent Shared-memory multiprocessors
 Are ubiquitous
 Coherence misses are a major source of performance loss for
shared memory applications
10 years ago Today

IBM Research
16MB L3 Cache Misses per 1000 inst

IBM Research
Edge-Chasing Delayed Consistency (ECDC)
 A new hardware implementation of POWER weak
ordering
– Not a new consistency model
 Allows a cache line to be non-speculatively read
after being invalidated.
 Based on necessary conditions
– Processor must fetch new data only if causally dependent
on it.

IBM Research
Constraint graph
 Introduced for SC by Landin et al., ISCA-18
 Directed-graph represents a multithreaded execution
– Nodes represent dynamic instances of instructions
– Edges represent their transitive orders (program order, RAW,
WAW, WAR).
 If the constraint graph is acyclic, then the execution is
correct

IBM Research
Constraint graph example - WO
Proc 1 Proc 2
LD AST B
LD B
ST->MB
Order
LD->MB
Order
Write-after-read
dependence order
Read-after-write
dependence order
ST A
MB MB
MB->ST
Order
MB->LD
Order
1.
2.
3.
5.
4.
Observation: An aggressive coherence protocol can ignore coherence messages
unless doing so will create a cycle in the constraint graph

IBM Research
Edge-chasing delayed consistency
 Based on edge-chasing algorithms used by distributed
database systems for deadlock detection
P1 P2 P3 P4Wham-O!
Cycle in WFG detected when a locally created probe received

IBM Research
ECDC - Basic idea
 Observation: Cycles in constraint graph can be detected
using a similar mechanism
 Protocol:
– Upon write miss, create a “probe”
– Upon receipt of invalidation, add probe to cache line
• Continue to read stale block until the probe is re-observed on
another message
– Pass probe to other processors at communication

IBM Research
Example – necessary miss (SC)
Proc
1
Proc 2
LD A
ST B
LD B
RAW
ST A
LD A
WAR
Line A is in proc 1’s
cache, valid bit = 1
Line A is in proc 1’s
Supplanter ProbeA =
RAW

IBM Research
Detecting critical writes
 Some write values shouldn’t be delayed (e.g. lock
releases, barriers, etc.)
 Two heuristics
– Atomic primitives – any cache block that has been
touched by a store-conditional should not be delayed
– Polling detection – If consecutive cache accesses have
same PC and address, discard stale line

IBM Research
Performance Evaluation
 PHARMSim – Cycle-mode Full System Simulator
– Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar], within
the SimOS-PPC full-system simulator
– Out-of-order single-threaded core
– 32k DM L1 icache (1), 32k DM L1 dcache (1), 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte
cache lines
– Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)
– Stride-based prefetcher modeled after Power4
 Lock-free list insertion microbenchmark
 Full applications
– SPLASH2: fft, fmm, ocean, radix, raytrace
– Commercial: DB2/TPC-B, DB2/TPC-H, SPECjbb2000, SPECweb99

IBM Research
Why delayed consistency?
 False sharing/Silent sharing
 Convergant/Data-race tolerant algorithms
– Genetic algorithms
– Parallel equation solvers
– Sparse matrix factorization
 Lock-free parallel linked data structures

IBM Research
Lock-free Algorithms
 For example list insertion:
– New node’s next pointer set to cur
– CAS operation atomically updates prev’s next pointer to new
 Increasingly common
prev cur
new

IBM Research
Prior work (Delayed consistency)
 Invalidate-based receiver-delayed protocols, sender-delayed
protocols (Dubois et al., SC ’91)
 Lazy release consistency (Keleher et al., ISCA ’92)
 Update-based receiver-delayed, sender-delayed protocols
(Afek et al., TPLS, ’93)
 Tear-off blocks in DSI (Lebeck and Wood, ISCA ’95)
 Write cache for reducing bandwidth in update coherence
protocol (Dahlgren and Stenstrom, JPDC ’95)

IBM Research
Lock-free list microbenchmark
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 20 40 60 80 100
% updates
cycles/search
base-1000
ecdc-1000
base-100
ecdc-100
base-10
ecdc-10
 Based on hazard-pointer lock-free list maintenance algorithm [Michael, PODC ’02]
15 threads randomly updating or searching linked list, 1 thread performing searches

IBM Research
Intolerable miss reduction
Left to right: a) baseline, b) ECDC base,
c) ECDC merged read/write sets, d) ECDC scalar probe set

IBM Research
ECDC Performance (Infinite resources)

IBM Research
Conclusions
 Of nine applications studied, performance improvement for two
– Mostly due to reduction in false sharing misses
 Other applications:
– Not enough coherence misses, or
– The avoidance of those misses does not improve performance
 We believe these results generalize to lock-based programs
 Other programming models may have potential
– As shown, lock-free data structures
• Should also apply to transactional programming model
– But beware, “Premature Optimization is the Root of All Evil” – Donald Knuth
– Best to identify apps with a communication bottleneck before attacking

IBM Research
Questions?

IBM Research
Backup slides

IBM Research
Base machine model
PHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],
within the SimOS-PPC full-system simulator
Out-of-order
execution core
15-stage, 8-wide pipeline
256 entry reorder buffer, 128 entry load/store queue
32 entry issue queue
Functional
units (latency)
8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),
4 L1 Dcache load ports in OoO window
1 L1 Dcache load/store port at commit
Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection
table, 64 entry RAS, 8k entry 4-way BTB
Memory
system
(latency)
32k DM L1 icache (1), 32k DM L1 dcache (1)
256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines
Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)
Stride-based prefetcher modeled after Power4

IBM Research
Causality (Lamport)
 An instruction i is causally
dependent upon instruction j if
there is a directed path from j
to i
 Two operations are concurrent
if neither causally depends
upon the other
 Coherence misses are a
significant source of
performance degradation for
many applications
 If two operations are
concurrent, why is their
performance penalized?
Time
P3P2P1
st A
st C
ld A
st B
ld C
ld B
ld A

IBM Research
Prior work: formal memory model representations
 Local, WRT, global “performance” of memory ops (Dubois et
al., ISCA-13)
 Acyclic graph representation (Landin et al., ISCA-18)
 Modeling memory operation as a series of sub-operations
(Collier, RAPA)
 Acyclic graph + sub-operations (Adve, thesis)
 Initiation event, for modeling early store-to-load forwarding
(Gharachorloo, thesis)

IBM Research
Anatomy of a cycle
Proc
1
ST A
Proc 2
LD A
ST B
LD BProgram
order
Program
order
WAR
RAW
Incoming invalidate
Cache miss

IBM Research
Other prior work
 Speculative stale value usage
– LVP with Stale Values (Lepak, Ph.D. Thesis ‘03)
– Coherence Decoupling (Huh et al., ASPLOS ’04)
 Delayed RFO response to improve
synchronization throughput (Rajwar et al., HPCA
’00)

IBM Research
Constraint graph extensions
 Constraint graph definition differs for other
consistency models
 Processor consistency
– Remove program order edges from stores to subsequent
loads
– Remaining single-thread orders: edges from
• Loads to subsequent loads
• Stores to subsequent stores
• Loads to subsequent stores

IBM Research
Constraint graph extensions
 Constraint graph definition differs for other
consistency models
 Weak ordering
– Remove program order edges
– Add single-thread ordering edges between
• memory barrier and preceding/following instructions
• same address reads/writes
• dependent instructions

IBM Research
PC Example – Dekker’s Alg.
Proc
1
ST A
Proc 2
ST B
LD B LD A
Write-after-read
dependence order
Program
order
Program
order
Lack of store-to-load order
results in acyclic graph
1.
2.
3.
4.

IBM Research
Constraint graph example - SC
Proc
1
ST A
Proc 2
LD A
ST B
LD BProgram
order
Program
order
Write-after-read
dependence order
Read-after-write
dependence order
Cycle indicates that
execution is incorrect
1.
2.
3.
4.

IBM Research
Constraint graph example - PC
Proc
1
ST A
Proc 2
LD B
ST B
LD A
Program
order
Program
Order
Write-after-read
dependence order
Read-after-write
dependence order
1.
2.
3.
4.

IBM Research
ECDC Conceptual Description
 Identify causal dependences (upstream probe sets)
– 1 upstream set per processor
– 2 upstream sets per cache block (read set, write set)
 Communicating dependences
– Probe sets passed on response messages
– Probes attached to incoming invalidation messages
– Extra ProbePropagation messages sent at memory barriers
 Identifying usable stale blocks
– Extra stable state in cache (ST)
– Supplanter probe

IBM Research
ECDC Operation
Initially
1. ld A
2. st A
3. ld B
4. st B
5. ld C
Фprocupstream
{ }
{ }
{ , }
{ , }
{ , }
Ф(read|write)A
{ | , }
{ | , }
{ | , }
{ | , }
{ | , }
{ | }
{ | }
{ | }
{ , | , }
{ , | , }
Ф(read|write)B

IBM Research
Finite ECDC Performance
 When restricting PPB/STAB resources (220 KB per
processor)
– 16k probe lifetime counter
– 128 entry STAB per processor
– 32 Entry PPB per processor/directory controller (256 PPB
virtual namespace)
 TPC-H/SPECweb99 performance within margin of
error to infinite resources

IBM Research
Non-atomicity of writes
 Absent from model
 Effect on optimizations
– Forces unnecessary orders to exist
– Correct, but another example of over-conservatism
 Hopefully, infrequent performance divot
Processor p1
st r1, [A]
Processor p2
ld r1, [A]
st r2, [r1]
Processor p3
ld r1, [B]
membar
ld r2, [A]

IBM Research
ECDC Base machine model
PHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],
within the SimOS-PPC full-system simulator
Out-of-order
execution core
15-stage, 8-wide pipeline
256 entry reorder buffer, 128 entry load/store queue
32 entry issue queue
Functional
units (latency)
8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),
4 L1 Dcache load ports in OoO window
1 L1 Dcache load/store port at commit
Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection
table, 64 entry RAS, 8k entry 4-way BTB
Cache
Hierarchy
(latency)
32k DM L1 icache (1), 32k DM L1 dcache (1)
256K 8-way L2 (7), 16 MB 8-way L3 (15), 128 byte cache lines
Stride-based prefetcher modeled after Power4
Memory
system
(latency)
2-D static DOR routed torus interconnect. 60 cycle per link+route (40 GB/S bandwidth per
link, 5GHZ clock)
Memory (400 cycle best-case latency, 10 GB/S bandwidth)

IBM Research
Mapping ECDC to HW
 STAB – Maintains
supplanting probe for each
stale cache block
 PPB – Maintains
approximation of upstream
sets
 In caches – 2 extra bits for
stale state and synch
heuristic
DRAM
Dir
MemCtr
NIC
L2 $
D$I$
P
STAB
P
P
B
CastoutPPB

IBM Research
Probe representation
 Each probe represented by n-bit timer
 Stale block may be used until supplanting probe
timer expires
 Probe set in p-processor system represented by p
timers

IBM Research
STAB Detail
12525
8123
timer
9980x112c
0x24e2
0xc123
address
925690xf2e5104250x8000 (998)
(13523)
(21646)
Cache
Incoming Invalidates
p1 p2 p3
counters

IBM Research
PPB Detail
address hash
0
0
0
5
5
15
189
327
0
0
0
27
27
127
282
735
0
0
92
180
280
800
855
950
0
0
0
12
12
12
12
724
Shift register/
probe timers
…
Incoming upstream set
Expired upstream set
Timer index table

IBM Research
Memory consistency review
 Memory consistency model
– Specifies the programming interface to a shared memory
– i.e. the allowable interleaving of instructions
 Models discussed here:
– Sequential Consistency
– Processor Consistency
• No store-to-load program order
– Weak Ordering
• Order wrt memory barriers
• Same-address order
• Dependence order

IBM Research
Example – necessary miss (SC)
Proc
1
Proc 2
LD A
ST B
LD B
RAW
ST A
LD A
WAR
PO PO
PO
Block A is in proc 1’s

IBM Research
Example – avoidable miss (SC)
Proc
1
Proc 2
LD A
ST B
LD B
RAW
ST A
LD A
WAR
PO
PO
PO

IBM Research
Typical ReadX transaction
 When sending invalidation, create probe, add to PPB
 At receipt of invalidation (2b, 2c) add probe to STAB
 When sending invalidate acknowledgment, add probe set to the response
 When receiving invalidate acknowledgment, add incoming probe set to the PPB
3(a) Inval Ack
R
S1
H
1. ReadX
3(b) Inval Ack
S2
2(a) Sharers/Data
2(b) Inval
2(c) Inval

IBM Research
Invalidation to read distance
0%
20%
40%
60%
80%
100%
1 10 100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09
cycles
%ofloadcohmisses
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H

IBM Research
Invalidation to read distance (synch)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09
cycles
%ofloadcohmisses
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H

IBM Research
Invalidation to read distance (data)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09
cycles
%ofloadcohmisses
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H

IBM Research
STAB entry death cdf
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000 1000000
cycles
%STABentriesdeallocated
fft
fmm
ocean
radix
raytrace
SPECjbb2000
SPECweb99
TPC-B
TPC-H

IBM Research
STAB Entry Lifetime

IBM Research
ECDC performance (16k probe lifetime)

IBM Research
ECDC Perf (128 entry STAB, 32 entry PPB, 256 entry namespace)

IBM Research
ProbePropagation messages

IBM Research
ECDC Storage Overhead
0
50
100
150
200
250
300
350
4p 8p 16p 32p 64p 128p 256p 512p 1024p
Processor count
Storage(KB)

IBM Research
What about limit study?
 Indicated a larger number of avoidable coherence
misses
 Reasons:
– Did not account for non-speculative nature of protocol
(oracle ECDC could be better)
– Inaccurate measurement of critical writes
• Many loads perform polling to lines that have never been
touched by a load-linked or store-conditional
– Used isolated stale data detection mechanism

IBM Research
What about speculative load squashes?
 In a few applications, they occur frequently
(SPECjbb2000, TPC-H)
 Implemented/evaluated read-set-tracking w/
squash on miss
 Could eliminate a large fraction of squashes
– Unfortunately, little performance improvement
– Presumably, many squashes caused by contended spin
locks

IBM Research
ECDC and other consistency models
 Stricter model => more ProbePropagation
messages
 Potential for release consistency
 In SC/PC/TSO, ECDC benefits will probably be
dominated by extra ProbePropagation messages

IBM Research
Cause of STAB entry deallocation

IBM Research
Publications
 [ISCA ’04] Memory ordering: A Value-based approach.
– Selected for IEEE Micro Top Picks ‘04
 [PACT ’03] Constraint Graph Analysis of Multithreaded Programs.
– Selected for Best of PACT JILP Issue
 [PACT ’03] Redeeming IPC as a Performance Metric for Multithreaded Programs.
 [CAECW ’02] Precise and Accurate Processor Simulation
 [SPAA Revue ’02] Verifying Sequential Consistency Using Vector Clocks.
 [Micro ’01] Correctly Implementing Value Prediction in Microprocessors that Support
Multithreading or Multiprocessing.
 [WBT ’01] A Dynamic Binary Translation Approach to Architectural Simulation
 [HPCA ’01] An Architectural Characterization of Java TPC-W.
 [Euro-Par ’00] A Callgraph-Based Search Strategy for Automated Performance
Diagnosis.
– Selected as distinguished paper
 [CAECW ’00] Characterizing a Java Implementation of TPC-W

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

Recommended

Recommended

More Related Content

Similar to Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

Similar to Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models (20)

Recently uploaded

Recently uploaded (20)

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

Editor's Notes