SlideShare une entreprise Scribd logo
1  sur  92
Multicore Synchronization
a pragmatic introduction
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/multicore-synchronization
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Speaker (@0xF390)
Co-founder of Backtrace
Building high performance debugging platform for native
applications.



http://backtrace.io @0xCD03
Founder of Concurrency Kit
A concurrent memory model for C99 and arsenal of tools for
high performance synchronization.
http://concurrencykit.org
Previously at AppNexus, Message Systems and GWU HPCL
Multicore Synchronization
This is a talk on mechanical sympathy of parallel
systems on modern multicore systems.
Understanding both your workload and your
environment allows for effective optimization.
Principles of Multicore
Memory
Cache Coherency
Cache coherency guarantees the eventual consistency
of shared state.
Core 0
Cache
Memory
Core 1
I S A Data
0 I 0x0000 backtrace.io…
1 I 0x0040 393929191
2 S 0x0080 asodkadoakdoak0
3 M 0x00c0 3.141592
Memory Controller
Cache
I S A Data
0 E 0x1000 ASDOKADO
1 M 0x1040 213091i491119
2 S 0x0080 asodkadoakdoak0
3 M 0x10c0 9940191
Memory Controller
Memory
Cache Coherency
int x = 443; (&x = 0x20c4)
Core 0
Cache
Memory
Core 1
I S A Data
0 I 0x0000 backtrace.io…
1 I 0x0040 393929191
2 S 0x0080 asodkadoakdoak0
3 M 0x00c0 3.141592
Memory Controller
Cache
I S A Data
0 E 0x1000 ASDOKADO
1 M 0x1040 213091i491119
2 S 0x0080 asodkadoakdoak0
3 M 0x10c0 9940191
Memory Controller
x = x + 10010;
Thread 0
printf(“%dn”, x);
Thread 1
Memory
Cache Coherency
Load x
Core 0
Cache
Memory
Core 1
I S A Data
0 I 0x0000 backtrace.io…
1 I 0x0040 393929191
2 S 0x0080 asodkadoakdoak0
3 E 0x20c0 1921919119443………
Memory Controller
Cache
I S A Data
0 E 0x1000 ASDOKADO
1 M 0x1040 213091i491119
2 S 0x0080 asodkadoakdoak0
3 M 0x10c0 9940191
Memory Controller
x = x + 10010;
Thread 0
printf(“%dn”, x);
Thread 1
Memory
Cache Coherency
Update the value of x
Core 0
Cache
Memory
Core 1
I S A Data
0 I 0x0000 backtrace.io…
1 I 0x0040 393929191
2 S 0x0080 asodkadoakdoak0
3 M 0x20c0 192191911910453………
Memory Controller
Cache
I S A Data
0 E 0x1000 ASDOKADO
1 M 0x1040 213091i491119
2 S 0x0080 asodkadoakdoak0
3 M 0x10c0 9940191
Memory Controller
x = x + 10010;
Thread 0
printf(“%dn”, x);
Thread 1
Memory
Cache Coherency
Thread 1 loads x
Core 0
Cache
Memory
Core 1
I S A Data
0 I 0x0000 backtrace.io…
1 I 0x0040 393929191
2 S 0x0080 asodkadoakdoak0
3 O 0x20c0 192191911910453………
Memory Controller
Cache
I S A Data
0 E 0x1000 ASDOKADO
1 M 0x1040 213091i491119
2 S 0x0080 asodkadoakdoak0
3 S 0x20c0 192191911910453………
Memory Controller
x = x + 10010;
Thread 0
printf(“%dn”, x);
Thread 1
Cache Coherency
MESI, MOESI and MESIF are common cache coherency
protocols.
MOESI: Modified, Owned, Exclusive, Shared, Invalid
MESIF: Modified, Exclusive, Shared, Forwarding
MESI: Modified, Exclusive, Shared, Invalid
Cache Coherency
The cache line is the unit of coherency and can
become an unnecessary source of contention.
[0] [1]
array[]
Thread 0
for (;;) {

array[0]++;

}
Thread 1
for (;;) {

array[1]++;

}
Cache Coherency
False sharing occurs when logically disparate objects
share the same cache line and contend on it.
struct {

rwlock_t rwlock;

int value;

} object;
Thread 0
for (;;) {

read_lock(&object.rwlock);

int v = atomic_read(&object.value);

do_work(v);

read_unlock(&object.rwlock);

}
Thread 1
for (;;) {

read_lock(&object.rwlock);

<short work>

read_unlock(&object.rwlock);

}
Cache Coherency
False sharing occurs when logically disparate objects
share the same cache line and contend on it.
0
750,000,000
1,500,000,000
2,250,000,000
3,000,000,000
One Reader False Sharing
67,091,751
2,165,421,795
Cache Coherency
Padding can be used to mitigate false sharing.
struct {

rwlock_t rwlock;

char pad[64 - sizeof(rwlock_t)];

int value;

} object;
Cache Coherency
Padding can be used to mitigate false sharing.
0
750,000,000
1,500,000,000
2,250,000,000
3,000,000,000
One Reader False Sharing Padding
1,954,712,036
67,091,751
2,165,421,795
Cache Coherency
Padding must consider access patterns and overall
footprint of application.
Too much padding is bad.
Simultaneous Multithreading
SMT technology allows for throughput increases by
allowing programs to better utilize processor
resources.
Figure from “The Architecture of the Nehalem Processor and Nehalem-
EP SMP Platforms” (Michael E. Thomadakis)
Atomic Operations
Atomic operations are typically implemented with the
help of the cache coherency mechanism.
lock cmpxchg(target, compare, new):

register = load_and_lock(target);

if (register == compare)

store(target, new);

unlock(target);

return register;
Cache line locking typically only serializes accesses to
the target cache line.
Atomic Operations
In the old commodity processor days, atomic
operations were implemented with a bus lock.
lock cmpxchg(target, compare, new):

lock(memory_bus);

register = load(target);

if (register == compare)

store(target, new);

unlock(memory_bus);

return register;
x86 will assert a bus lock if an atomic operations goes
across a cache line boundary. Be careful!
Atomic Operations
Atomic operations are crucial to efficient
synchronization primitives.
COMPARE_AND_SWAP(a, b, c): updates a to c if a is
equal to b, atomically.
LOAD_LINKED(a)/STORE_CONDITIONAL(a, b):
Updates a to b if a was not modified between the load-
linked (LL) and store-conditional (SC).
Topology
Most modern multicore systems are NUMA
architectures: the throughput and latency of memory
accesses varies.
Topology
The NUMA factor is a ratio that represents the relative
cost of a remote memory access.
Intel Xeon L5640 machine at 2.27 GHz (12x2)
Time
Local
wake-up
~140ns
Remote
wake-up
~289ns
Topology
NUMA effects can be pervasive and difficult to
mitigate.
Sun x4600
Topology
Be wary of your operating system’s memory placement
mechanisms.
First Touch: Allocate page on memory of first processor
to touch it.
Interleave: Allocate pages round-robin across nodes.
More sophisticated schemes exist that do hierarchical
allocation, page migration, replication and more.
Topology
NUMA-oblivious synchronization objects are not only
susceptible to performance mismatch but starvation
and even livelock under extreme load.
0
25,000,000
50,000,000
75,000,000
100,000,000
C0 C1 C2 C3
1E+07
7E+052E+06
1E+08
Intel(R) Xeon(R) CPU E7- 4850 @ 2.00GHz @ 10x4
Fairness
Fair locks guarantee starvation-freedom.
0 0
Next Position
CK_CC_INLINE static void
ck_spinlock_ticket_lock(struct ck_spinlock_ticket *ticket)
{
unsigned int request;
request = ck_pr_faa_uint(&ticket->next, 1);
while (ck_pr_load_uint(&ticket->position) != request)
ck_pr_stall();
return;
}
Fairness
1 0
Next Position
CK_CC_INLINE static void
ck_spinlock_ticket_lock(struct ck_spinlock_ticket *ticket)
{
unsigned int request;
request = ck_pr_faa_uint(&ticket->next, 1);
while (ck_pr_load_uint(&ticket->position) != request)
ck_pr_stall();
return;
}
request = 0
Fairness
1 0
Next Position
CK_CC_INLINE static void
ck_spinlock_ticket_lock(struct ck_spinlock_ticket *ticket)
{
unsigned int request;
request = ck_pr_faa_uint(&ticket->next, 1);
while (ck_pr_load_uint(&ticket->position) != request)
ck_pr_stall();
return;
}
request = 0
Fairness
1 1
Next Position
CK_CC_INLINE static void
ck_spinlock_ticket_unlock(struct ck_spinlock_ticket *ticket)
{
unsigned int update;
update = ck_pr_load_uint(&ticket->position);
ck_pr_store_uint(&ticket->position, update + 1);
return;
}
Fairness
0
6,250,000
12,500,000
18,750,000
25,000,000
C0 C1 C2 C3
4E+064E+064E+064E+06
1E+07
7E+05
2E+06
1E+08
FAS Ticket
Intel(R) Xeon(R) CPU E7- 4850 @ 2.00GHz @ 10x4
Fairness
Fair locks are not a silver bullet and may negatively
impact throughput.
Fairness comes at the cost of increased sensitivity to
preemption and other sources of jitter.
Fairness
0
1,500,000
3,000,000
4,500,000
6,000,000
C0 C1 C2 C3
4E+064E+064E+064E+06
5E+065E+065E+065E+06
MCS Ticket
Intel(R) Xeon(R) CPU E7- 4850 @ 2.00GHz @ 10x4
Distributed Locks
Array and queue-based locks provide lock scalability
and fairness with distributing spinning and point-to-
point wake-up.
Thread 1
Thread 2
Thread 3
Thread 4
Distributed Locks
The MCS lock was a seminal contribution to the area
and introduced queue locks to the masses.
Thread 1
Thread 2
Thread 3
Thread 4
Distributed Locks
Thread 1
Thread 2
Thread 3
Thread 4
Distributed Locks
Thread 1
Thread 2
Thread 3
Thread 4
Distributed Locks
Thread 1
Thread 2
Thread 3
Thread 4
Distributed Locks
Thread 1
Thread 2
Thread 3
Thread 4
Distributed Locks
Thread 1
Thread 2
Thread 3
Thread 4
Distributed Locks
Thread 1
Thread 2
Thread 3
Thread 4
Distributed Locks
Thread 1
Thread 2
Thread 3
Thread 4
Distributed Locks
Thread 1
Thread 2
Thread 3
Thread 4
Core 0
Distributed Locks
Similar mechanisms exist for read-write locks.
Writers Readers
Core 1
Core 2
Core 3
Core 0
Distributed Locks
Big reader locks (brlocks) or Read-Mostly Locks
(rmlocks) distribute read-side flags so that readers spin
only on local memory.
Writers
Core 1 Core 2
Reader Reader
Core 3
Reader
Distributed Locks
Similar mechanisms exist for read-write locks.
Intel Xeon E5-2630L at 2.40 GHz
Limitations of Locks
Locks are not composable and are susceptible to
priority inversion, livelock, starvation, deadlock and
more.
A delicate balance must be found between lock
hierarchies, granularity and quality of service.
A significant delay in one thread holding a
synchronization object leads to significant delays for all
other threads waiting on the same synchronization
object.
Limitations of Locks
Intel Xeon E5-2630L at 2.40 GHz
Lock-less Synchronization
With lock-based synchronization, it is sufficient to
reason in terms of lock dependencies and critical
sections.
This model doesn’t work with lock-less synchronization
where we must guarantee correctness in much more
subtle ways.
Memory Models
These days, cache coherency helps implement the
consistency model.
The memory model is specified by the runtime
environment and defines the correct behavior of
shared memory accesses.
Memory Models
int r_0;
x = 1;
r_0 = y;
int r_1;
y = 1;
r_1 = x;
int x = 0;
int y = 0;
if (r_0 == 0 && r_1 == 0)
abort();
Core 0 Core 1
Memory Models
int r_0;
x = 1;
r_0 = y;
int r_1;
y = 1;
r_1 = x;
int x = 0;
int y = 0;
Core 0 Core 1
(1,1)
Memory Models
int r_0;
x = 1;
r_0 = y;
int r_1;
y = 1;
r_1 = x;
int x = 0;
int y = 0;
Core 0
Core 1
(0,1)
Memory Models
int r_0;
x = 1;
r_0 = y;
int r_1;
y = 1;
r_1 = x;
int x = 0;
int y = 0;
Core 0
Core 1
(1,0)
Memory Models
This condition is possible and is an example of store-
to-load re-ordering.
int r_0;
r_0 = y;
x = 1;
int r_1;
r_1 = x;
y = 1;
Core 0 Core 1
(0,0)
Memory Models
Modern processors rely on a myriad of techniques to
achieve high levels of instruction-level parallelism.
Core 0
Execution Engine
Memory Order-Buffer
L1
L2
Memory
Core 1
Execution Engine
Memory Order-Buffer
L1
L2
x = 1; y = 1;
r_0 = y; r_1 = x;
Memory Models
The processor memory model is specified with respect
to loads, stores and atomic operations.
TSO RMO
Load to Load
Load to Store
Store to Store
Store to Load
Atomics
Examples x86*, SPARC-TSO ARM, Power
Memory Models
Ordering guarantees are provided by serializing
instructions such as memory fences.
mutex_lock(&mutex);
x = x + 1;
mutex_unlock(&mutex);
?
CK_CC_INLINE static void
mutex_lock(struct mutex *lock)
{
while (ck_pr_fas_uint(&lock->value, true) == true);
ck_pr_fence_memory();
return;
}
* Simplified
Memory Models
Serializing instructions are expensive because they
disable some processor optimizations.
Atomic instructions are expensive because they either
involve serialization (and locking) or are just plain old
complex.
Intel Core i7-3615QM at 2.30 GHz
Throughput (/ second)
lock cmpxchg 147,304,564
cmpxchg 458,940,006
Lock-less Synchronization
Non-blocking synchronization provides very specific
progress guarantees and high levels of resilience at the
cost of complexity on the fast path.
Lock-Less
Obstruction-Free
Lock-Free
Wait-Free
Lock-freedom provides
system-wide progress
guarantees.
Wait-freedom provides per-
operation progress
guarantees.
Lock-less Synchronization
struct node {
void *value;
struct node *next;
};
void
stack_push(struct node **top, struct node *entry, void *value)
{
entry->value = value;
entry->next = *top;
*top = entry;
return;
}
struct node *
stack_pop(struct node **top)
{
struct node *r;
r = *top;
*top = r->next;
return r;
}
Lock-less Synchronization
struct node {
void *value;
struct node *next;
};
void
stack_push(struct node **top, struct node *entry,
void *value)
{
entry->value = value;
do {
entry->next = ck_pr_load_ptr(top);
} while (ck_pr_cas_ptr(top, entry->next,
entry) == false);
return;
}
Lock-less Synchronization
struct node {
void *value;
struct node *next;
};
void
stack_push(struct node **top, struct node *entry,
void *value)
{
entry->value = value;
do {
entry->next = ck_pr_load_ptr(top);
} while (ck_pr_cas_ptr(top, entry->next,
entry) == false);
return;
}
struct node *
stack_pop(struct node **top)
{
struct node *r, *next;
do {
r = ck_pr_load_ptr(top);
if (r == NULL)
return NULL;
next = ck_pr_load_ptr(&r->next);
} while (ck_pr_cas_ptr(top, r, next) ==
false);
return r;
}
Lock-less Synchronization
Source: Nonblocking Algorithms and Scalable Multicore Programming,
Samy Bahra
Non-blocking synchronization
is not a silver bullet.
The cost of complexity on the
fast path will outweigh the
benefits until sufficient levels of
contention are reached.
Lock-less Synchronization
Relaxing correctness constraints and constraining runtime
requirements allows for many of the benefits without as
much additional complexity and impact on the fast path.
#define EMPLOYEE_MAX 8
struct employee {
const char *name;
unsigned long long number;
};
struct directory {
struct employee *employee[EMPLOYEE_MAX];
rwlock_t rwlock;
};
bool employee_add(struct directory *, const char *,
unsigned long long);
void employee_delete(struct directory *, const char *);
unsigned long long employee_number_get(struct directory *,
const char *);
rwlock
Lock-less Synchronization
unsigned long long
employee_number_get(struct directory *d, const char *n)
{
struct employee *em;
unsigned long number;
size_t i;
rwlock_read_lock(&d->rwlock);
for (i = 0; i < EMPLOYEE_MAX; i++) {
em = d->employee[i];
if (em == NULL)
continue;
if (strcmp(em->name, n) != 0)
continue;
number = em->number;
rwlock_read_unlock(&d->rwlock);
return number;
}
rwlock_read_unlock(&d->rwlock);
return 0;
}
The rwlock_t object provides
correctness at cost of forward
progress.
“Samy”
2912984911
rwlock
Lock-less Synchronization
bool
employee_add(struct directory *d, const char *n,
unsigned long long number)
{
struct employee *em;
size_t i;
rwlock_write_lock(&d->rwlock);
for (i = 0; i < EMPLOYEE_MAX; i++) {
if (d->employee[i] != NULL)
continue;
em = xmalloc(sizeof *em);
em->name = n;
em->number = number;
d->employee[i] = em;
rwlock_write_unlock(&d->rwlock);
return true;
}
rwlock_write_unlock(&d->rwlock);
return false;
}
“Samy”
2912984911
rwlock
The rwlock_t object provides
correctness at cost of forward
progress.
Lock-less Synchronization
“Samy”
2912984911
rwlock
The rwlock_t object provides
correctness at cost of forward
progress.
void
employee_delete(struct directory *d, const char *n)
{
struct employee *em;
size_t i;
rwlock_write_lock(&d->rwlock);
for (i = 0; i < EMPLOYEE_MAX; i++) {
if (d->employee[i] == NULL)
continue;
if (strcmp(d->employee[i]->name, n) != 0)
continue;
em = d->employee[i];
d->employee[i] = NULL;
rwlock_write_unlock(&d->rwlock);
free(em);
return;
}
rwlock_write_unlock(&d->rwlock);
return;
}
Lock-less Synchronization
void
employee_delete(struct directory *d, const char *n)
{
struct employee *em;
size_t i;
rwlock_write_lock(&d->rwlock);
for (i = 0; i < EMPLOYEE_MAX; i++) {
if (d->employee[i] == NULL)
continue;
if (strcmp(d->employee[i]->name, n) != 0)
continue;
em = d->employee[i];
d->employee[i] = NULL;
rwlock_write_unlock(&d->rwlock);
free(em);
return;
}
rwlock_write_unlock(&d->rwlock);
return;
}
“Samy”
2912984911
rwlock
If reachability and liveness are
coupled, you also protect
against a read-reclaim race.
Lock-less Synchronization
If reachability and liveness are coupled, you also protect against a
read-reclaim race.
strcmp(em->name, …
number = em->number
employee_delete waits on readers
Time
T0
T1
T2
employee_delete destroys object
Lock-less Synchronization
Decoupling is sometimes necessary, but requires a safe memory
reclamation scheme to guarantee that an object cannot be
physically destroyed if there are active references to it.
static struct employee *
employee_number_get(struct directory *d, const char *n,
ck_brlock_reader_t *reader)
{
…
ck_brlock_read_lock(&d->brlock, reader);
for (i = 0; i < EMPLOYEE_MAX; i++) {
em = d->employee[i];
if (em == NULL)
continue;
if (strcmp(em->name, n) != 0)
continue;
ck_pr_inc_uint(&em->ref);
ck_brlock_read_unlock(reader);
return em;
}
ck_brlock_read_unlock(reader);
…
Lock-less Synchronization
Decoupling is sometimes necessary, but requires a safe memory
reclamation scheme to guarantee that an object cannot be
physically destroyed if there are active references to it.
static void
employee_delref(struct employee *em)
{
bool z;
ck_pr_dec_uint_zero(&em->ref, &z);
if (z == true)
free(em);
return;
}
Lock-less Synchronization
Decoupling is sometimes necessary, but requires a safe memory
reclamation scheme to guarantee that an object cannot be
physically destroyed if there are active references to it.
static void
employee_delete(struct directory *d, const char *n)
{
…
ck_brlock_write_lock(&d->brlock);
for (i = 0; i < EMPLOYEE_MAX; i++) {
if (d->employee[i] == NULL)
continue;
if (strcmp(d->employee[i]->name, n) != 0)
continue;
em = d->employee[i];
d->employee[i] = NULL;
ck_brlock_write_unlock(&d->brlock);
employee_delref(em);
return;
}
ck_brlock_write_unlock(&d->brlock);
…
Lock-less Synchronization
Decoupling is sometimes necessary, but requires a safe memory
reclamation scheme to guarantee that an object cannot be
physically destroyed if there are active references to it.
get
get
logical deleteT0
T1
T2
logical delete
physical delete
active reference
active reference
Lock-less Synchronization
Concurrent Data Structures
static bool
employee_add(struct directory *d, const char *n,
unsigned long long number)
{
struct employee *em;
size_t i;
ck_rwlock_write_lock(&d->rwlock);
for (i = 0; i < EMPLOYEE_MAX; i++) {
if (d->employee[i] != NULL)
continue;
em = malloc(sizeof *em);
em->name = n;
em->number = number;
ck_pr_fence_store();
ck_pr_store_ptr(&d->employee[i], em);
ck_rwlock_write_unlock(&d->rwlock);
return true;
}
ck_rwlock_write_unlock(&d->rwlock);
return false;
}
static unsigned long long
employee_number_get(struct directory *d, const char *n)
{
struct employee *em;
unsigned long number;
size_t i;
for (i = 0; i < EMPLOYEE_MAX; i++) {
em = ck_pr_load_ptr(&d->employee[i]);
if (em == NULL)
continue;
ck_pr_fence_load_depends();
if (strcmp(em->name, n) != 0)
continue;
number = em->number;
return number;
}
return 0;
}
Concurrent Data Structures
static void
employee_delete(struct directory *d, const char *n)
{
struct employee *em;
size_t i;
ck_rwlock_write_lock(&d->rwlock);
for (i = 0; i < EMPLOYEE_MAX; i++) {
if (d->employee[i] == NULL)
continue;
if (strcmp(d->employee[i]->name, n) != 0)
continue;
em = d->employee[i];
ck_pr_store_ptr(&d->employee[i], NULL);
ck_rwlock_write_unlock(&d->rwlock);
/* XXX: When is it safe to free em? */
return;
}
ck_rwlock_write_unlock(&d->rwlock);
return;
}
static unsigned long long
employee_number_get(struct directory *d, const char *n)
{
struct employee *em;
unsigned long number;
size_t i;
for (i = 0; i < EMPLOYEE_MAX; i++) {
em = ck_pr_load_ptr(&d->employee[i]);
if (em == NULL)
continue;
ck_pr_fence_load_depends();
if (strcmp(em->name, n) != 0)
continue;
number = em->number;
return number;
}
return 0;
}
EXPERIMENT
• Uniform read-mostly workload
• Single writer attempts pessimistic add operation at fixed
frequency
• Readers attempt to get the number of the first employee
Environment
• 12 cores across 2 sockets
• Intel Xeon E5-2630L at 2.40 GHz
• Linux 2.6.32
Workload
Machine (64GB)
NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (15MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
Read Latency
No updates
Read Latency
No updates
Read Latency
Single writer
Write Latency
Single writer
Safe Memory Reclamation
A read-reclaim race occurs if an object is destroyed while there are
references or accesses to it.
strcmp(em->name, …
number = em->number
free(em)
Time
T0
T1
T2
Safe Memory Reclamation
Techniques such as hazard pointers, quiescent-state-based
reclamation and epoch-based reclamation protect against read-
reclaim races.
Schemes such as QSBR and EBR do so without affecting reader
progress but without guaranteeing writer progress.
Schemes provide strong guarantees on forward progress but
require heavy-weight instructions and retry logic for readers.
strcmp(em->name, …
number = em->number
free(em)
Time
T0
T1
T2
BLOCKING SMR SCHEMES
• Read-side critical sections
smr_read_lock();
<protected section>
smr_read_unlock();
• Explicit Reclamation
smr_synchronize();
smr_read_begin();
<protected section>
smr_read_end();
QUIESCENT-STATE-BASED RECLAMATION
Time
T0
T1
T2
logical delete synchronize
q read
q read
q
q
read q
G0
read q
synchronize
G1
destroy
QUIESCENT-STATE-BASED RECLAMATION
employee_number_delete
[…]
ck_pr_store_ptr(slot, NULL);
qsbr_synchronize();
free(em);
[…]
Writer Readers
[…]
for (;;) {
em = employee_get(…);
do_stuff(em);
quiesce();
}
[…]
Time
ck_pr_store_ptr(slot, NULL);
em = employee_get(…);
qsbr_synchronize();
do_stuff(em);
quiesce();
em = employee_get(…);
do_stuff(em);
quiesce();em = employee_get(…);
do_stuff(em);
em = employee_get(…);
do_stuff(em);
free(em);
T0
T1
T2
QUIESCENT-STATE-BASED RECLAMATION
static void
qsbr_synchronize(void)
{
int i;
uint64_t goal;
ck_pr_fence_memory();
goal = ck_pr_faa_64(&global.value, 1) + 1;
for (i = 0; i < n_reader; i++) {
uint64_t *c =
&threads.readers[i].counter.value;
while (ck_pr_load_64(c) < goal)
ck_pr_stall();
}
return;
}
Readers
static void
qsbr_quiesce(struct thread *th)
{
uint64_t v;
ck_pr_fence_memory();
v = ck_pr_load_64(&global.value);
ck_pr_store_64(&th->counter.value, v);
ck_pr_fence_memory();
return;
}
Writers
static void
qsbr_read_lock(struct thread *th)
{
ck_pr_barrier(); /* Compiler barrier. */
return;
}
static void
qsbr_read_unlock(struct thread *th)
{
ck_pr_barrier(); /* Compiler barrier. */
return;
}
Conclusion
There are no silver bullets in multicore synchronization,
but a deep understanding of both your workload and
your underlying environment may allow you to extract
phenomenal performance and reliability increases.
The End
http://concurrencykit.org
http://backtrace.io/
@0xF390
A lot of the content can be found on https://
queue.acm.org/detail.cfm?id=2492433 - along with
references.
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/multicore-
synchronization

Contenu connexe

Plus de C4Media

Plus de C4Media (20)

Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

A Pragmatic Introduction to Multicore Synchronization

  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /multicore-synchronization
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Speaker (@0xF390) Co-founder of Backtrace Building high performance debugging platform for native applications.
 
 http://backtrace.io @0xCD03 Founder of Concurrency Kit A concurrent memory model for C99 and arsenal of tools for high performance synchronization. http://concurrencykit.org Previously at AppNexus, Message Systems and GWU HPCL
  • 5. Multicore Synchronization This is a talk on mechanical sympathy of parallel systems on modern multicore systems. Understanding both your workload and your environment allows for effective optimization.
  • 7. Memory Cache Coherency Cache coherency guarantees the eventual consistency of shared state. Core 0 Cache Memory Core 1 I S A Data 0 I 0x0000 backtrace.io… 1 I 0x0040 393929191 2 S 0x0080 asodkadoakdoak0 3 M 0x00c0 3.141592 Memory Controller Cache I S A Data 0 E 0x1000 ASDOKADO 1 M 0x1040 213091i491119 2 S 0x0080 asodkadoakdoak0 3 M 0x10c0 9940191 Memory Controller
  • 8. Memory Cache Coherency int x = 443; (&x = 0x20c4) Core 0 Cache Memory Core 1 I S A Data 0 I 0x0000 backtrace.io… 1 I 0x0040 393929191 2 S 0x0080 asodkadoakdoak0 3 M 0x00c0 3.141592 Memory Controller Cache I S A Data 0 E 0x1000 ASDOKADO 1 M 0x1040 213091i491119 2 S 0x0080 asodkadoakdoak0 3 M 0x10c0 9940191 Memory Controller x = x + 10010; Thread 0 printf(“%dn”, x); Thread 1
  • 9. Memory Cache Coherency Load x Core 0 Cache Memory Core 1 I S A Data 0 I 0x0000 backtrace.io… 1 I 0x0040 393929191 2 S 0x0080 asodkadoakdoak0 3 E 0x20c0 1921919119443……… Memory Controller Cache I S A Data 0 E 0x1000 ASDOKADO 1 M 0x1040 213091i491119 2 S 0x0080 asodkadoakdoak0 3 M 0x10c0 9940191 Memory Controller x = x + 10010; Thread 0 printf(“%dn”, x); Thread 1
  • 10. Memory Cache Coherency Update the value of x Core 0 Cache Memory Core 1 I S A Data 0 I 0x0000 backtrace.io… 1 I 0x0040 393929191 2 S 0x0080 asodkadoakdoak0 3 M 0x20c0 192191911910453……… Memory Controller Cache I S A Data 0 E 0x1000 ASDOKADO 1 M 0x1040 213091i491119 2 S 0x0080 asodkadoakdoak0 3 M 0x10c0 9940191 Memory Controller x = x + 10010; Thread 0 printf(“%dn”, x); Thread 1
  • 11. Memory Cache Coherency Thread 1 loads x Core 0 Cache Memory Core 1 I S A Data 0 I 0x0000 backtrace.io… 1 I 0x0040 393929191 2 S 0x0080 asodkadoakdoak0 3 O 0x20c0 192191911910453……… Memory Controller Cache I S A Data 0 E 0x1000 ASDOKADO 1 M 0x1040 213091i491119 2 S 0x0080 asodkadoakdoak0 3 S 0x20c0 192191911910453……… Memory Controller x = x + 10010; Thread 0 printf(“%dn”, x); Thread 1
  • 12. Cache Coherency MESI, MOESI and MESIF are common cache coherency protocols. MOESI: Modified, Owned, Exclusive, Shared, Invalid MESIF: Modified, Exclusive, Shared, Forwarding MESI: Modified, Exclusive, Shared, Invalid
  • 13. Cache Coherency The cache line is the unit of coherency and can become an unnecessary source of contention. [0] [1] array[] Thread 0 for (;;) {
 array[0]++;
 } Thread 1 for (;;) {
 array[1]++;
 }
  • 14. Cache Coherency False sharing occurs when logically disparate objects share the same cache line and contend on it. struct {
 rwlock_t rwlock;
 int value;
 } object; Thread 0 for (;;) {
 read_lock(&object.rwlock);
 int v = atomic_read(&object.value);
 do_work(v);
 read_unlock(&object.rwlock);
 } Thread 1 for (;;) {
 read_lock(&object.rwlock);
 <short work>
 read_unlock(&object.rwlock);
 }
  • 15. Cache Coherency False sharing occurs when logically disparate objects share the same cache line and contend on it. 0 750,000,000 1,500,000,000 2,250,000,000 3,000,000,000 One Reader False Sharing 67,091,751 2,165,421,795
  • 16. Cache Coherency Padding can be used to mitigate false sharing. struct {
 rwlock_t rwlock;
 char pad[64 - sizeof(rwlock_t)];
 int value;
 } object;
  • 17. Cache Coherency Padding can be used to mitigate false sharing. 0 750,000,000 1,500,000,000 2,250,000,000 3,000,000,000 One Reader False Sharing Padding 1,954,712,036 67,091,751 2,165,421,795
  • 18. Cache Coherency Padding must consider access patterns and overall footprint of application. Too much padding is bad.
  • 19. Simultaneous Multithreading SMT technology allows for throughput increases by allowing programs to better utilize processor resources. Figure from “The Architecture of the Nehalem Processor and Nehalem- EP SMP Platforms” (Michael E. Thomadakis)
  • 20. Atomic Operations Atomic operations are typically implemented with the help of the cache coherency mechanism. lock cmpxchg(target, compare, new):
 register = load_and_lock(target);
 if (register == compare)
 store(target, new);
 unlock(target);
 return register; Cache line locking typically only serializes accesses to the target cache line.
  • 21. Atomic Operations In the old commodity processor days, atomic operations were implemented with a bus lock. lock cmpxchg(target, compare, new):
 lock(memory_bus);
 register = load(target);
 if (register == compare)
 store(target, new);
 unlock(memory_bus);
 return register; x86 will assert a bus lock if an atomic operations goes across a cache line boundary. Be careful!
  • 22. Atomic Operations Atomic operations are crucial to efficient synchronization primitives. COMPARE_AND_SWAP(a, b, c): updates a to c if a is equal to b, atomically. LOAD_LINKED(a)/STORE_CONDITIONAL(a, b): Updates a to b if a was not modified between the load- linked (LL) and store-conditional (SC).
  • 23. Topology Most modern multicore systems are NUMA architectures: the throughput and latency of memory accesses varies.
  • 24. Topology The NUMA factor is a ratio that represents the relative cost of a remote memory access. Intel Xeon L5640 machine at 2.27 GHz (12x2) Time Local wake-up ~140ns Remote wake-up ~289ns
  • 25. Topology NUMA effects can be pervasive and difficult to mitigate. Sun x4600
  • 26. Topology Be wary of your operating system’s memory placement mechanisms. First Touch: Allocate page on memory of first processor to touch it. Interleave: Allocate pages round-robin across nodes. More sophisticated schemes exist that do hierarchical allocation, page migration, replication and more.
  • 27. Topology NUMA-oblivious synchronization objects are not only susceptible to performance mismatch but starvation and even livelock under extreme load. 0 25,000,000 50,000,000 75,000,000 100,000,000 C0 C1 C2 C3 1E+07 7E+052E+06 1E+08 Intel(R) Xeon(R) CPU E7- 4850 @ 2.00GHz @ 10x4
  • 28. Fairness Fair locks guarantee starvation-freedom. 0 0 Next Position CK_CC_INLINE static void ck_spinlock_ticket_lock(struct ck_spinlock_ticket *ticket) { unsigned int request; request = ck_pr_faa_uint(&ticket->next, 1); while (ck_pr_load_uint(&ticket->position) != request) ck_pr_stall(); return; }
  • 29. Fairness 1 0 Next Position CK_CC_INLINE static void ck_spinlock_ticket_lock(struct ck_spinlock_ticket *ticket) { unsigned int request; request = ck_pr_faa_uint(&ticket->next, 1); while (ck_pr_load_uint(&ticket->position) != request) ck_pr_stall(); return; } request = 0
  • 30. Fairness 1 0 Next Position CK_CC_INLINE static void ck_spinlock_ticket_lock(struct ck_spinlock_ticket *ticket) { unsigned int request; request = ck_pr_faa_uint(&ticket->next, 1); while (ck_pr_load_uint(&ticket->position) != request) ck_pr_stall(); return; } request = 0
  • 31. Fairness 1 1 Next Position CK_CC_INLINE static void ck_spinlock_ticket_unlock(struct ck_spinlock_ticket *ticket) { unsigned int update; update = ck_pr_load_uint(&ticket->position); ck_pr_store_uint(&ticket->position, update + 1); return; }
  • 32. Fairness 0 6,250,000 12,500,000 18,750,000 25,000,000 C0 C1 C2 C3 4E+064E+064E+064E+06 1E+07 7E+05 2E+06 1E+08 FAS Ticket Intel(R) Xeon(R) CPU E7- 4850 @ 2.00GHz @ 10x4
  • 33. Fairness Fair locks are not a silver bullet and may negatively impact throughput. Fairness comes at the cost of increased sensitivity to preemption and other sources of jitter.
  • 34. Fairness 0 1,500,000 3,000,000 4,500,000 6,000,000 C0 C1 C2 C3 4E+064E+064E+064E+06 5E+065E+065E+065E+06 MCS Ticket Intel(R) Xeon(R) CPU E7- 4850 @ 2.00GHz @ 10x4
  • 35. Distributed Locks Array and queue-based locks provide lock scalability and fairness with distributing spinning and point-to- point wake-up. Thread 1 Thread 2 Thread 3 Thread 4
  • 36. Distributed Locks The MCS lock was a seminal contribution to the area and introduced queue locks to the masses. Thread 1 Thread 2 Thread 3 Thread 4
  • 37. Distributed Locks Thread 1 Thread 2 Thread 3 Thread 4
  • 38. Distributed Locks Thread 1 Thread 2 Thread 3 Thread 4
  • 39. Distributed Locks Thread 1 Thread 2 Thread 3 Thread 4
  • 40. Distributed Locks Thread 1 Thread 2 Thread 3 Thread 4
  • 41. Distributed Locks Thread 1 Thread 2 Thread 3 Thread 4
  • 42. Distributed Locks Thread 1 Thread 2 Thread 3 Thread 4
  • 43. Distributed Locks Thread 1 Thread 2 Thread 3 Thread 4
  • 44. Distributed Locks Thread 1 Thread 2 Thread 3 Thread 4
  • 45. Core 0 Distributed Locks Similar mechanisms exist for read-write locks. Writers Readers Core 1 Core 2 Core 3
  • 46. Core 0 Distributed Locks Big reader locks (brlocks) or Read-Mostly Locks (rmlocks) distribute read-side flags so that readers spin only on local memory. Writers Core 1 Core 2 Reader Reader Core 3 Reader
  • 47. Distributed Locks Similar mechanisms exist for read-write locks. Intel Xeon E5-2630L at 2.40 GHz
  • 48. Limitations of Locks Locks are not composable and are susceptible to priority inversion, livelock, starvation, deadlock and more. A delicate balance must be found between lock hierarchies, granularity and quality of service. A significant delay in one thread holding a synchronization object leads to significant delays for all other threads waiting on the same synchronization object.
  • 49. Limitations of Locks Intel Xeon E5-2630L at 2.40 GHz
  • 50. Lock-less Synchronization With lock-based synchronization, it is sufficient to reason in terms of lock dependencies and critical sections. This model doesn’t work with lock-less synchronization where we must guarantee correctness in much more subtle ways.
  • 51. Memory Models These days, cache coherency helps implement the consistency model. The memory model is specified by the runtime environment and defines the correct behavior of shared memory accesses.
  • 52. Memory Models int r_0; x = 1; r_0 = y; int r_1; y = 1; r_1 = x; int x = 0; int y = 0; if (r_0 == 0 && r_1 == 0) abort(); Core 0 Core 1
  • 53. Memory Models int r_0; x = 1; r_0 = y; int r_1; y = 1; r_1 = x; int x = 0; int y = 0; Core 0 Core 1 (1,1)
  • 54. Memory Models int r_0; x = 1; r_0 = y; int r_1; y = 1; r_1 = x; int x = 0; int y = 0; Core 0 Core 1 (0,1)
  • 55. Memory Models int r_0; x = 1; r_0 = y; int r_1; y = 1; r_1 = x; int x = 0; int y = 0; Core 0 Core 1 (1,0)
  • 56. Memory Models This condition is possible and is an example of store- to-load re-ordering. int r_0; r_0 = y; x = 1; int r_1; r_1 = x; y = 1; Core 0 Core 1 (0,0)
  • 57. Memory Models Modern processors rely on a myriad of techniques to achieve high levels of instruction-level parallelism. Core 0 Execution Engine Memory Order-Buffer L1 L2 Memory Core 1 Execution Engine Memory Order-Buffer L1 L2 x = 1; y = 1; r_0 = y; r_1 = x;
  • 58. Memory Models The processor memory model is specified with respect to loads, stores and atomic operations. TSO RMO Load to Load Load to Store Store to Store Store to Load Atomics Examples x86*, SPARC-TSO ARM, Power
  • 59. Memory Models Ordering guarantees are provided by serializing instructions such as memory fences. mutex_lock(&mutex); x = x + 1; mutex_unlock(&mutex); ? CK_CC_INLINE static void mutex_lock(struct mutex *lock) { while (ck_pr_fas_uint(&lock->value, true) == true); ck_pr_fence_memory(); return; } * Simplified
  • 60. Memory Models Serializing instructions are expensive because they disable some processor optimizations. Atomic instructions are expensive because they either involve serialization (and locking) or are just plain old complex. Intel Core i7-3615QM at 2.30 GHz Throughput (/ second) lock cmpxchg 147,304,564 cmpxchg 458,940,006
  • 61. Lock-less Synchronization Non-blocking synchronization provides very specific progress guarantees and high levels of resilience at the cost of complexity on the fast path. Lock-Less Obstruction-Free Lock-Free Wait-Free Lock-freedom provides system-wide progress guarantees. Wait-freedom provides per- operation progress guarantees.
  • 62. Lock-less Synchronization struct node { void *value; struct node *next; }; void stack_push(struct node **top, struct node *entry, void *value) { entry->value = value; entry->next = *top; *top = entry; return; } struct node * stack_pop(struct node **top) { struct node *r; r = *top; *top = r->next; return r; }
  • 63. Lock-less Synchronization struct node { void *value; struct node *next; }; void stack_push(struct node **top, struct node *entry, void *value) { entry->value = value; do { entry->next = ck_pr_load_ptr(top); } while (ck_pr_cas_ptr(top, entry->next, entry) == false); return; }
  • 64. Lock-less Synchronization struct node { void *value; struct node *next; }; void stack_push(struct node **top, struct node *entry, void *value) { entry->value = value; do { entry->next = ck_pr_load_ptr(top); } while (ck_pr_cas_ptr(top, entry->next, entry) == false); return; } struct node * stack_pop(struct node **top) { struct node *r, *next; do { r = ck_pr_load_ptr(top); if (r == NULL) return NULL; next = ck_pr_load_ptr(&r->next); } while (ck_pr_cas_ptr(top, r, next) == false); return r; }
  • 65. Lock-less Synchronization Source: Nonblocking Algorithms and Scalable Multicore Programming, Samy Bahra Non-blocking synchronization is not a silver bullet. The cost of complexity on the fast path will outweigh the benefits until sufficient levels of contention are reached.
  • 66. Lock-less Synchronization Relaxing correctness constraints and constraining runtime requirements allows for many of the benefits without as much additional complexity and impact on the fast path.
  • 67. #define EMPLOYEE_MAX 8 struct employee { const char *name; unsigned long long number; }; struct directory { struct employee *employee[EMPLOYEE_MAX]; rwlock_t rwlock; }; bool employee_add(struct directory *, const char *, unsigned long long); void employee_delete(struct directory *, const char *); unsigned long long employee_number_get(struct directory *, const char *); rwlock Lock-less Synchronization
  • 68. unsigned long long employee_number_get(struct directory *d, const char *n) { struct employee *em; unsigned long number; size_t i; rwlock_read_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { em = d->employee[i]; if (em == NULL) continue; if (strcmp(em->name, n) != 0) continue; number = em->number; rwlock_read_unlock(&d->rwlock); return number; } rwlock_read_unlock(&d->rwlock); return 0; } The rwlock_t object provides correctness at cost of forward progress. “Samy” 2912984911 rwlock Lock-less Synchronization
  • 69. bool employee_add(struct directory *d, const char *n, unsigned long long number) { struct employee *em; size_t i; rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] != NULL) continue; em = xmalloc(sizeof *em); em->name = n; em->number = number; d->employee[i] = em; rwlock_write_unlock(&d->rwlock); return true; } rwlock_write_unlock(&d->rwlock); return false; } “Samy” 2912984911 rwlock The rwlock_t object provides correctness at cost of forward progress. Lock-less Synchronization
  • 70. “Samy” 2912984911 rwlock The rwlock_t object provides correctness at cost of forward progress. void employee_delete(struct directory *d, const char *n) { struct employee *em; size_t i; rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; d->employee[i] = NULL; rwlock_write_unlock(&d->rwlock); free(em); return; } rwlock_write_unlock(&d->rwlock); return; } Lock-less Synchronization
  • 71. void employee_delete(struct directory *d, const char *n) { struct employee *em; size_t i; rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; d->employee[i] = NULL; rwlock_write_unlock(&d->rwlock); free(em); return; } rwlock_write_unlock(&d->rwlock); return; } “Samy” 2912984911 rwlock If reachability and liveness are coupled, you also protect against a read-reclaim race. Lock-less Synchronization
  • 72. If reachability and liveness are coupled, you also protect against a read-reclaim race. strcmp(em->name, … number = em->number employee_delete waits on readers Time T0 T1 T2 employee_delete destroys object Lock-less Synchronization
  • 73. Decoupling is sometimes necessary, but requires a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. static struct employee * employee_number_get(struct directory *d, const char *n, ck_brlock_reader_t *reader) { … ck_brlock_read_lock(&d->brlock, reader); for (i = 0; i < EMPLOYEE_MAX; i++) { em = d->employee[i]; if (em == NULL) continue; if (strcmp(em->name, n) != 0) continue; ck_pr_inc_uint(&em->ref); ck_brlock_read_unlock(reader); return em; } ck_brlock_read_unlock(reader); … Lock-less Synchronization
  • 74. Decoupling is sometimes necessary, but requires a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. static void employee_delref(struct employee *em) { bool z; ck_pr_dec_uint_zero(&em->ref, &z); if (z == true) free(em); return; } Lock-less Synchronization
  • 75. Decoupling is sometimes necessary, but requires a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. static void employee_delete(struct directory *d, const char *n) { … ck_brlock_write_lock(&d->brlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; d->employee[i] = NULL; ck_brlock_write_unlock(&d->brlock); employee_delref(em); return; } ck_brlock_write_unlock(&d->brlock); … Lock-less Synchronization
  • 76. Decoupling is sometimes necessary, but requires a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. get get logical deleteT0 T1 T2 logical delete physical delete active reference active reference Lock-less Synchronization
  • 77. Concurrent Data Structures static bool employee_add(struct directory *d, const char *n, unsigned long long number) { struct employee *em; size_t i; ck_rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] != NULL) continue; em = malloc(sizeof *em); em->name = n; em->number = number; ck_pr_fence_store(); ck_pr_store_ptr(&d->employee[i], em); ck_rwlock_write_unlock(&d->rwlock); return true; } ck_rwlock_write_unlock(&d->rwlock); return false; } static unsigned long long employee_number_get(struct directory *d, const char *n) { struct employee *em; unsigned long number; size_t i; for (i = 0; i < EMPLOYEE_MAX; i++) { em = ck_pr_load_ptr(&d->employee[i]); if (em == NULL) continue; ck_pr_fence_load_depends(); if (strcmp(em->name, n) != 0) continue; number = em->number; return number; } return 0; }
  • 78. Concurrent Data Structures static void employee_delete(struct directory *d, const char *n) { struct employee *em; size_t i; ck_rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; ck_pr_store_ptr(&d->employee[i], NULL); ck_rwlock_write_unlock(&d->rwlock); /* XXX: When is it safe to free em? */ return; } ck_rwlock_write_unlock(&d->rwlock); return; } static unsigned long long employee_number_get(struct directory *d, const char *n) { struct employee *em; unsigned long number; size_t i; for (i = 0; i < EMPLOYEE_MAX; i++) { em = ck_pr_load_ptr(&d->employee[i]); if (em == NULL) continue; ck_pr_fence_load_depends(); if (strcmp(em->name, n) != 0) continue; number = em->number; return number; } return 0; }
  • 79. EXPERIMENT • Uniform read-mostly workload • Single writer attempts pessimistic add operation at fixed frequency • Readers attempt to get the number of the first employee Environment • 12 cores across 2 sockets • Intel Xeon E5-2630L at 2.40 GHz • Linux 2.6.32 Workload Machine (64GB) NUMANode L#0 (P#0 32GB) Socket L#0 + L3 L#0 (15MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
  • 84. Safe Memory Reclamation A read-reclaim race occurs if an object is destroyed while there are references or accesses to it. strcmp(em->name, … number = em->number free(em) Time T0 T1 T2
  • 85. Safe Memory Reclamation Techniques such as hazard pointers, quiescent-state-based reclamation and epoch-based reclamation protect against read- reclaim races. Schemes such as QSBR and EBR do so without affecting reader progress but without guaranteeing writer progress. Schemes provide strong guarantees on forward progress but require heavy-weight instructions and retry logic for readers. strcmp(em->name, … number = em->number free(em) Time T0 T1 T2
  • 86. BLOCKING SMR SCHEMES • Read-side critical sections smr_read_lock(); <protected section> smr_read_unlock(); • Explicit Reclamation smr_synchronize(); smr_read_begin(); <protected section> smr_read_end();
  • 87. QUIESCENT-STATE-BASED RECLAMATION Time T0 T1 T2 logical delete synchronize q read q read q q read q G0 read q synchronize G1 destroy
  • 88. QUIESCENT-STATE-BASED RECLAMATION employee_number_delete […] ck_pr_store_ptr(slot, NULL); qsbr_synchronize(); free(em); […] Writer Readers […] for (;;) { em = employee_get(…); do_stuff(em); quiesce(); } […] Time ck_pr_store_ptr(slot, NULL); em = employee_get(…); qsbr_synchronize(); do_stuff(em); quiesce(); em = employee_get(…); do_stuff(em); quiesce();em = employee_get(…); do_stuff(em); em = employee_get(…); do_stuff(em); free(em); T0 T1 T2
  • 89. QUIESCENT-STATE-BASED RECLAMATION static void qsbr_synchronize(void) { int i; uint64_t goal; ck_pr_fence_memory(); goal = ck_pr_faa_64(&global.value, 1) + 1; for (i = 0; i < n_reader; i++) { uint64_t *c = &threads.readers[i].counter.value; while (ck_pr_load_64(c) < goal) ck_pr_stall(); } return; } Readers static void qsbr_quiesce(struct thread *th) { uint64_t v; ck_pr_fence_memory(); v = ck_pr_load_64(&global.value); ck_pr_store_64(&th->counter.value, v); ck_pr_fence_memory(); return; } Writers static void qsbr_read_lock(struct thread *th) { ck_pr_barrier(); /* Compiler barrier. */ return; } static void qsbr_read_unlock(struct thread *th) { ck_pr_barrier(); /* Compiler barrier. */ return; }
  • 90. Conclusion There are no silver bullets in multicore synchronization, but a deep understanding of both your workload and your underlying environment may allow you to extract phenomenal performance and reliability increases.
  • 91. The End http://concurrencykit.org http://backtrace.io/ @0xF390 A lot of the content can be found on https:// queue.acm.org/detail.cfm?id=2492433 - along with references.
  • 92. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/multicore- synchronization