SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Lecture: 
Google 
Chubby 
lock 
service 
h#p://research.google.com/archive/chubby.html 
10/15/2014 
Romain 
Jaco7n 
romain.jaco9n@orange.fr 
ZooKeeper 
i am your father ! 
Chubby 
MASTER
Agenda 
2 
• Introduc7on 
• Design 
• Mechanisms 
for 
scaling 
• Experience 
• Summary 
The 
“Part-­‐Time” 
Parliament 
– 
Leslie 
Lamport
Introduc9on 
Abstract 
• Chubby 
lock 
service 
is 
intended 
for 
use 
within 
a 
loosely-­‐coupled 
distributed 
system 
consis7ng 
large 
number 
of 
machines 
(10.000) 
connected 
by 
a 
high-­‐speed 
network 
– Provides 
coarse-­‐grained 
locking 
– And 
reliable 
(low-­‐volume) 
storage 
• Chubby 
provides 
an 
interface 
much 
like 
a 
distributed 
file 
system 
with 
advisory 
locks 
– Whole 
file 
read 
and 
writes 
opera9on 
(no 
seek) 
– Advisory 
locks 
– No9fica9on 
of 
various 
events 
such 
as 
file 
modifica9on 
• Design 
emphasis 
– Availability 
– Reliability 
– But 
not 
for 
high 
performance 
/ 
high 
throughput 
• Chubby 
uses 
asynchronous 
consensus: 
PAXOS 
with 
lease 
7mers 
to 
ensure 
liveness 
3
Agenda 
• Introduc9on 
• Design 
• Mechanisms 
for 
scaling 
• Experience 
• Summary 
“Death 
Star 
tea 
infuser” 
4
Design 
Google’s 
ra7onale 
(2006) 
Design 
choice: 
Lock 
Service 
or 
Client 
PAXOS 
Library 
? 
• Client 
PAXOS 
Library 
? 
– Depend 
on 
NO 
other 
servers 
(besides 
the 
name 
service 
…) 
– Provide 
a 
standard 
framework 
for 
programmers 
• Lock 
Service 
? 
– Make 
Initial Death Star design! 
it 
easier 
to 
add 
availability 
to 
a 
prototype, 
and 
to 
maintain 
exis9ng 
program 
structure 
and 
communica9on 
pa#erns 
– Reduces 
the 
number 
of 
servers 
on 
which 
a 
client 
depends 
by 
offering 
both 
a 
name 
service 
with 
consistent 
client 
caching 
and 
allowing 
clients 
to 
store 
and 
fetch 
small 
quan99es 
of 
data 
– Lock-­‐based 
interface 
is 
more 
familiar 
to 
programmers 
– Lock 
service 
use 
several 
replicas 
to 
achieve 
high 
availability 
(= 
quorums), 
but 
even 
a 
single 
client 
can 
obtain 
lock 
and 
make 
progress 
safely 
à 
Lock 
service 
reduces 
the 
number 
of 
servers 
needed 
for 
a 
reliable 
client 
system 
to 
make 
progress 
5
Design 
Google’s 
ra7onale 
(2006) 
• Two 
key 
design 
decisions 
– Google 
chooses 
lock 
service 
( 
but 
also 
provide 
a 
PAXOS 
client 
library 
independently 
from 
Chubby 
for 
specific 
projects 
) 
– Serve 
small 
files 
to 
permit 
elected 
primaries 
( 
= 
client 
applica9on 
masters 
) 
to 
adver9se 
themselves 
and 
their 
parameters 
• Decisions 
based 
on 
Google's 
expected 
usage 
and 
environment 
– Allow 
thousands 
of 
clients 
to 
observe 
Chubby 
files 
à 
Events 
no7fica7on 
mechanism 
to 
avoid 
polling 
by 
clients 
that 
wish 
to 
know 
change 
– Consistent 
caching 
seman7cs 
prefered 
by 
developers 
and 
caching 
of 
files 
protects 
lock 
service 
from 
intense 
polling 
– Security 
mechanisms 
( 
access 
control 
) 
– Provide 
only 
coarse-­‐grained 
locks 
( 
long 
dura9on 
lock 
= 
low 
lock-­‐acquisi9on 
rate 
= 
less 
load 
on 
the 
lock 
server 
) 
6
Design 
System 
structure 
• Two 
main 
components 
that 
communicate 
via 
RPC 
– A 
replica 
server 
– A 
library 
linked 
against 
client 
applica9ons 
• A 
Chubby 
cell 
consists 
of 
small 
set 
of 
servers 
( 
typically 
5 
) 
knows 
as 
replicas 
– Replicas 
use 
a 
distributed 
consensus 
protocol 
( 
PAXOS 
) 
to 
elect 
a 
master 
and 
replicate 
logs 
– Read 
and 
Write 
requests 
are 
sa9sfied 
by 
the 
master 
alone 
– If 
a 
master 
fails, 
other 
replicas 
run 
the 
elec9on 
protocol 
when 
their 
master 
lease 
expire 
( 
new 
master 
elected 
in 
few 
seconds 
) 
• Clients 
find 
the 
master 
by 
sending 
master 
loca7on 
requests 
to 
the 
replicas 
listed 
in 
the 
DNS 
– Non 
master 
replicas 
respond 
by 
returning 
the 
iden9ty 
of 
the 
master 
– Once 
a 
client 
has 
located 
the 
master, 
client 
directs 
all 
requests 
to 
it 
either 
un9l 
it 
ceases 
to 
respond, 
or 
un9l 
it 
indicates 
that 
it 
is 
no 
longer 
the 
master 
7 
Chubby 
cell 
Master 
5.5.5.5 
1 
-­‐ 
DNS 
request 
= 
chubby.deathstar.sith 
? 
Chubby 
lib 
Applica7on 
Client 
DNS 
servers 
8.8.4.4 
8.8.8.8 
1.1.1.1 
2.2.2.2 
3.3.3.3 
4.4.4.4 
chubby.deathstar.sith 
= 
1.1.1.1 
, 
2.2.2.2 
, 
3.3.3.3, 
4.4.4.4 
, 
5.5.5.5 
Master 
= 
5.5.5.5 
2 
-­‐ 
Master 
loca9on 
? 
3 
-­‐ 
ini9ates 
Chubby 
session 
with 
the 
master
Design 
PAXOS 
distributed 
consensus 
• Chubby 
cell 
with 
N 
= 
3 
replicas 
– Replicas 
use 
a 
distributed 
consensus 
protocol 
to 
elect 
a 
master 
(PAXOS). 
Quorum 
= 
2 
for 
N 
= 
3 
– The 
master 
must 
obtain 
votes 
from 
a 
majority 
of 
replicas 
that 
promise 
to 
not 
elect 
a 
different 
master 
for 
an 
interval 
of 
a 
few 
seconds 
(=master 
lease) 
– The 
master 
lease 
is 
periodically 
renewed 
by 
the 
replicas 
provided 
the 
master 
con9nues 
to 
win 
a 
majority 
of 
the 
vote 
– During 
its 
master 
lease, 
the 
master 
maintains 
copies 
of 
a 
simple 
database 
with 
replicas 
(ordered 
replicated 
logs) 
– Write 
request 
are 
propagated 
via 
the 
consensus 
protocol 
to 
all 
replicas 
(PAXOS) 
– Read 
requests 
are 
sa9sfied 
by 
the 
master 
alone 
– If 
a 
master 
fails, 
other 
replicas 
run 
the 
elec9on 
protocol 
when 
their 
master 
lease 
expire 
(new 
master 
elected 
in 
few 
seconds) 
8 
Prepare 
= 
please 
votes 
for 
me 
and 
promises 
not 
to 
vote 
for 
someone 
else 
during 
12 
seconds 
Promise 
= 
OK 
i 
vote 
for 
you 
and 
promise 
not 
to 
vote 
for 
someone 
else 
during 
12 
seconds 
if 
i 
received 
quorum 
of 
Promise 
then 
i 
am 
the 
Master 
an 
i 
can 
send 
many 
Accept 
during 
my 
lease 
(Proposer 
vote 
for 
himself) 
Accept 
= 
update 
your 
replicated 
logs 
with 
this 
Write 
client 
request 
Accepted 
= 
i 
have 
write 
in 
my 
log 
your 
Write 
client 
request 
if 
a 
replica 
received 
quorum 
of 
Accepted 
then 
the 
Write 
is 
commited 
(replica 
sends 
an 
Accepted 
to 
himself) 
Re-­‐Prepare 
= 
i 
love 
to 
be 
the 
Master, 
please 
re-­‐votes 
for 
me 
before 
the 
end 
the 
lease 
so 
i 
can 
extend 
my 
lease 
quorum 
quorum 
quorum 
quorum
Design 
Files 
& 
directories 
• Chubby 
exports 
a 
file 
system 
interface 
simpler 
than 
Unix 
– Tree 
of 
files 
and 
directories 
with 
name 
components 
separated 
by 
slashes 
– Each 
directory 
contains 
a 
list 
of 
child 
files 
and 
directories 
(collec9vely 
called 
nodes) 
– Each 
file 
contains 
a 
sequence 
of 
un-­‐interpreted 
bytes 
– No 
symbolic 
or 
hard 
links 
– No 
directory 
modified 
9mes, 
no 
last-­‐access 
9mes 
(to 
make 
easier 
to 
cache 
file 
meta-­‐data) 
– No 
path-­‐dependent 
permission 
seman9cs: 
file 
is 
controlled 
by 
the 
permissions 
on 
the 
file 
itself 
9 
The 
remaining 
of 
the 
name 
is 
interpreted 
within 
the 
named 
Chubby 
cell 
/ls/dc-tatooine/bigtable/root-tablet! 
The 
ls 
prefix 
is 
common 
to 
all 
Chubby 
names: 
stands 
for 
lock 
service 
Second 
component 
dc-tatooine 
is 
the 
name 
of 
the 
Chubby 
cell. 
It 
is 
resolved 
to 
one 
or 
more 
Chubby 
servers 
via 
DNS 
lookup
Design 
Files 
& 
directories 
: 
Nodes 
• Nodes 
(= 
files 
or 
directories) 
may 
be 
either 
permanent 
or 
ephemeral 
• Ephemeral 
files 
are 
used 
as 
temporary 
files, 
and 
act 
as 
indicators 
to 
others 
that 
a 
client 
is 
alive 
• Any 
nodes 
may 
be 
deleted 
explicitly 
– Ephemeral 
nodes 
files 
are 
also 
deleted 
if 
no 
client 
has 
them 
open 
– Ephemeral 
nodes 
directories 
are 
also 
deleted 
if 
they 
are 
empty 
• Any 
node 
can 
act 
as 
an 
advisory 
reader/writer 
lock 
10
Design 
Files 
& 
directories 
: 
Metadata 
• 3 
ACLs 
– Three 
names 
of 
access 
control 
lists 
(ACL) 
used 
to 
control 
reading, 
wri7ng 
and 
changing 
the 
ACL 
names 
for 
the 
node 
– Node 
inherits 
the 
ACL 
names 
of 
its 
parent 
directory 
on 
crea9on 
– ACLs 
are 
themselves 
files 
located 
in 
“/ls/dc-tatooine/acl” 
(ACL 
file 
consist 
of 
simple 
lists 
of 
names 
of 
principals) 
– Users 
are 
authen9cated 
by 
a 
mechanism 
built 
into 
the 
Chubby 
RPC 
system 
• 4 
monotonically 
increasing 
64-­‐bit 
numbers 
1. Instance 
number: 
greater 
than 
the 
instance 
number 
of 
any 
previous 
node 
with 
the 
same 
name 
2. Content 
genera7on 
number 
(files 
only): 
increases 
when 
the 
file’s 
contents 
are 
wri#en 
3. Lock 
genera7on 
number: 
increases 
when 
the 
node’s 
lock 
transi9ons 
from 
free 
to 
held 
4. ACL 
genera7on 
number: 
increases 
when 
the 
node’s 
ACL 
names 
are 
wri#en 
• 64-­‐bit 
checksum 
11
Design 
Files 
& 
directories 
: 
Handles 
• Clients 
open 
nodes 
to 
obtain 
Handles 
(analogous 
to 
UNIX 
file 
descriptors) 
• Handles 
include 
: 
– Check 
digits: 
prevent 
clients 
from 
crea9ng 
or 
guessing 
handles 
à 
full 
access 
control 
checks 
performed 
only 
when 
handles 
are 
created 
– A 
sequence 
number: 
Master 
can 
know 
whether 
a 
handle 
was 
generated 
by 
it 
or 
a 
previous 
master 
– Mode 
informa7on: 
(provided 
at 
open 
9me) 
to 
allow 
the 
master 
to 
recreate 
its 
state 
if 
an 
old 
handle 
is 
presented 
to 
a 
newly 
restarted 
master 
12
Design 
Locks, 
Sequencers 
and 
Lock-­‐delay 
• Each 
Chubby 
file 
and 
directory 
can 
act 
as 
a 
reader-­‐writer 
lock 
(locks 
are 
advisory) 
• Acquiring 
a 
lock 
in 
either 
mode 
requires 
write 
permission 
– Exclusive 
mode 
(writer): 
One 
client 
may 
hold 
the 
lock 
– Shared 
mode 
(reader): 
Any 
number 
of 
client 
handles 
may 
hold 
the 
lock 
• Lock 
holder 
can 
request 
a 
Sequencer 
: 
opaque 
byte 
string 
describing 
the 
state 
of 
the 
lock 
immediately 
aher 
acquisi9on 
– Name 
of 
the 
lock 
+ 
Lock 
mode 
(exclusive 
or 
shared) 
+ 
Lock 
genera9on 
number 
• Sequencer 
usage 
– Applica9on’s 
master 
can 
generate 
a 
sequencer 
and 
send 
it 
with 
any 
internal 
order 
sends 
to 
other 
servers 
– Applica9on’s 
servers 
that 
receive 
orders 
from 
a 
master 
can 
check 
with 
Chubby 
if 
the 
sequencer 
is 
s9ll 
good 
(= 
not 
a 
stale 
master) 
• Lock-­‐delay 
: 
Lock 
server 
prevents 
other 
clients 
from 
claiming 
the 
lock 
during 
lock-­‐delay 
period 
if 
lock 
becomes 
free 
– client 
may 
specify 
any 
look-­‐delay 
up 
to 
60 
seconds 
– This 
limit 
prevents 
a 
faulty 
client 
from 
making 
a 
lock 
unavailable 
for 
an 
arbitrary 
long 
9me 
– Lock 
delay 
protects 
unmodified 
servers 
and 
clients 
from 
everyday 
problems 
caused 
by 
message 
delays 
and 
restarts 
… 
13
Design 
Events 
• Session 
events 
can 
be 
received 
by 
applica7on 
– Jeopardy: 
when 
session 
lease 
9meout 
and 
Grace 
period 
begins 
(see 
Fail-­‐over 
later 
;-­‐) 
– Safe: 
when 
the 
session 
is 
known 
to 
have 
survived 
a 
communica9on 
problem 
– Expired: 
if 
the 
session 
9meout 
• Handle 
events: 
clients 
may 
subscribe 
to 
a 
range 
of 
events 
when 
they 
create 
a 
Handle 
(=Open 
phase) 
– File 
contents 
modified 
– Child 
node 
added/removed/modified 
– Master 
failed 
over 
– A 
Handle 
(and 
it’s 
lock) 
has 
become 
invalid 
– Lock 
acquired 
– Conflic7ng 
lock 
request 
from 
another 
client 
• These 
events 
are 
delivered 
to 
the 
clients 
asynchronously 
via 
an 
up-­‐call 
from 
the 
Chubby 
library 
• Mike 
Burrows: 
“The 
last 
two 
events 
menMoned 
are 
rarely 
used, 
and 
with 
hindsight 
could 
have 
been 
omiOed.” 
14
Design 
API 
• Open/Close 
node 
name 
– func Open( ) !Handles 
are 
created 
by 
Open() 
and 
destroyed 
by 
Close()! 
– func Close( ) !This 
call 
never 
failed 
• Poison 
– func Poison( ) !Allows 
a 
client 
to 
cancel 
Chubby 
calls 
made 
by 
other 
threads 
without 
fear 
of 
de-­‐alloca9ng 
the 
memory 
being 
accessed 
by 
them 
• Read/Write 
full 
contents 
– func GetContentsAndStat( ) !Atomic 
reading 
of 
the 
en9re 
content 
and 
metadata! 
– func GetStat( ) ! !Reading 
of 
the 
metadata 
only! 
– func ReadDir( ) ! !Reading 
of 
names 
and 
metadata 
of 
the 
directory! 
– func SetContents() ! !Atomic 
wri9ng 
of 
the 
en9re 
content! 
• ACL 
– func SetACL( ) ! !Change 
ACL 
for 
a 
node! 
• Delete 
node 
– func Delete( ) ! !If 
it 
has 
no 
children! 
• Lock 
– func Acquire( ) !Acquire 
a 
lock! 
– func TryAcquire( ) !Try 
to 
acquire 
a 
poten9ally 
conflic9ng 
lock 
by 
sending 
“conflic9ng 
lock 
request” 
to 
the 
holder! 
– func Release( ) !Release 
a 
lock! 
• Sequencer 
– func SetSequencer( ) !Returns 
a 
sequencer 
that 
describes 
any 
lock 
held 
by 
this 
Handle 
– func GetSequencer( ) 
Associate 
a 
sequencer 
with 
a 
Handle. 
Subsequent 
opera9ons 
on 
the 
Handle 
failed 
if 
the 
sequencer 
is 
no 
longer 
valid 
– func CheckSequencer( ) 
Checks 
whether 
a 
sequencer 
is 
valid 
15
Applica9on 
scheduler 
Design 
Design 
n°1 
• Primary 
elec9on 
example 
without 
sequencer 
usage, 
but 
with 
lock-­‐delay 
(worst 
design) 
16 
Master 
Applica9on 
workers 
The 
first 
server 
to 
get 
the 
exclusive 
lock 
on 
file 
“/ls/empire/deathstar/master“ 
is 
the 
Master, 
writes 
its 
name 
on 
the 
file 
and 
sets 
lock-­‐delay 
to 
1 
minute; 
the 
two 
others 
servers 
will 
only 
receive 
events 
no9fica9ons 
about 
this 
file 
Worker 
executes 
the 
request 
Execute order 66 ! 
Master 
sends 
an 
order 
to 
a 
worker 
If 
the 
lock 
on 
“/ls/empire/deathstar/ 
master“ 
becomes 
free 
because 
the 
holder 
has 
failed 
or 
become 
inaccessible, 
the 
lock 
server 
will 
prevent 
others 
backup 
servers 
from 
claiming 
the 
lock 
during 
the 
lock-­‐delay 
of 
1 
minute 
Chubby 
cell 
PAXOS 
leader
Design 
Design 
n°2 
• Primary 
elec9on 
example 
with 
sequencer 
usage 
(best 
design) 
17 
Master 
Applica9on 
scheduler 
Applica9on 
workers 
The 
first 
server 
to 
get 
the 
exclusive 
lock 
on 
file 
“/ls/empire/deathstar/master“ 
is 
the 
Master, 
write 
its 
name 
on 
the 
file 
and 
get 
a 
sequencer; 
the 
two 
others 
servers 
will 
only 
receive 
events 
no9fica9ons 
about 
this 
file 
Worker 
must 
check 
the 
sequencer 
before 
execu7ng 
the 
request 
checked 
against 
worker’s 
Chubby 
cache 
Chubby 
cell 
PAXOS 
leader 
Execute order 66 ! 
Master 
sends 
an 
order 
to 
a 
worker 
by 
adding 
the 
sequencer 
to 
the 
request
Applica9on 
scheduler 
Design 
Design 
n°3 
• Primary 
elec9on 
example 
with 
sequencer 
usage 
(op7mized 
design) 
18 
Chubby 
cell 
PAXOS 
leader 
Master 
Army 
of 
workers 
Destroy 
Alderaan ! 
Master 
sends 
an 
order 
to 
a 
worker 
by 
adding 
the 
sequencer 
to 
the 
request 
Worker 
must 
check 
the 
sequencer 
before 
execu7ng 
the 
request 
check 
against 
the 
most 
recent 
sequencer 
that 
the 
server 
has 
observed 
if 
the 
worker 
does 
not 
wish 
to 
maintain 
session 
with 
Chubby 
The 
first 
server 
to 
get 
the 
exclusive 
lock 
on 
file 
“/ls/empire/deathstar/master“ 
is 
the 
Master, 
write 
its 
name 
on 
the 
file 
and 
get 
a 
sequencer; 
the 
two 
others 
servers 
will 
only 
receive 
events 
no9fica9ons 
about 
this 
file
Design 
Caching 
• Clients 
cache 
file 
data 
and 
node 
metadata 
in 
a 
consistent, 
write-­‐through 
in 
memory 
cache 
– Cache 
maintained 
by 
a 
lease 
mechanism 
(client 
that 
allowed 
its 
cache 
lease 
to 
expire 
then 
informs 
the 
lock 
server) 
– Cache 
kept 
consistent 
by 
invalida7ons 
sent 
by 
the 
master, 
which 
keeps 
a 
list 
of 
what 
each 
client 
may 
be 
caching 
• When 
file 
data 
or 
metadata 
is 
to 
be 
changed 
– Master 
block 
modifica9on 
while 
sending 
invalida9ons 
for 
the 
data 
to 
every 
client 
that 
may 
cached 
it 
– Client 
that 
receives 
of 
an 
invalida9on 
flushes 
the 
invalidated 
state 
and 
acknowledges 
by 
making 
its 
next 
KeepAlive 
call 
– The 
modifica9on 
proceeds 
only 
aher 
the 
server 
knows 
that 
each 
client 
has 
invalidated 
its 
cache, 
either 
by 
acknowledged 
the 
invalida9on, 
or 
because 
the 
client 
allowed 
its 
cache 
lease 
to 
expire 
19 
Client 
PAXOS 
Chubby 
cell 
leader 
2 
-­‐ 
Master 
loca9on 
? 
Client 
Client 
Write 
change 
to 
the 
Chubby 
replicated 
database 
I 
want 
to 
change 
file 
content 
of 
“ls/empire/tux” 
I 
have 
in 
cache 
file 
“ls/empire/tux” 
I 
have 
in 
cache 
file 
“ls/empire/tux” 
Request 
for 
changing 
content 
of 
file 
“X” 
Master 
sends 
cache 
invalida9on 
for 
clients 
that 
cache 
“X” 
Clients 
flushes 
caches 
for 
file 
“X” 
and 
acknowledge 
the 
Master 
Master 
acknowledges 
the 
writer 
about 
change 
done
Design 
Sessions 
and 
KeepAlives 
: 
Handles, 
locks, 
and 
cached 
data 
all 
remain 
valid 
while 
session 
remains 
valid 
• Client 
requests 
a 
new 
session 
on 
first 
contac7ng 
the 
master 
of 
a 
Chubby 
cell 
• Client 
ends 
the 
session 
explicitly 
either 
when 
it 
terminates, 
or 
if 
the 
session 
has 
been 
idle 
– Session 
idle 
= 
no 
opens 
handles 
and 
no 
call 
for 
a 
minute 
• Each 
session 
has 
an 
associated 
lease 
interval 
– Lease 
interval 
= 
master 
guarantees 
not 
to 
terminate 
the 
session 
unilaterally 
un9l 
the 
lease 
9meout 
– Lease 
7meout 
= 
end 
of 
the 
lease 
interval 
grace period 
jeopardy 20
Design 
Sessions 
and 
KeepAlives 
• Master 
advances 
the 
lease 
7meout 
– On 
session 
crea7on 
– When 
a 
master 
fail-­‐over 
occurs 
– When 
it 
responds 
to 
a 
KeepAlive 
RPC 
from 
the 
client: 
• On 
receiving 
a 
KeepAlive 
(1), 
Master 
blocks 
the 
RPC 
un7l 
client’s 
previous 
lease 
interval 
is 
close 
to 
expiring 
• Master 
later 
allows 
the 
RPC 
to 
return 
to 
the 
client 
(2) 
and 
informs 
it 
about 
the 
new 
lease 
7meout 
(= 
lease 
M2) 
• Master 
use 
default 
extension 
of 
12 
second, 
overload 
master 
may 
use 
higher 
values 
• Client 
ini7ates 
a 
KeepAlive 
immediately 
aler 
receiving 
the 
previous 
reply 
Protocol 
op7miza7on 
: 
KeepAlive 
reply 
is 
used 
to 
transmit 
events 
and 
cache 
invalida7ons 
back 
to 
the 
client 
grace period 
jeopardy 21 
block 
RPC
Design 
Fail-­‐overs 
• When 
a 
master 
fails 
or 
otherwise 
loses 
mastership 
– Discards 
its 
in-­‐memory 
state 
about 
sessions, 
handles, 
and 
locks 
– Session 
lease 
9mer 
is 
stopped 
– If 
a 
master 
elec7on 
occurs 
quickly, 
clients 
can 
contact 
the 
new 
master 
before 
their 
local 
lease 
7mer 
expire 
– If 
the 
elec7on 
takes 
a 
long 
7me, 
client 
flush 
their 
caches 
(= 
JEOPARDY 
) 
and 
wait 
for 
the 
GRACE 
PERIOD 
( 
45 
seconds 
) 
while 
trying 
to 
find 
the 
new 
master 
grace period 
jeopardy 22
Design 
Fail-­‐overs 
: 
Newly 
elected 
master’s 
tasks 
(initial design)! 
1. Picks 
a 
new 
client 
epoch 
number 
(clients 
are 
required 
to 
present 
on 
every 
call) 
2. Respond 
to 
master-­‐loca7on 
requests, 
but 
does 
not 
at 
first 
process 
incoming 
session-­‐related 
opera9ons 
3. Builds 
in-­‐memory 
data 
structures 
for 
sessions 
and 
locks 
recorded 
in 
the 
database. 
Session 
leases 
are 
extended 
to 
the 
maximum 
that 
the 
previous 
master 
may 
have 
been 
using 
4. Lets 
clients 
perform 
KeepAlives, 
but 
no 
other 
session-­‐related 
opera9ons 
5. Emits 
a 
fail-­‐over 
event 
to 
each 
session: 
clients 
flush 
their 
caches 
and 
warn 
applica9ons 
that 
other 
events 
may 
have 
been 
lost 
6. Waits 
un7l 
each 
session 
expire 
or 
acknowledges 
the 
fail-­‐over 
event 
7. Now, 
allows 
all 
opera7ons 
to 
proceed 
8. If 
a 
client 
uses 
a 
handle 
created 
prior 
to 
the 
fail-­‐over, 
the 
Master 
recreates 
the 
in-­‐memory 
representa7on 
of 
the 
handle 
and 
then 
honors 
the 
call 
9. Aler 
some 
interval 
(1 
minute), 
master 
deletes 
ephemeral 
files 
that 
have 
no 
open 
file 
handles: 
clients 
should 
refresh 
handles 
on 
ephemeral 
files 
during 
this 
interval 
aher 
a 
fail-­‐over 
23
Design 
Database 
implementa7on 
• Simple 
key/value 
database 
using 
write 
ahead 
logging 
and 
snapshopng 
• Chubby 
needs 
atomic 
opera7ons 
only 
(no 
general 
transac7ons) 
• Database 
log 
is 
distributed 
among 
the 
replicas 
using 
a 
distributed 
consensus 
protocol 
(PAXOS) 
Backup 
• Every 
few 
hours, 
Master 
writes 
a 
snapshot 
of 
its 
database 
to 
a 
GFS 
file 
server 
in 
a 
different 
building 
= 
no 
cyclic 
dependencies 
( 
because 
local 
GFS 
uses 
local 
Chubby 
cell 
… 
) 
• Backup 
used 
for 
disaster 
recovery 
• Backup 
used 
for 
ini7alizing 
the 
database 
of 
a 
newly 
replaced 
replica 
= 
no 
load 
on 
others 
replicas 
24
Design 
Mirroring 
• Chubby 
allows 
a 
collec7on 
of 
files 
to 
be 
mirrored 
from 
one 
cell 
to 
another 
– Mirroring 
is 
fast: 
small 
files 
and 
the 
event 
mechanism 
to 
inform 
immediately 
if 
a 
file 
is 
added, 
deleted, 
or 
modified 
– Unreachable 
mirror 
remains 
unchanged 
un9l 
connec9vity 
is 
restored: 
updated 
files 
iden9fied 
by 
comparing 
checksums 
– Used 
to 
copy 
configura9on 
files 
to 
various 
compu9ng 
clusters 
distributed 
around 
the 
world 
• A 
special 
“global” 
Chubby 
cell 
– Subtree 
“/ls/global/master” 
mirrored 
to 
the 
subtree 
“/ls/cell/slave” 
in 
every 
other 
Chubby 
cell 
– The 
“global” 
cell 
replicas 
are 
located 
in 
widely-­‐separated 
parts 
of 
the 
world 
– Usage: 
• Chubby’s 
own 
ACLs 
• Various 
files 
in 
which 
Chubby 
cells 
and 
other 
systems 
adver9se 
their 
presence 
to 
monitoring 
services 
• Pointers 
to 
allow 
clients 
to 
locate 
large 
data 
sets 
(Bigtable 
cells) 
and 
many 
configura9ons 
files 
25
26 
Agenda 
• Introduc9on 
• Design 
• Mechanisms 
for 
scaling 
• Experience 
• Summary 
“Judge 
me 
by 
my 
size, 
do 
you 
?” 
-­‐ 
Yoda
Mechanisms 
for 
scaling 
+ 
90.000 
clients 
communica7ng 
with 
a 
Chubby 
master 
! 
• Chubby’s 
clients 
are 
individual 
processes 
! 
Chubby 
handles 
more 
clients 
than 
expected 
• Effec7ve 
scaling 
techniques 
= 
reduce 
communica7on 
with 
the 
master 
– Minimize 
RTT: 
Arbitrary 
number 
of 
Chubby 
cells: 
clients 
almost 
always 
use 
nearby 
cell 
(found 
with 
DNS) 
to 
avoid 
reliance 
on 
remotes 
machine 
(= 
1 
chubby 
cell 
in 
a 
datacenter 
for 
several 
thousand 
machines) 
– Minimize 
KeepAlives 
load: 
Increase 
lease 
9mes 
from 
the 
default=12s 
up 
to 
around 
60s 
under 
heavy 
load 
(= 
fewer 
KeepAlive 
RPC 
to 
process) 
– Op7mized 
caching: 
Chubby 
clients 
cache 
file 
data, 
metadata, 
absence 
of 
files, 
and 
open 
handles 
– Protocol-­‐conversion 
servers: 
Servers 
that 
translate 
the 
Chubby 
protocol 
into 
less-­‐complex 
protocols 
(DNS, 
…) 
27 
Chubby 
cell 
Master 
Chubby 
clients 
READ 
& 
WRITE
Mechanisms 
for 
scaling 
Proxies 
• Chubby’s 
protocol 
can 
be 
proxied 
– Same 
protocol 
on 
both 
sides 
– Proxy 
can 
reduce 
server 
load 
by 
handling 
both 
KeepAlive 
and 
read 
requests 
– Proxy 
cannot 
reduce 
write 
traffic 
• Proxies 
allow 
a 
significant 
increase 
in 
the 
number 
of 
clients 
– If 
external 
proxy 
proxy 
handle 
N 
clients, 
KeepAlive 
traffic 
is 
reduced 
by 
a 
factor 
of 
N 
( 
could 
be 
10.000 
or 
more 
! 
J 
) 
– Proxy 
cache 
can 
reduce 
read 
traffic 
by 
at 
most 
the 
mean 
amount 
of 
read-­‐sharing 
28 
Chubby 
clients 
Master 
Chubby 
cell 
WRITE 
READ 
Replicas 
as 
internal 
proxies
Mechanisms 
for 
scaling 
Par77oning 
( 
Intended 
to 
enable 
large 
Chubby 
cells 
with 
li#le 
communica9on 
between 
the 
par99ons 
) 
• The 
code 
can 
par77on 
the 
namespace 
by 
directory 
– Chubby 
cell 
= 
N par77ons 
– Each 
par99on 
has 
a 
set 
of 
replicas 
and 
a 
master 
– Every 
node 
D/C 
in 
directory 
D 
would 
be 
stored 
on 
the 
par77on 
P(D/C) = hash(D) 
mod 
N 
– Metadata 
for 
D 
may 
be 
stored 
on 
a 
different 
par99on 
P(D) = hash(D’) mod N, 
where 
D’ 
is 
the 
parent 
of 
D! 
– Few 
opera9ons 
s9ll 
require 
cross-­‐par99on 
communica9on 
• ACL: 
one 
par99on 
may 
use 
another 
for 
permissions 
checks 
( 
only 
Open() 
and 
Delete() 
calls 
requires 
ACL 
checks 
) 
• When 
a 
directory 
is 
deleted: 
a 
cross-­‐par99on 
call 
may 
be 
needed 
to 
ensure 
that 
the 
directory 
is 
empty 
• Unless 
number 
of 
par77ons 
N 
is 
large, 
each 
client 
would 
contact 
the 
majority 
of 
the 
par77ons 
– Par77oning 
reduces 
read 
and 
write 
traffic 
on 
any 
given 
par77on 
by 
a 
factor 
of 
N 
☺ 
– Par77oning 
does 
not 
necessarily 
reduce 
KeepAlive 
traffic 
… 
# 
• ParMMoning 
implemented 
in 
the 
code 
but 
not 
acMvated 
because 
Google 
does 
not 
need 
it 
(2006) 
29
Agenda 
• Introduc9on 
• Design 
• Mechanisms 
for 
scaling 
• Experience 
• Summary 
30 
“Do. 
Or 
do 
not. 
There 
is 
no 
try.” 
-­‐ 
Yoda
Experience 
(2006) 
Use 
and 
behavior 
: 
Sta7s7cs 
taken 
as 
a 
snapshot 
of 
a 
Chubby 
cell 
( 
RPC 
rate 
over 
10 
minutes 
period 
) 
31
Experience 
(2006) 
Use 
and 
behavior 
Typical 
causes 
of 
outages 
• 61 
outages 
over 
a 
period 
of 
a 
few 
weeks 
amoun7ng 
to 
700 
cell-­‐days 
of 
data 
in 
total 
– 52 
outages 
< 
30 
seconds 
! 
most 
applica7ons 
are 
not 
affected 
significantly 
by 
Chubby 
outages 
under 
30 
seconds 
– 4 
caused 
by 
network 
maintenance 
– 2 
caused 
by 
suspected 
network 
connec9vity 
problems 
– 2 
caused 
by 
sohware 
errors 
– 1 
caused 
by 
overload 
Few 
dozens 
cell-­‐years 
of 
opera7on 
• Data 
lost 
on 
6 
occasions 
– 4 
database 
errors 
– 2 
operator 
error 
Overload 
• Typically 
occurs 
when 
more 
than 
90.000 
sessions 
are 
ac9ve 
or 
simultaneous 
millions 
of 
reads 
32
Experience 
(2006) 
Java 
clients 
• Chubby 
is 
in 
C++ 
like 
most 
Google’s 
infrastructure 
• Problem: 
a 
growing 
number 
of 
systems 
are 
being 
wri#en 
in 
Java 
– Java 
programmers 
dislike 
Java 
Na9ve 
Interface 
(slow 
and 
cumbersome) 
for 
accessing 
non-­‐na9ve 
libraries 
– Chubby’s 
C++ 
client 
library 
is 
7.000 
lines 
: 
maintaining 
a 
Java 
library 
version 
is 
delicate 
and 
too 
expensive 
• Solu7on: 
Java 
users 
run 
copies 
of 
a 
protocol-­‐conversion 
server 
that 
exports 
a 
simple 
RPC 
protocol 
that 
correspond 
closely 
to 
Chubby’s 
client 
API 
• Mike 
Burrown 
(2006): 
“Even 
with 
hindsight, 
it 
is 
not 
obvious 
how 
we 
might 
have 
avoided 
the 
cost 
of 
wriMng, 
running 
and 
maintaining 
this 
addiMonal 
server” 
33 
Java 
applica7on 
JVM 
Chubby 
protocol-­‐conversion 
daemon 
Chubby 
cell 
Master 
127.0.0.1:42
Experience 
(2006) 
Use 
as 
a 
name 
service 
• Chubby 
was 
designed 
as 
a 
lock 
service, 
but 
popular 
use 
was 
as 
a 
name 
server 
• DNS 
caching 
is 
based 
on 
7me 
– Inconsistent 
caching 
even 
with 
small 
TTL 
= 
DNS 
data 
discarded 
when 
not 
refreshed 
within 
TTL 
period 
– A 
low 
TTL 
overloads 
DNS 
servers 
• Chubby 
caching 
use 
explicit 
invalida7ons 
– Consistent 
caching 
– No 
polling 
• Chubby 
DNS 
server 
– Another 
protocol-­‐conversion 
server 
that 
makes 
the 
naming 
data 
stored 
within 
Chubby 
available 
to 
DNS 
clients: 
for 
easing 
the 
transi9on 
from 
DNS 
names 
to 
Chubby 
names, 
and 
to 
accommodate 
exis9ng 
applica9ons 
that 
cannot 
be 
converted 
easily 
such 
as 
browsers 
34 
Chubby 
protocol-­‐conversion 
server 
Chubby 
cell 
Master 
DNS 
request 
Chubby 
requests 
NEW 
server 
OLD 
server
Experience 
(2006) 
Problems 
with 
fail-­‐over 
• Original 
design 
requires 
master 
to 
write 
new 
sessions 
to 
the 
database 
as 
they 
are 
created 
– Overhead 
on 
Berkeley 
DB 
version 
of 
the 
lock 
server 
!!! 
• New 
design 
avoid 
recording 
sessions 
in 
the 
database 
– Recreate 
sessions 
in 
the 
same 
way 
the 
master 
currently 
recreates 
Handles 
à 
new 
elected 
master’task 
n°8 
– A 
new 
master 
must 
now 
wait 
a 
full 
worst-­‐case 
lease-­‐7meout 
before 
allowing 
opera7ons 
to 
proceed 
• It 
cannot 
know 
whether 
all 
sessions 
have 
checked 
in 
à 
new 
elected 
master’task 
n°6 
– Proxy 
fail-­‐over 
made 
possible 
because 
proxy 
servers 
can 
now 
manage 
sessions 
that 
the 
master 
is 
not 
aware 
of 
• Extra 
opera9on 
available 
only 
on 
trusted 
proxy 
servers 
to 
take 
over 
a 
client 
from 
another 
when 
a 
proxy 
fails 
35 
Chubby 
external 
proxy 
server 
2 
Chubby 
cell 
Master 
Chubby 
external 
proxy 
server 
1 
Example: 
external 
Proxy 
server 
fail-­‐over
Design 
Problems 
with 
fail-­‐over 
: 
Newly 
elected 
master’s 
tasks 
(New design)! 
1. Picks 
a 
new 
client 
epoch 
number 
(clients 
are 
required 
to 
present 
on 
every 
call) 
2. Respond 
to 
master-­‐loca7on 
requests, 
but 
does 
not 
at 
first 
process 
incoming 
session-­‐related 
opera9ons 
3. Builds 
in-­‐memory 
data 
structures 
for 
sessions 
and 
locks 
recorded 
in 
the 
database. 
Session 
leases 
are 
extended 
to 
the 
maximum 
that 
the 
previous 
master 
may 
have 
been 
using 
4. Lets 
clients 
perform 
KeepAlives, 
but 
no 
other 
session-­‐related 
opera7ons 
5. Emits 
a 
fail-­‐over 
event 
to 
each 
session: 
clients 
flush 
their 
caches 
and 
warn 
applica9ons 
that 
other 
events 
may 
have 
been 
lost 
6. Waits 
un7l 
each 
session 
expire 
or 
acknowledges 
the 
fail-­‐over 
event 
7. Now, 
allows 
all 
opera7ons 
to 
proceed 
8. If 
a 
client 
uses 
a 
handle 
created 
prior 
to 
the 
fail-­‐over, 
the 
Master 
recreates 
the 
in-­‐memory 
representa7on 
of 
the 
session 
and 
the 
handle 
and 
then 
honors 
the 
call 
9. Aler 
some 
interval 
(1 
minute), 
master 
deletes 
ephemeral 
files 
that 
have 
no 
open 
file 
handles: 
clients 
should 
refresh 
handles 
on 
ephemeral 
files 
during 
this 
interval 
aher 
a 
fail-­‐over 
36
Experience 
(2006) 
Abusive 
clients 
• Many 
services 
use 
shared 
Chubby 
cells: 
need 
to 
isolate 
clients 
from 
the 
misbehavior 
of 
others 
Problems 
encountered: 
1. Lack 
of 
aggressive 
caching 
– Developers 
regularly 
write 
loops 
that 
retry 
indefinitely 
when 
a 
file 
is 
not 
present, 
or 
poll 
a 
file 
by 
opening 
it 
and 
closing 
it 
repeatedly 
– Need 
to 
cache 
the 
absence 
of 
file 
and 
to 
reuse 
open 
file 
handles 
– Requires 
to 
spend 
more 
9me 
on 
DEVOPS 
educa9on 
but 
in 
the 
end 
it 
was 
easier 
to 
make 
repeated 
Open() 
calls 
cheap 
… 
2. Lack 
of 
quotas 
– Chubby 
was 
never 
intended 
to 
be 
used 
as 
a 
storage 
system 
for 
large 
amounts 
of 
data 
! 
– File 
size 
limit 
introduced: 
256 
KB 
3. Publish/subscribe 
– Chubby 
design 
not 
made 
for 
using 
its 
event 
mechanisms 
as 
a 
publish/subscribe 
system 
! 
– Project 
review 
about 
Chubby 
usage 
and 
growth 
predic7ons 
(RPC 
rate, 
disk 
space, 
number 
of 
files) 
à 
need 
to 
track 
the 
bad 
usages 
… 
37
Experience 
(2006) 
Lessons 
learned 
• Developers 
rarely 
consider 
availability 
– Inclined 
to 
treat 
a 
service 
like 
Chubby 
like 
as 
though 
it 
were 
always 
available 
– Fail 
to 
appreciate 
the 
difference 
between 
a 
service 
group 
up 
and 
that 
service 
being 
available 
to 
their 
applica9ons 
– API 
choices 
can 
affect 
the 
way 
developers 
chose 
to 
handle 
Chubby 
outages: 
many 
DEVs 
chose 
to 
crash 
theirs 
apps 
when 
a 
master 
fail-­‐over 
take 
place, 
but 
the 
first 
intent 
was 
for 
clients 
to 
check 
for 
possible 
changes 
… 
• 3 
mechanisms 
to 
prevent 
DEVs 
from 
being 
over-­‐op7mis7c 
about 
Chubby 
availability 
1. Project 
review 
2. Supply 
libraries 
that 
perform 
high-­‐level 
tasks 
so 
that 
DEVs 
are 
automa9cally 
isolated 
from 
Chubby 
outages 
3. Post-­‐mortem 
of 
each 
Chubby 
outage: 
eliminates 
bugs 
in 
Chubby 
and 
Ops 
precedure+ 
reducing 
Apps 
sensi9vity 
to 
Chubby’s 
availability 
38
Experience 
(2006) 
Opportuni7es 
design 
changes 
• Fine 
grained 
locking 
could 
be 
ignored 
– DEVs 
must 
remove 
unnecessary 
communica9on 
to 
op9mize 
their 
Apps 
à 
finding 
a 
way 
to 
use 
coarse-­‐grained 
locking 
• Poor 
API 
choice 
have 
unexpected 
affects 
– One 
mistake: 
means 
for 
cancelling 
long-­‐running 
calls 
are 
the 
Close() 
and 
Poison() 
RPCs, 
which 
also 
discard 
the 
server 
state 
for 
the 
handle 
… 
! 
may 
add 
a 
Cancel() 
RPC 
to 
allow 
more 
sharing 
of 
open 
handles 
• RPC 
use 
affects 
transport 
protocols 
– KeepAlives 
used 
both 
for 
refreshing 
the 
client’s 
lease, 
and 
for 
passing 
events 
and 
cache 
invalida9ons 
from 
the 
master 
to 
the 
client: 
TCP’s 
back 
off 
policies 
pay 
no 
a#en9on 
to 
higher-­‐level 
9meouts 
such 
as 
Chubby 
leases, 
so 
TCP-­‐based 
KeepAlives 
led 
to 
many 
lost 
sessions 
at 
7me 
of 
high 
network 
conges7on 
! 
forced 
to 
send 
KeepAlive 
RPCs 
via 
UDP 
rather 
than 
TCP 
… 
– May 
augment 
the 
protocol 
with 
an 
addi7onal 
TCP-­‐based 
GetEvent() 
RPC 
which 
would 
be 
used 
to 
communicate 
events 
and 
invalida7ons 
in 
the 
normal 
case, 
used 
in 
the 
same 
way 
KeepAlives. 
KeepAlive 
reply 
would 
s7ll 
contain 
a 
list 
of 
unacknowledged 
events 
so 
that 
events 
must 
eventually 
be 
acknowledged 
39
Agenda 
• Introduc9on 
• Design 
• Mechanisms 
for 
scaling 
• Experience 
• Summary 
40 
“May 
the 
lock 
service 
be 
with 
you.”
Summary 
Chubby 
lock 
service 
• Chubby 
is 
a 
distributed 
lock 
service 
for 
coarse-­‐grained 
synchroniza7on 
of 
distributed 
systems 
– Distributed 
consensus 
among 
few 
replicas 
for 
fault-­‐tolerance 
– Consistent 
client-­‐side 
caching 
– Timely 
no9fica9on 
of 
updates 
– Familiar 
file 
system 
interface 
• Become 
the 
primary 
Google 
internal 
name 
service 
– Common 
rendez-­‐vous 
mechanism 
for 
systems 
such 
as 
MapReduce 
– To 
elect 
a 
primary 
from 
redundant 
replicas 
(GFS 
and 
Bigtable) 
– Standard 
repository 
for 
files 
that 
require 
high-­‐availability 
(ACLs) 
– Well-­‐known 
and 
available 
loca9on 
to 
store 
a 
small 
amount 
of 
meta-­‐data 
(= 
root 
of 
the 
distributed 
data 
structures) 
• Bigtable 
usage 
– To 
elect 
a 
master 
– To 
allow 
the 
master 
to 
discover 
the 
servers 
its 
controls 
– To 
permit 
clients 
to 
find 
the 
master 
41
Any 
ques9ons 
? 
42 
How to enlarge a light saber ? 
Romain 
Jaco7n 
romain.jaco9n@orange.fr

Contenu connexe

Tendances

Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward
 
How to tune Kafka® for production
How to tune Kafka® for productionHow to tune Kafka® for production
How to tune Kafka® for productionconfluent
 
Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sMydbops
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?Mydbops
 
Creating a complete disaster recovery strategy
Creating a complete disaster recovery strategyCreating a complete disaster recovery strategy
Creating a complete disaster recovery strategyMariaDB plc
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1Federico Campoli
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systemsViet-Trung TRAN
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking VN
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Apache pulsar - storage architecture
Apache pulsar - storage architectureApache pulsar - storage architecture
Apache pulsar - storage architectureMatteo Merli
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introducejhao niu
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...DataStax Academy
 

Tendances (20)

Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Google file system
Google file systemGoogle file system
Google file system
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
How to tune Kafka® for production
How to tune Kafka® for productionHow to tune Kafka® for production
How to tune Kafka® for production
 
Linux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA'sLinux monitoring and Troubleshooting for DBA's
Linux monitoring and Troubleshooting for DBA's
 
Google Spanner
Google SpannerGoogle Spanner
Google Spanner
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?
 
Creating a complete disaster recovery strategy
Creating a complete disaster recovery strategyCreating a complete disaster recovery strategy
Creating a complete disaster recovery strategy
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Apache pulsar - storage architecture
Apache pulsar - storage architectureApache pulsar - storage architecture
Apache pulsar - storage architecture
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introduce
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
 

Similaire à Google Chubby Lock Service Lecture Overview

File service architecture and network file system
File service architecture and network file systemFile service architecture and network file system
File service architecture and network file systemSukhman Kaur
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331Fengchang Xie
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Kafka and ibm event streams basics
Kafka and ibm event streams basicsKafka and ibm event streams basics
Kafka and ibm event streams basicsBrian S. Paskin
 
Leveraging Structured Data To Reduce Disk, IO & Network Bandwidth
Leveraging Structured Data To Reduce Disk, IO & Network BandwidthLeveraging Structured Data To Reduce Disk, IO & Network Bandwidth
Leveraging Structured Data To Reduce Disk, IO & Network BandwidthPerforce
 
Resource Management of Docker
Resource Management of DockerResource Management of Docker
Resource Management of DockerSpeedyCloud
 
Lustre And Nfs V4
Lustre And Nfs V4Lustre And Nfs V4
Lustre And Nfs V4awesomesos
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File Systemtutchiio
 
11 distributed file_systems
11 distributed file_systems11 distributed file_systems
11 distributed file_systemslongly
 
Linux Memory Analysis with Volatility
Linux Memory Analysis with VolatilityLinux Memory Analysis with Volatility
Linux Memory Analysis with VolatilityAndrew Case
 
Using filesystem capabilities with rsync
Using filesystem capabilities with rsyncUsing filesystem capabilities with rsync
Using filesystem capabilities with rsyncHazel Smith
 
Chapter 8 distributed file systems
Chapter 8 distributed file systemsChapter 8 distributed file systems
Chapter 8 distributed file systemsAbDul ThaYyal
 

Similaire à Google Chubby Lock Service Lecture Overview (20)

File service architecture and network file system
File service architecture and network file systemFile service architecture and network file system
File service architecture and network file system
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Gfs final
Gfs finalGfs final
Gfs final
 
Kafka and ibm event streams basics
Kafka and ibm event streams basicsKafka and ibm event streams basics
Kafka and ibm event streams basics
 
Nfs
NfsNfs
Nfs
 
Leveraging Structured Data To Reduce Disk, IO & Network Bandwidth
Leveraging Structured Data To Reduce Disk, IO & Network BandwidthLeveraging Structured Data To Reduce Disk, IO & Network Bandwidth
Leveraging Structured Data To Reduce Disk, IO & Network Bandwidth
 
5.distributed file systems
5.distributed file systems5.distributed file systems
5.distributed file systems
 
Resource Management of Docker
Resource Management of DockerResource Management of Docker
Resource Management of Docker
 
Lustre And Nfs V4
Lustre And Nfs V4Lustre And Nfs V4
Lustre And Nfs V4
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
11 distributed file_systems
11 distributed file_systems11 distributed file_systems
11 distributed file_systems
 
Linux Memory Analysis with Volatility
Linux Memory Analysis with VolatilityLinux Memory Analysis with Volatility
Linux Memory Analysis with Volatility
 
Google File System
Google File SystemGoogle File System
Google File System
 
12. dfs
12. dfs12. dfs
12. dfs
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
11. dfs
11. dfs11. dfs
11. dfs
 
Using filesystem capabilities with rsync
Using filesystem capabilities with rsyncUsing filesystem capabilities with rsync
Using filesystem capabilities with rsync
 
Chapter 8 distributed file systems
Chapter 8 distributed file systemsChapter 8 distributed file systems
Chapter 8 distributed file systems
 

Dernier

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Dernier (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Google Chubby Lock Service Lecture Overview

  • 1. Lecture: Google Chubby lock service h#p://research.google.com/archive/chubby.html 10/15/2014 Romain Jaco7n romain.jaco9n@orange.fr ZooKeeper i am your father ! Chubby MASTER
  • 2. Agenda 2 • Introduc7on • Design • Mechanisms for scaling • Experience • Summary The “Part-­‐Time” Parliament – Leslie Lamport
  • 3. Introduc9on Abstract • Chubby lock service is intended for use within a loosely-­‐coupled distributed system consis7ng large number of machines (10.000) connected by a high-­‐speed network – Provides coarse-­‐grained locking – And reliable (low-­‐volume) storage • Chubby provides an interface much like a distributed file system with advisory locks – Whole file read and writes opera9on (no seek) – Advisory locks – No9fica9on of various events such as file modifica9on • Design emphasis – Availability – Reliability – But not for high performance / high throughput • Chubby uses asynchronous consensus: PAXOS with lease 7mers to ensure liveness 3
  • 4. Agenda • Introduc9on • Design • Mechanisms for scaling • Experience • Summary “Death Star tea infuser” 4
  • 5. Design Google’s ra7onale (2006) Design choice: Lock Service or Client PAXOS Library ? • Client PAXOS Library ? – Depend on NO other servers (besides the name service …) – Provide a standard framework for programmers • Lock Service ? – Make Initial Death Star design! it easier to add availability to a prototype, and to maintain exis9ng program structure and communica9on pa#erns – Reduces the number of servers on which a client depends by offering both a name service with consistent client caching and allowing clients to store and fetch small quan99es of data – Lock-­‐based interface is more familiar to programmers – Lock service use several replicas to achieve high availability (= quorums), but even a single client can obtain lock and make progress safely à Lock service reduces the number of servers needed for a reliable client system to make progress 5
  • 6. Design Google’s ra7onale (2006) • Two key design decisions – Google chooses lock service ( but also provide a PAXOS client library independently from Chubby for specific projects ) – Serve small files to permit elected primaries ( = client applica9on masters ) to adver9se themselves and their parameters • Decisions based on Google's expected usage and environment – Allow thousands of clients to observe Chubby files à Events no7fica7on mechanism to avoid polling by clients that wish to know change – Consistent caching seman7cs prefered by developers and caching of files protects lock service from intense polling – Security mechanisms ( access control ) – Provide only coarse-­‐grained locks ( long dura9on lock = low lock-­‐acquisi9on rate = less load on the lock server ) 6
  • 7. Design System structure • Two main components that communicate via RPC – A replica server – A library linked against client applica9ons • A Chubby cell consists of small set of servers ( typically 5 ) knows as replicas – Replicas use a distributed consensus protocol ( PAXOS ) to elect a master and replicate logs – Read and Write requests are sa9sfied by the master alone – If a master fails, other replicas run the elec9on protocol when their master lease expire ( new master elected in few seconds ) • Clients find the master by sending master loca7on requests to the replicas listed in the DNS – Non master replicas respond by returning the iden9ty of the master – Once a client has located the master, client directs all requests to it either un9l it ceases to respond, or un9l it indicates that it is no longer the master 7 Chubby cell Master 5.5.5.5 1 -­‐ DNS request = chubby.deathstar.sith ? Chubby lib Applica7on Client DNS servers 8.8.4.4 8.8.8.8 1.1.1.1 2.2.2.2 3.3.3.3 4.4.4.4 chubby.deathstar.sith = 1.1.1.1 , 2.2.2.2 , 3.3.3.3, 4.4.4.4 , 5.5.5.5 Master = 5.5.5.5 2 -­‐ Master loca9on ? 3 -­‐ ini9ates Chubby session with the master
  • 8. Design PAXOS distributed consensus • Chubby cell with N = 3 replicas – Replicas use a distributed consensus protocol to elect a master (PAXOS). Quorum = 2 for N = 3 – The master must obtain votes from a majority of replicas that promise to not elect a different master for an interval of a few seconds (=master lease) – The master lease is periodically renewed by the replicas provided the master con9nues to win a majority of the vote – During its master lease, the master maintains copies of a simple database with replicas (ordered replicated logs) – Write request are propagated via the consensus protocol to all replicas (PAXOS) – Read requests are sa9sfied by the master alone – If a master fails, other replicas run the elec9on protocol when their master lease expire (new master elected in few seconds) 8 Prepare = please votes for me and promises not to vote for someone else during 12 seconds Promise = OK i vote for you and promise not to vote for someone else during 12 seconds if i received quorum of Promise then i am the Master an i can send many Accept during my lease (Proposer vote for himself) Accept = update your replicated logs with this Write client request Accepted = i have write in my log your Write client request if a replica received quorum of Accepted then the Write is commited (replica sends an Accepted to himself) Re-­‐Prepare = i love to be the Master, please re-­‐votes for me before the end the lease so i can extend my lease quorum quorum quorum quorum
  • 9. Design Files & directories • Chubby exports a file system interface simpler than Unix – Tree of files and directories with name components separated by slashes – Each directory contains a list of child files and directories (collec9vely called nodes) – Each file contains a sequence of un-­‐interpreted bytes – No symbolic or hard links – No directory modified 9mes, no last-­‐access 9mes (to make easier to cache file meta-­‐data) – No path-­‐dependent permission seman9cs: file is controlled by the permissions on the file itself 9 The remaining of the name is interpreted within the named Chubby cell /ls/dc-tatooine/bigtable/root-tablet! The ls prefix is common to all Chubby names: stands for lock service Second component dc-tatooine is the name of the Chubby cell. It is resolved to one or more Chubby servers via DNS lookup
  • 10. Design Files & directories : Nodes • Nodes (= files or directories) may be either permanent or ephemeral • Ephemeral files are used as temporary files, and act as indicators to others that a client is alive • Any nodes may be deleted explicitly – Ephemeral nodes files are also deleted if no client has them open – Ephemeral nodes directories are also deleted if they are empty • Any node can act as an advisory reader/writer lock 10
  • 11. Design Files & directories : Metadata • 3 ACLs – Three names of access control lists (ACL) used to control reading, wri7ng and changing the ACL names for the node – Node inherits the ACL names of its parent directory on crea9on – ACLs are themselves files located in “/ls/dc-tatooine/acl” (ACL file consist of simple lists of names of principals) – Users are authen9cated by a mechanism built into the Chubby RPC system • 4 monotonically increasing 64-­‐bit numbers 1. Instance number: greater than the instance number of any previous node with the same name 2. Content genera7on number (files only): increases when the file’s contents are wri#en 3. Lock genera7on number: increases when the node’s lock transi9ons from free to held 4. ACL genera7on number: increases when the node’s ACL names are wri#en • 64-­‐bit checksum 11
  • 12. Design Files & directories : Handles • Clients open nodes to obtain Handles (analogous to UNIX file descriptors) • Handles include : – Check digits: prevent clients from crea9ng or guessing handles à full access control checks performed only when handles are created – A sequence number: Master can know whether a handle was generated by it or a previous master – Mode informa7on: (provided at open 9me) to allow the master to recreate its state if an old handle is presented to a newly restarted master 12
  • 13. Design Locks, Sequencers and Lock-­‐delay • Each Chubby file and directory can act as a reader-­‐writer lock (locks are advisory) • Acquiring a lock in either mode requires write permission – Exclusive mode (writer): One client may hold the lock – Shared mode (reader): Any number of client handles may hold the lock • Lock holder can request a Sequencer : opaque byte string describing the state of the lock immediately aher acquisi9on – Name of the lock + Lock mode (exclusive or shared) + Lock genera9on number • Sequencer usage – Applica9on’s master can generate a sequencer and send it with any internal order sends to other servers – Applica9on’s servers that receive orders from a master can check with Chubby if the sequencer is s9ll good (= not a stale master) • Lock-­‐delay : Lock server prevents other clients from claiming the lock during lock-­‐delay period if lock becomes free – client may specify any look-­‐delay up to 60 seconds – This limit prevents a faulty client from making a lock unavailable for an arbitrary long 9me – Lock delay protects unmodified servers and clients from everyday problems caused by message delays and restarts … 13
  • 14. Design Events • Session events can be received by applica7on – Jeopardy: when session lease 9meout and Grace period begins (see Fail-­‐over later ;-­‐) – Safe: when the session is known to have survived a communica9on problem – Expired: if the session 9meout • Handle events: clients may subscribe to a range of events when they create a Handle (=Open phase) – File contents modified – Child node added/removed/modified – Master failed over – A Handle (and it’s lock) has become invalid – Lock acquired – Conflic7ng lock request from another client • These events are delivered to the clients asynchronously via an up-­‐call from the Chubby library • Mike Burrows: “The last two events menMoned are rarely used, and with hindsight could have been omiOed.” 14
  • 15. Design API • Open/Close node name – func Open( ) !Handles are created by Open() and destroyed by Close()! – func Close( ) !This call never failed • Poison – func Poison( ) !Allows a client to cancel Chubby calls made by other threads without fear of de-­‐alloca9ng the memory being accessed by them • Read/Write full contents – func GetContentsAndStat( ) !Atomic reading of the en9re content and metadata! – func GetStat( ) ! !Reading of the metadata only! – func ReadDir( ) ! !Reading of names and metadata of the directory! – func SetContents() ! !Atomic wri9ng of the en9re content! • ACL – func SetACL( ) ! !Change ACL for a node! • Delete node – func Delete( ) ! !If it has no children! • Lock – func Acquire( ) !Acquire a lock! – func TryAcquire( ) !Try to acquire a poten9ally conflic9ng lock by sending “conflic9ng lock request” to the holder! – func Release( ) !Release a lock! • Sequencer – func SetSequencer( ) !Returns a sequencer that describes any lock held by this Handle – func GetSequencer( ) Associate a sequencer with a Handle. Subsequent opera9ons on the Handle failed if the sequencer is no longer valid – func CheckSequencer( ) Checks whether a sequencer is valid 15
  • 16. Applica9on scheduler Design Design n°1 • Primary elec9on example without sequencer usage, but with lock-­‐delay (worst design) 16 Master Applica9on workers The first server to get the exclusive lock on file “/ls/empire/deathstar/master“ is the Master, writes its name on the file and sets lock-­‐delay to 1 minute; the two others servers will only receive events no9fica9ons about this file Worker executes the request Execute order 66 ! Master sends an order to a worker If the lock on “/ls/empire/deathstar/ master“ becomes free because the holder has failed or become inaccessible, the lock server will prevent others backup servers from claiming the lock during the lock-­‐delay of 1 minute Chubby cell PAXOS leader
  • 17. Design Design n°2 • Primary elec9on example with sequencer usage (best design) 17 Master Applica9on scheduler Applica9on workers The first server to get the exclusive lock on file “/ls/empire/deathstar/master“ is the Master, write its name on the file and get a sequencer; the two others servers will only receive events no9fica9ons about this file Worker must check the sequencer before execu7ng the request checked against worker’s Chubby cache Chubby cell PAXOS leader Execute order 66 ! Master sends an order to a worker by adding the sequencer to the request
  • 18. Applica9on scheduler Design Design n°3 • Primary elec9on example with sequencer usage (op7mized design) 18 Chubby cell PAXOS leader Master Army of workers Destroy Alderaan ! Master sends an order to a worker by adding the sequencer to the request Worker must check the sequencer before execu7ng the request check against the most recent sequencer that the server has observed if the worker does not wish to maintain session with Chubby The first server to get the exclusive lock on file “/ls/empire/deathstar/master“ is the Master, write its name on the file and get a sequencer; the two others servers will only receive events no9fica9ons about this file
  • 19. Design Caching • Clients cache file data and node metadata in a consistent, write-­‐through in memory cache – Cache maintained by a lease mechanism (client that allowed its cache lease to expire then informs the lock server) – Cache kept consistent by invalida7ons sent by the master, which keeps a list of what each client may be caching • When file data or metadata is to be changed – Master block modifica9on while sending invalida9ons for the data to every client that may cached it – Client that receives of an invalida9on flushes the invalidated state and acknowledges by making its next KeepAlive call – The modifica9on proceeds only aher the server knows that each client has invalidated its cache, either by acknowledged the invalida9on, or because the client allowed its cache lease to expire 19 Client PAXOS Chubby cell leader 2 -­‐ Master loca9on ? Client Client Write change to the Chubby replicated database I want to change file content of “ls/empire/tux” I have in cache file “ls/empire/tux” I have in cache file “ls/empire/tux” Request for changing content of file “X” Master sends cache invalida9on for clients that cache “X” Clients flushes caches for file “X” and acknowledge the Master Master acknowledges the writer about change done
  • 20. Design Sessions and KeepAlives : Handles, locks, and cached data all remain valid while session remains valid • Client requests a new session on first contac7ng the master of a Chubby cell • Client ends the session explicitly either when it terminates, or if the session has been idle – Session idle = no opens handles and no call for a minute • Each session has an associated lease interval – Lease interval = master guarantees not to terminate the session unilaterally un9l the lease 9meout – Lease 7meout = end of the lease interval grace period jeopardy 20
  • 21. Design Sessions and KeepAlives • Master advances the lease 7meout – On session crea7on – When a master fail-­‐over occurs – When it responds to a KeepAlive RPC from the client: • On receiving a KeepAlive (1), Master blocks the RPC un7l client’s previous lease interval is close to expiring • Master later allows the RPC to return to the client (2) and informs it about the new lease 7meout (= lease M2) • Master use default extension of 12 second, overload master may use higher values • Client ini7ates a KeepAlive immediately aler receiving the previous reply Protocol op7miza7on : KeepAlive reply is used to transmit events and cache invalida7ons back to the client grace period jeopardy 21 block RPC
  • 22. Design Fail-­‐overs • When a master fails or otherwise loses mastership – Discards its in-­‐memory state about sessions, handles, and locks – Session lease 9mer is stopped – If a master elec7on occurs quickly, clients can contact the new master before their local lease 7mer expire – If the elec7on takes a long 7me, client flush their caches (= JEOPARDY ) and wait for the GRACE PERIOD ( 45 seconds ) while trying to find the new master grace period jeopardy 22
  • 23. Design Fail-­‐overs : Newly elected master’s tasks (initial design)! 1. Picks a new client epoch number (clients are required to present on every call) 2. Respond to master-­‐loca7on requests, but does not at first process incoming session-­‐related opera9ons 3. Builds in-­‐memory data structures for sessions and locks recorded in the database. Session leases are extended to the maximum that the previous master may have been using 4. Lets clients perform KeepAlives, but no other session-­‐related opera9ons 5. Emits a fail-­‐over event to each session: clients flush their caches and warn applica9ons that other events may have been lost 6. Waits un7l each session expire or acknowledges the fail-­‐over event 7. Now, allows all opera7ons to proceed 8. If a client uses a handle created prior to the fail-­‐over, the Master recreates the in-­‐memory representa7on of the handle and then honors the call 9. Aler some interval (1 minute), master deletes ephemeral files that have no open file handles: clients should refresh handles on ephemeral files during this interval aher a fail-­‐over 23
  • 24. Design Database implementa7on • Simple key/value database using write ahead logging and snapshopng • Chubby needs atomic opera7ons only (no general transac7ons) • Database log is distributed among the replicas using a distributed consensus protocol (PAXOS) Backup • Every few hours, Master writes a snapshot of its database to a GFS file server in a different building = no cyclic dependencies ( because local GFS uses local Chubby cell … ) • Backup used for disaster recovery • Backup used for ini7alizing the database of a newly replaced replica = no load on others replicas 24
  • 25. Design Mirroring • Chubby allows a collec7on of files to be mirrored from one cell to another – Mirroring is fast: small files and the event mechanism to inform immediately if a file is added, deleted, or modified – Unreachable mirror remains unchanged un9l connec9vity is restored: updated files iden9fied by comparing checksums – Used to copy configura9on files to various compu9ng clusters distributed around the world • A special “global” Chubby cell – Subtree “/ls/global/master” mirrored to the subtree “/ls/cell/slave” in every other Chubby cell – The “global” cell replicas are located in widely-­‐separated parts of the world – Usage: • Chubby’s own ACLs • Various files in which Chubby cells and other systems adver9se their presence to monitoring services • Pointers to allow clients to locate large data sets (Bigtable cells) and many configura9ons files 25
  • 26. 26 Agenda • Introduc9on • Design • Mechanisms for scaling • Experience • Summary “Judge me by my size, do you ?” -­‐ Yoda
  • 27. Mechanisms for scaling + 90.000 clients communica7ng with a Chubby master ! • Chubby’s clients are individual processes ! Chubby handles more clients than expected • Effec7ve scaling techniques = reduce communica7on with the master – Minimize RTT: Arbitrary number of Chubby cells: clients almost always use nearby cell (found with DNS) to avoid reliance on remotes machine (= 1 chubby cell in a datacenter for several thousand machines) – Minimize KeepAlives load: Increase lease 9mes from the default=12s up to around 60s under heavy load (= fewer KeepAlive RPC to process) – Op7mized caching: Chubby clients cache file data, metadata, absence of files, and open handles – Protocol-­‐conversion servers: Servers that translate the Chubby protocol into less-­‐complex protocols (DNS, …) 27 Chubby cell Master Chubby clients READ & WRITE
  • 28. Mechanisms for scaling Proxies • Chubby’s protocol can be proxied – Same protocol on both sides – Proxy can reduce server load by handling both KeepAlive and read requests – Proxy cannot reduce write traffic • Proxies allow a significant increase in the number of clients – If external proxy proxy handle N clients, KeepAlive traffic is reduced by a factor of N ( could be 10.000 or more ! J ) – Proxy cache can reduce read traffic by at most the mean amount of read-­‐sharing 28 Chubby clients Master Chubby cell WRITE READ Replicas as internal proxies
  • 29. Mechanisms for scaling Par77oning ( Intended to enable large Chubby cells with li#le communica9on between the par99ons ) • The code can par77on the namespace by directory – Chubby cell = N par77ons – Each par99on has a set of replicas and a master – Every node D/C in directory D would be stored on the par77on P(D/C) = hash(D) mod N – Metadata for D may be stored on a different par99on P(D) = hash(D’) mod N, where D’ is the parent of D! – Few opera9ons s9ll require cross-­‐par99on communica9on • ACL: one par99on may use another for permissions checks ( only Open() and Delete() calls requires ACL checks ) • When a directory is deleted: a cross-­‐par99on call may be needed to ensure that the directory is empty • Unless number of par77ons N is large, each client would contact the majority of the par77ons – Par77oning reduces read and write traffic on any given par77on by a factor of N ☺ – Par77oning does not necessarily reduce KeepAlive traffic … # • ParMMoning implemented in the code but not acMvated because Google does not need it (2006) 29
  • 30. Agenda • Introduc9on • Design • Mechanisms for scaling • Experience • Summary 30 “Do. Or do not. There is no try.” -­‐ Yoda
  • 31. Experience (2006) Use and behavior : Sta7s7cs taken as a snapshot of a Chubby cell ( RPC rate over 10 minutes period ) 31
  • 32. Experience (2006) Use and behavior Typical causes of outages • 61 outages over a period of a few weeks amoun7ng to 700 cell-­‐days of data in total – 52 outages < 30 seconds ! most applica7ons are not affected significantly by Chubby outages under 30 seconds – 4 caused by network maintenance – 2 caused by suspected network connec9vity problems – 2 caused by sohware errors – 1 caused by overload Few dozens cell-­‐years of opera7on • Data lost on 6 occasions – 4 database errors – 2 operator error Overload • Typically occurs when more than 90.000 sessions are ac9ve or simultaneous millions of reads 32
  • 33. Experience (2006) Java clients • Chubby is in C++ like most Google’s infrastructure • Problem: a growing number of systems are being wri#en in Java – Java programmers dislike Java Na9ve Interface (slow and cumbersome) for accessing non-­‐na9ve libraries – Chubby’s C++ client library is 7.000 lines : maintaining a Java library version is delicate and too expensive • Solu7on: Java users run copies of a protocol-­‐conversion server that exports a simple RPC protocol that correspond closely to Chubby’s client API • Mike Burrown (2006): “Even with hindsight, it is not obvious how we might have avoided the cost of wriMng, running and maintaining this addiMonal server” 33 Java applica7on JVM Chubby protocol-­‐conversion daemon Chubby cell Master 127.0.0.1:42
  • 34. Experience (2006) Use as a name service • Chubby was designed as a lock service, but popular use was as a name server • DNS caching is based on 7me – Inconsistent caching even with small TTL = DNS data discarded when not refreshed within TTL period – A low TTL overloads DNS servers • Chubby caching use explicit invalida7ons – Consistent caching – No polling • Chubby DNS server – Another protocol-­‐conversion server that makes the naming data stored within Chubby available to DNS clients: for easing the transi9on from DNS names to Chubby names, and to accommodate exis9ng applica9ons that cannot be converted easily such as browsers 34 Chubby protocol-­‐conversion server Chubby cell Master DNS request Chubby requests NEW server OLD server
  • 35. Experience (2006) Problems with fail-­‐over • Original design requires master to write new sessions to the database as they are created – Overhead on Berkeley DB version of the lock server !!! • New design avoid recording sessions in the database – Recreate sessions in the same way the master currently recreates Handles à new elected master’task n°8 – A new master must now wait a full worst-­‐case lease-­‐7meout before allowing opera7ons to proceed • It cannot know whether all sessions have checked in à new elected master’task n°6 – Proxy fail-­‐over made possible because proxy servers can now manage sessions that the master is not aware of • Extra opera9on available only on trusted proxy servers to take over a client from another when a proxy fails 35 Chubby external proxy server 2 Chubby cell Master Chubby external proxy server 1 Example: external Proxy server fail-­‐over
  • 36. Design Problems with fail-­‐over : Newly elected master’s tasks (New design)! 1. Picks a new client epoch number (clients are required to present on every call) 2. Respond to master-­‐loca7on requests, but does not at first process incoming session-­‐related opera9ons 3. Builds in-­‐memory data structures for sessions and locks recorded in the database. Session leases are extended to the maximum that the previous master may have been using 4. Lets clients perform KeepAlives, but no other session-­‐related opera7ons 5. Emits a fail-­‐over event to each session: clients flush their caches and warn applica9ons that other events may have been lost 6. Waits un7l each session expire or acknowledges the fail-­‐over event 7. Now, allows all opera7ons to proceed 8. If a client uses a handle created prior to the fail-­‐over, the Master recreates the in-­‐memory representa7on of the session and the handle and then honors the call 9. Aler some interval (1 minute), master deletes ephemeral files that have no open file handles: clients should refresh handles on ephemeral files during this interval aher a fail-­‐over 36
  • 37. Experience (2006) Abusive clients • Many services use shared Chubby cells: need to isolate clients from the misbehavior of others Problems encountered: 1. Lack of aggressive caching – Developers regularly write loops that retry indefinitely when a file is not present, or poll a file by opening it and closing it repeatedly – Need to cache the absence of file and to reuse open file handles – Requires to spend more 9me on DEVOPS educa9on but in the end it was easier to make repeated Open() calls cheap … 2. Lack of quotas – Chubby was never intended to be used as a storage system for large amounts of data ! – File size limit introduced: 256 KB 3. Publish/subscribe – Chubby design not made for using its event mechanisms as a publish/subscribe system ! – Project review about Chubby usage and growth predic7ons (RPC rate, disk space, number of files) à need to track the bad usages … 37
  • 38. Experience (2006) Lessons learned • Developers rarely consider availability – Inclined to treat a service like Chubby like as though it were always available – Fail to appreciate the difference between a service group up and that service being available to their applica9ons – API choices can affect the way developers chose to handle Chubby outages: many DEVs chose to crash theirs apps when a master fail-­‐over take place, but the first intent was for clients to check for possible changes … • 3 mechanisms to prevent DEVs from being over-­‐op7mis7c about Chubby availability 1. Project review 2. Supply libraries that perform high-­‐level tasks so that DEVs are automa9cally isolated from Chubby outages 3. Post-­‐mortem of each Chubby outage: eliminates bugs in Chubby and Ops precedure+ reducing Apps sensi9vity to Chubby’s availability 38
  • 39. Experience (2006) Opportuni7es design changes • Fine grained locking could be ignored – DEVs must remove unnecessary communica9on to op9mize their Apps à finding a way to use coarse-­‐grained locking • Poor API choice have unexpected affects – One mistake: means for cancelling long-­‐running calls are the Close() and Poison() RPCs, which also discard the server state for the handle … ! may add a Cancel() RPC to allow more sharing of open handles • RPC use affects transport protocols – KeepAlives used both for refreshing the client’s lease, and for passing events and cache invalida9ons from the master to the client: TCP’s back off policies pay no a#en9on to higher-­‐level 9meouts such as Chubby leases, so TCP-­‐based KeepAlives led to many lost sessions at 7me of high network conges7on ! forced to send KeepAlive RPCs via UDP rather than TCP … – May augment the protocol with an addi7onal TCP-­‐based GetEvent() RPC which would be used to communicate events and invalida7ons in the normal case, used in the same way KeepAlives. KeepAlive reply would s7ll contain a list of unacknowledged events so that events must eventually be acknowledged 39
  • 40. Agenda • Introduc9on • Design • Mechanisms for scaling • Experience • Summary 40 “May the lock service be with you.”
  • 41. Summary Chubby lock service • Chubby is a distributed lock service for coarse-­‐grained synchroniza7on of distributed systems – Distributed consensus among few replicas for fault-­‐tolerance – Consistent client-­‐side caching – Timely no9fica9on of updates – Familiar file system interface • Become the primary Google internal name service – Common rendez-­‐vous mechanism for systems such as MapReduce – To elect a primary from redundant replicas (GFS and Bigtable) – Standard repository for files that require high-­‐availability (ACLs) – Well-­‐known and available loca9on to store a small amount of meta-­‐data (= root of the distributed data structures) • Bigtable usage – To elect a master – To allow the master to discover the servers its controls – To permit clients to find the master 41
  • 42. Any ques9ons ? 42 How to enlarge a light saber ? Romain Jaco7n romain.jaco9n@orange.fr