SlideShare une entreprise Scribd logo
1  sur  77
Dynamo, Five Years Later
Andy Gross
Chief Architect, Basho Technologies
QCon London 2013
Friday, March 8, 13
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/riak-dynamo
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Dynamo
Published October 2007 @ SOSP
Describes a collection of distributed systems
techniques applied to low-latency key-value storage
Spawned (along with BigTable) many imitators, an
industry (LinkedIn -> Voldemort, Facebook ->
Cassandra)
Authors nearly got fired from Amazon for publishing
Friday, March 8, 13
Riak - A Dynamo Clone
First lines of first prototype written in Fall 2007 on a
plane on the way to my Basho interview
“Technical Debt” is another term we use at Basho for
this code
Mostly Erlang with some C/C++
Apache2 Licensed
First release in 2009, 1.3 released 2/21/13
Friday, March 8, 13
Basho
Friday, March 8, 13
Basho
Founded late 2007 by ex-Akamai people
Currently ~120 employees, distributed, with offices in
Cambridge, San Francisco, London, and Tokyo
We sponsor of Riak Open Source
We sell Riak Enterprise (Riak + Multi-DC replication)
We sell Riak CS (S3 clone backed by Riak Enterprise)
Friday, March 8, 13
Principles
Always-writable
Incrementally scalable
Symmetrical
Decentralized
Heterogenous
Focus on SLAs, tail latency
Friday, March 8, 13
Techniques
Consistent Hashing
Vector Clocks
Read Repair
Anti-Entropy
Hinted Handoff
Gossip Protocol
Friday, March 8, 13
Consistent Hashing
Invented by Danny Lewin and others @ MIT/Akamai
Minimizes remapping of keys when number of hash
slots changes
Originally applied to CDNs, used in Dynamo for replica
placement
Enables incremental scalability, even spread
Minimizes hot spots
Friday, March 8, 13
Friday, March 8, 13
Vector Clocks
Introduced by Mattern et al, in 1988
Extends Lamport’s timestamps (1978)
Each value in Dynamo tagged with vector clock
Allows detection of stale values, logical siblings
Friday, March 8, 13
Read Repair
Update stale versions opportunistically on reads
(instead of writes)
Pushes system toward consistency, after returning
value to client
Reflects focus on a cheap, always-available write path
Friday, March 8, 13
Hinted Handoff
Any node can accept writes for other nodes if they’re
down
All messages include a destination
Data accepted by node other than destination is
handed off when node recovers
As long as a single node is alive the cluster can accept
a write
Friday, March 8, 13
Anti-Entropy
Replicas maintain a Merkle Tree of keys and their
versions/hashes
Trees periodically exchanged with peer vnodes
Merkle tree enables cheap comparison
Only values with different hashes are exchanged
Pushes system toward consistency
Friday, March 8, 13
Gossip Protocol
Decentralized approach to managing global state
Trades off atomicity of state changes for a
decentralized approach
Volume of gossip can overwhelm networks without
care
Friday, March 8, 13
Hinted Handoff
Friday, March 8, 13
Hinted Handoff
• Node fails
X
X
X
X
X
X
X
X
Friday, March 8, 13
Hinted Handoff
• Node fails
• Requests go to fallback
hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
X
X
X
X
X
X
X
X
Friday, March 8, 13
Hinted Handoff
• Node fails
• Requests go to fallback
• Node comes back
hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Friday, March 8, 13
Hinted Handoff
• Node fails
• Requests go to fallback
• Node comes back
• “Handoff” - data returns
to recovered node
hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Friday, March 8, 13
Hinted Handoff
• Node fails
• Requests go to fallback
• Node comes back
• “Handoff” - data returns
to recovered node
• Normal operations
resume
hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
client
Riak
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
hash(“blocks/
6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
== 10, 11, 12
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
hash(“blocks/
6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
== 10, 11, 12
Coordinating node
Cluster
6 7 8 9 10 11 12 13 14 15 16
The Ring
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Coordinating node
Cluster
6 7 8 9 10 11 12 13 14 15 16
The Ring
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
Coordinating node
Cluster
6 7 8 9 10 11 12 13 14 15 16
The Ring
R=2
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
Coordinating node
Cluster
6 7 8 9 10 11 12 13 14 15 16
The Ring
R=2 v1
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
R=2 v1 v2
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
R=2 v2
v2
Friday, March 8, 13
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
v2
Friday, March 8, 13
Read Repair
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
Coordinating node
Cluster
6 7 8 9 10 11 12 13 14 15 16
R=2 v1 v2
v2
v2v1
Friday, March 8, 13
Read Repair
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
Coordinating node
Cluster
6 7 8 9 10 11 12 13 14 15 16
R=2 v2
v2
v2v1
Friday, March 8, 13
Read Repair
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
Coordinating node
Cluster
6 7 8 9 10 11 12 13 14 15 16
R=2 v2
v2
v2v1v1
Friday, March 8, 13
v2v2
Read Repair
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)
client
Riak
Coordinating node
Cluster
6 7 8 9 10 11 12 13 14 15 16
R=2 v2
v2
v2v2v2
Friday, March 8, 13
Erlang/OTP Runtime
Riak Architecture
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs HTTP
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs HTTP Protocol Buffers
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs HTTP Protocol Buffers
Erlang local client
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
HTTP Protocol Buffers
Erlang local client
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
consistent hashing
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
membership
consistent hashing
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
membership
consistent hashing handoff
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
membership
consistent hashing handoff
node-liveness
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
membership
consistent hashing handoff
node-liveness
gossip
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
membership
consistent hashing handoff
node-liveness
gossip
buckets
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
membership
consistent hashing handoff
node-liveness
gossip
buckets
vnode master
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
membership
consistent hashing handoff
node-liveness
gossip
buckets
vnodes
vnode master
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
membership
consistent hashing handoff
node-liveness
gossip
buckets
vnodes
storage backend
vnode master
Friday, March 8, 13
Erlang/OTP Runtime
Riak KV
Riak Architecture
Client APIs
Request Coordination
Riak Core
get put delete map-reduce
HTTP Protocol Buffers
Erlang local client
membership
consistent hashing handoff
node-liveness
gossip
buckets
vnodes
storage backend
JS Runtime
vnode master
Friday, March 8, 13
Problems with Dynamo
Eventual Consistency is harsh mistress
Pushes conflict resolution to clients
Key/value data types limited in use
Random replica placement destroys locality
Gossip protocol can limit cluster size
R+W > N is NOT more consistent
TCP Incast
Friday, March 8, 13
Key-Value
Conflict Resolution
Forcing clients to resolve consistency issues on read is
a pain for developers
Most end up choosing the server-enforced last-write-
wins policy
With many language clients, logic must be
implemented many times
One solution: https://github.com/bumptech/montage
Another: Make everything immutable
Another: CRDTs
Friday, March 8, 13
Optimize for Immutability
“Accountants don’t use erasers” - Pat Helland
Eventual consistency is *great* for immutable data
Conflicts become a non-issue if data never changes
don’t need full quorums, vector clocks
backend optimizations are possible
Problem space shifts to distributed GC
... which is very hard, but not the user’s problem
anymore
Friday, March 8, 13
CRDTs
Conflict-free|Commutative Replicated Data Types
A server side structure and conflict-resolution policy for
richer datatypes like counters and sets amenable to
eventual consistency
Letia et al. (2009). CRDTs: Consistency without
concurrency control: http://hal.inria.fr/inria-00397981/
en
Prototype here: http://github.com/basho/riak_dt
Friday, March 8, 13
Random Placement and
Locality
By default, keys are randomly placed on different
replicas
But we have buckets!
Containers imply cheap iteration/enumeration, but with
random placement it becomes an expensive full-scan
Partial Solution: hash function defined per-bucket can
increase locality
Lots of work done to minimize impact of bucket listings
Friday, March 8, 13
(R+W>N) != Consistency
R+W described in Dynamo paper as “consistency
knobs”
Some Basho/Riak docs still say this too! :(
Even if R=W=N, sloppy quorums and partial writes
make reading old values possible
“Read your own writes if your writes succeed but
otherwise you have no idea what you’re going to read
consistency (RYOWIWSBOYHNIWYGTRC)” - Joe
Blomstedt
Solution: actual “strong” consistency
Friday, March 8, 13
Strong Consistency in Riak
CAP says you must choose C vs. A, but only during
failures
There’s no reason we can’t implement both models,
with different tradeoffs
Enable strong consistency on a per-bucket basis
See Joe Blomstedt’s talk at RICON 2012: http://
ricon2012.com, earlier work at:
http://github.com/jtuple/riak_zab
Friday, March 8, 13
An Aside: Probabalistically
Bounded Staleness
Bailis et al. : http://pbs.cs.berkeley.edu
R=W=1, .1ms latency at all hops
Friday, March 8, 13
TCP Incast
“You can’t pour two buckets of manure into one
bucket” - Scott Fritchie’s Grandfather
“microbursts” of traffic sent to one cluster member
Coordinator sends request to three replicas
All respond with large-ish result at roughly the same
time
Switch has to either buffer or drop packets
Cassandra tries to mitigate: 1 replica sends data,
others send hashes. We should do this in Riak.
Friday, March 8, 13
What Riak Did Differently
(or wrong)
Screwed up vector clock implementation
Actor IDs in vector clocks were client ids, therefore
potentially unbounded
Size explosion resulted in huge objects, caused
OOM crashes
Vector clock pruning resulted in false siblings
Fixed by forwarding to node in preflist circa 1.0
Friday, March 8, 13
What Riak Did Differently
No active anti-entropy until v1.3
Early versions had slow, unstable AAE
Node loss required reading all objects and
repopulating replicas via read repair
Ok for objects that are read often
Rarely-read objects N value decreases over time
Friday, March 8, 13
What Riak Did Differently
Initial versions had an unavailability window during
topology changes
Nodes would claim partitions immediately, before
data had been handed off
New versions don’t change request preflist until all
data has been handed off
Implemented as 2PC-ish commit over gossip
Friday, March 8, 13
Riak, Beyond Dynamo
MapReduce
Search
Secondary Indexes
Pre/post-commit hooks
Multi-DC replication
Riak Pipe distributed computation
Riak CS
Friday, March 8, 13
Riak CS
Amazon S3 clone implemented as a proxy in front of
Riak
Handles eventual consistency issues, object chunking,
multitenancy, and API for a much narrower use case
Forced us to eat our own dogfood and get serious
about fixing long-standing warts
Drives feature development
Friday, March 8, 13
Riak the Product vs.
Dynamo the Service
Dynamo had luxury of being a service while Riak is a
product
Screwing things up with Riak can not be fixed with
an emergency deploy
Multiple platforms, packaging are challenges
Testing distributed systems is another talk entirely
(QuickCheck FTW)
http://www.erlang-factory.com/upload/presentations/514/
TestFirstConstructionDistributedSystems.pdf
Friday, March 8, 13
Riak Core
Some of our best work!
Dynamo abstracted
Implements all Dynamo techniques without prescribing
a use case
Examples of Riak Core apps:
Riak KV!
Riak Search
Riak Pipe
Friday, March 8, 13
Riak Core
Production deployments
OpenX: several 100+-node clusters of custom Riak
Core systems
StackMob: proxy for mobile services implemented
with Riak Core
Needs to be much easier to use and better
documented
Friday, March 8, 13
Multi-Datacenter Replication
Intra-cluster replication in Riak is optimized for
consistently low latency, high throughput
WAN replication needs to deal with lossy links, long fat
networks, TCP oddities
MDC replication has 2 phases: full-sync (per-partition
merkle comparisons), real-time (asynchronous, driven
by post-commit hook)
Separate policies settable per-bucket
Friday, March 8, 13
Erlang
Still the best language for this stuff, but
We mix data and control messages over Erlang
message passing. Switch to TCP (or uTP/UDT) for
data
NIFs are problematic
VM tuning can be a dark art
~90 public repos of mostly-Erlang, mostly-awesome
open source: https://github.com/basho
Friday, March 8, 13
Other Future Directions
Security was not a factor in Dynamo’s or Riak’s design
Isolating Riak increases operational complexity, cost
Statically sized ring is a pain
Explore possibilities with smarter clients
Support larger clusters
Multitenancy, tenant isolation
More vertical products like Riak CS
Friday, March 8, 13
Questions?
@argv0
http://www.basho.com
http://github.com/basho
http://docs.basho.com
Friday, March 8, 13

Contenu connexe

Plus de C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

Plus de C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Dernier

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Dernier (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Riak and Dynamo, Five Years Later

  • 1. Dynamo, Five Years Later Andy Gross Chief Architect, Basho Technologies QCon London 2013 Friday, March 8, 13
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /riak-dynamo
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Dynamo Published October 2007 @ SOSP Describes a collection of distributed systems techniques applied to low-latency key-value storage Spawned (along with BigTable) many imitators, an industry (LinkedIn -> Voldemort, Facebook -> Cassandra) Authors nearly got fired from Amazon for publishing Friday, March 8, 13
  • 5. Riak - A Dynamo Clone First lines of first prototype written in Fall 2007 on a plane on the way to my Basho interview “Technical Debt” is another term we use at Basho for this code Mostly Erlang with some C/C++ Apache2 Licensed First release in 2009, 1.3 released 2/21/13 Friday, March 8, 13
  • 7. Basho Founded late 2007 by ex-Akamai people Currently ~120 employees, distributed, with offices in Cambridge, San Francisco, London, and Tokyo We sponsor of Riak Open Source We sell Riak Enterprise (Riak + Multi-DC replication) We sell Riak CS (S3 clone backed by Riak Enterprise) Friday, March 8, 13
  • 9. Techniques Consistent Hashing Vector Clocks Read Repair Anti-Entropy Hinted Handoff Gossip Protocol Friday, March 8, 13
  • 10. Consistent Hashing Invented by Danny Lewin and others @ MIT/Akamai Minimizes remapping of keys when number of hash slots changes Originally applied to CDNs, used in Dynamo for replica placement Enables incremental scalability, even spread Minimizes hot spots Friday, March 8, 13
  • 12. Vector Clocks Introduced by Mattern et al, in 1988 Extends Lamport’s timestamps (1978) Each value in Dynamo tagged with vector clock Allows detection of stale values, logical siblings Friday, March 8, 13
  • 13. Read Repair Update stale versions opportunistically on reads (instead of writes) Pushes system toward consistency, after returning value to client Reflects focus on a cheap, always-available write path Friday, March 8, 13
  • 14. Hinted Handoff Any node can accept writes for other nodes if they’re down All messages include a destination Data accepted by node other than destination is handed off when node recovers As long as a single node is alive the cluster can accept a write Friday, March 8, 13
  • 15. Anti-Entropy Replicas maintain a Merkle Tree of keys and their versions/hashes Trees periodically exchanged with peer vnodes Merkle tree enables cheap comparison Only values with different hashes are exchanged Pushes system toward consistency Friday, March 8, 13
  • 16. Gossip Protocol Decentralized approach to managing global state Trades off atomicity of state changes for a decentralized approach Volume of gossip can overwhelm networks without care Friday, March 8, 13
  • 18. Hinted Handoff • Node fails X X X X X X X X Friday, March 8, 13
  • 19. Hinted Handoff • Node fails • Requests go to fallback hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) X X X X X X X X Friday, March 8, 13
  • 20. Hinted Handoff • Node fails • Requests go to fallback • Node comes back hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Friday, March 8, 13
  • 21. Hinted Handoff • Node fails • Requests go to fallback • Node comes back • “Handoff” - data returns to recovered node hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Friday, March 8, 13
  • 22. Hinted Handoff • Node fails • Requests go to fallback • Node comes back • “Handoff” - data returns to recovered node • Normal operations resume hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Friday, March 8, 13
  • 23. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Friday, March 8, 13
  • 24. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) client Riak Friday, March 8, 13
  • 25. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak Friday, March 8, 13
  • 26. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak hash(“blocks/ 6307C89A-710A-42CD-9FFB-2A6B39F983EA”) == 10, 11, 12 Friday, March 8, 13
  • 27. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak hash(“blocks/ 6307C89A-710A-42CD-9FFB-2A6B39F983EA”) == 10, 11, 12 Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring Friday, March 8, 13
  • 28. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring Friday, March 8, 13
  • 29. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2 Friday, March 8, 13
  • 30. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2 v1 Friday, March 8, 13
  • 31. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak R=2 v1 v2 Friday, March 8, 13
  • 32. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak R=2 v2 v2 Friday, March 8, 13
  • 33. Anatomy of a Request get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) v2 Friday, March 8, 13
  • 34. Read Repair get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 R=2 v1 v2 v2 v2v1 Friday, March 8, 13
  • 35. Read Repair get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 R=2 v2 v2 v2v1 Friday, March 8, 13
  • 36. Read Repair get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 R=2 v2 v2 v2v1v1 Friday, March 8, 13
  • 37. v2v2 Read Repair get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 R=2 v2 v2 v2v2v2 Friday, March 8, 13
  • 39. Erlang/OTP Runtime Riak KV Riak Architecture Friday, March 8, 13
  • 40. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Friday, March 8, 13
  • 41. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs HTTP Friday, March 8, 13
  • 42. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs HTTP Protocol Buffers Friday, March 8, 13
  • 43. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs HTTP Protocol Buffers Erlang local client Friday, March 8, 13
  • 44. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination HTTP Protocol Buffers Erlang local client Friday, March 8, 13
  • 45. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination get put delete map-reduce HTTP Protocol Buffers Erlang local client Friday, March 8, 13
  • 46. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client Friday, March 8, 13
  • 47. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client consistent hashing Friday, March 8, 13
  • 48. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client membership consistent hashing Friday, March 8, 13
  • 49. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client membership consistent hashing handoff Friday, March 8, 13
  • 50. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client membership consistent hashing handoff node-liveness Friday, March 8, 13
  • 51. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client membership consistent hashing handoff node-liveness gossip Friday, March 8, 13
  • 52. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client membership consistent hashing handoff node-liveness gossip buckets Friday, March 8, 13
  • 53. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client membership consistent hashing handoff node-liveness gossip buckets vnode master Friday, March 8, 13
  • 54. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client membership consistent hashing handoff node-liveness gossip buckets vnodes vnode master Friday, March 8, 13
  • 55. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client membership consistent hashing handoff node-liveness gossip buckets vnodes storage backend vnode master Friday, March 8, 13
  • 56. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client membership consistent hashing handoff node-liveness gossip buckets vnodes storage backend JS Runtime vnode master Friday, March 8, 13
  • 57. Problems with Dynamo Eventual Consistency is harsh mistress Pushes conflict resolution to clients Key/value data types limited in use Random replica placement destroys locality Gossip protocol can limit cluster size R+W > N is NOT more consistent TCP Incast Friday, March 8, 13
  • 58. Key-Value Conflict Resolution Forcing clients to resolve consistency issues on read is a pain for developers Most end up choosing the server-enforced last-write- wins policy With many language clients, logic must be implemented many times One solution: https://github.com/bumptech/montage Another: Make everything immutable Another: CRDTs Friday, March 8, 13
  • 59. Optimize for Immutability “Accountants don’t use erasers” - Pat Helland Eventual consistency is *great* for immutable data Conflicts become a non-issue if data never changes don’t need full quorums, vector clocks backend optimizations are possible Problem space shifts to distributed GC ... which is very hard, but not the user’s problem anymore Friday, March 8, 13
  • 60. CRDTs Conflict-free|Commutative Replicated Data Types A server side structure and conflict-resolution policy for richer datatypes like counters and sets amenable to eventual consistency Letia et al. (2009). CRDTs: Consistency without concurrency control: http://hal.inria.fr/inria-00397981/ en Prototype here: http://github.com/basho/riak_dt Friday, March 8, 13
  • 61. Random Placement and Locality By default, keys are randomly placed on different replicas But we have buckets! Containers imply cheap iteration/enumeration, but with random placement it becomes an expensive full-scan Partial Solution: hash function defined per-bucket can increase locality Lots of work done to minimize impact of bucket listings Friday, March 8, 13
  • 62. (R+W>N) != Consistency R+W described in Dynamo paper as “consistency knobs” Some Basho/Riak docs still say this too! :( Even if R=W=N, sloppy quorums and partial writes make reading old values possible “Read your own writes if your writes succeed but otherwise you have no idea what you’re going to read consistency (RYOWIWSBOYHNIWYGTRC)” - Joe Blomstedt Solution: actual “strong” consistency Friday, March 8, 13
  • 63. Strong Consistency in Riak CAP says you must choose C vs. A, but only during failures There’s no reason we can’t implement both models, with different tradeoffs Enable strong consistency on a per-bucket basis See Joe Blomstedt’s talk at RICON 2012: http:// ricon2012.com, earlier work at: http://github.com/jtuple/riak_zab Friday, March 8, 13
  • 64. An Aside: Probabalistically Bounded Staleness Bailis et al. : http://pbs.cs.berkeley.edu R=W=1, .1ms latency at all hops Friday, March 8, 13
  • 65. TCP Incast “You can’t pour two buckets of manure into one bucket” - Scott Fritchie’s Grandfather “microbursts” of traffic sent to one cluster member Coordinator sends request to three replicas All respond with large-ish result at roughly the same time Switch has to either buffer or drop packets Cassandra tries to mitigate: 1 replica sends data, others send hashes. We should do this in Riak. Friday, March 8, 13
  • 66. What Riak Did Differently (or wrong) Screwed up vector clock implementation Actor IDs in vector clocks were client ids, therefore potentially unbounded Size explosion resulted in huge objects, caused OOM crashes Vector clock pruning resulted in false siblings Fixed by forwarding to node in preflist circa 1.0 Friday, March 8, 13
  • 67. What Riak Did Differently No active anti-entropy until v1.3 Early versions had slow, unstable AAE Node loss required reading all objects and repopulating replicas via read repair Ok for objects that are read often Rarely-read objects N value decreases over time Friday, March 8, 13
  • 68. What Riak Did Differently Initial versions had an unavailability window during topology changes Nodes would claim partitions immediately, before data had been handed off New versions don’t change request preflist until all data has been handed off Implemented as 2PC-ish commit over gossip Friday, March 8, 13
  • 69. Riak, Beyond Dynamo MapReduce Search Secondary Indexes Pre/post-commit hooks Multi-DC replication Riak Pipe distributed computation Riak CS Friday, March 8, 13
  • 70. Riak CS Amazon S3 clone implemented as a proxy in front of Riak Handles eventual consistency issues, object chunking, multitenancy, and API for a much narrower use case Forced us to eat our own dogfood and get serious about fixing long-standing warts Drives feature development Friday, March 8, 13
  • 71. Riak the Product vs. Dynamo the Service Dynamo had luxury of being a service while Riak is a product Screwing things up with Riak can not be fixed with an emergency deploy Multiple platforms, packaging are challenges Testing distributed systems is another talk entirely (QuickCheck FTW) http://www.erlang-factory.com/upload/presentations/514/ TestFirstConstructionDistributedSystems.pdf Friday, March 8, 13
  • 72. Riak Core Some of our best work! Dynamo abstracted Implements all Dynamo techniques without prescribing a use case Examples of Riak Core apps: Riak KV! Riak Search Riak Pipe Friday, March 8, 13
  • 73. Riak Core Production deployments OpenX: several 100+-node clusters of custom Riak Core systems StackMob: proxy for mobile services implemented with Riak Core Needs to be much easier to use and better documented Friday, March 8, 13
  • 74. Multi-Datacenter Replication Intra-cluster replication in Riak is optimized for consistently low latency, high throughput WAN replication needs to deal with lossy links, long fat networks, TCP oddities MDC replication has 2 phases: full-sync (per-partition merkle comparisons), real-time (asynchronous, driven by post-commit hook) Separate policies settable per-bucket Friday, March 8, 13
  • 75. Erlang Still the best language for this stuff, but We mix data and control messages over Erlang message passing. Switch to TCP (or uTP/UDT) for data NIFs are problematic VM tuning can be a dark art ~90 public repos of mostly-Erlang, mostly-awesome open source: https://github.com/basho Friday, March 8, 13
  • 76. Other Future Directions Security was not a factor in Dynamo’s or Riak’s design Isolating Riak increases operational complexity, cost Statically sized ring is a pain Explore possibilities with smarter clients Support larger clusters Multitenancy, tenant isolation More vertical products like Riak CS Friday, March 8, 13