KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES

KEY CONCEPTS FOR
SCALABLE STATEFUL
SERVICES
Nikolay Novik
https://github.com/jettify
PyConUA 2017

I AM ...
Software Engineer: at DataRobot Ukraine
Github:
Twitter:
aio-libs:
My Projects:
database clients: aiomysql, aioobc, aiogibson
web and etc: aiomonitor,
aiohttp_debugtoolbar, aiobotocore,
aiohttp_mako, aiohttp_admin, aiorwlock
https://github.com/jettify
https://twitter.com/isinf
https://github.com/aio-libs

POLL: HAVE YOU EVER READ DYNAMO PAPER?
1. I read this papers.
2. I heard about this paper and know key ideas.
3. I think distributed systems is kinda cool.

AGENDA
1. Motivation, why and when we might want to user stateful services.
2. Industry examples: Uber, Halo 4, DragonAge, HPC
3. Problem statement, required components
4. Overview of consistent hashing, gossip dissemination and swim failure
detection
5. Possible improvements

USE STATELESS (DUCK TAPE) WHEN YOU CAN!
Stateless protocol is proved technique, use it like duck tape

ISSUES WITH STATELESS SERVICES
Soft real time is requirement
State serialization
Wasteful data fetching
DB leaky transactions

STATELESS SERVICE EXAMPLE
Notice that user data fetched several times and cached on
multiple servers.

BENEFITS OF STATEFUL SERVICES
Data locality, logic executed where data is stored with fast
access
Lower latency state in memory, no need extra network hops
Higher performance no need to deserialize data

STATEFUL SERVICE EXAMPLE
Avoided are extra trips to the database which reduces latency.
Even if the database is down the request can be handled.

INDUSTRY EXAMPLE: UBER
Geo spatial index service to match driver and user

INDUSTRY EXAMPLE: HALO 4
Orleans used as backbone for server part of Halo game,
including: presence, statistics, cheat detection, etc

INDUSTRY EXAMPLE: HPC
San Diego Supercomputer Center uses Serf to coordinate
compute resources in multiple locations, cluster size is about
2k nodes

LETS TRY TO SOLVE CLOSE TO REAL WORLD
PROBLEM: PREDICTION SERVICE
Services that predicts reselling prices of different products,
based on product speciﬁcation
User enters used product specs, and obtains price estimate
Each product category

FUNCTIONAL REQUIREMENTS
Dynamic scaling
Fault tolerance
Exploit data
locality
Flexible API

REQUIRED COMPONENTS
1. Work distribution and routing move job request to
appropriate node
2. Cluster membership update provide means to determine
nodes participating in cluster in stable and cluster resizing
conditions
3. Failure detector periodically check nodes and remove
unresponsive/dead ones

ROUTING. NAIVE SOLUTION WITH HARD CODED
CLUSTER NODES
Very easy to implement, viable solution when dynamic
resizing is not required
Does not support dynamic scaling in or scaling out
Requires cluster restart for changing nodes conﬁguration

ROUTING. CONSISTENT HASHING SOLUTION
This simple algorithms made Akamai multi billion worth
company

CONSISTENT HASHING. BASIC IDEA
Consistent hashing minimizes number of keys, need to be
remapped
http://blog.carlosgaldino.com/consistent-hashing.html

CONSISTENT HASHING. ADDING NODE
In case of adding capacity, only fraction of keys will be moved

CONSISTENT HASHING. REMOVING NODE
In case of node failure next address will handle related keys

CONSISTENT HASHING. VIRTUAL NODES
Virtual nodes help with keys distribution, moving it close to
1/n

CLUSTER MEMBERSHIP PROBLEM
We have routing and job distribution, lets ﬁgure out how to
add and remove nodes.

WHY NOT JUST USE ZOOKEEPER/CONSUL/ECTD
(OR IN OTHER WORDS ZAB, PAXOS, RAFT)?
Issues
Availability
Performance
Network partitions
Operation overhead

TYPICAL SYSTEM WITH COORDINATION
Zookeeper forces own
view
Possible links:
but for FD used only
Nodes availability
decision best when it
is local
n(n−1)
2
n

CLUSTER MEMBERSHIP UPDATE PROBLEM. NAIVE
SOLUTION
Broadcast: could be used for cluster membership update
Use network broadcast (usually disabled)
Send message one by one to each peer(not reliable)

Xerox invented gossip protocols: and
.
GOSSIP PROTOCOL
anti-entropy rumor
mongering

GOSSIP OVERVIEW
Basic gossip protocol
Send message to k
random peers
peers retransmit
message to next k
random peers
in steps,
information will be
disseminated
log(n)

GOSSIP PROTOCOL VS PACKET LOSS
Heavy packet loss does not stop dissemination, it simply will
take a bit longer, 2 times for 50% loss.

FAILURE DETECTION PROTOCOL
We can route jobs and communicate cluster update, last
component is failure detector.

Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM
(JACM) 43.2 (1996): 225-267.
FAILURE DETECTORS FOR ASYNCHRONOUS
SYSTEMS
In asynchronous distributed systems, the detection of crash
failures is imperfect. There will be false positives and false
negatives.

FAILURE DETECTORS. PROPERTIES
Completeness - every crashed process is eventually
suspected
Accuracy - no correct process is ever suspected
Speed - how fast we can detect fault node
Network message load - number of messages required
during protocol period

BASIC FAILURE DETECTOR
Each process periodically sends out an incremented
heartbeat counter to the outside world.
Another process is detected as failed when a heartbeat is not
received from it for some time

BASIC FAILURE DETECTOR. PROPERTIES
Completeness each process eventually miss heartbeat
Speed conﬁgurable, as little as protocol interval
Accuracy high, depends on speed
Network message load each node sends message to
all other nodes
O( )n
2

SWIM FAILURE DETECTOR
SWIM: Scalable Weakly-consistent Infection-style Process
Group Membership. Protocol

SWIM FAILURE DETECTOR
On each protocol round,
node sends only
pings messages
SWIM uses ping as
primary way to do FD, and
indirect ping for better
tolerance to network
partitions
k = 3

SWIM FAILURE DETECTOR. PROPERTIES
Completeness each process eventually will be pinged
Speed conﬁgurable, 1 protocol interval
Accuracy 99.9 % with delivery probability 0.95 and k=3
Network message load. ( )O(n) 4k + 2)n

SWIM VS CONNECTION LOSS. SUSPICION
SUBPROTOCOL
Provides a mechanism to reduce the rate of false positives by
“suspecting” a process before “declaring” it as failed within
the group.

SWIM VS PACKET ORDER
Ordering between messages is important, but total order is not
required, only happens before/casual ordering.
Logical timestamp for state updates
Peer speciﬁc and only incremented by peer

SWIM VS NETWORK PARTITIONS
Nodes in each subnet can talk to each as result declares peers
on other subnet as dead.
How we can
recover cluster
after network heal?
Do not purge nodes
on dead
Periodically try to
rejoin

PROBLEM SOLVED! IMPLEMENTATION DETAILS
How python can
help with
implementation?
What frameworks
to use?

OVERVIEW OF FRAMEWORKS FOR BUILDING
CLUSTER AWARE SYSTEMS
Name Language Developer Description
??? Python ??? ???
node.js Uber Used as services for matching user and driver with follow
up location update
golang Hashicorp Used in number applications for instance in HPC to
manage computing resources
.NET Microsoft General purpose framework, used in Halo online game
Java EA Games Used in Bioware games, such as DragonAge game, not
sure where thou. Inspired by Orleans
Erlang Basho Building block for Riak database and erlang distributed
systems
Scala Lightblend General purpose distribute systems framework, often used
as microservsies platform
RingPop
Serf
Orleans
Orbit/jGroups
riak_core
Akka

IMPROVEMENT: NETWORK COORDINATES
Famous paper from MIT, describes synthetic network
coordinates, based on ping delays, used in Serf/Consul for data
center fail over

IMPROVEMENT: NETWORK COORDINATES
VISUALIZATION
Notice coordinate drifting in space and stable distance
between clusters

IMPROVEMENT: PARTIAL VIEW FOR HUGE
CLUSTERS
For huge clusters full membership is not scalable, paper
proposes partial membership protocol

IMPROVEMENT: PARTIAL VIEW IN CASE OF NODE
FAILURES
Even for failure rates as high as 95%, HyParView still
manages to maintain a reliability value in the order of
deliveries to 90% of the active processes.

IMPROVEMENT: DHT FOR MORE BALANCING
Orleans uses a one-hop distributed hash table that maps actors
between machines, as result actors could be moved across the
cluster

STATEFUL SERVICES CHALLENGES
Work distribution
Code deployment
Unbounded data structures
Memory management
Persistent strategies

REFERENCES
1. Karger, David, et al. "Consistent hashing and random trees: Distributed caching protocols for
relieving hot spots on the World Wide Web." Proceedings of the twenty-ninth annual ACM
symposium on Theory of computing. ACM, 1997.
2. Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed
systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.
3. Das, Abhinandan, Indranil Gupta, and Ashish Motivala. "Swim: Scalable weakly-consistent
infection-style process group membership protocol." Dependable Systems and Networks, 2002.
DSN 2002. Proceedings. International Conference on. IEEE, 2002.
4. Dabek, Frank, et al. "Vivaldi: A decentralized network coordinate system." ACM SIGCOMM
Computer Communication Review 34.4 (2004): 15-26.
5. Leitao, Joao, José Pereira, and Luis Rodrigues. "HyParView: A membership protocol for reliable
gossip-based broadcast." Dependable Systems and Networks, 2007. DSN'07. 37th Annual
IEEE/IFIP International Conference on. IEEE, 2007.
6. Stoica, Ion, et al. "Chord: A scalable peer-to-peer lookup service for internet applications."
ACM SIGCOMM Computer Communication Review 31.4 (2001): 149-160.
7. Bailis, Peter, and Kyle Kingsbury. "The network is reliable." Queue 12.7 (2014): 20.
8. Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system."
Communications of the ACM 21.7 (1978): 558-565.b

THANK YOU!
aio-libs: https://github.com/aio-libs
slides: https://jettify.github.io/pyconua2017

KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES

Recommandé

Recommandé

Contenu connexe

Similaire à KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES

Similaire à KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES (20)

Dernier

Dernier (20)

KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES