Stateless service architectures can be easily scaled horizontally by adding backend servers to a front-end load balancer. Such approach is not always optimal, any application that needs to perform soft real-time work could never be built using stateless CRUD models, because state locality is required in order to achieve those response times. In this talk I'll cover benefits of statefull services, gave an overview of academic research and existing frameworks in js, scala, .net and golang worlds. Unfortunately in this area python has little to offer. To fix this we will figure out key concepts for building scalable stateful services: membership and dissemination protocols, failure detection and message routing.
2. I AM ...
Software Engineer: at DataRobot Ukraine
Github:
Twitter:
aio-libs:
My Projects:
database clients: aiomysql, aioobc, aiogibson
web and etc: aiomonitor,
aiohttp_debugtoolbar, aiobotocore,
aiohttp_mako, aiohttp_admin, aiorwlock
https://github.com/jettify
https://twitter.com/isinf
https://github.com/aio-libs
3. POLL: HAVE YOU EVER READ DYNAMO PAPER?
1. I read this papers.
2. I heard about this paper and know key ideas.
3. I think distributed systems is kinda cool.
4. AGENDA
1. Motivation, why and when we might want to user stateful services.
2. Industry examples: Uber, Halo 4, DragonAge, HPC
3. Problem statement, required components
4. Overview of consistent hashing, gossip dissemination and swim failure
detection
5. Possible improvements
5. USE STATELESS (DUCK TAPE) WHEN YOU CAN!
Stateless protocol is proved technique, use it like duck tape
6. ISSUES WITH STATELESS SERVICES
Soft real time is requirement
State serialization
Wasteful data fetching
DB leaky transactions
8. BENEFITS OF STATEFUL SERVICES
Data locality, logic executed where data is stored with fast
access
Lower latency state in memory, no need extra network hops
Higher performance no need to deserialize data
9. STATEFUL SERVICE EXAMPLE
Avoided are extra trips to the database which reduces latency.
Even if the database is down the request can be handled.
11. INDUSTRY EXAMPLE: HALO 4
Orleans used as backbone for server part of Halo game,
including: presence, statistics, cheat detection, etc
12. INDUSTRY EXAMPLE: HPC
San Diego Supercomputer Center uses Serf to coordinate
compute resources in multiple locations, cluster size is about
2k nodes
13. LETS TRY TO SOLVE CLOSE TO REAL WORLD
PROBLEM: PREDICTION SERVICE
Services that predicts reselling prices of different products,
based on product specification
User enters used product specs, and obtains price estimate
Each product category
15. REQUIRED COMPONENTS
1. Work distribution and routing move job request to
appropriate node
2. Cluster membership update provide means to determine
nodes participating in cluster in stable and cluster resizing
conditions
3. Failure detector periodically check nodes and remove
unresponsive/dead ones
16. ROUTING. NAIVE SOLUTION WITH HARD CODED
CLUSTER NODES
Very easy to implement, viable solution when dynamic
resizing is not required
Does not support dynamic scaling in or scaling out
Requires cluster restart for changing nodes configuration
18. CONSISTENT HASHING. BASIC IDEA
Consistent hashing minimizes number of keys, need to be
remapped
http://blog.carlosgaldino.com/consistent-hashing.html
23. WHY NOT JUST USE ZOOKEEPER/CONSUL/ECTD
(OR IN OTHER WORDS ZAB, PAXOS, RAFT)?
Issues
Availability
Performance
Network partitions
Operation overhead
24. TYPICAL SYSTEM WITH COORDINATION
Zookeeper forces own
view
Possible links:
but for FD used only
Nodes availability
decision best when it
is local
n(n−1)
2
n
25. CLUSTER MEMBERSHIP UPDATE PROBLEM. NAIVE
SOLUTION
Broadcast: could be used for cluster membership update
Use network broadcast (usually disabled)
Send message one by one to each peer(not reliable)
27. GOSSIP OVERVIEW
Basic gossip protocol
Send message to k
random peers
peers retransmit
message to next k
random peers
in steps,
information will be
disseminated
log(n)
28. GOSSIP PROTOCOL VS PACKET LOSS
Heavy packet loss does not stop dissemination, it simply will
take a bit longer, 2 times for 50% loss.
30. Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM
(JACM) 43.2 (1996): 225-267.
FAILURE DETECTORS FOR ASYNCHRONOUS
SYSTEMS
In asynchronous distributed systems, the detection of crash
failures is imperfect. There will be false positives and false
negatives.
31. FAILURE DETECTORS. PROPERTIES
Completeness - every crashed process is eventually
suspected
Accuracy - no correct process is ever suspected
Speed - how fast we can detect fault node
Network message load - number of messages required
during protocol period
32. BASIC FAILURE DETECTOR
Each process periodically sends out an incremented
heartbeat counter to the outside world.
Another process is detected as failed when a heartbeat is not
received from it for some time
33. BASIC FAILURE DETECTOR. PROPERTIES
Completeness each process eventually miss heartbeat
Speed configurable, as little as protocol interval
Accuracy high, depends on speed
Network message load each node sends message to
all other nodes
O( )n
2
35. SWIM FAILURE DETECTOR
On each protocol round,
node sends only
pings messages
SWIM uses ping as
primary way to do FD, and
indirect ping for better
tolerance to network
partitions
k = 3
36. SWIM FAILURE DETECTOR. PROPERTIES
Completeness each process eventually will be pinged
Speed configurable, 1 protocol interval
Accuracy 99.9 % with delivery probability 0.95 and k=3
Network message load. ( )O(n) 4k + 2)n
37. SWIM VS CONNECTION LOSS. SUSPICION
SUBPROTOCOL
Provides a mechanism to reduce the rate of false positives by
“suspecting” a process before “declaring” it as failed within
the group.
38. SWIM VS PACKET ORDER
Ordering between messages is important, but total order is not
required, only happens before/casual ordering.
Logical timestamp for state updates
Peer specific and only incremented by peer
39. SWIM VS NETWORK PARTITIONS
Nodes in each subnet can talk to each as result declares peers
on other subnet as dead.
How we can
recover cluster
after network heal?
Do not purge nodes
on dead
Periodically try to
rejoin
41. OVERVIEW OF FRAMEWORKS FOR BUILDING
CLUSTER AWARE SYSTEMS
Name Language Developer Description
??? Python ??? ???
node.js Uber Used as services for matching user and driver with follow
up location update
golang Hashicorp Used in number applications for instance in HPC to
manage computing resources
.NET Microsoft General purpose framework, used in Halo online game
Java EA Games Used in Bioware games, such as DragonAge game, not
sure where thou. Inspired by Orleans
Erlang Basho Building block for Riak database and erlang distributed
systems
Scala Lightblend General purpose distribute systems framework, often used
as microservsies platform
RingPop
Serf
Orleans
Orbit/jGroups
riak_core
Akka
42. IMPROVEMENT: NETWORK COORDINATES
Famous paper from MIT, describes synthetic network
coordinates, based on ping delays, used in Serf/Consul for data
center fail over
44. IMPROVEMENT: PARTIAL VIEW FOR HUGE
CLUSTERS
For huge clusters full membership is not scalable, paper
proposes partial membership protocol
45. IMPROVEMENT: PARTIAL VIEW IN CASE OF NODE
FAILURES
Even for failure rates as high as 95%, HyParView still
manages to maintain a reliability value in the order of
deliveries to 90% of the active processes.
46. IMPROVEMENT: DHT FOR MORE BALANCING
Orleans uses a one-hop distributed hash table that maps actors
between machines, as result actors could be moved across the
cluster
49. REFERENCES
1. Karger, David, et al. "Consistent hashing and random trees: Distributed caching protocols for
relieving hot spots on the World Wide Web." Proceedings of the twenty-ninth annual ACM
symposium on Theory of computing. ACM, 1997.
2. Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed
systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.
3. Das, Abhinandan, Indranil Gupta, and Ashish Motivala. "Swim: Scalable weakly-consistent
infection-style process group membership protocol." Dependable Systems and Networks, 2002.
DSN 2002. Proceedings. International Conference on. IEEE, 2002.
4. Dabek, Frank, et al. "Vivaldi: A decentralized network coordinate system." ACM SIGCOMM
Computer Communication Review 34.4 (2004): 15-26.
5. Leitao, Joao, José Pereira, and Luis Rodrigues. "HyParView: A membership protocol for reliable
gossip-based broadcast." Dependable Systems and Networks, 2007. DSN'07. 37th Annual
IEEE/IFIP International Conference on. IEEE, 2007.
6. Stoica, Ion, et al. "Chord: A scalable peer-to-peer lookup service for internet applications."
ACM SIGCOMM Computer Communication Review 31.4 (2001): 149-160.
7. Bailis, Peter, and Kyle Kingsbury. "The network is reliable." Queue 12.7 (2014): 20.
8. Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system."
Communications of the ACM 21.7 (1978): 558-565.b