Slides created as a part of CS 295's week 4 on NoSQL Basics.
CS 295 (Cloud Computing and BigData) at UCI - https://sites.google.com/site/cs295cloudcomputing/
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Cassandra - A Decentralized Structured Storage System
1. Cassandra - A Decentralized Structured Storage
System
Avinash Lakshman, Prashant Malik (Facebook), ACM SIGOPS
(2010)
Jiacheng & Varad
Dept. of Computer Science,
Donald Bren School of Information and Computer Sciences,
UC Irvine.
jiachenf@uci.edu, vmeru@ics.uci.edu
2. Outline
● Introduction
● Data Model
● API
● System Architecture
○ Partitioning
○ Replication
○ Membership
○ Bootstrapping and Scaling
○ Local Persistence
● Practical Experiences
● Conclusion
4. Why Cassandra?
● Facebook
○ Requirements -
■ Performance.
■ Reliability.
■ Efficiency.
■ Support for continuous growth.
■ Fail-friendly.
○ Application
■ Facebook’s Inbox Search (Old: Now moved to HBase!1
)
● 2008: 100 Mn, 2010: 250 Mn
● 2014: 1.35 Bn users; 864 Mn daily average for september2
.
1:Storage Infrastructure Behind Facebook Messages: Using HBase at Scale: http://on.fb.me/1ok9SnC| 2: http://newsroom.fb.com/company-info/
5. Related Work
● Ficus, Coda
○ High availability with specialized conflict resolution for updates. (A-P)
● Farsite
○ Serverless, distributed file system without trust.
● Google File System
○ Single master - simple design. Made fault tolerant with Chubby.
● Bayou
○ Distributed Relational Database with eventual consistency and partition
tolerance. (A-P)
● Dynamo
○ Structured overlay network with 1-hop request routing. Gossip based
information sharing.
● Bigtable
○ GFS based system.
7. Data Model
● Multi-dimensional map.
○ key - string, generally 16-36 bytes long. No size restrictions.
● Column Family - group of columns.
○ Simple column family
○ Super - column family of column families.
○ Column order: sorted by timestamp of name. (Time sorting helps in Inbox
Search)
● Column Access
○ Simple column
■ column_family:column
○ Super column
■ column_family:super_column:column
14. Partitioning
● Purpose:
○ The ability to dynamically partition the data over the set
of nodes.
○ The ability to scale incrementally.
● Consistent hashing using an order preserving hash function
15. Consistent Hashing
Basics :-
● Assigns each node and key an identifier using a base hash function.
● The hash space is large, and is treated as if it wraps around to form a
circle - hence hash ring. Identifiers are ordered on an identifier circle
modulo M.
● Each data item identified by a key is assigned to a node by hashing the
data item’s key to yield its position on the ring, and then walking the ring
clockwise to find the first node with a position larger than the item’s
position.
16. Consistent Hashing
Example -
● The ring has 10 nodes and stores five keys.
● The successor of identifier 10 is node 14, so
key 10 would be located at node 14.
● If a node were to join with identifier 26, it
would capture the key with identifier 24
from the node with identifier 32
The principal advantage of consistent hashing is
that departure or arrival of a node only affects
its immediate neighbors and other nodes remain
unaffected
17. Consistent Hashing
The basic consistent hashing algorithm presents some challenges.
1. First, the random position assignment of each node on the ring leads to non-
uniform data and load distribution.
2. Second, the basic algorithm is oblivious to the heterogeneity in the performance
of nodes.
There are two solutions
1. Nodes can be assigned to multiple positions in the circle
2. Analyze load information on the ring and have lightly loaded nodes move on the
ring to alleviate heavily loaded nodes (chosen)
18. Order preserving hash function
If keys are given in some order a1
, a2
, ..., an
and for any keys aj
and ak
, j<k implies F(aj
)<F(ak
)
19. Replication
Purpose: achieve high availability and durability
● Each data item is replicated at N hosts, where N is the
replication factor
● Each key, k, is assigned to a coordinator node. The
coordinator is in charge of the replication of the data items
that fall within its range.
20. Replication
Rack Unaware
● The non-coordinator replicas are chosen by picking N-1 successors of the
coordinator on the ring
● Cassandra system elects a leader amongst its nodes using a system called
Zookeeper.
● All nodes on joining the cluster contact the leader
● Leader tells them for what ranges they are replicas for (maintain the invariant
that no node is responsible for more than N-1 ranges in the ring)
● The metadata cached locally and in Zookeeper (crashes and comes back up
knows what ranges it was responsible for)
21. Replication
Scheme
● Cassandra is configured such that each row is replicated
across multiple data centers.
● Data centers are connected through high speed network
links.
This scheme allows to handle entire data center failures without
any outage.
22. Failure Detection
Purpose
● Locally determine if any other node in the system is up or down
● avoid attempts to communicate with unreachable nodes during various
operations.
Modified version of the ΦAccrual Failure Detector :
● Doesn’t emit a Boolean value stating a node is up or down
● Emits a value which represents a suspicion level for each of monitored nodes.
23. Failure Detection
● Every node in the system maintains a sliding window of inter-arrival
times of gossip messages from other nodes in the cluster.
● The distribution of these inter-arrival times is determined and Φ is
calculated
● Assuming that we decide to suspect a node A when Φ =1, then the
likelihood that we will make a mistake is about 10%. The likelihood is
about 1% with Φ = 2, 0.1% with Φ = 3, and so on.
Approximate Distribution
● Gaussian distribution
● Exponential distribution (chosen)
24. Bootstrapping
● When a node starts for the first time, it chooses a random
token for its position in the ring.
● For fault tolerance, the mapping is persisted to disk locally
and also in Zookeeper.
● The token information is then gossiped around the cluster.
This is how we know about all nodes and their respective
positions in the ring. This enables any node to route a request
for a key to the correct node in the cluster.
25. Scaling the Cluster
When a new node is added into the system, it gets assigned a token such that
it can alleviate a heavily loaded node.
● Bootstrap algorithm is initiated from any other node in the system
● The node giving up the data streams the data over to the new no de
using kernel-kernel copy techniques.
● Operational experience has shown that data can be transferred at the
rate of 40 MB/sec from a single node.
● Improving this by having multiple replicas take part in the bootstrap
transfer thereby parallelizing the effort, similar to Bittorrent
26. Local Persistence
● Typical write operation involves a write into a commit log for durability and
recoverability and an update into an in-memory data structure.
● The write into the in-memory data structure is performed only after a successful
write into the commit log
● When the in-memory data structure crosses a certain threshold, it dumps itself to
disk
● Over time many such files could exist on disk and a merge process runs in the
background to collate the different files into one file.
27. Local Persistence
Speed Up
● Bloom Filter: avoid lookups into files that do not contain the key
● Column Indices: prevent scanning of every column on disk and which allow us to
jump to the right chunk on disk for column retrieval
29. Practical Experiences
● Use of MapReduce jobs to index the inbox data
○ 7 TB data.
○ Special back channel for MR to process aggregate reverse index per user and
send over the serialized data.
● Atomic operations
● Failure detection is difficult
○ Initial: 2 minutes for a 100-node setup. Later, 15 seconds.
● Integrated with Ganglia for monitoring.
● Some coordination with Zookeeper.
30. Facebook Inbox Search
● Per user index of all the messages
● Two search features: Two column families.
○ Term Search
■ <user_id, message> : message becomes the super-column.
■ Individual message identifiers that contain the word become columns.
○ Interactions
■ <user_id, recipients_user_id> : recipients_user_id is the super-column.
■ Individual message identifiers are the columns.
● Caching user’s index as soon as he/she clicks the search bar.
● Current: 50 TB, on 150 Node cluster (East/West coast data centers).