SlideShare une entreprise Scribd logo
1  sur  33
Cassandra - A Decentralized Structured Storage
System
Avinash Lakshman, Prashant Malik (Facebook), ACM SIGOPS
(2010)
Jiacheng & Varad
Dept. of Computer Science,
Donald Bren School of Information and Computer Sciences,
UC Irvine.
jiachenf@uci.edu, vmeru@ics.uci.edu
Outline
● Introduction
● Data Model
● API
● System Architecture
○ Partitioning
○ Replication
○ Membership
○ Bootstrapping and Scaling
○ Local Persistence
● Practical Experiences
● Conclusion
INTRODUCTION
The need!
Why Cassandra?
● Facebook
○ Requirements -
■ Performance.
■ Reliability.
■ Efficiency.
■ Support for continuous growth.
■ Fail-friendly.
○ Application
■ Facebook’s Inbox Search (Old: Now moved to HBase!1
)
● 2008: 100 Mn, 2010: 250 Mn
● 2014: 1.35 Bn users; 864 Mn daily average for september2
.
1:Storage Infrastructure Behind Facebook Messages: Using HBase at Scale: http://on.fb.me/1ok9SnC| 2: http://newsroom.fb.com/company-info/
Related Work
● Ficus, Coda
○ High availability with specialized conflict resolution for updates. (A-P)
● Farsite
○ Serverless, distributed file system without trust.
● Google File System
○ Single master - simple design. Made fault tolerant with Chubby.
● Bayou
○ Distributed Relational Database with eventual consistency and partition
tolerance. (A-P)
● Dynamo
○ Structured overlay network with 1-hop request routing. Gossip based
information sharing.
● Bigtable
○ GFS based system.
Data Model
Data Model
● Multi-dimensional map.
○ key - string, generally 16-36 bytes long. No size restrictions.
● Column Family - group of columns.
○ Simple column family
○ Super - column family of column families.
○ Column order: sorted by timestamp of name. (Time sorting helps in Inbox
Search)
● Column Access
○ Simple column
■ column_family:column
○ Super column
■ column_family:super_column:column
Data Model (contd)
API
API
● Three simple methods
○ insert (table, key, rowMutation)
[default@keyspace] set User[‘vmeru’][‘fname’] = ‘Varad’;
[default@keyspace] set User[‘vmeru’][ascii(‘email’)] = ‘vmeru@uci.edu’;
cqlsh> INSERT INTO users (firstname, lastname, age, email, city)
VALUES ('John', 'Smith', 46, 'johnsmith@email.com', 'Sacramento');
○ get ( table, key, columnName)
[default@keyspace] get User[‘vmeru’];
=> (column=656d6169c, value=vmeru@uci.edu,timestamp=135225847342)
cqlsh:demo> SELECT * FROM users where lastname= 'Doe';
○ delete (table, key, columnName)
[default@keyspace] del User[‘vmeru’];
row removed.
cqlsh:demo> DELETE from users WHERE lastname = “Doe”;
System Architecture
Over to Jiacheng :)
Requirements
● Scalability:
○ Continuous growth of the platform
● Reliability:
○ Treat failure as norm
System Architecture
Characteristics
● Partitioning
● Replication
● Membership
● Failure
● Handling
● Scaling
Partitioning
● Purpose:
○ The ability to dynamically partition the data over the set
of nodes.
○ The ability to scale incrementally.
● Consistent hashing using an order preserving hash function
Consistent Hashing
Basics :-
● Assigns each node and key an identifier using a base hash function.
● The hash space is large, and is treated as if it wraps around to form a
circle - hence hash ring. Identifiers are ordered on an identifier circle
modulo M.
● Each data item identified by a key is assigned to a node by hashing the
data item’s key to yield its position on the ring, and then walking the ring
clockwise to find the first node with a position larger than the item’s
position.
Consistent Hashing
Example -
● The ring has 10 nodes and stores five keys.
● The successor of identifier 10 is node 14, so
key 10 would be located at node 14.
● If a node were to join with identifier 26, it
would capture the key with identifier 24
from the node with identifier 32
The principal advantage of consistent hashing is
that departure or arrival of a node only affects
its immediate neighbors and other nodes remain
unaffected
Consistent Hashing
The basic consistent hashing algorithm presents some challenges.
1. First, the random position assignment of each node on the ring leads to non-
uniform data and load distribution.
2. Second, the basic algorithm is oblivious to the heterogeneity in the performance
of nodes.
There are two solutions
1. Nodes can be assigned to multiple positions in the circle
2. Analyze load information on the ring and have lightly loaded nodes move on the
ring to alleviate heavily loaded nodes (chosen)
Order preserving hash function
If keys are given in some order a1
, a2
, ..., an
and for any keys aj
and ak
, j<k implies F(aj
)<F(ak
)
Replication
Purpose: achieve high availability and durability
● Each data item is replicated at N hosts, where N is the
replication factor
● Each key, k, is assigned to a coordinator node. The
coordinator is in charge of the replication of the data items
that fall within its range.
Replication
Rack Unaware
● The non-coordinator replicas are chosen by picking N-1 successors of the
coordinator on the ring
● Cassandra system elects a leader amongst its nodes using a system called
Zookeeper.
● All nodes on joining the cluster contact the leader
● Leader tells them for what ranges they are replicas for (maintain the invariant
that no node is responsible for more than N-1 ranges in the ring)
● The metadata cached locally and in Zookeeper (crashes and comes back up
knows what ranges it was responsible for)
Replication
Scheme
● Cassandra is configured such that each row is replicated
across multiple data centers.
● Data centers are connected through high speed network
links.
This scheme allows to handle entire data center failures without
any outage.
Failure Detection
Purpose
● Locally determine if any other node in the system is up or down
● avoid attempts to communicate with unreachable nodes during various
operations.
Modified version of the ΦAccrual Failure Detector :
● Doesn’t emit a Boolean value stating a node is up or down
● Emits a value which represents a suspicion level for each of monitored nodes.
Failure Detection
● Every node in the system maintains a sliding window of inter-arrival
times of gossip messages from other nodes in the cluster.
● The distribution of these inter-arrival times is determined and Φ is
calculated
● Assuming that we decide to suspect a node A when Φ =1, then the
likelihood that we will make a mistake is about 10%. The likelihood is
about 1% with Φ = 2, 0.1% with Φ = 3, and so on.
Approximate Distribution
● Gaussian distribution
● Exponential distribution (chosen)
Bootstrapping
● When a node starts for the first time, it chooses a random
token for its position in the ring.
● For fault tolerance, the mapping is persisted to disk locally
and also in Zookeeper.
● The token information is then gossiped around the cluster.
This is how we know about all nodes and their respective
positions in the ring. This enables any node to route a request
for a key to the correct node in the cluster.
Scaling the Cluster
When a new node is added into the system, it gets assigned a token such that
it can alleviate a heavily loaded node.
● Bootstrap algorithm is initiated from any other node in the system
● The node giving up the data streams the data over to the new no de
using kernel-kernel copy techniques.
● Operational experience has shown that data can be transferred at the
rate of 40 MB/sec from a single node.
● Improving this by having multiple replicas take part in the bootstrap
transfer thereby parallelizing the effort, similar to Bittorrent
Local Persistence
● Typical write operation involves a write into a commit log for durability and
recoverability and an update into an in-memory data structure.
● The write into the in-memory data structure is performed only after a successful
write into the commit log
● When the in-memory data structure crosses a certain threshold, it dumps itself to
disk
● Over time many such files could exist on disk and a merge process runs in the
background to collate the different files into one file.
Local Persistence
Speed Up
● Bloom Filter: avoid lookups into files that do not contain the key
● Column Indices: prevent scanning of every column on disk and which allow us to
jump to the right chunk on disk for column retrieval
Practical Experiences
Practical Experiences
● Use of MapReduce jobs to index the inbox data
○ 7 TB data.
○ Special back channel for MR to process aggregate reverse index per user and
send over the serialized data.
● Atomic operations
● Failure detection is difficult
○ Initial: 2 minutes for a 100-node setup. Later, 15 seconds.
● Integrated with Ganglia for monitoring.
● Some coordination with Zookeeper.
Facebook Inbox Search
● Per user index of all the messages
● Two search features: Two column families.
○ Term Search
■ <user_id, message> : message becomes the super-column.
■ Individual message identifiers that contain the word become columns.
○ Interactions
■ <user_id, recipients_user_id> : recipients_user_id is the super-column.
■ Individual message identifiers are the columns.
● Caching user’s index as soon as he/she clicks the search bar.
● Current: 50 TB, on 150 Node cluster (East/West coast data centers).
Questions?
One Size does not fit all
Source:http://tcrn.ch/1x3RCSp
Thank You

Contenu connexe

Tendances

The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
Ioanna Tsalouchidou
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
 

Tendances (20)

Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Apache Cassandra - Base de datos
Apache Cassandra - Base de datosApache Cassandra - Base de datos
Apache Cassandra - Base de datos
 
Introduction to Cassandra Architecture
Introduction to Cassandra ArchitectureIntroduction to Cassandra Architecture
Introduction to Cassandra Architecture
 
Data partitioning
Data partitioningData partitioning
Data partitioning
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
Distributed Counters in Cassandra (Cassandra Summit 2010)
Distributed Counters in Cassandra (Cassandra Summit 2010)Distributed Counters in Cassandra (Cassandra Summit 2010)
Distributed Counters in Cassandra (Cassandra Summit 2010)
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
 
Parallel Database
Parallel DatabaseParallel Database
Parallel Database
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 

En vedette

En vedette (15)

LJC: Fault tolerance with Apache Cassandra
LJC: Fault tolerance with Apache CassandraLJC: Fault tolerance with Apache Cassandra
LJC: Fault tolerance with Apache Cassandra
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
 
Apache cassandra an introduction
Apache cassandra  an introductionApache cassandra  an introduction
Apache cassandra an introduction
 
Cassandra Prophecy
Cassandra ProphecyCassandra Prophecy
Cassandra Prophecy
 
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
 
Cassandra
Cassandra Cassandra
Cassandra
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
 
Voldemort
VoldemortVoldemort
Voldemort
 
Apache Cassandra 2.0
Apache Cassandra 2.0Apache Cassandra 2.0
Apache Cassandra 2.0
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
Cassandra architecture
Cassandra architectureCassandra architecture
Cassandra architecture
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 

Similaire à Cassandra - A Decentralized Structured Storage System

Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
Riyad Parvez
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple Spaces
Srinath Perera
 

Similaire à Cassandra - A Decentralized Structured Storage System (20)

Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
 
Clustering in PostgreSQL - Because one database server is never enough (and n...
Clustering in PostgreSQL - Because one database server is never enough (and n...Clustering in PostgreSQL - Because one database server is never enough (and n...
Clustering in PostgreSQL - Because one database server is never enough (and n...
 
Storing the real world data
Storing the real world dataStoring the real world data
Storing the real world data
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Taskerman - a distributed cluster task manager
Taskerman - a distributed cluster task managerTaskerman - a distributed cluster task manager
Taskerman - a distributed cluster task manager
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and Metrics
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple Spaces
 
Chorus - Distributed Operating System [ case study ]
Chorus - Distributed Operating System [ case study ]Chorus - Distributed Operating System [ case study ]
Chorus - Distributed Operating System [ case study ]
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 
Cassandra
CassandraCassandra
Cassandra
 
6.1-Cassandra.ppt
6.1-Cassandra.ppt6.1-Cassandra.ppt
6.1-Cassandra.ppt
 

Plus de Varad Meru

Plus de Varad Meru (14)

Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction Problem
 
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
 
Cloud Computing: An Overview
Cloud Computing: An OverviewCloud Computing: An Overview
Cloud Computing: An Overview
 
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An Introduction
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
OpenSourceEducation
OpenSourceEducationOpenSourceEducation
OpenSourceEducation
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Cassandra - A Decentralized Structured Storage System

  • 1. Cassandra - A Decentralized Structured Storage System Avinash Lakshman, Prashant Malik (Facebook), ACM SIGOPS (2010) Jiacheng & Varad Dept. of Computer Science, Donald Bren School of Information and Computer Sciences, UC Irvine. jiachenf@uci.edu, vmeru@ics.uci.edu
  • 2. Outline ● Introduction ● Data Model ● API ● System Architecture ○ Partitioning ○ Replication ○ Membership ○ Bootstrapping and Scaling ○ Local Persistence ● Practical Experiences ● Conclusion
  • 4. Why Cassandra? ● Facebook ○ Requirements - ■ Performance. ■ Reliability. ■ Efficiency. ■ Support for continuous growth. ■ Fail-friendly. ○ Application ■ Facebook’s Inbox Search (Old: Now moved to HBase!1 ) ● 2008: 100 Mn, 2010: 250 Mn ● 2014: 1.35 Bn users; 864 Mn daily average for september2 . 1:Storage Infrastructure Behind Facebook Messages: Using HBase at Scale: http://on.fb.me/1ok9SnC| 2: http://newsroom.fb.com/company-info/
  • 5. Related Work ● Ficus, Coda ○ High availability with specialized conflict resolution for updates. (A-P) ● Farsite ○ Serverless, distributed file system without trust. ● Google File System ○ Single master - simple design. Made fault tolerant with Chubby. ● Bayou ○ Distributed Relational Database with eventual consistency and partition tolerance. (A-P) ● Dynamo ○ Structured overlay network with 1-hop request routing. Gossip based information sharing. ● Bigtable ○ GFS based system.
  • 7. Data Model ● Multi-dimensional map. ○ key - string, generally 16-36 bytes long. No size restrictions. ● Column Family - group of columns. ○ Simple column family ○ Super - column family of column families. ○ Column order: sorted by timestamp of name. (Time sorting helps in Inbox Search) ● Column Access ○ Simple column ■ column_family:column ○ Super column ■ column_family:super_column:column
  • 9. API
  • 10. API ● Three simple methods ○ insert (table, key, rowMutation) [default@keyspace] set User[‘vmeru’][‘fname’] = ‘Varad’; [default@keyspace] set User[‘vmeru’][ascii(‘email’)] = ‘vmeru@uci.edu’; cqlsh> INSERT INTO users (firstname, lastname, age, email, city) VALUES ('John', 'Smith', 46, 'johnsmith@email.com', 'Sacramento'); ○ get ( table, key, columnName) [default@keyspace] get User[‘vmeru’]; => (column=656d6169c, value=vmeru@uci.edu,timestamp=135225847342) cqlsh:demo> SELECT * FROM users where lastname= 'Doe'; ○ delete (table, key, columnName) [default@keyspace] del User[‘vmeru’]; row removed. cqlsh:demo> DELETE from users WHERE lastname = “Doe”;
  • 12. Requirements ● Scalability: ○ Continuous growth of the platform ● Reliability: ○ Treat failure as norm
  • 13. System Architecture Characteristics ● Partitioning ● Replication ● Membership ● Failure ● Handling ● Scaling
  • 14. Partitioning ● Purpose: ○ The ability to dynamically partition the data over the set of nodes. ○ The ability to scale incrementally. ● Consistent hashing using an order preserving hash function
  • 15. Consistent Hashing Basics :- ● Assigns each node and key an identifier using a base hash function. ● The hash space is large, and is treated as if it wraps around to form a circle - hence hash ring. Identifiers are ordered on an identifier circle modulo M. ● Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position.
  • 16. Consistent Hashing Example - ● The ring has 10 nodes and stores five keys. ● The successor of identifier 10 is node 14, so key 10 would be located at node 14. ● If a node were to join with identifier 26, it would capture the key with identifier 24 from the node with identifier 32 The principal advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected
  • 17. Consistent Hashing The basic consistent hashing algorithm presents some challenges. 1. First, the random position assignment of each node on the ring leads to non- uniform data and load distribution. 2. Second, the basic algorithm is oblivious to the heterogeneity in the performance of nodes. There are two solutions 1. Nodes can be assigned to multiple positions in the circle 2. Analyze load information on the ring and have lightly loaded nodes move on the ring to alleviate heavily loaded nodes (chosen)
  • 18. Order preserving hash function If keys are given in some order a1 , a2 , ..., an and for any keys aj and ak , j<k implies F(aj )<F(ak )
  • 19. Replication Purpose: achieve high availability and durability ● Each data item is replicated at N hosts, where N is the replication factor ● Each key, k, is assigned to a coordinator node. The coordinator is in charge of the replication of the data items that fall within its range.
  • 20. Replication Rack Unaware ● The non-coordinator replicas are chosen by picking N-1 successors of the coordinator on the ring ● Cassandra system elects a leader amongst its nodes using a system called Zookeeper. ● All nodes on joining the cluster contact the leader ● Leader tells them for what ranges they are replicas for (maintain the invariant that no node is responsible for more than N-1 ranges in the ring) ● The metadata cached locally and in Zookeeper (crashes and comes back up knows what ranges it was responsible for)
  • 21. Replication Scheme ● Cassandra is configured such that each row is replicated across multiple data centers. ● Data centers are connected through high speed network links. This scheme allows to handle entire data center failures without any outage.
  • 22. Failure Detection Purpose ● Locally determine if any other node in the system is up or down ● avoid attempts to communicate with unreachable nodes during various operations. Modified version of the ΦAccrual Failure Detector : ● Doesn’t emit a Boolean value stating a node is up or down ● Emits a value which represents a suspicion level for each of monitored nodes.
  • 23. Failure Detection ● Every node in the system maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster. ● The distribution of these inter-arrival times is determined and Φ is calculated ● Assuming that we decide to suspect a node A when Φ =1, then the likelihood that we will make a mistake is about 10%. The likelihood is about 1% with Φ = 2, 0.1% with Φ = 3, and so on. Approximate Distribution ● Gaussian distribution ● Exponential distribution (chosen)
  • 24. Bootstrapping ● When a node starts for the first time, it chooses a random token for its position in the ring. ● For fault tolerance, the mapping is persisted to disk locally and also in Zookeeper. ● The token information is then gossiped around the cluster. This is how we know about all nodes and their respective positions in the ring. This enables any node to route a request for a key to the correct node in the cluster.
  • 25. Scaling the Cluster When a new node is added into the system, it gets assigned a token such that it can alleviate a heavily loaded node. ● Bootstrap algorithm is initiated from any other node in the system ● The node giving up the data streams the data over to the new no de using kernel-kernel copy techniques. ● Operational experience has shown that data can be transferred at the rate of 40 MB/sec from a single node. ● Improving this by having multiple replicas take part in the bootstrap transfer thereby parallelizing the effort, similar to Bittorrent
  • 26. Local Persistence ● Typical write operation involves a write into a commit log for durability and recoverability and an update into an in-memory data structure. ● The write into the in-memory data structure is performed only after a successful write into the commit log ● When the in-memory data structure crosses a certain threshold, it dumps itself to disk ● Over time many such files could exist on disk and a merge process runs in the background to collate the different files into one file.
  • 27. Local Persistence Speed Up ● Bloom Filter: avoid lookups into files that do not contain the key ● Column Indices: prevent scanning of every column on disk and which allow us to jump to the right chunk on disk for column retrieval
  • 29. Practical Experiences ● Use of MapReduce jobs to index the inbox data ○ 7 TB data. ○ Special back channel for MR to process aggregate reverse index per user and send over the serialized data. ● Atomic operations ● Failure detection is difficult ○ Initial: 2 minutes for a 100-node setup. Later, 15 seconds. ● Integrated with Ganglia for monitoring. ● Some coordination with Zookeeper.
  • 30. Facebook Inbox Search ● Per user index of all the messages ● Two search features: Two column families. ○ Term Search ■ <user_id, message> : message becomes the super-column. ■ Individual message identifiers that contain the word become columns. ○ Interactions ■ <user_id, recipients_user_id> : recipients_user_id is the super-column. ■ Individual message identifiers are the columns. ● Caching user’s index as soon as he/she clicks the search bar. ● Current: 50 TB, on 150 Node cluster (East/West coast data centers).
  • 32. One Size does not fit all Source:http://tcrn.ch/1x3RCSp