Cassandra is a decentralized structured storage system developed at Facebook to handle large amounts of structured data across many servers. It uses a distributed architecture with no single point of failure and dynamically replicates data across nodes for high availability. Cassandra uses a column-oriented data model and supports operations like insert, get, and delete. It partitions and distributes data using consistent hashing and handles failures through gossip-based cluster membership and an anti-entropy protocol.
2. Outline …
Introduction
Data Model
System Architecture
Bootstrapping & Scaling
Local Persistence
Conclusion
3. What is Cassandra ?
Distributed Storage System
Manages Structured Data
Highly available , No SPoF
Not a Relational Data Model
Handle high write throughput
◦ No impact on read efficiency
5. Related Work
Google File System
◦ Distributed FS, Single master/Slave
Ficus/ Coda
◦ Distributed FS
Farsite
◦ Distributed FS, No centralized server
Bayou
◦ Distributed Relational DB System
Dynamo
◦ Distributed Storage system
8. • Table
• Multidimensional map indexed by key
• Columns
• Grouped in to Column Families
• Simple
• Super (Nested Column Families)
• Column has
• Name/ Value/ Timestamp
Data Model
13. Cassandra Architecture
Partitioning
Data distribution across nodes
Replication
Data duplication across nodes
Cluster Membership
Node management in cluster
adding/ deleting
18. Replication
Different Replication Policies
◦ Rack Unaware
Replicate at N-1 nodes
◦ Rack Aware
Zookeeper, using a leader
◦ Data center Aware
similar to Rack Aware, leader chosen at Datacenter
level.
19. Cluster Membership
Based on scuttlebutt
Efficient Gossip based mechanism
Inspired for real life rumor spreading.
Anti Entropy protocol
◦ Repair replicated data by comparing &
reconciling differences
21. Cluster Membership
Failure Detection
◦ Accrual Failure Detector
If a node is faulty, the suspicion level increases.
Φ(t) k as t k
k - threshold variable
◦ If node is correct
Φ(t) = 0
23. Bootstrapping & Scaling
Bootstrapping
◦ Node selects random token
◦ Locally persisted, gossiped to cluster
Scaling
◦ Cassandra bootstrap algorithm initiated by
operator
◦ New node get a spitted range of heavily
loaded node
29. Conclusion
Proven high scalability, performance, and
wide applicability
Very high update throughput, delivering low
latency
Future work
◦ Adding compression
◦ Support atomicity across keys
◦ Secondary index support