2. 01ARCHITECTURE
Cassandra Architecture
02 C* CLUSTER
Cassandra Cluster
03CQL COLUMN
FAMILY
CQL Column Family
WRITE PATH04
Cassandra Write Path
READ PATH05
Cassandra Read Path
DATA
MODEL
06
C* Data Model
Whats is Cassandra?
11. Consistent Hashing
Consistent hashing allows distributing data across a cluster which minimizes reorganization
when nodes are added or removed. Consistent hashing partitions data based on the partition
key
name age car gender
jim 36 camaro M
carol 37 bmw F
johnny 12 M
suzy 10 F
For Example
Partition key Murmur3 hash value
jim -2245462676723223822
carol 7723358927203680754
johnny -6723372854036780875
suzy 1168604627387940318
Cassandra assigns a hash value to each partition key
12. What is a CQL table and how is it related to a column family?
13. Row is the smallest unit that stores related data in Cassandra
• Rows – individual rows constitute a column family
• Row key – uniquely identifies a row in a column family
• Row – stores pairs of column keys and column values
• Column key – uniquely identifies a column value in a row
• Column value – stores one value or a collection of values
What are row, row key, column key and column value?
16. How does Cassandra writes so fast?
Cassandra is a log-structured storage engine
• Data is sequentially appended, not placed in pre-set locations
17. What are the key components of the write path?
Each node implements four key components to handle its writes
Memtables – in-memory tables corresponding to CQL tables, with
indexes
CommitLog – append-only log, replayed to restore downed node's
Memtables
SSTables – Memtable snapshots periodically flushed to disk, clearing
heap
Compaction – periodic process to merge and streamline SSTables
When any node receive any write request
The record appends to the CommitLog, and
The record appends to the Memtable for this record's target CQL table
Periodically, Memtables flush to SSTables, clearing JVM heap and
CommitLog
Periodically, Compaction runs to merge and streamline SSTables
32. Partitioning
• Nodes are logically structured in Ring Topology.
• Hashed value of key associated with data partition is used to assign it
to a node in the ring.
• Hashing rounds off after certain value to support ring structure.
• Lightly loaded nodes moves position to alleviate highly loaded
nodes.
34. Consistency – All the
servers in the system will
have the same data so
anyone using the system will
get the same copy
regardless of which server
answers their request.
Availability – The system
will always respond to a
request (even if it's not the
latest data or consistent
across the system or just a
message saying the system
isn't working)
Partition Tolerance – The
system continues to operate
as a whole even if individual
servers fail or can't be
reached..
CAP Theorem
35. CassandraArchitecture Overview
○ Cassandra was designed with the understanding that system/ hardware failures
can and do occur
○ Peer-to-peer, distributed system
○ All nodes are the same
○ Data partitioned among all nodes in the cluster
○ Custom data replication to ensure fault tolerance
○ Read/Write-anywhere design
○ Google BigTable - data model
○ Column Families
○ Memtables
○ SSTables
○ Amazon Dynamo - distributed systems technologies
○ Consistent hashing
○ Partitioning
○ Replication
○ One-hop routing
36. Transparent Elasticity
Nodes can be added and removed from Cassandra online, with no
downtime being experienced.
1
2
3
4
5
6
1
7
10
4
2
3
5
6
8
9
11
12
37. Transparent Scalability
Addition of Cassandra nodes increases performance linearly and
ability to manage TB’s-PB’s of data.
1
2
3
4
5
6
1
7
10
4
2
3
5
6
8
9
11
12
Performance
throughput = N
Performance
throughput = N x 2
39. Multi-Geography/ZoneAware
Cassandra allows a single logical database to span 1-N datacenters
that are geographically dispersed. Also supports a hybrid on-
premise/Cloud implementation.
40. Data Redundancy
Cassandra allows for customizable data redundancy so that data is
completely protected. Also supports rack awareness (data can be
replicated between different racks to guard against machine/rack
failures).
uses ‘Zookeeper’ to
choose a leader
which tells nodes
the range they are
replicas for
41. Security in Cassandra
• Internal Authentication
Manages login IDs and passwords inside the database.
• Object Permission Management
Controls who has access to what and who can do what in the
database
Uses familiar GRANT/REVOKE from relational systems.
• Client to Node Encryption
Protects data in flight to and from a database