SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
A highly scalable, eventually
consistent, distributed,
structured key-value store.
张天伦
Stable release: 1.2.0 / Jan. 2, 2013
'07 '06
'09
BigTable Dynamo
Cassandra
Data model
Tablet write / read
Compaction
Bloom filter
Cluster membership
Eventual consistency
Partition
Fault tolerance
Hbase
Hypertable
Voldemort
Riak
Family tree
Architecture Overview
Messaging Layer
Cluster MembershipFailure Detector
Storage Layer
Partitioner Replicator
Cassandra API Tools
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
Cassandra is a row-oriented
database
keyspace
Row Key: column
name
value
timestamp
column column
column family
Row Key: column column column
Data Model Keyspace is like database in an
RDBMS
A column family is a tableEach row has a unique Row
Key, like primary key
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
Cassandra is a Distributed Hash Table using
consistent hashing
Firstly, we have an empty token ring with
2^64 positions
-2^632^63 - 1 A token
represents a
position on the
ring
We add two nodes (B and D) and their tokens
determine their positions on the ring
D
B
-2^63
0
2^63 - 1
Nodes mean
machines here
Tokens could
be assigned
manually or
generated
randomly
A node is responsible for the range between
its predecessor and itself
D
B
-2^63
0
2^63 - 1
B's range
D's range
D has a list of seed nodes that include B
such that D knows the IP address of B and
could talk to B
D
B
-2^63
0
2^63 - 1
messages
When D hasn't received a reply from B for a
while it suspects that B is down
D
B
-2^63
0
2^63 - 1
No reply
Then we add more nodes (A and C)
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
Part of B and D's
ranges are taken
by A and C
Node A and C have D as their seed node so that
they could talk to D
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
messages messages
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
messages
messages
messages
Node A gets B and C's IP addresses from D;
node C gets B and A's IP addresses from D
Now node A, B and C could talk to one another
The way A and C learn about other nodes are
called Gossip
● Gossip is a peer-to-peer communication
protocol for exchanging location and state
information between nodes
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
Row is the unit of partition
a row key will also get a token (a position
on the ring)
Row key Token
The row is stored on the node that is
responsible for the range
D
AC
B
-2^63
-2^62
0
2^62
johnny
jim
suzy
carol
2^63 - 1
e.g. johnny's
token falls in
the range of
A and is
hence stored
there
Partitioner is to assign tokens
partitioner function range
Murmur3Partitioner
MurmurHash
Function
[-2^63, 2^63 - 1]
RandomPartitioner MD5 hash value [0, 2^127 - 1]
ByteOrderedPartitioner
Orders rows
lexically by key
bytes
Platform 's
default charset
(e.g. 32 bit for
utf8)
One cluster, one partitioner !
D
AC
B
Murmur3Partitioner /
RandomPartitioner:
ByteOrderedPartitioner:
Row key Column
family
carol ...
jim ...
johnny ...
suzy ...
Scans are different for them
Scan by token
Scan by row key order
Drawback of ByteOrderedPartitioner
✗ Sequential writes can cause hot spots
✗ More administrative overhead to load balance
the cluster
✗ Uneven load balancing for multiple column
families
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
Data could be lost when nodes fail; we need a
replication strategy
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
johnny
The first replica is determined by partitioner
and additional replicas are placed on the next
nodes clockwise in the ring (SimpleStrategy)
Suppose we
store 3 replicas
D
AC
B
-2^63
-2^62
0
2^62
2^63 - 1
johnny
What if node A, B, C are on the same rack ?
rack failure
would mean
data loss
D
AC
B
joh.
H
EG
F
Cassandra can replica data across racks and
data centers
West Data center East Data center
Suppose A, B are on rack1 and C, D are
on rack2
Suppose E, F and G are on rack1 and H are
on rack2
This is called NetworkTopologyStrategy
● Use for multiple racks in a data center and
multiple data centers
● Specify how many replicas you want in each
data center
● Places replicas in the same data center by
walking down the ring clockwise until reaching
the first node in another rack
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
A write or read request could go to any node
which serves as a coordinator
A write request where D serves as the
coordinator and replicas are stored on A, B, C
D
AC
B
client
insert 'johnny'
coordinator
johnny
johnny
johnny
By partitioner and
replica strategy, a
coordinator
determines which
nodes to get the
request
When does a coordinator return an
acknowledgement to the client ?
● When the write succeeds on consistency level
replicas
✔ Consistency is the synchronization of data on
replicas in a cluster
✔ Consistency level is a client setting that defines a
successful write or read by the number of cluster
replicas that acknowledge the write or respond to
the read request, respectively
insert 'johnny' with consistency level = one
D
AC
B
client
insert 'johnny'
coordinator
johnny
lost lost
ACK
ACK
insert 'johnny' with consistency level = quorum
D
AC
B
client coordinator
johnny
johnny
(replicas / 2) + 1ACKACK
ACK
ACK
lost
insert 'johnny'
Quorum
means
majority
get 'johnny' with consistency level = quorum
D
AC
B
client
johnny
v2
johnny
v1
johnny
v2 Coordinator returns
the most recent data
determined by timestamp
What if I want strong consistency
● Write CL + Read CL > Replicas
e.g. write one, read all
write all, read one
write quorum, read quorum
A B C
client
A B C
client
A B C
client
read
write
So Cassandra' s consistency model is tunable
A write's journey
Each column family has a Memtable
Flush after several inserts
memtable
Commit log
● Memtable
an in-memory sorted map from row key to
columns
● SSTable
an immutable data file to which Cassandra
writes memtables periodically
● Commit log
a redo log to which Cassandra appends data
for recovery in the event of a hardware failure
What are they ?
More updates and flush
memtable
Commit log
They belong to the same column family
A read's journey
memtable
Commit log
● A tombstone is written to indicate a deleted
column
● Columns marked with a tombstone exist for
configured gc_grace_seconds after which
compaction permanently deletes the column
SSTable is immutable, how about delete ?
compaction
● In the background, Cassandra periodically
merges SSTables together into larger
SSTables
● Compaction merges row fragments, removes
expired tombstones, and rebuilds indexes.
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
CQL
● Cassandra Query Language (CQL) is a SQL
like language for querying Cassandra.
● CQL doesn't support joins; Cassandra
encourages denormalization
We refer to CQL3 here
Joins require expensive random reads, which
need to be merged across the network
CQL3 structure
clientcqlsh
Thrift RPC CQL binary protocol
Query Processor
Internal write / read API
Local path Remote path
server
transport
Java / .NET driver
CQL3 queries
CREATE TABLE profiles (
id text PRIMARY KEY,
first_name text,
last_name text,
age int
);
id first_name last_name age
11485603 tianlun zhang 23
INSERT INTO profiles (id,
first_name, last_name, age)
VALUES ('11485603',
'tianlun', 'zhang', 23);
SELECT * FROM profiles;
Table means column family here
CQL3 hides internal storage from
users
id first_name last_name age
11485603 tianlun zhang 23
first_name:
last_name:
age:
tianlun
zhang
23
11485603
internal
storage
Row key Column name Column value
:
Columns
are sorted
by column
name
compound primary key in CQL3
CREATE TABLE comments (
article_id uuid,
posted_at timestamp,
author text,
content text,
PRIMARY KEY (article_id, posted_at)
);
Row Key The remaining component
ensures that the columns in a
row are stored in ascending
order on disk
Columns are sorted first by posted_at and
then by column name
article_id posted_at author content
550e8400-..
1970-01-17 00:08:19+0900
yukim blah, blah, blah
550e8400-..
1970-01-17 05:08:19+0900
yukim well, well, well
Since columns of a row are sorted by time,
we could efficiently get the comment on an
article after a certain time
SELECT * FROM comments WHERE
article_id = '550e8400-..' AND
posted_at >= '1970-01-17 03:08:19+0900';
article_id posted_at author content
550e8400-.. 1970-01-17 05:08:19+0900 yukim well, well, well
How about query on value ?
Secondary index enables us to query on value
SELECT * FROM comments where author = 'yukim';
Bad Request: No indexed columns present in by-columns
clause with Equal operator
Agenda
● Data model
● Cluster membership
● Partition
● Replication
● Client request
● CQL
● Secondary index
● Index on column values (should not be primary
key or part of compound primary key)
● Cassandra implements secondary indexes as a
hidden column family (invisible to client),
separate from the column family that contains
the values being indexed
Secondary index
CREATE INDEX c_author on comments (author);
`
yukim [550e8400-.., 1350499616, author]:
[550e8400-.., 1368499616, author]:
Index column family
Base CF and Index CF are
flushed to disk at the same time
Column value
Row key + column name
SELECT * FROM comments where author='yukim';
● Index column family is stored on the same
node as base column family
● Cassandra doesn't maintain column value
information in any one node and the query
still needs to be sent to all nodes
Using multiple secondary indexes
● If 'bob' is less frequent than 'smith', Cassandra
will process users_fname = 'bob' first for
efficiency
DELETE FROM comments where author='yukim';
● This is not allowed
● Delete a indexed column won't update index
Secondary index updates
● Cassandra appends data to the commit log,
updates the memtable, and updates the
secondary index
● If a read sees a stale index entry before
compaction purges it, the reader thread
invalidates it
Secondary index overhead
● Built on existing data in the background
automatically, without blocking reads or writes
(the CREATE clause)
● Updating indexes blocks reads or writes at row
level
(the INSERT clause)
There are more...
● Virtual nodes
● Atomic batches
● Request tracing
● Expiring / counter columns
● CQL collections
● Composite partition keys
Cassandra links
● Cassandra Official website
http://cassandra.apache.org/
● Apache Cassandra 1.2 Documentation
http://www.datastax.com/docs/1.2/index
● Cassandra trunk
http://git-wip-us.apache.org/repos/asf/cassandra.git
● Configuration file
conf / cassandra.yaml
Thank you !

Contenu connexe

Tendances

Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OSPractical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OSCuneyt Goksu
 
Fpga based low power and high performance address generator for wimax deinter...
Fpga based low power and high performance address generator for wimax deinter...Fpga based low power and high performance address generator for wimax deinter...
Fpga based low power and high performance address generator for wimax deinter...eSAT Journals
 
Fpga based low power and high performance address
Fpga based low power and high performance addressFpga based low power and high performance address
Fpga based low power and high performance addresseSAT Publishing House
 
20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secrets20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secretsmistercteam
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStoreMariaDB plc
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Cassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie SatterlyCassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie SatterlyDataStax Academy
 
Discriminators for use in flow-based classification
Discriminators for use in flow-based classificationDiscriminators for use in flow-based classification
Discriminators for use in flow-based classificationDenis Zuev
 
High performance queues with Cassandra
High performance queues with CassandraHigh performance queues with Cassandra
High performance queues with CassandraMikalai Alimenkou
 
Cassandra Deep Diver & Data Modeling
Cassandra Deep Diver & Data ModelingCassandra Deep Diver & Data Modeling
Cassandra Deep Diver & Data ModelingBrian Enochson
 
Ieeepro techno solutions ieee java project - nc cloud applying network codi...
Ieeepro techno solutions   ieee java project - nc cloud applying network codi...Ieeepro techno solutions   ieee java project - nc cloud applying network codi...
Ieeepro techno solutions ieee java project - nc cloud applying network codi...hemanthbbc
 
Ternary content addressable memory for longest prefix matching based on rando...
Ternary content addressable memory for longest prefix matching based on rando...Ternary content addressable memory for longest prefix matching based on rando...
Ternary content addressable memory for longest prefix matching based on rando...TELKOMNIKA JOURNAL
 

Tendances (17)

Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OSPractical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OS
 
Fpga based low power and high performance address generator for wimax deinter...
Fpga based low power and high performance address generator for wimax deinter...Fpga based low power and high performance address generator for wimax deinter...
Fpga based low power and high performance address generator for wimax deinter...
 
Fpga based low power and high performance address
Fpga based low power and high performance addressFpga based low power and high performance address
Fpga based low power and high performance address
 
20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secrets20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secrets
 
Lab 10 nmr n1_2011
Lab 10 nmr n1_2011Lab 10 nmr n1_2011
Lab 10 nmr n1_2011
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Cassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie SatterlyCassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie Satterly
 
Discriminators for use in flow-based classification
Discriminators for use in flow-based classificationDiscriminators for use in flow-based classification
Discriminators for use in flow-based classification
 
High performance queues with Cassandra
High performance queues with CassandraHigh performance queues with Cassandra
High performance queues with Cassandra
 
Cassandra Deep Diver & Data Modeling
Cassandra Deep Diver & Data ModelingCassandra Deep Diver & Data Modeling
Cassandra Deep Diver & Data Modeling
 
Sha3
Sha3Sha3
Sha3
 
Ieeepro techno solutions ieee java project - nc cloud applying network codi...
Ieeepro techno solutions   ieee java project - nc cloud applying network codi...Ieeepro techno solutions   ieee java project - nc cloud applying network codi...
Ieeepro techno solutions ieee java project - nc cloud applying network codi...
 
Report
ReportReport
Report
 
Ternary content addressable memory for longest prefix matching based on rando...
Ternary content addressable memory for longest prefix matching based on rando...Ternary content addressable memory for longest prefix matching based on rando...
Ternary content addressable memory for longest prefix matching based on rando...
 
DB2 utilities
DB2 utilitiesDB2 utilities
DB2 utilities
 
Final report
Final reportFinal report
Final report
 

Similaire à Cassandra1.2

Cassandra overview
Cassandra overviewCassandra overview
Cassandra overviewSean Murphy
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingVassilis Bekiaris
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUGStu Hood
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache CassandraStu Hood
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamojbellis
 
A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersLuke Tillman
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layerPranith Karampuri
 
CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSVipul Thakur
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
 
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftJon Haddad
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with storesYoni Farin
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 

Similaire à Cassandra1.2 (20)

Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
Cassandra
CassandraCassandra
Cassandra
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
Cassandra
CassandraCassandra
Cassandra
 
A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET Developers
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layer
 
CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMS
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica Coloft
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 

Dernier

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 

Dernier (20)

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 

Cassandra1.2

  • 1. A highly scalable, eventually consistent, distributed, structured key-value store. 张天伦
  • 2.
  • 3. Stable release: 1.2.0 / Jan. 2, 2013 '07 '06 '09
  • 4. BigTable Dynamo Cassandra Data model Tablet write / read Compaction Bloom filter Cluster membership Eventual consistency Partition Fault tolerance Hbase Hypertable Voldemort Riak Family tree
  • 5. Architecture Overview Messaging Layer Cluster MembershipFailure Detector Storage Layer Partitioner Replicator Cassandra API Tools
  • 6. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 7. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 8. Cassandra is a row-oriented database
  • 9. keyspace Row Key: column name value timestamp column column column family Row Key: column column column Data Model Keyspace is like database in an RDBMS A column family is a tableEach row has a unique Row Key, like primary key
  • 10. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 11. Cassandra is a Distributed Hash Table using consistent hashing
  • 12. Firstly, we have an empty token ring with 2^64 positions -2^632^63 - 1 A token represents a position on the ring
  • 13. We add two nodes (B and D) and their tokens determine their positions on the ring D B -2^63 0 2^63 - 1 Nodes mean machines here Tokens could be assigned manually or generated randomly
  • 14. A node is responsible for the range between its predecessor and itself D B -2^63 0 2^63 - 1 B's range D's range
  • 15. D has a list of seed nodes that include B such that D knows the IP address of B and could talk to B D B -2^63 0 2^63 - 1 messages
  • 16. When D hasn't received a reply from B for a while it suspects that B is down D B -2^63 0 2^63 - 1 No reply
  • 17. Then we add more nodes (A and C) D AC B -2^63 -2^62 0 2^62 2^63 - 1 Part of B and D's ranges are taken by A and C
  • 18. Node A and C have D as their seed node so that they could talk to D D AC B -2^63 -2^62 0 2^62 2^63 - 1 messages messages
  • 19. D AC B -2^63 -2^62 0 2^62 2^63 - 1 messages messages messages Node A gets B and C's IP addresses from D; node C gets B and A's IP addresses from D Now node A, B and C could talk to one another
  • 20. The way A and C learn about other nodes are called Gossip ● Gossip is a peer-to-peer communication protocol for exchanging location and state information between nodes
  • 21. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 22. Row is the unit of partition
  • 23. a row key will also get a token (a position on the ring) Row key Token
  • 24. The row is stored on the node that is responsible for the range D AC B -2^63 -2^62 0 2^62 johnny jim suzy carol 2^63 - 1 e.g. johnny's token falls in the range of A and is hence stored there
  • 25. Partitioner is to assign tokens partitioner function range Murmur3Partitioner MurmurHash Function [-2^63, 2^63 - 1] RandomPartitioner MD5 hash value [0, 2^127 - 1] ByteOrderedPartitioner Orders rows lexically by key bytes Platform 's default charset (e.g. 32 bit for utf8) One cluster, one partitioner !
  • 26. D AC B Murmur3Partitioner / RandomPartitioner: ByteOrderedPartitioner: Row key Column family carol ... jim ... johnny ... suzy ... Scans are different for them Scan by token Scan by row key order
  • 27. Drawback of ByteOrderedPartitioner ✗ Sequential writes can cause hot spots ✗ More administrative overhead to load balance the cluster ✗ Uneven load balancing for multiple column families
  • 28. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 29. Data could be lost when nodes fail; we need a replication strategy
  • 30. D AC B -2^63 -2^62 0 2^62 2^63 - 1 johnny The first replica is determined by partitioner and additional replicas are placed on the next nodes clockwise in the ring (SimpleStrategy) Suppose we store 3 replicas
  • 31. D AC B -2^63 -2^62 0 2^62 2^63 - 1 johnny What if node A, B, C are on the same rack ? rack failure would mean data loss
  • 32. D AC B joh. H EG F Cassandra can replica data across racks and data centers West Data center East Data center Suppose A, B are on rack1 and C, D are on rack2 Suppose E, F and G are on rack1 and H are on rack2
  • 33. This is called NetworkTopologyStrategy ● Use for multiple racks in a data center and multiple data centers ● Specify how many replicas you want in each data center ● Places replicas in the same data center by walking down the ring clockwise until reaching the first node in another rack
  • 34. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 35. A write or read request could go to any node which serves as a coordinator
  • 36. A write request where D serves as the coordinator and replicas are stored on A, B, C D AC B client insert 'johnny' coordinator johnny johnny johnny By partitioner and replica strategy, a coordinator determines which nodes to get the request
  • 37. When does a coordinator return an acknowledgement to the client ? ● When the write succeeds on consistency level replicas ✔ Consistency is the synchronization of data on replicas in a cluster ✔ Consistency level is a client setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively
  • 38. insert 'johnny' with consistency level = one D AC B client insert 'johnny' coordinator johnny lost lost ACK ACK
  • 39. insert 'johnny' with consistency level = quorum D AC B client coordinator johnny johnny (replicas / 2) + 1ACKACK ACK ACK lost insert 'johnny' Quorum means majority
  • 40. get 'johnny' with consistency level = quorum D AC B client johnny v2 johnny v1 johnny v2 Coordinator returns the most recent data determined by timestamp
  • 41. What if I want strong consistency ● Write CL + Read CL > Replicas e.g. write one, read all write all, read one write quorum, read quorum A B C client A B C client A B C client read write
  • 42. So Cassandra' s consistency model is tunable
  • 43. A write's journey Each column family has a Memtable
  • 44. Flush after several inserts memtable Commit log
  • 45. ● Memtable an in-memory sorted map from row key to columns ● SSTable an immutable data file to which Cassandra writes memtables periodically ● Commit log a redo log to which Cassandra appends data for recovery in the event of a hardware failure What are they ?
  • 46. More updates and flush memtable Commit log They belong to the same column family
  • 48. ● A tombstone is written to indicate a deleted column ● Columns marked with a tombstone exist for configured gc_grace_seconds after which compaction permanently deletes the column SSTable is immutable, how about delete ?
  • 49. compaction ● In the background, Cassandra periodically merges SSTables together into larger SSTables ● Compaction merges row fragments, removes expired tombstones, and rebuilds indexes.
  • 50. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 51. CQL ● Cassandra Query Language (CQL) is a SQL like language for querying Cassandra. ● CQL doesn't support joins; Cassandra encourages denormalization We refer to CQL3 here Joins require expensive random reads, which need to be merged across the network
  • 52. CQL3 structure clientcqlsh Thrift RPC CQL binary protocol Query Processor Internal write / read API Local path Remote path server transport Java / .NET driver
  • 53. CQL3 queries CREATE TABLE profiles ( id text PRIMARY KEY, first_name text, last_name text, age int ); id first_name last_name age 11485603 tianlun zhang 23 INSERT INTO profiles (id, first_name, last_name, age) VALUES ('11485603', 'tianlun', 'zhang', 23); SELECT * FROM profiles; Table means column family here
  • 54. CQL3 hides internal storage from users id first_name last_name age 11485603 tianlun zhang 23 first_name: last_name: age: tianlun zhang 23 11485603 internal storage Row key Column name Column value : Columns are sorted by column name
  • 55. compound primary key in CQL3 CREATE TABLE comments ( article_id uuid, posted_at timestamp, author text, content text, PRIMARY KEY (article_id, posted_at) ); Row Key The remaining component ensures that the columns in a row are stored in ascending order on disk
  • 56. Columns are sorted first by posted_at and then by column name article_id posted_at author content 550e8400-.. 1970-01-17 00:08:19+0900 yukim blah, blah, blah 550e8400-.. 1970-01-17 05:08:19+0900 yukim well, well, well
  • 57. Since columns of a row are sorted by time, we could efficiently get the comment on an article after a certain time SELECT * FROM comments WHERE article_id = '550e8400-..' AND posted_at >= '1970-01-17 03:08:19+0900'; article_id posted_at author content 550e8400-.. 1970-01-17 05:08:19+0900 yukim well, well, well
  • 58. How about query on value ? Secondary index enables us to query on value SELECT * FROM comments where author = 'yukim'; Bad Request: No indexed columns present in by-columns clause with Equal operator
  • 59. Agenda ● Data model ● Cluster membership ● Partition ● Replication ● Client request ● CQL ● Secondary index
  • 60. ● Index on column values (should not be primary key or part of compound primary key) ● Cassandra implements secondary indexes as a hidden column family (invisible to client), separate from the column family that contains the values being indexed Secondary index
  • 61. CREATE INDEX c_author on comments (author); ` yukim [550e8400-.., 1350499616, author]: [550e8400-.., 1368499616, author]: Index column family Base CF and Index CF are flushed to disk at the same time Column value Row key + column name
  • 62. SELECT * FROM comments where author='yukim'; ● Index column family is stored on the same node as base column family ● Cassandra doesn't maintain column value information in any one node and the query still needs to be sent to all nodes
  • 63. Using multiple secondary indexes ● If 'bob' is less frequent than 'smith', Cassandra will process users_fname = 'bob' first for efficiency
  • 64. DELETE FROM comments where author='yukim'; ● This is not allowed ● Delete a indexed column won't update index
  • 65. Secondary index updates ● Cassandra appends data to the commit log, updates the memtable, and updates the secondary index ● If a read sees a stale index entry before compaction purges it, the reader thread invalidates it
  • 66. Secondary index overhead ● Built on existing data in the background automatically, without blocking reads or writes (the CREATE clause) ● Updating indexes blocks reads or writes at row level (the INSERT clause)
  • 67. There are more... ● Virtual nodes ● Atomic batches ● Request tracing ● Expiring / counter columns ● CQL collections ● Composite partition keys
  • 68. Cassandra links ● Cassandra Official website http://cassandra.apache.org/ ● Apache Cassandra 1.2 Documentation http://www.datastax.com/docs/1.2/index ● Cassandra trunk http://git-wip-us.apache.org/repos/asf/cassandra.git ● Configuration file conf / cassandra.yaml