SlideShare une entreprise Scribd logo
1  sur  30
Overview of Column DBs
Presenter – Pooja Biswas
Presented On - DOL
Topics To
Cover
Quick Recap of the different type of Databases
What is a Column DB
Type of Column DBs
Example of Column DBs
Deep Dive into Cassandra’s Data Model, DB engine and
read write pattern
Usage of column DBs in real world systems
Let’s revise
the DB
hierarchy
once
What is a
column DB?
• Column database
• Column family database
• Column oriented database
• Wide column store database
• Wide column store
• Columnar database
• Columnar store
What is a column
Store or column-
oriented
DBMS or columnar
DBMS
As per Wikipedia A column-oriented DBMS or columnar
DBMS is a database management system (DBMS) that stores
data tables by column rather than by row
Practical use of a column store versus a row store differs little in
the relational DBMS world. Both columnar and row databases
can use traditional database query languages like SQL to load
data and perform queries.
Both row and columnar databases can become the backbone in
a system to serve data for common extract, transform, load (ETL)
and data visualization tools.
However, by storing data in columns rather than rows, the
database can more precisely access the data it needs to answer
a query rather than scanning and discarding unwanted data in
rows.
Example Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright.
Cont..
• A column-oriented database stores each
column continuously. i.e. on disk or in-
memory each column on the left will be stored
in sequential blocks.
• For analytical queries that perform aggregate
operations over a small number of columns
retrieving data in this format is extremely fast.
• As disc storage is optimized for block access,
by storing the data beside each other we
exploit locality of reference
Benefits of
Column
Store
Database
s
Compression. Column stores are very efficient at data
compression and/or partitioning.
Aggregation queries. Due to their structure, columnar databases
perform particularly well with aggregation queries (such as
SUM, COUNT, AVG, etc).
Scalability. Columnar databases are very scalable. They are well
suited to massively parallel processing (MPP), which involves
having data spread across a large cluster of machines – often
thousands of machines.
Fast to load and query. Columnar stores can be loaded
extremely fast. A billion row table could be loaded within a few
seconds. You can start querying and analysing almost
immediately.
What is a column
family or wide
column DB
As per
Wikipedia
• A wide-column store (or extensible
record stores) is a type
of NoSQL database.[1]
• It uses tables, rows, and columns,
but unlike a relational database, the
names and format of the columns
can vary from row to row in the
same table. A wide-column store
can be interpreted as a two-
dimensional key–value store.[1]
Cont..
• Wide-column stores such as Bigtable and
Apache Cassandra are not column stores in
the original sense of the term, since their two-
level structures do not use a columnar data
layout.
• In genuine column stores, a columnar data
layout is adopted such that each column is
stored separately on disk.
• Wide-column stores do often support the
notion of column families that are stored
separately
• However, each such column family typically
contains multiple columns that are used
together, similar to traditional relational
database tables
• Within a given column family, all data is
stored in a row-by-row fashion, such that the
columns for a given row are stored together,
rather than each column being stored
separately
• Wide-column stores that support column
families are also known as column family
databases.
Cont..
• As a 2-dimensional key-value store,
the first part of the key is used to
distribute the data across servers,
the second part of the key lets you
quickly find the data on the target
server
• Example Cassandra, Amazon
Dynamo DB, Bigtable, Hbase
Cassandra Deep
Dive
Cassandra
• Apache Cassandra is a distributed, wide-column
store, NoSQL database management system designed to
handle large amounts of data across many commodity servers.
• It is an AP system, i.e it targets providing high availability and
partition tolerance
• It also provides No single point of failure
• Cassandra offers support for clusters spanning multiple
datacentres, with asynchronous masterless replication allowing
low latency operations for all clients
• Cassandra was designed to implement a combination of
Amazon's Dynamo distributed storage and replication
techniques combined with Google's Bigtable data and storage
engine model.
• Cassandra and DynamoDB both origin from the same paper:
Dynamo: Amazon’s Highly Available Key-value store
•
Cassandra
Data Model
Cassandra is considered a wide-column
store, which manages data in column
families.
Columns are the first-class citizen in
Cassandra World
Rows in a wide-column database don’t
need to have the same columns
Enables developers to dynamically add and
remove new columns without impacting
the underlying table
Contd..
• Cassandra uses a concept called
a keyspace
• A keyspace is kind of like
a schema in the relational model
• The keyspace contains all the
column families (kind of like tables in
the relational model), which contain
rows, which contain columns.
Cont..
• The key difference between
column stores and a traditional
RDBMS is that, in a column
store, each record (think row in
an RDBMS) doesn’t require a
single value per column
• Instead, it’s possible to model
column families. A single record
may consist of an ID field, a
column family for “customer”
information, and another
column family for “order item”
information.
• It Also stores a time stamp with
each column Value, which helps it
to keep the data sorted in disc
• The model in Cassandra is that
rows contain columns. To access
the smallest unit of data (a
column) you have to specify first
the row name (key), then the
column name
• Each column is contained to its
row. It doesn’t span all rows like
in a relational database
• Each column contains a
name/value pair, along with a
timestamp, that’s why its is also
called a 2 dimensional key value
pair
Cassandra DB
engine
Architecture
• LSM Tree is the heart of Cassandra’s DB engine
• Partitioning Key — each column family has a
Partitioning Key. It helps with determining which
node in the cluster the data should be stored.
• Commit Log —the transactional log. It’s used for
transactional recovery in case of system failures.
It’s an append only file and provides durability.
• Memtable — a memory cache to store the in
memory copy of the data. Each node has a
memtable for each CQL table. The memtable
accumulates writes and provides read for data
which are not yet stored to disk.
• SSTable —the final destination of data in C*. They
are actual files on disk and are immutable.
• Compaction —the periodic process of merging
multiple SSTables into a single SSTable. It’s
primarily done to optimize the read operations.
Cassandra does Peer
to Peer Cluster
Communication
• It is a peer to peer database
where each node in the cluster
constantly communicates with
each other to share and receive
information (node status, data
ranges and so on)
• There is no concept of master
or slave in a Cassandra
cluster.Any Node can be
coordinator node for each
query
Writes in Cassandra
Cassandra also performs some special tasks
during/after writing the data to Cassandra.
These include :
•Compaction
•Hinted Handoff
•Caching
Compaction
Reads in Cassandra
Read repair
Read-repair is a lazy mechanism in Cassandra that ensures that the data you
request from the database is accurate and consistent
For every read request, the coordinator node requests to all the replica nodes
having the data requested by the client
All nodes return the data which client requested for
The most recent data is sent to the client and asynchronously, the coordinator
identifies any replicas that return obsolete data and issues a read-repair
request to each of these replicas to update their data based on the latest data.
Cassandras also uses bloom filter to optimize the reads
Strategies
for Reads
1.ONE — reads from the closest node holding the
data
2.QUORUM — returns a result from a quorum of
servers with the most recent timestamp of data
3.LOCAL_QUORUM — returns a result from a
quorum of servers with the most recent timestamp
for the data in the same data center as the
coordinator node
4.EACH_QUORUM — returns a result from a quorum
of servers with the most recent timestamp in all
data centers
5.ALL — returns a result from all replica nodes for a
row key
Cassandra suitable
for write heavy
workload
Cassandra's storage engine performs very well
on writes because it stores data in an append
only format
This makes great use of spinning disk drives
that have poor seek times
It can do serial writes very quickly
But the downside is that when you do a read,
you often need to scan through several
versions of an object to get the most recent
version to return to the caller
Who all are
using
Cassandra
• Twitter, Discord, Instagram, Reddit, Uber ….
• Azure Cosmos DB also supports CassandraAPIs
• “We’re using Cassandra in production for a bunch of things
atTwitter. A few examples: Our geo team uses it to store
and query their database of places of interest.The
research team uses it to store the results of data mining
done over our entire user base. Our analytics, operations
and infrastructure teams are working on a system that
uses cassandra for large-scale real time analytics for use
both internally and externally”
• Thank you

Contenu connexe

Tendances

Tendances (20)

Cassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataCassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting data
 
HBase
HBaseHBase
HBase
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
Cassandra tutorial
Cassandra tutorialCassandra tutorial
Cassandra tutorial
 
Big data stores
Big data  storesBig data  stores
Big data stores
 
Apache Cassandra
Apache CassandraApache Cassandra
Apache Cassandra
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
Cassandra architecture
Cassandra architectureCassandra architecture
Cassandra architecture
 
Cassandra - Research Paper Overview
Cassandra - Research Paper OverviewCassandra - Research Paper Overview
Cassandra - Research Paper Overview
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
The No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelThe No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra Model
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databases
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 
What is Apache Cassandra? | Apache Cassandra Tutorial | Apache Cassandra Intr...
What is Apache Cassandra? | Apache Cassandra Tutorial | Apache Cassandra Intr...What is Apache Cassandra? | Apache Cassandra Tutorial | Apache Cassandra Intr...
What is Apache Cassandra? | Apache Cassandra Tutorial | Apache Cassandra Intr...
 
Learn Cassandra at edureka!
Learn Cassandra at edureka!Learn Cassandra at edureka!
Learn Cassandra at edureka!
 

Similaire à Column db dol

04-Introduction-to-CassandraDB-.pdf
04-Introduction-to-CassandraDB-.pdf04-Introduction-to-CassandraDB-.pdf
04-Introduction-to-CassandraDB-.pdf
hothyfa
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 

Similaire à Column db dol (20)

04-Introduction-to-CassandraDB-.pdf
04-Introduction-to-CassandraDB-.pdf04-Introduction-to-CassandraDB-.pdf
04-Introduction-to-CassandraDB-.pdf
 
cassandra.pptx
cassandra.pptxcassandra.pptx
cassandra.pptx
 
Cassandra Learning
Cassandra LearningCassandra Learning
Cassandra Learning
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
Cassandra from the trenches: migrating Netflix (update)
Cassandra from the trenches: migrating Netflix (update)Cassandra from the trenches: migrating Netflix (update)
Cassandra from the trenches: migrating Netflix (update)
 
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented Database
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
NoSql
NoSqlNoSql
NoSql
 
Data Storage Management
Data Storage ManagementData Storage Management
Data Storage Management
 
Cassandra for mission critical data
Cassandra for mission critical dataCassandra for mission critical data
Cassandra for mission critical data
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Cassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating NetflixCassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating Netflix
 
Use a data parallel approach to proAcess
Use a data parallel approach to proAcessUse a data parallel approach to proAcess
Use a data parallel approach to proAcess
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
cassandra_presentation_final
cassandra_presentation_finalcassandra_presentation_final
cassandra_presentation_final
 
Dsm project-h base-cassandra
Dsm project-h base-cassandraDsm project-h base-cassandra
Dsm project-h base-cassandra
 
Datastores
DatastoresDatastores
Datastores
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Column db dol

  • 1. Overview of Column DBs Presenter – Pooja Biswas Presented On - DOL
  • 2. Topics To Cover Quick Recap of the different type of Databases What is a Column DB Type of Column DBs Example of Column DBs Deep Dive into Cassandra’s Data Model, DB engine and read write pattern Usage of column DBs in real world systems
  • 4. What is a column DB? • Column database • Column family database • Column oriented database • Wide column store database • Wide column store • Columnar database • Columnar store
  • 5. What is a column Store or column- oriented DBMS or columnar DBMS As per Wikipedia A column-oriented DBMS or columnar DBMS is a database management system (DBMS) that stores data tables by column rather than by row Practical use of a column store versus a row store differs little in the relational DBMS world. Both columnar and row databases can use traditional database query languages like SQL to load data and perform queries. Both row and columnar databases can become the backbone in a system to serve data for common extract, transform, load (ETL) and data visualization tools. However, by storing data in columns rather than rows, the database can more precisely access the data it needs to answer a query rather than scanning and discarding unwanted data in rows.
  • 6. Example Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright.
  • 7. Cont.. • A column-oriented database stores each column continuously. i.e. on disk or in- memory each column on the left will be stored in sequential blocks. • For analytical queries that perform aggregate operations over a small number of columns retrieving data in this format is extremely fast. • As disc storage is optimized for block access, by storing the data beside each other we exploit locality of reference
  • 8. Benefits of Column Store Database s Compression. Column stores are very efficient at data compression and/or partitioning. Aggregation queries. Due to their structure, columnar databases perform particularly well with aggregation queries (such as SUM, COUNT, AVG, etc). Scalability. Columnar databases are very scalable. They are well suited to massively parallel processing (MPP), which involves having data spread across a large cluster of machines – often thousands of machines. Fast to load and query. Columnar stores can be loaded extremely fast. A billion row table could be loaded within a few seconds. You can start querying and analysing almost immediately.
  • 9.
  • 10. What is a column family or wide column DB
  • 11. As per Wikipedia • A wide-column store (or extensible record stores) is a type of NoSQL database.[1] • It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table. A wide-column store can be interpreted as a two- dimensional key–value store.[1]
  • 12. Cont.. • Wide-column stores such as Bigtable and Apache Cassandra are not column stores in the original sense of the term, since their two- level structures do not use a columnar data layout. • In genuine column stores, a columnar data layout is adopted such that each column is stored separately on disk. • Wide-column stores do often support the notion of column families that are stored separately • However, each such column family typically contains multiple columns that are used together, similar to traditional relational database tables • Within a given column family, all data is stored in a row-by-row fashion, such that the columns for a given row are stored together, rather than each column being stored separately • Wide-column stores that support column families are also known as column family databases.
  • 13. Cont.. • As a 2-dimensional key-value store, the first part of the key is used to distribute the data across servers, the second part of the key lets you quickly find the data on the target server • Example Cassandra, Amazon Dynamo DB, Bigtable, Hbase
  • 15. Cassandra • Apache Cassandra is a distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers. • It is an AP system, i.e it targets providing high availability and partition tolerance • It also provides No single point of failure • Cassandra offers support for clusters spanning multiple datacentres, with asynchronous masterless replication allowing low latency operations for all clients • Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model. • Cassandra and DynamoDB both origin from the same paper: Dynamo: Amazon’s Highly Available Key-value store •
  • 16. Cassandra Data Model Cassandra is considered a wide-column store, which manages data in column families. Columns are the first-class citizen in Cassandra World Rows in a wide-column database don’t need to have the same columns Enables developers to dynamically add and remove new columns without impacting the underlying table
  • 17. Contd.. • Cassandra uses a concept called a keyspace • A keyspace is kind of like a schema in the relational model • The keyspace contains all the column families (kind of like tables in the relational model), which contain rows, which contain columns.
  • 18. Cont.. • The key difference between column stores and a traditional RDBMS is that, in a column store, each record (think row in an RDBMS) doesn’t require a single value per column • Instead, it’s possible to model column families. A single record may consist of an ID field, a column family for “customer” information, and another column family for “order item” information.
  • 19. • It Also stores a time stamp with each column Value, which helps it to keep the data sorted in disc • The model in Cassandra is that rows contain columns. To access the smallest unit of data (a column) you have to specify first the row name (key), then the column name • Each column is contained to its row. It doesn’t span all rows like in a relational database • Each column contains a name/value pair, along with a timestamp, that’s why its is also called a 2 dimensional key value pair
  • 21. • LSM Tree is the heart of Cassandra’s DB engine • Partitioning Key — each column family has a Partitioning Key. It helps with determining which node in the cluster the data should be stored. • Commit Log —the transactional log. It’s used for transactional recovery in case of system failures. It’s an append only file and provides durability. • Memtable — a memory cache to store the in memory copy of the data. Each node has a memtable for each CQL table. The memtable accumulates writes and provides read for data which are not yet stored to disk. • SSTable —the final destination of data in C*. They are actual files on disk and are immutable. • Compaction —the periodic process of merging multiple SSTables into a single SSTable. It’s primarily done to optimize the read operations.
  • 22. Cassandra does Peer to Peer Cluster Communication • It is a peer to peer database where each node in the cluster constantly communicates with each other to share and receive information (node status, data ranges and so on) • There is no concept of master or slave in a Cassandra cluster.Any Node can be coordinator node for each query
  • 23. Writes in Cassandra Cassandra also performs some special tasks during/after writing the data to Cassandra. These include : •Compaction •Hinted Handoff •Caching
  • 26. Read repair Read-repair is a lazy mechanism in Cassandra that ensures that the data you request from the database is accurate and consistent For every read request, the coordinator node requests to all the replica nodes having the data requested by the client All nodes return the data which client requested for The most recent data is sent to the client and asynchronously, the coordinator identifies any replicas that return obsolete data and issues a read-repair request to each of these replicas to update their data based on the latest data. Cassandras also uses bloom filter to optimize the reads
  • 27. Strategies for Reads 1.ONE — reads from the closest node holding the data 2.QUORUM — returns a result from a quorum of servers with the most recent timestamp of data 3.LOCAL_QUORUM — returns a result from a quorum of servers with the most recent timestamp for the data in the same data center as the coordinator node 4.EACH_QUORUM — returns a result from a quorum of servers with the most recent timestamp in all data centers 5.ALL — returns a result from all replica nodes for a row key
  • 28. Cassandra suitable for write heavy workload Cassandra's storage engine performs very well on writes because it stores data in an append only format This makes great use of spinning disk drives that have poor seek times It can do serial writes very quickly But the downside is that when you do a read, you often need to scan through several versions of an object to get the most recent version to return to the caller
  • 29. Who all are using Cassandra • Twitter, Discord, Instagram, Reddit, Uber …. • Azure Cosmos DB also supports CassandraAPIs • “We’re using Cassandra in production for a bunch of things atTwitter. A few examples: Our geo team uses it to store and query their database of places of interest.The research team uses it to store the results of data mining done over our entire user base. Our analytics, operations and infrastructure teams are working on a system that uses cassandra for large-scale real time analytics for use both internally and externally”