SlideShare une entreprise Scribd logo
1  sur  31
An Introduction to 
Apache 
assandra 
Aaron Ploetz
What is Cassandra? 
● Cassandra is a non-relational, partitioned row store. 
● Rows are organized into column families (tables) with a 
required primary key. 
● Data is distributed across multiple master-less, nodes in 
an application-transparent manner. 
● DataStax oversees the development of the Apache 
Cassandra open-source project, provides support to 
companies using Cassandra, and provides an enterprise-ready 
version of Cassandra.
$ whoami 
Aaron Ploetz 
@APloetz 
● Lead Database Engineer 
● B.S.-MCS UW-Whitewater 
● M.S.-SED Regis University 
● Using Cassandra since version 0.8 
● Contributor to the Casandra tag on StackOverflow 
● Contributor to the Apache Cassandra project 
● 2014/15 DataStax MVP for Apache Cassandra
(short) History of Cassandra and 
DataStax 
● Developed at , open sourced in 2008. 
● Design influenced by Google BigTable and Amazon Dynamo. 
● Graduated to Apache “Top-Level Project” status in Feb 2010. 
● DataStax founded by Jonathan Ellis and Matt Pfeil in late 2010, 
offering enterprise Cassandra support. 
● Secured $190 million in VC funding. 
● Started with eight people, now employs more than 350. 
● 400+ Customers, including 25 of the Fortune 100.
Key Features 
● Current release is Cassandra 2.1 (Sept 10). 
● Distributed, decentralized storage; no SPOF. 
● Scalable. 
● High-availability, Fault-tolerance. 
● Tunable Consistency. 
● High-performance. 
● Data center awareness.
Distributed, Decentralized 
Storage DC1 DC2 
● Peer-to-peer, master-less replication. 
● Any node can handle a read or write operation. 
● Supports local read/write ops via “logical” data centers. 
● Gossip protocol allows nodes to be aware of each other. 
● Snitch ensures that data is replicated appropriately.
Scalability 
● Cassandra allows you to easily add nodes to scale your 
application back-end. 
● Benchmark from 2011: 
– 48 node cluster could handle 174,373 writes/sec. 
– 288 node cluster could handle 1,099,837 writes/sec. 
– Indicates that Cassandra scales linearly. 
● Throughput of N nodes = T. 
● Throughput of Nx2 nodes = Tx2.
High Availability 
DC1 DC2 
X 
● Cassandra was designed under the premise that 
hardware failures can and do occur.
High Availability 
DC1 DC2 
X 
X 
● Cassandra was designed under the premise that 
hardware failures can and do occur.
High Availability 
DC1 DC2 
X 
X X 
X 
X 
X X 
●Gossip Protocol keeps live nodes informed of failures. 
●Cassandra 2.0.2 implemented Rapid Read Protection which 
redirects read operations to live nodes.
Tunable Consistency 
● Cassandra allows you alter your consistency level on a 
per-operation basis. 
● Also allows configuration for data center locality: 
ALL QUORUM ONE 
Strong 
Consistency 
High Availability / 
Eventual 
Consistency 
Quorum == 
(nodes / 2) + 1
Eventual Consistency != Hopeful 
Consistency 
● experiment on consistency : 
– Created two data centers with C* 1.1.7 Cluster of 48 nodes in 
each data center. 
– Wrote 1,000,000 records at CL1 in one data center. 
– Read same 1,000,000 records at CL1 in other data center. 
– All records read successfully! 
– “Eventually consistent does not mean a day, minute or 
even a second from now… in most cases, it is 
milliseconds!”- Christos Kalantzis
High Performance 
● Cassandra is optimized from the ground up for 
performance: 
Source: DataStax.com
High Performance 
● All disk writes are sequential, append-only 
operations. 
● No reading before writing. 
● Cassandra is optimized for threading with multi-core/ 
processor machines.
Potential Drawbacks? 
● Some use cases are not appropriate (transient data 
or delete-heavy patterns). 
● Developer learning curve: CQL != SQL 
● Simple queries only. No JOINs or sub-queries. 
● Optimal performance is achieved through de-normalizaiton 
and query-based data modeling.
Cassandra moves beyond disco-era 
data modeling 
●Everything MUST be normalized!!! 
●Redundant data == “bad” 
●Relational Database theory originated when disk space was expensive. In 
1975 some vendors were selling disk space at $11k per MB. 
●By 1980 prices “dropped” so that you could finally buy 1GB of storage for 
under $1 Million. 
●Today I can buy a 1TB disk for $60.
Cassandra Storage Structures 
● Keyspace == Database (in the RDBMS world) 
CREATE KEYSPACE products WITH replication = { 
'class': 'NetworkTopologyStrategy', 
'RFD': '2', 'MKE': '4'}; 
● Column Family == Table 
CREATE TABLE hierarchy ( 
category text, 
subcategory text, 
classification text, 
skumap map<uuid, text>, 
PRIMARY KEY (category, subcategory, classification));
Cassandra Primary Keys 
● Primary Keys are unique. 
● Single Primary Key: 
PRIMARY KEY (keyColumn) 
● Composite Primary Key: 
PRIMARY KEY (myPartitionKey, my1stClusteringKey, 
my2ndClusteringKey) 
● Composite Partitioning Key: 
PRIMARY KEY ((my1stPartitionKey, my2ndPartitionKey), 
myClusteringKey)
Cassandra Secondary Indexes 
● Does allow secondary indexes. 
CREATE INDEX myIndex ON myTable(myNonKeyColumn) 
● Designed for query convenience, not for performance. 
● Does not perform well on high-cardinality columns, because you filter a 
huge volume of records for a small number of results. Extremely low 
cardinality is also not a good idea (ex: customer address [state == good, 
phone == bad, gender == bad]). 
● Works best on a table having many rows that contain the indexed value; 
middle-of-the-road cardinality.
Serenity “crew” 
● Create a table to store data for the crew of “Serenity” from “Firefly.” 
CREATE TABLE crew ( 
crewname TEXT, 
firstname TEXT, 
lastname TEXT, 
phone TEXT, 
PRIMARY KEY (crewname)); 
crewname | firstname | lastname | phone 
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­Mal 
| Malcolm | Reynolds | 111­555­1234 
Jayne | Jayne | Cobb | 111­555­3464 
Sheppard | Derial | Book | 111­555­2349 
Simon | Simon | Tam | 111­555­8899
Serenity “crew” under the hood 
RowKey:Mal 
=> (column, value=, timestamp=1374546754299000) 
=> (column=firstname, value=Malcolm, timestamp=1374546754299000) 
=> (column=lastname, value=Reynolds, timestamp=1374546754299000) 
=> (column=phone, value=111­555­1234, 
timestamp=1374546754299000) 
RowKey:Jayne 
=> (column, value=, timestamp=1374546757815000) 
=> (column=firstname, value=Jayne, timestamp=1374546757815000) 
=> (column=lastname, value=Cobb, timestamp=1374546757815000) 
=> (column=phone, value=111­555­3464, 
timestamp=1374546757815000)
Serenity “crewbyphone” 
● To solve the problem of being able to query crew members by phone:” 
CREATE TABLE crewbyphone ( 
crewname TEXT, 
firstname TEXT, 
lastname TEXT, 
phone TEXT, 
PRIMARY KEY (phone,crewname)); 
crewname | firstname | lastname | phone 
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­Mal 
| Malcolm | Reynolds | 111­555­1234 
Wash | Hoban | Washburne| 111­555­1212 
Zoey | Zoey | Washburne| 111­555­1212 
Jayne | Jayne | Cobb | 111­555­3464
Serenity “crewbyphone” under 
the hood 
RowKey:111­555­1234 
=> (column=Mal, value=, timestamp=1374546754299000) 
=> (column:Mal:firstname, value=Malcolm, timestamp=... 
=> (column:Mal:lastname, value=Reynolds, timestamp=... 
RowKey:111­555­1212 
=> (column=Wash, value=, timestamp=1374546754299000) 
=> (column=Wash:firstname, value=Hoban, timestamp=... 
=> (column=Wash:lastname, value=Washburne, timestamp=... 
=> (column=Zoey, value=, timestamp=1374546754299000) 
=> (column=Zoey:firstname, value=Zoey, timestamp=... 
=> (column=Zoey:lastname, value=Washburne, timestamp=...
Who Uses Cassandra?
Who else Uses Cassandra?
Cassandra Large Deployments 
● 100+ nodes. 250TB of data, cluster sizes vary from 6 to 32 
nodes. 
● 2,500+ nodes, 420TB of data, 4 DCs, handles 1 trillion 
operations per day. 
● 75,000+ nodes, 10s of PB of data, largest cluster 1000+ nodes.
Additional Reading 
● Amazon Dynamo paper 
● Facebook Cassandra paper 
● Harvest, Yield, and Scalable, Tolerant Systems - Brewer, Fox, 1999 
● DataStax grabs $106M to achieve big-dog status in database country 
● http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 
● http://planetcassandra.org/blog/a-netflix-experiment-eventual-consistency-hopeful-consistency-● DataStax Documentation 
● KillrVideo.com
Getting Started 
● Community site: http://planetcassandra.org 
● http://datastax.com 
● DataStax community edition: 
http://planetcassandra.org/cassandra 
● DataStax startup program: 
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/● Apache Cassandra project site: 
http://cassandra.apache.org/
Questions?
Demo
Want to work at AccuLynx? 
We're hiring! 
http://careers.stackoverflow.com/company/acculynx

Contenu connexe

Tendances

An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Artem Chebotko
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...DataStax
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into CassandraBrian Hess
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra nehabsairam
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraFolio3 Software
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraRobert Stupp
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loadingalex_araujo
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache CassandraPatrick McFadin
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overviewPritamKathar
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 

Tendances (20)

An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Cassandra 101
Cassandra 101Cassandra 101
Cassandra 101
 
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
Cassandra ppt 1
Cassandra ppt 1Cassandra ppt 1
Cassandra ppt 1
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
Cassandra
CassandraCassandra
Cassandra
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 

Similaire à Intro to cassandra

Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache CassandraStu Hood
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into CassandraBrent Theisen
 
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra GuruUse Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra GuruTim Callaghan
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUGStu Hood
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfCédrick Lunven
 
Performance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. DatastaxPerformance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. DatastaxScyllaDB
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesBernd Ocklin
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at PollfishPollfish
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writesInstaclustr
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...DataStax
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage systemArunit Gupta
 

Similaire à Intro to cassandra (20)

Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
 
Cassandra
CassandraCassandra
Cassandra
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra GuruUse Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdf
 
Performance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. DatastaxPerformance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. Datastax
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 

Dernier

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 

Dernier (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 

Intro to cassandra

  • 1. An Introduction to Apache assandra Aaron Ploetz
  • 2. What is Cassandra? ● Cassandra is a non-relational, partitioned row store. ● Rows are organized into column families (tables) with a required primary key. ● Data is distributed across multiple master-less, nodes in an application-transparent manner. ● DataStax oversees the development of the Apache Cassandra open-source project, provides support to companies using Cassandra, and provides an enterprise-ready version of Cassandra.
  • 3. $ whoami Aaron Ploetz @APloetz ● Lead Database Engineer ● B.S.-MCS UW-Whitewater ● M.S.-SED Regis University ● Using Cassandra since version 0.8 ● Contributor to the Casandra tag on StackOverflow ● Contributor to the Apache Cassandra project ● 2014/15 DataStax MVP for Apache Cassandra
  • 4. (short) History of Cassandra and DataStax ● Developed at , open sourced in 2008. ● Design influenced by Google BigTable and Amazon Dynamo. ● Graduated to Apache “Top-Level Project” status in Feb 2010. ● DataStax founded by Jonathan Ellis and Matt Pfeil in late 2010, offering enterprise Cassandra support. ● Secured $190 million in VC funding. ● Started with eight people, now employs more than 350. ● 400+ Customers, including 25 of the Fortune 100.
  • 5. Key Features ● Current release is Cassandra 2.1 (Sept 10). ● Distributed, decentralized storage; no SPOF. ● Scalable. ● High-availability, Fault-tolerance. ● Tunable Consistency. ● High-performance. ● Data center awareness.
  • 6. Distributed, Decentralized Storage DC1 DC2 ● Peer-to-peer, master-less replication. ● Any node can handle a read or write operation. ● Supports local read/write ops via “logical” data centers. ● Gossip protocol allows nodes to be aware of each other. ● Snitch ensures that data is replicated appropriately.
  • 7. Scalability ● Cassandra allows you to easily add nodes to scale your application back-end. ● Benchmark from 2011: – 48 node cluster could handle 174,373 writes/sec. – 288 node cluster could handle 1,099,837 writes/sec. – Indicates that Cassandra scales linearly. ● Throughput of N nodes = T. ● Throughput of Nx2 nodes = Tx2.
  • 8. High Availability DC1 DC2 X ● Cassandra was designed under the premise that hardware failures can and do occur.
  • 9. High Availability DC1 DC2 X X ● Cassandra was designed under the premise that hardware failures can and do occur.
  • 10. High Availability DC1 DC2 X X X X X X X ●Gossip Protocol keeps live nodes informed of failures. ●Cassandra 2.0.2 implemented Rapid Read Protection which redirects read operations to live nodes.
  • 11. Tunable Consistency ● Cassandra allows you alter your consistency level on a per-operation basis. ● Also allows configuration for data center locality: ALL QUORUM ONE Strong Consistency High Availability / Eventual Consistency Quorum == (nodes / 2) + 1
  • 12. Eventual Consistency != Hopeful Consistency ● experiment on consistency : – Created two data centers with C* 1.1.7 Cluster of 48 nodes in each data center. – Wrote 1,000,000 records at CL1 in one data center. – Read same 1,000,000 records at CL1 in other data center. – All records read successfully! – “Eventually consistent does not mean a day, minute or even a second from now… in most cases, it is milliseconds!”- Christos Kalantzis
  • 13. High Performance ● Cassandra is optimized from the ground up for performance: Source: DataStax.com
  • 14. High Performance ● All disk writes are sequential, append-only operations. ● No reading before writing. ● Cassandra is optimized for threading with multi-core/ processor machines.
  • 15. Potential Drawbacks? ● Some use cases are not appropriate (transient data or delete-heavy patterns). ● Developer learning curve: CQL != SQL ● Simple queries only. No JOINs or sub-queries. ● Optimal performance is achieved through de-normalizaiton and query-based data modeling.
  • 16. Cassandra moves beyond disco-era data modeling ●Everything MUST be normalized!!! ●Redundant data == “bad” ●Relational Database theory originated when disk space was expensive. In 1975 some vendors were selling disk space at $11k per MB. ●By 1980 prices “dropped” so that you could finally buy 1GB of storage for under $1 Million. ●Today I can buy a 1TB disk for $60.
  • 17. Cassandra Storage Structures ● Keyspace == Database (in the RDBMS world) CREATE KEYSPACE products WITH replication = { 'class': 'NetworkTopologyStrategy', 'RFD': '2', 'MKE': '4'}; ● Column Family == Table CREATE TABLE hierarchy ( category text, subcategory text, classification text, skumap map<uuid, text>, PRIMARY KEY (category, subcategory, classification));
  • 18. Cassandra Primary Keys ● Primary Keys are unique. ● Single Primary Key: PRIMARY KEY (keyColumn) ● Composite Primary Key: PRIMARY KEY (myPartitionKey, my1stClusteringKey, my2ndClusteringKey) ● Composite Partitioning Key: PRIMARY KEY ((my1stPartitionKey, my2ndPartitionKey), myClusteringKey)
  • 19. Cassandra Secondary Indexes ● Does allow secondary indexes. CREATE INDEX myIndex ON myTable(myNonKeyColumn) ● Designed for query convenience, not for performance. ● Does not perform well on high-cardinality columns, because you filter a huge volume of records for a small number of results. Extremely low cardinality is also not a good idea (ex: customer address [state == good, phone == bad, gender == bad]). ● Works best on a table having many rows that contain the indexed value; middle-of-the-road cardinality.
  • 20. Serenity “crew” ● Create a table to store data for the crew of “Serenity” from “Firefly.” CREATE TABLE crew ( crewname TEXT, firstname TEXT, lastname TEXT, phone TEXT, PRIMARY KEY (crewname)); crewname | firstname | lastname | phone ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­Mal | Malcolm | Reynolds | 111­555­1234 Jayne | Jayne | Cobb | 111­555­3464 Sheppard | Derial | Book | 111­555­2349 Simon | Simon | Tam | 111­555­8899
  • 21. Serenity “crew” under the hood RowKey:Mal => (column, value=, timestamp=1374546754299000) => (column=firstname, value=Malcolm, timestamp=1374546754299000) => (column=lastname, value=Reynolds, timestamp=1374546754299000) => (column=phone, value=111­555­1234, timestamp=1374546754299000) RowKey:Jayne => (column, value=, timestamp=1374546757815000) => (column=firstname, value=Jayne, timestamp=1374546757815000) => (column=lastname, value=Cobb, timestamp=1374546757815000) => (column=phone, value=111­555­3464, timestamp=1374546757815000)
  • 22. Serenity “crewbyphone” ● To solve the problem of being able to query crew members by phone:” CREATE TABLE crewbyphone ( crewname TEXT, firstname TEXT, lastname TEXT, phone TEXT, PRIMARY KEY (phone,crewname)); crewname | firstname | lastname | phone ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­Mal | Malcolm | Reynolds | 111­555­1234 Wash | Hoban | Washburne| 111­555­1212 Zoey | Zoey | Washburne| 111­555­1212 Jayne | Jayne | Cobb | 111­555­3464
  • 23. Serenity “crewbyphone” under the hood RowKey:111­555­1234 => (column=Mal, value=, timestamp=1374546754299000) => (column:Mal:firstname, value=Malcolm, timestamp=... => (column:Mal:lastname, value=Reynolds, timestamp=... RowKey:111­555­1212 => (column=Wash, value=, timestamp=1374546754299000) => (column=Wash:firstname, value=Hoban, timestamp=... => (column=Wash:lastname, value=Washburne, timestamp=... => (column=Zoey, value=, timestamp=1374546754299000) => (column=Zoey:firstname, value=Zoey, timestamp=... => (column=Zoey:lastname, value=Washburne, timestamp=...
  • 25. Who else Uses Cassandra?
  • 26. Cassandra Large Deployments ● 100+ nodes. 250TB of data, cluster sizes vary from 6 to 32 nodes. ● 2,500+ nodes, 420TB of data, 4 DCs, handles 1 trillion operations per day. ● 75,000+ nodes, 10s of PB of data, largest cluster 1000+ nodes.
  • 27. Additional Reading ● Amazon Dynamo paper ● Facebook Cassandra paper ● Harvest, Yield, and Scalable, Tolerant Systems - Brewer, Fox, 1999 ● DataStax grabs $106M to achieve big-dog status in database country ● http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html ● http://planetcassandra.org/blog/a-netflix-experiment-eventual-consistency-hopeful-consistency-● DataStax Documentation ● KillrVideo.com
  • 28. Getting Started ● Community site: http://planetcassandra.org ● http://datastax.com ● DataStax community edition: http://planetcassandra.org/cassandra ● DataStax startup program: http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/● Apache Cassandra project site: http://cassandra.apache.org/
  • 30. Demo
  • 31. Want to work at AccuLynx? We're hiring! http://careers.stackoverflow.com/company/acculynx