SlideShare a Scribd company logo
1 of 29
Cassandra
FUNdamentals Overview
Main points

Structured log storage

Columns ordered by name inside key

Rows ordered by hash of row key *

Column family storage

Fully distributed peer-to-peer

Partitioned by row key

Dynamo consistency
Structured log storage

No writes in place for you

JVM heap is reserved for memtables

Memtables are sorted

Memtables reach a specific size they are
flushed to disk
− Creates sstable file
− Bloom filter file
− Index file
Compaction

SSTables merged

Deleted columns physically removed

Two compaction strategies
− Sized
− LevelDB
Commit logs

Every write/delete operation goes to commit
log

If a node were to shutdown with un-flushed
memtables (every shutdown really)

Replay the commit logs
Columns ordered inside key

Cassandra likes wide rows
− Up to 2 billion
− (but not really would be a 32GB row)

set mystuff['ecapriolo']['a']='1'

set mystuff['ecapriolo']['b']='2'

set mystuff['ecapriolo']['c']='3'
...

slice mystuff['ecapriolo'] ['b'] ['g']
Rows ordered by hash of row key

All columns of row 'a1' on the same node

But all columns of row 'a2' may not be on
same node

Reduces hot spots

But there is no total ordering based on row
keys
Peer to Peer

Node list and token range is gossip-ed

Each node responsible for local storage and
requests

When a new node joins it take some token
range away from other nodes.
Ed Ed Ed
stacey stacey stacey
bob bob bob
Replication 3
Dynamo consistency

Operations have a requested Consistency
Level
− ONE
− QUORUM

CL nodes ack the operation before the user
receives ack

If an operation fails it is safe to retry *
Fully distributed. The good

Highly available

Redundant

Fault tolerant
Fully distributed! The bad

Locks

Counters

Tombstones

Consistency
Hadoop
Hadoop and Cassandra

ColumnFamilyInputFormat
− Takes a ColumnFamily as input
− Map(ByteBuffer[] key,
SortedMap<ByteBuffer,Column>

ColumnFamilyOutputFormat
− Writes out to a column family
− OutputFormat ByteBuffer,List<Mutation>
Hadoop optimizations

Tasks run with locality if c* and h same node

InputFormat can leverage c* secondary
indexes

OutputFormat can use bulk loader
− C* writes are helluva fast anyway
Hive and Cassandra

Hive support similar to the hbase handler
support

Create a hive table specifying properties
similar to those in map reduce

hive> CREATE EXTERNAL TABLE
Users(userid string, name string, email
string, phone string)
STORED BY
'org.apache.hadoop.hive.cassandra.Ca
ssandraStorageHandler' WITH
Other support out there

github.com/edwardcapriolo/hive-cassandra-
udfs
− Delete UDF
− Composite splitter/builder UDFS

Not very hard to roll your own input format
− OneRowInputFormat
− ListOfRowsInputFormat
Pig Cassandra

Nice support for pig/cassandra

Pigmalian library

But I don't use it
− Cause I use hive
− You should as well
− And get my book :)
Comparison between c*
and “other noSQL”

I know your talking about hbase :)

Cassandra does not store multiple versions of
column
− Last update wins
− Use UUID as part of column name instead

The row keys are not globally ordered *
− Unless you are using ByteOrderPartitioner (no one
should use this)
Comparison between c*
and “other noSQL”

Each c* replica actively servers reads & writes

Cassandra directly manages its storage

Shards are pre-defined tokens (no auto-split)

Qualifier/column name can NOT be null
Key Performance tips
Know your data

Design for the long tail scenarios
− With design x our largest customer will have
10000000000000 columns in one row

How large will this column family be in 5
months?

What is the request rate?

How random is the read pattern
Understanding write-once files

Deletes are writes that get compacted away
later

Can you optimize from blind writes?

What percent of your application is
update/insert?
Profiling / Dark Launch

Compression

Compaction strategy
Metrics

Collect the JMX information
− Column family
− Caches

Set milestone alerts (traps)
Hardware

Fast disk (you almost always want SSD)

RAM
− Caches, bloom filters, young gen

CPU
− Garbage collector, deserialization + compaction
needs cpu to work
Anti patterns

Using one row key as a queue

Doing N reads to satisfy a request

Read before write

Using collection support in place of wide rows

Encoding
Questions?

More Related Content

What's hot

IBM DB2 LUW UDB DBA Training by www.etraining.guru
IBM DB2 LUW UDB DBA Training by www.etraining.guruIBM DB2 LUW UDB DBA Training by www.etraining.guru
IBM DB2 LUW UDB DBA Training by www.etraining.guruRavikumar Nandigam
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4caizer_x
 
Archlinux install
Archlinux installArchlinux install
Archlinux installsambismo
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Rupak Roy
 
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache CassandraCassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandraaaronmorton
 
Configuringahadoop
ConfiguringahadoopConfiguringahadoop
Configuringahadoopmensb
 
Basic command of hadoop
Basic command of hadoopBasic command of hadoop
Basic command of hadoopAhmad Kabeer
 
Research computing at ILRI
Research computing at ILRIResearch computing at ILRI
Research computing at ILRIILRI
 
Glusterfs session #9 index xlator
Glusterfs session #9   index xlatorGlusterfs session #9   index xlator
Glusterfs session #9 index xlatorPranith Karampuri
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoopFrank Y
 
Gluster dev session #3 xlator interface
Gluster dev session #3   xlator interfaceGluster dev session #3   xlator interface
Gluster dev session #3 xlator interfacePranith Karampuri
 
Glusterfs session #18 intro to fuse and its trade offs
Glusterfs session #18 intro to fuse and its trade offsGlusterfs session #18 intro to fuse and its trade offs
Glusterfs session #18 intro to fuse and its trade offsPranith Karampuri
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveEdward Capriolo
 
Glusterfs session #5 inode t, fd-t lifecycles
Glusterfs session #5   inode t, fd-t lifecyclesGlusterfs session #5   inode t, fd-t lifecycles
Glusterfs session #5 inode t, fd-t lifecyclesPranith Karampuri
 

What's hot (20)

IBM DB2 LUW UDB DBA Training by www.etraining.guru
IBM DB2 LUW UDB DBA Training by www.etraining.guruIBM DB2 LUW UDB DBA Training by www.etraining.guru
IBM DB2 LUW UDB DBA Training by www.etraining.guru
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
 
Php dba cache
Php dba cachePhp dba cache
Php dba cache
 
Archlinux install
Archlinux installArchlinux install
Archlinux install
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
 
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache CassandraCassandra South Bay Meetup - Backup And Restore For Apache Cassandra
Cassandra South Bay Meetup - Backup And Restore For Apache Cassandra
 
Xusage
XusageXusage
Xusage
 
Configuringahadoop
ConfiguringahadoopConfiguringahadoop
Configuringahadoop
 
Tablespaces
TablespacesTablespaces
Tablespaces
 
Basic command of hadoop
Basic command of hadoopBasic command of hadoop
Basic command of hadoop
 
Research computing at ILRI
Research computing at ILRIResearch computing at ILRI
Research computing at ILRI
 
Cassandra+Hadoop
Cassandra+HadoopCassandra+Hadoop
Cassandra+Hadoop
 
Glusterfs session #9 index xlator
Glusterfs session #9   index xlatorGlusterfs session #9   index xlator
Glusterfs session #9 index xlator
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoop
 
Cgghh
CgghhCgghh
Cgghh
 
Gluster dev session #3 xlator interface
Gluster dev session #3   xlator interfaceGluster dev session #3   xlator interface
Gluster dev session #3 xlator interface
 
Glusterfs session #18 intro to fuse and its trade offs
Glusterfs session #18 intro to fuse and its trade offsGlusterfs session #18 intro to fuse and its trade offs
Glusterfs session #18 intro to fuse and its trade offs
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
Glusterfs session #5 inode t, fd-t lifecycles
Glusterfs session #5   inode t, fd-t lifecyclesGlusterfs session #5   inode t, fd-t lifecycles
Glusterfs session #5 inode t, fd-t lifecycles
 

Viewers also liked

Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14Benoit Perroud
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into CassandraBrian Hess
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraRobbie Strickland
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into CassandraDataStax
 
Definition of Matter Lab Day 3
Definition of Matter Lab Day 3Definition of Matter Lab Day 3
Definition of Matter Lab Day 3jmori1
 
Track2 -杨世芬--cloudena-apac-8-11-2012
Track2 -杨世芬--cloudena-apac-8-11-2012Track2 -杨世芬--cloudena-apac-8-11-2012
Track2 -杨世芬--cloudena-apac-8-11-2012OpenCity Community
 
Reach.UrFaculty - Govt. Jobs Update Mar 7
Reach.UrFaculty - Govt. Jobs Update Mar 7Reach.UrFaculty - Govt. Jobs Update Mar 7
Reach.UrFaculty - Govt. Jobs Update Mar 7Reshmaurfaculty
 
Mba724 s4 2 writing up the final report
Mba724 s4 2 writing up the final reportMba724 s4 2 writing up the final report
Mba724 s4 2 writing up the final reportRachel Chung
 
3 concurrencycontrolone
3 concurrencycontrolone3 concurrencycontrolone
3 concurrencycontroloneKamal Shrish
 
LT データ可視化とd3.js js_cafe_20130908_otanet
LT データ可視化とd3.js js_cafe_20130908_otanetLT データ可視化とd3.js js_cafe_20130908_otanet
LT データ可視化とd3.js js_cafe_20130908_otanet博三 太田
 
Projekt E-građani: Smjernice za izradu središnjeg državnog portala - GOV.HR
Projekt E-građani: Smjernice za izradu središnjeg državnog portala - GOV.HRProjekt E-građani: Smjernice za izradu središnjeg državnog portala - GOV.HR
Projekt E-građani: Smjernice za izradu središnjeg državnog portala - GOV.HRTomislav Korman
 

Viewers also liked (20)

Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14Cassandra talk @JUG Lausanne, 2012.06.14
Cassandra talk @JUG Lausanne, 2012.06.14
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Shopping optimisation
Shopping optimisationShopping optimisation
Shopping optimisation
 
Definition of Matter Lab Day 3
Definition of Matter Lab Day 3Definition of Matter Lab Day 3
Definition of Matter Lab Day 3
 
Track2 -杨世芬--cloudena-apac-8-11-2012
Track2 -杨世芬--cloudena-apac-8-11-2012Track2 -杨世芬--cloudena-apac-8-11-2012
Track2 -杨世芬--cloudena-apac-8-11-2012
 
Reach.UrFaculty - Govt. Jobs Update Mar 7
Reach.UrFaculty - Govt. Jobs Update Mar 7Reach.UrFaculty - Govt. Jobs Update Mar 7
Reach.UrFaculty - Govt. Jobs Update Mar 7
 
Mba724 s4 2 writing up the final report
Mba724 s4 2 writing up the final reportMba724 s4 2 writing up the final report
Mba724 s4 2 writing up the final report
 
3 concurrencycontrolone
3 concurrencycontrolone3 concurrencycontrolone
3 concurrencycontrolone
 
LT データ可視化とd3.js js_cafe_20130908_otanet
LT データ可視化とd3.js js_cafe_20130908_otanetLT データ可視化とd3.js js_cafe_20130908_otanet
LT データ可視化とd3.js js_cafe_20130908_otanet
 
Xavier thoma
Xavier thomaXavier thoma
Xavier thoma
 
Projekt E-građani: Smjernice za izradu središnjeg državnog portala - GOV.HR
Projekt E-građani: Smjernice za izradu središnjeg državnog portala - GOV.HRProjekt E-građani: Smjernice za izradu središnjeg državnog portala - GOV.HR
Projekt E-građani: Smjernice za izradu središnjeg državnog portala - GOV.HR
 
Сезон простуд вебинар2016
Сезон простуд вебинар2016Сезон простуд вебинар2016
Сезон простуд вебинар2016
 
quimica
quimicaquimica
quimica
 
Ms word shortcut keys
Ms word shortcut keysMs word shortcut keys
Ms word shortcut keys
 
The First Follower
The First FollowerThe First Follower
The First Follower
 
Formato planeacion
Formato planeacionFormato planeacion
Formato planeacion
 
Migration to panama
Migration to panamaMigration to panama
Migration to panama
 
Full turkey cycle17 2013
Full turkey cycle17 2013Full turkey cycle17 2013
Full turkey cycle17 2013
 

Similar to Cassandra Fundamentals Overview

Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandrazznate
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate
 
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...DataStax Academy
 
MariaDB and Cassandra Interoperability
MariaDB and Cassandra InteroperabilityMariaDB and Cassandra Interoperability
MariaDB and Cassandra InteroperabilityColin Charles
 
Cassandra - A Distributed Database System
Cassandra - A Distributed Database System Cassandra - A Distributed Database System
Cassandra - A Distributed Database System Md. Shohel Rana
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandraPL dream
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis
 
MySQL for Oracle DBAs
MySQL for Oracle DBAsMySQL for Oracle DBAs
MySQL for Oracle DBAsMark Leith
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cqlzznate
 
Cassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupCassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupMichael Wynholds
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache CassandraStu Hood
 
Tobi Bosede - PyCassa Setting Up and Using Apache Cassandra with Python in Wi...
Tobi Bosede - PyCassa Setting Up and Using Apache Cassandra with Python in Wi...Tobi Bosede - PyCassa Setting Up and Using Apache Cassandra with Python in Wi...
Tobi Bosede - PyCassa Setting Up and Using Apache Cassandra with Python in Wi...PyData
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loadingalex_araujo
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage systemArunit Gupta
 

Similar to Cassandra Fundamentals Overview (20)

Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)
 
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...
 
MariaDB and Cassandra Interoperability
MariaDB and Cassandra InteroperabilityMariaDB and Cassandra Interoperability
MariaDB and Cassandra Interoperability
 
Cassandra - A Distributed Database System
Cassandra - A Distributed Database System Cassandra - A Distributed Database System
Cassandra - A Distributed Database System
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
 
MySQL for Oracle DBAs
MySQL for Oracle DBAsMySQL for Oracle DBAs
MySQL for Oracle DBAs
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
 
Cassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupCassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL Meetup
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Tobi Bosede - PyCassa Setting Up and Using Apache Cassandra with Python in Wi...
Tobi Bosede - PyCassa Setting Up and Using Apache Cassandra with Python in Wi...Tobi Bosede - PyCassa Setting Up and Using Apache Cassandra with Python in Wi...
Tobi Bosede - PyCassa Setting Up and Using Apache Cassandra with Python in Wi...
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
NoSQL Session II
NoSQL Session IINoSQL Session II
NoSQL Session II
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 

More from Edward Capriolo

Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeEdward Capriolo
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 
Cassandra NoSQL Lan party
Cassandra NoSQL Lan partyCassandra NoSQL Lan party
Cassandra NoSQL Lan partyEdward Capriolo
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with HiveEdward Capriolo
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best PracticesEdward Capriolo
 
Counters for real-time statistics
Counters for real-time statisticsCounters for real-time statistics
Counters for real-time statisticsEdward Capriolo
 

More from Edward Capriolo (15)

Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
M6d cassandra summit
M6d cassandra summitM6d cassandra summit
M6d cassandra summit
 
Apache Kafka Demo
Apache Kafka DemoApache Kafka Demo
Apache Kafka Demo
 
Cassandra NoSQL Lan party
Cassandra NoSQL Lan partyCassandra NoSQL Lan party
Cassandra NoSQL Lan party
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with Hive
 
Casbase presentation
Casbase presentationCasbase presentation
Casbase presentation
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best Practices
 
Cli deep dive
Cli deep diveCli deep dive
Cli deep dive
 
Cassandra as Memcache
Cassandra as MemcacheCassandra as Memcache
Cassandra as Memcache
 
Counters for real-time statistics
Counters for real-time statisticsCounters for real-time statistics
Counters for real-time statistics
 
Real world capacity
Real world capacityReal world capacity
Real world capacity
 

Recently uploaded

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Cassandra Fundamentals Overview

  • 3. Main points  Structured log storage  Columns ordered by name inside key  Rows ordered by hash of row key *  Column family storage  Fully distributed peer-to-peer  Partitioned by row key  Dynamo consistency
  • 4. Structured log storage  No writes in place for you  JVM heap is reserved for memtables  Memtables are sorted  Memtables reach a specific size they are flushed to disk − Creates sstable file − Bloom filter file − Index file
  • 5. Compaction  SSTables merged  Deleted columns physically removed  Two compaction strategies − Sized − LevelDB
  • 6. Commit logs  Every write/delete operation goes to commit log  If a node were to shutdown with un-flushed memtables (every shutdown really)  Replay the commit logs
  • 7. Columns ordered inside key  Cassandra likes wide rows − Up to 2 billion − (but not really would be a 32GB row)  set mystuff['ecapriolo']['a']='1'  set mystuff['ecapriolo']['b']='2'  set mystuff['ecapriolo']['c']='3' ...  slice mystuff['ecapriolo'] ['b'] ['g']
  • 8. Rows ordered by hash of row key  All columns of row 'a1' on the same node  But all columns of row 'a2' may not be on same node  Reduces hot spots  But there is no total ordering based on row keys
  • 9. Peer to Peer  Node list and token range is gossip-ed  Each node responsible for local storage and requests  When a new node joins it take some token range away from other nodes.
  • 10. Ed Ed Ed stacey stacey stacey bob bob bob Replication 3
  • 11. Dynamo consistency  Operations have a requested Consistency Level − ONE − QUORUM  CL nodes ack the operation before the user receives ack  If an operation fails it is safe to retry *
  • 12. Fully distributed. The good  Highly available  Redundant  Fault tolerant
  • 13. Fully distributed! The bad  Locks  Counters  Tombstones  Consistency
  • 15. Hadoop and Cassandra  ColumnFamilyInputFormat − Takes a ColumnFamily as input − Map(ByteBuffer[] key, SortedMap<ByteBuffer,Column>  ColumnFamilyOutputFormat − Writes out to a column family − OutputFormat ByteBuffer,List<Mutation>
  • 16. Hadoop optimizations  Tasks run with locality if c* and h same node  InputFormat can leverage c* secondary indexes  OutputFormat can use bulk loader − C* writes are helluva fast anyway
  • 17. Hive and Cassandra  Hive support similar to the hbase handler support  Create a hive table specifying properties similar to those in map reduce  hive> CREATE EXTERNAL TABLE Users(userid string, name string, email string, phone string) STORED BY 'org.apache.hadoop.hive.cassandra.Ca ssandraStorageHandler' WITH
  • 18. Other support out there  github.com/edwardcapriolo/hive-cassandra- udfs − Delete UDF − Composite splitter/builder UDFS  Not very hard to roll your own input format − OneRowInputFormat − ListOfRowsInputFormat
  • 19. Pig Cassandra  Nice support for pig/cassandra  Pigmalian library  But I don't use it − Cause I use hive − You should as well − And get my book :)
  • 20. Comparison between c* and “other noSQL”  I know your talking about hbase :)  Cassandra does not store multiple versions of column − Last update wins − Use UUID as part of column name instead  The row keys are not globally ordered * − Unless you are using ByteOrderPartitioner (no one should use this)
  • 21. Comparison between c* and “other noSQL”  Each c* replica actively servers reads & writes  Cassandra directly manages its storage  Shards are pre-defined tokens (no auto-split)  Qualifier/column name can NOT be null
  • 23. Know your data  Design for the long tail scenarios − With design x our largest customer will have 10000000000000 columns in one row  How large will this column family be in 5 months?  What is the request rate?  How random is the read pattern
  • 24. Understanding write-once files  Deletes are writes that get compacted away later  Can you optimize from blind writes?  What percent of your application is update/insert?
  • 25. Profiling / Dark Launch  Compression  Compaction strategy
  • 26. Metrics  Collect the JMX information − Column family − Caches  Set milestone alerts (traps)
  • 27. Hardware  Fast disk (you almost always want SSD)  RAM − Caches, bloom filters, young gen  CPU − Garbage collector, deserialization + compaction needs cpu to work
  • 28. Anti patterns  Using one row key as a queue  Doing N reads to satisfy a request  Read before write  Using collection support in place of wide rows  Encoding