SlideShare une entreprise Scribd logo
1  sur  37
Apache Cassandra




                                   Vova Miguro
                               THE END
                                trnl.me@gmail.com




Thursday, September 22, 11
What is Cassandra?

                    •        key-value store with some structure

                    •        fault-tolerant

                    •        scalable

                    •        eventual consistent

                    •        tunable

                             -   consistency level

                             -   replication



Thursday, September 22, 11
Where did it come from?

                    •        created at Facebook

                             -   Dynamo: distribution architecture

                             -   BigTable: data model

                    •        open-sourced in 2008

                    •        Apache incubator in early 2009

                    •        graduation in March 2010




Thursday, September 22, 11
Who uses it?

                    •        Facebook (of cource)

                    •        Rackspace

                    •        Twitter

                    •        Digg

                    •        Reddit

                    •        IBM

                    •        others...



Thursday, September 22, 11
What problems does it solve?

                    •        reliability at scale

                             -   no single point of failure (all nodes are
                                 identical)

                    •        simple scaling (linear)

                    •        high write throughput

                    •        large data sets




Thursday, September 22, 11
What problems it can’t solve?

                    •        no flexible indices (later about this)

                    •        not good for big binary data (>64mb) unless
                             you chunk

                    •        row contents must fit in available memory




Thursday, September 22, 11
Clustering: CAP

                    •        CAP Theorem

                             -   Consistency

                             -   Availability

                             -   Partition tolerance

                    •        choose two

                    •        Cassandra chooses A and P but allows them
                             to be tunable to have more C




Thursday, September 22, 11
Clustering: Replication & Consistency

                    •        replication factor

                             -   how many nodes data is replicated on

                    •        consistency level

                             -   zero (async write)

                             -   any

                             -   one

                             -   quorum (rf/2+1)

                             -   all

Thursday, September 22, 11
Clustering: Consistency Level

                              zero            none                  write
                                            (async write)


                              any       1st response                write
                                      (included hinted handoff)


                              one       1st response              read/write

                             quorum          rf/2 + 1             read/write

                               all              all               read/write


Thursday, September 22, 11
Clustering: Ring

                •      every node gets a token

                    -        defines its place
                             in the ring

                    -        and which keys it
                             is responsible
                             for (ranges)




Thursday, September 22, 11
Clustering:Ring

                •      every node gets a token

                    -        defines its place
                             in the ring

                    -        and which keys it
                             is responsible
                             for (ranges)




Thursday, September 22, 11
Clustering:Ring

                •      new node

                    -        token assignment

                    -        ranges adjusted

                    -        bootstrap

                    -        only neighbor
                             nodes affected




Thursday, September 22, 11
Clustering:Ring

                •      node dies or becomes
                       isolated

                •      hinting handoff




Thursday, September 22, 11
Data Model

                    •        keyspace

                             •   column family

                                 •   row (indexed)

                                     •   key

                                     •   columns

                                         •   name (sorted)

                                         •   value



Thursday, September 22, 11
Data Model: ColumnFamily families
                                    Column




Thursday, September 22, 11
Supercolumn families
                Data Model: SuperColumnFamily




Thursday, September 22, 11
Easier to start from the bottom up




Thursday, September 22, 11
Data Model: Column




Thursday, September 22, 11
Data Model: Row




Thursday, September 22, 11
Data Model: Column comparators

                    •        TimeUUID

                    •        LexicalUUID

                    •        UTF8

                    •        Long

                    •        Bytes

                    •        ...




Thursday, September 22, 11
Data Model: ColumnFamily




Thursday, September 22, 11
Writing
                    •        simple: put(key,col,value)

                    •        complex: put(key,[col,value,...col,value])

                    •        batch: multi key




Thursday, September 22, 11
Writes
                Writing




Thursday, September 22, 11
Reading
                    •        get(): retrieve column by name

                    •        multiget(): by column name for a number of keys

                    •        get_slice(): by column name or a range of names

                             -   returning columns

                             -   returning supercolumns

                    •        multiget_slice(): a subset of columns for a set of keys

                    •        get_count(): number of columns or subcolumns

                    •        get_range_slice(): subset of columns for a range of keys




Thursday, September 22, 11
Reads

                Reading




Thursday, September 22, 11
Clients
                      Python:
                        •Pycassa: http://github.com/pycassa/pycassa
                        •Telephus: http://github.com/driftx/Telephus (Twisted)
                  •   Java:
                        •Hector: http://github.com/rantav/hector
                        •Kundera http://github.com/impetus-opensource/Kundera
                        •Pelops: http://github.com/s7/scale7-pelops
                        •Cassandrelle (Demoiselle Cassandra): http://demoiselle.sf.net/
                         component/demoiselle-cassandra/
                  •   .NET
                        •Aquiles: http://aquiles.codeplex.com/
                  •   Ruby:
                        •Cassandra: http://github.com/fauna/cassandra
                  •   PHP:
                        •PHP Client Library: https://github.com/kallaspriit/Cassandra-PHP-
                         Client-Library
                        •phpcassa: http://github.com/thobbs/phpcassa



Thursday, September 22, 11
CQL (from 0.8)
                    •        USE

                    •        SELECT

                    •        INSERT/UPDATE

                    •        DELETE

                    •        TRUNCATE/DROP

                    •        BATCH



                    •        CREATE KEYSPACE

                    •        CREATE COLUMNFAMILY

                    •        CREATE INDEX




Thursday, September 22, 11
CQL: Example
                   CREATE COLUMNFAMILY users (
                  ... KEY varchar PRIMARY KEY,
                  ... password varchar,
                  ... gender varchar,
                  ... session_token varchar,
                  ... state varchar,
                  ... birth_year bigint);

                INSERT INTO users (KEY, password) VALUES ('jsmith',
                'ch@ngem3a');



                SELECT * FROM users WHERE KEY='jsmith';
                u'jsmith' | u'password',u'ch@ngem3a'

                DROP COLUMNFAMILY users;




Thursday, September 22, 11
CQL: Example
                  CREATE INDEX birth_year_key ON users (birth_year);
                CREATE INDEX state_key ON users (state);

                SELECT * FROM users
                 ... WHERE gender='f' AND
                 ... state='TX' AND
                 ... birth_year='1968';
                u'user1' | u'birth_year',1968 | u'gender',u'f' |
                u'password',u'ch@ngem3' | u'state',u'TX'

                DROP COLUMNFAMILY users;




Thursday, September 22, 11
Indexing

                    •        secondary indexes

                             -   hashed

                             -   equality predicates (where column x = y)

                             -   specified on creation or later

                             -   best when many rows with similar columns

                    •        self-managed indexes




Thursday, September 22, 11
Indexing: Self-managed: one-to-one



                                     indexed indexed
                                     value #1 value #2
                             index
                             name
                                     related   related
                                       key       key




Thursday, September 22, 11
Indexing: Self-managed: one-to-several



                                        indexed         indexed
                                        value #1        value #2
                             index
                             name
                                     related related related related
                                       key     key     key     key




Thursday, September 22, 11
Indexing: Self-managed: one-to-many


                                        related key related key
                             indexed
                             value #1
                                             -           -

                                        related key related key
                             indexed
                             value #2
                                             -           -



Thursday, September 22, 11
Indexing: Self-managed: one-to-many


                                         ordering    ordering
                             indexed      value       value
                             value #1
                                        related key related key

                                         ordering    ordering
                             indexed      value       value
                             value #2
                                        related key related key



Thursday, September 22, 11
Let’s practice: Twitter
                      Get a user record by username
                  •   Get the friends of a username
                  •   Get the followers of a username
                  •   Get a timeline for a user
                  •   Get a timeline of a specific user’s tweets
                  •   Get a tweet from a tweet ID
                  •   Create a tweet
                  •   Create a user
                  •   Add friends to a user
                  •   Remove friends from a user



Thursday, September 22, 11
Facebook messaging




Thursday, September 22, 11
?
Thursday, September 22, 11

Contenu connexe

Similaire à cassandra

Introduction to Java 7 (Devoxx Nov/2011)
Introduction to Java 7 (Devoxx Nov/2011)Introduction to Java 7 (Devoxx Nov/2011)
Introduction to Java 7 (Devoxx Nov/2011)Martijn Verburg
 
Rails Performance Tuning
Rails Performance TuningRails Performance Tuning
Rails Performance TuningBurke Libbey
 
MiningTheSocialWeb.Ch2.Microformat
MiningTheSocialWeb.Ch2.MicroformatMiningTheSocialWeb.Ch2.Microformat
MiningTheSocialWeb.Ch2.MicroformatHyeonSeok Choi
 
A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLDaniel Austin
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
Cliff Moon - Building Polyglot Distributed Systems with Scalang, Boundary Tec...
Cliff Moon - Building Polyglot Distributed Systems with Scalang, Boundary Tec...Cliff Moon - Building Polyglot Distributed Systems with Scalang, Boundary Tec...
Cliff Moon - Building Polyglot Distributed Systems with Scalang, Boundary Tec...boundary_slides
 

Similaire à cassandra (8)

Introduction to Java 7 (Devoxx Nov/2011)
Introduction to Java 7 (Devoxx Nov/2011)Introduction to Java 7 (Devoxx Nov/2011)
Introduction to Java 7 (Devoxx Nov/2011)
 
Rails Performance Tuning
Rails Performance TuningRails Performance Tuning
Rails Performance Tuning
 
MiningTheSocialWeb.Ch2.Microformat
MiningTheSocialWeb.Ch2.MicroformatMiningTheSocialWeb.Ch2.Microformat
MiningTheSocialWeb.Ch2.Microformat
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQL
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
My sql tutorial-oscon-2012
My sql tutorial-oscon-2012My sql tutorial-oscon-2012
My sql tutorial-oscon-2012
 
Cliff Moon - Building Polyglot Distributed Systems with Scalang, Boundary Tec...
Cliff Moon - Building Polyglot Distributed Systems with Scalang, Boundary Tec...Cliff Moon - Building Polyglot Distributed Systems with Scalang, Boundary Tec...
Cliff Moon - Building Polyglot Distributed Systems with Scalang, Boundary Tec...
 

cassandra

  • 1. Apache Cassandra Vova Miguro THE END trnl.me@gmail.com Thursday, September 22, 11
  • 2. What is Cassandra? • key-value store with some structure • fault-tolerant • scalable • eventual consistent • tunable - consistency level - replication Thursday, September 22, 11
  • 3. Where did it come from? • created at Facebook - Dynamo: distribution architecture - BigTable: data model • open-sourced in 2008 • Apache incubator in early 2009 • graduation in March 2010 Thursday, September 22, 11
  • 4. Who uses it? • Facebook (of cource) • Rackspace • Twitter • Digg • Reddit • IBM • others... Thursday, September 22, 11
  • 5. What problems does it solve? • reliability at scale - no single point of failure (all nodes are identical) • simple scaling (linear) • high write throughput • large data sets Thursday, September 22, 11
  • 6. What problems it can’t solve? • no flexible indices (later about this) • not good for big binary data (>64mb) unless you chunk • row contents must fit in available memory Thursday, September 22, 11
  • 7. Clustering: CAP • CAP Theorem - Consistency - Availability - Partition tolerance • choose two • Cassandra chooses A and P but allows them to be tunable to have more C Thursday, September 22, 11
  • 8. Clustering: Replication & Consistency • replication factor - how many nodes data is replicated on • consistency level - zero (async write) - any - one - quorum (rf/2+1) - all Thursday, September 22, 11
  • 9. Clustering: Consistency Level zero none write (async write) any 1st response write (included hinted handoff) one 1st response read/write quorum rf/2 + 1 read/write all all read/write Thursday, September 22, 11
  • 10. Clustering: Ring • every node gets a token - defines its place in the ring - and which keys it is responsible for (ranges) Thursday, September 22, 11
  • 11. Clustering:Ring • every node gets a token - defines its place in the ring - and which keys it is responsible for (ranges) Thursday, September 22, 11
  • 12. Clustering:Ring • new node - token assignment - ranges adjusted - bootstrap - only neighbor nodes affected Thursday, September 22, 11
  • 13. Clustering:Ring • node dies or becomes isolated • hinting handoff Thursday, September 22, 11
  • 14. Data Model • keyspace • column family • row (indexed) • key • columns • name (sorted) • value Thursday, September 22, 11
  • 15. Data Model: ColumnFamily families Column Thursday, September 22, 11
  • 16. Supercolumn families Data Model: SuperColumnFamily Thursday, September 22, 11
  • 17. Easier to start from the bottom up Thursday, September 22, 11
  • 18. Data Model: Column Thursday, September 22, 11
  • 19. Data Model: Row Thursday, September 22, 11
  • 20. Data Model: Column comparators • TimeUUID • LexicalUUID • UTF8 • Long • Bytes • ... Thursday, September 22, 11
  • 22. Writing • simple: put(key,col,value) • complex: put(key,[col,value,...col,value]) • batch: multi key Thursday, September 22, 11
  • 23. Writes Writing Thursday, September 22, 11
  • 24. Reading • get(): retrieve column by name • multiget(): by column name for a number of keys • get_slice(): by column name or a range of names - returning columns - returning supercolumns • multiget_slice(): a subset of columns for a set of keys • get_count(): number of columns or subcolumns • get_range_slice(): subset of columns for a range of keys Thursday, September 22, 11
  • 25. Reads Reading Thursday, September 22, 11
  • 26. Clients Python: •Pycassa: http://github.com/pycassa/pycassa •Telephus: http://github.com/driftx/Telephus (Twisted) • Java: •Hector: http://github.com/rantav/hector •Kundera http://github.com/impetus-opensource/Kundera •Pelops: http://github.com/s7/scale7-pelops •Cassandrelle (Demoiselle Cassandra): http://demoiselle.sf.net/ component/demoiselle-cassandra/ • .NET •Aquiles: http://aquiles.codeplex.com/ • Ruby: •Cassandra: http://github.com/fauna/cassandra • PHP: •PHP Client Library: https://github.com/kallaspriit/Cassandra-PHP- Client-Library •phpcassa: http://github.com/thobbs/phpcassa Thursday, September 22, 11
  • 27. CQL (from 0.8) • USE • SELECT • INSERT/UPDATE • DELETE • TRUNCATE/DROP • BATCH • CREATE KEYSPACE • CREATE COLUMNFAMILY • CREATE INDEX Thursday, September 22, 11
  • 28. CQL: Example CREATE COLUMNFAMILY users ( ... KEY varchar PRIMARY KEY, ... password varchar, ... gender varchar, ... session_token varchar, ... state varchar, ... birth_year bigint); INSERT INTO users (KEY, password) VALUES ('jsmith', 'ch@ngem3a'); SELECT * FROM users WHERE KEY='jsmith'; u'jsmith' | u'password',u'ch@ngem3a' DROP COLUMNFAMILY users; Thursday, September 22, 11
  • 29. CQL: Example CREATE INDEX birth_year_key ON users (birth_year); CREATE INDEX state_key ON users (state); SELECT * FROM users ... WHERE gender='f' AND ... state='TX' AND ... birth_year='1968'; u'user1' | u'birth_year',1968 | u'gender',u'f' | u'password',u'ch@ngem3' | u'state',u'TX' DROP COLUMNFAMILY users; Thursday, September 22, 11
  • 30. Indexing • secondary indexes - hashed - equality predicates (where column x = y) - specified on creation or later - best when many rows with similar columns • self-managed indexes Thursday, September 22, 11
  • 31. Indexing: Self-managed: one-to-one indexed indexed value #1 value #2 index name related related key key Thursday, September 22, 11
  • 32. Indexing: Self-managed: one-to-several indexed indexed value #1 value #2 index name related related related related key key key key Thursday, September 22, 11
  • 33. Indexing: Self-managed: one-to-many related key related key indexed value #1 - - related key related key indexed value #2 - - Thursday, September 22, 11
  • 34. Indexing: Self-managed: one-to-many ordering ordering indexed value value value #1 related key related key ordering ordering indexed value value value #2 related key related key Thursday, September 22, 11
  • 35. Let’s practice: Twitter Get a user record by username • Get the friends of a username • Get the followers of a username • Get a timeline for a user • Get a timeline of a specific user’s tweets • Get a tweet from a tweet ID • Create a tweet • Create a user • Add friends to a user • Remove friends from a user Thursday, September 22, 11