SlideShare a Scribd company logo
1 of 55
Using Cassandra in your Web Applications Tom Melendez, Yahoo!
Why do we need another DB? We’re really like MySQL Everyone knows MySQL and if they don’t, they definitely know SQL, Codd, Normalization etc. Lots of tools are based on SQL backends: 3rd party home grown
Should I consider NoSQL? Well, maybe There’s a gazillion NoSQL solutions out there If you’re already using Memcached on top of your db, then you should look closely at NoSQL, as you’ve already identified an issue with your current infrastructure.
Cassandra: Overview Eventually consistent Highly Available Really fast reads, Really fast writes Flexible schemas Distributed No “Master” - No Single Point of Failure BigTable plus Dynamo written in Java
A little context SQL Joins can be expensive Sharding can be a PITA Master is a point of failure (that can be mitigated but we all know its painful) The data really might not be that important RIGHT NOW. Oh yeah, someone got tired of lousy response times
A little history Released by Facebook as Open Source Hosted at Google Code for a bit Now an Apache Project Based on: Amazon’s Dynamo All nodes are Equal (no master) partitioning/replication Google’s Big Table Column Families
Sounds great, right? When do I throw away our SQL DB? When do I get my promotion? When do I go on vacation? Not So Fast.
What you talkin’ about, Willis?
You WILL see this slide again You will need to rewrite code and probably re-arch the application You will need to run in parallel for testing You will need training for your Dev and Ops You will need to develop new tools and processes Cassandra isn’t the only NoSQL option You’ll (likely) still need/want SQL somewhere in your infrastructure
CAP Theorem Consistency – how consistent is the data across nodes? Availability – how available is the system? Partition Tolerance – will the system function if we lose a piece of it? CAP Theorem basically says you get to pick 2 of the above. (Anyone else reminded of: “Good, Fast and Cheap, pick two”?)
CAP and Cassandra The tradeoff between CAP are tunable by the client on a per transaction basis For example, when adding a user record, you could insist that this transaction is CONSISTENCY.ALL if you wanted. To really get the benefit Cassandra, you need to look at what data DOES NOT need CONSISTENCY.ALL
Consistency Levels: Writes
Consistency Levels: Reads
Running Cassandra Does it fit in your infrastructure? Clustering/Partitioning Replication/Snitching Monitoring Tuning Tools/Utilities A couple exist, but you’ll likely need to build your own or at least augment what’s available
Clustering The ring Each node has a unique token (dependent on the Partitioner used) Nodes are responsible for their own tokens plus the node previous to it the token determines on which node rows are stored
Partitioning How data is stored on the cluster Random Order Preserving You can implement your own Custom Partitioning
Partitioning: Types Random Default Good distribution of data across cluster Example usage: logging application Order Preserving Good for range queries OPP has seen some issues on the mailing list lately Custom implement IPartitioner to create your own
Operations: Replication First replica is whatever node claims that range should that node fail But the rest are determined with replication strategies You can tell Cassandra if the nodes are in a rack via IReplicaPlacementStrategy RackUnawareStrategy RackAwareStrategy You can create your own Replication factor – how many copies of the data do we want These options go in conf/storage-conf.xml
Operations: Snitching Telling Cassandra the physical location of nodes EndPoint – figure out based on IP address PropertySnitch – individual IPs to datacenters/racks DatacenterEndpointSnitch – give it subnets and datacenters
Operations - Monitoring IMO, It is critical that you get this working immediately (i.e. as soon as you have something running) Basically requires being able to run JMX queries and ideally store this data over time. Advice: watch the mailing list.  I’m betting a HOWTO will pop up soon as we all have the same problem.
Operations - Tuning You’ve set up monitoring, right? As you add ColumnFamilies, tuning might change Things you tune: Memtables (in mem structure: like a write-back cache) Heap Sizing: don’t ramp up the heap without testing first key cache: probably want to raise this for reads row cache
Utilities: NodeTool Really important.  Helps you manage your cluster.  Find under the bin/ dir in the download get some disk storage stats heap memory usage data snapshot decommission a node move a node
Utilities: cassandra-cli This is NOT the equivalent of: mysql> (although it does provide a prompt) the mysql executable You can do basic get/set operations and some other stuff It is really meant to check and see if things are working Maybe one day it will grow into something more
Utilities: cassandra-cli Example: cassandra> set Keyspace1.Standard1['user']['tom'] = 'cool'    Value inserted. cassandra> count Keyspace1.Standard1['user']               1 columns cassandra> get Keyspace1.Standard1['user']['tom']          => (column=746f6d, value=cool, timestamp=1286875497246000) cassandra> show api version 2.2.0
Other Utilities stress.py – helps you test the performance of your cluster. run periodically against your cluster(s) be prepared with these results when asking for perf help on the mailing list binary-memtable – a bulk loader that avoids some of the Thrift overhead.  Use with caution.
Data Model Simply put, it is similar to a multi-dimensional array The general strategy is denormalized data, sacrificing disk space for speed/efficiency Think about your queries (your DBAs will like this, but won’t like the way it is done!) You’ll end up getting very creative You need to know your queries in advance, they ultimately define your schema.
Data Model Again, keep in mind that you’re (probably) after denormalizing. I know it’s painful.  Terms you’ll see: Keyspaces Column Families SuperColumns Indexes Queries
Data Model Column Family Think of it as a DB table Column Key-Value Pair (NOT just a value, like a DB column) they also have a timestamp SuperColumn Columns inside a column So, you have a key, and its value are columns no timestamp Keyspace – like a namespace, generally 1 per app
Data Model Indexes and Queries Here is where you get creative Regardless of the partitioner, rows are always stored sorted by key Column sorting:  CompareWith  and CompareSubcolumnsWith
Data Model: Indexes and Queries Your bag of tricks include: creating column families for each query getting the row key to be the WHERE of your SQL query using column and SuperColumn names as “values” columns are stored sorted within the row
Data Model: Example Example data set: “b”: {“name”:”Ben”, “street”:”1234 Oak St.”, “city”:”Seattle”, “state”:”WA”}  “jason”: {”name”:”Jason”, “street”:”456 First Ave.”, “city”:”Bellingham”, “state”:”WA”}  “zack”: {”name”: “Zack”, “street”: “4321 Pine St.”, “city”: “Seattle”, “state”: “WA”}  “jen1982”: {”name”:”Jennifer”, “street”:”1120 Foo Lane”, “city”:”San Francisco”, “state”:”CA”}  “albert”: {”name”:”Albert”, “street”:”2364 South St.”, “city”:”Boston”, “state”:”MA”} (Taken from Benjamin Black’s presentation on indexing – twitter: @b6n)
Data Model: Example Given that data set, we want to say: SELECT name FROM Users WHERE state=“WA” We create a ColumnFamily:<ColumnFamily Name=”LocationUserIndexSCF”  CompareWith=”UTF8Type”  CompareSubcolumnsWith=”UTF8Type”  ColumnType=”Super” />  (Taken from Benjamin Black’s presentation on indexing – twitter: @b6n)
Data Model: Example Which looks like this: [state]: {                 [city1]: {[name1]:[user1], [name2]:[user2], ... },                 [city2]: {[name3]:[user3], [name4]:[user4], ... },                ...                [cityX]: {[name5]:[user5], [name6]:[user6], ... }  } State is the row key, so we can select by it and we’ll get the city grouping and name sorting basically for free. (Taken from Benjamin Black’s presentation on indexing – twitter @b6n)
Talking to Cassandra Generally two ways to do this: Native clients (ideal) Thrift Avro support is coming All of the PHP clients are still very Alpha All the PHP clients use Thrift that I’ve seen If you can, please use them and file bugs. Or even better than that – FIX IT YOURSELF! If you need something more stable, use Thrift
PHP Clients Pandra (LGPL)  PHP Cassa – pycassa port  Simple Cassie (New BSD License)  Prophet (PHP License) Clients in other languages are further along Thanks to Chris Barber (@cb1inc) for this list
Raw Cassandra API These are wrapped differently per client but generally exposed by thrift.  These are just the major data manip methods, there are others to gather information, etc.. Full list is here: http://wiki.apache.org/cassandra/API
Raw Cassandra API get get_count get_key_range get_range_slices get_slice multiget_slice insert batch_mutate remove truncate
What is Thrift? Thrift is a remote procedure call framework developed at Facebook for "scalable cross-language services development” – Wikipedia In short, you define a .thrift file (IDL file), with data structures, services, etc. and run the “thrift compiler” and get code, which you then use PHP, Java, Perl, Python, C#, Erlang, Ruby (and probably others) are supported thrift -php myproject.thrift is what you run Generated files are in a dir called: gen-php Then go in and add your logic
Example IDL file Heavily Snipped from: http://wiki.apache.org/thrift/Tutorial # Thrift Tutorial (heavily snipped) # Mark Slee (mcslee@facebook.com) # C and C++ comments also supported include "shared.thrift" namespace phptutorial service Calculator extends shared.SharedService {    void ping(),    i32 add(1:i32 num1, 2:i32 num2),    i32 calculate(1:i32 logid, 2:Work w) throws (1:InvalidOperation ouch), oneway void zip(), }
Installing Thrift and the PHP ext Download and install Thrift http://incubator.apache.org/thrift/download/ To use PHP, you install the PHP extension “thrift_protocol” You’ll find this in the Thrift download above Steps cd PATH-TO-THRIFT/lib/php/src/ext/thrift_protocol phpize && ./configure --enable-thrift_protocol && make sudo cp modules/thrift_protocol.so /php/ext/dir add extension=thrift_protocol.so to the appropriate php.ini file You really need APC, too (http://www.php.net/apc)
PHP Thrift Example http://wiki.apache.org/cassandra/ThriftExamples#PHP
So, who’s using this thing? Big and small companies alike Not sure if they’re applications of Cassandra are mission-critical Yahoo! is NOT a user, but we have our own implementation, and that implementation IS mission critical.  Do a search for “PNUTS”
Facebook – Inbox search
Heavy users, but not for tweets.  Yet.
Probably the biggest consumer-facing users of Cassandra
Digg - continued These guys have provided a lot Patches Documentation/Blogs/Advocacy LazyBoy Python client: http://github.com/digg/lazyboy#readme
Not totally sure, probably logging the massive amounts of data the generate from routers, switches and other hardware http://www.rackspacecloud.com/blog/2010/06/07/speaking-session-on-cassandra-at-velocity-2010/
Others using Cassandra Comcast, Cisco, CBS Interactive http://www.dbthink.com/?p=183
Competitors, sort of CouchDB – document db, accessible via javascript and REST HBase – no SOPF, Column Families, runs on top of Hadoop Memcached – used with MySQL, FB are big users MongoDB – cool online shell; k/v store, document db Redis – see Cassandra vs. Redispresentation by @tlossen from NoSQL Frankfurt 9/28/2010 Voldemort – distributed db, built by LinkedIn
Cassandra and Hadoop and Pig/Hive Yes, it is possible, I haven’t done it myself 0.6x Cassandra - Hadoop M/R jobs can read from Cassandra  0.7x Cassandra – Hadoop M/R jobs can write to it (again, according to the docs) Pig: own implementation of LoadFunc; Hive work has been started See:  http://wiki.apache.org/cassandra/HadoopSupport github.com/stuhood/cassandra-summit-demo slideshare.net/jeromatron cassandrahadoop-4399672 Hive: https://issues.apache.org/jira/browse/CASSANDRA-913
Developing Cassandra itself Using Eclipse http://wiki.apache.org/cassandra/RunningCassandraInEclipse
My personal recommendations Not that you asked. Understand that this is bleeding-edge You’re giving up a lot of SQL comforts Evaluate if you really need this (like anything else) If so, go with the latest and greatest and create a procedure to keep you running the latest and greatest (that would be 0.7x) Contribute back – it is good for your company and for you. Consider commercial support: http://www.riptano.com(I’m not affiliated in any way)
Think about during your evaluation: Are we just in another cycle? Fat client, thin client, Big bandwidth, little bandwidth, big transactions, micro transactions Have we been here before? Remember dbase, Foxpro, Sleepycat/BerkeleyDB? Is it just a technology Fad? How many people developed in WML/HDML only have phones support full HTML/JS? Do we all need native Iphone Apps?
I told you that you’d see this again… You will need to rewrite code and probably re-arch the application You will need to run in parallel for testing You will need training for your Dev and Ops You will need to develop new tools and processes Cassandra isn’t the only NoSQL option You’ll (likely) still need/want SQL
Thanks! http://wiki.apache.org/cassandra/GettingStarted http:///www.riptano.com/blog/slides-and-videos-cassandra-summit-2010

More Related Content

What's hot

MySQL Enterprise Edition
MySQL Enterprise EditionMySQL Enterprise Edition
MySQL Enterprise EditionMySQL Brasil
 
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQLMySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQLOlivier DASINI
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10Kenny Gryp
 
MySQL Shell for Database Engineers
MySQL Shell for Database EngineersMySQL Shell for Database Engineers
MySQL Shell for Database EngineersMydbops
 
Percona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replicationPercona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replicationmysqlops
 
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterMySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterKenny Gryp
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!Vitor Oliveira
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxVinicius M Grippa
 
Running MariaDB in multiple data centers
Running MariaDB in multiple data centersRunning MariaDB in multiple data centers
Running MariaDB in multiple data centersMariaDB plc
 
High Performance Mysql
High Performance MysqlHigh Performance Mysql
High Performance Mysqlliufabin 66688
 
My sql failover test using orchestrator
My sql failover test  using orchestratorMy sql failover test  using orchestrator
My sql failover test using orchestratorYoungHeon (Roy) Kim
 
MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting Mydbops
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group ReplicationKenny Gryp
 
Insight on MongoDB Change Stream - Abhishek.D, Mydbops Team
Insight on MongoDB Change Stream - Abhishek.D, Mydbops TeamInsight on MongoDB Change Stream - Abhishek.D, Mydbops Team
Insight on MongoDB Change Stream - Abhishek.D, Mydbops TeamMydbops
 
MySQL Day Roma - MySQL Shell and Visual Studio Code Extension
MySQL Day Roma - MySQL Shell and Visual Studio Code ExtensionMySQL Day Roma - MySQL Shell and Visual Studio Code Extension
MySQL Day Roma - MySQL Shell and Visual Studio Code ExtensionFrederic Descamps
 
RivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and HistogramsRivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and HistogramsFrederic Descamps
 
Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Mydbops
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...Jean-François Gagné
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKYoungHeon (Roy) Kim
 
Percona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL ArchitecturesPercona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL ArchitecturesFrederic Descamps
 

What's hot (20)

MySQL Enterprise Edition
MySQL Enterprise EditionMySQL Enterprise Edition
MySQL Enterprise Edition
 
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQLMySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10
 
MySQL Shell for Database Engineers
MySQL Shell for Database EngineersMySQL Shell for Database Engineers
MySQL Shell for Database Engineers
 
Percona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replicationPercona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replication
 
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterMySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptx
 
Running MariaDB in multiple data centers
Running MariaDB in multiple data centersRunning MariaDB in multiple data centers
Running MariaDB in multiple data centers
 
High Performance Mysql
High Performance MysqlHigh Performance Mysql
High Performance Mysql
 
My sql failover test using orchestrator
My sql failover test  using orchestratorMy sql failover test  using orchestrator
My sql failover test using orchestrator
 
MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group Replication
 
Insight on MongoDB Change Stream - Abhishek.D, Mydbops Team
Insight on MongoDB Change Stream - Abhishek.D, Mydbops TeamInsight on MongoDB Change Stream - Abhishek.D, Mydbops Team
Insight on MongoDB Change Stream - Abhishek.D, Mydbops Team
 
MySQL Day Roma - MySQL Shell and Visual Studio Code Extension
MySQL Day Roma - MySQL Shell and Visual Studio Code ExtensionMySQL Day Roma - MySQL Shell and Visual Studio Code Extension
MySQL Day Roma - MySQL Shell and Visual Studio Code Extension
 
RivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and HistogramsRivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and Histograms
 
Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
 
Percona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL ArchitecturesPercona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL Architectures
 

Viewers also liked

NodeJS : Communication and Round Robin Way
NodeJS : Communication and Round Robin WayNodeJS : Communication and Round Robin Way
NodeJS : Communication and Round Robin WayEdureka!
 
Node.js and Cassandra
Node.js and CassandraNode.js and Cassandra
Node.js and CassandraStratio
 
Application Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a ServiceApplication Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a ServiceWSO2
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012jbellis
 
Cassandra DataTables Using RESTful API
Cassandra DataTables Using RESTful APICassandra DataTables Using RESTful API
Cassandra DataTables Using RESTful APISimran Kedia
 
Cassandra NodeJS driver & NodeJS Paris
Cassandra NodeJS driver & NodeJS ParisCassandra NodeJS driver & NodeJS Paris
Cassandra NodeJS driver & NodeJS ParisDuyhai Doan
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparisonshsedghi
 
Developing with Cassandra
Developing with CassandraDeveloping with Cassandra
Developing with CassandraSperasoft
 

Viewers also liked (8)

NodeJS : Communication and Round Robin Way
NodeJS : Communication and Round Robin WayNodeJS : Communication and Round Robin Way
NodeJS : Communication and Round Robin Way
 
Node.js and Cassandra
Node.js and CassandraNode.js and Cassandra
Node.js and Cassandra
 
Application Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a ServiceApplication Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a Service
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
 
Cassandra DataTables Using RESTful API
Cassandra DataTables Using RESTful APICassandra DataTables Using RESTful API
Cassandra DataTables Using RESTful API
 
Cassandra NodeJS driver & NodeJS Paris
Cassandra NodeJS driver & NodeJS ParisCassandra NodeJS driver & NodeJS Paris
Cassandra NodeJS driver & NodeJS Paris
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
 
Developing with Cassandra
Developing with CassandraDeveloping with Cassandra
Developing with Cassandra
 

Similar to Using Cassandra with your Web Application

Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandrazznate
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!Andraz Tori
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandraPL dream
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlDavid Daeschler
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Architecture by Accident
Architecture by AccidentArchitecture by Accident
Architecture by AccidentGleicon Moraes
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Architectural anti-patterns for data handling
Architectural anti-patterns for data handlingArchitectural anti-patterns for data handling
Architectural anti-patterns for data handlingGleicon Moraes
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceAshok Modi
 

Similar to Using Cassandra with your Web Application (20)

No sql
No sqlNo sql
No sql
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Architecture by Accident
Architecture by AccidentArchitecture by Accident
Architecture by Accident
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Architectural anti-patterns for data handling
Architectural anti-patterns for data handlingArchitectural anti-patterns for data handling
Architectural anti-patterns for data handling
 
No sql
No sqlNo sql
No sql
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
 

Recently uploaded

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Using Cassandra with your Web Application

  • 1. Using Cassandra in your Web Applications Tom Melendez, Yahoo!
  • 2. Why do we need another DB? We’re really like MySQL Everyone knows MySQL and if they don’t, they definitely know SQL, Codd, Normalization etc. Lots of tools are based on SQL backends: 3rd party home grown
  • 3. Should I consider NoSQL? Well, maybe There’s a gazillion NoSQL solutions out there If you’re already using Memcached on top of your db, then you should look closely at NoSQL, as you’ve already identified an issue with your current infrastructure.
  • 4. Cassandra: Overview Eventually consistent Highly Available Really fast reads, Really fast writes Flexible schemas Distributed No “Master” - No Single Point of Failure BigTable plus Dynamo written in Java
  • 5. A little context SQL Joins can be expensive Sharding can be a PITA Master is a point of failure (that can be mitigated but we all know its painful) The data really might not be that important RIGHT NOW. Oh yeah, someone got tired of lousy response times
  • 6. A little history Released by Facebook as Open Source Hosted at Google Code for a bit Now an Apache Project Based on: Amazon’s Dynamo All nodes are Equal (no master) partitioning/replication Google’s Big Table Column Families
  • 7. Sounds great, right? When do I throw away our SQL DB? When do I get my promotion? When do I go on vacation? Not So Fast.
  • 8. What you talkin’ about, Willis?
  • 9. You WILL see this slide again You will need to rewrite code and probably re-arch the application You will need to run in parallel for testing You will need training for your Dev and Ops You will need to develop new tools and processes Cassandra isn’t the only NoSQL option You’ll (likely) still need/want SQL somewhere in your infrastructure
  • 10. CAP Theorem Consistency – how consistent is the data across nodes? Availability – how available is the system? Partition Tolerance – will the system function if we lose a piece of it? CAP Theorem basically says you get to pick 2 of the above. (Anyone else reminded of: “Good, Fast and Cheap, pick two”?)
  • 11. CAP and Cassandra The tradeoff between CAP are tunable by the client on a per transaction basis For example, when adding a user record, you could insist that this transaction is CONSISTENCY.ALL if you wanted. To really get the benefit Cassandra, you need to look at what data DOES NOT need CONSISTENCY.ALL
  • 14. Running Cassandra Does it fit in your infrastructure? Clustering/Partitioning Replication/Snitching Monitoring Tuning Tools/Utilities A couple exist, but you’ll likely need to build your own or at least augment what’s available
  • 15. Clustering The ring Each node has a unique token (dependent on the Partitioner used) Nodes are responsible for their own tokens plus the node previous to it the token determines on which node rows are stored
  • 16. Partitioning How data is stored on the cluster Random Order Preserving You can implement your own Custom Partitioning
  • 17. Partitioning: Types Random Default Good distribution of data across cluster Example usage: logging application Order Preserving Good for range queries OPP has seen some issues on the mailing list lately Custom implement IPartitioner to create your own
  • 18. Operations: Replication First replica is whatever node claims that range should that node fail But the rest are determined with replication strategies You can tell Cassandra if the nodes are in a rack via IReplicaPlacementStrategy RackUnawareStrategy RackAwareStrategy You can create your own Replication factor – how many copies of the data do we want These options go in conf/storage-conf.xml
  • 19. Operations: Snitching Telling Cassandra the physical location of nodes EndPoint – figure out based on IP address PropertySnitch – individual IPs to datacenters/racks DatacenterEndpointSnitch – give it subnets and datacenters
  • 20. Operations - Monitoring IMO, It is critical that you get this working immediately (i.e. as soon as you have something running) Basically requires being able to run JMX queries and ideally store this data over time. Advice: watch the mailing list. I’m betting a HOWTO will pop up soon as we all have the same problem.
  • 21. Operations - Tuning You’ve set up monitoring, right? As you add ColumnFamilies, tuning might change Things you tune: Memtables (in mem structure: like a write-back cache) Heap Sizing: don’t ramp up the heap without testing first key cache: probably want to raise this for reads row cache
  • 22. Utilities: NodeTool Really important. Helps you manage your cluster. Find under the bin/ dir in the download get some disk storage stats heap memory usage data snapshot decommission a node move a node
  • 23. Utilities: cassandra-cli This is NOT the equivalent of: mysql> (although it does provide a prompt) the mysql executable You can do basic get/set operations and some other stuff It is really meant to check and see if things are working Maybe one day it will grow into something more
  • 24. Utilities: cassandra-cli Example: cassandra> set Keyspace1.Standard1['user']['tom'] = 'cool' Value inserted. cassandra> count Keyspace1.Standard1['user'] 1 columns cassandra> get Keyspace1.Standard1['user']['tom'] => (column=746f6d, value=cool, timestamp=1286875497246000) cassandra> show api version 2.2.0
  • 25. Other Utilities stress.py – helps you test the performance of your cluster. run periodically against your cluster(s) be prepared with these results when asking for perf help on the mailing list binary-memtable – a bulk loader that avoids some of the Thrift overhead. Use with caution.
  • 26. Data Model Simply put, it is similar to a multi-dimensional array The general strategy is denormalized data, sacrificing disk space for speed/efficiency Think about your queries (your DBAs will like this, but won’t like the way it is done!) You’ll end up getting very creative You need to know your queries in advance, they ultimately define your schema.
  • 27. Data Model Again, keep in mind that you’re (probably) after denormalizing. I know it’s painful.  Terms you’ll see: Keyspaces Column Families SuperColumns Indexes Queries
  • 28. Data Model Column Family Think of it as a DB table Column Key-Value Pair (NOT just a value, like a DB column) they also have a timestamp SuperColumn Columns inside a column So, you have a key, and its value are columns no timestamp Keyspace – like a namespace, generally 1 per app
  • 29. Data Model Indexes and Queries Here is where you get creative Regardless of the partitioner, rows are always stored sorted by key Column sorting: CompareWith and CompareSubcolumnsWith
  • 30. Data Model: Indexes and Queries Your bag of tricks include: creating column families for each query getting the row key to be the WHERE of your SQL query using column and SuperColumn names as “values” columns are stored sorted within the row
  • 31. Data Model: Example Example data set: “b”: {“name”:”Ben”, “street”:”1234 Oak St.”, “city”:”Seattle”, “state”:”WA”} “jason”: {”name”:”Jason”, “street”:”456 First Ave.”, “city”:”Bellingham”, “state”:”WA”} “zack”: {”name”: “Zack”, “street”: “4321 Pine St.”, “city”: “Seattle”, “state”: “WA”} “jen1982”: {”name”:”Jennifer”, “street”:”1120 Foo Lane”, “city”:”San Francisco”, “state”:”CA”} “albert”: {”name”:”Albert”, “street”:”2364 South St.”, “city”:”Boston”, “state”:”MA”} (Taken from Benjamin Black’s presentation on indexing – twitter: @b6n)
  • 32. Data Model: Example Given that data set, we want to say: SELECT name FROM Users WHERE state=“WA” We create a ColumnFamily:<ColumnFamily Name=”LocationUserIndexSCF” CompareWith=”UTF8Type” CompareSubcolumnsWith=”UTF8Type” ColumnType=”Super” /> (Taken from Benjamin Black’s presentation on indexing – twitter: @b6n)
  • 33. Data Model: Example Which looks like this: [state]: { [city1]: {[name1]:[user1], [name2]:[user2], ... }, [city2]: {[name3]:[user3], [name4]:[user4], ... }, ... [cityX]: {[name5]:[user5], [name6]:[user6], ... } } State is the row key, so we can select by it and we’ll get the city grouping and name sorting basically for free. (Taken from Benjamin Black’s presentation on indexing – twitter @b6n)
  • 34. Talking to Cassandra Generally two ways to do this: Native clients (ideal) Thrift Avro support is coming All of the PHP clients are still very Alpha All the PHP clients use Thrift that I’ve seen If you can, please use them and file bugs. Or even better than that – FIX IT YOURSELF! If you need something more stable, use Thrift
  • 35. PHP Clients Pandra (LGPL) PHP Cassa – pycassa port Simple Cassie (New BSD License) Prophet (PHP License) Clients in other languages are further along Thanks to Chris Barber (@cb1inc) for this list
  • 36. Raw Cassandra API These are wrapped differently per client but generally exposed by thrift. These are just the major data manip methods, there are others to gather information, etc.. Full list is here: http://wiki.apache.org/cassandra/API
  • 37. Raw Cassandra API get get_count get_key_range get_range_slices get_slice multiget_slice insert batch_mutate remove truncate
  • 38. What is Thrift? Thrift is a remote procedure call framework developed at Facebook for "scalable cross-language services development” – Wikipedia In short, you define a .thrift file (IDL file), with data structures, services, etc. and run the “thrift compiler” and get code, which you then use PHP, Java, Perl, Python, C#, Erlang, Ruby (and probably others) are supported thrift -php myproject.thrift is what you run Generated files are in a dir called: gen-php Then go in and add your logic
  • 39. Example IDL file Heavily Snipped from: http://wiki.apache.org/thrift/Tutorial # Thrift Tutorial (heavily snipped) # Mark Slee (mcslee@facebook.com) # C and C++ comments also supported include "shared.thrift" namespace phptutorial service Calculator extends shared.SharedService { void ping(), i32 add(1:i32 num1, 2:i32 num2), i32 calculate(1:i32 logid, 2:Work w) throws (1:InvalidOperation ouch), oneway void zip(), }
  • 40. Installing Thrift and the PHP ext Download and install Thrift http://incubator.apache.org/thrift/download/ To use PHP, you install the PHP extension “thrift_protocol” You’ll find this in the Thrift download above Steps cd PATH-TO-THRIFT/lib/php/src/ext/thrift_protocol phpize && ./configure --enable-thrift_protocol && make sudo cp modules/thrift_protocol.so /php/ext/dir add extension=thrift_protocol.so to the appropriate php.ini file You really need APC, too (http://www.php.net/apc)
  • 41. PHP Thrift Example http://wiki.apache.org/cassandra/ThriftExamples#PHP
  • 42. So, who’s using this thing? Big and small companies alike Not sure if they’re applications of Cassandra are mission-critical Yahoo! is NOT a user, but we have our own implementation, and that implementation IS mission critical. Do a search for “PNUTS”
  • 44. Heavy users, but not for tweets. Yet.
  • 45. Probably the biggest consumer-facing users of Cassandra
  • 46. Digg - continued These guys have provided a lot Patches Documentation/Blogs/Advocacy LazyBoy Python client: http://github.com/digg/lazyboy#readme
  • 47. Not totally sure, probably logging the massive amounts of data the generate from routers, switches and other hardware http://www.rackspacecloud.com/blog/2010/06/07/speaking-session-on-cassandra-at-velocity-2010/
  • 48. Others using Cassandra Comcast, Cisco, CBS Interactive http://www.dbthink.com/?p=183
  • 49. Competitors, sort of CouchDB – document db, accessible via javascript and REST HBase – no SOPF, Column Families, runs on top of Hadoop Memcached – used with MySQL, FB are big users MongoDB – cool online shell; k/v store, document db Redis – see Cassandra vs. Redispresentation by @tlossen from NoSQL Frankfurt 9/28/2010 Voldemort – distributed db, built by LinkedIn
  • 50. Cassandra and Hadoop and Pig/Hive Yes, it is possible, I haven’t done it myself 0.6x Cassandra - Hadoop M/R jobs can read from Cassandra 0.7x Cassandra – Hadoop M/R jobs can write to it (again, according to the docs) Pig: own implementation of LoadFunc; Hive work has been started See: http://wiki.apache.org/cassandra/HadoopSupport github.com/stuhood/cassandra-summit-demo slideshare.net/jeromatron cassandrahadoop-4399672 Hive: https://issues.apache.org/jira/browse/CASSANDRA-913
  • 51. Developing Cassandra itself Using Eclipse http://wiki.apache.org/cassandra/RunningCassandraInEclipse
  • 52. My personal recommendations Not that you asked. Understand that this is bleeding-edge You’re giving up a lot of SQL comforts Evaluate if you really need this (like anything else) If so, go with the latest and greatest and create a procedure to keep you running the latest and greatest (that would be 0.7x) Contribute back – it is good for your company and for you. Consider commercial support: http://www.riptano.com(I’m not affiliated in any way)
  • 53. Think about during your evaluation: Are we just in another cycle? Fat client, thin client, Big bandwidth, little bandwidth, big transactions, micro transactions Have we been here before? Remember dbase, Foxpro, Sleepycat/BerkeleyDB? Is it just a technology Fad? How many people developed in WML/HDML only have phones support full HTML/JS? Do we all need native Iphone Apps?
  • 54. I told you that you’d see this again… You will need to rewrite code and probably re-arch the application You will need to run in parallel for testing You will need training for your Dev and Ops You will need to develop new tools and processes Cassandra isn’t the only NoSQL option You’ll (likely) still need/want SQL

Editor's Notes

  1. http://blog.medallia.com/2010/05/choosing_a_keyvalue_storage_sy.html
  2. http://www.julianbrowne.com/article/viewer/brewers-cap-theoremhttp://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html
  3. http://www.riptano.com/docs/0.6.5/consistency/levels
  4. http://www.riptano.com/docs/0.6.5/consistency/levels
  5. http://www.riptano.com/docs/0.6.5/operations/clustering
  6. http://www.riptano.com/docs/0.6.5/operations/clusteringhttp://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency
  7. http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency
  8. You need to know and understand where you started from and where you are now. If you don’t do this, you’ll be on the mailing list having to explain in detail your setup and reporting back the numbers provided by JMX. So, save yourself the trouble and understand how it works from day one.Maybe Cassandra is a good store for holding Cassandra JMX data. 
  9. See: http://www.riptano.com/docs/0.6.5/operations/tuninghttp://wiki.apache.org/cassandra/MemtableSSTablecommit log -&gt; memtablesstableshttp://wiki.apache.org/cassandra/ArchitectureSSTable
  10. nodetool --hlocalhostcfstatsnodetool --hlocalhost ringnodetool --hlocalhost info
  11. http://www.riptano.com/docs/0.6.5/utils/binary-memtable
  12. http://www.slideshare.net/benjaminblack/cassandra-basics-indexing
  13. http://www.slideshare.net/benjaminblack/cassandra-basics-indexing
  14. http://www.slideshare.net/benjaminblack/cassandra-basics-indexing
  15. http://www.riptano.com/docs/0.6.5/api/clientshttp://avro.apache.org/docs/current/
  16. http://wiki.apache.org/cassandra/API
  17. http://chanian.com/2010/05/13/thrift-tutorial-a-php-client/http://incubator.apache.org/thrift/about/http://wiki.apache.org/thrift/ThriftIDL
  18. http://incubator.apache.org/thrift/download/https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHPhttp://www.php.net/apc
  19. http://www.facebook.com/note.php?note_id=24413138919
  20. As of July, Twitter is using Cassandra, but not to store tweets.http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html
  21. http://project-voldemort.com/http://project-voldemort.com/performance.phphttp://blog.oskarsson.nu/2009/06/nosql-debrief.htmlhttp://static.last.fm/johan/nosql-20090611/vpork_nosql.pdf
  22. http://highscalability.com/blog/2009/10/13/why-are-facebook-digg-and-twitter-so-hard-to-scale.htmlhttp://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-modelhttp://www.rackspacecloud.com/blog/2010/02/25/should-you-switch-to-nosql-too/http://www.slideshare.net/jbellis/what-every-developer-should-know-about-database-scalability-pycon-2010http://david415.wordpress.com/2010/09/03/cassandra-data-storage-performance-tool/