Summary of "Cassandra" for 3rd nosql summer reading in Tokyo
1. Cassandra – A Decentralized Structured Storage System Gemini Mobile Technologies, Inc. NOSQL Tokyo Reading Group (http://nosqlsummer.org/city/tokyo) August 25, 2010 Tags: #cassandra #nosql 2010/8/23 Gemini Mobile Technologies, Inc. 1
2. Cassandra: A Decentralized Structured Storage System Authors: AvinashLakshman, PrashantMalik. Abstract: Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers). At this scale, small and large components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. … Appeared in:3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware, 2009. http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 2
3. 1. Introduction and 2. Related Work Facebook inbox search: Enables users to search through their inbox. Launched 6/2008. Highly scalable: 250M users. Tolerant for server/network failures. Very high write throughput: “billions of writes per day”. Replicate data across data centers. Related Work Distributed file systems: Ficus, Coda, Farsite, GFS, Bayou. Storage systems: Dynamo, Bigtable. “The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model.” 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 3
4. 3. Data Model Multi-level Index: Table: Set of rows Key: Identifies the row Key is arbitrary byte[]. Each row can contain a variable number of columns/CFs. No need for rows to contain same columns/CFs. Each row can contain millions of columns/CFs Atomic operations per key per replica. 3. ColumnName: Identify the column value(s). Can be either “Column”, “ColumnFamily”, “ColumnFamily:Column”, “ColumnFamily:ColumnFamily”, etc. ColumnFamily (CF) is a group of Columns. CFs and Columns are sorted. Time-based or name-based. Columns can be added/deleted efficiently during run-time. 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 4
5. Data Model Example: Inbox Search 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 5 Query: Find all messages of user3 with “hello”. Get(UserMessages, “user3”, “term:hello”) Table: UserMessages Key:<userid> CF:”term” CF: <word> Name:<timestamp> Val:<messageID> “term” user3 “hello” “how” “you” time4 time12 time4 time4 time12 time1 msg10 msg81 msg10 msg10 msg81 msg03
6. 4. API Simple get/put operations: Insert(table, key, rowMutation) Single columns, Multiple columns, Batch of multiple keys. Get(table, key, columnName) Key: Single key or key range. columnName: “Slice” range or name. Delete(table, key, columnName) Also, specify Consistency Level. 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 6
7. 5. System Architecture Data partitioned to subset of nodes: Consistent Hashing Data replicated to multiple nodes for redundancy, performance: Quorum using “preference list” of nodes Node management: Membership algorithm to know which nodes are up/down. “Accrual failure detection + Gossip” Bootstrapping to add node. Manual operation + “Seed” nodes 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 7 Consistent Hash NodeA NodeC NodeD Gossip NodeB
8. 5.1 Partitioning Algorithm: Consistent Hashing Each node is assigned a random position on ring. Key k is hashed to fixed circular space. Nodes are assigned by walking clockwise from hash location. Example: Nodes A, B, C, D, E, F, G assigned to ring. Hash(k) is between A and B. Since 3 replicas, choose next 3 nodes on ring (i.e., B, C, D). 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 8 Hash(k) A Node assignment B G C F D E
9. 5.1 Consistent Hashing Key advantage: Adding, deleting, re-allocating nodes is cheap. It affects only immediate neighbor node keys. Hash function Locality Load distribution. Load-balancing by moving nodes toward heavily-loaded nodes. 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 9
10. 5.2 Replication Each data item is replicated at multiple nodes (N). Each key is assigned to a “coordinator” node by consistent hash function. “Coordinator” node replicates the key to an additional N-1 nodes. “Consistency Level” is set by client per read/write request. ZERO, ONE, ALL, ANY, QUORUM Zookeeper used to elect leader node and distribute “preference list” Leader node owns “preference list” that maps key to node list. 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 10
11. 5.3 Membership Each node locally determines if any other node in the system is up/down. Φ (phi) Accrual Failure Detector Instead of boolean value (up or down), compute a numeric value Φ representing suspicion level for each monitored nodes. Φ is computed using inter-arrival times of gossip messages from other nodes in the cluster. If Φ exceeds a particular threshold, then node is considered as “down”. In experiment of 100 nodes with threshold of 5, average time to detect failure: 15 seconds. 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 11
12. 5.4 Bootstrapping, 5.5 Scaling the Cluster New nodes check configuration for “seed” nodes to get initial gossip data like “preference” lists. Add/remove of nodes is not done automatically. Requires manual command-line operation. New node needs to have data moved to it from other nodes. Operationally, 40MB/s. Working to improve this by copying data from multiple replicas a la BitTorrent. 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 12
21. Epilogue Active Apache project with good documentation: http://cassandra.apache.org/ http://wiki.apache.org/cassandra/ArticlesAndPresentations In use at companies like Digg, Facebook, Twitter, Reddit, Rackspace. Largest production cluster has over 100 TB data over 150 machines. 2010/8/23 Gemini Mobile Technologies, Inc. All rights reserved. 15