Cliff Moon - Building Polyglot Distributed Systems with Scalang, Boundary Tec...
cassandra
1. Apache Cassandra
Vova Miguro
THE END
trnl.me@gmail.com
Thursday, September 22, 11
2. What is Cassandra?
• key-value store with some structure
• fault-tolerant
• scalable
• eventual consistent
• tunable
- consistency level
- replication
Thursday, September 22, 11
3. Where did it come from?
• created at Facebook
- Dynamo: distribution architecture
- BigTable: data model
• open-sourced in 2008
• Apache incubator in early 2009
• graduation in March 2010
Thursday, September 22, 11
4. Who uses it?
• Facebook (of cource)
• Rackspace
• Twitter
• Digg
• Reddit
• IBM
• others...
Thursday, September 22, 11
5. What problems does it solve?
• reliability at scale
- no single point of failure (all nodes are
identical)
• simple scaling (linear)
• high write throughput
• large data sets
Thursday, September 22, 11
6. What problems it can’t solve?
• no flexible indices (later about this)
• not good for big binary data (>64mb) unless
you chunk
• row contents must fit in available memory
Thursday, September 22, 11
7. Clustering: CAP
• CAP Theorem
- Consistency
- Availability
- Partition tolerance
• choose two
• Cassandra chooses A and P but allows them
to be tunable to have more C
Thursday, September 22, 11
8. Clustering: Replication & Consistency
• replication factor
- how many nodes data is replicated on
• consistency level
- zero (async write)
- any
- one
- quorum (rf/2+1)
- all
Thursday, September 22, 11
9. Clustering: Consistency Level
zero none write
(async write)
any 1st response write
(included hinted handoff)
one 1st response read/write
quorum rf/2 + 1 read/write
all all read/write
Thursday, September 22, 11
10. Clustering: Ring
• every node gets a token
- defines its place
in the ring
- and which keys it
is responsible
for (ranges)
Thursday, September 22, 11
11. Clustering:Ring
• every node gets a token
- defines its place
in the ring
- and which keys it
is responsible
for (ranges)
Thursday, September 22, 11
12. Clustering:Ring
• new node
- token assignment
- ranges adjusted
- bootstrap
- only neighbor
nodes affected
Thursday, September 22, 11
13. Clustering:Ring
• node dies or becomes
isolated
• hinting handoff
Thursday, September 22, 11
14. Data Model
• keyspace
• column family
• row (indexed)
• key
• columns
• name (sorted)
• value
Thursday, September 22, 11
24. Reading
• get(): retrieve column by name
• multiget(): by column name for a number of keys
• get_slice(): by column name or a range of names
- returning columns
- returning supercolumns
• multiget_slice(): a subset of columns for a set of keys
• get_count(): number of columns or subcolumns
• get_range_slice(): subset of columns for a range of keys
Thursday, September 22, 11
27. CQL (from 0.8)
• USE
• SELECT
• INSERT/UPDATE
• DELETE
• TRUNCATE/DROP
• BATCH
• CREATE KEYSPACE
• CREATE COLUMNFAMILY
• CREATE INDEX
Thursday, September 22, 11
28. CQL: Example
CREATE COLUMNFAMILY users (
... KEY varchar PRIMARY KEY,
... password varchar,
... gender varchar,
... session_token varchar,
... state varchar,
... birth_year bigint);
INSERT INTO users (KEY, password) VALUES ('jsmith',
'ch@ngem3a');
SELECT * FROM users WHERE KEY='jsmith';
u'jsmith' | u'password',u'ch@ngem3a'
DROP COLUMNFAMILY users;
Thursday, September 22, 11
29. CQL: Example
CREATE INDEX birth_year_key ON users (birth_year);
CREATE INDEX state_key ON users (state);
SELECT * FROM users
... WHERE gender='f' AND
... state='TX' AND
... birth_year='1968';
u'user1' | u'birth_year',1968 | u'gender',u'f' |
u'password',u'ch@ngem3' | u'state',u'TX'
DROP COLUMNFAMILY users;
Thursday, September 22, 11
30. Indexing
• secondary indexes
- hashed
- equality predicates (where column x = y)
- specified on creation or later
- best when many rows with similar columns
• self-managed indexes
Thursday, September 22, 11
32. Indexing: Self-managed: one-to-several
indexed indexed
value #1 value #2
index
name
related related related related
key key key key
Thursday, September 22, 11
33. Indexing: Self-managed: one-to-many
related key related key
indexed
value #1
- -
related key related key
indexed
value #2
- -
Thursday, September 22, 11
34. Indexing: Self-managed: one-to-many
ordering ordering
indexed value value
value #1
related key related key
ordering ordering
indexed value value
value #2
related key related key
Thursday, September 22, 11
35. Let’s practice: Twitter
Get a user record by username
• Get the friends of a username
• Get the followers of a username
• Get a timeline for a user
• Get a timeline of a specific user’s tweets
• Get a tweet from a tweet ID
• Create a tweet
• Create a user
• Add friends to a user
• Remove friends from a user
Thursday, September 22, 11