MyCassandra (Full English Version)

Shunsuke Nakamura
/ @sunsuk7tp
Tokyo Institute of Technology
Master Course
Tokyo, Japan

Update latency Read latency
in write-heavy workload in read-heavy workload

write-
optimized
Better

read- read-
optimized optimized

write-
optimized

performance storage engine distribution
Apache HBase write optimized Bigtable like centralized
Apache Cassandra write optimized Bigtable like decentralized
Sharded MySQL read optimized MySQL centralized
Yahoo! Sherpa read optimized MySQL centralized

The storage engine determines which workload a data store
treats efficiently.
The distribution architecture of a data store is independent of
the performance characteristics of read and write.

For example, if the storage part is excanged with MySQL, what
does the characteristics of read and write change?

= Dynamo + Bigtable
distribution (P2P/decentralized) storage engine

= Dynamo +
distribution (P2P/decentralized) storage engine

MySQL
= Dynamo + Bigtable
Redis
:
storage engine

MyCassandra is a modular distributed data store.
  You can select a storage engine by a keyspace.
  Index algorithm
  Read-optimized vs. write-optimized
  Sequential or Random
  Volatile or persistence
  Your experience for the storage engine

  MySQL (B+-Trees)
  read-optimized.

  Bigtable (LSM-Tree)
  write-optimized. Cassandra’s original
  Redis (hash)
  on-memory and asynchronous snapshot

  MongoDB (B-Tree)
  schema-less document oriented db

  KyotoCabinet (hash/B+-Tree)
  Simple Pluggable DBM (extended TokyoCabinet)

  You
can adapt any data store to
MyCassandra, a scalable data store.
•  RDB (MySQL/PostgreSQL)

  You
can apply to the apps which change I/O
characteristics by a phase.
•  MapReduce: Map – Shuffle - Reduce
•  Full text search: crowl – indexing – search

  You can apply to any IaaS environments.
•  EC2 + RDS (MyCassandra with MySQL)

Max. QPS for 40 Clients Bigtable
MySQL
40000
Redis
35000
30000
25000
20000
15000
10000
5000 Better
0

(qps) Write Only Write Heavy Read Heavy Read Only

proxy
  client
client
•  o.a.c.cli
•  o.a.c.avro/thrift server
  proxy
•  o.a.c.service.StorageProxy
  server engine
•  o.a.c.service.StorageService
•  o.a.c.db.ReadVerbHandler/RowMutationVerbHandler
  engine
•  o.a.c.db.Table (by a keyspace)
  o.a.c.db.commitlog
  o.a.c.db.ColumnFamilyStore (by a columnfamily)
  o.a.c.db.engine.StorageEngineInterface
  o.a.c.db.engine.MySQLInstance, RedisInstance, MongoDBInstance, …

  Now supporting
•  put (key, cf)
  Insert/Update/Delete At least, you implement this two method.
•  get (key)
•  getRangeSlice (startWith, engWith, maxResults)
•  truncate/dropTable/dropDB
  Next supporting
•  secondaryIndex
•  expire
•  counter (Cassandra-0.8 ~)

  The Data model is the same as Cassandra.
•  But super column is not supported now.
Store with the same Key/Value format as
 
SSTable
•  Supporting for a NoSQL of Any data model
NoSQL with a data model of smaller
 
dimension than Cassandra
•  Add a prefix to a primary key
•  The prefix means a Keyspace/ColumnFamily name.

Cassandra MySQL Redis

keyspace database db

column family table record

column field

database db
table A table B key values
key values key values
A:sato …
sato gender;male;age;17 sato visits;18;plan;Gold
B:ito …
suzuki gender;female;age; suzuki visits;
A:suzuki …
21;region;Tokyo 214;plan;Bronze
B:tanaka …
RDB (MySQL)
KVS (Redis)
keyspace
columnfamily A columnfamily B
key col gender age region key col visits plan
sato male 17 [null] sato 18 Gold
suzuki female 21 Tokyo suzuki 214 Bronze

Bigtable (Cassandra)

 A Key and a Value serialized a Object (now)
# change easily
 A column is mapped to a MySQL’s field
•  It gets smaller overhead but a schema is needed.
 Add specialized column
•  For secondary search
•  For range query
rowKey CF counter secondary token
index
Primary Serialized Specialized For secondary For range
key object column search search
Key Value

  A heterogeneous cluster
•  It combines multiple types of nodes where
different storage engines are located.
•  Replicas of data are located each different
storage engines.
•  A proxy routes to nodes that efficiently process a
query.
write query read query

sync async async sync

W R W R
Bigtable MySQL Bigtable MySQL

•  W: write-optimized
(e.g. Bigtable)
•  R: read-optimized
(e.g. MySQL)
•  RW: memory-based
(e.g. Redis)
  MyCassandra Cluster keeps the same consistency
strength with Cassandra.
Quorum Protocol: (write agrements) + (read afreements) > (replicas)
•  This protocol guarantees to get one of the most recent value.

Our system needs one node which synchronously process
both read and write queries.
Memory-based node (Redis)
write query

sync async write read

W R
W RW R
Bigtable MySQL

(e.g. Bigtable)
=3, =2 (e.g. MySQL)
W:RW:R = 1:1:1 •  RW: memory-based
Client
Proxy (e.g. Redis)

1)  A proxy broadcasts the query
to nodes.
Wait for two acks for 2)  The proxy waits
write and return 3a) write success: The proxy
Async write returns a success msg. to client.
3b) write failure: The proxy waits
W for acks from total
RW R 4) the proxy
asynchronously waits for acks
Nodes responsible for a record from the remaining
Write Latency: max (W, RW)

(e.g. Bigtable)
=3, =2 (e.g. MySQL)
W:RW:R = 1:1:1 Client •  RW: memory-based
Proxy (e.g. Redis)

1)  A proxy sends a request to a R or
Async check RW node, a digest request to other
consistency replicas.
Check consistency 2)  The proxy waits for replies
and return result including the specified record.
3a) success: if the record and
digests are consistent, returns the
W RW R record to the client.
3b) failure or inconsistency: The proxy
tries to read and collect digests until
Nodes responsible for a record they satisfy the quorum
4)  The proxy waits from the remaining
Read Latency: max (R, RW) nodes after replying to the
client.
If there is inconsistent, resolve it
using Read Repair.

20000 Cassandra
×0.90 max. qps for 40 clients MyCassandra Cluster
18000
16000 × 6.53
14000
12000 × 1.54
× 0.93
10000
Better 8000
6000
4000
2000
0
[100:0] [50:50] [5:95] [0:100] [write:read]
(query/sec) Write-Only Write-Heavy Read-Heavy Read-Only

Write Heavy Read Heavy
•  YCSB / Zipfian
•  Throughput was up to 6.53 times as high as those of Cassandra.
•  In Write-Heavy, there happens multiple read repairs.

 MyCassandra-0.2.2
•  secondaryIndex
  Apply to MySQL and MongoDB
 MyCassandra-0.3.0
•  Based on Cassandra-0.8
•  Atomic counter
•  Brisk (Hadoop + Cassandra)…

1.  Asynchronous deletion
2.  Engine failure detection
3.  Support for ad hoc query

  Cassandra’s delete/expire operation
•  Logical deletion using tombstone
•  Actual deletion with SSTable compaction
This approach depends on Bigtable’s engine.

  MyCassandra (MySQL, Redis, …)
•  Synchronous Deletion (now)
•  Expire function works well, but data continues to exit.
•  Asynchronous deletion is a heavy operation
  I/O to a big table different from SSTable (It is a data subset.)

 Only with storage engine failure,
failure detection and the behavior of instance

  With several storage engines and a partial
failure, the behavior of instance
instance instance What should I
instance
Periodic do?
polling detect
engine engine engine instance overall failure?
Take over the other node?

node down

  Ad hoc query and data model
•  If it does not depend on distributed archetecture, it can
be added easily.
  Data model of Redis (List, Set, ..)
  Document data model and ad hoc queries of MongoDB
•  But if it depends, it can not be supported.
  Atomic query across multiple keys.
  Join

  It is important to determine whether the query
is dependent on the distributed mechanism.

 github
•  https://github.com/sunsuk7tp/MyCassandra/
 Twitter
•  @MyCassandraJP
•  @_MyCassandra # @MyCassandra had already been taken!!
•  @sunsuk7tp # my private account

 Google Groups
•  https://groups.google.com/group/my-cassandra

MyCassandra (Full English Version)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à MyCassandra (Full English Version)

Similaire à MyCassandra (Full English Version) (20)

Dernier

Dernier (20)

MyCassandra (Full English Version)