Add a bit of ACID to Cassandra. Cassandra Summit EU 2014

Add a Bit of ACID to Cassandra
Oleg Anastasyev
Lead Platform Developer
ok.ru

ok.ru
* 45M daily, 80M monthly audience
* Top 4 social networking site
* Top 7 on total time on site in the world*
* comScore data on July 2014, desktops, users of 15+ age
* ~ 500,000 http reqs/sec
* > 400 Gbps out
* > 8000 iron servers in 5 DCs, ~1ms ping

Cassandra at
* Since 2010
- 0.6-ok, 1.2, 2.0
* In 2014
- 33 clusters
- > 600 storage nodes
- 330 TB
* Fastest :1.5M ops (48 nodes)
* Largest : 130TB (96 nodes)

SQL Server 2005
* Consistent (ACID) OLTP data
* 200 servers, 50 TB of data
* Sharding
• F(Entity_Id) -> Token -> SQL Server Node
• F(Master_Id) === F(Detail_Id)
* Local node commit only

Fast SQL Server 2005
* DB JOIN
* Foreign key constraints
* Stored Procs, Triggers
* Read uncommitted (noTx)
* Short lived transactions <100ms
* No massive UPDATEs, DELETEs
* Always query on indexed data

Usual SQL shortcomings
* Manual “scale out” with downtime
* Downtime on maintenance
* Write performance
* BSoD, swap outs, magic
* Expensive HA hardware (10x 1U server price)
* Fragile failover
- ~ 10% failovers fail
* Downtime on DC failure or partition

Simple transaction in SQL Server
TX.start(“Albums”, id);
Album album = albums.lock(id);
Photo photo = photos.create(…);
if (photo.status == PUBLIC ) {
album.incPublicPhotosCount();
}
TX.commit();
* Read - modify - write
* Involves a few records, different tables
* Possibility of concurrent transactions on 1 key

Usual NoSQL problems
* Learning curve
* Sophisticated development
- Often rewrite from scratch, data model and UI
- Often with omission of functionality
* Distributed programming means
- (A lot of) app specific code around consistency,
conflicts resolution, retries and rollbacks
* Ad-hoc, fragile and buggy ACID
implementation

We need a New Storage
* Fast to learn and develop
- ACID
- SQL
* Easy to operate and maintain:
- Read and modify on DC failure
- Automatic scale out w/o downtime
- Commodity hardware
* Fixable codebase (OpenSource,Java)

TODO: SQL
* Scale out
* Availability
- Cluster
- Conflict resolution
- SQL
NoSQL ?
* ACID
* SQL
* Cassandra 2 CQL
- OR -

Cassandra 2.0
* Implements out of the box
- CQL
- Automatic scale out
- Good write perf
- Quorums, speculative retry ( see also CASSANDRA-6866 )
- Logged Batch
- “Lightweight” transactions ?
Read - modify - write
Possibility of concurrent transactions on 1 key
Involves a few records, different tables
“3 phase commit” -> slow

Cassandra 2.0
* Implements out of the box
- CQL
- Automatic scale out
- Good write perf ( https://github.com/jbellis/YCSB )
- Quorums, speculative retry ( see also CASSANDRA-6866 )
- Logged Batch
- “Lightweight” transactions
- Secondary indexes ?

C*One
* ACID transactions
- No SpOF, DC failure resistant
- Across multiple tables and partitions
- Commits and rollbacks
* First class indexes
- No additional coding
- Online build on existing data

Cassandra
Gossip & Messaging
clients
C* Storage nodes
“Heartbeat”
Schema
Partitioner
Cluster topology
C*One
Update
services
C*One

clients
> 800
(all java)
Clients
* Fat client mode
* Client is its own coordinator
* Faster
* -1 point of failure -> more reliable

clients
NoTx
C*One
Update
services
In Tx
Clients

C*One Update Srvs
* Manages pessimistic locks
* Generates monotonic timestamp for cells
Lamport Timestamp
http://en.wikipedia.org/wiki/Lamport_timestamps
* Manages transactions
* Failure management

00
C*One
Update
services
10
20
30
50
40
Locks mgmt
* Transaction Group Masters
* Simple in-memory locking

DC-1 DC-2 DC-3
00
10
20
30
50
40
* Each to every heartbeat
* Quorum cluster view
(I am dead if Q say so)
* 50ms tick
* G1 GC
* 200ms till failure detection
Heartbeat
Quorum
Failure detection

Failure management
50
* Master election protocol
* Speculative transaction start
50’
50”
clients
> 800
start Tx

Unborn transactions
* Transacion start requests queue
- (in substitute’s memory)
- Thrown away after timeout
* On range master failure
- queue is being processed
- send started replies to clients
(declines if already opened)

Tx start RAM
clients
Locks table
1. StartTx
Transaction state
id=1, a=1, b=1
2. Lock
3. Read
4. Cache

Tx write RAM
Locks table
Transaction state
1. UPDATE
id=1, a=1, b=1
2. File
2, 2
clients

Tx read RAM
Locks table
Transaction state
1. Read
id=1, a=12, b=12
2. Read ?
3. resolve()
clients

Locks table
Transaction state
1. Commit
id=1, a=2, b=2
RAM
2
LOGGED BATCH
3
4. Ack
Tx commit
clients

1. Rollback
RAM
Locks table
Transaction state
id=1, a=2, b=2
Tx rollback
clients

ACID
* Atomicity
- logged batch or nothing
* Consistency
- application, rollback
* Isolation
- Locks
- Read Committed
* Durability
- quorum reads and writes to Cassandra

Indexes in Cassandra 2
CREATE TABLE photos (
id bigint primary key,
owner bigint,
modified timestamp
SELECT *
WHERE owner=?
AND modified>?
* CREATE INDEX (owner, modified ) ?
- No composite index support
- High cardinality
- Don’t scale (synchronous full cluster scan on read)
- Max 100K tombstones per index

Global Indexes in C*One
Primary Key
id owner modified caption access …
1 111 9.10.2014 “kitty cat” PUB …
INDEX i1 ON photos (owner, modified)
VALUES (caption,access,…);
Primary Key
owner modified id caption access …
111 9.10.2014 1 “kitty cat” PUB …
Partition Key Clustering Key
SELECT *
WHERE owner=?
AND modified>?
SELECT * FROM i1_photo
WHERE owner=?
AND modified>?

UPDATE
RAM
Transaction state
iid=1,, a=12,, b=12
Schema
idx: a=2, b=2, id=1
2. idxwrites()
Index
clients

ACID
* Indexes “a la SQL”
- Consistent
- On more than 1 column
- Scalable and fast
- Built into CQL
- No additional coding required
- Very little penalty (+1 write)

Production: Photos
* 11 bi photos
* 80k reads/sec, 2k-8k tx/sec
* SQL
- RF=1 (+1 on RAID 10, +3 in backups)
- 32 MS SQL + 16 standby + 10 backup = 58
- load =100%
* C*One
- RF=3 ( in each DC )
- 63 C* + 6 upd = 69, 1/3 price
- load = 30%

Photos: numbers
* Tx failures 8500 /day -> 85/day
* Avg Tx timespan: <40ms
* Commit latency avg: <2ms
* Read, write, avg <2ms, 99% ~ 3ms

C*
* 22 patches to issues.apache.org
- range thombstone and queries fixes, optimizations,
etc.
* Commit log on the fly compression
(CASSANDRA-7994)
* Reliable always retry policy
(CASSANDRA-6866)
* Night of the Living Dead
(CASSANDRA-7872)

THANK YOU !
Oleg Anastasyev
oa@ok.ru
ok.ru/oa
@m0nstermind
slideshare.net/m0nstermind
http://v.ok.ru

Add a bit of ACID to Cassandra. Cassandra Summit EU 2014

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Add a bit of ACID to Cassandra. Cassandra Summit EU 2014

Similaire à Add a bit of ACID to Cassandra. Cassandra Summit EU 2014 (20)

Plus de odnoklassniki.ru

Plus de odnoklassniki.ru (8)

Dernier

Dernier (20)

Add a bit of ACID to Cassandra. Cassandra Summit EU 2014