OK.ru is one of the largest social networks for Russian-speaking audiences with 80+ million unique user’s visits monthly. ok.ru uses Cassandra since 2010 and made a number of improvements to C* 2.0 and 2.1 codebase. Until recent time more than 50 TB of data at Ok.ru OLTP systems was managed by Microsoft SQL Sever. It’s very expensive, hard to scale and cannot save us from outage if one of our data centers fail. We wanted a new, fast scalable and reliable storage for these data. These data has requirements to support ACID transactions, so we don’t have to rewrite all application code from scratch. С* does not support these transactions, only lightweight, so we implemented a new storage with ACID and selected features of SQL world by ourselves. Still, it has C* at its heart. We’ll discuss the internals of the new storage, what features of C* we had to alter and which to rewrite from scratch. We’ll also talk about its operational experience in production.
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
1. Add a Bit of ACID to Cassandra
Oleg Anastasyev
Lead Platform Developer
ok.ru
2. ok.ru
* 45M daily, 80M monthly audience
* Top 4 social networking site
* Top 7 on total time on site in the world*
* comScore data on July 2014, desktops, users of 15+ age
* ~ 500,000 http reqs/sec
* > 400 Gbps out
* > 8000 iron servers in 5 DCs, ~1ms ping
3. Cassandra at
* Since 2010
- 0.6-ok, 1.2, 2.0
* In 2014
- 33 clusters
- > 600 storage nodes
- 330 TB
* Fastest :1.5M ops (48 nodes)
* Largest : 130TB (96 nodes)
4. SQL Server 2005
* Consistent (ACID) OLTP data
* 200 servers, 50 TB of data
* Sharding
• F(Entity_Id) -> Token -> SQL Server Node
• F(Master_Id) === F(Detail_Id)
* Local node commit only
5. Fast SQL Server 2005
* DB JOIN
* Foreign key constraints
* Stored Procs, Triggers
* Read uncommitted (noTx)
* Short lived transactions <100ms
* No massive UPDATEs, DELETEs
* Always query on indexed data
6. Usual SQL shortcomings
* Manual “scale out” with downtime
* Downtime on maintenance
* Write performance
* BSoD, swap outs, magic
* Expensive HA hardware (10x 1U server price)
* Fragile failover
- ~ 10% failovers fail
* Downtime on DC failure or partition
7. Simple transaction in SQL Server
TX.start(“Albums”, id);
Album album = albums.lock(id);
Photo photo = photos.create(…);
if (photo.status == PUBLIC ) {
album.incPublicPhotosCount();
}
TX.commit();
* Read - modify - write
* Involves a few records, different tables
* Possibility of concurrent transactions on 1 key
8. Usual NoSQL problems
* Learning curve
* Sophisticated development
- Often rewrite from scratch, data model and UI
- Often with omission of functionality
* Distributed programming means
- (A lot of) app specific code around consistency,
conflicts resolution, retries and rollbacks
* Ad-hoc, fragile and buggy ACID
implementation
9. We need a New Storage
* Fast to learn and develop
- ACID
- SQL
* Easy to operate and maintain:
- Read and modify on DC failure
- Automatic scale out w/o downtime
- Commodity hardware
* Fixable codebase (OpenSource,Java)
11. Cassandra 2.0
* Implements out of the box
- CQL
- Automatic scale out
- Good write perf
- Quorums, speculative retry ( see also CASSANDRA-6866 )
- Logged Batch
- “Lightweight” transactions ?
Read - modify - write
Possibility of concurrent transactions on 1 key
Involves a few records, different tables
“3 phase commit” -> slow
12. Cassandra 2.0
* Implements out of the box
- CQL
- Automatic scale out
- Good write perf ( https://github.com/jbellis/YCSB )
- Quorums, speculative retry ( see also CASSANDRA-6866 )
- Logged Batch
- “Lightweight” transactions
- Secondary indexes ?
13. C*One
* ACID transactions
- No SpOF, DC failure resistant
- Across multiple tables and partitions
- Commits and rollbacks
* First class indexes
- No additional coding
- Online build on existing data
19. DC-1 DC-2 DC-3
00
10
20
30
50
40
* Each to every heartbeat
* Quorum cluster view
(I am dead if Q say so)
* 50ms tick
* G1 GC
* 200ms till failure detection
Heartbeat
Quorum
Failure detection
21. Unborn transactions
* Transacion start requests queue
- (in substitute’s memory)
- Thrown away after timeout
* On range master failure
- queue is being processed
- send started replies to clients
(declines if already opened)
27. ACID
* Atomicity
- logged batch or nothing
* Consistency
- application, rollback
* Isolation
- Locks
- Read Committed
* Durability
- quorum reads and writes to Cassandra
28. Indexes in Cassandra 2
CREATE TABLE photos (
id bigint primary key,
owner bigint,
modified timestamp
SELECT *
WHERE owner=?
AND modified>?
* CREATE INDEX (owner, modified ) ?
- No composite index support
- High cardinality
- Don’t scale (synchronous full cluster scan on read)
- Max 100K tombstones per index
29. Global Indexes in C*One
Primary Key
id owner modified caption access …
1 111 9.10.2014 “kitty cat” PUB …
INDEX i1 ON photos (owner, modified)
VALUES (caption,access,…);
Primary Key
owner modified id caption access …
111 9.10.2014 1 “kitty cat” PUB …
Partition Key Clustering Key
SELECT *
WHERE owner=?
AND modified>?
SELECT * FROM i1_photo
WHERE owner=?
AND modified>?
30. UPDATE
RAM
Transaction state
iid=1,, a=12,, b=12
Schema
idx: a=2, b=2, id=1
2. idxwrites()
Index
clients
31. ACID
* Indexes “a la SQL”
- Consistent
- On more than 1 column
- Scalable and fast
- Built into CQL
- No additional coding required
- Very little penalty (+1 write)
32. Production: Photos
* 11 bi photos
* 80k reads/sec, 2k-8k tx/sec
* SQL
- RF=1 (+1 on RAID 10, +3 in backups)
- 32 MS SQL + 16 standby + 10 backup = 58
- load =100%
* C*One
- RF=3 ( in each DC )
- 63 C* + 6 upd = 69, 1/3 price
- load = 30%
34. C*
* 22 patches to issues.apache.org
- range thombstone and queries fixes, optimizations,
etc.
* Commit log on the fly compression
(CASSANDRA-7994)
* Reliable always retry policy
(CASSANDRA-6866)
* Night of the Living Dead
(CASSANDRA-7872)