SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Add a Bit of ACID to Cassandra 
Oleg Anastasyev 
Lead Platform Developer 
ok.ru
ok.ru 
* 45M daily, 80M monthly audience 
* Top 4 social networking site 
* Top 7 on total time on site in the world* 
* comScore data on July 2014, desktops, users of 15+ age 
* ~ 500,000 http reqs/sec 
* > 400 Gbps out 
* > 8000 iron servers in 5 DCs, ~1ms ping
Cassandra at 
* Since 2010 
- 0.6-ok, 1.2, 2.0 
* In 2014 
- 33 clusters 
- > 600 storage nodes 
- 330 TB 
* Fastest :1.5M ops (48 nodes) 
* Largest : 130TB (96 nodes)
SQL Server 2005 
* Consistent (ACID) OLTP data 
* 200 servers, 50 TB of data 
* Sharding 
• F(Entity_Id) -> Token -> SQL Server Node 
• F(Master_Id) === F(Detail_Id) 
* Local node commit only
Fast SQL Server 2005 
* DB JOIN 
* Foreign key constraints 
* Stored Procs, Triggers 
* Read uncommitted (noTx) 
* Short lived transactions <100ms 
* No massive UPDATEs, DELETEs 
* Always query on indexed data
Usual SQL shortcomings 
* Manual “scale out” with downtime 
* Downtime on maintenance 
* Write performance 
* BSoD, swap outs, magic 
* Expensive HA hardware (10x 1U server price) 
* Fragile failover 
- ~ 10% failovers fail 
* Downtime on DC failure or partition
Simple transaction in SQL Server 
TX.start(“Albums”, id); 
Album album = albums.lock(id); 
Photo photo = photos.create(…); 
if (photo.status == PUBLIC ) { 
album.incPublicPhotosCount(); 
} 
TX.commit(); 
* Read - modify - write 
* Involves a few records, different tables 
* Possibility of concurrent transactions on 1 key
Usual NoSQL problems 
* Learning curve 
* Sophisticated development 
- Often rewrite from scratch, data model and UI 
- Often with omission of functionality 
* Distributed programming means 
- (A lot of) app specific code around consistency, 
conflicts resolution, retries and rollbacks 
* Ad-hoc, fragile and buggy ACID 
implementation
We need a New Storage 
* Fast to learn and develop 
- ACID 
- SQL 
* Easy to operate and maintain: 
- Read and modify on DC failure 
- Automatic scale out w/o downtime 
- Commodity hardware 
* Fixable codebase (OpenSource,Java)
TODO: SQL 
* Scale out 
* Availability 
- Cluster 
- Conflict resolution 
- SQL 
NoSQL ? 
* ACID 
* SQL 
* Cassandra 2 CQL 
- OR -
Cassandra 2.0 
* Implements out of the box 
- CQL 
- Automatic scale out 
- Good write perf 
- Quorums, speculative retry ( see also CASSANDRA-6866 ) 
- Logged Batch 
- “Lightweight” transactions ? 
Read - modify - write 
Possibility of concurrent transactions on 1 key 
Involves a few records, different tables 
“3 phase commit” -> slow
Cassandra 2.0 
* Implements out of the box 
- CQL 
- Automatic scale out 
- Good write perf ( https://github.com/jbellis/YCSB ) 
- Quorums, speculative retry ( see also CASSANDRA-6866 ) 
- Logged Batch 
- “Lightweight” transactions 
- Secondary indexes ?
C*One 
* ACID transactions 
- No SpOF, DC failure resistant 
- Across multiple tables and partitions 
- Commits and rollbacks 
* First class indexes 
- No additional coding 
- Online build on existing data
Cassandra 
Gossip & Messaging 
clients 
C* Storage nodes 
“Heartbeat” 
Schema 
Partitioner 
Cluster topology 
C*One 
Update 
services 
C*One
clients 
> 800 
(all java) 
Clients 
* Fat client mode 
* Client is its own coordinator 
* Faster 
* -1 point of failure -> more reliable
clients 
NoTx 
C*One 
Update 
services 
In Tx 
Clients
C*One Update Srvs 
* Manages pessimistic locks 
* Generates monotonic timestamp for cells 
Lamport Timestamp 
http://en.wikipedia.org/wiki/Lamport_timestamps 
* Manages transactions 
* Failure management
00 
C*One 
Update 
services 
10 
20 
30 
50 
40 
Locks mgmt 
* Transaction Group Masters 
* Simple in-memory locking
DC-1 DC-2 DC-3 
00 
10 
20 
30 
50 
40 
* Each to every heartbeat 
* Quorum cluster view 
(I am dead if Q say so) 
* 50ms tick 
* G1 GC 
* 200ms till failure detection 
Heartbeat 
Quorum 
Failure detection
Failure management 
50 
* Master election protocol 
* Speculative transaction start 
50’ 
50” 
clients 
> 800 
start Tx
Unborn transactions 
* Transacion start requests queue 
- (in substitute’s memory) 
- Thrown away after timeout 
* On range master failure 
- queue is being processed 
- send started replies to clients 
(declines if already opened)
Tx start RAM 
clients 
Locks table 
1. StartTx 
Transaction state 
id=1, a=1, b=1 
2. Lock 
3. Read 
4. Cache
Tx write RAM 
Locks table 
Transaction state 
1. UPDATE 
id=1, a=1, b=1 
2. File 
2, 2 
clients
Tx read RAM 
Locks table 
Transaction state 
1. Read 
id=1, a=12, b=12 
2. Read ? 
3. resolve() 
clients
Locks table 
Transaction state 
1. Commit 
id=1, a=2, b=2 
RAM 
2 
LOGGED BATCH 
3 
4. Ack 
Tx commit 
clients
1. Rollback 
RAM 
Locks table 
Transaction state 
id=1, a=2, b=2 
Tx rollback 
clients
ACID 
* Atomicity 
- logged batch or nothing 
* Consistency 
- application, rollback 
* Isolation 
- Locks 
- Read Committed 
* Durability 
- quorum reads and writes to Cassandra
Indexes in Cassandra 2 
CREATE TABLE photos ( 
id bigint primary key, 
owner bigint, 
modified timestamp 
SELECT * 
WHERE owner=? 
AND modified>? 
* CREATE INDEX (owner, modified ) ? 
- No composite index support 
- High cardinality 
- Don’t scale (synchronous full cluster scan on read) 
- Max 100K tombstones per index
Global Indexes in C*One 
Primary Key 
id owner modified caption access … 
1 111 9.10.2014 “kitty cat” PUB … 
INDEX i1 ON photos (owner, modified) 
VALUES (caption,access,…); 
Primary Key 
owner modified id caption access … 
111 9.10.2014 1 “kitty cat” PUB … 
Partition Key Clustering Key 
SELECT * 
WHERE owner=? 
AND modified>? 
SELECT * FROM i1_photo 
WHERE owner=? 
AND modified>?
UPDATE 
RAM 
Transaction state 
iid=1,, a=12,, b=12 
Schema 
idx: a=2, b=2, id=1 
2. idxwrites() 
Index 
clients
ACID 
* Indexes “a la SQL” 
- Consistent 
- On more than 1 column 
- Scalable and fast 
- Built into CQL 
- No additional coding required 
- Very little penalty (+1 write)
Production: Photos 
* 11 bi photos 
* 80k reads/sec, 2k-8k tx/sec 
* SQL 
- RF=1 (+1 on RAID 10, +3 in backups) 
- 32 MS SQL + 16 standby + 10 backup = 58 
- load =100% 
* C*One 
- RF=3 ( in each DC ) 
- 63 C* + 6 upd = 69, 1/3 price 
- load = 30%
Photos: numbers 
* Tx failures 8500 /day -> 85/day 
* Avg Tx timespan: <40ms 
* Commit latency avg: <2ms 
* Read, write, avg <2ms, 99% ~ 3ms
C* 
* 22 patches to issues.apache.org 
- range thombstone and queries fixes, optimizations, 
etc. 
* Commit log on the fly compression 
(CASSANDRA-7994) 
* Reliable always retry policy 
(CASSANDRA-6866) 
* Night of the Living Dead 
(CASSANDRA-7872)
THANK YOU ! 
Oleg Anastasyev 
oa@ok.ru 
ok.ru/oa 
@m0nstermind 
slideshare.net/m0nstermind 
http://v.ok.ru

Contenu connexe

Tendances

MySQL async message subscription platform
MySQL async message subscription platformMySQL async message subscription platform
MySQL async message subscription platformLouis liu
 
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...Ontico
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemakerhastexo
 
New features in Performance Schema 5.7 in action
New features in Performance Schema 5.7 in actionNew features in Performance Schema 5.7 in action
New features in Performance Schema 5.7 in actionSveta Smirnova
 
HandlerSocket - A NoSQL plugin for MySQL
HandlerSocket - A NoSQL plugin for MySQLHandlerSocket - A NoSQL plugin for MySQL
HandlerSocket - A NoSQL plugin for MySQLJui-Nan Lin
 
Nvmfs benchmark
Nvmfs benchmarkNvmfs benchmark
Nvmfs benchmarkLouis liu
 
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)Ontico
 
MySQL Tokudb engine benchmark
MySQL Tokudb engine benchmarkMySQL Tokudb engine benchmark
MySQL Tokudb engine benchmarkLouis liu
 
Managing MariaDB Server operations with Percona Toolkit
Managing MariaDB Server operations with Percona ToolkitManaging MariaDB Server operations with Percona Toolkit
Managing MariaDB Server operations with Percona ToolkitSveta Smirnova
 
Redis SoCraTes 2014
Redis SoCraTes 2014Redis SoCraTes 2014
Redis SoCraTes 2014steffenbauer
 
How to cook lettuce @Java casual
How to cook lettuce @Java casualHow to cook lettuce @Java casual
How to cook lettuce @Java casualGo Hagiwara
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Tier1 App
 
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...Ontico
 
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018 Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018 Antonios Giannopoulos
 
Percona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL AdministrationPercona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL AdministrationMydbops
 
Vitess 解析
Vitess 解析Vitess 解析
Vitess 解析Zheng Hu
 
Using Apache Spark and MySQL for Data Analysis
Using Apache Spark and MySQL for Data AnalysisUsing Apache Spark and MySQL for Data Analysis
Using Apache Spark and MySQL for Data AnalysisSveta Smirnova
 
Базы данных. HDFS
Базы данных. HDFSБазы данных. HDFS
Базы данных. HDFSVadim Tsesko
 

Tendances (20)

MySQL async message subscription platform
MySQL async message subscription platformMySQL async message subscription platform
MySQL async message subscription platform
 
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemaker
 
New features in Performance Schema 5.7 in action
New features in Performance Schema 5.7 in actionNew features in Performance Schema 5.7 in action
New features in Performance Schema 5.7 in action
 
HandlerSocket - A NoSQL plugin for MySQL
HandlerSocket - A NoSQL plugin for MySQLHandlerSocket - A NoSQL plugin for MySQL
HandlerSocket - A NoSQL plugin for MySQL
 
Nvmfs benchmark
Nvmfs benchmarkNvmfs benchmark
Nvmfs benchmark
 
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)
 
MySQL Tokudb engine benchmark
MySQL Tokudb engine benchmarkMySQL Tokudb engine benchmark
MySQL Tokudb engine benchmark
 
Managing MariaDB Server operations with Percona Toolkit
Managing MariaDB Server operations with Percona ToolkitManaging MariaDB Server operations with Percona Toolkit
Managing MariaDB Server operations with Percona Toolkit
 
Redis SoCraTes 2014
Redis SoCraTes 2014Redis SoCraTes 2014
Redis SoCraTes 2014
 
How to cook lettuce @Java casual
How to cook lettuce @Java casualHow to cook lettuce @Java casual
How to cook lettuce @Java casual
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
 
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
 
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018 Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
 
Percona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL AdministrationPercona Toolkit for Effective MySQL Administration
Percona Toolkit for Effective MySQL Administration
 
Vitess 解析
Vitess 解析Vitess 解析
Vitess 解析
 
Using Apache Spark and MySQL for Data Analysis
Using Apache Spark and MySQL for Data AnalysisUsing Apache Spark and MySQL for Data Analysis
Using Apache Spark and MySQL for Data Analysis
 
Базы данных. HDFS
Базы данных. HDFSБазы данных. HDFS
Базы данных. HDFS
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 

En vedette

Распределенные системы в Одноклассниках
Распределенные системы в ОдноклассникахРаспределенные системы в Одноклассниках
Распределенные системы в Одноклассникахodnoklassniki.ru
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 
Класс!ная Cassandra
Класс!ная CassandraКласс!ная Cassandra
Класс!ная Cassandraodnoklassniki.ru
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
За гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на CassandraЗа гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на Cassandraodnoklassniki.ru
 
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...odnoklassniki.ru
 
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...odnoklassniki.ru
 

En vedette (7)

Распределенные системы в Одноклассниках
Распределенные системы в ОдноклассникахРаспределенные системы в Одноклассниках
Распределенные системы в Одноклассниках
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
Класс!ная Cassandra
Класс!ная CassandraКласс!ная Cassandra
Класс!ная Cassandra
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
За гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на CassandraЗа гранью NoSQL: NewSQL на Cassandra
За гранью NoSQL: NewSQL на Cassandra
 
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
Франкенштейнизация Voldemort или key-value данные в Одноклассниках. Роман Ан...
 
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
Тюним память и сетевой стек в Linux: история перевода высоконагруженных серве...
 

Similaire à Add a bit of ACID to Cassandra. Cassandra Summit EU 2014

11thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp0111thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp01Karam Abuataya
 
11 Things About11g
11 Things About11g11 Things About11g
11 Things About11gfcamachob
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
MySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellMySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellEmily Ikuta
 
Oracle Client Failover - Under The Hood
Oracle Client Failover - Under The HoodOracle Client Failover - Under The Hood
Oracle Client Failover - Under The HoodLudovico Caldara
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Clusterpercona2013
 
Advanced Replication
Advanced ReplicationAdvanced Replication
Advanced ReplicationMongoDB
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014Philippe Fierens
 
2013 london advanced-replication
2013 london advanced-replication2013 london advanced-replication
2013 london advanced-replicationMarc Schwering
 
[B14] A MySQL Replacement by Colin Charles
[B14] A MySQL Replacement by Colin Charles[B14] A MySQL Replacement by Colin Charles
[B14] A MySQL Replacement by Colin CharlesInsight Technology, Inc.
 
Transaction Management on Cassandra
Transaction Management on CassandraTransaction Management on Cassandra
Transaction Management on CassandraScalar, Inc.
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDTony Rogerson
 
M|18 How Facebook Migrated to MyRocks
M|18 How Facebook Migrated to MyRocksM|18 How Facebook Migrated to MyRocks
M|18 How Facebook Migrated to MyRocksMariaDB plc
 
Velocity 2012 - Learning WebOps the Hard Way
Velocity 2012 - Learning WebOps the Hard WayVelocity 2012 - Learning WebOps the Hard Way
Velocity 2012 - Learning WebOps the Hard WayCosimo Streppone
 
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...Zahid Anwar (OCM)
 
Oracle database 12.2 new features
Oracle database 12.2 new featuresOracle database 12.2 new features
Oracle database 12.2 new featuresAlfredo Krieg
 
Стек Linux HTTPS/TCP/IP для защиты от HTTP-DDoS-атак
Стек Linux HTTPS/TCP/IP для защиты от HTTP-DDoS-атакСтек Linux HTTPS/TCP/IP для защиты от HTTP-DDoS-атак
Стек Linux HTTPS/TCP/IP для защиты от HTTP-DDoS-атакPositive Hack Days
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Monal Daxini
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
 

Similaire à Add a bit of ACID to Cassandra. Cassandra Summit EU 2014 (20)

11thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp0111thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp01
 
11 Things About11g
11 Things About11g11 Things About11g
11 Things About11g
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
MySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellMySQL 5.7 in a Nutshell
MySQL 5.7 in a Nutshell
 
Oracle Client Failover - Under The Hood
Oracle Client Failover - Under The HoodOracle Client Failover - Under The Hood
Oracle Client Failover - Under The Hood
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
 
Advanced Replication
Advanced ReplicationAdvanced Replication
Advanced Replication
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
 
2013 london advanced-replication
2013 london advanced-replication2013 london advanced-replication
2013 london advanced-replication
 
[B14] A MySQL Replacement by Colin Charles
[B14] A MySQL Replacement by Colin Charles[B14] A MySQL Replacement by Colin Charles
[B14] A MySQL Replacement by Colin Charles
 
Transaction Management on Cassandra
Transaction Management on CassandraTransaction Management on Cassandra
Transaction Management on Cassandra
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACID
 
M|18 How Facebook Migrated to MyRocks
M|18 How Facebook Migrated to MyRocksM|18 How Facebook Migrated to MyRocks
M|18 How Facebook Migrated to MyRocks
 
Velocity 2012 - Learning WebOps the Hard Way
Velocity 2012 - Learning WebOps the Hard WayVelocity 2012 - Learning WebOps the Hard Way
Velocity 2012 - Learning WebOps the Hard Way
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
 
Oracle database 12.2 new features
Oracle database 12.2 new featuresOracle database 12.2 new features
Oracle database 12.2 new features
 
Стек Linux HTTPS/TCP/IP для защиты от HTTP-DDoS-атак
Стек Linux HTTPS/TCP/IP для защиты от HTTP-DDoS-атакСтек Linux HTTPS/TCP/IP для защиты от HTTP-DDoS-атак
Стек Linux HTTPS/TCP/IP для защиты от HTTP-DDoS-атак
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 

Plus de odnoklassniki.ru

Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015odnoklassniki.ru
 
Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр ТобольКадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тобольodnoklassniki.ru
 
Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.odnoklassniki.ru
 
Аварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
Аварийный дамп – чёрный ящик упавшей JVM. Андрей ПаньгинАварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
Аварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгинodnoklassniki.ru
 
Управление тысячами серверов в Одноклассниках. Алексей Чудов.
Управление тысячами серверов в Одноклассниках. Алексей Чудов.Управление тысячами серверов в Одноклассниках. Алексей Чудов.
Управление тысячами серверов в Одноклассниках. Алексей Чудов.odnoklassniki.ru
 
Незаурядная Java как инструмент разработки высоконагруженного сервера
Незаурядная Java как инструмент разработки высоконагруженного сервераНезаурядная Java как инструмент разработки высоконагруженного сервера
Незаурядная Java как инструмент разработки высоконагруженного сервераodnoklassniki.ru
 
Cистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ruCистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ruodnoklassniki.ru
 
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...odnoklassniki.ru
 

Plus de odnoklassniki.ru (8)

Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015Тестирование аварий. Андрей Губа. Highload++ 2015
Тестирование аварий. Андрей Губа. Highload++ 2015
 
Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр ТобольКадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
Кадры решают все, или стриминг видео в «Одноклассниках». Александр Тоболь
 
Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.Платформа для видео сроком в квартал. Александр Тоболь.
Платформа для видео сроком в квартал. Александр Тоболь.
 
Аварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
Аварийный дамп – чёрный ящик упавшей JVM. Андрей ПаньгинАварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
Аварийный дамп – чёрный ящик упавшей JVM. Андрей Паньгин
 
Управление тысячами серверов в Одноклассниках. Алексей Чудов.
Управление тысячами серверов в Одноклассниках. Алексей Чудов.Управление тысячами серверов в Одноклассниках. Алексей Чудов.
Управление тысячами серверов в Одноклассниках. Алексей Чудов.
 
Незаурядная Java как инструмент разработки высоконагруженного сервера
Незаурядная Java как инструмент разработки высоконагруженного сервераНезаурядная Java как инструмент разработки высоконагруженного сервера
Незаурядная Java как инструмент разработки высоконагруженного сервера
 
Cистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ruCистема внутренней статистики Odnoklassniki.ru
Cистема внутренней статистики Odnoklassniki.ru
 
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
Как, используя Lucene, построить высоконагруженную систему поиска разнородных...
 

Dernier

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Dernier (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Add a bit of ACID to Cassandra. Cassandra Summit EU 2014

  • 1. Add a Bit of ACID to Cassandra Oleg Anastasyev Lead Platform Developer ok.ru
  • 2. ok.ru * 45M daily, 80M monthly audience * Top 4 social networking site * Top 7 on total time on site in the world* * comScore data on July 2014, desktops, users of 15+ age * ~ 500,000 http reqs/sec * > 400 Gbps out * > 8000 iron servers in 5 DCs, ~1ms ping
  • 3. Cassandra at * Since 2010 - 0.6-ok, 1.2, 2.0 * In 2014 - 33 clusters - > 600 storage nodes - 330 TB * Fastest :1.5M ops (48 nodes) * Largest : 130TB (96 nodes)
  • 4. SQL Server 2005 * Consistent (ACID) OLTP data * 200 servers, 50 TB of data * Sharding • F(Entity_Id) -> Token -> SQL Server Node • F(Master_Id) === F(Detail_Id) * Local node commit only
  • 5. Fast SQL Server 2005 * DB JOIN * Foreign key constraints * Stored Procs, Triggers * Read uncommitted (noTx) * Short lived transactions <100ms * No massive UPDATEs, DELETEs * Always query on indexed data
  • 6. Usual SQL shortcomings * Manual “scale out” with downtime * Downtime on maintenance * Write performance * BSoD, swap outs, magic * Expensive HA hardware (10x 1U server price) * Fragile failover - ~ 10% failovers fail * Downtime on DC failure or partition
  • 7. Simple transaction in SQL Server TX.start(“Albums”, id); Album album = albums.lock(id); Photo photo = photos.create(…); if (photo.status == PUBLIC ) { album.incPublicPhotosCount(); } TX.commit(); * Read - modify - write * Involves a few records, different tables * Possibility of concurrent transactions on 1 key
  • 8. Usual NoSQL problems * Learning curve * Sophisticated development - Often rewrite from scratch, data model and UI - Often with omission of functionality * Distributed programming means - (A lot of) app specific code around consistency, conflicts resolution, retries and rollbacks * Ad-hoc, fragile and buggy ACID implementation
  • 9. We need a New Storage * Fast to learn and develop - ACID - SQL * Easy to operate and maintain: - Read and modify on DC failure - Automatic scale out w/o downtime - Commodity hardware * Fixable codebase (OpenSource,Java)
  • 10. TODO: SQL * Scale out * Availability - Cluster - Conflict resolution - SQL NoSQL ? * ACID * SQL * Cassandra 2 CQL - OR -
  • 11. Cassandra 2.0 * Implements out of the box - CQL - Automatic scale out - Good write perf - Quorums, speculative retry ( see also CASSANDRA-6866 ) - Logged Batch - “Lightweight” transactions ? Read - modify - write Possibility of concurrent transactions on 1 key Involves a few records, different tables “3 phase commit” -> slow
  • 12. Cassandra 2.0 * Implements out of the box - CQL - Automatic scale out - Good write perf ( https://github.com/jbellis/YCSB ) - Quorums, speculative retry ( see also CASSANDRA-6866 ) - Logged Batch - “Lightweight” transactions - Secondary indexes ?
  • 13. C*One * ACID transactions - No SpOF, DC failure resistant - Across multiple tables and partitions - Commits and rollbacks * First class indexes - No additional coding - Online build on existing data
  • 14. Cassandra Gossip & Messaging clients C* Storage nodes “Heartbeat” Schema Partitioner Cluster topology C*One Update services C*One
  • 15. clients > 800 (all java) Clients * Fat client mode * Client is its own coordinator * Faster * -1 point of failure -> more reliable
  • 16. clients NoTx C*One Update services In Tx Clients
  • 17. C*One Update Srvs * Manages pessimistic locks * Generates monotonic timestamp for cells Lamport Timestamp http://en.wikipedia.org/wiki/Lamport_timestamps * Manages transactions * Failure management
  • 18. 00 C*One Update services 10 20 30 50 40 Locks mgmt * Transaction Group Masters * Simple in-memory locking
  • 19. DC-1 DC-2 DC-3 00 10 20 30 50 40 * Each to every heartbeat * Quorum cluster view (I am dead if Q say so) * 50ms tick * G1 GC * 200ms till failure detection Heartbeat Quorum Failure detection
  • 20. Failure management 50 * Master election protocol * Speculative transaction start 50’ 50” clients > 800 start Tx
  • 21. Unborn transactions * Transacion start requests queue - (in substitute’s memory) - Thrown away after timeout * On range master failure - queue is being processed - send started replies to clients (declines if already opened)
  • 22. Tx start RAM clients Locks table 1. StartTx Transaction state id=1, a=1, b=1 2. Lock 3. Read 4. Cache
  • 23. Tx write RAM Locks table Transaction state 1. UPDATE id=1, a=1, b=1 2. File 2, 2 clients
  • 24. Tx read RAM Locks table Transaction state 1. Read id=1, a=12, b=12 2. Read ? 3. resolve() clients
  • 25. Locks table Transaction state 1. Commit id=1, a=2, b=2 RAM 2 LOGGED BATCH 3 4. Ack Tx commit clients
  • 26. 1. Rollback RAM Locks table Transaction state id=1, a=2, b=2 Tx rollback clients
  • 27. ACID * Atomicity - logged batch or nothing * Consistency - application, rollback * Isolation - Locks - Read Committed * Durability - quorum reads and writes to Cassandra
  • 28. Indexes in Cassandra 2 CREATE TABLE photos ( id bigint primary key, owner bigint, modified timestamp SELECT * WHERE owner=? AND modified>? * CREATE INDEX (owner, modified ) ? - No composite index support - High cardinality - Don’t scale (synchronous full cluster scan on read) - Max 100K tombstones per index
  • 29. Global Indexes in C*One Primary Key id owner modified caption access … 1 111 9.10.2014 “kitty cat” PUB … INDEX i1 ON photos (owner, modified) VALUES (caption,access,…); Primary Key owner modified id caption access … 111 9.10.2014 1 “kitty cat” PUB … Partition Key Clustering Key SELECT * WHERE owner=? AND modified>? SELECT * FROM i1_photo WHERE owner=? AND modified>?
  • 30. UPDATE RAM Transaction state iid=1,, a=12,, b=12 Schema idx: a=2, b=2, id=1 2. idxwrites() Index clients
  • 31. ACID * Indexes “a la SQL” - Consistent - On more than 1 column - Scalable and fast - Built into CQL - No additional coding required - Very little penalty (+1 write)
  • 32. Production: Photos * 11 bi photos * 80k reads/sec, 2k-8k tx/sec * SQL - RF=1 (+1 on RAID 10, +3 in backups) - 32 MS SQL + 16 standby + 10 backup = 58 - load =100% * C*One - RF=3 ( in each DC ) - 63 C* + 6 upd = 69, 1/3 price - load = 30%
  • 33. Photos: numbers * Tx failures 8500 /day -> 85/day * Avg Tx timespan: <40ms * Commit latency avg: <2ms * Read, write, avg <2ms, 99% ~ 3ms
  • 34. C* * 22 patches to issues.apache.org - range thombstone and queries fixes, optimizations, etc. * Commit log on the fly compression (CASSANDRA-7994) * Reliable always retry policy (CASSANDRA-6866) * Night of the Living Dead (CASSANDRA-7872)
  • 35. THANK YOU ! Oleg Anastasyev oa@ok.ru ok.ru/oa @m0nstermind slideshare.net/m0nstermind http://v.ok.ru