SlideShare a Scribd company logo
1 of 43
Download to read offline
Nadav Har'El, ScyllaDB
The Generalist Engineer meetup, Tel-Aviv
Ides of March, 2016
SeastarSeastar Or how we implemented a
10-times faster Cassandra
2
● Israeli but multi-national startup company
– 15 developers cherry-picked from 10 countries.
● Founded 2013 (“Cloudius Systems”)
– by Avi Kivity and Dor Laor of KVM fame.
● Fans of open-source: OSv, Seastar, ScyllaDB.
3
Make Cassandra 10 times faster
Your mission, should
you choose to accept it:
4
“Make Cassandra 10 times faster”
● Why 10?
● Why Cassandra?
– Popular NoSQL database (2nd to MongoDB).
– Powerful and widely applicable.
– Example of a wider class of middleware.
● Why “mission impossible”?
– Cassandra not considered particularly slow -
– Considered faster than MongoDB, Hbase, et al.
– “disk is bottleneck” (no longer, with SSD!)
5
Our first attempt: OSv
● New OS design specifically for cloud VMs:
– Run a single application per VM (“unikernel”)
– Run existing Linux applications (Cassandra)
– Run these faster than Linux.
6
OSv
●
Some of the many ideas we used in OSv:
– Single address space.
– System call is just a function call.
– Faster context switches.
– No spin locks.
– Smaller code.
– Redesigned network stack (Van Jacobson).
7
OSv
● Writing an entire OS from scratch was a really
fun exercise for our generalist engineers.
●
Full description of OSv is beyond the scope of
this talk. Check out:
– “OSv—Optimizing the Operating System for Virtual
Machines”, Usenix ATC 2014.
8
Cassandra on OSv
● Cassandra-stress, READ, 4 vcpu:
On OSv, 34% faster than Linux
● Very nice, but not even close to our goal.
What are the remaining bottlenecks?
9
Bottlenecks: API locks
● In one profile, we saw 20% of run on lock()
and unlock() operations. Most uncontended
– Posix APIs allow threads to share
● file descriptors
● sockets
– As many as 20 lock/unlock for each network packet!
● Uncontended locks were efficient on UP (flag to
disable preemption),
But atomic operations slow on many cores.
10
Bottlenecks: API copies
● Write/send system calls copies user data to
kernel
– Even on OSv with no user-kernel separation
– Part of the socket API
● Similar for read
11
Bottlenecks: context switching
● One thread per CPU is optimal, >1 require:
– Context switch time
– Stacks consume memory and polute CPU cache
– Thread imbalance
● Requires fully non-blocking APIs
– Cassandra's uses mmap() for disk….
12
Bottlenecks:
unscalable applications
● Contended locks ruin scalability to many cores
– Memcache's counter and shared cache
● Solution: per-cpu data.
● Even lock-free atomic algorithms are unscalable
– Cache line bouncing
● Again, better to shard, not share, data.
– Becomes worse as core count grows
● NUMA
13
Therefore
● Need to provide a better APIs for server
applications
– Not file descriptors, sockets, threads, etc.
● Need to write better applications.
14
Framework
● One thread per CPU
– Event-driven programming
– Everything (network & disk) is non-blocking
– How to write complex applications?
15
Framework
● Sharded (shared-nothing) applications
– Important!
16
Framework
● Language with no runtime overheads or built-in
data sharing
17
Seastar
● C++14 library
● For writing new high-performance server applications
● Share-nothing model, fully asynchronous
● Futures & Continuations based
– Unified API for all asynchronous operations
– Compose complex asyncrhonous operations
– The key to complex applications
● (Optionally) full zero-copy user-space TCP/IP (over DPDK)
● Open source: http://www.seastar-project.org/
18
Seastar linear scaling in #cores
19
Seastar linear scaling in #cores
20
Brief introduction to Seastar
21
Sharded application design
● One thread per CPU
● Each thread handles one shard of data
– No shared data (“share nothing”)
– Separate memory per CPU (NUMA aware)
– Message-passing between CPUs
– No locks or cache line bounces
● Reactor (event loop) per thread
● User-space network stack also sharded
22
Futures and continuations
● Futures and continuations are the building
blocks of asynchronous programming in
Seastar.
● Can be composed together to a large, complex,
asynchronous program.
23
Futures and continuations
● A future is a result which may not be available yet:
– Data buffer from the network
– Timer expiration
– Completion of a disk write
– The result of a computation which requires the values
from one or more other futures.
● future<int>
● future<>
24
Futures and continuations
● An asynchronous function (also “promise”) is
a function returning a future:
– future<> sleep(duration)
– future<temporary_buffer<char>> read()
● The function sets up for the future to be fulfilled
– sleep() sets a timer to fulfill the future it returns
25
Futures and continuations
● A continuation is a callback, typically a lambda
executed when a future becomes ready
– sleep(1s).then([] {
std::cerr << “done”;
});
● A continuation can hold state (lambda capture)
– future<int> slow_incr(int i) {
sleep(10ms).then(
[i] { return i+1; });
}
26
Futures and continuations
● Continuations can be nested:
– future<int> get();
future<> put(int);
get().then([] (int value) {
put(value+1).then([] {
std::cout << “done”;
});
});
● Or chained:
– get().then([] (int value) {
return put(value+1);
}).then([] {
std::cout << “done”;
});
27
Futures and continuations
● Parallelism is easy:
– sleep(100ms).then([] {
std::cout << “100msn”;
});
sleep(200ms).then([] {
std::cout << “200msn”;
28
Futures and continuations
● In Seastar, every asynchronous operation is a
future:
– Network read or write
– Disk read or write
– Timers
– …
– A complex combination of other futures
● Useful for everything from writing network stack to
writing a full, complex, application.
29
Network zero-copy
● future<temporary_buffer>
input_stream::read()
– temporary_buffer points at driver-provided pages, if
possible.
– Automatically discarded after use (C++).
● future<> output_stream::
write(temporary_buffer)
– Future becomes ready when TCP window allows further
writes (usually immediately).
– Buffer discarded after data is ACKed.
30
Two TCP/IP implementations
Networking API
Seastar (native) Stack POSIX (hosted) stack
Linux kernel (sockets)
User-space TCP/IP
Interface layer
DPDK
Virtio Xen
igb ixgb
31
Disk I/O
● Asynchronous and zero copy, using AIO and
O_DIRECT.
● Not implemented well by all filesystems
– XFS recommended
● Focusing on SSD
● Future thought:
– Direct NVMe support,
– Implement filesystem in Seastar.
32
More info on Seastar
● http://seastar-project.com
● https://github.com/scylladb/seastar
● http://docs.seastar-project.org/
● http://docs.seastar-project.org/master/md_doc_tu
torial.html
33
ScyllaDB
● NoSQL database, implemented in Seastar.
● Fully compatible with Cassandra:
– Same CQL queries
– Copy over a complete Cassandra database
– Use existing drivers
– Use existing cassandra.yaml
– Use same nodetool or JMX console
– Can be clustered (of course...)
34
ScyllaDBCassandra
Key cache
Row cache
On-
heap /
Off-heap
Linux page cache
SSTables
Unified cache
SSTables
● Don't double-cache.
● Don't cache unrelated rows.
● Don't cache unparsed sstables.
● Can fit much more into cache.
● No page faults, threads, etc.
35
Scylla vs. Cassandra
● Single node benchmark:
– 2 x 12-core x 2 hyperthread Intel(R) Xeon(R) CPU
E5-2690 v3 @ 2.60GHz
cassandra-stress
Benchmark
ScyllaDB Cassandra
Write 1,871,556 251,785
Read 1,585,416 95,874
Mixed 1,372,451 108,947
36
Scylla vs. Cassandra
● We really got a x7 – x16 speedup!
● Read speeded up more -
– Cassandra writes are simpler
– Row-cache benefits further improve Scylla's read
● Almost 2 million writes per second on single
machine!
– Google reported in their blogs achieving 1 million writes
per second on 330 (!) machines
– (2 years ago, and RF=3… but still impressive).
37
Scylla vs. Cassandra
3 node cluster, 2x12 cores each; RF=3, CL=quorum
38
Better latency, at all load levels
39
What will you do with 10x performance?
● Shrink your cluster by a factor of 10
● Use stronger (but slower) data models
● Run more queries - more value from your data
● Stop using caches in front of databases
40
41
Do we qualify?
In 3 years, our small team wrote:
● A complete kernel and library (OSv).
● An asynchronous programming framework
(Seastar).
● A complete Cassandra-compatible NoSQL
database (ScyllaDB).
42
43
This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant
agreement No 645402.

More Related Content

What's hot

Seastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワークSeastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Takuya ASADA
 
Oracle Clusterware Node Management and Voting Disks
Oracle Clusterware Node Management and Voting DisksOracle Clusterware Node Management and Voting Disks
Oracle Clusterware Node Management and Voting Disks
Markus Michalewicz
 

What's hot (20)

Sockets and Socket-Buffer
Sockets and Socket-BufferSockets and Socket-Buffer
Sockets and Socket-Buffer
 
Outrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar FrameworkOutrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar Framework
 
How to be Successful with Scylla
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with Scylla
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQL
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワークSeastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Apples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
Apples and Oranges - Comparing Kafka Streams and Flink with Bill BejeckApples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
Apples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Structured Streaming - The Internal -
Structured Streaming - The Internal -Structured Streaming - The Internal -
Structured Streaming - The Internal -
 
Oracle Clusterware Node Management and Voting Disks
Oracle Clusterware Node Management and Voting DisksOracle Clusterware Node Management and Voting Disks
Oracle Clusterware Node Management and Voting Disks
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 

Similar to Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

Similar to Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra (20)

Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
 
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Docker and coreos20141020b
Docker and coreos20141020bDocker and coreos20141020b
Docker and coreos20141020b
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
OSv at Usenix ATC 2014
OSv at Usenix ATC 2014OSv at Usenix ATC 2014
OSv at Usenix ATC 2014
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 

Recently uploaded

Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 

Recently uploaded (20)

Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

  • 1. Nadav Har'El, ScyllaDB The Generalist Engineer meetup, Tel-Aviv Ides of March, 2016 SeastarSeastar Or how we implemented a 10-times faster Cassandra
  • 2. 2 ● Israeli but multi-national startup company – 15 developers cherry-picked from 10 countries. ● Founded 2013 (“Cloudius Systems”) – by Avi Kivity and Dor Laor of KVM fame. ● Fans of open-source: OSv, Seastar, ScyllaDB.
  • 3. 3 Make Cassandra 10 times faster Your mission, should you choose to accept it:
  • 4. 4 “Make Cassandra 10 times faster” ● Why 10? ● Why Cassandra? – Popular NoSQL database (2nd to MongoDB). – Powerful and widely applicable. – Example of a wider class of middleware. ● Why “mission impossible”? – Cassandra not considered particularly slow - – Considered faster than MongoDB, Hbase, et al. – “disk is bottleneck” (no longer, with SSD!)
  • 5. 5 Our first attempt: OSv ● New OS design specifically for cloud VMs: – Run a single application per VM (“unikernel”) – Run existing Linux applications (Cassandra) – Run these faster than Linux.
  • 6. 6 OSv ● Some of the many ideas we used in OSv: – Single address space. – System call is just a function call. – Faster context switches. – No spin locks. – Smaller code. – Redesigned network stack (Van Jacobson).
  • 7. 7 OSv ● Writing an entire OS from scratch was a really fun exercise for our generalist engineers. ● Full description of OSv is beyond the scope of this talk. Check out: – “OSv—Optimizing the Operating System for Virtual Machines”, Usenix ATC 2014.
  • 8. 8 Cassandra on OSv ● Cassandra-stress, READ, 4 vcpu: On OSv, 34% faster than Linux ● Very nice, but not even close to our goal. What are the remaining bottlenecks?
  • 9. 9 Bottlenecks: API locks ● In one profile, we saw 20% of run on lock() and unlock() operations. Most uncontended – Posix APIs allow threads to share ● file descriptors ● sockets – As many as 20 lock/unlock for each network packet! ● Uncontended locks were efficient on UP (flag to disable preemption), But atomic operations slow on many cores.
  • 10. 10 Bottlenecks: API copies ● Write/send system calls copies user data to kernel – Even on OSv with no user-kernel separation – Part of the socket API ● Similar for read
  • 11. 11 Bottlenecks: context switching ● One thread per CPU is optimal, >1 require: – Context switch time – Stacks consume memory and polute CPU cache – Thread imbalance ● Requires fully non-blocking APIs – Cassandra's uses mmap() for disk….
  • 12. 12 Bottlenecks: unscalable applications ● Contended locks ruin scalability to many cores – Memcache's counter and shared cache ● Solution: per-cpu data. ● Even lock-free atomic algorithms are unscalable – Cache line bouncing ● Again, better to shard, not share, data. – Becomes worse as core count grows ● NUMA
  • 13. 13 Therefore ● Need to provide a better APIs for server applications – Not file descriptors, sockets, threads, etc. ● Need to write better applications.
  • 14. 14 Framework ● One thread per CPU – Event-driven programming – Everything (network & disk) is non-blocking – How to write complex applications?
  • 15. 15 Framework ● Sharded (shared-nothing) applications – Important!
  • 16. 16 Framework ● Language with no runtime overheads or built-in data sharing
  • 17. 17 Seastar ● C++14 library ● For writing new high-performance server applications ● Share-nothing model, fully asynchronous ● Futures & Continuations based – Unified API for all asynchronous operations – Compose complex asyncrhonous operations – The key to complex applications ● (Optionally) full zero-copy user-space TCP/IP (over DPDK) ● Open source: http://www.seastar-project.org/
  • 21. 21 Sharded application design ● One thread per CPU ● Each thread handles one shard of data – No shared data (“share nothing”) – Separate memory per CPU (NUMA aware) – Message-passing between CPUs – No locks or cache line bounces ● Reactor (event loop) per thread ● User-space network stack also sharded
  • 22. 22 Futures and continuations ● Futures and continuations are the building blocks of asynchronous programming in Seastar. ● Can be composed together to a large, complex, asynchronous program.
  • 23. 23 Futures and continuations ● A future is a result which may not be available yet: – Data buffer from the network – Timer expiration – Completion of a disk write – The result of a computation which requires the values from one or more other futures. ● future<int> ● future<>
  • 24. 24 Futures and continuations ● An asynchronous function (also “promise”) is a function returning a future: – future<> sleep(duration) – future<temporary_buffer<char>> read() ● The function sets up for the future to be fulfilled – sleep() sets a timer to fulfill the future it returns
  • 25. 25 Futures and continuations ● A continuation is a callback, typically a lambda executed when a future becomes ready – sleep(1s).then([] { std::cerr << “done”; }); ● A continuation can hold state (lambda capture) – future<int> slow_incr(int i) { sleep(10ms).then( [i] { return i+1; }); }
  • 26. 26 Futures and continuations ● Continuations can be nested: – future<int> get(); future<> put(int); get().then([] (int value) { put(value+1).then([] { std::cout << “done”; }); }); ● Or chained: – get().then([] (int value) { return put(value+1); }).then([] { std::cout << “done”; });
  • 27. 27 Futures and continuations ● Parallelism is easy: – sleep(100ms).then([] { std::cout << “100msn”; }); sleep(200ms).then([] { std::cout << “200msn”;
  • 28. 28 Futures and continuations ● In Seastar, every asynchronous operation is a future: – Network read or write – Disk read or write – Timers – … – A complex combination of other futures ● Useful for everything from writing network stack to writing a full, complex, application.
  • 29. 29 Network zero-copy ● future<temporary_buffer> input_stream::read() – temporary_buffer points at driver-provided pages, if possible. – Automatically discarded after use (C++). ● future<> output_stream:: write(temporary_buffer) – Future becomes ready when TCP window allows further writes (usually immediately). – Buffer discarded after data is ACKed.
  • 30. 30 Two TCP/IP implementations Networking API Seastar (native) Stack POSIX (hosted) stack Linux kernel (sockets) User-space TCP/IP Interface layer DPDK Virtio Xen igb ixgb
  • 31. 31 Disk I/O ● Asynchronous and zero copy, using AIO and O_DIRECT. ● Not implemented well by all filesystems – XFS recommended ● Focusing on SSD ● Future thought: – Direct NVMe support, – Implement filesystem in Seastar.
  • 32. 32 More info on Seastar ● http://seastar-project.com ● https://github.com/scylladb/seastar ● http://docs.seastar-project.org/ ● http://docs.seastar-project.org/master/md_doc_tu torial.html
  • 33. 33 ScyllaDB ● NoSQL database, implemented in Seastar. ● Fully compatible with Cassandra: – Same CQL queries – Copy over a complete Cassandra database – Use existing drivers – Use existing cassandra.yaml – Use same nodetool or JMX console – Can be clustered (of course...)
  • 34. 34 ScyllaDBCassandra Key cache Row cache On- heap / Off-heap Linux page cache SSTables Unified cache SSTables ● Don't double-cache. ● Don't cache unrelated rows. ● Don't cache unparsed sstables. ● Can fit much more into cache. ● No page faults, threads, etc.
  • 35. 35 Scylla vs. Cassandra ● Single node benchmark: – 2 x 12-core x 2 hyperthread Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz cassandra-stress Benchmark ScyllaDB Cassandra Write 1,871,556 251,785 Read 1,585,416 95,874 Mixed 1,372,451 108,947
  • 36. 36 Scylla vs. Cassandra ● We really got a x7 – x16 speedup! ● Read speeded up more - – Cassandra writes are simpler – Row-cache benefits further improve Scylla's read ● Almost 2 million writes per second on single machine! – Google reported in their blogs achieving 1 million writes per second on 330 (!) machines – (2 years ago, and RF=3… but still impressive).
  • 37. 37 Scylla vs. Cassandra 3 node cluster, 2x12 cores each; RF=3, CL=quorum
  • 38. 38 Better latency, at all load levels
  • 39. 39 What will you do with 10x performance? ● Shrink your cluster by a factor of 10 ● Use stronger (but slower) data models ● Run more queries - more value from your data ● Stop using caches in front of databases
  • 40. 40
  • 41. 41 Do we qualify? In 3 years, our small team wrote: ● A complete kernel and library (OSv). ● An asynchronous programming framework (Seastar). ● A complete Cassandra-compatible NoSQL database (ScyllaDB).
  • 42. 42
  • 43. 43 This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 645402.