We have developed a new framework, Seastar, for high-throughput server applications, along with a key-value store capable of millions of transactions per second. Seastar, which runs on OSv and Linux, is completely asynchronous and based on shared-nothing data structures that eliminate costly locking between CPUs. SeaStar is event-driven and supports writing non-blocking, asynchronous server code in a straightforward manner that facilitates debugging and reasoning about performance.
2. How multifarious and how mutually
complicated are the considerations which
the working of such an engine involve.
There are frequently several distinct sets of
effects going on simultaneously; all in a
manner independent of each other, and yet
to a greater or less degree exercising a
mutual influence. To adjust each to every
other, and indeed even to perceive and
trace them out with perfect correctness and
success, entails difficulties whose nature
partakes to a certain extent of those
involved in every question where conditions
are very numerous and inter-complicated.
3.
4. Hardware outgrowing software
+ CPU clocks not getting faster.
+ More cores, but hard to use them.
+ Locks have costs even when no contention
+ Data is allocated on one core, copied and used on
others
+ Result: Software can’t keep up with new
hardware (SSD, 10Gbps networking…)
Kernel
Application
TCP/IPScheduler
queuequeuequeuequeuequeue
threads
NIC
Queues
Kernel
Memory
5. Workloads changing
+ Complex, multi-layered applications
+ NoSQL data stores
+ More users
+ Lower latencies needed
+ Microservices
- 81% of Redis processing is in the kernel.
- If 100 requests needed for a page, the “99%
latency” affects 63% of pageviews.
Kernel
Application
TCP/IPScheduler
queuequeuequeuequeuequeue
threads
NIC
Queues
Kernel
Memory
10. A new model
Threads
- Costly locking (example:
POSIX requires multiple
threads to be able to use same
socket)
+ Uses available skills/tools
Shared-nothing
+ Fewer wasted cycles
- Cross-core communication
must be explicit, so harder to
program
11. How
■ Single-threaded async engine
running on each CPU
■ No threads
■ No shared data
■ All inter-CPU communication by message
passing
12. Linear scaling
+ Each engine is executed by each core
+ Shared-nothing per-core design
+ Fits existing shared-nothing distributed
applications model
+ Full kernel bypass, supports zero-copy
+ No threads, no context switch and no locks!
+ Instead, asynchronous lambda
invocation
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
13. Kernel
Comparison with old school
Application
TCP/IPScheduler
queuequeuequeuequeuequeue
threads
NIC
Queues
Kernel
Traditional stack SeaStar’s sharded stack
Memory
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeuesmp queue
NIC
Queue
DPDK
Kernel
(not
involved)
Userspace
14. Millions of connections
Traditional stack SeaStar’s sharded stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a
pointer to
eventually
computed value
Task is a
pointer to a
lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Thread is a
function pointer
Stack is a byte
array from 64k
to megabytes
15. But how can you program it?
■ Ada Lovelace’s
problem today
■ Need max. possible
“easy” without
giving up any “fast.”
If the answer
were “no”,
would this
book be 467
pages long?
17. F-P-C defined: Future
A future is a result of a computation
that may not be available yet.
■ a data buffer that we are reading from the network
■ the expiration of a timer
■ the completion of a disk write
■ the result computation that requires the values from one
or more other futures.
18. F-P-C defined: Promise
A promise is an object or function that
provides you with a future, with the
expectation that it will fulfill the future.
19. Basic future/promise
future<int> get(); // promises an int will be produced eventually
future<> put(int) // promises to store an int
void f() {
get().then([] (int value) {
put(value + 1).then([] {
std::cout << "value stored successfullyn";
});
});
}
20. Chaining
future<int> get(); // promises an int will be produced eventually
future<> put(int) // promises to store an int
void f() {
get().then([] (int value) {
return put(value + 1);
}).then([] {
std::cout << "value stored successfullyn";
});
}
22. Zero copy friendly (2)
pair<future<size_t>,
future<temporary_buffer>>
socket::write(temporary_buffer);
■ First future becomes ready when TCP window allows
sending more data (usually immediately)
■ Second future becomes ready when buffer can be
discarded (after TCP ACK)
■ May complete in any order
24. Shared state: networking
■ No shared state except index of
net channels (1 per cpu)
■ No migration of existing TCP connections
25. Handling shared state: block
■ Each CPU is responsible for handling
specific files/directories/free blocks
(by hash)
■ Can delegate access to another CPU for
locality, but not concurrent shared access
■ Flash optimized - no fancy layout
■ DMA only
28. Performance results
■ Linear scaling to 20 cores and beyond
■ 250,000 transactions/core (memcached)
■ Currently limited by client. More client
development in progress.
29. Applications
■ HTTP server
■ NoSQL system
■ Distributed filesystem
■ Object store
■ Transparent proxy
■ Cache (Memcache, CDN,..)
■ NFV
318,715 transactions/core at 2 cores, 274,114 transactions/core at 16 cores…
250,000 transactions/core
Slide 7 - locking is only part of the problem, and mostly eliminated by "lock-free" alternatives to locking. The other problems are cache-line bouncing, and slow atomic operations and memory barriers. How "shared nothing" design cannot eliminate all of these (we still communicate between core), but can minimize it by making it very explicit when these things happen.
If I understood Avi correctly, he also says that another problem of the thread model is the large stacks also mean large cache polution on context switches, while our tiny "task" switches don't have large cache pollution. You even mention this later on But I have to admit I'm not completely convinced this is the case (even if the stack is large, the threads use only a tiny portion of it?).
http://aws.amazon.com/ec2/pricing/
http://aws.amazon.com/ec2/pricing/
http://aws.amazon.com/ec2/pricing/
Promises and futures simplify asynchronous programming since they decouple the event producer (the promise) and the event consumer (whoever uses the future). Whether the promise is fulfilled before the future is consumed, or vice versa, does not change the outcome of the code.