Next Generation Execution for Apache Storm

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Generation Execution Engine
for Apache Storm
Roshan Naik, Hortonworks
Dataworks Summit
Sept 20th 2017, San Jose

Present : Storm 1.x
Ã Has matured into a stable and reliable system
Ã Widely deployed and holding up well in production
Ã Scales well horizontally
Ã Lots of new competition
– Differentiating on Features, Performance, Ease of Use etc.
Storm 2.x
Ã High performance execution engine
Ã All Java code (transitioning away from Clojure)
Ã Improved Backpressure, Metrics subsystems
Ã Lots more ..
– Streams API, UI improvements, RAS scheduler improvements, …

Execution Engine - Planned Enhancements for
Ã Umbrella Jira : STORM-2284
– https://issues.apache.org/jira/browse/STORM-2284

Performance

Use Cases - Latency centric
Ã 100ms+ : Factory automation
Ã 10ms - 100ms : Real time gaming, scoring shopping carts to print coupons
Ã 0-10 ms : Network threat detection
Ã Java based High Frequency Trading systems
– fast: under 100 micro-secs 90% of time, no GC during the trading hours
– medium: under 1ms 95% of time, and rare minor GC
– slow: under 10 ms 99 or 99.9% of time, minor GC every few mins
– Cost of being slow
• Better to turn it off than lose money by leaving it running

Performance in 2.0
Ã How do we know if a streaming system is “fast”?
– Faster than another system ?
– What about Hardware potential ?
• More on this later
Ã Dimensions
– Throughput
– Latency
– Resource utilization: CPU/Network/Memory/Disk/Power

Areas critical to Performance
Ã Messaging System
– Need Bounded Concurrent Queues that operate as fast as hardware allows
– Lock based queues not an option
– Lock free queues or preferably Wait-free queues
Ã Threading & Execution Model
– Avoid unnecessary threads. Less synchronization.
– Dedicated threads for spouts and bolts instead of pooled threads.
– CPU Pinning.
– Reduce inter-thread, inter-process and inter-host communication
Ã Memory Model
– Lowering GC Pressure: Recycling Objects in critical path.
– Reducing CPU cache faults: Control Object Layout (contiguous allocation), avoid false sharing

Messaging Subsystem
(STORM-2307)

Understanding “Fast”
Component Throughput Mill/sec
AKKA 90-100 threads 50
Flink per core 1.5
Apex 3.0 container local 4.3
v3.0
Gear Pump 4 nodes 18
InfoSphere Streams
v3.0
Huge Gap!
Component Throughput Mill/sec
Not thread safe ArrayDeQueue 1 thread rd+wr 1063
Lock based ArrayBlockingQueue 1 thd rd+wr 30
1 Prod, 1 Cons 4
SleepingWaitStrategy Disruptor 1 P, 1C 25
(ProducerMode= MULTI) 3.3.x
lazySet() FastQ 1 P, 1C 31
JC Tools MPSC 1P, 1c 74
2P, 59
3P 43
4P 40
6P 56
8P 65
10P 66
15P 68
20P 68

Messaging - Current Architecture
Worker Send Thd
Send Q
Network
Bolt/Spout Executor
Recv Q
Bolt
Executor
Thread
(user logic)
Send Q
Send
Thread
Worker Recv Thd
Recv Q
Network
Worker Process - High Level View

Bolt/Spout Executor - Detailed
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
CLQ : OVERFLOW
BATCHER (1 per publisher)
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
ArrayList
ArrayList
DestID
msgs
msgs
msgs
msgs
DestID
msgs
msgs
msgs
msgs
Worker’s
Outbound Q
Local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
local
remote

New Architecture
ArrayList
ArrayList
DestID
msgs
msgs
msgs
msgs
DestID
msgs
msgs
msgs
msgs
Worker’s
Outbound Q
Local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
CLQ : OVERFLOW
BATCHER (1 per publisher)
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
local
remote

Messaging - New Architecture
(STORM-2306)
RECEIVE Q
BATCHER
JCTools Q
Bolt
Executor
Thread
(user logic)
publish
Worker’s
Outbound Q
local
remote
Local Executor’s
RECEIVE Q

Preliminary Numbers
LATENCY
Ã 1 spout --> 1 bolt with 1 ACKer (all in same worker)
– v1.0.1 : 3.4 milliseconds
– v2.0 master: ~7 milliseconds
– v2.0 redesigned : under 100 micro seconds (116x improvement)

Preliminary Numbers
THROUGHPUT
Ã 1 spout --> 1 bolt [w/o ACKing]
– v1.0.1 : ?
– v2.0 master: ~4 million /sec
– v2.0 redesigned : 7 - 8 million /sec (~2x but can be much better)
Ã 1 spout --> 1 bolt [with ACKing]
– v1.0 : 233 K /sec
– v2.0 master: 900 k/sec
– v2.0 redesigned : 1.5 million /sec (again, can be much better)

Observations
Ã Latency: Dramatically improved.
Ã Throughput: Discovered multiple bottlenecks preventing significantly higher
throughput.
– Grouping: Bottlenecks in LocalShuffle & FieldsGrouping if addressed along with some others,
throughput can reach ~7 million/sec.
– TumpleImpl : If inefficiencies here are addressed, throughput can reach ~15 mill/sec.
– ACK-ing : ACKer bolt currently maxing out at ~ 2.5 million ACKs / sec. Limitation with
implementation not with concept. I see room for ACKer specific fixes that can also
substantially improve its throughput.

CPU Pinning
(STORM-2313)

CPU cache access
Ã Approximate access costs
– L1 cache : 1x
– L2 cache : 2.5x
– Local L3 cache : 10-20x
– Remote L3 cache: 25-75x

CPU Affinity
Ã For inter-thread communication
– cache fault distance matters
– Faster between cores on same socket
• 20% latency hit when threads pinned to diff sockets
Ã Pinning threads to CPUs
– If done right, minimizes cache fault distance
– Threads moving around needs to cache refreshed
– Unrelated threads running on same core trash each others cache
Ã Helps perf on NUMA machines
– Pinning long running tasks reduces NUMA effects
– NUMA aware allocator introduced in Java SE 6u2

CPU Pinning Strategy
Ã Pin executors to physical cores.
Ã Pin each executor to a separate physical core
– High throughput / very low latency topos:
– Not economical for other topos.
Ã Try to fit subsequent executor threads on same socket
Ã Logical cores – i.e. Hyperthreading ?
– Avoid hyperthreading – avoid cache trashing each other on same core
– Could provide it as option in future ?

Threading & Execution Model
(STORM-2307)

WORKER PROCESS
• Start/Stop/Monitor
Executors
• Manage Metrics
• Topology Reconfig
• Heartbeat
Executor (Thd)
grouper
Task
(Bolt)Q
counters
Executor (Thd)
System Task
(Inter host
Input)
Executor (Thd)
Sys Task
(Outbound
Msgs)
Q
counters
New Threading & Execution Model
Executor (Thd)
System Task
(Intra host
Input)
Executor (Thd)
(grouper)
(Bolt)
Task
(Spout/Bolt)Q
counters

Memory Management

Memory Management
Can be decomposed into 2 key area
– Object Recycling - in critical path
• Avoids dynamic allocation cost
• Minimizes stop-the-world GC pauses
– Contiguous allocation: arrays, data members.
• CPU likes it.
• Pre-fetch friendly.
• Fewer cache faults per object.
• Natural in C++, very painful in Java.

Thank You !
Tomorrow:
Data Guarantees And Fault Tolerance In Streaming Systems
5:10 pm Room: C4.5
Questions ?
References
https://issues.apache.org/jira/browse/STORM-2284

Next Generation Execution for Apache Storm

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (15)

Similaire à Next Generation Execution for Apache Storm

Similaire à Next Generation Execution for Apache Storm (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Next Generation Execution for Apache Storm