Contenu connexe Similaire à Next Generation Execution for Apache Storm (20) Plus de DataWorks Summit (20) Next Generation Execution for Apache Storm2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Present : Storm 1.x
à Has matured into a stable and reliable system
à Widely deployed and holding up well in production
à Scales well horizontally
à Lots of new competition
– Differentiating on Features, Performance, Ease of Use etc.
Storm 2.x
à High performance execution engine
à All Java code (transitioning away from Clojure)
à Improved Backpressure, Metrics subsystems
à Lots more ..
– Streams API, UI improvements, RAS scheduler improvements, …
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases - Latency centric
à 100ms+ : Factory automation
à 10ms - 100ms : Real time gaming, scoring shopping carts to print coupons
à 0-10 ms : Network threat detection
à Java based High Frequency Trading systems
– fast: under 100 micro-secs 90% of time, no GC during the trading hours
– medium: under 1ms 95% of time, and rare minor GC
– slow: under 10 ms 99 or 99.9% of time, minor GC every few mins
– Cost of being slow
• Better to turn it off than lose money by leaving it running
7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Areas critical to Performance
à Messaging System
– Need Bounded Concurrent Queues that operate as fast as hardware allows
– Lock based queues not an option
– Lock free queues or preferably Wait-free queues
à Threading & Execution Model
– Avoid unnecessary threads. Less synchronization.
– Dedicated threads for spouts and bolts instead of pooled threads.
– CPU Pinning.
– Reduce inter-thread, inter-process and inter-host communication
à Memory Model
– Lowering GC Pressure: Recycling Objects in critical path.
– Reducing CPU cache faults: Control Object Layout (contiguous allocation), avoid false sharing
9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Understanding “Fast”
Component Throughput Mill/sec
AKKA 90-100 threads 50
Flink per core 1.5
Apex 3.0 container local 4.3
v3.0
Gear Pump 4 nodes 18
InfoSphere Streams
v3.0
Huge Gap!
Component Throughput Mill/sec
Not thread safe ArrayDeQueue 1 thread rd+wr 1063
Lock based ArrayBlockingQueue 1 thd rd+wr 30
1 Prod, 1 Cons 4
SleepingWaitStrategy Disruptor 1 P, 1C 25
(ProducerMode= MULTI) 3.3.x
lazySet() FastQ 1 P, 1C 31
JC Tools MPSC 1P, 1c 74
2P, 59
3P 43
4P 40
6P 56
8P 65
10P 66
15P 68
20P 68
11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Bolt/Spout Executor - Detailed
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER (1 per publisher)
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
ArrayList
ArrayList
DestID
msgs
msgs
msgs
msgs
DestID
msgs
msgs
msgs
msgs
Worker’s
Outbound Q
Local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
local
remote
16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Observations
à Latency: Dramatically improved.
à Throughput: Discovered multiple bottlenecks preventing significantly higher
throughput.
– Grouping: Bottlenecks in LocalShuffle & FieldsGrouping if addressed along with some others,
throughput can reach ~7 million/sec.
– TumpleImpl : If inefficiencies here are addressed, throughput can reach ~15 mill/sec.
– ACK-ing : ACKer bolt currently maxing out at ~ 2.5 million ACKs / sec. Limitation with
implementation not with concept. I see room for ACKer specific fixes that can also
substantially improve its throughput.
19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CPU Affinity
à For inter-thread communication
– cache fault distance matters
– Faster between cores on same socket
• 20% latency hit when threads pinned to diff sockets
à Pinning threads to CPUs
– If done right, minimizes cache fault distance
– Threads moving around needs to cache refreshed
– Unrelated threads running on same core trash each others cache
à Helps perf on NUMA machines
– Pinning long running tasks reduces NUMA effects
– NUMA aware allocator introduced in Java SE 6u2
20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CPU Pinning Strategy
à Pin executors to physical cores.
à Pin each executor to a separate physical core
– High throughput / very low latency topos:
– Not economical for other topos.
à Try to fit subsequent executor threads on same socket
à Logical cores – i.e. Hyperthreading ?
– Avoid hyperthreading – avoid cache trashing each other on same core
– Could provide it as option in future ?
22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
WORKER PROCESS
• Start/Stop/Monitor
Executors
• Manage Metrics
• Topology Reconfig
• Heartbeat
Executor (Thd)
grouper
Task
(Bolt)Q
counters
Executor (Thd)
System Task
(Inter host
Input)
Executor (Thd)
Sys Task
(Outbound
Msgs)
Q
counters
New Threading & Execution Model
Executor (Thd)
System Task
(Intra host
Input)
Executor (Thd)
(grouper)
(Bolt)
Task
(Spout/Bolt)Q
counters