3. High level overview
Kat Chuang @katychuang
Batch
Streaming
Microbatching
Storm Trident Spark Streaming
Released 2011 2010
Delivery
Semantics
Exactly Once Exactly once
State Management Yes Yes
Latency Seconds Seconds
Output MapState Resilient Distributed
Dataset (RDD)
Throughput 10k/nodes/sec? 400k/nodes/sec?
4. Test Cases Metrics
1. Does every message pass
through the pipeline?
2. How fast does each message
take to process?
Data
1. Timestamps
Kat Chuang @katychuang
6. 1. Does every message pass
through the pipeline?
Kat Chuang @katychuang
This is a scatterplot
7. 2. How fast does each
message take to
process?
Kat Chuang @katychuang
This is a scatterplot
8. Storm Trident Vs Spark Streaming
Storm Trident Spark Streaming
Stream processing framework
that also does micro-batching.
Great for transforming or
computing as data flows in.
Complex event processing
(CEP), continuous computation.
Task-Parallel Computations, i.
e. reading Twitter streams
Batch processing framework
that also does micro-batching.
Great for combining with
historical data.
ML algos included. Requires
HDFS-backed data source.
Data-Parallel Computations, i.
e. offering recommendations
9. Kat Chuang
Data Engineering Fellow
#DE-2015c
hello@katychuang.com
Github: katychuang
Twitter: katychuang
IG: katychuang.nyc