Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.
Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/
2. Outline
Distributed Systems & Batch Processing
Streaming Processing. Introduce Storm
WordCount Demo & Setup
Storm Cluster Architecture
Storm Topology Architecture
WordCount Deep Dive
Discussion and Q&A: Storm Use Cases & Patterns
3. Distributed Systems
Distribute work across N nodes
Hadoop Ecosystem
Batch processing
Massively parallel (horizontal scale out)
Problems – data latency, 24 hour batching vs global
client base
What’s next? Increasing need to move to real time &
streaming processing models
4. Streaming Processing
Provides near real time views into analytical data sets
and system status. Allows for real time intervention &
response to events
Streaming frameworks: Spark, Azure Streaming
Analytics, AWS Kinesis+Lambda, Storm
Created by Nathan Marz, first used at Twitter
Storm: “Doing for realtime processing what Hadoop did
for batch processing”
Stream definition: “unbounded sequence of tuples”
5. Storm WordCount Demo
WordCount Storm Topology
Streams text blobs
Counts word occurrences
Reporting results each 10 seconds
Getting it running
https://github.com/OpenDataMining/brandonobrien
mvn clean install exec:java -Dexec.mainClass=
"dataclub.storm.TokenCountingTopology”
6. Storm Cluster Architecture
Core components:
Zookeeper
Nimbus
Supervisors
Workers/JVM
Executor/thread
Component/task (bolts & spouts)
Scalability – can add supervisors while topologies are running, no
code change required
Supervisors run Worker JVMs
Workers run Executor Threads
Executors run Tasks (instances of Spouts and Bolts)
7. Storm Topology Architecture
DAG Processing Model
Directed Acyclic Graph
Components: Spout & Bolt (benefit: decouple logic from
scalability)
Tasks (instances of Spouts & Bolts)
Executors (run Tasks)
8. Storm WordCount Deep Dive
Topology structure
Classes
Spout: SentenceProducer.java
Bolt: SentenceTokenizer.java
Bolt: TokenCounter.java
Putting it all together: TokenCountingTopology.java
9. Storm Use Cases & Patterns
Consume data from Kafka, Kinesis or other queue
Persist data to high write perf datastore like Cassandra
Streaming map reduce, multi-stage map reduce
Storm is stateless & fail-fast. Externalize state using Redis or other
cache for resiliency
Online learning / realtime model updates (using frameworks like
WEKA or others)
Real world use cases: Real time ad targeting, travel market
analytics, user behavior analytics, system monitoring & SLA
Storm multi lang API (Python, Ruby, PERL, JavaScript, Scala, and
more)
10. Distributed Streaming Processing
with Storm
Going Further
https://storm.apache.org/
http://storm.apache.org/documentation/Common-patterns.html
Frameworks: Trident, Summingbird
Stand up Storm cluster: http://www.michael-
noll.com/tutorials/running-multi-node-storm-cluster/
Contact
Brandon O’Brien, Data Engineer @ Expedia
https://www.linkedin.com/in/brandonjobrien
Q&A
Notes de l'éditeur
1 A framework I’ve used extensively for real time processing of travel market analytics data. It’s really underpinning the analytics platform I’m building, so I wanted to share what I’ve learned about it, for anyone who’s interested in getting started with streaming processing and storm
2 Gauge audience. Engineers vs data scientists. Today’s talk is focused on Data Engineering. Domain = data engineering. Setting up analytics pipelines and services to realize the value of data science models.
3 For analytics processing that can’t fit in memory on a single machine, we need to scale horizontally. That is, add more machines. Ideally, we’d like to scale out horizontally using cheap commodity hardware. For a variety of reasons, many analytics teams need to move to a streaming processing model
4 Batch processing was great for reports, but if we can get real time views into markets & systems, then we can get real time alerts, updates. This unlocks whole new categories of use cases where we can see what’s happening in systems and markets in real time, and respond or intervene in real time.
5 Hands on demo so you can see a concrete example of what we’re talking about