1. Course Instructor : Dr.Zarifzadeh
Presented By : Pouyan Rezazadeh, Ali Rezaie
Apache
Stor
m
2. Introduction
Hadoop and related technologies have made it
possible to store and process data at large scales.
Unfortunately, these data processing technologies
are not realtime systems.
Hadoop does batch processing instead of
realtime processing.
2
Apache
3. Introduction
Batch processing
Processing jobs in batch
Batch processing jobs can take
hours
E.g. billing system
Realtime processing
Processing jobs one by one
Processing jobs
immediately
E.g. airline system
3
Apache
4. Introduction
Realtime data processing at massive scale is
becoming more and more of a requirement for
businesses.
The lack of a "Hadoop of realtime" has become
the biggest hole in the data processing
ecosystem.
There's no hack that will turn Hadoop into a
realtime system.
Solution
4
Apache
5. Apache Storm
A distributed realtime computation system
Founded in 2011
Implemented in Clojure (a dialect of Lisp), some
Java
5
Apache
6. Advantages
Free, simple and open source
Can be used with any programming language
Very fast
Scalable
Fault-tolerant
Guarantees your data will be processed
Integrates with any database technology
Extremely robust
6
Apache
8. Storm vs Hadoop
A Storm cluster is superficially similar to a
Hadoop cluster.
Hadoop runs "MapReduce jobs", while Storm
runs "topologies".
A MapReduce job eventually finishes, whereas a
topology processes messages forever (or until
you kill it).
8
Apache
10. Spouts and Bolts
Spout 2 Bolt 3
A stream is an unbounded sequence of
tuples. A spout is a source of streams.
Bolt 2
Bolt 1
Bolt 4
Spout 1
10
Apache
11. Spouts and Bolts
Spout 2 Bolt 3
For example, a spout may read tuples off of
a queue and emit them as a stream.
Bolt 2
Bolt 1
Bolt 4
Spout 1
11
Apache
12. Spouts and Bolts
Spout 2 Bolt 3
A bolt consumes any number of input
streams, does some processing, and
possibly emits new
Bolt 2
Bolt 1
Bolt 4
Spout 1
streams
. 12
13. Spouts and Bolts
Spout 2 Bolt 3
Each node (spout or bolt) in a Storm topology
executes in parallel.
Bolt 2
Bolt 1
Bolt 4
Spout 1
1
Apache
14. Architecture
A machine in a storm cluster may
run one or more worker processes.
Each topology has one or more
worker processes.
Each worker process runs
executors (threads) for a specific
topology.
Each executor runs one or more
tasks of the same component(spout
or bolt).
Worker
Process
Tas
k
Tas
k
Tas
k
Tas
k
executor
14
Apache
16. Architecture
The Nimbus and Supervisor are
stateless. All state is kept in Zookeeper.
1 ZK instance per machine
When the Nimbus or Supervisor fails, they'll start
back up like nothing happened.
storm jar all-my-code.jar org.apache.storm.MyTopology arg1
arg2
16
Apache
19. Stream Groupings
Shuffle grouping: Randomized
round-robin
Fields grouping: all Tuples
with the same field value(s) are
always routed to the same task
Direct grouping: producer of
the tuple decides which task of
the consumer will receive the
tuple
19
Apache
28. A Sample Word
Count Topology
Split Sentence
Bolt:
Sentence Spout: { "sentence": "my dog has
fleas" } { "word" : "my" }
{ "word" : "dog" }
{ "word" : "has" }
{ "word" : "fleas" }
Word Count Bolt: { "word" : "dog", "count" : 5 }
Senten
ce
Spout
Split
Senten
ce Bolt
Wor
d
Cou
nt
Bolt
Repo
rt
Bolt
Report Bolt: prints the
contents
28
29. A Sample Word
Count Code
public class SentenceSpout extends BaseRichSpout{
private SpoutOutputCollector collector;
private String[] sentences = {
"my dog has fleas", "i like cold beverages", "the dog ate my
homework", "don't have a cow man", "i don't think i like fleas“
};
private int index = 0;
public void declareOutputFields(OutputFieldsDeclarer declarer){
declarer.declare(new Fields("sentence"));
}
public void open(Map config, TopologyContext context, SpoutOutputCollector collector){
this.collector = collector;
}
public void nextTuple() {
this.collector.emit(new Values(sentences[index]));
index++;
if (index >= sentences.length) { index = 0;}
}
}
29
Apache
30. A SampleWord
Count Code
public class SplitSentenceBolt extends BaseRichBolt{
private OutputCollector collector;
public void prepare(Map config, TopologyContext context, OutputCollector
collector) {
this.collector = collector;
}
public void execute(Tuple tuple) {
String sentence = tuple.getStringByField("sentence");
String[] words = sentence.split(" ");
for(String word : words){
this.collector.emit(new Values(word));
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
30
Apache
31. A SampleWord
Count Code
public class WordCountBolt extends BaseRichBolt{
private OutputCollector collector;
private HashMap<String, Long> counts = null;
public void prepare(Map config, TopologyContext context, OutputCollector collector) {
this.collector = collector;
this.counts = new HashMap<String, Long>();
}
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
Long count = this.counts.get(word);
if(count == null){
count = 0L;
}
count++;
this.counts.put(word, count);
this.collector.emit(new Values(word, count));
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
31
Apache
32. A SampleWord
Count Code
public class ReportBolt extends BaseRichBolt {
private HashMap<String, Long> counts = null;
public void prepare(Map config, TopologyContext context, OutputCollector collector) {
this.counts = new HashMap<String, Long>();
}
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
Long count = tuple.getLongByField("count");
this.counts.put(word, count);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// this bolt does not emit anything }
public void cleanup() {
List<String> keys = new ArrayList<String>();
keys.addAll(this.counts.keySet());
Collections.sort(keys);
for (String key : keys) {
System.out.println(key + " : " + this.counts.get(key));
}
}
}
32
Apache