SlideShare a Scribd company logo
1 of 38
Download to read offline
Real-time Big Data Processing
with Storm: Using Twitter
Streaming as Example
Liang-Chi Hsieh
Hadoop in Taiwan 2013
1
In Today’s Talk
• Introduce stream computation in Big Data
• Introduce current stream computation
platforms
• Storm
• Architecture & concepts
• Use case: analysis of Twitter streaming data
2
Recap, the FourV’s of Big Data
• To help us talk ‘big data’, it is common to
break it down into four dimensions
• Volume: Scale of Data
• Velocity:Analysis of Streaming Data
• Variety: Different Forms of Data
• Veracity: Uncertainty of Data
http://dashburst.com/infographic/big-data-volume-variety-velocity/
3
• Velocity: Data in motion
• Require realtime response to process,
analyze continuous data stream
http://www.intergen.co.nz/Global/Images/BlogImages/2013/Defining-big-data.png
4
Streaming Data
• Data coming from:
• Logs
• Sensors
• Stock trade
• Personal devices
• Network connections
• etc...
5
Batch Data Processing
Architecture
6
Data Store Hadoop
Data Flow Batch Run
BatchView
Query
• Views generated in batch maybe
out of date
• Batch workflow is too slow
Data Processing Architecture:
Batch and Realtime
7
Data Store Hadoop
Batch Run
Realtime
Processing
BatchView
Realtime
View
Query
Data Flow
• Generate realtime views of data
by using stream computation
Current Stream Computation Platforms
• S4
• Storm
• Spark Streaming
• MillWheel
8
S4
• General-purpose, distributed, scalable, fault-
tolerant, pluggable platform for processing
data stream
• Initially released byYahoo!
• Apache Incubator project since September
2011
• Written in Java
9
Adapter
PEs &
Streams
Storm
• Distributed and fault-tolerant realtime
computation
• Provide a set of general primitives for
doing realtime computation
10
http://storm-project.net/
Spark Streaming
• (Near) real-time processing of stream data
• New programming model
• Discretized streams (D-Streams)
• Built on Resilient Distributed Datasets (RDDs)
• Based on Spark
• Integrated with Spark batch and interactive
computation modes
11
Spark Streaming
• D-Streams
• Treat a streaming computation as a series of deterministic
batch computations on a small time intervals
• Latencies can be as low as a second, supported by the fast
execution engine Spark
val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile))
val tweets = ssc.twitterStream(twitterUsername, twitterPassword)
val statuses = tweets.map(status => status.getText())
statuses.print()
batch@t batch@t+1 batch@t+2Twitter Streaming Data
D-Streams: RDDs
12
MillWheel
• Google’s computation framework for low-latency
stream data-processing applications
• Application logic is written as individual nodes in a
directed computation graph
• Fault tolerance
• Exactly-once delivery guarantees
• Low watermarks is used to prevent logical
inconsistencies caused by data delivery not in order
13
Storm: Distributed and Fault-Tolerant
Realtime Computation
• Guaranteed data processing
• Every tuple will be fully processed
• Exactly-once? Using Trident
• Horizontal scalability
• Fault-tolerance
• Easy to deploy and operate
• One click deploy on EC2
14
Storm Architecture
• A Storm cluster is similar to a Hadoop cluster
• Togologies vs. MapReduce jobs
• Running a topology:
• Killing a topology
15
storm jar all‐my‐code.jar backtype.storm.MyTopology arg1 arg2
storm kill {topology name}
Storm Architecture
• Two kinds of nodes
• Master node runs a daemon called Nimbus
• Each worker node runs a daemon called Supervisor
• Each worker process executes a subset of a topology
16
https://github.com/nathanmarz/storm/wiki/images/storm-cluster.png
Topologies
• A topology is a graph of computation
• Each node contains processing logic
• Links between nodes represent the data flows between those
processing units
• Topology definitions are Thrift structs and Nimbus is a Thrift service
• You can create and submit topologies using any programming
language
17
Topologies: Concepts
• Stream: unbounded
sequence of tuples
• Primitives
• Spouts
• Bolts
• Interfaces can be
implemented to run
your logic
18
https://github.com/nathanmarz/storm/wiki/images/topology.png
Data Model
• Tuples are used by Storm as data model
• A named list of values
• A field in a tuple can be an object of any type
• Storm supports all the primitive types, strings,
and byte arrays
• Implement corresponding serializer for using
custom type
19
Tuples
Stream Grouping
• Define how streams are distributed to downstream
tasks
• Shuffle grouping: randomly distributed
• Fields grouping: partitioned by specified fields
• All grouping: replicated to all tasks
• Global grouping: the task with lowest id
20
https://github.com/nathanmarz/storm/wiki/images/topology-tasks.png
Simple Topology
TopologyBuilder builder = new TopologyBuilder();        
builder.setSpout("words", new TestWordSpout(), 10);        
builder.setBolt("exclaim1", new ExclamationBolt(), 3)
        .shuffleGrouping("words");
builder.setBolt("exclaim2", new ExclamationBolt(), 2)
        .shuffleGrouping("exclaim1");
“words:”
TestWordSpout
“exclaim1”:
ExclamationBolt
“exclaim2”:
ExclamationBolt
shuffleGrouping
shuffleGrouping
shuffle grouping: tuples are randomly distributed to the boltʼs tasks
21
Submit Topology
Config conf = new Config();
conf.setDebug(true);
conf.setNumWorkers(2);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf, builder.createTopology());
Utils.sleep(10000);
cluster.killTopology("test");
cluster.shutdown();
Local mode:
Distributed mode:
Config conf = new Config();
conf.setNumWorkers(20);
conf.setMaxSpoutPending(5000);
StormSubmitter.submitTopology("mytopology", conf, topology);
22
Guaranteeing Message Processing
• Every tuple will be fully processed
• Tuple tree
Fully processed: all messages in the tree must to be processed.
23
Storm Reliability API
• A Bolt to split a tuple containing a sentence to the
tuples of words
public void execute(Tuple tuple) {
            String sentence = tuple.getString(0);
            for(String word: sentence.split(" ")) {
                _collector.emit(tuple, new Values(word));
            }
            _collector.ack(tuple);
        }
“Anchoring” creates
a new link in the
tuple tree.
Calling “ack” (or “fail”) makes the tuple as complete (or failed).
24
Storm onYARN
• Enable Storm clusters to be deployed on
HadoopYARN
25
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/yarn_architecture.gif
Use Case:Analysis of Twitter
Streaming Data
• Suppose we want to program a simple
visualization for Twitter streaming data
• Tweet visualization on map: heatmap
• Since there are too many tweets at same
time, we are like to group tweets by their
geo-locations
26
Heatmap:TweetVisualization on Map
• Graphical representation of tweet data
• Clear visualization of the intensity of
tweet count by geo-locations
• Static or dynamic
27
Batch Approach: Hadoop
• Generating static tweet heatmap
• Continuous data collecting
• Batch data processing using Hadoop Java
programs, Hive or Pig
28
Twitter Storage
Batch Processing by
Hadoop
Simple Geo-location-based
Tweet Grouping
• Goal
• To group geographical near tweets
together
• Using Hive
29
Data Store & Data Loading
• Simple data schema
CREATE EXTERNAL TABLE tweets (
  id_str STRING,
  geo STRUCT<
    type:STRING,
    coordinates:ARRAY<DOUBLE>>
) 
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/hduser/tweets';
load data local inpath '/mnt/tweets_2013_3.json' overwrite 
into table tweets;
• Loading data in Hive
30
Hive Query
• Applying Hive query on collected tweets
data
insert overwrite local directory '/tmp/tweets_coords.txt' 
  select avg(geo.coordinates[0]),   
         avg(geo.coordinates[1]), 
         count(*) as tweet_count
  from tweets 
  group by floor(geo.coordinates[0] * 100000), 
           floor(geo.coordinates[1] * 100000)
  sort by tweet_count desc;
31
Static Tweet Heatmap
• Heatmap visualization of partial tweets
collected in Jan, 2013
32
Streaming Approach: Storm
• Generate realtime Twitter usage heatmap view
• Higher level Storm programming by using DSLs
• Scala DSL here
33
class ExclamationBolt extends StormBolt(outputFields = List("word")) {
  def execute(t: Tuple) = {
    t emit (t.getString(0) + "!!!")
    t ack
  }
}
Bolt DSL
class MySpout extends StormSpout(outputFields = List("word", "author")) {
  def nextTuple = {}
}
Spout DSL
Stream Computation Design
Tweets
Defined Time Slot
Calculate some statistics,
e.g. average geo-locations,
for each group
Group geographical
near tweets
Perform predication tasks
such as classification,
sentiment analysis
Send/Store results
34
Create Topology
val builder = new TopologyBuilder
builder.setSpout("tweetstream", new TweetStreamSpout, 1)
builder.setSpout("clock", new ClockSpout)
builder.setBolt("geogrouping", new GeoGrouping, 12)
.fieldsGrouping("tweetstream", new Fields("geo_lat", "geo_lng"))
.allGrouping("clock")
• Two Spouts
• One for produce tweet stream
• One for generate time interval needed to update tweet
statistics
• Only one Bolt; Stream grouping by lat, lng for tweet stream
35
Tweet Spout & Clock Spout
class TweetStreamSpout
extends StormSpout(outputFields = List("geo_lat", "geo_lng", "lat", "lng", "txt")) {
def nextTuple = {
...
emit (math.floor(lat * 10000), math.floor(lng * 1000
0), lat, lng, txt)
...
}
}
class ClockSpout extends StormSpout(outputFields = List("timestamp")) {
def nextTuple {
Thread sleep 1000 * 1
emit (System.currentTimeMillis / 1000)
}
}
36
GeoGrouping Bolt
class GeoGrouping extends StormBolt(List("geo_lat", "geo_lng", "lat", "lng", "txt")) {
def execute(t: Tuple) = t matchSeq {
case Seq(clockTime: Long) =>
// Calculate statistics for each group of tweets
// Perform classification tasks
// Send/Store results
case Seq(geo_lat: Double, geo_lng: Double, lat: Double, lng: Double, txt: String)
=>
// Group tweets by geo-locations
}
}
37
Demo
38

More Related Content

What's hot

Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataDataWorks Summit
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceRobert Evans
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleDung Ngua
 
Storm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationStorm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationFerran Galí Reniu
 

What's hot (20)

Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Storm
StormStorm
Storm
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & Example
 
Storm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationStorm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computation
 

Viewers also liked

[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민NAVER D2
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Clemens Valiente
 
[115] clean fe development_윤지수
[115] clean fe development_윤지수[115] clean fe development_윤지수
[115] clean fe development_윤지수NAVER D2
 
[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림NAVER D2
 
[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준NAVER D2
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSAmazon Web Services
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Roberto Hashioka
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민NAVER D2
 
[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우NAVER D2
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영NAVER D2
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 

Viewers also liked (17)

[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
[115] clean fe development_윤지수
[115] clean fe development_윤지수[115] clean fe development_윤지수
[115] clean fe development_윤지수
 
[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림
 
[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
 
[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 

Similar to Real-time Big Data Processing with Storm

Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introductionAlex Su
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Ingesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseIngesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseGuido Schmutz
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Webmaria.grineva
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 

Similar to Real-time Big Data Processing with Storm (20)

Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introduction
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Ingesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseIngesting streaming data into Graph Database
Ingesting streaming data into Graph Database
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Spark etl
Spark etlSpark etl
Spark etl
 

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Real-time Big Data Processing with Storm

  • 1. Real-time Big Data Processing with Storm: Using Twitter Streaming as Example Liang-Chi Hsieh Hadoop in Taiwan 2013 1
  • 2. In Today’s Talk • Introduce stream computation in Big Data • Introduce current stream computation platforms • Storm • Architecture & concepts • Use case: analysis of Twitter streaming data 2
  • 3. Recap, the FourV’s of Big Data • To help us talk ‘big data’, it is common to break it down into four dimensions • Volume: Scale of Data • Velocity:Analysis of Streaming Data • Variety: Different Forms of Data • Veracity: Uncertainty of Data http://dashburst.com/infographic/big-data-volume-variety-velocity/ 3
  • 4. • Velocity: Data in motion • Require realtime response to process, analyze continuous data stream http://www.intergen.co.nz/Global/Images/BlogImages/2013/Defining-big-data.png 4
  • 5. Streaming Data • Data coming from: • Logs • Sensors • Stock trade • Personal devices • Network connections • etc... 5
  • 6. Batch Data Processing Architecture 6 Data Store Hadoop Data Flow Batch Run BatchView Query • Views generated in batch maybe out of date • Batch workflow is too slow
  • 7. Data Processing Architecture: Batch and Realtime 7 Data Store Hadoop Batch Run Realtime Processing BatchView Realtime View Query Data Flow • Generate realtime views of data by using stream computation
  • 8. Current Stream Computation Platforms • S4 • Storm • Spark Streaming • MillWheel 8
  • 9. S4 • General-purpose, distributed, scalable, fault- tolerant, pluggable platform for processing data stream • Initially released byYahoo! • Apache Incubator project since September 2011 • Written in Java 9 Adapter PEs & Streams
  • 10. Storm • Distributed and fault-tolerant realtime computation • Provide a set of general primitives for doing realtime computation 10 http://storm-project.net/
  • 11. Spark Streaming • (Near) real-time processing of stream data • New programming model • Discretized streams (D-Streams) • Built on Resilient Distributed Datasets (RDDs) • Based on Spark • Integrated with Spark batch and interactive computation modes 11
  • 12. Spark Streaming • D-Streams • Treat a streaming computation as a series of deterministic batch computations on a small time intervals • Latencies can be as low as a second, supported by the fast execution engine Spark val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile)) val tweets = ssc.twitterStream(twitterUsername, twitterPassword) val statuses = tweets.map(status => status.getText()) statuses.print() batch@t batch@t+1 batch@t+2Twitter Streaming Data D-Streams: RDDs 12
  • 13. MillWheel • Google’s computation framework for low-latency stream data-processing applications • Application logic is written as individual nodes in a directed computation graph • Fault tolerance • Exactly-once delivery guarantees • Low watermarks is used to prevent logical inconsistencies caused by data delivery not in order 13
  • 14. Storm: Distributed and Fault-Tolerant Realtime Computation • Guaranteed data processing • Every tuple will be fully processed • Exactly-once? Using Trident • Horizontal scalability • Fault-tolerance • Easy to deploy and operate • One click deploy on EC2 14
  • 15. Storm Architecture • A Storm cluster is similar to a Hadoop cluster • Togologies vs. MapReduce jobs • Running a topology: • Killing a topology 15 storm jar all‐my‐code.jar backtype.storm.MyTopology arg1 arg2 storm kill {topology name}
  • 16. Storm Architecture • Two kinds of nodes • Master node runs a daemon called Nimbus • Each worker node runs a daemon called Supervisor • Each worker process executes a subset of a topology 16 https://github.com/nathanmarz/storm/wiki/images/storm-cluster.png
  • 17. Topologies • A topology is a graph of computation • Each node contains processing logic • Links between nodes represent the data flows between those processing units • Topology definitions are Thrift structs and Nimbus is a Thrift service • You can create and submit topologies using any programming language 17
  • 18. Topologies: Concepts • Stream: unbounded sequence of tuples • Primitives • Spouts • Bolts • Interfaces can be implemented to run your logic 18 https://github.com/nathanmarz/storm/wiki/images/topology.png
  • 19. Data Model • Tuples are used by Storm as data model • A named list of values • A field in a tuple can be an object of any type • Storm supports all the primitive types, strings, and byte arrays • Implement corresponding serializer for using custom type 19 Tuples
  • 20. Stream Grouping • Define how streams are distributed to downstream tasks • Shuffle grouping: randomly distributed • Fields grouping: partitioned by specified fields • All grouping: replicated to all tasks • Global grouping: the task with lowest id 20 https://github.com/nathanmarz/storm/wiki/images/topology-tasks.png
  • 23. Guaranteeing Message Processing • Every tuple will be fully processed • Tuple tree Fully processed: all messages in the tree must to be processed. 23
  • 24. Storm Reliability API • A Bolt to split a tuple containing a sentence to the tuples of words public void execute(Tuple tuple) {             String sentence = tuple.getString(0);             for(String word: sentence.split(" ")) {                 _collector.emit(tuple, new Values(word));             }             _collector.ack(tuple);         } “Anchoring” creates a new link in the tuple tree. Calling “ack” (or “fail”) makes the tuple as complete (or failed). 24
  • 25. Storm onYARN • Enable Storm clusters to be deployed on HadoopYARN 25 http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/yarn_architecture.gif
  • 26. Use Case:Analysis of Twitter Streaming Data • Suppose we want to program a simple visualization for Twitter streaming data • Tweet visualization on map: heatmap • Since there are too many tweets at same time, we are like to group tweets by their geo-locations 26
  • 27. Heatmap:TweetVisualization on Map • Graphical representation of tweet data • Clear visualization of the intensity of tweet count by geo-locations • Static or dynamic 27
  • 28. Batch Approach: Hadoop • Generating static tweet heatmap • Continuous data collecting • Batch data processing using Hadoop Java programs, Hive or Pig 28 Twitter Storage Batch Processing by Hadoop
  • 29. Simple Geo-location-based Tweet Grouping • Goal • To group geographical near tweets together • Using Hive 29
  • 30. Data Store & Data Loading • Simple data schema CREATE EXTERNAL TABLE tweets (   id_str STRING,   geo STRUCT<     type:STRING,     coordinates:ARRAY<DOUBLE>> )  ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/hduser/tweets'; load data local inpath '/mnt/tweets_2013_3.json' overwrite  into table tweets; • Loading data in Hive 30
  • 31. Hive Query • Applying Hive query on collected tweets data insert overwrite local directory '/tmp/tweets_coords.txt'    select avg(geo.coordinates[0]),             avg(geo.coordinates[1]),           count(*) as tweet_count   from tweets    group by floor(geo.coordinates[0] * 100000),             floor(geo.coordinates[1] * 100000)   sort by tweet_count desc; 31
  • 32. Static Tweet Heatmap • Heatmap visualization of partial tweets collected in Jan, 2013 32
  • 33. Streaming Approach: Storm • Generate realtime Twitter usage heatmap view • Higher level Storm programming by using DSLs • Scala DSL here 33 class ExclamationBolt extends StormBolt(outputFields = List("word")) {   def execute(t: Tuple) = {     t emit (t.getString(0) + "!!!")     t ack   } } Bolt DSL class MySpout extends StormSpout(outputFields = List("word", "author")) {   def nextTuple = {} } Spout DSL
  • 34. Stream Computation Design Tweets Defined Time Slot Calculate some statistics, e.g. average geo-locations, for each group Group geographical near tweets Perform predication tasks such as classification, sentiment analysis Send/Store results 34
  • 35. Create Topology val builder = new TopologyBuilder builder.setSpout("tweetstream", new TweetStreamSpout, 1) builder.setSpout("clock", new ClockSpout) builder.setBolt("geogrouping", new GeoGrouping, 12) .fieldsGrouping("tweetstream", new Fields("geo_lat", "geo_lng")) .allGrouping("clock") • Two Spouts • One for produce tweet stream • One for generate time interval needed to update tweet statistics • Only one Bolt; Stream grouping by lat, lng for tweet stream 35
  • 36. Tweet Spout & Clock Spout class TweetStreamSpout extends StormSpout(outputFields = List("geo_lat", "geo_lng", "lat", "lng", "txt")) { def nextTuple = { ... emit (math.floor(lat * 10000), math.floor(lng * 1000 0), lat, lng, txt) ... } } class ClockSpout extends StormSpout(outputFields = List("timestamp")) { def nextTuple { Thread sleep 1000 * 1 emit (System.currentTimeMillis / 1000) } } 36
  • 37. GeoGrouping Bolt class GeoGrouping extends StormBolt(List("geo_lat", "geo_lng", "lat", "lng", "txt")) { def execute(t: Tuple) = t matchSeq { case Seq(clockTime: Long) => // Calculate statistics for each group of tweets // Perform classification tasks // Send/Store results case Seq(geo_lat: Double, geo_lng: Double, lat: Double, lng: Double, txt: String) => // Group tweets by geo-locations } } 37