AWS Community Day CPH - Three problems of Terraform
Â
Real-time Big Data Processing with Storm
1. Real-time Big Data Processing
with Storm: Using Twitter
Streaming as Example
Liang-Chi Hsieh
Hadoop in Taiwan 2013
1
2. In Todayâs Talk
⢠Introduce stream computation in Big Data
⢠Introduce current stream computation
platforms
⢠Storm
⢠Architecture & concepts
⢠Use case: analysis of Twitter streaming data
2
3. Recap, the FourVâs of Big Data
⢠To help us talk âbig dataâ, it is common to
break it down into four dimensions
⢠Volume: Scale of Data
⢠Velocity:Analysis of Streaming Data
⢠Variety: Different Forms of Data
⢠Veracity: Uncertainty of Data
http://dashburst.com/infographic/big-data-volume-variety-velocity/
3
4. ⢠Velocity: Data in motion
⢠Require realtime response to process,
analyze continuous data stream
http://www.intergen.co.nz/Global/Images/BlogImages/2013/DeďŹning-big-data.png
4
5. Streaming Data
⢠Data coming from:
⢠Logs
⢠Sensors
⢠Stock trade
⢠Personal devices
⢠Network connections
⢠etc...
5
6. Batch Data Processing
Architecture
6
Data Store Hadoop
Data Flow Batch Run
BatchView
Query
⢠Views generated in batch maybe
out of date
⢠Batch workďŹow is too slow
7. Data Processing Architecture:
Batch and Realtime
7
Data Store Hadoop
Batch Run
Realtime
Processing
BatchView
Realtime
View
Query
Data Flow
⢠Generate realtime views of data
by using stream computation
9. S4
⢠General-purpose, distributed, scalable, fault-
tolerant, pluggable platform for processing
data stream
⢠Initially released byYahoo!
⢠Apache Incubator project since September
2011
⢠Written in Java
9
Adapter
PEs &
Streams
10. Storm
⢠Distributed and fault-tolerant realtime
computation
⢠Provide a set of general primitives for
doing realtime computation
10
http://storm-project.net/
11. Spark Streaming
⢠(Near) real-time processing of stream data
⢠New programming model
⢠Discretized streams (D-Streams)
⢠Built on Resilient Distributed Datasets (RDDs)
⢠Based on Spark
⢠Integrated with Spark batch and interactive
computation modes
11
12. Spark Streaming
⢠D-Streams
⢠Treat a streaming computation as a series of deterministic
batch computations on a small time intervals
⢠Latencies can be as low as a second, supported by the fast
execution engine Spark
val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile))
val tweets = ssc.twitterStream(twitterUsername, twitterPassword)
val statuses = tweets.map(status => status.getText())
statuses.print()
batch@t batch@t+1 batch@t+2Twitter Streaming Data
D-Streams: RDDs
12
13. MillWheel
⢠Googleâs computation framework for low-latency
stream data-processing applications
⢠Application logic is written as individual nodes in a
directed computation graph
⢠Fault tolerance
⢠Exactly-once delivery guarantees
⢠Low watermarks is used to prevent logical
inconsistencies caused by data delivery not in order
13
14. Storm: Distributed and Fault-Tolerant
Realtime Computation
⢠Guaranteed data processing
⢠Every tuple will be fully processed
⢠Exactly-once? Using Trident
⢠Horizontal scalability
⢠Fault-tolerance
⢠Easy to deploy and operate
⢠One click deploy on EC2
14
15. Storm Architecture
⢠A Storm cluster is similar to a Hadoop cluster
⢠Togologies vs. MapReduce jobs
⢠Running a topology:
⢠Killing a topology
15
storm jar allâmyâcode.jar backtype.storm.MyTopology arg1 arg2
storm kill {topology name}
16. Storm Architecture
⢠Two kinds of nodes
⢠Master node runs a daemon called Nimbus
⢠Each worker node runs a daemon called Supervisor
⢠Each worker process executes a subset of a topology
16
https://github.com/nathanmarz/storm/wiki/images/storm-cluster.png
17. Topologies
⢠A topology is a graph of computation
⢠Each node contains processing logic
⢠Links between nodes represent the data ďŹows between those
processing units
⢠Topology deďŹnitions are Thrift structs and Nimbus is a Thrift service
⢠You can create and submit topologies using any programming
language
17
18. Topologies: Concepts
⢠Stream: unbounded
sequence of tuples
⢠Primitives
⢠Spouts
⢠Bolts
⢠Interfaces can be
implemented to run
your logic
18
https://github.com/nathanmarz/storm/wiki/images/topology.png
19. Data Model
⢠Tuples are used by Storm as data model
⢠A named list of values
⢠A ďŹeld in a tuple can be an object of any type
⢠Storm supports all the primitive types, strings,
and byte arrays
⢠Implement corresponding serializer for using
custom type
19
Tuples
20. Stream Grouping
⢠DeďŹne how streams are distributed to downstream
tasks
⢠ShufďŹe grouping: randomly distributed
⢠Fields grouping: partitioned by speciďŹed ďŹelds
⢠All grouping: replicated to all tasks
⢠Global grouping: the task with lowest id
20
https://github.com/nathanmarz/storm/wiki/images/topology-tasks.png
23. Guaranteeing Message Processing
⢠Every tuple will be fully processed
⢠Tuple tree
Fully processed: all messages in the tree must to be processed.
23
24. Storm Reliability API
⢠A Bolt to split a tuple containing a sentence to the
tuples of words
public void execute(Tuple tuple) {
            String sentence = tuple.getString(0);
            for(String word: sentence.split(" ")) {
                _collector.emit(tuple, new Values(word));
            }
            _collector.ack(tuple);
        }
âAnchoringâ creates
a new link in the
tuple tree.
Calling âackâ (or âfailâ) makes the tuple as complete (or failed).
24
25. Storm onYARN
⢠Enable Storm clusters to be deployed on
HadoopYARN
25
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/yarn_architecture.gif
26. Use Case:Analysis of Twitter
Streaming Data
⢠Suppose we want to program a simple
visualization for Twitter streaming data
⢠Tweet visualization on map: heatmap
⢠Since there are too many tweets at same
time, we are like to group tweets by their
geo-locations
26
27. Heatmap:TweetVisualization on Map
⢠Graphical representation of tweet data
⢠Clear visualization of the intensity of
tweet count by geo-locations
⢠Static or dynamic
27
28. Batch Approach: Hadoop
⢠Generating static tweet heatmap
⢠Continuous data collecting
⢠Batch data processing using Hadoop Java
programs, Hive or Pig
28
Twitter Storage
Batch Processing by
Hadoop
30. Data Store & Data Loading
⢠Simple data schema
CREATE EXTERNAL TABLE tweets (
  id_str STRING,
  geo STRUCT<
    type:STRING,
    coordinates:ARRAY<DOUBLE>>
)Â
ROWÂ FORMATÂ SERDEÂ 'com.cloudera.hive.serde.JSONSerDe'
LOCATIONÂ '/user/hduser/tweets';
load data local inpath '/mnt/tweets_2013_3.json' overwriteÂ
into table tweets;
⢠Loading data in Hive
30
31. Hive Query
⢠Applying Hive query on collected tweets
data
insert overwrite local directory '/tmp/tweets_coords.txt'Â
  select avg(geo.coordinates[0]),  Â
         avg(geo.coordinates[1]),Â
         count(*) as tweet_count
  from tweetsÂ
  group by floor(geo.coordinates[0] * 100000),Â
           floor(geo.coordinates[1] * 100000)
  sort by tweet_count desc;
31
33. Streaming Approach: Storm
⢠Generate realtime Twitter usage heatmap view
⢠Higher level Storm programming by using DSLs
⢠Scala DSL here
33
class ExclamationBolt extends StormBolt(outputFields = List("word")) {
  def execute(t: Tuple) = {
    t emit (t.getString(0) + "!!!")
    t ack
  }
}
Bolt DSL
class MySpout extends StormSpout(outputFields = List("word", "author")) {
  def nextTuple = {}
}
Spout DSL
34. Stream Computation Design
Tweets
DeďŹned Time Slot
Calculate some statistics,
e.g. average geo-locations,
for each group
Group geographical
near tweets
Perform predication tasks
such as classiďŹcation,
sentiment analysis
Send/Store results
34
35. Create Topology
val builder = new TopologyBuilder
builder.setSpout("tweetstream", new TweetStreamSpout, 1)
builder.setSpout("clock", new ClockSpout)
builder.setBolt("geogrouping", new GeoGrouping, 12)
.fieldsGrouping("tweetstream", new Fields("geo_lat", "geo_lng"))
.allGrouping("clock")
⢠Two Spouts
⢠One for produce tweet stream
⢠One for generate time interval needed to update tweet
statistics
⢠Only one Bolt; Stream grouping by lat, lng for tweet stream
35