2. Rationale
• Hadoop Scales but no Real Time Data Processing.
• Batch processing is stale data.
• Before Storm :
Messages
Queues
Workers
Tedious
Hard to Scale
1.Tedious
2.Brittle
3.Hard to Scale
Storm :Distributed Fault Tolerant Real Time Computation
3. Why Storm
• Real-Time
• Fault tolerant
• Extremely robust
• Scalable
(processed 1,000,000
Messages per second
on a 10 node cluster)
Storm :Distributed Fault Tolerant Real Time Computation
5. Key Concepts
• Topology
• Tasks
• Tuple
• Stream
• Spout
• Bolt
Topology is a graph of
Computation.
Tasks are the processes
which execute the
Streams or bolts.
Storm :Distributed Fault Tolerant Real Time Computation
Stream
Tuple
Bolt
A simple Topology
Spout
6. Key Concepts
• Tuple and Streams
• Tuple : Ordered list of elements
• Steams: Unbounded sequence of tuples
Storm :Distributed Fault Tolerant Real Time Computation 6/12
7. Key Concepts
Spouts and Bolts
• Spout : the source of a stream
• Deals with queues
• weblogs
• API calls
• Event data.
• Bolts :process input streams
and create new streams.
• Apply functions/transforms
filter, aggregation ,streaming
joins etc.
• Can produce multiple streams
Storm :Distributed Fault Tolerant Real Time Computation
8. Key Concepts
Stream groupings
• Stream partitioning among the bolt tasks.
Storm :Distributed Fault Tolerant Real Time Computation
9. A simple topology
Storm :Distributed Fault Tolerant Real Time Computation
words exclaim1 exclaim2
mike!!!!!!
mike
mike!!!
Shuffle
Shuffle
10. Implementation of Spout
• The object implements IRichSpout Interface.
• nextTuple() method as part of the TestWordSpout()
Storm :Distributed Fault Tolerant Real Time Computation
11. Implementation of Bolt
• Implements IRichBolt interface
• Prepare method saves the outputCollector as a variable.
• Execute method receives a tuple and appends exclamation.
• Cleanup prevents resource leakages on bolt Shutdown
• DeclareOutputFields declares that the bolt emits a tuple with field named
‘word’.
Storm :Distributed Fault Tolerant Real Time Computation
12. Conclusion
• Storm is a promising tool.
• It has a clean and elegant design.
• Excellent documentation for a young open source tool.
• Great replacement of Hadoop for real time Computation.
Storm :Distributed Fault Tolerant Real Time Computation
14. Sources
• Storm: The Real-Time Layer - GlueCon 2012
Dan Lynn( dan@fullcontact.com)
• http://storm.incubator.apache.org/documentation/Tutorial.html
Nathan Marz
• Streams processing with Storm
Mariusz Gil
Storm :Distributed Fault Tolerant Real Time Computation
15. Questions
• What are the major issues with processing in real time
stream and how to solve them ?Specify algorithms or
techniques.
• Any Query Languages for real time stream processing?
Storm :Distributed Fault Tolerant Real Time Computation
16. Answers
• One strategy to dealing with streams is to maintain
summaries of the streams, sufficient to answer the
expected queries about the data and use sampling and
filtering of data to extract the subset.
• A second approach is to maintain a sliding window of the
most recently arrived data.
• SQL stream.
Storm :Distributed Fault Tolerant Real Time Computation
Notes de l'éditeur
Realtime streaming computation application in machine learning data anayltics integration .
Hadoop uses batch processing.1.Tedious in deploying workers,where to send messages and deploying queues. 2.Brittle for no fault tolerance 3.For high throughput you need to partition data and how it moves around hence can fail.reconfigure other workers.
1.Real time in the sense it can be used to process messages and updating databases. Continuous querying of database and streaming the result into the client.2.Fault tolerant: If faults occur during the computation, storm can reassign tasks. It makes sure that a computation can be run forever.3.Extremely Robust:Storm clusters are easier to manage than Hadoop.Storm ensures painless user experience.4.Scalable:Massive number of messages per second.All you need to do is add machines and increase parallelism settings of the topology.
1.Hadoop has mapreduce jobs but storm has topologies.Mapreduce job finishes but storm topology processes messages forever until you kill it.2.Nimbus is a daemon similar to master nodes job tracker for distributing code around the cluster. assigning tasks and monitoring for failures.3.Each worker node runs a daemon called supervisor.It starts and stops a worker node based on the work assigned to it.4.Nimbus and Supervisor are stateless all the state is stored in the zookeeper or on a local disk.you can kill nimbus or supervisor they will start back like nothing happened.This provides the stability.
Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around between nodes. Each task corresponds to one thread of execution.But tasks can be less than equal to number of trheads.WorkersTopologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.
In a tuple there can be a list of values Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. For example, you may transform a stream of tweets into a stream of trending topics. tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.Every stream is given an id when declared.
The basic primitives Storm provides for doing stream transformations are "spouts" and "bolts". Spouts and bolts have interfaces that you implement to run your application-specific logic a spout may connect to the Twitter API and emit a stream of tweets. Spouts easily integrated to a new queuing system.Spouts can be reliable or unreliable. Reliable have ack and fail.Bolts:Complex stream transformation requires mutliple bolts.Can give out multiple streams.A topology runs forever, or until you kill it. Storm will automatically reassign any failed tasks. Additionally, Storm guarantees that there will be no data loss, even if machines go down and messages are dropped.
Part of defining a topology is specifying for each bolt which streams it should receive as inputSpouts and bolts execute as many tasks in parallel across the cluster.Shuffleuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples. Fields grouping:The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same taskGlobal
These methods take as input a user-specified id, an object containing the processing logic, and the amount of parallelism you want for the node.The last parameter, how much parallelism you want for the node, is optional. It indicates how many threads should execute that component across the cluster
TestWordSpout in this topology emits a random word from the list ["nathan", "mike", "jackson", "golda", "bertels"] as a 1-tuple every 100ms
Prepare method: output collector that is used for emitting tuplesThe execute method receives a tuple from one of the bolt's inputs .Provides acknowedgement to prevent data loss.When bolt is shut down and should clean up resources that were openThe declareOutputFields method declares that the ExclamationBolt emits 1-tuples with one field called "word".The getComponentConfiguration method allows you to configure various aspects of how this component runs
Before proceeding to discuss algorithms, let us consider the constraints underwhich we work when dealing with streams. First, streams often deliver elementsvery rapidly. We must process elements in real time, or we lose the opportunityto process them at all, without accessing the archival storage. Thus, it often isimportant that the stream-processing algorithm is executed in main memory,without access to secondary storage or with only rare accesses to secondarystorage. Moreover, even when streams are “slow,” as in the sensor-data exampleof Section 4.1.2, there may be many such streams. Even if each stream by itselfcan be processed using a small amount of main memory, the requirements of allthe streams together can easily exceed the amount of available main memory.Thus, many problems about streaming data would be easy to solve if wehad enough memory, but become rather hard and require the invention of newtechniques in order to execute them at a realistic rate on a machine of realisticsize. Here are two generalizations about stream algorithms worth bearing inmind as you read through this chapter:• Often, it is much more efficient to get an approximate answer to ourproblem than an exact solution.• As in Chapter 3, a variety of techniques related to hashing turn out to beuseful. Generally, these techniques introduce useful randomness into thealgorithm’s behavior, in order to produce an approximate answer that isvery close to the true result1. We can use