SlideShare une entreprise Scribd logo
1  sur  21
STORM
    COMPARISON – INTRODUCTION - CONCEPTS




PRESENTATION BY KASPER MADSEN
MARCH - 2012
HADOOP                              VS             STORM
     Batch processing                            Real-time processing
     Jobs runs to completion                   Topologies run forever
     JobTracker is SPOF*                      No single point of failure
     Stateful nodes                                    Stateless nodes


     Scalable                                                 Scalable
     Guarantees no data loss                  Guarantees no data loss
     Open source                                          Open source




* Hadoop 0.21 added some checkpointing
 SPOF: Single Point Of Failure
COMPONENTS
     Nimbus daemon is comparable to Hadoop JobTracker. It is the master
     Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker
     Worker is spawned by supervisor, one per port defined in storm.yaml configuration
     Task is run as a thread in workers
     Zookeeper* is a distributed system, used to store metadata. Nimbus and
     Supervisor daemons are fail-fast and stateless. All state is kept in Zookeeper.




         Notice all communication between Nimbus and
           Supervisors are done through Zookeeper

      On a cluster with 2k+1 zookeeper nodes, the system
          can recover when maximally k nodes fails.




* Zookeeper is an Apache top-level project
STREAMS
Stream is an unbounded sequence of tuples.
Topology is a graph where each node is a spout or bolt, and the edges indicate
which bolts are subscribing to which streams.
•   A spout is a source of a stream
•   A bolt is consuming a stream (possibly emits a new one)
                                                              Subscribes: A
•   An edge represents a grouping                             Emits: C


                                                                                 Subscribes: C & D

                                                              Subscribes: A
                                 Source of stream A           Emits: D




                                 Source of stream B
                                                              Subscribes:A & B
GROUPINGS
Each spout or bolt are running X instances in parallel (called tasks).
Groupings are used to decide which task in the subscribing bolt, the tuple is sent to
Shuffle grouping     is a random grouping
Fields grouping      is grouped by value, such that equal value results in equal task
All grouping         replicates to all tasks
Global grouping      makes all tuples go to one task
None grouping        makes bolt run in same thread as bolt/spout it subscribes to
Direct grouping      producer (task that emits) controls which consumer will receive
                                          4 tasks   3 tasks


                                2 tasks


                                          2 tasks
TestWordSpout          ExclamationBolt     ExclamationBolt

    EXAMPLE
     TopologyBuilder builder = new TopologyBuilder();                   Create stream called ”words”

                                                                        Run 10 tasks
     builder.setSpout("words", new TestWordSpout(), 10);
                                                                        Create stream called ”exclaim1”
     builder.setBolt("exclaim1", new ExclamationBolt(), 3)              Run 3 tasks

                                                                        Subscribe to stream ”words”,
                 .shuffleGrouping("words");                             using shufflegrouping
                                                                        Create stream called ”exclaim2”
     builder.setBolt("exclaim2", new ExclamationBolt(), 2)
                                                                        Run 2 tasks
                 .shuffleGrouping("exclaim1");                          Subscribe to stream ”exclaim1”,
                                                                        using shufflegrouping



        A bolt can subscribe to an unlimited number of
                streams, by chaining groupings.



The sourcecode for this example is part of the storm-starter project on github
TestWordSpout        ExclamationBolt     ExclamationBolt

EXAMPLE – 1
TestWordSpout
public void nextTuple() {
     Utils.sleep(100);
     final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"};
     final Random rand = new Random();
     final String word = words[rand.nextInt(words.length)];
     _collector.emit(new Values(word));
}



The TestWordSpout emits a random string from the
       array words, each 100 milliseconds
TestWordSpout          ExclamationBolt        ExclamationBolt

EXAMPLE – 2
ExclamationBolt                                    Prepare is called when bolt is created

OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
      _collector = collector;
}                                             Execute is called for each tuple
public void execute(Tuple tuple) {
     _collector.emit(tuple, new Values(tuple.getString(0) + "!!!"));
     _collector.ack(tuple);
 }                                            declareOutputFields is called when bolt is created
public void declareOutputFields(OutputFieldsDeclarer declarer) {
     declarer.declare(new Fields("word"));
}


declareOutputFields is used to declare streams and their schemas. It
 is possible to declare several streams and specify the stream to use
           when outputting tuples in the emit function call.
FAULT TOLERANCE
Zookeeper stores metadata in a very robust way
Nimbus and Supervisor are stateless and only need metadata from ZK to work/restart
When a node dies
   • The tasks will time out and be reassigned to other workers by Nimbus.
When a worker dies
     •The supervisor will restart the worker.
     •Nimbus will reassign worker to another supervisor, if no heartbeats are sent.
     •If not possible (no free ports), then tasks will be run on other workers in
      topology. If more capacity is added to the cluster later, STORM will
      automatically initialize a new worker and spread out the tasks.
When nimbus or supervisor dies
     •   Workers will continue to run
     •   Workers cannot be reassigned without Nimbus
     •   Nimbus and Supervisor should be run using a process monitoring tool, to
         restarts them automatically if they fail.
AT-LEAST-ONCE PROCESSING
STORM guarantees at-least-once processing of tuples.
Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits long
Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple.
Ack is called on spout, when tree of tuples for spout tuple is fully processed.
Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of
tuples is not fully processed within a specified timeout (default is 30 seconds).
It is possible to specify the message id, when emitting a tuple. This might be useful for
replaying tuples from a queue.




                   Ack/fail method called when tree of
                  tuples have been fully processed or
                             failed / timed-out
AT-LEAST-ONCE PROCESSING – 2
Anchoring is used to copy the spout tuple message id(s) to the new tuples
generated. In this way, every tuple knows the message id(s) of all spout tuples.
Multi-anchoring is when multiple tuples are anchored. If the tuple tree fails, then
multiple spout tuples will be replayed. Useful for doing streaming joins and more.
Ack called from a bolt, indicates the tuple has been processed as intented
Fail called from a bolt, replays the spout tuple(s)
Every tuple must be acked/failed or the task will run out of memory at some point.




_collector.emit(tuple, new Values(word));    Uses anchoring

_collector.emit(new Values(word));           Does NOT use anchoring
AT-LEAST-ONCE PROCESSING – 3
Acker tasks tracks the tree of tuples for every spout tuple
     •   The acker task responsible for a given spout tuple is determined by modulo
         on message id. Since all tuples have all spout tuple message ids, it is easy
         to call the correct acker tasks.
     •   Acker task stores a map, the format is {spoutMsgId, {spoutTaskId, ”ack val”}}
     •   ”ack val” is the representation of state of entire tree of tuples. It is the xor of
         all tuple message ids created and acked in the tree of tuples.
     •   When ”ack val” is 0, then tuple tree is fully processed.
     •   Since message ids are random 64 bits numbers, chances of ”ack val”
         becoming 0 by accident is extremely small.




               Important to set number of acker tasks in topology when
                  processing large amounts of tuples (defaults to 1)
AT-LEAST-ONCE PROCESSING – 4
    Example                                                             Bolt
                                                         Emit ”h”        Task: 3
                                                         spoutIds: 10
                                                         msgId: 2
                Spout          Emit ”hey”   Bolt
                     Task: 1   msgId:10      Task: 2
                                                        Emit ”ey”
                                                        spoutIds: 10
                                                        msgId: 3        Bolt
                                                                         Task: 4



Shows what happens in acker task, for one spout tuple. Format is: {spoutMsgId, {spoutTaskId, ”ack val”}}
1. After emit ”hey”: {10, {1, 0000 XOR 1010 = 1010}
2. After emit ”h”: {10, {1, 1010 XOR 0010 = 1000}
3. After emit ”ey”: {10, {1, 1000 XOR 0011 = 1011}                      USES 64 BIT IDS
4. After ack ”hey”: {10, {1, 1011 XOR 1010 = 0001}                         IN REALITY
5. After ack ”h”:    {10, {1, 0001 XOR 0010 = 0011}
6. After ack ”ey”: {10, {1, 0011 XOR 0011 = 0000}
7. Since ”ack val” is 0, spout tuple with id 10, must be fully processed. Call ack on spout (task 1)
AT-LEAST-ONCE PROCESSING – 5
A tuple isn't acked because the task died:
The spout tuple(s) at the root of the tree of tuples will time out and be replayed.
Acker task dies:
All the spout tuples the acker was tracking will time out and be replayed.
Spout task dies:
In this case the source that the spout talks to is responsible for replaying the
messages. For example, queues like Kestrel and RabbitMQ will place all pending
messages back on the queue when a client disconnects.
AT-LEAST-ONCE PROCESSING – 6
At-least-once processing might process a tuple more than once.
Example

     All grouping       Bolt       1. A spout tuple is emitted to task 2 and 3
                         Task: 2   2. Worker responsible for task 3 fails
                                   3. Supervisor restarts worker
  Spout
      Task: 1
                                   4. Spout tuple is replayed and emitted to task 2 and 3
                                   5. Task 2 will now have executed the same bolt twice
                       Bolt
                         Task: 3




Consider why the all grouping is not important in this example
EXACTLY-ONCE-PROCESSING
Transactional topologies (TT) is an abstraction built on STORM primitives.
TT guarantees exactly-once-processing of tuples.
Acking is optimized in TT, no need to do anchoring or acking manually.
Bolts execute as new instances per attempt of processing a batch


Example

     All grouping       Bolt        1. A spout tuple is emitted to task 2 and 3
                         Task: 2    2. Worker responsible for task 3 fails
                                    3. Supervisor restarts worker
  Spout
      Task: 1
                                    4. Spout tuple is replayed and emitted to task 2 and 3
                                    5. Task 2 and 3 initiate new bolts because of new attempt
                        Bolt        5. Now there is no problem
                         Task: 3
EXACTLY-ONCE-PROCESSING – 2
For efficiency batch processing of tuples is introduced in TT
Batch has two states: processing or committing
Many batches can be in the processing state concurrently
Only one batch can be in the committing state, and a strong ordering is imposed. That
means batch 1 will always be committed before batch 2 and so on.
Types of bolts for TT: BasicBolt, BatchBolt, BatchBolt marked as committer
BasicBolt is processing one tuple at a time.
BatchBolt is processing batches. Call finishBatch when all tuples of batch is executed
BatchBolt marked as committer is calling finishBatch only when batch is in
committing state.
EXACTLY-ONCE-PROCESSING – 3
      Transactional spout has capability               Committer               Committer
      to replay exact batches of tuples    batchbolt   batchbolt   batchbolt
                                                                               batchbolt




BATCH IS IN PROCESSING STATE
Bolt A:   execute method is called for all tuples received from spout
          finishBatch is called when first batch is received
Bolt B:   execute method is called for all tuples received from bolt A
          finishBatch is NOT called because batch is in processing state
Bolt C:   execute method is called for all tuples received from bolt A (and B)
          finishBatch is NOT called, because bolt B has not called finishBatch
Bolt D:   execute method is called for all tuples received from bolt C
          finishBatch is NOT called because batch is in processing state
BATCH CHANGES TO COMMITTING STATE
Bolt B:   finishBatch is called
Bolt C:   finishBatch is called, because we know we got all tuples from Bolt B now
Bolt D:   finishBatch is called, because we know we got all tuples from Bolt C now
EXACTLY-ONCE-PROCESSING – 4
    Transactional spout
  All groupings on                   When batch should enter processing state:
  batch stream                       •  Coordinator emits a tuple with TransactionAttempt and the metadata for that
                                        transaction to the "batch" stream.
                                     •  All emitter tasks receives the tuple and begins to emit their portion of tuples for
                                        the given batch.


                                     When processing phase of batch is done (determined by acker task):
                                     •  Ack gets called on coordinator


                                     When ack gets called on coordinator and all prior transactions have committed:
                  Regular bolt,      •  Coordinator emits a tuple with TransactionAttempt to the commit stream.
                  Parallelism of P   •  All Bolts which are marked as committers subscribe to the commit stream of the
                                        coordinator using an all grouping.
                                     •  Bolts marked as committers now know the batch is in the committing phase
Regular spout, parallelism of 1
Defined streams: batch & commit
                                     When batch is fully processed again (determined by acker task):
                                     •  Ack gets called on coordinator
                                     •  Coordinator knows batch is now committed
STORM LIBRARIES
STORM uses a lot of libraries. The most prominent are
Clojure    a new lisp programming language. Crash-course follows
Jetty      an embedded webserver. Used to host the UI of Nimbus.
Kryo       a fast serializer, used when sending tuples
Thrift     a framework to build services. Nimbus is a thrift daemon
ZeroMQ     a very fast transportation layer
Zookeeper a distributed system for storing metadata
LEARN MORE
Wiki (https://github.com/nathanmarz/storm/wiki)
Storm-starter (https://github.com/nathanmarz/storm-starter)
Mailing list (http://groups.google.com/group/storm-user)
#storm-user room on freenode




                                            from: http://www.cupofjoe.tv/2010/11/learn-lesson.html

Contenu connexe

Tendances

Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
Mariusz Gil
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
Uday Vakalapudi
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
P. Taylor Goetz
 

Tendances (20)

Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Storm
StormStorm
Storm
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Storm
StormStorm
Storm
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Tutorial Kafka-Storm
Tutorial Kafka-StormTutorial Kafka-Storm
Tutorial Kafka-Storm
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 

En vedette

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
huguk
 
Sensor(zigbee)
Sensor(zigbee)Sensor(zigbee)
Sensor(zigbee)
rajrayala
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Nathan Bijnens
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 

En vedette (15)

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
Sensor(zigbee)
Sensor(zigbee)Sensor(zigbee)
Sensor(zigbee)
 
Advance ethernet
Advance ethernetAdvance ethernet
Advance ethernet
 
Ethernet
EthernetEthernet
Ethernet
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
Ethernet technology
Ethernet technologyEthernet technology
Ethernet technology
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Ethernet - Networking presentation
Ethernet - Networking presentationEthernet - Networking presentation
Ethernet - Networking presentation
 
Zigbee technology ppt edited
Zigbee technology ppt editedZigbee technology ppt edited
Zigbee technology ppt edited
 
Zigbee Presentation
Zigbee PresentationZigbee Presentation
Zigbee Presentation
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Internet of Things and its applications
Internet of Things and its applicationsInternet of Things and its applications
Internet of Things and its applications
 

Similaire à STORM

The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning Algorithm
ESCOM
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Eve
l xf
 
Java 7 at SoftShake 2011
Java 7 at SoftShake 2011Java 7 at SoftShake 2011
Java 7 at SoftShake 2011
julien.ponge
 

Similaire à STORM (20)

Storm 0.8.2
Storm 0.8.2Storm 0.8.2
Storm 0.8.2
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Storm
StormStorm
Storm
 
storm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxstorm-170531123446.dotx.pptx
storm-170531123446.dotx.pptx
 
Storm
StormStorm
Storm
 
Storm
StormStorm
Storm
 
Iteration
IterationIteration
Iteration
 
understand Storm in pictures
understand Storm in picturesunderstand Storm in pictures
understand Storm in pictures
 
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and HadoopUnraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
Unraveling mysteries of the Universe at CERN, with OpenStack and Hadoop
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning Algorithm
 
Storm introduction
Storm introductionStorm introduction
Storm introduction
 
Storm begins
Storm beginsStorm begins
Storm begins
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Eve
 
02 - Basics of Qt
02 - Basics of Qt02 - Basics of Qt
02 - Basics of Qt
 
storm-170531123446.pptx
storm-170531123446.pptxstorm-170531123446.pptx
storm-170531123446.pptx
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Java 7 LavaJUG
Java 7 LavaJUGJava 7 LavaJUG
Java 7 LavaJUG
 
Stackless Python In Eve
Stackless Python In EveStackless Python In Eve
Stackless Python In Eve
 
Java 7 at SoftShake 2011
Java 7 at SoftShake 2011Java 7 at SoftShake 2011
Java 7 at SoftShake 2011
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

STORM

  • 1. STORM COMPARISON – INTRODUCTION - CONCEPTS PRESENTATION BY KASPER MADSEN MARCH - 2012
  • 2. HADOOP VS STORM Batch processing Real-time processing Jobs runs to completion Topologies run forever JobTracker is SPOF* No single point of failure Stateful nodes Stateless nodes Scalable Scalable Guarantees no data loss Guarantees no data loss Open source Open source * Hadoop 0.21 added some checkpointing SPOF: Single Point Of Failure
  • 3. COMPONENTS Nimbus daemon is comparable to Hadoop JobTracker. It is the master Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker Worker is spawned by supervisor, one per port defined in storm.yaml configuration Task is run as a thread in workers Zookeeper* is a distributed system, used to store metadata. Nimbus and Supervisor daemons are fail-fast and stateless. All state is kept in Zookeeper. Notice all communication between Nimbus and Supervisors are done through Zookeeper On a cluster with 2k+1 zookeeper nodes, the system can recover when maximally k nodes fails. * Zookeeper is an Apache top-level project
  • 4. STREAMS Stream is an unbounded sequence of tuples. Topology is a graph where each node is a spout or bolt, and the edges indicate which bolts are subscribing to which streams. • A spout is a source of a stream • A bolt is consuming a stream (possibly emits a new one) Subscribes: A • An edge represents a grouping Emits: C Subscribes: C & D Subscribes: A Source of stream A Emits: D Source of stream B Subscribes:A & B
  • 5. GROUPINGS Each spout or bolt are running X instances in parallel (called tasks). Groupings are used to decide which task in the subscribing bolt, the tuple is sent to Shuffle grouping is a random grouping Fields grouping is grouped by value, such that equal value results in equal task All grouping replicates to all tasks Global grouping makes all tuples go to one task None grouping makes bolt run in same thread as bolt/spout it subscribes to Direct grouping producer (task that emits) controls which consumer will receive 4 tasks 3 tasks 2 tasks 2 tasks
  • 6. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE TopologyBuilder builder = new TopologyBuilder(); Create stream called ”words” Run 10 tasks builder.setSpout("words", new TestWordSpout(), 10); Create stream called ”exclaim1” builder.setBolt("exclaim1", new ExclamationBolt(), 3) Run 3 tasks Subscribe to stream ”words”, .shuffleGrouping("words"); using shufflegrouping Create stream called ”exclaim2” builder.setBolt("exclaim2", new ExclamationBolt(), 2) Run 2 tasks .shuffleGrouping("exclaim1"); Subscribe to stream ”exclaim1”, using shufflegrouping A bolt can subscribe to an unlimited number of streams, by chaining groupings. The sourcecode for this example is part of the storm-starter project on github
  • 7. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE – 1 TestWordSpout public void nextTuple() { Utils.sleep(100); final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"}; final Random rand = new Random(); final String word = words[rand.nextInt(words.length)]; _collector.emit(new Values(word)); } The TestWordSpout emits a random string from the array words, each 100 milliseconds
  • 8. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE – 2 ExclamationBolt Prepare is called when bolt is created OutputCollector _collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } Execute is called for each tuple public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } declareOutputFields is called when bolt is created public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } declareOutputFields is used to declare streams and their schemas. It is possible to declare several streams and specify the stream to use when outputting tuples in the emit function call.
  • 9. FAULT TOLERANCE Zookeeper stores metadata in a very robust way Nimbus and Supervisor are stateless and only need metadata from ZK to work/restart When a node dies • The tasks will time out and be reassigned to other workers by Nimbus. When a worker dies •The supervisor will restart the worker. •Nimbus will reassign worker to another supervisor, if no heartbeats are sent. •If not possible (no free ports), then tasks will be run on other workers in topology. If more capacity is added to the cluster later, STORM will automatically initialize a new worker and spread out the tasks. When nimbus or supervisor dies • Workers will continue to run • Workers cannot be reassigned without Nimbus • Nimbus and Supervisor should be run using a process monitoring tool, to restarts them automatically if they fail.
  • 10. AT-LEAST-ONCE PROCESSING STORM guarantees at-least-once processing of tuples. Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits long Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple. Ack is called on spout, when tree of tuples for spout tuple is fully processed. Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of tuples is not fully processed within a specified timeout (default is 30 seconds). It is possible to specify the message id, when emitting a tuple. This might be useful for replaying tuples from a queue. Ack/fail method called when tree of tuples have been fully processed or failed / timed-out
  • 11. AT-LEAST-ONCE PROCESSING – 2 Anchoring is used to copy the spout tuple message id(s) to the new tuples generated. In this way, every tuple knows the message id(s) of all spout tuples. Multi-anchoring is when multiple tuples are anchored. If the tuple tree fails, then multiple spout tuples will be replayed. Useful for doing streaming joins and more. Ack called from a bolt, indicates the tuple has been processed as intented Fail called from a bolt, replays the spout tuple(s) Every tuple must be acked/failed or the task will run out of memory at some point. _collector.emit(tuple, new Values(word));  Uses anchoring _collector.emit(new Values(word));  Does NOT use anchoring
  • 12. AT-LEAST-ONCE PROCESSING – 3 Acker tasks tracks the tree of tuples for every spout tuple • The acker task responsible for a given spout tuple is determined by modulo on message id. Since all tuples have all spout tuple message ids, it is easy to call the correct acker tasks. • Acker task stores a map, the format is {spoutMsgId, {spoutTaskId, ”ack val”}} • ”ack val” is the representation of state of entire tree of tuples. It is the xor of all tuple message ids created and acked in the tree of tuples. • When ”ack val” is 0, then tuple tree is fully processed. • Since message ids are random 64 bits numbers, chances of ”ack val” becoming 0 by accident is extremely small. Important to set number of acker tasks in topology when processing large amounts of tuples (defaults to 1)
  • 13. AT-LEAST-ONCE PROCESSING – 4 Example Bolt Emit ”h” Task: 3 spoutIds: 10 msgId: 2 Spout Emit ”hey” Bolt Task: 1 msgId:10 Task: 2 Emit ”ey” spoutIds: 10 msgId: 3 Bolt Task: 4 Shows what happens in acker task, for one spout tuple. Format is: {spoutMsgId, {spoutTaskId, ”ack val”}} 1. After emit ”hey”: {10, {1, 0000 XOR 1010 = 1010} 2. After emit ”h”: {10, {1, 1010 XOR 0010 = 1000} 3. After emit ”ey”: {10, {1, 1000 XOR 0011 = 1011} USES 64 BIT IDS 4. After ack ”hey”: {10, {1, 1011 XOR 1010 = 0001} IN REALITY 5. After ack ”h”: {10, {1, 0001 XOR 0010 = 0011} 6. After ack ”ey”: {10, {1, 0011 XOR 0011 = 0000} 7. Since ”ack val” is 0, spout tuple with id 10, must be fully processed. Call ack on spout (task 1)
  • 14. AT-LEAST-ONCE PROCESSING – 5 A tuple isn't acked because the task died: The spout tuple(s) at the root of the tree of tuples will time out and be replayed. Acker task dies: All the spout tuples the acker was tracking will time out and be replayed. Spout task dies: In this case the source that the spout talks to is responsible for replaying the messages. For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.
  • 15. AT-LEAST-ONCE PROCESSING – 6 At-least-once processing might process a tuple more than once. Example All grouping Bolt 1. A spout tuple is emitted to task 2 and 3 Task: 2 2. Worker responsible for task 3 fails 3. Supervisor restarts worker Spout Task: 1 4. Spout tuple is replayed and emitted to task 2 and 3 5. Task 2 will now have executed the same bolt twice Bolt Task: 3 Consider why the all grouping is not important in this example
  • 16. EXACTLY-ONCE-PROCESSING Transactional topologies (TT) is an abstraction built on STORM primitives. TT guarantees exactly-once-processing of tuples. Acking is optimized in TT, no need to do anchoring or acking manually. Bolts execute as new instances per attempt of processing a batch Example All grouping Bolt 1. A spout tuple is emitted to task 2 and 3 Task: 2 2. Worker responsible for task 3 fails 3. Supervisor restarts worker Spout Task: 1 4. Spout tuple is replayed and emitted to task 2 and 3 5. Task 2 and 3 initiate new bolts because of new attempt Bolt 5. Now there is no problem Task: 3
  • 17. EXACTLY-ONCE-PROCESSING – 2 For efficiency batch processing of tuples is introduced in TT Batch has two states: processing or committing Many batches can be in the processing state concurrently Only one batch can be in the committing state, and a strong ordering is imposed. That means batch 1 will always be committed before batch 2 and so on. Types of bolts for TT: BasicBolt, BatchBolt, BatchBolt marked as committer BasicBolt is processing one tuple at a time. BatchBolt is processing batches. Call finishBatch when all tuples of batch is executed BatchBolt marked as committer is calling finishBatch only when batch is in committing state.
  • 18. EXACTLY-ONCE-PROCESSING – 3 Transactional spout has capability Committer Committer to replay exact batches of tuples batchbolt batchbolt batchbolt batchbolt BATCH IS IN PROCESSING STATE Bolt A: execute method is called for all tuples received from spout finishBatch is called when first batch is received Bolt B: execute method is called for all tuples received from bolt A finishBatch is NOT called because batch is in processing state Bolt C: execute method is called for all tuples received from bolt A (and B) finishBatch is NOT called, because bolt B has not called finishBatch Bolt D: execute method is called for all tuples received from bolt C finishBatch is NOT called because batch is in processing state BATCH CHANGES TO COMMITTING STATE Bolt B: finishBatch is called Bolt C: finishBatch is called, because we know we got all tuples from Bolt B now Bolt D: finishBatch is called, because we know we got all tuples from Bolt C now
  • 19. EXACTLY-ONCE-PROCESSING – 4 Transactional spout All groupings on When batch should enter processing state: batch stream • Coordinator emits a tuple with TransactionAttempt and the metadata for that transaction to the "batch" stream. • All emitter tasks receives the tuple and begins to emit their portion of tuples for the given batch. When processing phase of batch is done (determined by acker task): • Ack gets called on coordinator When ack gets called on coordinator and all prior transactions have committed: Regular bolt, • Coordinator emits a tuple with TransactionAttempt to the commit stream. Parallelism of P • All Bolts which are marked as committers subscribe to the commit stream of the coordinator using an all grouping. • Bolts marked as committers now know the batch is in the committing phase Regular spout, parallelism of 1 Defined streams: batch & commit When batch is fully processed again (determined by acker task): • Ack gets called on coordinator • Coordinator knows batch is now committed
  • 20. STORM LIBRARIES STORM uses a lot of libraries. The most prominent are Clojure a new lisp programming language. Crash-course follows Jetty an embedded webserver. Used to host the UI of Nimbus. Kryo a fast serializer, used when sending tuples Thrift a framework to build services. Nimbus is a thrift daemon ZeroMQ a very fast transportation layer Zookeeper a distributed system for storing metadata
  • 21. LEARN MORE Wiki (https://github.com/nathanmarz/storm/wiki) Storm-starter (https://github.com/nathanmarz/storm-starter) Mailing list (http://groups.google.com/group/storm-user) #storm-user room on freenode from: http://www.cupofjoe.tv/2010/11/learn-lesson.html