SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Storm - pipes and
filters on steroids

      Andre Sprenger


    BigData Roundtable
   Hamburg 30. Nov 2011
My background
•   info@andresprenger.de

•   Studied Computer Science and Economics

•   Background: banking, ecommerce, online advertising

•   Freelancer

•   Java, Scala, Ruby, Rails

•   Hadoop, Pig, Hive, Cassandra
“Next click” problem
Raymie Strata (CTO,Yahoo):

“With the paths that go through Hadoop [at Yahoo!], the
latency is about fifteen minutes. … [I]t will never be true
real-time. It will never be what we call “next click,” where
I click and by the time the page loads, the semantic
implication of my decision is reflected in the page.”
“Next click” problem
                             (next)
 HTTP         HTTP           HTTP          HTTP
Request      Response       Request       Response


     max latency                  max latency
       80 ms                        80 ms

                                                     web server
              realtime   near realtime
              response     response

                                                     real time layer

      collect data                 process data


                         time
Example problems
•   Realtime statistics - counting, trends, moving average

•   Read Twitter stream and output images that are
    trending in the last 10 minutes

•   CTR calculation - read ad clicks/ad impressions and
    calculate new click through rate

•   ETL - transform format, filter duplicates / bot traffic,
    enrich from static data, persist

•   Search advertising
Pick your framework...
•   S4 - Yahoo, “real time map reduce”, actor model

•   Storm - Twitter

•   MapReduce Online - Yahoo

•   Cloud Map Reduce - Accenture

•   HStreaming - Startup, based on Hadoop

•   Brisk - DataStax, Cassandra
System requirements
•   Fault tolerance - system keeps running when a node
    fails

•   Horizontal scalability - should be easy, just add a
    node

•   Low latency

•   Reliable - does not loose data

•   High availability - well, if it’s down for an hour its not
    realtime
Storm in a nutshell
•   Written by Backtype (aquired by Twitter)

•   Open Source, Github

•   Runs on JVM

•   Clojure, Python, Zookeeper, ZeroMQ

•   Currently used by Twitter for real time statistics
Programming model
•   Tuple - name/value list

•   Stream - unbounded sequence of Tuples

•   Spout - source of Streams

•   Bolt - consumer / producer of Streams

•   Topology - network of Streams, Spouts and Bolts
Spout
        tuple tuple tuple tuple


Spout
        tuple tuple tuple tuple
Bolt
   Processes streams and generates new streams.



tuple tuple tuple tuple

                                  tuple tuple tuple tuple
                           Bolt
tuple tuple tuple tuple
Bolt
•   filtering

•   transformation

•   split / aggregate streams

•   counting, statistics

•   read from / write to database
Topology
Network of Streams, Spouts and Bolts

                    Bolt         Bolt
     Spout

                    Bolt

     Spout                       Bolt

                    Bolt
Task
Parallel processor inside Spouts and Bolts.
Each Spout / Bolt has a fixed number of Tasks.


      Spout                Bolt

      Task                 Task

      Task                 Task

      Task
Stream grouping
Which Task does a Tuple go to?

•   shuffle grouping - distribute randomly

•   field grouping - partition by field value

•   all grouping - send to all Tasks

•   custom grouping - implement your own logic
Word count example

                Sentence            Word    (“a”, 2)
                 Splitter           Count   (“b”, 2)
Spout
                  Bolt               Bolt   (“c”, 1)
                            (“a”)           (“d”, 1)
                            (“b”)
  (“a b c a b d”)           (“c”)
                            (“a”)
                            (“b”)
                            (“d”)
Guaranteed processing
                             (“a”)

                             (“b”)
                                             (“a”, 2)
                             (“c”)
                                             (“b”, 2)
Spout    (“a b c a b d”)
                                             (“c”, 1)
                             (“a”)
                                             (“d”, 1)
                             (“b”)

                             (“d”)

Topology has a timeout for processing of the tuple tree
Runtime view
Reliability
•   Nimbus / Supervisor are SPOF

•   both are stateless, easy to restart without data loss

•   Failure of master node (?)

•   Running Topologies should not be affected!

•   Failed Workers are restarted

•   Guaranteed message processing
Administration

•   Nimbus / Supervisor / Zookeeper need monitoring
    and supervisor (e.g. Monit)

•   Cluster nodes can be added at runtime

•   But: existing Topologies are not rebalanced (there is a
    ticket)

•   Administration web GUI
Community
•   Source is on Github - https://github.com/
    nathanmarz/storm.git

•   Wiki - https://github.com/nathanmarz/storm/wiki

•   Nice documentation

•   Google Group

•   People start to build add-ons: JRuby integration,
    adapters for JMS, AMQP
Storm summary
•   Nice programming model

•   Easy to deploy new topologies

•   Horizontal scalability

•   Low latency

•   Fault tolerance

•   Easy to setup on EC2
Questions?

Contenu connexe

Tendances

ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例知教 本間
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo ProductsMikio Hirabayashi
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesCharles Nutter
 
DEFCON 23 - Mike Sconzo - i am packer and so can you
DEFCON 23 - Mike Sconzo - i am packer and so can youDEFCON 23 - Mike Sconzo - i am packer and so can you
DEFCON 23 - Mike Sconzo - i am packer and so can youFelipe Prado
 

Tendances (8)

Storm
StormStorm
Storm
 
ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例
 
Cwmg
CwmgCwmg
Cwmg
 
[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 
Kyotoproducts
KyotoproductsKyotoproducts
Kyotoproducts
 
DEFCON 23 - Mike Sconzo - i am packer and so can you
DEFCON 23 - Mike Sconzo - i am packer and so can youDEFCON 23 - Mike Sconzo - i am packer and so can you
DEFCON 23 - Mike Sconzo - i am packer and so can you
 

Similaire à Bigdata roundtable-storm

Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processingducquoc_vn
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesStéphane Maldini
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Stormboorad
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsthelabdude
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorStéphane Maldini
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalCodemotion Tel Aviv
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 

Similaire à Bigdata roundtable-storm (20)

Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processing
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServices
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and Reactor
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 

Dernier

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Dernier (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Bigdata roundtable-storm

  • 1. Storm - pipes and filters on steroids Andre Sprenger BigData Roundtable Hamburg 30. Nov 2011
  • 2. My background • info@andresprenger.de • Studied Computer Science and Economics • Background: banking, ecommerce, online advertising • Freelancer • Java, Scala, Ruby, Rails • Hadoop, Pig, Hive, Cassandra
  • 3. “Next click” problem Raymie Strata (CTO,Yahoo): “With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true real-time. It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reflected in the page.”
  • 4. “Next click” problem (next) HTTP HTTP HTTP HTTP Request Response Request Response max latency max latency 80 ms 80 ms web server realtime near realtime response response real time layer collect data process data time
  • 5. Example problems • Realtime statistics - counting, trends, moving average • Read Twitter stream and output images that are trending in the last 10 minutes • CTR calculation - read ad clicks/ad impressions and calculate new click through rate • ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist • Search advertising
  • 6. Pick your framework... • S4 - Yahoo, “real time map reduce”, actor model • Storm - Twitter • MapReduce Online - Yahoo • Cloud Map Reduce - Accenture • HStreaming - Startup, based on Hadoop • Brisk - DataStax, Cassandra
  • 7. System requirements • Fault tolerance - system keeps running when a node fails • Horizontal scalability - should be easy, just add a node • Low latency • Reliable - does not loose data • High availability - well, if it’s down for an hour its not realtime
  • 8. Storm in a nutshell • Written by Backtype (aquired by Twitter) • Open Source, Github • Runs on JVM • Clojure, Python, Zookeeper, ZeroMQ • Currently used by Twitter for real time statistics
  • 9. Programming model • Tuple - name/value list • Stream - unbounded sequence of Tuples • Spout - source of Streams • Bolt - consumer / producer of Streams • Topology - network of Streams, Spouts and Bolts
  • 10. Spout tuple tuple tuple tuple Spout tuple tuple tuple tuple
  • 11. Bolt Processes streams and generates new streams. tuple tuple tuple tuple tuple tuple tuple tuple Bolt tuple tuple tuple tuple
  • 12. Bolt • filtering • transformation • split / aggregate streams • counting, statistics • read from / write to database
  • 13. Topology Network of Streams, Spouts and Bolts Bolt Bolt Spout Bolt Spout Bolt Bolt
  • 14. Task Parallel processor inside Spouts and Bolts. Each Spout / Bolt has a fixed number of Tasks. Spout Bolt Task Task Task Task Task
  • 15. Stream grouping Which Task does a Tuple go to? • shuffle grouping - distribute randomly • field grouping - partition by field value • all grouping - send to all Tasks • custom grouping - implement your own logic
  • 16. Word count example Sentence Word (“a”, 2) Splitter Count (“b”, 2) Spout Bolt Bolt (“c”, 1) (“a”) (“d”, 1) (“b”) (“a b c a b d”) (“c”) (“a”) (“b”) (“d”)
  • 17. Guaranteed processing (“a”) (“b”) (“a”, 2) (“c”) (“b”, 2) Spout (“a b c a b d”) (“c”, 1) (“a”) (“d”, 1) (“b”) (“d”) Topology has a timeout for processing of the tuple tree
  • 19. Reliability • Nimbus / Supervisor are SPOF • both are stateless, easy to restart without data loss • Failure of master node (?) • Running Topologies should not be affected! • Failed Workers are restarted • Guaranteed message processing
  • 20. Administration • Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit) • Cluster nodes can be added at runtime • But: existing Topologies are not rebalanced (there is a ticket) • Administration web GUI
  • 21. Community • Source is on Github - https://github.com/ nathanmarz/storm.git • Wiki - https://github.com/nathanmarz/storm/wiki • Nice documentation • Google Group • People start to build add-ons: JRuby integration, adapters for JMS, AMQP
  • 22. Storm summary • Nice programming model • Easy to deploy new topologies • Horizontal scalability • Low latency • Fault tolerance • Easy to setup on EC2