SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
ONTOPIC 
Storm-crawler in the wild 
Jake K. Dodd 
jake@ontopic.io 
http://www.ontopic.io
Who We Are 
ONTOPIC 
• Ontopic is an early-stage FinTech startup 
located in Los Angeles, CA 
• We’re building an engine that empowers 
qualitative financial research by taming 
information overload
Our Requirements 
• Need to discover news as soon as it 
appears on the web 
• Involves monitoring several hundred 
thousand content sources 
• This is more adequately described as web 
monitoring than web crawling
What We Tried 
• + Perhaps the gold-standard for open source web crawling 
• + Capable of handling millions of pages per day 
• - We decided that we were trying to force Nutch to do 
something for which it wasn’t designed—specifically, real-time 
monitoring 
• + Open source python web scraping framework 
• + Incredibly simple to get started 
• + Processing pipelines are dead-simple to develop 
• - No built-in distributed mode. Building an in-house 
distributed and continuous-crawl framework for Scrapy 
seemed like a fragile solution 
• - Designed, and primarily used, as a web scraper (again, not 
precisely the same as web monitoring)
Storm Crawler at Ontopic 
• The storm-crawler project is our workhorse 
for web monitoring 
• Integrated with Apache Kafka, Redis, and 
several other technologies 
• Running on a cluster managed by 
Hortonworks HDP 2.2
High-Level Architecture 
Redis 
• Seed List 
• Domain locks 
• Outlink List 
• Logstash events 
manages 
URL 
Manager 
(Ruby app) 
storm-crawler 
• One topology 
• Seed stream and 
outlink stream 
kafka 
• One topic, two 
partitions 
Publishes seeds and 
outlinks to Kafka 
Elasticsearch 
indexing 
Kafka Spout with two executors 
(one for each topic partition) 
logstash
R&D Environment (AWS) 
Redis 
• Seed List 
• Domain locks 
• Outlink List 
• Logstash events 
manages 
URL 
Manager 
(Ruby app) 
Nimbus: 1 x r3.large 
Supervisors: 3 x c3.large 
(in a placement group) 
storm-crawler 
• One topology 
• Seed stream and 
outlink stream 
kafka 
• One topic, two 
partitions 
Publishes seeds and 
outlinks to Kafka 
Elasticsearch 
indexing 
Kafka Spout with two executors 
(one for each topic partition) 
logstash 
1 x m1.small instance 
(Redis and Ruby app) 
1 x r3.large 
1 x c3.large
Eye Candy (Ambari) 
Ambari Dashboard 
Cluster 
utilization 
only ~7%
Eye Candy (Storm) 
Storm UI 
~800,000 URLS per day
Eye Candy (Kibana) 
Kibana Dashboard 
Metrics from storm-crawler sent to Logstash 
enable easy real-time monitoring 
~800,000 URLS per day
Conclusion 
• Storm-crawler has enabled us to build a reliable, distributed, web-scale 
news monitoring solution 
• Screen grabs are from our R&D environment, in which we’ve been 
able to monitor ~2,000 sources with a revisit time of one minute, at 
10% utilization on a small cluster 
• We have zero scalability concerns with storm-crawler—upping the 
number of tasks and nodes has demonstrated the ability to fetch 
100,000s of pages per minute 
• Ontopic is committed to open-sourcing our work on top of storm-crawler 
and being a core contributor to the project 
• We’re working to generalize our integration points with Redis, 
Kafka, and Logstash, and providing tutorials so that storm-crawler 
users can easily leverage these technologies (or their equivalents) 
on projects using storm-crawler

Contenu connexe

Tendances

Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Andrii Vozniuk
 

Tendances (20)

Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Monitoring with Graylog - a modern approach to monitoring?
Monitoring with Graylog - a modern approach to monitoring?Monitoring with Graylog - a modern approach to monitoring?
Monitoring with Graylog - a modern approach to monitoring?
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaWhy Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for Elasticsearch
 

En vedette

Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
JAXLondon2014
 
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
Brian Grant
 

En vedette (10)

Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 
Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes
 
Business use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise ArchitectureBusiness use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise Architecture
 
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
 
Deep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer ArchitectureDeep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer Architecture
 
Kubernetes and bluemix
Kubernetes  and  bluemixKubernetes  and  bluemix
Kubernetes and bluemix
 
A brief study on Kubernetes and its components
A brief study on Kubernetes and its componentsA brief study on Kubernetes and its components
A brief study on Kubernetes and its components
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
 
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
 

Similaire à StormCrawler in the wild

Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Paul Brebner
 
OpenStack monitoring - Unidata S.p.A. Case Report
OpenStack monitoring - Unidata S.p.A. Case ReportOpenStack monitoring - Unidata S.p.A. Case Report
OpenStack monitoring - Unidata S.p.A. Case Report
Davide Guerri
 

Similaire à StormCrawler in the wild (20)

Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
The Art of Container Monitoring
The Art of Container MonitoringThe Art of Container Monitoring
The Art of Container Monitoring
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Using AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic ProtectionUsing AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic Protection
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
 
Microservices with Spring Cloud
Microservices with Spring CloudMicroservices with Spring Cloud
Microservices with Spring Cloud
 
Distributed architecture in a cloud native microservices ecosystem
Distributed architecture in a cloud native microservices ecosystemDistributed architecture in a cloud native microservices ecosystem
Distributed architecture in a cloud native microservices ecosystem
 
An introduction to Node.js
An introduction to Node.jsAn introduction to Node.js
An introduction to Node.js
 
OpenStack monitoring - Unidata S.p.A. Case Report
OpenStack monitoring - Unidata S.p.A. Case ReportOpenStack monitoring - Unidata S.p.A. Case Report
OpenStack monitoring - Unidata S.p.A. Case Report
 
ICON: Intelligent Container Overlays
ICON: Intelligent Container OverlaysICON: Intelligent Container Overlays
ICON: Intelligent Container Overlays
 
A closer look to locaweb IaaS
A closer look to locaweb IaaSA closer look to locaweb IaaS
A closer look to locaweb IaaS
 
Node.js for beginner
Node.js for beginnerNode.js for beginner
Node.js for beginner
 

Dernier

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Dernier (20)

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 

StormCrawler in the wild

  • 1. ONTOPIC Storm-crawler in the wild Jake K. Dodd jake@ontopic.io http://www.ontopic.io
  • 2. Who We Are ONTOPIC • Ontopic is an early-stage FinTech startup located in Los Angeles, CA • We’re building an engine that empowers qualitative financial research by taming information overload
  • 3. Our Requirements • Need to discover news as soon as it appears on the web • Involves monitoring several hundred thousand content sources • This is more adequately described as web monitoring than web crawling
  • 4. What We Tried • + Perhaps the gold-standard for open source web crawling • + Capable of handling millions of pages per day • - We decided that we were trying to force Nutch to do something for which it wasn’t designed—specifically, real-time monitoring • + Open source python web scraping framework • + Incredibly simple to get started • + Processing pipelines are dead-simple to develop • - No built-in distributed mode. Building an in-house distributed and continuous-crawl framework for Scrapy seemed like a fragile solution • - Designed, and primarily used, as a web scraper (again, not precisely the same as web monitoring)
  • 5. Storm Crawler at Ontopic • The storm-crawler project is our workhorse for web monitoring • Integrated with Apache Kafka, Redis, and several other technologies • Running on a cluster managed by Hortonworks HDP 2.2
  • 6. High-Level Architecture Redis • Seed List • Domain locks • Outlink List • Logstash events manages URL Manager (Ruby app) storm-crawler • One topology • Seed stream and outlink stream kafka • One topic, two partitions Publishes seeds and outlinks to Kafka Elasticsearch indexing Kafka Spout with two executors (one for each topic partition) logstash
  • 7. R&D Environment (AWS) Redis • Seed List • Domain locks • Outlink List • Logstash events manages URL Manager (Ruby app) Nimbus: 1 x r3.large Supervisors: 3 x c3.large (in a placement group) storm-crawler • One topology • Seed stream and outlink stream kafka • One topic, two partitions Publishes seeds and outlinks to Kafka Elasticsearch indexing Kafka Spout with two executors (one for each topic partition) logstash 1 x m1.small instance (Redis and Ruby app) 1 x r3.large 1 x c3.large
  • 8. Eye Candy (Ambari) Ambari Dashboard Cluster utilization only ~7%
  • 9. Eye Candy (Storm) Storm UI ~800,000 URLS per day
  • 10. Eye Candy (Kibana) Kibana Dashboard Metrics from storm-crawler sent to Logstash enable easy real-time monitoring ~800,000 URLS per day
  • 11. Conclusion • Storm-crawler has enabled us to build a reliable, distributed, web-scale news monitoring solution • Screen grabs are from our R&D environment, in which we’ve been able to monitor ~2,000 sources with a revisit time of one minute, at 10% utilization on a small cluster • We have zero scalability concerns with storm-crawler—upping the number of tasks and nodes has demonstrated the ability to fetch 100,000s of pages per minute • Ontopic is committed to open-sourcing our work on top of storm-crawler and being a core contributor to the project • We’re working to generalize our integration points with Redis, Kafka, and Logstash, and providing tutorials so that storm-crawler users can easily leverage these technologies (or their equivalents) on projects using storm-crawler