SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
ONTOPIC 
Storm-crawler in the wild 
Jake K. Dodd 
jake@ontopic.io 
http://www.ontopic.io
Who We Are 
ONTOPIC 
• Ontopic is an early-stage FinTech startup 
located in Los Angeles, CA 
• We’re building an engine that empowers 
qualitative financial research by taming 
information overload
Our Requirements 
• Need to discover news as soon as it 
appears on the web 
• Involves monitoring several hundred 
thousand content sources 
• This is more adequately described as web 
monitoring than web crawling
What We Tried 
• + Perhaps the gold-standard for open source web crawling 
• + Capable of handling millions of pages per day 
• - We decided that we were trying to force Nutch to do 
something for which it wasn’t designed—specifically, real-time 
monitoring 
• + Open source python web scraping framework 
• + Incredibly simple to get started 
• + Processing pipelines are dead-simple to develop 
• - No built-in distributed mode. Building an in-house 
distributed and continuous-crawl framework for Scrapy 
seemed like a fragile solution 
• - Designed, and primarily used, as a web scraper (again, not 
precisely the same as web monitoring)
Storm Crawler at Ontopic 
• The storm-crawler project is our workhorse 
for web monitoring 
• Integrated with Apache Kafka, Redis, and 
several other technologies 
• Running on a cluster managed by 
Hortonworks HDP 2.2
High-Level Architecture 
Redis 
• Seed List 
• Domain locks 
• Outlink List 
• Logstash events 
manages 
URL 
Manager 
(Ruby app) 
storm-crawler 
• One topology 
• Seed stream and 
outlink stream 
kafka 
• One topic, two 
partitions 
Publishes seeds and 
outlinks to Kafka 
Elasticsearch 
indexing 
Kafka Spout with two executors 
(one for each topic partition) 
logstash
R&D Environment (AWS) 
Redis 
• Seed List 
• Domain locks 
• Outlink List 
• Logstash events 
manages 
URL 
Manager 
(Ruby app) 
Nimbus: 1 x r3.large 
Supervisors: 3 x c3.large 
(in a placement group) 
storm-crawler 
• One topology 
• Seed stream and 
outlink stream 
kafka 
• One topic, two 
partitions 
Publishes seeds and 
outlinks to Kafka 
Elasticsearch 
indexing 
Kafka Spout with two executors 
(one for each topic partition) 
logstash 
1 x m1.small instance 
(Redis and Ruby app) 
1 x r3.large 
1 x c3.large
Eye Candy (Ambari) 
Ambari Dashboard 
Cluster 
utilization 
only ~7%
Eye Candy (Storm) 
Storm UI 
~800,000 URLS per day
Eye Candy (Kibana) 
Kibana Dashboard 
Metrics from storm-crawler sent to Logstash 
enable easy real-time monitoring 
~800,000 URLS per day
Conclusion 
• Storm-crawler has enabled us to build a reliable, distributed, web-scale 
news monitoring solution 
• Screen grabs are from our R&D environment, in which we’ve been 
able to monitor ~2,000 sources with a revisit time of one minute, at 
10% utilization on a small cluster 
• We have zero scalability concerns with storm-crawler—upping the 
number of tasks and nodes has demonstrated the ability to fetch 
100,000s of pages per minute 
• Ontopic is committed to open-sourcing our work on top of storm-crawler 
and being a core contributor to the project 
• We’re working to generalize our integration points with Redis, 
Kafka, and Logstash, and providing tutorials so that storm-crawler 
users can easily leverage these technologies (or their equivalents) 
on projects using storm-crawler

Contenu connexe

Tendances

Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELKYuHsuan Chen
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...Lucidworks
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1Maruf Hassan
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearchJoey Wen
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...Rahul K Chauhan
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)Mathew Beane
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchRuslan Zavacky
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Andrii Vozniuk
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Monitoring with Graylog - a modern approach to monitoring?
Monitoring with Graylog - a modern approach to monitoring?Monitoring with Graylog - a modern approach to monitoring?
Monitoring with Graylog - a modern approach to monitoring?inovex GmbH
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Lucidworks
 
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaWhy Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaLucidworks
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for ElasticsearchJodok Batlogg
 

Tendances (20)

Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Monitoring with Graylog - a modern approach to monitoring?
Monitoring with Graylog - a modern approach to monitoring?Monitoring with Graylog - a modern approach to monitoring?
Monitoring with Graylog - a modern approach to monitoring?
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaWhy Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for Elasticsearch
 

En vedette

Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...JAXLondon2014
 
Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes Weaveworks
 
Business use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise ArchitectureBusiness use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise ArchitectureNUS-ISS
 
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...Brian Grant
 
Deep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer ArchitectureDeep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer ArchitectureWSO2
 
Kubernetes and bluemix
Kubernetes  and  bluemixKubernetes  and  bluemix
Kubernetes and bluemixDuckDuckGo
 
A brief study on Kubernetes and its components
A brief study on Kubernetes and its componentsA brief study on Kubernetes and its components
A brief study on Kubernetes and its componentsRamit Surana
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...Ambassador Labs
 
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017Bob Cotton
 

En vedette (10)

Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 
Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes
 
Business use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise ArchitectureBusiness use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise Architecture
 
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
 
Deep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer ArchitectureDeep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer Architecture
 
Kubernetes and bluemix
Kubernetes  and  bluemixKubernetes  and  bluemix
Kubernetes and bluemix
 
A brief study on Kubernetes and its components
A brief study on Kubernetes and its componentsA brief study on Kubernetes and its components
A brief study on Kubernetes and its components
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
 
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
 

Similaire à StormCrawler in the wild

Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015ontopic
 
The Art of Container Monitoring
The Art of Container MonitoringThe Art of Container Monitoring
The Art of Container MonitoringDerek Chen
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
Using AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic ProtectionUsing AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic ProtectionAmazon Web Services
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Paul Brebner
 
Microservices with Spring Cloud
Microservices with Spring CloudMicroservices with Spring Cloud
Microservices with Spring CloudDaniel Eichten
 
Distributed architecture in a cloud native microservices ecosystem
Distributed architecture in a cloud native microservices ecosystemDistributed architecture in a cloud native microservices ecosystem
Distributed architecture in a cloud native microservices ecosystemZhenzhong Xu
 
An introduction to Node.js
An introduction to Node.jsAn introduction to Node.js
An introduction to Node.jsKasey McCurdy
 
OpenStack monitoring - Unidata S.p.A. Case Report
OpenStack monitoring - Unidata S.p.A. Case ReportOpenStack monitoring - Unidata S.p.A. Case Report
OpenStack monitoring - Unidata S.p.A. Case ReportDavide Guerri
 
ICON: Intelligent Container Overlays
ICON: Intelligent Container OverlaysICON: Intelligent Container Overlays
ICON: Intelligent Container OverlaysNitinder Mohan
 
A closer look to locaweb IaaS
A closer look to locaweb IaaSA closer look to locaweb IaaS
A closer look to locaweb IaaSGleicon Moraes
 

Similaire à StormCrawler in the wild (20)

Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
The Art of Container Monitoring
The Art of Container MonitoringThe Art of Container Monitoring
The Art of Container Monitoring
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Using AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic ProtectionUsing AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic Protection
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
 
Microservices with Spring Cloud
Microservices with Spring CloudMicroservices with Spring Cloud
Microservices with Spring Cloud
 
Distributed architecture in a cloud native microservices ecosystem
Distributed architecture in a cloud native microservices ecosystemDistributed architecture in a cloud native microservices ecosystem
Distributed architecture in a cloud native microservices ecosystem
 
An introduction to Node.js
An introduction to Node.jsAn introduction to Node.js
An introduction to Node.js
 
OpenStack monitoring - Unidata S.p.A. Case Report
OpenStack monitoring - Unidata S.p.A. Case ReportOpenStack monitoring - Unidata S.p.A. Case Report
OpenStack monitoring - Unidata S.p.A. Case Report
 
ICON: Intelligent Container Overlays
ICON: Intelligent Container OverlaysICON: Intelligent Container Overlays
ICON: Intelligent Container Overlays
 
A closer look to locaweb IaaS
A closer look to locaweb IaaSA closer look to locaweb IaaS
A closer look to locaweb IaaS
 
Node.js for beginner
Node.js for beginnerNode.js for beginner
Node.js for beginner
 

Dernier

ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxAutus Cyber Tech
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?AmeliaSmith90
 
Mastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example ProjectMastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example Projectwajrcs
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntelliSource Technologies
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024Mind IT Systems
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdfMeon Technology
 
20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기Chiwon Song
 
Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Neo4j
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorShane Coughlan
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfBrain Inventory
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesSoftwareMill
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIIvo Andreev
 

Dernier (20)

ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptx
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?
 
Mastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example ProjectMastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example Project
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptx
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in Trivandrum
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdf
 
20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기
 
Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS Calculator
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdf
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 

StormCrawler in the wild

  • 1. ONTOPIC Storm-crawler in the wild Jake K. Dodd jake@ontopic.io http://www.ontopic.io
  • 2. Who We Are ONTOPIC • Ontopic is an early-stage FinTech startup located in Los Angeles, CA • We’re building an engine that empowers qualitative financial research by taming information overload
  • 3. Our Requirements • Need to discover news as soon as it appears on the web • Involves monitoring several hundred thousand content sources • This is more adequately described as web monitoring than web crawling
  • 4. What We Tried • + Perhaps the gold-standard for open source web crawling • + Capable of handling millions of pages per day • - We decided that we were trying to force Nutch to do something for which it wasn’t designed—specifically, real-time monitoring • + Open source python web scraping framework • + Incredibly simple to get started • + Processing pipelines are dead-simple to develop • - No built-in distributed mode. Building an in-house distributed and continuous-crawl framework for Scrapy seemed like a fragile solution • - Designed, and primarily used, as a web scraper (again, not precisely the same as web monitoring)
  • 5. Storm Crawler at Ontopic • The storm-crawler project is our workhorse for web monitoring • Integrated with Apache Kafka, Redis, and several other technologies • Running on a cluster managed by Hortonworks HDP 2.2
  • 6. High-Level Architecture Redis • Seed List • Domain locks • Outlink List • Logstash events manages URL Manager (Ruby app) storm-crawler • One topology • Seed stream and outlink stream kafka • One topic, two partitions Publishes seeds and outlinks to Kafka Elasticsearch indexing Kafka Spout with two executors (one for each topic partition) logstash
  • 7. R&D Environment (AWS) Redis • Seed List • Domain locks • Outlink List • Logstash events manages URL Manager (Ruby app) Nimbus: 1 x r3.large Supervisors: 3 x c3.large (in a placement group) storm-crawler • One topology • Seed stream and outlink stream kafka • One topic, two partitions Publishes seeds and outlinks to Kafka Elasticsearch indexing Kafka Spout with two executors (one for each topic partition) logstash 1 x m1.small instance (Redis and Ruby app) 1 x r3.large 1 x c3.large
  • 8. Eye Candy (Ambari) Ambari Dashboard Cluster utilization only ~7%
  • 9. Eye Candy (Storm) Storm UI ~800,000 URLS per day
  • 10. Eye Candy (Kibana) Kibana Dashboard Metrics from storm-crawler sent to Logstash enable easy real-time monitoring ~800,000 URLS per day
  • 11. Conclusion • Storm-crawler has enabled us to build a reliable, distributed, web-scale news monitoring solution • Screen grabs are from our R&D environment, in which we’ve been able to monitor ~2,000 sources with a revisit time of one minute, at 10% utilization on a small cluster • We have zero scalability concerns with storm-crawler—upping the number of tasks and nodes has demonstrated the ability to fetch 100,000s of pages per minute • Ontopic is committed to open-sourcing our work on top of storm-crawler and being a core contributor to the project • We’re working to generalize our integration points with Redis, Kafka, and Logstash, and providing tutorials so that storm-crawler users can easily leverage these technologies (or their equivalents) on projects using storm-crawler