SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
An introduction to
Julien Nioche
julien@digitalpebble.com
@digitalpebble
Java Meetup – Bristol 10/03/2015
Storm Crawler
2 / 31
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Information Retrieval
– Machine Learning
 Strong focus on Open Source & Apache ecosystem
 User | Contributor | Committer
– Nutch
– Tika
– SOLR, Lucene
– GATE, UIMA
– Behemoth
3 / 31
 Collection of resources (SDK) for building web crawlers on
Apache Storm
 https://github.com/DigitalPebble/storm-crawler
 Artefacts available from Maven Central
 Apache License v2
 Active and growing fast
=> Scalable
=> Low latency
=> Easily extensible
=> Polite
Storm-Crawler : what is it?
4 / 31
What it is not
 A ready-to-use, feature-complete web crawler
– Will provide that as a separate project using S/C later
 e.g. No PageRank or explicit ranking of pages
– Build your own
 No fancy UI, dashboards, etc...
– Build your own or integrate in your existing ones
 Compare with Apache Nutch later
5 / 31
Apache Storm
 Distributed real time computation system
 http://storm.apache.org/
 Scalable, fault-tolerant, polyglot and fun!
 Implemented in Clojure + Java
 “Realtime analytics, online machine learning, continuous
computation, distributed RPC, ETL, and more”
6 / 31
Storm Cluster Layout
http://cdn.oreillystatic.com/en/assets/1/event/85/Realtime
%20Processing%20with%20Storm%20Presentation.pdf
7 / 31
Topology = spouts + bolts + tuples +
streams
8 / 31
Basic Crawl Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
(1) pull URLs from an external
source
(2) group per host / domain / IP
(3) fetch URLs
Default
stream
(4) parse content
(3) fetch URLs
(5) index text and metadata
9 / 31
Basic Topology : Tuple perspective
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
(1) <String url, Metadata m>
(2) <String url, String key, Metadata m>
Default
stream
(4) <String url, byte[] content, Metadata m,
String text>
(3) <String url, byte[] content, Metadata m>
(1)
(2)
(3)
(4)
10 / 31
Basic scenario
 Spout : simple queue e.g. RabbitMQ, AWS SQS, etc...
 Indexer : any form of storage but often index e.g. SOLR,
ElasticSearch
 Components in grey : default ones from SC
 Use case : non-recursive crawls (i.e. no links discovered),
URLs known in advance. Failures don't matter too much.
 “Great! What about recursive crawls and / or failure
handling?”
11 / 31
Frontier expansion
 Manual “discovery”
– Adding new URLs by
hand, “seeding”
 Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
12 / 31
Recursive Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
Default
stream
Status Updater
Status
stream
Switch
DB
13 / 31
Status stream
 Used for handling errors and update info about URLs
 Newly discovered URLs from parser
public enum Status {
DISCOVERED, FETCHED, FETCH_ERROR,
REDIRECTION, ERROR;}
 StatusUpdater : writes to storage
 Can/should extend AbstractStatusUpdaterBolt
14 / 31
Error vs failure in SC
 Failure : structural issue e.g. dead node
– Storm's mechanism based on time-out
– Spout method ack() / fail() get message ID
• can decide whether to replay tuple or not
• message ID only => limited info about what went wrong
 Error : data issue
– Some temporary e.g. target server temporarily unavailable
– Some fatal e.g. document not parsable
– Handled via Status stream
– StatusUpdater can decide what to do with them
• Change status? Change scheduling?
15 / 31
Topology in code
Grouping!
Politeness and
data locality
16 / 31
Launching the Storm Crawler
 Modify resources in src/main/resources
– URLFilters, ParseFilters etc...
 Build uber-jar with Maven
– mvn clean package
 Set the configuration
– YAML file (e.g. crawler-conf.yaml)
 Call Storm
storm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with-
dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf
crawler-conf.yaml
17 / 31
18 / 31
Comparison with Nutch
19 / 31
Comparison with Nutch
 Nutch is batch driven : little control on when URLs are
fetched
– Potential issue for use cases where need sessions
– latency++
 Fetching only one of the steps in Nutch
– SC : 'always be fetching' (Ken Krugler); better use of resources
 Make it even more flexible
– Typical case : few custom classes (at least a Topology) the rest are just
dependencies and standard S/C components
 Not ready-to use as Nutch : it's a SDK
 Would not have existed without it
– Borrowed code and concepts
20 / 31
Overview of resources
https://www.flickr.com/photos/dipster1/1403240351/
21 / 31
Resources
 Separated in core vs external
 Core
– URLPartitioner
– Fetcher
– Parser
– SitemapParser
– …
 External
– Metrics-related : inc. connector for Librato
– ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics
 Also user maintained external resources
22 / 31
FetcherBolt
 Multi-threaded
 Polite
– Puts incoming tuples into internal queues based on IP/domain/hostname
– Sets delay between requests from same queue
– Respects robots.txt
 Protocol-neutral
– Protocol implementations are pluggable
– HTTP implementation taken from Nutch
23 / 31
ParserBolt
 Based on Apache Tika
 Supports most commonly used doc formats
– HTML, PDF, DOC etc...
 Calls ParseFilters on document
– e.g. scrape info with XPathFilter
 Calls URLFilters on outlinks
– e.g normalize and / or blacklists URLs based on RegExps
24 / 31
Other resources
 URLPartitionerBolt
– Generates a key based on the hostname / domain / IP of URL
– Output :
• String URL
• String key
• String metadata
– Useful for fieldGrouping
 URLFilters
– Control crawl expansion
– Delete or rewrites URLs
– Configured in JSON file
– Interface
• String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)
25 / 31
Other resources
 ParseFilters
– Called from ParserBolt
– Extracts information from document
– Configured in JSON file
– Interface
• filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata)
– com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java
• Xpath expressions
• Info extracted stored in metadata
• Used later for indexing
26 / 31
Integrate it!
 Write the Spout for your usecase
– Will work fine existing resources as long as it generates URL, Metadata
 Typical scenario
– Group URLs to fetch into separate external queues based on host or
domain (AWS SQS, Apache Kafka)
– Write Spout for it and throttle with topology.max.spout.pending
– So that can enforce politeness without getting timeout on Tuples → fail
– Parse and extract
– Send new URLs to queues
 Can use various forms of persistence for URLs
– ElasticSearch, DynamoDB, HBase, etc...
27 / 31
Some use cases (prototype stage)
 Processing of streams of URLs (natural fit for Storm)
– http://www.weborama.com
 Monitoring of finite set of URLs
– http://www.ontopic.io (more on them later)
– http://www.shopstyle.com : scraping + indexing
 One-off non-recursive crawling
– http://www.stolencamerafinder.com/ : scraping + indexing
 Recursive crawler
– http://www.shopstyle.com : discovery of product pages
28 / 31
What's next?
 0.5 released soon?
 All-in-one crawler based on ElasticSearch
– Also a good example of how to use SC
– Separate project
 Additional Parse/URLFilters
 Facilitate writing IndexingBolts
 More tests and documentation
 A nice logo or better name?
29 / 31
Resources
 https://storm.apache.org/
 “Storm Applied” http://www.manning.com/sallen/
 https://github.com/DigitalPebble/storm-crawler
 http://nutch.apache.org/
30 / 31
Questions
?
31 / 31

Contenu connexe

Tendances

Kubernetes Clusters Security with Amazon EKS (CON338-R1) - AWS re:Invent 2018
Kubernetes Clusters Security with Amazon EKS (CON338-R1) - AWS re:Invent 2018Kubernetes Clusters Security with Amazon EKS (CON338-R1) - AWS re:Invent 2018
Kubernetes Clusters Security with Amazon EKS (CON338-R1) - AWS re:Invent 2018Amazon Web Services
 
AngularJS Security: defend your Single Page Application
AngularJS Security: defend your Single Page Application AngularJS Security: defend your Single Page Application
AngularJS Security: defend your Single Page Application Carlo Bonamico
 
JUnit & Mockito, first steps
JUnit & Mockito, first stepsJUnit & Mockito, first steps
JUnit & Mockito, first stepsRenato Primavera
 
Securing and Automating Kubernetes with Kyverno
Securing and Automating Kubernetes with KyvernoSecuring and Automating Kubernetes with Kyverno
Securing and Automating Kubernetes with KyvernoSaim Safder
 
Introduction to Kubernetes and Google Container Engine (GKE)
Introduction to Kubernetes and Google Container Engine (GKE)Introduction to Kubernetes and Google Container Engine (GKE)
Introduction to Kubernetes and Google Container Engine (GKE)Opsta
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsSIGHUP
 
Introduction to Security Testing
Introduction to Security TestingIntroduction to Security Testing
Introduction to Security TestingvodQA
 
Maltego Webinar Slides
Maltego Webinar SlidesMaltego Webinar Slides
Maltego Webinar SlidesThreatConnect
 
Istio Service Mesh
Istio Service MeshIstio Service Mesh
Istio Service MeshLuke Marsden
 
Be a Hero on Day 1 with ASP.Net Boilerplate
Be a Hero on Day 1 with ASP.Net BoilerplateBe a Hero on Day 1 with ASP.Net Boilerplate
Be a Hero on Day 1 with ASP.Net BoilerplateLee Richardson
 
A Brief History of Cryptographic Failures
A Brief History of Cryptographic FailuresA Brief History of Cryptographic Failures
A Brief History of Cryptographic FailuresNothing Nowhere
 
Deploy Application on Kubernetes
Deploy Application on KubernetesDeploy Application on Kubernetes
Deploy Application on KubernetesOpsta
 
Jim Geovedi - Machine Learning for Cybersecurity
Jim Geovedi - Machine Learning for CybersecurityJim Geovedi - Machine Learning for Cybersecurity
Jim Geovedi - Machine Learning for Cybersecurityidsecconf
 
開發人員必須知道的 Kubernetes 核心技術 - Kubernetes Summit 2018
開發人員必須知道的 Kubernetes 核心技術 - Kubernetes Summit 2018開發人員必須知道的 Kubernetes 核心技術 - Kubernetes Summit 2018
開發人員必須知道的 Kubernetes 核心技術 - Kubernetes Summit 2018Will Huang
 
Decompose your monolith: Six principles for refactoring a monolith to microse...
Decompose your monolith: Six principles for refactoring a monolith to microse...Decompose your monolith: Six principles for refactoring a monolith to microse...
Decompose your monolith: Six principles for refactoring a monolith to microse...Chris Richardson
 
Docker Security Overview
Docker Security OverviewDocker Security Overview
Docker Security OverviewSreenivas Makam
 
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and KnativeBuild and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and KnativeOmar Al-Safi
 
Autoscaled Distributed Automation using AWS at Selenium London MeetUp
Autoscaled Distributed Automation using AWS at Selenium London MeetUpAutoscaled Distributed Automation using AWS at Selenium London MeetUp
Autoscaled Distributed Automation using AWS at Selenium London MeetUparagavan
 

Tendances (20)

Kubernetes Clusters Security with Amazon EKS (CON338-R1) - AWS re:Invent 2018
Kubernetes Clusters Security with Amazon EKS (CON338-R1) - AWS re:Invent 2018Kubernetes Clusters Security with Amazon EKS (CON338-R1) - AWS re:Invent 2018
Kubernetes Clusters Security with Amazon EKS (CON338-R1) - AWS re:Invent 2018
 
AngularJS Security: defend your Single Page Application
AngularJS Security: defend your Single Page Application AngularJS Security: defend your Single Page Application
AngularJS Security: defend your Single Page Application
 
Log analysis with elastic stack
Log analysis with elastic stackLog analysis with elastic stack
Log analysis with elastic stack
 
JUnit & Mockito, first steps
JUnit & Mockito, first stepsJUnit & Mockito, first steps
JUnit & Mockito, first steps
 
Securing and Automating Kubernetes with Kyverno
Securing and Automating Kubernetes with KyvernoSecuring and Automating Kubernetes with Kyverno
Securing and Automating Kubernetes with Kyverno
 
Introduction to Kubernetes and Google Container Engine (GKE)
Introduction to Kubernetes and Google Container Engine (GKE)Introduction to Kubernetes and Google Container Engine (GKE)
Introduction to Kubernetes and Google Container Engine (GKE)
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & Operators
 
Introduction to Security Testing
Introduction to Security TestingIntroduction to Security Testing
Introduction to Security Testing
 
Maltego Webinar Slides
Maltego Webinar SlidesMaltego Webinar Slides
Maltego Webinar Slides
 
Istio Service Mesh
Istio Service MeshIstio Service Mesh
Istio Service Mesh
 
Be a Hero on Day 1 with ASP.Net Boilerplate
Be a Hero on Day 1 with ASP.Net BoilerplateBe a Hero on Day 1 with ASP.Net Boilerplate
Be a Hero on Day 1 with ASP.Net Boilerplate
 
A Brief History of Cryptographic Failures
A Brief History of Cryptographic FailuresA Brief History of Cryptographic Failures
A Brief History of Cryptographic Failures
 
Deploy Application on Kubernetes
Deploy Application on KubernetesDeploy Application on Kubernetes
Deploy Application on Kubernetes
 
Jim Geovedi - Machine Learning for Cybersecurity
Jim Geovedi - Machine Learning for CybersecurityJim Geovedi - Machine Learning for Cybersecurity
Jim Geovedi - Machine Learning for Cybersecurity
 
開發人員必須知道的 Kubernetes 核心技術 - Kubernetes Summit 2018
開發人員必須知道的 Kubernetes 核心技術 - Kubernetes Summit 2018開發人員必須知道的 Kubernetes 核心技術 - Kubernetes Summit 2018
開發人員必須知道的 Kubernetes 核心技術 - Kubernetes Summit 2018
 
Decompose your monolith: Six principles for refactoring a monolith to microse...
Decompose your monolith: Six principles for refactoring a monolith to microse...Decompose your monolith: Six principles for refactoring a monolith to microse...
Decompose your monolith: Six principles for refactoring a monolith to microse...
 
Docker Security Overview
Docker Security OverviewDocker Security Overview
Docker Security Overview
 
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and KnativeBuild and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
 
Autoscaled Distributed Automation using AWS at Selenium London MeetUp
Autoscaled Distributed Automation using AWS at Selenium London MeetUpAutoscaled Distributed Automation using AWS at Selenium London MeetUp
Autoscaled Distributed Automation using AWS at Selenium London MeetUp
 
Kubernetes workshop
Kubernetes workshopKubernetes workshop
Kubernetes workshop
 

En vedette

Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm CrawlerJulien Nioche
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Deep-Dive to Azure Search
Deep-Dive to Azure SearchDeep-Dive to Azure Search
Deep-Dive to Azure SearchGunnar Peipman
 

En vedette (20)

Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Search engine
Search engineSearch engine
Search engine
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Deep-Dive to Azure Search
Deep-Dive to Azure SearchDeep-Dive to Azure Search
Deep-Dive to Azure Search
 
Presentationjava
PresentationjavaPresentationjava
Presentationjava
 

Similaire à An introduction to Storm Crawler

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerGeorge Ang
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015ontopic
 
Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)Alexander Shalimov
 
Introduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFIntroduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFGlobus
 
C# 3.0 and LINQ Tech Talk
C# 3.0 and LINQ Tech TalkC# 3.0 and LINQ Tech Talk
C# 3.0 and LINQ Tech TalkMichael Heydt
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBDDan Frincu
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environmentsDocker, Inc.
 
Kubernetes acomprehensiveoverview
Kubernetes acomprehensiveoverviewKubernetes acomprehensiveoverview
Kubernetes acomprehensiveoverviewAnkit Shukla
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewBob Killen
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Boden Russell
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudySalman Baset
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at BristechJulien Nioche
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive Overview(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive OverviewBob Killen
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 
Kubernetes a comprehensive overview
Kubernetes   a comprehensive overviewKubernetes   a comprehensive overview
Kubernetes a comprehensive overviewGabriel Carro
 

Similaire à An introduction to Storm Crawler (20)

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web Crawler
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)
 
Introduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFIntroduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCF
 
C# 3.0 and LINQ Tech Talk
C# 3.0 and LINQ Tech TalkC# 3.0 and LINQ Tech Talk
C# 3.0 and LINQ Tech Talk
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
 
Kubernetes acomprehensiveoverview
Kubernetes acomprehensiveoverviewKubernetes acomprehensiveoverview
Kubernetes acomprehensiveoverview
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive Overview
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive Overview(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive Overview
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Kubernetes a comprehensive overview
Kubernetes   a comprehensive overviewKubernetes   a comprehensive overview
Kubernetes a comprehensive overview
 

Dernier

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 

Dernier (20)

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 

An introduction to Storm Crawler

  • 1. An introduction to Julien Nioche julien@digitalpebble.com @digitalpebble Java Meetup – Bristol 10/03/2015 Storm Crawler
  • 2. 2 / 31 About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Machine Learning  Strong focus on Open Source & Apache ecosystem  User | Contributor | Committer – Nutch – Tika – SOLR, Lucene – GATE, UIMA – Behemoth
  • 3. 3 / 31  Collection of resources (SDK) for building web crawlers on Apache Storm  https://github.com/DigitalPebble/storm-crawler  Artefacts available from Maven Central  Apache License v2  Active and growing fast => Scalable => Low latency => Easily extensible => Polite Storm-Crawler : what is it?
  • 4. 4 / 31 What it is not  A ready-to-use, feature-complete web crawler – Will provide that as a separate project using S/C later  e.g. No PageRank or explicit ranking of pages – Build your own  No fancy UI, dashboards, etc... – Build your own or integrate in your existing ones  Compare with Apache Nutch later
  • 5. 5 / 31 Apache Storm  Distributed real time computation system  http://storm.apache.org/  Scalable, fault-tolerant, polyglot and fun!  Implemented in Clojure + Java  “Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more”
  • 6. 6 / 31 Storm Cluster Layout http://cdn.oreillystatic.com/en/assets/1/event/85/Realtime %20Processing%20with%20Storm%20Presentation.pdf
  • 7. 7 / 31 Topology = spouts + bolts + tuples + streams
  • 8. 8 / 31 Basic Crawl Topology Fetcher Parser URLPartitioner Indexer Some Spout (1) pull URLs from an external source (2) group per host / domain / IP (3) fetch URLs Default stream (4) parse content (3) fetch URLs (5) index text and metadata
  • 9. 9 / 31 Basic Topology : Tuple perspective Fetcher Parser URLPartitioner Indexer Some Spout (1) <String url, Metadata m> (2) <String url, String key, Metadata m> Default stream (4) <String url, byte[] content, Metadata m, String text> (3) <String url, byte[] content, Metadata m> (1) (2) (3) (4)
  • 10. 10 / 31 Basic scenario  Spout : simple queue e.g. RabbitMQ, AWS SQS, etc...  Indexer : any form of storage but often index e.g. SOLR, ElasticSearch  Components in grey : default ones from SC  Use case : non-recursive crawls (i.e. no links discovered), URLs known in advance. Failures don't matter too much.  “Great! What about recursive crawls and / or failure handling?”
  • 11. 11 / 31 Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control – Requires content parsing and link extraction seed i = 1 i = 2 i = 3 [Slide courtesy of A. Bialecki]
  • 12. 12 / 31 Recursive Topology Fetcher Parser URLPartitioner Indexer Some Spout Default stream Status Updater Status stream Switch DB
  • 13. 13 / 31 Status stream  Used for handling errors and update info about URLs  Newly discovered URLs from parser public enum Status { DISCOVERED, FETCHED, FETCH_ERROR, REDIRECTION, ERROR;}  StatusUpdater : writes to storage  Can/should extend AbstractStatusUpdaterBolt
  • 14. 14 / 31 Error vs failure in SC  Failure : structural issue e.g. dead node – Storm's mechanism based on time-out – Spout method ack() / fail() get message ID • can decide whether to replay tuple or not • message ID only => limited info about what went wrong  Error : data issue – Some temporary e.g. target server temporarily unavailable – Some fatal e.g. document not parsable – Handled via Status stream – StatusUpdater can decide what to do with them • Change status? Change scheduling?
  • 15. 15 / 31 Topology in code Grouping! Politeness and data locality
  • 16. 16 / 31 Launching the Storm Crawler  Modify resources in src/main/resources – URLFilters, ParseFilters etc...  Build uber-jar with Maven – mvn clean package  Set the configuration – YAML file (e.g. crawler-conf.yaml)  Call Storm storm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with- dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml
  • 18. 18 / 31 Comparison with Nutch
  • 19. 19 / 31 Comparison with Nutch  Nutch is batch driven : little control on when URLs are fetched – Potential issue for use cases where need sessions – latency++  Fetching only one of the steps in Nutch – SC : 'always be fetching' (Ken Krugler); better use of resources  Make it even more flexible – Typical case : few custom classes (at least a Topology) the rest are just dependencies and standard S/C components  Not ready-to use as Nutch : it's a SDK  Would not have existed without it – Borrowed code and concepts
  • 20. 20 / 31 Overview of resources https://www.flickr.com/photos/dipster1/1403240351/
  • 21. 21 / 31 Resources  Separated in core vs external  Core – URLPartitioner – Fetcher – Parser – SitemapParser – …  External – Metrics-related : inc. connector for Librato – ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics  Also user maintained external resources
  • 22. 22 / 31 FetcherBolt  Multi-threaded  Polite – Puts incoming tuples into internal queues based on IP/domain/hostname – Sets delay between requests from same queue – Respects robots.txt  Protocol-neutral – Protocol implementations are pluggable – HTTP implementation taken from Nutch
  • 23. 23 / 31 ParserBolt  Based on Apache Tika  Supports most commonly used doc formats – HTML, PDF, DOC etc...  Calls ParseFilters on document – e.g. scrape info with XPathFilter  Calls URLFilters on outlinks – e.g normalize and / or blacklists URLs based on RegExps
  • 24. 24 / 31 Other resources  URLPartitionerBolt – Generates a key based on the hostname / domain / IP of URL – Output : • String URL • String key • String metadata – Useful for fieldGrouping  URLFilters – Control crawl expansion – Delete or rewrites URLs – Configured in JSON file – Interface • String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)
  • 25. 25 / 31 Other resources  ParseFilters – Called from ParserBolt – Extracts information from document – Configured in JSON file – Interface • filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata) – com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java • Xpath expressions • Info extracted stored in metadata • Used later for indexing
  • 26. 26 / 31 Integrate it!  Write the Spout for your usecase – Will work fine existing resources as long as it generates URL, Metadata  Typical scenario – Group URLs to fetch into separate external queues based on host or domain (AWS SQS, Apache Kafka) – Write Spout for it and throttle with topology.max.spout.pending – So that can enforce politeness without getting timeout on Tuples → fail – Parse and extract – Send new URLs to queues  Can use various forms of persistence for URLs – ElasticSearch, DynamoDB, HBase, etc...
  • 27. 27 / 31 Some use cases (prototype stage)  Processing of streams of URLs (natural fit for Storm) – http://www.weborama.com  Monitoring of finite set of URLs – http://www.ontopic.io (more on them later) – http://www.shopstyle.com : scraping + indexing  One-off non-recursive crawling – http://www.stolencamerafinder.com/ : scraping + indexing  Recursive crawler – http://www.shopstyle.com : discovery of product pages
  • 28. 28 / 31 What's next?  0.5 released soon?  All-in-one crawler based on ElasticSearch – Also a good example of how to use SC – Separate project  Additional Parse/URLFilters  Facilitate writing IndexingBolts  More tests and documentation  A nice logo or better name?
  • 29. 29 / 31 Resources  https://storm.apache.org/  “Storm Applied” http://www.manning.com/sallen/  https://github.com/DigitalPebble/storm-crawler  http://nutch.apache.org/