Introduction to Storm Crawler [ https://github.com/DigitalPebble/storm-crawler], a collection of resources for building low-latency, large scale web crawlers on Storm available under Apache License.
Fast Feather talk given at ApacheCon EU 2014 Budapest
1. A quick introduction to
Storm Crawler
Julien Nioche
julien@digitalpebble.com
@digitalpebble
ApacheCon EU 2014 - Budapest
2. 2 / 15
About myself
DigitalPebble Ltd, Bristol (UK)
Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Information Retrieval
– Machine Learning
Strong focus on Open Source & Apache ecosystem
PMC Chair Apache Nutch
User | Contributor | Committer
– Tika
– SOLR, Lucene
– GATE, UIMA
– Mahout
– Behemoth
3. What is it?
Collection of resources (SDK) for building web crawlers on
Apache Storm
https://github.com/DigitalPebble/storm-crawler
Artefacts available from Maven Central
Apache License v2
3 / 15
Scalable
Low latency
Easily extensible
4. What it is not
A ready-to-use, feature-complete, recursive web crawler
– Might be something like that as a separate project using S/C later
4 / 15
e.g. no PageRank or explicit ranking of pages
– Build your own
No fancy UI, dashboards, etc...
– Build your own
5. Comparison with Nutch
Nutch is batch driven : little control on when URLs are
fetched
5 / 15
– Potential issue for use cases where need sessions
– latency++
Fetching only one of the steps in Nutch
– SC : 'always be fetching' (Ken Krugler); better use of resources
Make it even more flexible
– Typical case : few custom classes (at least a Topology) the rest are just
dependencies and standard S/C components
Not ready-to use as Nutch : it's a SDK
Would not have existed without it
– Borrowed code and concepts
6. 6 / 15
Overview of resources
https://www.flickr.com/photos/dipster1/1403240351/
7. 7 / 15
FetcherBolt
Multi-threaded
Polite
– Puts incoming tuples into internal queues based on IP/domain/hostname
– Sets delay between requests from same queue
– Respects robots.txt
Protocol-neutral
– Protocol implementations are pluggable
– HTTP implementation taken from Nutch
Output
– String URL
– byte[] content
– HashMap<String, String[]> metadata
8. 8 / 15
ParserBolt
Based on Apache Tika
Supports most commonly used doc formats
– HTML, PDF, DOC etc...
Calls ParseFilters on document
– e.g. scrape info with XPathFilter
Calls URLFilters on outlinks
– e.g normalize and / or blacklists URLs based on RegExps
Output
– String URL
– byte[] content
– HashMap<String, String[]> metadata
– String text
– Set<String> outlinks
9. 9 / 15
Other resources
ElasticSearchBolt
– Sends fields to ElasticSearch for indexing
– (deprecated by resources in elasticsearch-hadoop?)
URLPartitionerBolt
– Generates a key based on the hostname / domain / IP of URL
– Output :
• String URL
• String key
• String metadata
– Useful for fieldGrouping
10. 10 / 15
Other resources
ConfigurableTopology
– Overrides config with local YAML file
– Simple switch for running in local mode
– Abstract class to be extended
Simple Spouts (for testing)
– FileSpout / RandomURLSpout
Various Metrics-related stuff
– Including a MetricsConsumer for https://www.librato.com/
FetchQueue package
– BlockingURLSpout and ShardedQueue abstraction
11. 11 / 15
Integrate it!
Write your the Spout for your usecase
– Will work fine existing resources as long as it generates URL, metadata
Typical scenario
– Group URLs to fetch into separate external queues based on host or
domain (AWS SQS, Apache Kafka)
– Write Spout for it and throttle with topology.max.spout.pending
– So that can enforce politeness without getting timeout on Tuples → fail
– Parse and extract
– Send new URLs to queues
Can use various forms of persistence for URLs
– ElasticSearch, DynamoDB, Hbase, etc...
12. 12 / 15
Some use cases (prototype stage)
Processing of streams of data (natural fit for Storm)
– http://www.weborama.com
Monitoring of finite set of URLs
– http://www.ontopic.io (more on them later)
– http://www.shopstyle.com : scraping + indexing
One-off non-recursive crawling
– http://www.stolencamerafinder.com/ : scraping + indexing
Recursive crawler
– WIP
13. 13 / 15
What's next?
All-in-one crawler project built on SC
– Also a good example of how to use SC
Additional Parse/URLFilters
More tests and documentation
A nice logo (this is an invitation)
A better name?