Storm-Crawler is a collection of resources for building low-latency, large scale web crawlers on Apache Storm. We will compare with similar projects like Apache Nutch and present several use cases where the storm-crawler is being used. In particular we will see how the Storm-crawler can be used with ElasticSearch and Kibana for crawling and indexing web pages.
1. An introduction to
Julien Nioche
julien@digitalpebble.com
@digitalpebble
Java Meetup – Bristol 10/03/2015
Storm Crawler
2. 2 / 31
About myself
DigitalPebble Ltd, Bristol (UK)
Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Information Retrieval
– Machine Learning
Strong focus on Open Source & Apache ecosystem
User | Contributor | Committer
– Nutch
– Tika
– SOLR, Lucene
– GATE, UIMA
– Behemoth
3. 3 / 31
Collection of resources (SDK) for building web crawlers on
Apache Storm
https://github.com/DigitalPebble/storm-crawler
Artefacts available from Maven Central
Apache License v2
Active and growing fast
=> Scalable
=> Low latency
=> Easily extensible
=> Polite
Storm-Crawler : what is it?
4. 4 / 31
What it is not
A ready-to-use, feature-complete web crawler
– Will provide that as a separate project using S/C later
e.g. No PageRank or explicit ranking of pages
– Build your own
No fancy UI, dashboards, etc...
– Build your own or integrate in your existing ones
Compare with Apache Nutch later
5. 5 / 31
Apache Storm
Distributed real time computation system
http://storm.apache.org/
Scalable, fault-tolerant, polyglot and fun!
Implemented in Clojure + Java
“Realtime analytics, online machine learning, continuous
computation, distributed RPC, ETL, and more”
8. 8 / 31
Basic Crawl Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
(1) pull URLs from an external
source
(2) group per host / domain / IP
(3) fetch URLs
Default
stream
(4) parse content
(3) fetch URLs
(5) index text and metadata
10. 10 / 31
Basic scenario
Spout : simple queue e.g. RabbitMQ, AWS SQS, etc...
Indexer : any form of storage but often index e.g. SOLR,
ElasticSearch
Components in grey : default ones from SC
Use case : non-recursive crawls (i.e. no links discovered),
URLs known in advance. Failures don't matter too much.
“Great! What about recursive crawls and / or failure
handling?”
11. 11 / 31
Frontier expansion
Manual “discovery”
– Adding new URLs by
hand, “seeding”
Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
12. 12 / 31
Recursive Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
Default
stream
Status Updater
Status
stream
Switch
DB
13. 13 / 31
Status stream
Used for handling errors and update info about URLs
Newly discovered URLs from parser
public enum Status {
DISCOVERED, FETCHED, FETCH_ERROR,
REDIRECTION, ERROR;}
StatusUpdater : writes to storage
Can/should extend AbstractStatusUpdaterBolt
14. 14 / 31
Error vs failure in SC
Failure : structural issue e.g. dead node
– Storm's mechanism based on time-out
– Spout method ack() / fail() get message ID
• can decide whether to replay tuple or not
• message ID only => limited info about what went wrong
Error : data issue
– Some temporary e.g. target server temporarily unavailable
– Some fatal e.g. document not parsable
– Handled via Status stream
– StatusUpdater can decide what to do with them
• Change status? Change scheduling?
19. 19 / 31
Comparison with Nutch
Nutch is batch driven : little control on when URLs are
fetched
– Potential issue for use cases where need sessions
– latency++
Fetching only one of the steps in Nutch
– SC : 'always be fetching' (Ken Krugler); better use of resources
Make it even more flexible
– Typical case : few custom classes (at least a Topology) the rest are just
dependencies and standard S/C components
Not ready-to use as Nutch : it's a SDK
Would not have existed without it
– Borrowed code and concepts
20. 20 / 31
Overview of resources
https://www.flickr.com/photos/dipster1/1403240351/
21. 21 / 31
Resources
Separated in core vs external
Core
– URLPartitioner
– Fetcher
– Parser
– SitemapParser
– …
External
– Metrics-related : inc. connector for Librato
– ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics
Also user maintained external resources
22. 22 / 31
FetcherBolt
Multi-threaded
Polite
– Puts incoming tuples into internal queues based on IP/domain/hostname
– Sets delay between requests from same queue
– Respects robots.txt
Protocol-neutral
– Protocol implementations are pluggable
– HTTP implementation taken from Nutch
23. 23 / 31
ParserBolt
Based on Apache Tika
Supports most commonly used doc formats
– HTML, PDF, DOC etc...
Calls ParseFilters on document
– e.g. scrape info with XPathFilter
Calls URLFilters on outlinks
– e.g normalize and / or blacklists URLs based on RegExps
24. 24 / 31
Other resources
URLPartitionerBolt
– Generates a key based on the hostname / domain / IP of URL
– Output :
• String URL
• String key
• String metadata
– Useful for fieldGrouping
URLFilters
– Control crawl expansion
– Delete or rewrites URLs
– Configured in JSON file
– Interface
• String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)
25. 25 / 31
Other resources
ParseFilters
– Called from ParserBolt
– Extracts information from document
– Configured in JSON file
– Interface
• filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata)
– com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java
• Xpath expressions
• Info extracted stored in metadata
• Used later for indexing
26. 26 / 31
Integrate it!
Write the Spout for your usecase
– Will work fine existing resources as long as it generates URL, Metadata
Typical scenario
– Group URLs to fetch into separate external queues based on host or
domain (AWS SQS, Apache Kafka)
– Write Spout for it and throttle with topology.max.spout.pending
– So that can enforce politeness without getting timeout on Tuples → fail
– Parse and extract
– Send new URLs to queues
Can use various forms of persistence for URLs
– ElasticSearch, DynamoDB, HBase, etc...
27. 27 / 31
Some use cases (prototype stage)
Processing of streams of URLs (natural fit for Storm)
– http://www.weborama.com
Monitoring of finite set of URLs
– http://www.ontopic.io (more on them later)
– http://www.shopstyle.com : scraping + indexing
One-off non-recursive crawling
– http://www.stolencamerafinder.com/ : scraping + indexing
Recursive crawler
– http://www.shopstyle.com : discovery of product pages
28. 28 / 31
What's next?
0.5 released soon?
All-in-one crawler based on ElasticSearch
– Also a good example of how to use SC
– Separate project
Additional Parse/URLFilters
Facilitate writing IndexingBolts
More tests and documentation
A nice logo or better name?