An introduction to Storm Crawler

An introduction to
Julien Nioche
julien@digitalpebble.com
@digitalpebble
Java Meetup – Bristol 10/03/2015
Storm Crawler

2 / 31
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Information Retrieval
– Machine Learning
 Strong focus on Open Source & Apache ecosystem
 User | Contributor | Committer
– Nutch
– Tika
– SOLR, Lucene
– GATE, UIMA
– Behemoth

3 / 31
 Collection of resources (SDK) for building web crawlers on
Apache Storm
 https://github.com/DigitalPebble/storm-crawler
 Artefacts available from Maven Central
 Apache License v2
 Active and growing fast
=> Scalable
=> Low latency
=> Easily extensible
=> Polite
Storm-Crawler : what is it?

4 / 31
What it is not
 A ready-to-use, feature-complete web crawler
– Will provide that as a separate project using S/C later
 e.g. No PageRank or explicit ranking of pages
– Build your own
 No fancy UI, dashboards, etc...
– Build your own or integrate in your existing ones
 Compare with Apache Nutch later

5 / 31
Apache Storm
 Distributed real time computation system
 http://storm.apache.org/
 Scalable, fault-tolerant, polyglot and fun!
 Implemented in Clojure + Java
 “Realtime analytics, online machine learning, continuous
computation, distributed RPC, ETL, and more”

6 / 31
Storm Cluster Layout
http://cdn.oreillystatic.com/en/assets/1/event/85/Realtime
%20Processing%20with%20Storm%20Presentation.pdf

7 / 31
Topology = spouts + bolts + tuples +
streams

8 / 31
Basic Crawl Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
(1) pull URLs from an external
source
(2) group per host / domain / IP
(3) fetch URLs
Default
stream
(4) parse content
(3) fetch URLs
(5) index text and metadata

9 / 31
Basic Topology : Tuple perspective
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
(1) <String url, Metadata m>
(2) <String url, String key, Metadata m>
Default
stream
(4) <String url, byte[] content, Metadata m,
String text>
(3) <String url, byte[] content, Metadata m>
(1)
(2)
(3)
(4)

10 / 31
Basic scenario
 Spout : simple queue e.g. RabbitMQ, AWS SQS, etc...
 Indexer : any form of storage but often index e.g. SOLR,
ElasticSearch
 Components in grey : default ones from SC
 Use case : non-recursive crawls (i.e. no links discovered),
URLs known in advance. Failures don't matter too much.
 “Great! What about recursive crawls and / or failure
handling?”

11 / 31
Frontier expansion
 Manual “discovery”
– Adding new URLs by
hand, “seeding”
 Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]

12 / 31
Recursive Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
Default
stream
Status Updater
Status
stream
Switch
DB

13 / 31
Status stream
 Used for handling errors and update info about URLs
 Newly discovered URLs from parser
public enum Status {
DISCOVERED, FETCHED, FETCH_ERROR,
REDIRECTION, ERROR;}
 StatusUpdater : writes to storage
 Can/should extend AbstractStatusUpdaterBolt

14 / 31
Error vs failure in SC
 Failure : structural issue e.g. dead node
– Storm's mechanism based on time-out
– Spout method ack() / fail() get message ID
• can decide whether to replay tuple or not
• message ID only => limited info about what went wrong
 Error : data issue
– Some temporary e.g. target server temporarily unavailable
– Some fatal e.g. document not parsable
– Handled via Status stream
– StatusUpdater can decide what to do with them
• Change status? Change scheduling?

15 / 31
Topology in code
Grouping!
Politeness and
data locality

16 / 31
Launching the Storm Crawler
 Modify resources in src/main/resources
– URLFilters, ParseFilters etc...
 Build uber-jar with Maven
– mvn clean package
 Set the configuration
– YAML file (e.g. crawler-conf.yaml)
 Call Storm
storm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with-
dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf
crawler-conf.yaml

19 / 31
Comparison with Nutch
 Nutch is batch driven : little control on when URLs are
fetched
– Potential issue for use cases where need sessions
– latency++
 Fetching only one of the steps in Nutch
– SC : 'always be fetching' (Ken Krugler); better use of resources
 Make it even more flexible
– Typical case : few custom classes (at least a Topology) the rest are just
dependencies and standard S/C components
 Not ready-to use as Nutch : it's a SDK
 Would not have existed without it
– Borrowed code and concepts

20 / 31
Overview of resources
https://www.flickr.com/photos/dipster1/1403240351/

21 / 31
Resources
 Separated in core vs external
 Core
– URLPartitioner
– Fetcher
– Parser
– SitemapParser
– …
 External
– Metrics-related : inc. connector for Librato
– ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics
 Also user maintained external resources

22 / 31
FetcherBolt
 Multi-threaded
 Polite
– Puts incoming tuples into internal queues based on IP/domain/hostname
– Sets delay between requests from same queue
– Respects robots.txt
 Protocol-neutral
– Protocol implementations are pluggable
– HTTP implementation taken from Nutch

23 / 31
ParserBolt
 Based on Apache Tika
 Supports most commonly used doc formats
– HTML, PDF, DOC etc...
 Calls ParseFilters on document
– e.g. scrape info with XPathFilter
 Calls URLFilters on outlinks
– e.g normalize and / or blacklists URLs based on RegExps

24 / 31
Other resources
 URLPartitionerBolt
– Generates a key based on the hostname / domain / IP of URL
– Output :
• String URL
• String key
• String metadata
– Useful for fieldGrouping
 URLFilters
– Control crawl expansion
– Delete or rewrites URLs
– Configured in JSON file
– Interface
• String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)

25 / 31
Other resources
 ParseFilters
– Called from ParserBolt
– Extracts information from document
– Configured in JSON file
– Interface
• filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata)
– com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java
• Xpath expressions
• Info extracted stored in metadata
• Used later for indexing

26 / 31
Integrate it!
 Write the Spout for your usecase
– Will work fine existing resources as long as it generates URL, Metadata
 Typical scenario
– Group URLs to fetch into separate external queues based on host or
domain (AWS SQS, Apache Kafka)
– Write Spout for it and throttle with topology.max.spout.pending
– So that can enforce politeness without getting timeout on Tuples → fail
– Parse and extract
– Send new URLs to queues
 Can use various forms of persistence for URLs
– ElasticSearch, DynamoDB, HBase, etc...

27 / 31
Some use cases (prototype stage)
 Processing of streams of URLs (natural fit for Storm)
– http://www.weborama.com
 Monitoring of finite set of URLs
– http://www.ontopic.io (more on them later)
– http://www.shopstyle.com : scraping + indexing
 One-off non-recursive crawling
– http://www.stolencamerafinder.com/ : scraping + indexing
 Recursive crawler
– http://www.shopstyle.com : discovery of product pages

28 / 31
What's next?
 0.5 released soon?
 All-in-one crawler based on ElasticSearch
– Also a good example of how to use SC
– Separate project
 Additional Parse/URLFilters
 Facilitate writing IndexingBolts
 More tests and documentation
 A nice logo or better name?

29 / 31
Resources
 https://storm.apache.org/
 “Storm Applied” http://www.manning.com/sallen/
 https://github.com/DigitalPebble/storm-crawler
 http://nutch.apache.org/

An introduction to Storm Crawler

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à An introduction to Storm Crawler

Similaire à An introduction to Storm Crawler (20)

Dernier

Dernier (20)

An introduction to Storm Crawler