A quick introduction to Storm Crawler

A quick introduction to
Storm Crawler
Julien Nioche
julien@digitalpebble.com
@digitalpebble
ApacheCon EU 2014 - Budapest

2 / 15
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Information Retrieval
– Machine Learning
 Strong focus on Open Source & Apache ecosystem
 PMC Chair Apache Nutch
 User | Contributor | Committer
– Tika
– SOLR, Lucene
– GATE, UIMA
– Mahout
– Behemoth

What is it?
 Collection of resources (SDK) for building web crawlers on
Apache Storm
 https://github.com/DigitalPebble/storm-crawler
 Artefacts available from Maven Central
 Apache License v2
3 / 15
 Scalable
 Low latency
 Easily extensible

What it is not
 A ready-to-use, feature-complete, recursive web crawler
– Might be something like that as a separate project using S/C later
4 / 15
 e.g. no PageRank or explicit ranking of pages
– Build your own
 No fancy UI, dashboards, etc...
– Build your own

Comparison with Nutch
 Nutch is batch driven : little control on when URLs are
fetched
5 / 15
– Potential issue for use cases where need sessions
– latency++
 Fetching only one of the steps in Nutch
– SC : 'always be fetching' (Ken Krugler); better use of resources
 Make it even more flexible
– Typical case : few custom classes (at least a Topology) the rest are just
dependencies and standard S/C components
 Not ready-to use as Nutch : it's a SDK
 Would not have existed without it
– Borrowed code and concepts

6 / 15
Overview of resources
https://www.flickr.com/photos/dipster1/1403240351/

7 / 15
FetcherBolt
 Multi-threaded
 Polite
– Puts incoming tuples into internal queues based on IP/domain/hostname
– Sets delay between requests from same queue
– Respects robots.txt
 Protocol-neutral
– Protocol implementations are pluggable
– HTTP implementation taken from Nutch
 Output
– String URL
– byte[] content
– HashMap<String, String[]> metadata

8 / 15
ParserBolt
 Based on Apache Tika
 Supports most commonly used doc formats
– HTML, PDF, DOC etc...
 Calls ParseFilters on document
– e.g. scrape info with XPathFilter
 Calls URLFilters on outlinks
– e.g normalize and / or blacklists URLs based on RegExps
 Output
– String URL
– byte[] content
– HashMap<String, String[]> metadata
– String text
– Set<String> outlinks

9 / 15
Other resources
 ElasticSearchBolt
– Sends fields to ElasticSearch for indexing
– (deprecated by resources in elasticsearch-hadoop?)
 URLPartitionerBolt
– Generates a key based on the hostname / domain / IP of URL
– Output :
• String URL
• String key
• String metadata
– Useful for fieldGrouping

10 / 15
Other resources
 ConfigurableTopology
– Overrides config with local YAML file
– Simple switch for running in local mode
– Abstract class to be extended
 Simple Spouts (for testing)
– FileSpout / RandomURLSpout
 Various Metrics-related stuff
– Including a MetricsConsumer for https://www.librato.com/
 FetchQueue package
– BlockingURLSpout and ShardedQueue abstraction

11 / 15
Integrate it!
 Write your the Spout for your usecase
– Will work fine existing resources as long as it generates URL, metadata
 Typical scenario
– Group URLs to fetch into separate external queues based on host or
domain (AWS SQS, Apache Kafka)
– Write Spout for it and throttle with topology.max.spout.pending
– So that can enforce politeness without getting timeout on Tuples → fail
– Parse and extract
– Send new URLs to queues
 Can use various forms of persistence for URLs
– ElasticSearch, DynamoDB, Hbase, etc...

12 / 15
Some use cases (prototype stage)
 Processing of streams of data (natural fit for Storm)
– http://www.weborama.com
 Monitoring of finite set of URLs
– http://www.ontopic.io (more on them later)
– http://www.shopstyle.com : scraping + indexing
 One-off non-recursive crawling
– http://www.stolencamerafinder.com/ : scraping + indexing
 Recursive crawler
– WIP

13 / 15
What's next?
 All-in-one crawler project built on SC
– Also a good example of how to use SC
 Additional Parse/URLFilters
 More tests and documentation
 A nice logo (this is an invitation)
 A better name?

A quick introduction to Storm Crawler

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (17)

Similaire à A quick introduction to Storm Crawler

Similaire à A quick introduction to Storm Crawler (20)

Dernier

Dernier (20)

A quick introduction to Storm Crawler