Large scale crawling with Apache Nutch
Upcoming SlideShare
Loading in...5
×
 

Large scale crawling with Apache Nutch

on

  • 19,202 vues

This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments. ...

This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments.

Apache Nutch was started exactly 10 years ago and was the starting point for what later became Apache Hadoop and also Apache Tika. Nutch is nowadays the tool of reference for large scale web crawling.

In this talk I will give an overview of Apache Nutch and describe its main components and how Nutch fits with other Apache projects such as Hadoop, SOLR or Tika.

The second part of the presentation will be focused on the latest developments in Nutch and the changes introduced by the 2.x branch with the use of Apache GORA as a front end to various NoSQL datastores.

Statistiques

Vues

Total des vues
19,202
Vues sur SlideShare
19,015
Vues externes
187

Actions

J'aime
25
Téléchargements
345
Commentaires
1

6 Ajouts 187

http://www.twylah.com 129
https://twitter.com 31
http://www.linkedin.com 19
http://192.168.5.10 4
http://www.pinterest.com 3
http://translate.googleusercontent.com 1

Accessibilité

Catégories

Détails de l'import

Uploaded via as OpenOffice

Droits d'utilisation

CC Attribution License

Report content

Signalé comme inapproprié Signaler comme inapproprié
Signaler comme inapproprié

Indiquez la raison pour laquelle vous avez signalé cette présentation comme n'étant pas appropriée.

Annuler
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Votre message apparaîtra ici
    Processing...
Poster un commentaire
Modifier votre commentaire
  • I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  • A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  • Note that I mention crawling and not web search → used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  • Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  • Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  • Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  • Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  • Fetcher . multithreaded but polite
  • Fetcher . multithreaded but polite
  • Writable object – crawl datum
  • What does this mean for Nutch?
  • What does this mean for Nutch?

Large scale crawling with Apache Nutch Large scale crawling with Apache Nutch Presentation Transcript

  • Large Scale Crawling with ApacheJulien Niochejulien@digitalpebble.comApacheCon Europe 2012
  • About myself DigitalPebble Ltd, Bristol (UK) Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Data Mining Strong focus on Open Source & Apache ecosystem Apache Nutch VP Apache Tika committer User | Contributor – SOLR, Lucene – GATE, UIMA – Mahout – Behemoth 2 / 37
  • Objectives Overview of the project Nutch in a nutshell Nutch 2.x Future developments 3 / 37
  • Nutch? “Distributed framework for large scale web crawling” – but does not have to be large scale at all – or even on the web (file-protocol)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search 4 / 37
  • Short history 2002/2003 : Started By Doug Cutting & Mike Caffarella 2004 : sub-project of Lucene @Apache 2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache 2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache May 2010 : TLP project at Apache June 2012 : Nutch 1.5.1 Oct 2012 : Nutch 2.1 5 / 37
  • Recent Releases 1.0 1.1 1.2 1.3 1.4 1.5.1trunk 2.x 2.0 2.1 06/09 06/10 06/11 06/12 7 / 37
  • Community 6 active committers / PMC members – 4 within the last 18 months Constant stream of new contributions & bug reports Steady numbers of mailing list subscribers and traffic Nutch is a very healthy 10-year old 9 / 37
  • Why use Nutch? Usual reasons – Mature, business-friendly license, community, ... Scalability – Tried and tested on very large scale – Hadoop cluster : installation and skills Features – e.g. Index with SOLR – PageRank implementation – Can be extended with plugins 10 / 37
  • Not the best option when ... Hadoop based == batch processing == high latency – No guarantee that a page will be fetched / parsed / indexed within X minutes|hours Javascript / Ajax not supported (yet) 11 / 37
  • Use cases Crawl for IR – Generic or vertical – Index and Search with SOLR – Single node to large clusters on Cloud … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 12 / 37
  • Customer casesSpecificity (Verticality) Usecase : BetterJobs.com – Single server – Aggregates content from job portals – Extracts and normalizes structure (description, requirements, locations) – ~1M pages total – Feeds SOLR index Usecase : SimilarPages.com – Large cluster on Amazon EC2 (up to 400 nodes) – Fetched & parsed 3 billion pages – 10+ billion pages in crawlDB (~100TB data) – 200+ million lists of similarities – No indexing / search involved Scale 13 / 37
  • Typical Nutch Steps Same in 1.x and 2.x Sequence of batch operations 1) Inject → populates CrawlDB from seed list 2) Generate → Selects URLS to fetch in segment 3) Fetch → Fetches URLs from segment 4) Parse → Parses content (text + metadata) 5) UpdateDB → Updates CrawlDB (new URLs, new status...) 6) InvertLinks → Build Webgraph 7) SOLRIndex → Send docs to SOLR 8) SOLRDedup → Remove duplicate docs based on signature Repeat steps 2 to 8 Or use the all-in-one crawl script 14 / 37
  • Main steps Seed List CrawlDB Segment / /crawl_fetch/ crawl_generate/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 15 / 37
  • Frontier expansion Manual “discovery” – Adding new URLs by hand, “seeding” Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control seed – Requires content i=1 parsing and link extraction i=2 i=3 [Slide courtesy of A. Bialecki] 16 / 37
  • An extensible framework Plugins – Activated with parameter plugin.includes – Implement one or more endpoints Endpoints – Protocol – Parser – HtmlParseFilter (ParseFilter in Nutch 2.x) – ScoringFilter (used in various places) – URLFilter (ditto) – URLNormalizer (ditto) – IndexingFilter 17 / 37
  • Features Fetcher – Multi-threaded fetcher – Follows robots.txt – Groups URLs per hostname / domain / IP – Limit the number of URLs for round of fetching – Default values are polite but can be made more aggressive Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom scoring plugins Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 18 / 37
  • Features (cont.) Protocols – Http, file, ftp, https Scheduling – Specified or adaptative URL filters – Regex, FSA, TLD, prefix, suffix URL normalisers – Default, regex 19 / 37
  • Features (cont.) Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well Other plugins – CreativeCommons – Feeds – Language Identification – Rel tags – Arbitrary Metadata Indexing to SOLR – Bespoke schema 20 / 37
  • Data Structures in 1.x MapReduce jobs => I/O : Hadoop [Sequence|Map]Files CrawlDB => status of known pages MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; CrawlDB int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData; Input of : generate - index Output of : inject - update 21 / 37
  • Data Structures 1.x Segment => round of fetching Identified by a timestamp Segment /crawl_generate/ → SequenceFile<Text,CrawlDatum> /crawl_fetch/ → MapFile<Text,CrawlDatum> /content/ → MapFile<Text,Content> /crawl_parse/ → SequenceFile<Text,CrawlDatum> /parse_data/ → MapFile<Text,ParseData> /parse_text/ → MapFile<Text,ParseText> Can have multiple versions of a page in different segments 22 / 37
  • Data Structures – 1.x linkDB => storage for Web Graph MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> LinkDB Inlink : String fromUrl String anchor Output of : invertlinks Input of : SOLRIndex 23 / 37
  • NUTCH 2.x 2.0 released in July 2012 2.1 in October 2012 Common features as 1.x – delegation to SOLR, TIKA, MapReduce etc... Moved to table-based architecture – Wealth of NoSQL projects in last few years Abstraction over storage layer → Apache GORA 24 / 37
  • Apache GORA http://gora.apache.org/ ORM for NoSQL databases – and limited SQL support + file based storage 0.2.1 released in August 2012 DataStore implementations ● Accumulo ● Avro ● Cassandra ● DynamoDB (soon) ● HBase ● SQL Serialization with Apache AVRO Object-to-datastore mappings (backend-specific) 25 / 37
  • AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 26 / 37
  • Mapping file (backend specific – Hbase)<gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 27 / 37
  • DataStore operations Atomic operations – get(K key) – put(K key, T obj) – delete(K key) Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query) Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 28 / 37
  • GORA in Nutch AVRO schema provided and java code pre-generated Mapping files provided for backends – can be modified if necessary Need to rebuild to get dependencies for backend – No binary distribution of Nutch 2.x http://wiki.apache.org/nutch/Nutch2Tutorial 29 / 37
  • Benefits Storage still distributed and replicated but one big table – status, metadata, content, text → one place Simplified logic in Nutch – Simpler code for updating / merging information More efficient (?) – No need to read / write entire structure to update records – No comparison available yet + early days for GORA Easier interaction with other resources – Third-party code just need to use GORA and schema 30 / 37
  • Drawbacks More stuff to install and configure :-) Not as stable as Nutch 1.x Dependent on success of Gora 31 / 37
  • 2.x Work in progress Stabilise backend implementations – GORA-Hbase most reliable Synchronize features with 1.x – e.g. has ElasticSearch but missing LinkRank equivalent Filter enabled scans (GORA-119) – Dont need to de-serialize the whole dataset 32 / 37
  • Future Both 1.x and 2.x in parallel – but more frequent releases for 2.x New functionalities – Support for SOLRCloud – Sitemap (from Crawler Commons library) – Canonical tag – More indexers (e.g. ElasticSearch) + pluggable indexers? 33 / 37
  • More delegation Great deal done in recent years (SOLR, Tika) Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – Robots.txt parsing – URL normalisation / filtering PageRank-like computations to graph library – e.g. Apache Giraph – Should be more efficient as well 34 / 37
  • Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org Chapter in Hadoop the Definitive Guide (T. White) – Understanding Hadoop is essential anyway... Support / consulting : – http://wiki.apache.org/nutch/Support 35 / 37
  • Questions ? 36 / 37
  • 37 / 37