SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Web Crawling with Apache Nutch 
Sebastian Nagel 
snagel@apache.org 
ApacheCon EU 2014 
2014-11-18
About Me 
computational linguist 
software developer at Exorbyte (Konstanz, Germany) 
search and data matching 
prepare data for indexing, cleansing noisy data, 
web crawling 
Nutch user since 2008 
2012 Nutch committer and PMC
Nutch History 
2002 started by Doug Cutting and Mike Caffarella 
open source web-scale crawler and search engine 
2004/05 MapReduce and distributed file system in Nutch 
2005 Apache incubator, sub-project of Lucene 
2006 Hadoop split from Nutch, Nutch based on Hadoop 
2007 use Tika for MimeType detection, Tika Parser 2010 
2008 start NutchBase, …, 2012 released 2.0 based on Gora 0.2 
2009 use Solr for indexing 
2010 Nutch top-level project 
2012/13 ElasticSearch indexer (2.x), pluggable indexing back-ends (1.x) 
2.0 
graduate tlp 
@sourceforge incubator 
Tika 
2002 2009 201.5 
1.9 
2.2.1 
1.8 
2.2 
1.7 
2.1 
1.6 
1.5 
1.4 
1.3 
1.2 
1.1 
0.7 
0.8 0.9 1.0 
0.7.2 
0.5 
0.4 
0.6 
Hadoop 
Gora 
use Solr
What is Nutch now? 
“an extensible and scalable web crawler based on Hadoop” 
▶ runs on top of 
▶ scalable: billions of pages possible 
▶ some overhead (if scale is not a requirement) 
▶ not ideal for low latency 
▶ customizable / extensible plugin architecture 
▶ pluggable protocols (document access) 
▶ URL filters + normalizers 
▶ parsing: document formats + meta data extraction 
▶ indexing back-ends 
▶ mostly used to feed a search index … 
▶ …but also data mining
Nutch Community 
▶ mature Apache project 
▶ 6 active committers 
▶ maintain two branches (1.x and 2.x) 
▶ “friends” — (Apache) projects Nutch delegates work to 
▶ Hadoop: scalability, job execution, data serialization (1.x) 
▶ Tika: detecting and parsing multiple document formats 
▶ Solr, ElasticSearch: make crawled content searchable 
▶ Gora (and HBase, Cassandra, …): data storage (2.x) 
▶ crawler-commons: robots.txt parsing 
▶ steady user base, majority of users 
▶ uses 1.x 
▶ runs small scale crawls (< 1M pages) 
▶ uses Solr for indexing and search
Crawler Workflow 
0. initialize CrawlDb, inject seed URLs 
repeat generate-fetch-update cycle n times: 
1. generate fetch list: select URLs from CrawlDb for fetching 
2. fetch URLs from fetch list 
3. parse documents: extract content, metadata and links 
4. update CrawlDb status, score and signature, add new URLs 
inlined or at the end of one crawler run (once for multiple cycles): 
5. invert links: map anchor texts to documents the links point to 
6. (calculate link rank on web graph, update CrawlDb scores) 
7. deduplicate documents by signature 
8. index document content, meta data, and anchor texts
Crawler Workflow
Workflow Execution 
▶ every step is implemented as one (or more) MapReduce job 
▶ shell script to run the workflow (bin/crawl) 
▶ bin/nutch to run individual tools 
▶ inject, generate, fetch, parse, updatedb, invertlinks, index, … 
▶ many analysis and debugging tools 
▶ local mode 
▶ works out-of-the-box (bin package) 
▶ useful for testing and debugging 
▶ (pseudo-)distributed mode 
▶ parallelization, monitor crawls with MapReduce web UI 
▶ recompile and deploy job file with configuration changes
Features of selected Tools 
Fetcher 
▶ multi-threaded, high throughput 
▶ but always be polite 
▶ respect robots rules (robots.txt and robots directives) 
▶ limit load on crawled servers, 
guaranteed delays between access to same host, IP, or domain 
WebGraph 
▶ iterative link analysis 
▶ on the level of documents, sites, or domains 
… most other tools 
▶ are either quite trivial 
▶ and/or rely heavily on plugins
Extensible via Plugins 
Plugins basics 
▶ each Plugin implements one (or more) extension points 
▶ plugins are activated on demand (property plugin.includes) 
▶ multiple active plugins per ext. point 
▶ “chained”: sequentially applied (filters and normalizers) 
▶ automatically selected (protocols and parser) 
Extension points and available plugins 
▶ URL filter 
▶ include/exclude URLs from crawling 
▶ plugins: regex, prefix, suffix, domain 
▶ URL normalizer 
▶ canonicalize URL 
▶ regex, domain, querystring
Plugins (continued) 
▶ protocols 
▶ http, https, ftp, file 
▶ protocol plugins take care for robots.txt rules 
▶ parser 
▶ need to parse various document types 
▶ now mostly delegated to Tika (plugin parse-tika) 
▶ legacy: parse-html, feed, zip 
▶ parse filter 
▶ extract additional meta data 
▶ change extracted plain text 
▶ plugins: headings, metatags 
▶ indexing filter 
▶ add/fill indexed fields 
▶ url, title, content, anchor, boost, digest, type, host, … 
▶ often in combination with parse filter to extract field content 
▶ plugins: basic, more, anchor, metadata, …
Plugins (continued) 
▶ index writer 
▶ connect to indexing back-ends 
▶ send/write additions / updates / deletions 
▶ plugins: Solr, ElasticSearch 
▶ scoring filter 
▶ pass score to outlinks (in 1.x: quite a few hops) 
▶ change scores in CrawlDb via inlinks 
▶ can do magic: limit crawl by depth, focused crawling, … 
▶ OPIC (On-line Page Importance Computation, [1]) 
▶ online: good strategy to crawl relevant content first 
▶ scores also ok for ranking 
▶ but seeds (and docs “close” to them) are favored 
▶ scores get out of control for long-running continuous crawls 
▶ LinkRank 
▶ link rank calculation done separately on WebGraph 
▶ scores are fed back to CrawlDb 
Complete list of plugins: 
http://nutch.apache.org/apidocs/apidocs-1.9/
Pluggable classes 
Extensible interfaces (pluggable class implementations) 
▶ fetch schedule 
▶ decide whether to add URL to fetch list 
▶ set next fetch time and re-fetch intervals 
▶ default implementation: fixed re-fetch interval 
▶ available: interval adaptive to document change frequency 
▶ signature calculation 
▶ used for deduplication and not-modified detection 
▶ MD5 sum of binary document or plain text 
▶ TextProfile based on filtered word frequency list
Data Structures and Storage in Nutch 1.x 
All data is stored as map 〈url,someobject〉 in Hadoop map or 
sequence files 
Data structures (directories) and stored values: 
▶ CrawlDb: all information of URL/document to run and 
schedule crawling 
▶ current status (injected, linked, fetched, gone, …) 
▶ score, next fetch time, last modified time 
▶ signature (for deduplication) 
▶ meta data (container for arbitrary data) 
▶ LinkDb: incoming links (URLs and anchor texts) 
▶ WebGraph: map files to hold outlinks, inlinks, node scores
Data Structures (1.x): Segments 
Segments store all data related to fetch and parse of a single batch 
of URLs (one generate-fetch-update cycle) 
crawl_generate: list of URLs to be fetched 
with politeness requirements (host-level blocking): 
▶ all URLs of one host (or IP) must be in one partition to 
implement host-level blocking in one single JVM 
▶ inside one partition URLs are shuffled: URLs of one host are 
spread over whole partition to minimize blocking during fetch 
crawl_fetch: status from fetch (e.g., success, failed, robots denied) 
content: fetched binary content and meta data (HTTP header) 
parse_data: extracted meta data, outlinks and anchor texts 
parse_text: plain-text content 
crawl_parse: outlinks, scores, signatures, meta data, 
used to update CrawlDb
Storage and Data Flow in Nutch 1.x
Storage in Nutch 2.x 
One big table (WebTable) 
▶ all information about one URL/document in a single row: 
status, metadata, binary content, extracted plain-text, inlink 
anchors, … 
▶ inspired by BigTable paper [3] 
▶ …and rise of NoSQL data stores 
Apache Gora used for storage layer abstraction 
▶ OTD (object-to-datastore) for NoSQL data stores 
▶ HBase, Cassandra, DynamoDb, Avro, Accumulo, … 
▶ data serialization with Avro 
▶ back-end specific object-datastore mappings
Benefits of Nutch 2.x 
Benefits 
▶ less steps in crawler workflow 
▶ simplify Nutch code base 
▶ easy to access data from other applications 
▶ access storage directly 
▶ or via Gora and schema 
▶ plugins and MapReduce processing shared with 1.x
Performance and Efficiency 2.x? 
Objective: should be more performant (in terms of latency) 
▶ no need to rewrite whole CrawlDb for each smaller update 
▶ adaptable to more instant workflows (cf. [5]) 
…however 
▶ benchmark by Julien Nioche [4] shows that 
Nutch 2.x is slower than 1.x by a factor of 2.5 
▶ not a problem of HBase and Cassandra but due to Gora layer 
▶ will hopefully be improved with Nutch 2.3 and recent Gora 
version which supports filtered scans 
▶ still fast at scale
…and Drawbacks? 
Drawbacks 
▶ need to install and maintain datastore 
▶ higher hardware requirements 
Impact on Nutch as software project 
▶ 2.x planned as replacement, however 
▶ still not as stable as 1.x 
▶ 1.x has more users 
▶ need to maintain two branches 
▶ keep in sync, transfer patches
News 
Common Crawl moves to Nutch (2013) 
http://blog.commoncrawl.org/2014/02/common-crawl-move-to-nutch/ 
▶ public data on Amazon S3 
▶ billions of web pages 
Generic deduplication (1.x) 
▶ based on CrawlDb only 
▶ no need to pull document signatures and scores from index 
▶ no special implementations for indexing back-ends (Solr, etc.) 
Nutch participated in GSoC 2014 …
Nutch Web App 
. 
GSoC 2014 (Fjodor Vershinin): 
“Create a Wicket-based Web Application for Nutch” 
▶ Nutch Server and REST API 
▶ Web App client to run and schedule crawls 
▶ only Nutch 2.x (for now) 
▶ needs completion: change configuration, analytics (logs), …
What’s coming next? 
Nutch 1.x is still alive! 
…but we hope to release 2.3 soon 
▶ with performance improvements from recent Gora release 
Fix bugs, improve usability 
▶ our bottleneck: review patches, get them committed 
▶ try hard to keep 1.x and 2.x in sync 
Pending new features 
▶ sitemap protocol 
▶ canonical links
What’s coming next? (mid- and long-term) 
Move to Hadoop 2.x and new MapReduce API 
… complete Rest API and Web App 
… port it “back” to 1.x 
… use graph library (Giraph) for link rank computation 
… switch from batch processing to streaming 
▶ Storm / Spark / ? 
▶ parts of the workflow
Thanks and acknowledgments … 
… to Julien Nioche, Lewis John McGibbney, Andrzej Białecki, Chris 
Mattmann, Doug Cutting, Mike Cafarella – for inspirations and 
material from previous Nutch presentations 
… and to all Nutch committers for review and comments on early 
versions of theses slides.
References 
[1] Abiteboul, Serge; Preda, Mihai; Cobena, Gregory, 2003: Adaptive on-line 
page importance computation. In: Proceedings of the 12th International 
Conference on World Wide Web, WWW ’03, pp. 280–290, ACM, New 
York, NY, USA, 
http://procod.com/mihai_preda/paper/online_page_importance.pdf. 
[2] Białecki, Andrzej, 2012: Nutch search engine. In: White, Tom (ed.) 
Hadoop: The Definitive Guide, pp. 565–579, O’Reilly. 
[3] Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson C.; Wallach, 
Deborah A.; Burrows, Mike; Chandra, Tushar; Fikes, Andrew; Gruber, 
Robert E., 2006: Bigtable: A distributed storage system for structured 
data. In: Proceedings of the 7th Conference on USENIX Symposium on 
Operating Systems Design and Implementation (OSDI ’06), vol. 7, pp. 
205–218, http://www.usenix.org/events/osdi06/tech/chang/chang.pdf. 
[4] Nioche, Julien, 2013, Nutch fight! 1.7 vs 2.2.1. Blog post, 
http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html. 
[5] Peng, Daniel Dabek, Frank, 2010: Large-scale incremental processing 
using distributed transactions and notifications. In: 9th USENIX 
Symposium on Operating Systems Design and Implementation, pp. 4–6, 
http://www.usenix.org/event/osdi10/tech/full_papers/Peng.pdf.
Questions?

Contenu connexe

Tendances

Introduction à ElasticSearch
Introduction à ElasticSearchIntroduction à ElasticSearch
Introduction à ElasticSearchFadel Chafai
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo ProductsMikio Hirabayashi
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요Jo Hoon
 
[Pgday.Seoul 2020] SQL Tuning
[Pgday.Seoul 2020] SQL Tuning[Pgday.Seoul 2020] SQL Tuning
[Pgday.Seoul 2020] SQL TuningPgDay.Seoul
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
[OpenStack 하반기 스터디] Docker를 이용한 OpenStack 가상화
[OpenStack 하반기 스터디] Docker를 이용한 OpenStack 가상화[OpenStack 하반기 스터디] Docker를 이용한 OpenStack 가상화
[OpenStack 하반기 스터디] Docker를 이용한 OpenStack 가상화OpenStack Korea Community
 
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...Vietnam Open Infrastructure User Group
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewBob Killen
 
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...Altinity Ltd
 
Kubernetes in Docker
Kubernetes in DockerKubernetes in Docker
Kubernetes in DockerDocker, Inc.
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingAnimesh Singh
 
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.jsHeeJung Hwang
 
Altinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Quickstart for ClickHouse-2202-09-15.pdfAltinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Quickstart for ClickHouse-2202-09-15.pdfAltinity Ltd
 
Mvcc in postgreSQL 권건우
Mvcc in postgreSQL 권건우Mvcc in postgreSQL 권건우
Mvcc in postgreSQL 권건우PgDay.Seoul
 
Openstack pour les nuls
Openstack pour les nulsOpenstack pour les nuls
Openstack pour les nulsChris Cowley
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Introduction to OpenStack Trove & Database as a Service
Introduction to OpenStack Trove & Database as a ServiceIntroduction to OpenStack Trove & Database as a Service
Introduction to OpenStack Trove & Database as a ServiceTesora
 
Découverte de Elastic search
Découverte de Elastic searchDécouverte de Elastic search
Découverte de Elastic searchJEMLI Fathi
 

Tendances (20)

Elk
Elk Elk
Elk
 
Introduction à ElasticSearch
Introduction à ElasticSearchIntroduction à ElasticSearch
Introduction à ElasticSearch
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
 
[Pgday.Seoul 2020] SQL Tuning
[Pgday.Seoul 2020] SQL Tuning[Pgday.Seoul 2020] SQL Tuning
[Pgday.Seoul 2020] SQL Tuning
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
[OpenStack 하반기 스터디] Docker를 이용한 OpenStack 가상화
[OpenStack 하반기 스터디] Docker를 이용한 OpenStack 가상화[OpenStack 하반기 스터디] Docker를 이용한 OpenStack 가상화
[OpenStack 하반기 스터디] Docker를 이용한 OpenStack 가상화
 
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive Overview
 
Traitement distribue en BIg Data - KAFKA Broker and Kafka Streams
Traitement distribue en BIg Data - KAFKA Broker and Kafka StreamsTraitement distribue en BIg Data - KAFKA Broker and Kafka Streams
Traitement distribue en BIg Data - KAFKA Broker and Kafka Streams
 
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
 
Kubernetes in Docker
Kubernetes in DockerKubernetes in Docker
Kubernetes in Docker
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model Inferencing
 
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
 
Altinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Quickstart for ClickHouse-2202-09-15.pdfAltinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Quickstart for ClickHouse-2202-09-15.pdf
 
Mvcc in postgreSQL 권건우
Mvcc in postgreSQL 권건우Mvcc in postgreSQL 권건우
Mvcc in postgreSQL 권건우
 
Openstack pour les nuls
Openstack pour les nulsOpenstack pour les nuls
Openstack pour les nuls
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Introduction to OpenStack Trove & Database as a Service
Introduction to OpenStack Trove & Database as a ServiceIntroduction to OpenStack Trove & Database as a Service
Introduction to OpenStack Trove & Database as a Service
 
Découverte de Elastic search
Découverte de Elastic searchDécouverte de Elastic search
Découverte de Elastic search
 

En vedette

Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platformabial
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutchSigmoid
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 

En vedette (20)

Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutch
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 

Similaire à Web Crawling with Apache Nutch

Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018PGConf APAC
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Bquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureCarst Vaartjes
 
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdf
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdfLupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdf
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdfWolfgangZiegler6
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSBowenDing4
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshSion Smith
 
Drupalcon 2021 - Nuxt.js for drupal developers
Drupalcon 2021 - Nuxt.js for drupal developersDrupalcon 2021 - Nuxt.js for drupal developers
Drupalcon 2021 - Nuxt.js for drupal developersnuppla
 
Cloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysisCloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysisYaoyu Wang
 
Getting started with postgresql
Getting started with postgresqlGetting started with postgresql
Getting started with postgresqlbotsplash.com
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 

Similaire à Web Crawling with Apache Nutch (20)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
 
Mastro
MastroMastro
Mastro
 
Mastro
MastroMastro
Mastro
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Suggest.js
Suggest.jsSuggest.js
Suggest.js
 
Bquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics Architecture
 
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdf
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdfLupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdf
Lupus Decoupled Drupal - Drupal Austria Meetup - 2023-04.pdf
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFS
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Drupalcon 2021 - Nuxt.js for drupal developers
Drupalcon 2021 - Nuxt.js for drupal developersDrupalcon 2021 - Nuxt.js for drupal developers
Drupalcon 2021 - Nuxt.js for drupal developers
 
Cloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysisCloud Native Analysis Platform for NGS analysis
Cloud Native Analysis Platform for NGS analysis
 
Getting started with postgresql
Getting started with postgresqlGetting started with postgresql
Getting started with postgresql
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 

Dernier

Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile EnvironmentVictorSzoltysek
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 

Dernier (20)

Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 

Web Crawling with Apache Nutch

  • 1. Web Crawling with Apache Nutch Sebastian Nagel snagel@apache.org ApacheCon EU 2014 2014-11-18
  • 2. About Me computational linguist software developer at Exorbyte (Konstanz, Germany) search and data matching prepare data for indexing, cleansing noisy data, web crawling Nutch user since 2008 2012 Nutch committer and PMC
  • 3. Nutch History 2002 started by Doug Cutting and Mike Caffarella open source web-scale crawler and search engine 2004/05 MapReduce and distributed file system in Nutch 2005 Apache incubator, sub-project of Lucene 2006 Hadoop split from Nutch, Nutch based on Hadoop 2007 use Tika for MimeType detection, Tika Parser 2010 2008 start NutchBase, …, 2012 released 2.0 based on Gora 0.2 2009 use Solr for indexing 2010 Nutch top-level project 2012/13 ElasticSearch indexer (2.x), pluggable indexing back-ends (1.x) 2.0 graduate tlp @sourceforge incubator Tika 2002 2009 201.5 1.9 2.2.1 1.8 2.2 1.7 2.1 1.6 1.5 1.4 1.3 1.2 1.1 0.7 0.8 0.9 1.0 0.7.2 0.5 0.4 0.6 Hadoop Gora use Solr
  • 4. What is Nutch now? “an extensible and scalable web crawler based on Hadoop” ▶ runs on top of ▶ scalable: billions of pages possible ▶ some overhead (if scale is not a requirement) ▶ not ideal for low latency ▶ customizable / extensible plugin architecture ▶ pluggable protocols (document access) ▶ URL filters + normalizers ▶ parsing: document formats + meta data extraction ▶ indexing back-ends ▶ mostly used to feed a search index … ▶ …but also data mining
  • 5. Nutch Community ▶ mature Apache project ▶ 6 active committers ▶ maintain two branches (1.x and 2.x) ▶ “friends” — (Apache) projects Nutch delegates work to ▶ Hadoop: scalability, job execution, data serialization (1.x) ▶ Tika: detecting and parsing multiple document formats ▶ Solr, ElasticSearch: make crawled content searchable ▶ Gora (and HBase, Cassandra, …): data storage (2.x) ▶ crawler-commons: robots.txt parsing ▶ steady user base, majority of users ▶ uses 1.x ▶ runs small scale crawls (< 1M pages) ▶ uses Solr for indexing and search
  • 6. Crawler Workflow 0. initialize CrawlDb, inject seed URLs repeat generate-fetch-update cycle n times: 1. generate fetch list: select URLs from CrawlDb for fetching 2. fetch URLs from fetch list 3. parse documents: extract content, metadata and links 4. update CrawlDb status, score and signature, add new URLs inlined or at the end of one crawler run (once for multiple cycles): 5. invert links: map anchor texts to documents the links point to 6. (calculate link rank on web graph, update CrawlDb scores) 7. deduplicate documents by signature 8. index document content, meta data, and anchor texts
  • 8. Workflow Execution ▶ every step is implemented as one (or more) MapReduce job ▶ shell script to run the workflow (bin/crawl) ▶ bin/nutch to run individual tools ▶ inject, generate, fetch, parse, updatedb, invertlinks, index, … ▶ many analysis and debugging tools ▶ local mode ▶ works out-of-the-box (bin package) ▶ useful for testing and debugging ▶ (pseudo-)distributed mode ▶ parallelization, monitor crawls with MapReduce web UI ▶ recompile and deploy job file with configuration changes
  • 9. Features of selected Tools Fetcher ▶ multi-threaded, high throughput ▶ but always be polite ▶ respect robots rules (robots.txt and robots directives) ▶ limit load on crawled servers, guaranteed delays between access to same host, IP, or domain WebGraph ▶ iterative link analysis ▶ on the level of documents, sites, or domains … most other tools ▶ are either quite trivial ▶ and/or rely heavily on plugins
  • 10. Extensible via Plugins Plugins basics ▶ each Plugin implements one (or more) extension points ▶ plugins are activated on demand (property plugin.includes) ▶ multiple active plugins per ext. point ▶ “chained”: sequentially applied (filters and normalizers) ▶ automatically selected (protocols and parser) Extension points and available plugins ▶ URL filter ▶ include/exclude URLs from crawling ▶ plugins: regex, prefix, suffix, domain ▶ URL normalizer ▶ canonicalize URL ▶ regex, domain, querystring
  • 11. Plugins (continued) ▶ protocols ▶ http, https, ftp, file ▶ protocol plugins take care for robots.txt rules ▶ parser ▶ need to parse various document types ▶ now mostly delegated to Tika (plugin parse-tika) ▶ legacy: parse-html, feed, zip ▶ parse filter ▶ extract additional meta data ▶ change extracted plain text ▶ plugins: headings, metatags ▶ indexing filter ▶ add/fill indexed fields ▶ url, title, content, anchor, boost, digest, type, host, … ▶ often in combination with parse filter to extract field content ▶ plugins: basic, more, anchor, metadata, …
  • 12. Plugins (continued) ▶ index writer ▶ connect to indexing back-ends ▶ send/write additions / updates / deletions ▶ plugins: Solr, ElasticSearch ▶ scoring filter ▶ pass score to outlinks (in 1.x: quite a few hops) ▶ change scores in CrawlDb via inlinks ▶ can do magic: limit crawl by depth, focused crawling, … ▶ OPIC (On-line Page Importance Computation, [1]) ▶ online: good strategy to crawl relevant content first ▶ scores also ok for ranking ▶ but seeds (and docs “close” to them) are favored ▶ scores get out of control for long-running continuous crawls ▶ LinkRank ▶ link rank calculation done separately on WebGraph ▶ scores are fed back to CrawlDb Complete list of plugins: http://nutch.apache.org/apidocs/apidocs-1.9/
  • 13. Pluggable classes Extensible interfaces (pluggable class implementations) ▶ fetch schedule ▶ decide whether to add URL to fetch list ▶ set next fetch time and re-fetch intervals ▶ default implementation: fixed re-fetch interval ▶ available: interval adaptive to document change frequency ▶ signature calculation ▶ used for deduplication and not-modified detection ▶ MD5 sum of binary document or plain text ▶ TextProfile based on filtered word frequency list
  • 14. Data Structures and Storage in Nutch 1.x All data is stored as map 〈url,someobject〉 in Hadoop map or sequence files Data structures (directories) and stored values: ▶ CrawlDb: all information of URL/document to run and schedule crawling ▶ current status (injected, linked, fetched, gone, …) ▶ score, next fetch time, last modified time ▶ signature (for deduplication) ▶ meta data (container for arbitrary data) ▶ LinkDb: incoming links (URLs and anchor texts) ▶ WebGraph: map files to hold outlinks, inlinks, node scores
  • 15. Data Structures (1.x): Segments Segments store all data related to fetch and parse of a single batch of URLs (one generate-fetch-update cycle) crawl_generate: list of URLs to be fetched with politeness requirements (host-level blocking): ▶ all URLs of one host (or IP) must be in one partition to implement host-level blocking in one single JVM ▶ inside one partition URLs are shuffled: URLs of one host are spread over whole partition to minimize blocking during fetch crawl_fetch: status from fetch (e.g., success, failed, robots denied) content: fetched binary content and meta data (HTTP header) parse_data: extracted meta data, outlinks and anchor texts parse_text: plain-text content crawl_parse: outlinks, scores, signatures, meta data, used to update CrawlDb
  • 16. Storage and Data Flow in Nutch 1.x
  • 17. Storage in Nutch 2.x One big table (WebTable) ▶ all information about one URL/document in a single row: status, metadata, binary content, extracted plain-text, inlink anchors, … ▶ inspired by BigTable paper [3] ▶ …and rise of NoSQL data stores Apache Gora used for storage layer abstraction ▶ OTD (object-to-datastore) for NoSQL data stores ▶ HBase, Cassandra, DynamoDb, Avro, Accumulo, … ▶ data serialization with Avro ▶ back-end specific object-datastore mappings
  • 18. Benefits of Nutch 2.x Benefits ▶ less steps in crawler workflow ▶ simplify Nutch code base ▶ easy to access data from other applications ▶ access storage directly ▶ or via Gora and schema ▶ plugins and MapReduce processing shared with 1.x
  • 19. Performance and Efficiency 2.x? Objective: should be more performant (in terms of latency) ▶ no need to rewrite whole CrawlDb for each smaller update ▶ adaptable to more instant workflows (cf. [5]) …however ▶ benchmark by Julien Nioche [4] shows that Nutch 2.x is slower than 1.x by a factor of 2.5 ▶ not a problem of HBase and Cassandra but due to Gora layer ▶ will hopefully be improved with Nutch 2.3 and recent Gora version which supports filtered scans ▶ still fast at scale
  • 20. …and Drawbacks? Drawbacks ▶ need to install and maintain datastore ▶ higher hardware requirements Impact on Nutch as software project ▶ 2.x planned as replacement, however ▶ still not as stable as 1.x ▶ 1.x has more users ▶ need to maintain two branches ▶ keep in sync, transfer patches
  • 21. News Common Crawl moves to Nutch (2013) http://blog.commoncrawl.org/2014/02/common-crawl-move-to-nutch/ ▶ public data on Amazon S3 ▶ billions of web pages Generic deduplication (1.x) ▶ based on CrawlDb only ▶ no need to pull document signatures and scores from index ▶ no special implementations for indexing back-ends (Solr, etc.) Nutch participated in GSoC 2014 …
  • 22. Nutch Web App . GSoC 2014 (Fjodor Vershinin): “Create a Wicket-based Web Application for Nutch” ▶ Nutch Server and REST API ▶ Web App client to run and schedule crawls ▶ only Nutch 2.x (for now) ▶ needs completion: change configuration, analytics (logs), …
  • 23. What’s coming next? Nutch 1.x is still alive! …but we hope to release 2.3 soon ▶ with performance improvements from recent Gora release Fix bugs, improve usability ▶ our bottleneck: review patches, get them committed ▶ try hard to keep 1.x and 2.x in sync Pending new features ▶ sitemap protocol ▶ canonical links
  • 24. What’s coming next? (mid- and long-term) Move to Hadoop 2.x and new MapReduce API … complete Rest API and Web App … port it “back” to 1.x … use graph library (Giraph) for link rank computation … switch from batch processing to streaming ▶ Storm / Spark / ? ▶ parts of the workflow
  • 25. Thanks and acknowledgments … … to Julien Nioche, Lewis John McGibbney, Andrzej Białecki, Chris Mattmann, Doug Cutting, Mike Cafarella – for inspirations and material from previous Nutch presentations … and to all Nutch committers for review and comments on early versions of theses slides.
  • 26. References [1] Abiteboul, Serge; Preda, Mihai; Cobena, Gregory, 2003: Adaptive on-line page importance computation. In: Proceedings of the 12th International Conference on World Wide Web, WWW ’03, pp. 280–290, ACM, New York, NY, USA, http://procod.com/mihai_preda/paper/online_page_importance.pdf. [2] Białecki, Andrzej, 2012: Nutch search engine. In: White, Tom (ed.) Hadoop: The Definitive Guide, pp. 565–579, O’Reilly. [3] Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson C.; Wallach, Deborah A.; Burrows, Mike; Chandra, Tushar; Fikes, Andrew; Gruber, Robert E., 2006: Bigtable: A distributed storage system for structured data. In: Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation (OSDI ’06), vol. 7, pp. 205–218, http://www.usenix.org/events/osdi06/tech/chang/chang.pdf. [4] Nioche, Julien, 2013, Nutch fight! 1.7 vs 2.2.1. Blog post, http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html. [5] Peng, Daniel Dabek, Frank, 2010: Large-scale incremental processing using distributed transactions and notifications. In: 9th USENIX Symposium on Operating Systems Design and Implementation, pp. 4–6, http://www.usenix.org/event/osdi10/tech/full_papers/Peng.pdf.