SlideShare une entreprise Scribd logo
1  sur  35
Large Scale Crawling with

   Apache
Julien Nioche
julien@digitalpebble.com

ApacheCon Europe 2012
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
    –   Web Crawling
    –   Natural Language Processing
    –   Information Retrieval
    –   Data Mining
   Strong focus on Open Source & Apache ecosystem
   Apache Nutch VP
   Apache Tika committer
   User | Contributor
    –   SOLR, Lucene
    –   GATE, UIMA
    –   Mahout
    –   Behemoth


                                                     2 / 37
Objectives

 Overview of the project

 Nutch in a nutshell

 Nutch 2.x

 Future developments




                            3 / 37
Nutch?
 “Distributed framework for large scale web crawling”
   – but does not have to be large scale at all
   – or even on the web (file-protocol)


  Apache TLP since May 2010


  Based on Apache Hadoop

  Indexing and Search



                                                     4 / 37
Short history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2004 : sub-project of Lucene @Apache
 2005 : MapReduce implementation in Nutch
   – 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
   – 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 June 2012 : Nutch 1.5.1
 Oct 2012 : Nutch 2.1

                                                    5 / 37
Recent Releases


        1.0            1.1 1.2   1.3     1.4 1.5.1
trunk


               2.x
                                               2.0 2.1



              06/09   06/10      06/11       06/12




                                                         7 / 37
Community

 6 active committers / PMC members
   – 4 within the last 18 months


 Constant stream of new contributions & bug reports

 Steady numbers of mailing list subscribers and traffic

 Nutch is a very healthy 10-year old




                                                       9 / 37
Why use Nutch?

 Usual reasons
   – Mature, business-friendly license, community, ...


 Scalability
   – Tried and tested on very large scale
   – Hadoop cluster : installation and skills

 Features
   – e.g. Index with SOLR
   – PageRank implementation
   – Can be extended with plugins




                                                         10 / 37
Not the best option when ...

 Hadoop based == batch processing == high latency
   – No guarantee that a page will be fetched / parsed / indexed within X
     minutes|hours


 Javascript / Ajax not supported (yet)




                                                                      11 / 37
Use cases

 Crawl for IR
   – Generic or vertical
   – Index and Search with SOLR
   – Single node to large clusters on Cloud

 … but also
   – Data Mining
   – NLP (e.g.Sentiment Analysis)
   – ML


   – MAHOUT / UIMA / GATE
   – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth)




                                                                   12 / 37
Customer cases
Specificity (Verticality)
                       Usecase : BetterJobs.com
                        –   Single server
                        –   Aggregates content from job portals
                        –   Extracts and normalizes structure (description,
                            requirements, locations)
                        –   ~1M pages total
                        –   Feeds SOLR index


          Usecase : SimilarPages.com
           –   Large cluster on Amazon EC2 (up to 400
               nodes)
           –   Fetched & parsed 3 billion pages
           –   10+ billion pages in crawlDB (~100TB data)
           –   200+ million lists of similarities
           –   No indexing / search involved
                                                                              Scale


                                                                                13 / 37
Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
      1)   Inject → populates CrawlDB from seed list
      2)   Generate → Selects URLS to fetch in segment
      3)   Fetch → Fetches URLs from segment
      4)   Parse → Parses content (text + metadata)
      5)   UpdateDB → Updates CrawlDB (new URLs, new status...)
      6)   InvertLinks → Build Webgraph
      7)   SOLRIndex → Send docs to SOLR
      8)   SOLRDedup → Remove duplicate docs based on signature
 Repeat steps 2 to 8
 Or use the all-in-one crawl script

                                                          14 / 37
Main steps


 Seed
 List        CrawlDB                 Segment
                                /
                                /crawl_fetch/
                                crawl_generate/
                                /content/
                                /crawl_parse/
                                /parse_data/
                                /parse_text/




                       LinkDB


                                                  15 / 37
Frontier expansion

 Manual “discovery”
   – Adding new URLs by
     hand, “seeding”


 Automatic discovery
  of new resources
  (frontier expansion)
   – Not all outlinks are
     equally useful - control       seed
   – Requires content
                                           i=1
     parsing and link
     extraction
                                                 i=2
                                                        i=3

  [Slide courtesy of A. Bialecki]

                                                       16 / 37
An extensible framework
 Plugins
   – Activated with parameter 'plugin.includes'
   – Implement one or more endpoints

 Endpoints
   –   Protocol
   –   Parser
   –   HtmlParseFilter (ParseFilter in Nutch 2.x)
   –   ScoringFilter (used in various places)
   –   URLFilter (ditto)
   –   URLNormalizer (ditto)
   –   IndexingFilter




                                                    17 / 37
Features

 Fetcher
   –   Multi-threaded fetcher
   –   Follows robots.txt
   –   Groups URLs per hostname / domain / IP
   –   Limit the number of URLs for round of fetching
   –   Default values are polite but can be made more aggressive

 Crawl Strategy
   – Breadth-first but can be depth-first
   – Configurable via custom scoring plugins

 Scoring
   – OPIC (On-line Page Importance Calculation) by default
   – LinkRank


                                                                   18 / 37
Features (cont.)

 Protocols
   – Http, file, ftp, https

 Scheduling
   – Specified or adaptative

 URL filters
   – Regex, FSA, TLD, prefix, suffix


 URL normalisers
   – Default, regex




                                       19 / 37
Features (cont.)

 Parsing with Apache Tika
   – Hundreds of formats supported
   – But some legacy parsers as well
 Other plugins
   –   CreativeCommons
   –   Feeds
   –   Language Identification
   –   Rel tags
   –   Arbitrary Metadata

 Indexing to SOLR
   – Bespoke schema




                                       20 / 37
Data Structures in 1.x
 MapReduce jobs => I/O : Hadoop [Sequence|Map]Files
 CrawlDB => status of known pages
                       MapFile : <Text,CrawlDatum>
                       byte status; [fetched? Unfetched? Failed? Redir?]
                       long fetchTime;
                       byte retries;
     CrawlDB           int fetchInterval;
                       float score = 1.0f;
                       byte[] signature = null;
                       long modifiedTime;
                       org.apache.hadoop.io.MapWritable metaData;


 Input of : generate - index
 Output of : inject - update


                                                                   21 / 37
Data Structures 1.x

 Segment => round of fetching
 Identified by a timestamp

 Segment
      /crawl_generate/ → SequenceFile<Text,CrawlDatum>
      /crawl_fetch/ → MapFile<Text,CrawlDatum>
      /content/ → MapFile<Text,Content>
      /crawl_parse/ → SequenceFile<Text,CrawlDatum>
      /parse_data/ → MapFile<Text,ParseData>
      /parse_text/ → MapFile<Text,ParseText>


 Can have multiple versions of a page in different
  segments


                                                         22 / 37
Data Structures – 1.x

 linkDB => storage for Web Graph

                        MapFile : <Text,Inlinks>
                        Inlinks : HashSet <Inlink>
     LinkDB             Inlink :
                                 String fromUrl
                                 String anchor


 Output of : invertlinks
 Input of : SOLRIndex




                                                     23 / 37
NUTCH 2.x

 2.0 released in July 2012

 2.1 in October 2012

 Common features as 1.x
   – delegation to SOLR, TIKA, MapReduce etc...


 Moved to table-based architecture
   – Wealth of NoSQL projects in last few years


 Abstraction over storage layer → Apache GORA


                                                  24 / 37
Apache GORA

 http://gora.apache.org/

 ORM for NoSQL databases
   – and limited SQL support + file based storage
 0.2.1 released in August 2012
 DataStore implementations
       ●   Accumulo             ●   Avro
       ●   Cassandra            ●   DynamoDB (soon)
       ●   HBase                ●   SQL

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)

                                                      25 / 37
AVRO Schema => Java code
 {"name": "WebPage",
  "type": "record",
  "namespace": "org.apache.nutch.storage",
  "fields": [
        {"name": "baseUrl", "type": ["null", "string"] },
        {"name": "status", "type": "int"},
        {"name": "fetchTime", "type": "long"},
        {"name": "prevFetchTime", "type": "long"},
        {"name": "fetchInterval", "type": "int"},
        {"name": "retriesSinceFetch", "type": "int"},
        {"name": "modifiedTime", "type": "long"},
        {"name": "protocolStatus", "type": {
           "name": "ProtocolStatus",
           "type": "record",
           "namespace": "org.apache.nutch.storage",
           "fields": [
               {"name": "code", "type": "int"},
               {"name": "args", "type": {"type": "array", "items": "string"}},
               {"name": "lastModified", "type": "long"}
           ]
           }},
 […]

                                                                                 26 / 37
Mapping file (backend specific – Hbase)
<gora-orm>

  <table name="webpage">
     <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters -->
     <family name="f" maxVersions="1"/>
     <family name="s" maxVersions="1"/>
     <family name="il" maxVersions="1"/>
     <family name="ol" maxVersions="1"/>
     <family name="h" maxVersions="1"/>
     <family name="mtdt" maxVersions="1"/>
     <family name="mk" maxVersions="1"/>
  </table>
  <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">

    <!-- fetch fields                        -->
    <field name="baseUrl" family="f" qualifier="bas"/>
    <field name="status" family="f" qualifier="st"/>
    <field name="prevFetchTime" family="f" qualifier="pts"/>
    <field name="fetchTime" family="f" qualifier="ts"/>
    <field name="fetchInterval" family="f" qualifier="fi"/>
    <field name="retriesSinceFetch" family="f" qualifier="rsf"/>



                                                                                            27 / 37
DataStore operations

 Atomic operations
   – get(K key)
   – put(K key, T obj)
   – delete(K key)

 Querying
   – execute(Query<K, T> query) → Result<K,T>
   – deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
   – GORAInput|OutputFormat
   – GoraRecordReader|Writer
   – GORAMapper|Reducer



                                                28 / 37
GORA in Nutch

 AVRO schema provided and java code pre-generated


 Mapping files provided for backends
   – can be modified if necessary

 Need to rebuild to get dependencies for backend
   – No binary distribution of Nutch 2.x

 http://wiki.apache.org/nutch/Nutch2Tutorial



                                                    29 / 37
Benefits

 Storage still distributed and replicated
 but one big table
   – status, metadata, content, text → one place
 Simplified logic in Nutch
   – Simpler code for updating / merging information
 More efficient (?)
   – No need to read / write entire structure to update records
   – No comparison available yet + early days for GORA
 Easier interaction with other resources
   – Third-party code just need to use GORA and schema



                                                                  30 / 37
Drawbacks

 More stuff to install and configure :-)

 Not as stable as Nutch 1.x

 Dependent on success of Gora




                                            31 / 37
2.x Work in progress

 Stabilise backend implementations
   – GORA-Hbase most reliable


 Synchronize features with 1.x
   – e.g. has ElasticSearch but missing LinkRank equivalent


 Filter enabled scans (GORA-119)
   – Don't need to de-serialize the whole dataset




                                                              32 / 37
Future

 Both 1.x and 2.x in parallel
   – but more frequent releases for 2.x


 New functionalities
   –   Support for SOLRCloud
   –   Sitemap (from Crawler Commons library)
   –   Canonical tag
   –   More indexers (e.g. ElasticSearch) + pluggable indexers?




                                                                  33 / 37
More delegation
 Great deal done in recent years (SOLR, Tika)

 Share code with crawler-commons
  (http://code.google.com/p/crawler-commons/)
   – Fetcher / protocol handling
   – Robots.txt parsing
   – URL normalisation / filtering


 PageRank-like computations to graph library
   – e.g. Apache Giraph
   – Should be more efficient as well




                                                 34 / 37
Where to find out more?

  Project page : http://nutch.apache.org/
  Wiki : http://wiki.apache.org/nutch/
  Mailing lists :
    – user@nutch.apache.org
    – dev@nutch.apache.org


 Chapter in 'Hadoop the Definitive Guide' (T. White)
   – Understanding Hadoop is essential anyway...


 Support / consulting :
   – http://wiki.apache.org/nutch/Support



                                                        35 / 37
Questions




            ?

                36 / 37
37 / 37

Contenu connexe

Tendances

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리Junyi Song
 
Django Rest Framework - Building a Web API
Django Rest Framework - Building a Web APIDjango Rest Framework - Building a Web API
Django Rest Framework - Building a Web APIMarcos Pereira
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web CorpusRobert Meusel
 
라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성Hyunjik Bae
 
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawldata publica
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchhypto
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...sparktc
 
API Gateway를 이용한 토큰 기반 인증 아키텍처
API Gateway를 이용한 토큰 기반 인증 아키텍처API Gateway를 이용한 토큰 기반 인증 아키텍처
API Gateway를 이용한 토큰 기반 인증 아키텍처Yoonjeong Kwon
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]MongoDB
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureMySQLConference
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
파이썬을 활용한 챗봇 서비스 개발 3일차
파이썬을 활용한 챗봇 서비스 개발 3일차파이썬을 활용한 챗봇 서비스 개발 3일차
파이썬을 활용한 챗봇 서비스 개발 3일차Taekyung Han
 
Massive service basic
Massive service basicMassive service basic
Massive service basicDaeMyung Kang
 

Tendances (20)

Hive tuning
Hive tuningHive tuning
Hive tuning
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리
 
Django Rest Framework - Building a Web API
Django Rest Framework - Building a Web APIDjango Rest Framework - Building a Web API
Django Rest Framework - Building a Web API
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
 
Angular js
Angular jsAngular js
Angular js
 
라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성
 
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawl
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
 
API Gateway를 이용한 토큰 기반 인증 아키텍처
API Gateway를 이용한 토큰 기반 인증 아키텍처API Gateway를 이용한 토큰 기반 인증 아키텍처
API Gateway를 이용한 토큰 기반 인증 아키텍처
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code Structure
 
jQuery
jQueryjQuery
jQuery
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
파이썬을 활용한 챗봇 서비스 개발 3일차
파이썬을 활용한 챗봇 서비스 개발 3일차파이썬을 활용한 챗봇 서비스 개발 3일차
파이썬을 활용한 챗봇 서비스 개발 3일차
 
Massive service basic
Massive service basicMassive service basic
Massive service basic
 

En vedette

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platformabial
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingNate Murray
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 

En vedette (20)

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 

Similaire à Large scale crawling with Apache Nutch

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaKnoldus Inc.
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 

Similaire à Large scale crawling with Apache Nutch (20)

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 

Dernier

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Large scale crawling with Apache Nutch

  • 1. Large Scale Crawling with Apache Julien Nioche julien@digitalpebble.com ApacheCon Europe 2012
  • 2. About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Data Mining  Strong focus on Open Source & Apache ecosystem  Apache Nutch VP  Apache Tika committer  User | Contributor – SOLR, Lucene – GATE, UIMA – Mahout – Behemoth 2 / 37
  • 3. Objectives  Overview of the project  Nutch in a nutshell  Nutch 2.x  Future developments 3 / 37
  • 4. Nutch?  “Distributed framework for large scale web crawling” – but does not have to be large scale at all – or even on the web (file-protocol)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search 4 / 37
  • 5. Short history  2002/2003 : Started By Doug Cutting & Mike Caffarella  2004 : sub-project of Lucene @Apache  2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache  2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache  May 2010 : TLP project at Apache  June 2012 : Nutch 1.5.1  Oct 2012 : Nutch 2.1 5 / 37
  • 6. Recent Releases 1.0 1.1 1.2 1.3 1.4 1.5.1 trunk 2.x 2.0 2.1 06/09 06/10 06/11 06/12 7 / 37
  • 7. Community  6 active committers / PMC members – 4 within the last 18 months  Constant stream of new contributions & bug reports  Steady numbers of mailing list subscribers and traffic  Nutch is a very healthy 10-year old 9 / 37
  • 8. Why use Nutch?  Usual reasons – Mature, business-friendly license, community, ...  Scalability – Tried and tested on very large scale – Hadoop cluster : installation and skills  Features – e.g. Index with SOLR – PageRank implementation – Can be extended with plugins 10 / 37
  • 9. Not the best option when ...  Hadoop based == batch processing == high latency – No guarantee that a page will be fetched / parsed / indexed within X minutes|hours  Javascript / Ajax not supported (yet) 11 / 37
  • 10. Use cases  Crawl for IR – Generic or vertical – Index and Search with SOLR – Single node to large clusters on Cloud  … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 12 / 37
  • 11. Customer cases Specificity (Verticality) Usecase : BetterJobs.com – Single server – Aggregates content from job portals – Extracts and normalizes structure (description, requirements, locations) – ~1M pages total – Feeds SOLR index Usecase : SimilarPages.com – Large cluster on Amazon EC2 (up to 400 nodes) – Fetched & parsed 3 billion pages – 10+ billion pages in crawlDB (~100TB data) – 200+ million lists of similarities – No indexing / search involved Scale 13 / 37
  • 12. Typical Nutch Steps  Same in 1.x and 2.x  Sequence of batch operations 1) Inject → populates CrawlDB from seed list 2) Generate → Selects URLS to fetch in segment 3) Fetch → Fetches URLs from segment 4) Parse → Parses content (text + metadata) 5) UpdateDB → Updates CrawlDB (new URLs, new status...) 6) InvertLinks → Build Webgraph 7) SOLRIndex → Send docs to SOLR 8) SOLRDedup → Remove duplicate docs based on signature  Repeat steps 2 to 8  Or use the all-in-one crawl script 14 / 37
  • 13. Main steps Seed List CrawlDB Segment / /crawl_fetch/ crawl_generate/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 15 / 37
  • 14. Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control seed – Requires content i=1 parsing and link extraction i=2 i=3 [Slide courtesy of A. Bialecki] 16 / 37
  • 15. An extensible framework  Plugins – Activated with parameter 'plugin.includes' – Implement one or more endpoints  Endpoints – Protocol – Parser – HtmlParseFilter (ParseFilter in Nutch 2.x) – ScoringFilter (used in various places) – URLFilter (ditto) – URLNormalizer (ditto) – IndexingFilter 17 / 37
  • 16. Features  Fetcher – Multi-threaded fetcher – Follows robots.txt – Groups URLs per hostname / domain / IP – Limit the number of URLs for round of fetching – Default values are polite but can be made more aggressive  Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom scoring plugins  Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 18 / 37
  • 17. Features (cont.)  Protocols – Http, file, ftp, https  Scheduling – Specified or adaptative  URL filters – Regex, FSA, TLD, prefix, suffix  URL normalisers – Default, regex 19 / 37
  • 18. Features (cont.)  Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well  Other plugins – CreativeCommons – Feeds – Language Identification – Rel tags – Arbitrary Metadata  Indexing to SOLR – Bespoke schema 20 / 37
  • 19. Data Structures in 1.x  MapReduce jobs => I/O : Hadoop [Sequence|Map]Files  CrawlDB => status of known pages MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; CrawlDB int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;  Input of : generate - index  Output of : inject - update 21 / 37
  • 20. Data Structures 1.x  Segment => round of fetching  Identified by a timestamp Segment /crawl_generate/ → SequenceFile<Text,CrawlDatum> /crawl_fetch/ → MapFile<Text,CrawlDatum> /content/ → MapFile<Text,Content> /crawl_parse/ → SequenceFile<Text,CrawlDatum> /parse_data/ → MapFile<Text,ParseData> /parse_text/ → MapFile<Text,ParseText>  Can have multiple versions of a page in different segments 22 / 37
  • 21. Data Structures – 1.x  linkDB => storage for Web Graph MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> LinkDB Inlink : String fromUrl String anchor  Output of : invertlinks  Input of : SOLRIndex 23 / 37
  • 22. NUTCH 2.x  2.0 released in July 2012  2.1 in October 2012  Common features as 1.x – delegation to SOLR, TIKA, MapReduce etc...  Moved to table-based architecture – Wealth of NoSQL projects in last few years  Abstraction over storage layer → Apache GORA 24 / 37
  • 23. Apache GORA  http://gora.apache.org/  ORM for NoSQL databases – and limited SQL support + file based storage  0.2.1 released in August 2012  DataStore implementations ● Accumulo ● Avro ● Cassandra ● DynamoDB (soon) ● HBase ● SQL  Serialization with Apache AVRO  Object-to-datastore mappings (backend-specific) 25 / 37
  • 24. AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 26 / 37
  • 25. Mapping file (backend specific – Hbase) <gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 27 / 37
  • 26. DataStore operations  Atomic operations – get(K key) – put(K key, T obj) – delete(K key)  Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query)  Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 28 / 37
  • 27. GORA in Nutch  AVRO schema provided and java code pre-generated  Mapping files provided for backends – can be modified if necessary  Need to rebuild to get dependencies for backend – No binary distribution of Nutch 2.x  http://wiki.apache.org/nutch/Nutch2Tutorial 29 / 37
  • 28. Benefits  Storage still distributed and replicated  but one big table – status, metadata, content, text → one place  Simplified logic in Nutch – Simpler code for updating / merging information  More efficient (?) – No need to read / write entire structure to update records – No comparison available yet + early days for GORA  Easier interaction with other resources – Third-party code just need to use GORA and schema 30 / 37
  • 29. Drawbacks  More stuff to install and configure :-)  Not as stable as Nutch 1.x  Dependent on success of Gora 31 / 37
  • 30. 2.x Work in progress  Stabilise backend implementations – GORA-Hbase most reliable  Synchronize features with 1.x – e.g. has ElasticSearch but missing LinkRank equivalent  Filter enabled scans (GORA-119) – Don't need to de-serialize the whole dataset 32 / 37
  • 31. Future  Both 1.x and 2.x in parallel – but more frequent releases for 2.x  New functionalities – Support for SOLRCloud – Sitemap (from Crawler Commons library) – Canonical tag – More indexers (e.g. ElasticSearch) + pluggable indexers? 33 / 37
  • 32. More delegation  Great deal done in recent years (SOLR, Tika)  Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – Robots.txt parsing – URL normalisation / filtering  PageRank-like computations to graph library – e.g. Apache Giraph – Should be more efficient as well 34 / 37
  • 33. Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org  Chapter in 'Hadoop the Definitive Guide' (T. White) – Understanding Hadoop is essential anyway...  Support / consulting : – http://wiki.apache.org/nutch/Support 35 / 37
  • 34. Questions ? 36 / 37

Notes de l'éditeur

  1. I&apos;ll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  2. A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  3. Note that I mention crawling and not web search → used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  4. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  5. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  6. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  7. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  8. Fetcher . multithreaded but polite
  9. Fetcher . multithreaded but polite
  10. Writable object – crawl datum
  11. What does this mean for Nutch?
  12. What does this mean for Nutch?