SlideShare une entreprise Scribd logo
1  sur  35
Large Scale Crawling with

   Apache
Julien Nioche
julien@digitalpebble.com

ApacheCon Europe 2012
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
    –   Web Crawling
    –   Natural Language Processing
    –   Information Retrieval
    –   Data Mining
   Strong focus on Open Source & Apache ecosystem
   Apache Nutch VP
   Apache Tika committer
   User | Contributor
    –   SOLR, Lucene
    –   GATE, UIMA
    –   Mahout
    –   Behemoth


                                                     2 / 37
Objectives

 Overview of the project

 Nutch in a nutshell

 Nutch 2.x

 Future developments




                            3 / 37
Nutch?
 “Distributed framework for large scale web crawling”
   – but does not have to be large scale at all
   – or even on the web (file-protocol)


  Apache TLP since May 2010


  Based on Apache Hadoop

  Indexing and Search



                                                     4 / 37
Short history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2004 : sub-project of Lucene @Apache
 2005 : MapReduce implementation in Nutch
   – 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
   – 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 June 2012 : Nutch 1.5.1
 Oct 2012 : Nutch 2.1

                                                    5 / 37
Recent Releases


        1.0            1.1 1.2   1.3     1.4 1.5.1
trunk


               2.x
                                               2.0 2.1



              06/09   06/10      06/11       06/12




                                                         7 / 37
Community

 6 active committers / PMC members
   – 4 within the last 18 months


 Constant stream of new contributions & bug reports

 Steady numbers of mailing list subscribers and traffic

 Nutch is a very healthy 10-year old




                                                       9 / 37
Why use Nutch?

 Usual reasons
   – Mature, business-friendly license, community, ...


 Scalability
   – Tried and tested on very large scale
   – Hadoop cluster : installation and skills

 Features
   – e.g. Index with SOLR
   – PageRank implementation
   – Can be extended with plugins




                                                         10 / 37
Not the best option when ...

 Hadoop based == batch processing == high latency
   – No guarantee that a page will be fetched / parsed / indexed within X
     minutes|hours


 Javascript / Ajax not supported (yet)




                                                                      11 / 37
Use cases

 Crawl for IR
   – Generic or vertical
   – Index and Search with SOLR
   – Single node to large clusters on Cloud

 … but also
   – Data Mining
   – NLP (e.g.Sentiment Analysis)
   – ML


   – MAHOUT / UIMA / GATE
   – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth)




                                                                   12 / 37
Customer cases
Specificity (Verticality)
                       Usecase : BetterJobs.com
                        –   Single server
                        –   Aggregates content from job portals
                        –   Extracts and normalizes structure (description,
                            requirements, locations)
                        –   ~1M pages total
                        –   Feeds SOLR index


          Usecase : SimilarPages.com
           –   Large cluster on Amazon EC2 (up to 400
               nodes)
           –   Fetched & parsed 3 billion pages
           –   10+ billion pages in crawlDB (~100TB data)
           –   200+ million lists of similarities
           –   No indexing / search involved
                                                                              Scale


                                                                                13 / 37
Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
      1)   Inject → populates CrawlDB from seed list
      2)   Generate → Selects URLS to fetch in segment
      3)   Fetch → Fetches URLs from segment
      4)   Parse → Parses content (text + metadata)
      5)   UpdateDB → Updates CrawlDB (new URLs, new status...)
      6)   InvertLinks → Build Webgraph
      7)   SOLRIndex → Send docs to SOLR
      8)   SOLRDedup → Remove duplicate docs based on signature
 Repeat steps 2 to 8
 Or use the all-in-one crawl script

                                                          14 / 37
Main steps


 Seed
 List        CrawlDB                 Segment
                                /
                                /crawl_fetch/
                                crawl_generate/
                                /content/
                                /crawl_parse/
                                /parse_data/
                                /parse_text/




                       LinkDB


                                                  15 / 37
Frontier expansion

 Manual “discovery”
   – Adding new URLs by
     hand, “seeding”


 Automatic discovery
  of new resources
  (frontier expansion)
   – Not all outlinks are
     equally useful - control       seed
   – Requires content
                                           i=1
     parsing and link
     extraction
                                                 i=2
                                                        i=3

  [Slide courtesy of A. Bialecki]

                                                       16 / 37
An extensible framework
 Plugins
   – Activated with parameter 'plugin.includes'
   – Implement one or more endpoints

 Endpoints
   –   Protocol
   –   Parser
   –   HtmlParseFilter (ParseFilter in Nutch 2.x)
   –   ScoringFilter (used in various places)
   –   URLFilter (ditto)
   –   URLNormalizer (ditto)
   –   IndexingFilter




                                                    17 / 37
Features

 Fetcher
   –   Multi-threaded fetcher
   –   Follows robots.txt
   –   Groups URLs per hostname / domain / IP
   –   Limit the number of URLs for round of fetching
   –   Default values are polite but can be made more aggressive

 Crawl Strategy
   – Breadth-first but can be depth-first
   – Configurable via custom scoring plugins

 Scoring
   – OPIC (On-line Page Importance Calculation) by default
   – LinkRank


                                                                   18 / 37
Features (cont.)

 Protocols
   – Http, file, ftp, https

 Scheduling
   – Specified or adaptative

 URL filters
   – Regex, FSA, TLD, prefix, suffix


 URL normalisers
   – Default, regex




                                       19 / 37
Features (cont.)

 Parsing with Apache Tika
   – Hundreds of formats supported
   – But some legacy parsers as well
 Other plugins
   –   CreativeCommons
   –   Feeds
   –   Language Identification
   –   Rel tags
   –   Arbitrary Metadata

 Indexing to SOLR
   – Bespoke schema




                                       20 / 37
Data Structures in 1.x
 MapReduce jobs => I/O : Hadoop [Sequence|Map]Files
 CrawlDB => status of known pages
                       MapFile : <Text,CrawlDatum>
                       byte status; [fetched? Unfetched? Failed? Redir?]
                       long fetchTime;
                       byte retries;
     CrawlDB           int fetchInterval;
                       float score = 1.0f;
                       byte[] signature = null;
                       long modifiedTime;
                       org.apache.hadoop.io.MapWritable metaData;


 Input of : generate - index
 Output of : inject - update


                                                                   21 / 37
Data Structures 1.x

 Segment => round of fetching
 Identified by a timestamp

 Segment
      /crawl_generate/ → SequenceFile<Text,CrawlDatum>
      /crawl_fetch/ → MapFile<Text,CrawlDatum>
      /content/ → MapFile<Text,Content>
      /crawl_parse/ → SequenceFile<Text,CrawlDatum>
      /parse_data/ → MapFile<Text,ParseData>
      /parse_text/ → MapFile<Text,ParseText>


 Can have multiple versions of a page in different
  segments


                                                         22 / 37
Data Structures – 1.x

 linkDB => storage for Web Graph

                        MapFile : <Text,Inlinks>
                        Inlinks : HashSet <Inlink>
     LinkDB             Inlink :
                                 String fromUrl
                                 String anchor


 Output of : invertlinks
 Input of : SOLRIndex




                                                     23 / 37
NUTCH 2.x

 2.0 released in July 2012

 2.1 in October 2012

 Common features as 1.x
   – delegation to SOLR, TIKA, MapReduce etc...


 Moved to table-based architecture
   – Wealth of NoSQL projects in last few years


 Abstraction over storage layer → Apache GORA


                                                  24 / 37
Apache GORA

 http://gora.apache.org/

 ORM for NoSQL databases
   – and limited SQL support + file based storage
 0.2.1 released in August 2012
 DataStore implementations
       ●   Accumulo             ●   Avro
       ●   Cassandra            ●   DynamoDB (soon)
       ●   HBase                ●   SQL

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)

                                                      25 / 37
AVRO Schema => Java code
 {"name": "WebPage",
  "type": "record",
  "namespace": "org.apache.nutch.storage",
  "fields": [
        {"name": "baseUrl", "type": ["null", "string"] },
        {"name": "status", "type": "int"},
        {"name": "fetchTime", "type": "long"},
        {"name": "prevFetchTime", "type": "long"},
        {"name": "fetchInterval", "type": "int"},
        {"name": "retriesSinceFetch", "type": "int"},
        {"name": "modifiedTime", "type": "long"},
        {"name": "protocolStatus", "type": {
           "name": "ProtocolStatus",
           "type": "record",
           "namespace": "org.apache.nutch.storage",
           "fields": [
               {"name": "code", "type": "int"},
               {"name": "args", "type": {"type": "array", "items": "string"}},
               {"name": "lastModified", "type": "long"}
           ]
           }},
 […]

                                                                                 26 / 37
Mapping file (backend specific – Hbase)
<gora-orm>

  <table name="webpage">
     <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters -->
     <family name="f" maxVersions="1"/>
     <family name="s" maxVersions="1"/>
     <family name="il" maxVersions="1"/>
     <family name="ol" maxVersions="1"/>
     <family name="h" maxVersions="1"/>
     <family name="mtdt" maxVersions="1"/>
     <family name="mk" maxVersions="1"/>
  </table>
  <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">

    <!-- fetch fields                        -->
    <field name="baseUrl" family="f" qualifier="bas"/>
    <field name="status" family="f" qualifier="st"/>
    <field name="prevFetchTime" family="f" qualifier="pts"/>
    <field name="fetchTime" family="f" qualifier="ts"/>
    <field name="fetchInterval" family="f" qualifier="fi"/>
    <field name="retriesSinceFetch" family="f" qualifier="rsf"/>



                                                                                            27 / 37
DataStore operations

 Atomic operations
   – get(K key)
   – put(K key, T obj)
   – delete(K key)

 Querying
   – execute(Query<K, T> query) → Result<K,T>
   – deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
   – GORAInput|OutputFormat
   – GoraRecordReader|Writer
   – GORAMapper|Reducer



                                                28 / 37
GORA in Nutch

 AVRO schema provided and java code pre-generated


 Mapping files provided for backends
   – can be modified if necessary

 Need to rebuild to get dependencies for backend
   – No binary distribution of Nutch 2.x

 http://wiki.apache.org/nutch/Nutch2Tutorial



                                                    29 / 37
Benefits

 Storage still distributed and replicated
 but one big table
   – status, metadata, content, text → one place
 Simplified logic in Nutch
   – Simpler code for updating / merging information
 More efficient (?)
   – No need to read / write entire structure to update records
   – No comparison available yet + early days for GORA
 Easier interaction with other resources
   – Third-party code just need to use GORA and schema



                                                                  30 / 37
Drawbacks

 More stuff to install and configure :-)

 Not as stable as Nutch 1.x

 Dependent on success of Gora




                                            31 / 37
2.x Work in progress

 Stabilise backend implementations
   – GORA-Hbase most reliable


 Synchronize features with 1.x
   – e.g. has ElasticSearch but missing LinkRank equivalent


 Filter enabled scans (GORA-119)
   – Don't need to de-serialize the whole dataset




                                                              32 / 37
Future

 Both 1.x and 2.x in parallel
   – but more frequent releases for 2.x


 New functionalities
   –   Support for SOLRCloud
   –   Sitemap (from Crawler Commons library)
   –   Canonical tag
   –   More indexers (e.g. ElasticSearch) + pluggable indexers?




                                                                  33 / 37
More delegation
 Great deal done in recent years (SOLR, Tika)

 Share code with crawler-commons
  (http://code.google.com/p/crawler-commons/)
   – Fetcher / protocol handling
   – Robots.txt parsing
   – URL normalisation / filtering


 PageRank-like computations to graph library
   – e.g. Apache Giraph
   – Should be more efficient as well




                                                 34 / 37
Where to find out more?

  Project page : http://nutch.apache.org/
  Wiki : http://wiki.apache.org/nutch/
  Mailing lists :
    – user@nutch.apache.org
    – dev@nutch.apache.org


 Chapter in 'Hadoop the Definitive Guide' (T. White)
   – Understanding Hadoop is essential anyway...


 Support / consulting :
   – http://wiki.apache.org/nutch/Support



                                                        35 / 37
Questions




            ?

                36 / 37
37 / 37

Contenu connexe

Tendances

Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라Seongyun Byeon
 
Crawling, indexation & the impact on performance | Brighton SEO
Crawling, indexation & the impact on performance | Brighton SEOCrawling, indexation & the impact on performance | Brighton SEO
Crawling, indexation & the impact on performance | Brighton SEOMartin Sean Fennon
 
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawldata publica
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
E-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/SolrE-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/SolrVincenzo D'Amore
 
Google Tag Manager | Google Tag Manager Tutorial 2019 | Google Tag Manager Se...
Google Tag Manager | Google Tag Manager Tutorial 2019 | Google Tag Manager Se...Google Tag Manager | Google Tag Manager Tutorial 2019 | Google Tag Manager Se...
Google Tag Manager | Google Tag Manager Tutorial 2019 | Google Tag Manager Se...Simplilearn
 
How Search Works
How Search WorksHow Search Works
How Search WorksAhrefs
 
Technical SEO Presentation
Technical SEO PresentationTechnical SEO Presentation
Technical SEO PresentationJoe Robison
 
Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Ali Saif Mirza
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with WeaviateNETWAYS
 
Keywords and Keyword Research.pdf
Keywords and Keyword Research.pdfKeywords and Keyword Research.pdf
Keywords and Keyword Research.pdfShristi Shrestha
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
 
SEOkomm 2021: Interne Verlinkung
SEOkomm 2021: Interne VerlinkungSEOkomm 2021: Interne Verlinkung
SEOkomm 2021: Interne VerlinkungJohan Hülsen
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseHasan H Topcu
 
How to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With PythonHow to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With Pythonsearchsolved
 

Tendances (20)

Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
 
Seo for-content
Seo for-contentSeo for-content
Seo for-content
 
Crawling, indexation & the impact on performance | Brighton SEO
Crawling, indexation & the impact on performance | Brighton SEOCrawling, indexation & the impact on performance | Brighton SEO
Crawling, indexation & the impact on performance | Brighton SEO
 
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawl
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
E-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/SolrE-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/Solr
 
Google Tag Manager | Google Tag Manager Tutorial 2019 | Google Tag Manager Se...
Google Tag Manager | Google Tag Manager Tutorial 2019 | Google Tag Manager Se...Google Tag Manager | Google Tag Manager Tutorial 2019 | Google Tag Manager Se...
Google Tag Manager | Google Tag Manager Tutorial 2019 | Google Tag Manager Se...
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
How Search Works
How Search WorksHow Search Works
How Search Works
 
Technical SEO Presentation
Technical SEO PresentationTechnical SEO Presentation
Technical SEO Presentation
 
Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
 
Keywords and Keyword Research.pdf
Keywords and Keyword Research.pdfKeywords and Keyword Research.pdf
Keywords and Keyword Research.pdf
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
SEOkomm 2021: Interne Verlinkung
SEOkomm 2021: Interne VerlinkungSEOkomm 2021: Interne Verlinkung
SEOkomm 2021: Interne Verlinkung
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
 
How to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With PythonHow to Automatically Subcategorise Your Website Automatically With Python
How to Automatically Subcategorise Your Website Automatically With Python
 

En vedette

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platformabial
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingNate Murray
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 

En vedette (20)

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 

Similaire à Large scale crawling with Apache Nutch

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaKnoldus Inc.
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 

Similaire à Large scale crawling with Apache Nutch (20)

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 

Dernier

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Dernier (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Large scale crawling with Apache Nutch

  • 1. Large Scale Crawling with Apache Julien Nioche julien@digitalpebble.com ApacheCon Europe 2012
  • 2. About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Data Mining  Strong focus on Open Source & Apache ecosystem  Apache Nutch VP  Apache Tika committer  User | Contributor – SOLR, Lucene – GATE, UIMA – Mahout – Behemoth 2 / 37
  • 3. Objectives  Overview of the project  Nutch in a nutshell  Nutch 2.x  Future developments 3 / 37
  • 4. Nutch?  “Distributed framework for large scale web crawling” – but does not have to be large scale at all – or even on the web (file-protocol)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search 4 / 37
  • 5. Short history  2002/2003 : Started By Doug Cutting & Mike Caffarella  2004 : sub-project of Lucene @Apache  2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache  2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache  May 2010 : TLP project at Apache  June 2012 : Nutch 1.5.1  Oct 2012 : Nutch 2.1 5 / 37
  • 6. Recent Releases 1.0 1.1 1.2 1.3 1.4 1.5.1 trunk 2.x 2.0 2.1 06/09 06/10 06/11 06/12 7 / 37
  • 7. Community  6 active committers / PMC members – 4 within the last 18 months  Constant stream of new contributions & bug reports  Steady numbers of mailing list subscribers and traffic  Nutch is a very healthy 10-year old 9 / 37
  • 8. Why use Nutch?  Usual reasons – Mature, business-friendly license, community, ...  Scalability – Tried and tested on very large scale – Hadoop cluster : installation and skills  Features – e.g. Index with SOLR – PageRank implementation – Can be extended with plugins 10 / 37
  • 9. Not the best option when ...  Hadoop based == batch processing == high latency – No guarantee that a page will be fetched / parsed / indexed within X minutes|hours  Javascript / Ajax not supported (yet) 11 / 37
  • 10. Use cases  Crawl for IR – Generic or vertical – Index and Search with SOLR – Single node to large clusters on Cloud  … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 12 / 37
  • 11. Customer cases Specificity (Verticality) Usecase : BetterJobs.com – Single server – Aggregates content from job portals – Extracts and normalizes structure (description, requirements, locations) – ~1M pages total – Feeds SOLR index Usecase : SimilarPages.com – Large cluster on Amazon EC2 (up to 400 nodes) – Fetched & parsed 3 billion pages – 10+ billion pages in crawlDB (~100TB data) – 200+ million lists of similarities – No indexing / search involved Scale 13 / 37
  • 12. Typical Nutch Steps  Same in 1.x and 2.x  Sequence of batch operations 1) Inject → populates CrawlDB from seed list 2) Generate → Selects URLS to fetch in segment 3) Fetch → Fetches URLs from segment 4) Parse → Parses content (text + metadata) 5) UpdateDB → Updates CrawlDB (new URLs, new status...) 6) InvertLinks → Build Webgraph 7) SOLRIndex → Send docs to SOLR 8) SOLRDedup → Remove duplicate docs based on signature  Repeat steps 2 to 8  Or use the all-in-one crawl script 14 / 37
  • 13. Main steps Seed List CrawlDB Segment / /crawl_fetch/ crawl_generate/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 15 / 37
  • 14. Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control seed – Requires content i=1 parsing and link extraction i=2 i=3 [Slide courtesy of A. Bialecki] 16 / 37
  • 15. An extensible framework  Plugins – Activated with parameter 'plugin.includes' – Implement one or more endpoints  Endpoints – Protocol – Parser – HtmlParseFilter (ParseFilter in Nutch 2.x) – ScoringFilter (used in various places) – URLFilter (ditto) – URLNormalizer (ditto) – IndexingFilter 17 / 37
  • 16. Features  Fetcher – Multi-threaded fetcher – Follows robots.txt – Groups URLs per hostname / domain / IP – Limit the number of URLs for round of fetching – Default values are polite but can be made more aggressive  Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom scoring plugins  Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 18 / 37
  • 17. Features (cont.)  Protocols – Http, file, ftp, https  Scheduling – Specified or adaptative  URL filters – Regex, FSA, TLD, prefix, suffix  URL normalisers – Default, regex 19 / 37
  • 18. Features (cont.)  Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well  Other plugins – CreativeCommons – Feeds – Language Identification – Rel tags – Arbitrary Metadata  Indexing to SOLR – Bespoke schema 20 / 37
  • 19. Data Structures in 1.x  MapReduce jobs => I/O : Hadoop [Sequence|Map]Files  CrawlDB => status of known pages MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; CrawlDB int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;  Input of : generate - index  Output of : inject - update 21 / 37
  • 20. Data Structures 1.x  Segment => round of fetching  Identified by a timestamp Segment /crawl_generate/ → SequenceFile<Text,CrawlDatum> /crawl_fetch/ → MapFile<Text,CrawlDatum> /content/ → MapFile<Text,Content> /crawl_parse/ → SequenceFile<Text,CrawlDatum> /parse_data/ → MapFile<Text,ParseData> /parse_text/ → MapFile<Text,ParseText>  Can have multiple versions of a page in different segments 22 / 37
  • 21. Data Structures – 1.x  linkDB => storage for Web Graph MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> LinkDB Inlink : String fromUrl String anchor  Output of : invertlinks  Input of : SOLRIndex 23 / 37
  • 22. NUTCH 2.x  2.0 released in July 2012  2.1 in October 2012  Common features as 1.x – delegation to SOLR, TIKA, MapReduce etc...  Moved to table-based architecture – Wealth of NoSQL projects in last few years  Abstraction over storage layer → Apache GORA 24 / 37
  • 23. Apache GORA  http://gora.apache.org/  ORM for NoSQL databases – and limited SQL support + file based storage  0.2.1 released in August 2012  DataStore implementations ● Accumulo ● Avro ● Cassandra ● DynamoDB (soon) ● HBase ● SQL  Serialization with Apache AVRO  Object-to-datastore mappings (backend-specific) 25 / 37
  • 24. AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 26 / 37
  • 25. Mapping file (backend specific – Hbase) <gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 27 / 37
  • 26. DataStore operations  Atomic operations – get(K key) – put(K key, T obj) – delete(K key)  Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query)  Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 28 / 37
  • 27. GORA in Nutch  AVRO schema provided and java code pre-generated  Mapping files provided for backends – can be modified if necessary  Need to rebuild to get dependencies for backend – No binary distribution of Nutch 2.x  http://wiki.apache.org/nutch/Nutch2Tutorial 29 / 37
  • 28. Benefits  Storage still distributed and replicated  but one big table – status, metadata, content, text → one place  Simplified logic in Nutch – Simpler code for updating / merging information  More efficient (?) – No need to read / write entire structure to update records – No comparison available yet + early days for GORA  Easier interaction with other resources – Third-party code just need to use GORA and schema 30 / 37
  • 29. Drawbacks  More stuff to install and configure :-)  Not as stable as Nutch 1.x  Dependent on success of Gora 31 / 37
  • 30. 2.x Work in progress  Stabilise backend implementations – GORA-Hbase most reliable  Synchronize features with 1.x – e.g. has ElasticSearch but missing LinkRank equivalent  Filter enabled scans (GORA-119) – Don't need to de-serialize the whole dataset 32 / 37
  • 31. Future  Both 1.x and 2.x in parallel – but more frequent releases for 2.x  New functionalities – Support for SOLRCloud – Sitemap (from Crawler Commons library) – Canonical tag – More indexers (e.g. ElasticSearch) + pluggable indexers? 33 / 37
  • 32. More delegation  Great deal done in recent years (SOLR, Tika)  Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – Robots.txt parsing – URL normalisation / filtering  PageRank-like computations to graph library – e.g. Apache Giraph – Should be more efficient as well 34 / 37
  • 33. Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org  Chapter in 'Hadoop the Definitive Guide' (T. White) – Understanding Hadoop is essential anyway...  Support / consulting : – http://wiki.apache.org/nutch/Support 35 / 37
  • 34. Questions ? 36 / 37

Notes de l'éditeur

  1. I&apos;ll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  2. A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  3. Note that I mention crawling and not web search → used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  4. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  5. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  6. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  7. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  8. Fetcher . multithreaded but polite
  9. Fetcher . multithreaded but polite
  10. Writable object – crawl datum
  11. What does this mean for Nutch?
  12. What does this mean for Nutch?