SlideShare une entreprise Scribd logo
1  sur  35
Large Scale Crawling with

   Apache
Julien Nioche
julien@digitalpebble.com

ApacheCon Europe 2012
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
    –   Web Crawling
    –   Natural Language Processing
    –   Information Retrieval
    –   Data Mining
   Strong focus on Open Source & Apache ecosystem
   Apache Nutch VP
   Apache Tika committer
   User | Contributor
    –   SOLR, Lucene
    –   GATE, UIMA
    –   Mahout
    –   Behemoth


                                                     2 / 37
Objectives

 Overview of the project

 Nutch in a nutshell

 Nutch 2.x

 Future developments




                            3 / 37
Nutch?
 “Distributed framework for large scale web crawling”
   – but does not have to be large scale at all
   – or even on the web (file-protocol)


  Apache TLP since May 2010


  Based on Apache Hadoop

  Indexing and Search



                                                     4 / 37
Short history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2004 : sub-project of Lucene @Apache
 2005 : MapReduce implementation in Nutch
   – 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
   – 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 June 2012 : Nutch 1.5.1
 Oct 2012 : Nutch 2.1

                                                    5 / 37
Recent Releases


        1.0            1.1 1.2   1.3     1.4 1.5.1
trunk


               2.x
                                               2.0 2.1



              06/09   06/10      06/11       06/12




                                                         7 / 37
Community

 6 active committers / PMC members
   – 4 within the last 18 months


 Constant stream of new contributions & bug reports

 Steady numbers of mailing list subscribers and traffic

 Nutch is a very healthy 10-year old




                                                       9 / 37
Why use Nutch?

 Usual reasons
   – Mature, business-friendly license, community, ...


 Scalability
   – Tried and tested on very large scale
   – Hadoop cluster : installation and skills

 Features
   – e.g. Index with SOLR
   – PageRank implementation
   – Can be extended with plugins




                                                         10 / 37
Not the best option when ...

 Hadoop based == batch processing == high latency
   – No guarantee that a page will be fetched / parsed / indexed within X
     minutes|hours


 Javascript / Ajax not supported (yet)




                                                                      11 / 37
Use cases

 Crawl for IR
   – Generic or vertical
   – Index and Search with SOLR
   – Single node to large clusters on Cloud

 … but also
   – Data Mining
   – NLP (e.g.Sentiment Analysis)
   – ML


   – MAHOUT / UIMA / GATE
   – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth)




                                                                   12 / 37
Customer cases
Specificity (Verticality)
                       Usecase : BetterJobs.com
                        –   Single server
                        –   Aggregates content from job portals
                        –   Extracts and normalizes structure (description,
                            requirements, locations)
                        –   ~1M pages total
                        –   Feeds SOLR index


          Usecase : SimilarPages.com
           –   Large cluster on Amazon EC2 (up to 400
               nodes)
           –   Fetched & parsed 3 billion pages
           –   10+ billion pages in crawlDB (~100TB data)
           –   200+ million lists of similarities
           –   No indexing / search involved
                                                                              Scale


                                                                                13 / 37
Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
      1)   Inject → populates CrawlDB from seed list
      2)   Generate → Selects URLS to fetch in segment
      3)   Fetch → Fetches URLs from segment
      4)   Parse → Parses content (text + metadata)
      5)   UpdateDB → Updates CrawlDB (new URLs, new status...)
      6)   InvertLinks → Build Webgraph
      7)   SOLRIndex → Send docs to SOLR
      8)   SOLRDedup → Remove duplicate docs based on signature
 Repeat steps 2 to 8
 Or use the all-in-one crawl script

                                                          14 / 37
Main steps


 Seed
 List        CrawlDB                 Segment
                                /
                                /crawl_fetch/
                                crawl_generate/
                                /content/
                                /crawl_parse/
                                /parse_data/
                                /parse_text/




                       LinkDB


                                                  15 / 37
Frontier expansion

 Manual “discovery”
   – Adding new URLs by
     hand, “seeding”


 Automatic discovery
  of new resources
  (frontier expansion)
   – Not all outlinks are
     equally useful - control       seed
   – Requires content
                                           i=1
     parsing and link
     extraction
                                                 i=2
                                                        i=3

  [Slide courtesy of A. Bialecki]

                                                       16 / 37
An extensible framework
 Plugins
   – Activated with parameter 'plugin.includes'
   – Implement one or more endpoints

 Endpoints
   –   Protocol
   –   Parser
   –   HtmlParseFilter (ParseFilter in Nutch 2.x)
   –   ScoringFilter (used in various places)
   –   URLFilter (ditto)
   –   URLNormalizer (ditto)
   –   IndexingFilter




                                                    17 / 37
Features

 Fetcher
   –   Multi-threaded fetcher
   –   Follows robots.txt
   –   Groups URLs per hostname / domain / IP
   –   Limit the number of URLs for round of fetching
   –   Default values are polite but can be made more aggressive

 Crawl Strategy
   – Breadth-first but can be depth-first
   – Configurable via custom scoring plugins

 Scoring
   – OPIC (On-line Page Importance Calculation) by default
   – LinkRank


                                                                   18 / 37
Features (cont.)

 Protocols
   – Http, file, ftp, https

 Scheduling
   – Specified or adaptative

 URL filters
   – Regex, FSA, TLD, prefix, suffix


 URL normalisers
   – Default, regex




                                       19 / 37
Features (cont.)

 Parsing with Apache Tika
   – Hundreds of formats supported
   – But some legacy parsers as well
 Other plugins
   –   CreativeCommons
   –   Feeds
   –   Language Identification
   –   Rel tags
   –   Arbitrary Metadata

 Indexing to SOLR
   – Bespoke schema




                                       20 / 37
Data Structures in 1.x
 MapReduce jobs => I/O : Hadoop [Sequence|Map]Files
 CrawlDB => status of known pages
                       MapFile : <Text,CrawlDatum>
                       byte status; [fetched? Unfetched? Failed? Redir?]
                       long fetchTime;
                       byte retries;
     CrawlDB           int fetchInterval;
                       float score = 1.0f;
                       byte[] signature = null;
                       long modifiedTime;
                       org.apache.hadoop.io.MapWritable metaData;


 Input of : generate - index
 Output of : inject - update


                                                                   21 / 37
Data Structures 1.x

 Segment => round of fetching
 Identified by a timestamp

 Segment
      /crawl_generate/ → SequenceFile<Text,CrawlDatum>
      /crawl_fetch/ → MapFile<Text,CrawlDatum>
      /content/ → MapFile<Text,Content>
      /crawl_parse/ → SequenceFile<Text,CrawlDatum>
      /parse_data/ → MapFile<Text,ParseData>
      /parse_text/ → MapFile<Text,ParseText>


 Can have multiple versions of a page in different
  segments


                                                         22 / 37
Data Structures – 1.x

 linkDB => storage for Web Graph

                        MapFile : <Text,Inlinks>
                        Inlinks : HashSet <Inlink>
     LinkDB             Inlink :
                                 String fromUrl
                                 String anchor


 Output of : invertlinks
 Input of : SOLRIndex




                                                     23 / 37
NUTCH 2.x

 2.0 released in July 2012

 2.1 in October 2012

 Common features as 1.x
   – delegation to SOLR, TIKA, MapReduce etc...


 Moved to table-based architecture
   – Wealth of NoSQL projects in last few years


 Abstraction over storage layer → Apache GORA


                                                  24 / 37
Apache GORA

 http://gora.apache.org/

 ORM for NoSQL databases
   – and limited SQL support + file based storage
 0.2.1 released in August 2012
 DataStore implementations
       ●   Accumulo             ●   Avro
       ●   Cassandra            ●   DynamoDB (soon)
       ●   HBase                ●   SQL

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)

                                                      25 / 37
AVRO Schema => Java code
 {"name": "WebPage",
  "type": "record",
  "namespace": "org.apache.nutch.storage",
  "fields": [
        {"name": "baseUrl", "type": ["null", "string"] },
        {"name": "status", "type": "int"},
        {"name": "fetchTime", "type": "long"},
        {"name": "prevFetchTime", "type": "long"},
        {"name": "fetchInterval", "type": "int"},
        {"name": "retriesSinceFetch", "type": "int"},
        {"name": "modifiedTime", "type": "long"},
        {"name": "protocolStatus", "type": {
           "name": "ProtocolStatus",
           "type": "record",
           "namespace": "org.apache.nutch.storage",
           "fields": [
               {"name": "code", "type": "int"},
               {"name": "args", "type": {"type": "array", "items": "string"}},
               {"name": "lastModified", "type": "long"}
           ]
           }},
 […]

                                                                                 26 / 37
Mapping file (backend specific – Hbase)
<gora-orm>

  <table name="webpage">
     <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters -->
     <family name="f" maxVersions="1"/>
     <family name="s" maxVersions="1"/>
     <family name="il" maxVersions="1"/>
     <family name="ol" maxVersions="1"/>
     <family name="h" maxVersions="1"/>
     <family name="mtdt" maxVersions="1"/>
     <family name="mk" maxVersions="1"/>
  </table>
  <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">

    <!-- fetch fields                        -->
    <field name="baseUrl" family="f" qualifier="bas"/>
    <field name="status" family="f" qualifier="st"/>
    <field name="prevFetchTime" family="f" qualifier="pts"/>
    <field name="fetchTime" family="f" qualifier="ts"/>
    <field name="fetchInterval" family="f" qualifier="fi"/>
    <field name="retriesSinceFetch" family="f" qualifier="rsf"/>



                                                                                            27 / 37
DataStore operations

 Atomic operations
   – get(K key)
   – put(K key, T obj)
   – delete(K key)

 Querying
   – execute(Query<K, T> query) → Result<K,T>
   – deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
   – GORAInput|OutputFormat
   – GoraRecordReader|Writer
   – GORAMapper|Reducer



                                                28 / 37
GORA in Nutch

 AVRO schema provided and java code pre-generated


 Mapping files provided for backends
   – can be modified if necessary

 Need to rebuild to get dependencies for backend
   – No binary distribution of Nutch 2.x

 http://wiki.apache.org/nutch/Nutch2Tutorial



                                                    29 / 37
Benefits

 Storage still distributed and replicated
 but one big table
   – status, metadata, content, text → one place
 Simplified logic in Nutch
   – Simpler code for updating / merging information
 More efficient (?)
   – No need to read / write entire structure to update records
   – No comparison available yet + early days for GORA
 Easier interaction with other resources
   – Third-party code just need to use GORA and schema



                                                                  30 / 37
Drawbacks

 More stuff to install and configure :-)

 Not as stable as Nutch 1.x

 Dependent on success of Gora




                                            31 / 37
2.x Work in progress

 Stabilise backend implementations
   – GORA-Hbase most reliable


 Synchronize features with 1.x
   – e.g. has ElasticSearch but missing LinkRank equivalent


 Filter enabled scans (GORA-119)
   – Don't need to de-serialize the whole dataset




                                                              32 / 37
Future

 Both 1.x and 2.x in parallel
   – but more frequent releases for 2.x


 New functionalities
   –   Support for SOLRCloud
   –   Sitemap (from Crawler Commons library)
   –   Canonical tag
   –   More indexers (e.g. ElasticSearch) + pluggable indexers?




                                                                  33 / 37
More delegation
 Great deal done in recent years (SOLR, Tika)

 Share code with crawler-commons
  (http://code.google.com/p/crawler-commons/)
   – Fetcher / protocol handling
   – Robots.txt parsing
   – URL normalisation / filtering


 PageRank-like computations to graph library
   – e.g. Apache Giraph
   – Should be more efficient as well




                                                 34 / 37
Where to find out more?

  Project page : http://nutch.apache.org/
  Wiki : http://wiki.apache.org/nutch/
  Mailing lists :
    – user@nutch.apache.org
    – dev@nutch.apache.org


 Chapter in 'Hadoop the Definitive Guide' (T. White)
   – Understanding Hadoop is essential anyway...


 Support / consulting :
   – http://wiki.apache.org/nutch/Support



                                                        35 / 37
Questions




            ?

                36 / 37
37 / 37

Contenu connexe

Tendances

Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawldata publica
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Grouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/SolrGrouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/Solrlucenerevolution
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
 
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH) Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH) David Fombella Pombal
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web CorpusRobert Meusel
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Key-Value Stores: a practical overview
Key-Value Stores: a practical overviewKey-Value Stores: a practical overview
Key-Value Stores: a practical overviewMarc Seeger
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa
 
MongoDB Evenings Chicago - WindyGrid: Utilizing MongoDB for a Smarter & Safer...
MongoDB Evenings Chicago - WindyGrid: Utilizing MongoDB for a Smarter & Safer...MongoDB Evenings Chicago - WindyGrid: Utilizing MongoDB for a Smarter & Safer...
MongoDB Evenings Chicago - WindyGrid: Utilizing MongoDB for a Smarter & Safer...MongoDB
 

Tendances (20)

Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawl
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Grouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/SolrGrouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/Solr
 
Examples of Ontology Applications
Examples of Ontology ApplicationsExamples of Ontology Applications
Examples of Ontology Applications
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH) Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
Neo4j Introduction (Basics, Cypher, RDBMS to GRAPH)
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Key-Value Stores: a practical overview
Key-Value Stores: a practical overviewKey-Value Stores: a practical overview
Key-Value Stores: a practical overview
 
Workshop Docker for DSpace
Workshop Docker for DSpaceWorkshop Docker for DSpace
Workshop Docker for DSpace
 
Full text search
Full text searchFull text search
Full text search
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
MongoDB Evenings Chicago - WindyGrid: Utilizing MongoDB for a Smarter & Safer...
MongoDB Evenings Chicago - WindyGrid: Utilizing MongoDB for a Smarter & Safer...MongoDB Evenings Chicago - WindyGrid: Utilizing MongoDB for a Smarter & Safer...
MongoDB Evenings Chicago - WindyGrid: Utilizing MongoDB for a Smarter & Safer...
 

En vedette

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platformabial
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingNate Murray
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 

En vedette (20)

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 

Similaire à Large scale crawling with Apache Nutch

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaKnoldus Inc.
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 

Similaire à Large scale crawling with Apache Nutch (20)

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Large scale crawling with Apache Nutch

  • 1. Large Scale Crawling with Apache Julien Nioche julien@digitalpebble.com ApacheCon Europe 2012
  • 2. About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Data Mining  Strong focus on Open Source & Apache ecosystem  Apache Nutch VP  Apache Tika committer  User | Contributor – SOLR, Lucene – GATE, UIMA – Mahout – Behemoth 2 / 37
  • 3. Objectives  Overview of the project  Nutch in a nutshell  Nutch 2.x  Future developments 3 / 37
  • 4. Nutch?  “Distributed framework for large scale web crawling” – but does not have to be large scale at all – or even on the web (file-protocol)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search 4 / 37
  • 5. Short history  2002/2003 : Started By Doug Cutting & Mike Caffarella  2004 : sub-project of Lucene @Apache  2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache  2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache  May 2010 : TLP project at Apache  June 2012 : Nutch 1.5.1  Oct 2012 : Nutch 2.1 5 / 37
  • 6. Recent Releases 1.0 1.1 1.2 1.3 1.4 1.5.1 trunk 2.x 2.0 2.1 06/09 06/10 06/11 06/12 7 / 37
  • 7. Community  6 active committers / PMC members – 4 within the last 18 months  Constant stream of new contributions & bug reports  Steady numbers of mailing list subscribers and traffic  Nutch is a very healthy 10-year old 9 / 37
  • 8. Why use Nutch?  Usual reasons – Mature, business-friendly license, community, ...  Scalability – Tried and tested on very large scale – Hadoop cluster : installation and skills  Features – e.g. Index with SOLR – PageRank implementation – Can be extended with plugins 10 / 37
  • 9. Not the best option when ...  Hadoop based == batch processing == high latency – No guarantee that a page will be fetched / parsed / indexed within X minutes|hours  Javascript / Ajax not supported (yet) 11 / 37
  • 10. Use cases  Crawl for IR – Generic or vertical – Index and Search with SOLR – Single node to large clusters on Cloud  … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 12 / 37
  • 11. Customer cases Specificity (Verticality) Usecase : BetterJobs.com – Single server – Aggregates content from job portals – Extracts and normalizes structure (description, requirements, locations) – ~1M pages total – Feeds SOLR index Usecase : SimilarPages.com – Large cluster on Amazon EC2 (up to 400 nodes) – Fetched & parsed 3 billion pages – 10+ billion pages in crawlDB (~100TB data) – 200+ million lists of similarities – No indexing / search involved Scale 13 / 37
  • 12. Typical Nutch Steps  Same in 1.x and 2.x  Sequence of batch operations 1) Inject → populates CrawlDB from seed list 2) Generate → Selects URLS to fetch in segment 3) Fetch → Fetches URLs from segment 4) Parse → Parses content (text + metadata) 5) UpdateDB → Updates CrawlDB (new URLs, new status...) 6) InvertLinks → Build Webgraph 7) SOLRIndex → Send docs to SOLR 8) SOLRDedup → Remove duplicate docs based on signature  Repeat steps 2 to 8  Or use the all-in-one crawl script 14 / 37
  • 13. Main steps Seed List CrawlDB Segment / /crawl_fetch/ crawl_generate/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 15 / 37
  • 14. Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control seed – Requires content i=1 parsing and link extraction i=2 i=3 [Slide courtesy of A. Bialecki] 16 / 37
  • 15. An extensible framework  Plugins – Activated with parameter 'plugin.includes' – Implement one or more endpoints  Endpoints – Protocol – Parser – HtmlParseFilter (ParseFilter in Nutch 2.x) – ScoringFilter (used in various places) – URLFilter (ditto) – URLNormalizer (ditto) – IndexingFilter 17 / 37
  • 16. Features  Fetcher – Multi-threaded fetcher – Follows robots.txt – Groups URLs per hostname / domain / IP – Limit the number of URLs for round of fetching – Default values are polite but can be made more aggressive  Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom scoring plugins  Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 18 / 37
  • 17. Features (cont.)  Protocols – Http, file, ftp, https  Scheduling – Specified or adaptative  URL filters – Regex, FSA, TLD, prefix, suffix  URL normalisers – Default, regex 19 / 37
  • 18. Features (cont.)  Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well  Other plugins – CreativeCommons – Feeds – Language Identification – Rel tags – Arbitrary Metadata  Indexing to SOLR – Bespoke schema 20 / 37
  • 19. Data Structures in 1.x  MapReduce jobs => I/O : Hadoop [Sequence|Map]Files  CrawlDB => status of known pages MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; CrawlDB int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;  Input of : generate - index  Output of : inject - update 21 / 37
  • 20. Data Structures 1.x  Segment => round of fetching  Identified by a timestamp Segment /crawl_generate/ → SequenceFile<Text,CrawlDatum> /crawl_fetch/ → MapFile<Text,CrawlDatum> /content/ → MapFile<Text,Content> /crawl_parse/ → SequenceFile<Text,CrawlDatum> /parse_data/ → MapFile<Text,ParseData> /parse_text/ → MapFile<Text,ParseText>  Can have multiple versions of a page in different segments 22 / 37
  • 21. Data Structures – 1.x  linkDB => storage for Web Graph MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> LinkDB Inlink : String fromUrl String anchor  Output of : invertlinks  Input of : SOLRIndex 23 / 37
  • 22. NUTCH 2.x  2.0 released in July 2012  2.1 in October 2012  Common features as 1.x – delegation to SOLR, TIKA, MapReduce etc...  Moved to table-based architecture – Wealth of NoSQL projects in last few years  Abstraction over storage layer → Apache GORA 24 / 37
  • 23. Apache GORA  http://gora.apache.org/  ORM for NoSQL databases – and limited SQL support + file based storage  0.2.1 released in August 2012  DataStore implementations ● Accumulo ● Avro ● Cassandra ● DynamoDB (soon) ● HBase ● SQL  Serialization with Apache AVRO  Object-to-datastore mappings (backend-specific) 25 / 37
  • 24. AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 26 / 37
  • 25. Mapping file (backend specific – Hbase) <gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 27 / 37
  • 26. DataStore operations  Atomic operations – get(K key) – put(K key, T obj) – delete(K key)  Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query)  Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 28 / 37
  • 27. GORA in Nutch  AVRO schema provided and java code pre-generated  Mapping files provided for backends – can be modified if necessary  Need to rebuild to get dependencies for backend – No binary distribution of Nutch 2.x  http://wiki.apache.org/nutch/Nutch2Tutorial 29 / 37
  • 28. Benefits  Storage still distributed and replicated  but one big table – status, metadata, content, text → one place  Simplified logic in Nutch – Simpler code for updating / merging information  More efficient (?) – No need to read / write entire structure to update records – No comparison available yet + early days for GORA  Easier interaction with other resources – Third-party code just need to use GORA and schema 30 / 37
  • 29. Drawbacks  More stuff to install and configure :-)  Not as stable as Nutch 1.x  Dependent on success of Gora 31 / 37
  • 30. 2.x Work in progress  Stabilise backend implementations – GORA-Hbase most reliable  Synchronize features with 1.x – e.g. has ElasticSearch but missing LinkRank equivalent  Filter enabled scans (GORA-119) – Don't need to de-serialize the whole dataset 32 / 37
  • 31. Future  Both 1.x and 2.x in parallel – but more frequent releases for 2.x  New functionalities – Support for SOLRCloud – Sitemap (from Crawler Commons library) – Canonical tag – More indexers (e.g. ElasticSearch) + pluggable indexers? 33 / 37
  • 32. More delegation  Great deal done in recent years (SOLR, Tika)  Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – Robots.txt parsing – URL normalisation / filtering  PageRank-like computations to graph library – e.g. Apache Giraph – Should be more efficient as well 34 / 37
  • 33. Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org  Chapter in 'Hadoop the Definitive Guide' (T. White) – Understanding Hadoop is essential anyway...  Support / consulting : – http://wiki.apache.org/nutch/Support 35 / 37
  • 34. Questions ? 36 / 37

Notes de l'éditeur

  1. I&apos;ll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  2. A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  3. Note that I mention crawling and not web search → used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  4. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  5. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  6. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  7. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  8. Fetcher . multithreaded but polite
  9. Fetcher . multithreaded but polite
  10. Writable object – crawl datum
  11. What does this mean for Nutch?
  12. What does this mean for Nutch?