SlideShare une entreprise Scribd logo
1  sur  35
Large Scale Crawling with

   Apache
Julien Nioche
julien@digitalpebble.com

ApacheCon Europe 2012
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
    –   Web Crawling
    –   Natural Language Processing
    –   Information Retrieval
    –   Data Mining
   Strong focus on Open Source & Apache ecosystem
   Apache Nutch VP
   Apache Tika committer
   User | Contributor
    –   SOLR, Lucene
    –   GATE, UIMA
    –   Mahout
    –   Behemoth


                                                     2 / 37
Objectives

 Overview of the project

 Nutch in a nutshell

 Nutch 2.x

 Future developments




                            3 / 37
Nutch?
 “Distributed framework for large scale web crawling”
   – but does not have to be large scale at all
   – or even on the web (file-protocol)


  Apache TLP since May 2010


  Based on Apache Hadoop

  Indexing and Search



                                                     4 / 37
Short history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2004 : sub-project of Lucene @Apache
 2005 : MapReduce implementation in Nutch
   – 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
   – 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 June 2012 : Nutch 1.5.1
 Oct 2012 : Nutch 2.1

                                                    5 / 37
Recent Releases


        1.0            1.1 1.2   1.3     1.4 1.5.1
trunk


               2.x
                                               2.0 2.1



              06/09   06/10      06/11       06/12




                                                         7 / 37
Community

 6 active committers / PMC members
   – 4 within the last 18 months


 Constant stream of new contributions & bug reports

 Steady numbers of mailing list subscribers and traffic

 Nutch is a very healthy 10-year old




                                                       9 / 37
Why use Nutch?

 Usual reasons
   – Mature, business-friendly license, community, ...


 Scalability
   – Tried and tested on very large scale
   – Hadoop cluster : installation and skills

 Features
   – e.g. Index with SOLR
   – PageRank implementation
   – Can be extended with plugins




                                                         10 / 37
Not the best option when ...

 Hadoop based == batch processing == high latency
   – No guarantee that a page will be fetched / parsed / indexed within X
     minutes|hours


 Javascript / Ajax not supported (yet)




                                                                      11 / 37
Use cases

 Crawl for IR
   – Generic or vertical
   – Index and Search with SOLR
   – Single node to large clusters on Cloud

 … but also
   – Data Mining
   – NLP (e.g.Sentiment Analysis)
   – ML


   – MAHOUT / UIMA / GATE
   – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth)




                                                                   12 / 37
Customer cases
Specificity (Verticality)
                       Usecase : BetterJobs.com
                        –   Single server
                        –   Aggregates content from job portals
                        –   Extracts and normalizes structure (description,
                            requirements, locations)
                        –   ~1M pages total
                        –   Feeds SOLR index


          Usecase : SimilarPages.com
           –   Large cluster on Amazon EC2 (up to 400
               nodes)
           –   Fetched & parsed 3 billion pages
           –   10+ billion pages in crawlDB (~100TB data)
           –   200+ million lists of similarities
           –   No indexing / search involved
                                                                              Scale


                                                                                13 / 37
Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
      1)   Inject → populates CrawlDB from seed list
      2)   Generate → Selects URLS to fetch in segment
      3)   Fetch → Fetches URLs from segment
      4)   Parse → Parses content (text + metadata)
      5)   UpdateDB → Updates CrawlDB (new URLs, new status...)
      6)   InvertLinks → Build Webgraph
      7)   SOLRIndex → Send docs to SOLR
      8)   SOLRDedup → Remove duplicate docs based on signature
 Repeat steps 2 to 8
 Or use the all-in-one crawl script

                                                          14 / 37
Main steps


 Seed
 List        CrawlDB                 Segment
                                /
                                /crawl_fetch/
                                crawl_generate/
                                /content/
                                /crawl_parse/
                                /parse_data/
                                /parse_text/




                       LinkDB


                                                  15 / 37
Frontier expansion

 Manual “discovery”
   – Adding new URLs by
     hand, “seeding”


 Automatic discovery
  of new resources
  (frontier expansion)
   – Not all outlinks are
     equally useful - control       seed
   – Requires content
                                           i=1
     parsing and link
     extraction
                                                 i=2
                                                        i=3

  [Slide courtesy of A. Bialecki]

                                                       16 / 37
An extensible framework
 Plugins
   – Activated with parameter 'plugin.includes'
   – Implement one or more endpoints

 Endpoints
   –   Protocol
   –   Parser
   –   HtmlParseFilter (ParseFilter in Nutch 2.x)
   –   ScoringFilter (used in various places)
   –   URLFilter (ditto)
   –   URLNormalizer (ditto)
   –   IndexingFilter




                                                    17 / 37
Features

 Fetcher
   –   Multi-threaded fetcher
   –   Follows robots.txt
   –   Groups URLs per hostname / domain / IP
   –   Limit the number of URLs for round of fetching
   –   Default values are polite but can be made more aggressive

 Crawl Strategy
   – Breadth-first but can be depth-first
   – Configurable via custom scoring plugins

 Scoring
   – OPIC (On-line Page Importance Calculation) by default
   – LinkRank


                                                                   18 / 37
Features (cont.)

 Protocols
   – Http, file, ftp, https

 Scheduling
   – Specified or adaptative

 URL filters
   – Regex, FSA, TLD, prefix, suffix


 URL normalisers
   – Default, regex




                                       19 / 37
Features (cont.)

 Parsing with Apache Tika
   – Hundreds of formats supported
   – But some legacy parsers as well
 Other plugins
   –   CreativeCommons
   –   Feeds
   –   Language Identification
   –   Rel tags
   –   Arbitrary Metadata

 Indexing to SOLR
   – Bespoke schema




                                       20 / 37
Data Structures in 1.x
 MapReduce jobs => I/O : Hadoop [Sequence|Map]Files
 CrawlDB => status of known pages
                       MapFile : <Text,CrawlDatum>
                       byte status; [fetched? Unfetched? Failed? Redir?]
                       long fetchTime;
                       byte retries;
     CrawlDB           int fetchInterval;
                       float score = 1.0f;
                       byte[] signature = null;
                       long modifiedTime;
                       org.apache.hadoop.io.MapWritable metaData;


 Input of : generate - index
 Output of : inject - update


                                                                   21 / 37
Data Structures 1.x

 Segment => round of fetching
 Identified by a timestamp

 Segment
      /crawl_generate/ → SequenceFile<Text,CrawlDatum>
      /crawl_fetch/ → MapFile<Text,CrawlDatum>
      /content/ → MapFile<Text,Content>
      /crawl_parse/ → SequenceFile<Text,CrawlDatum>
      /parse_data/ → MapFile<Text,ParseData>
      /parse_text/ → MapFile<Text,ParseText>


 Can have multiple versions of a page in different
  segments


                                                         22 / 37
Data Structures – 1.x

 linkDB => storage for Web Graph

                        MapFile : <Text,Inlinks>
                        Inlinks : HashSet <Inlink>
     LinkDB             Inlink :
                                 String fromUrl
                                 String anchor


 Output of : invertlinks
 Input of : SOLRIndex




                                                     23 / 37
NUTCH 2.x

 2.0 released in July 2012

 2.1 in October 2012

 Common features as 1.x
   – delegation to SOLR, TIKA, MapReduce etc...


 Moved to table-based architecture
   – Wealth of NoSQL projects in last few years


 Abstraction over storage layer → Apache GORA


                                                  24 / 37
Apache GORA

 http://gora.apache.org/

 ORM for NoSQL databases
   – and limited SQL support + file based storage
 0.2.1 released in August 2012
 DataStore implementations
       ●   Accumulo             ●   Avro
       ●   Cassandra            ●   DynamoDB (soon)
       ●   HBase                ●   SQL

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)

                                                      25 / 37
AVRO Schema => Java code
 {"name": "WebPage",
  "type": "record",
  "namespace": "org.apache.nutch.storage",
  "fields": [
        {"name": "baseUrl", "type": ["null", "string"] },
        {"name": "status", "type": "int"},
        {"name": "fetchTime", "type": "long"},
        {"name": "prevFetchTime", "type": "long"},
        {"name": "fetchInterval", "type": "int"},
        {"name": "retriesSinceFetch", "type": "int"},
        {"name": "modifiedTime", "type": "long"},
        {"name": "protocolStatus", "type": {
           "name": "ProtocolStatus",
           "type": "record",
           "namespace": "org.apache.nutch.storage",
           "fields": [
               {"name": "code", "type": "int"},
               {"name": "args", "type": {"type": "array", "items": "string"}},
               {"name": "lastModified", "type": "long"}
           ]
           }},
 […]

                                                                                 26 / 37
Mapping file (backend specific – Hbase)
<gora-orm>

  <table name="webpage">
     <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters -->
     <family name="f" maxVersions="1"/>
     <family name="s" maxVersions="1"/>
     <family name="il" maxVersions="1"/>
     <family name="ol" maxVersions="1"/>
     <family name="h" maxVersions="1"/>
     <family name="mtdt" maxVersions="1"/>
     <family name="mk" maxVersions="1"/>
  </table>
  <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">

    <!-- fetch fields                        -->
    <field name="baseUrl" family="f" qualifier="bas"/>
    <field name="status" family="f" qualifier="st"/>
    <field name="prevFetchTime" family="f" qualifier="pts"/>
    <field name="fetchTime" family="f" qualifier="ts"/>
    <field name="fetchInterval" family="f" qualifier="fi"/>
    <field name="retriesSinceFetch" family="f" qualifier="rsf"/>



                                                                                            27 / 37
DataStore operations

 Atomic operations
   – get(K key)
   – put(K key, T obj)
   – delete(K key)

 Querying
   – execute(Query<K, T> query) → Result<K,T>
   – deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
   – GORAInput|OutputFormat
   – GoraRecordReader|Writer
   – GORAMapper|Reducer



                                                28 / 37
GORA in Nutch

 AVRO schema provided and java code pre-generated


 Mapping files provided for backends
   – can be modified if necessary

 Need to rebuild to get dependencies for backend
   – No binary distribution of Nutch 2.x

 http://wiki.apache.org/nutch/Nutch2Tutorial



                                                    29 / 37
Benefits

 Storage still distributed and replicated
 but one big table
   – status, metadata, content, text → one place
 Simplified logic in Nutch
   – Simpler code for updating / merging information
 More efficient (?)
   – No need to read / write entire structure to update records
   – No comparison available yet + early days for GORA
 Easier interaction with other resources
   – Third-party code just need to use GORA and schema



                                                                  30 / 37
Drawbacks

 More stuff to install and configure :-)

 Not as stable as Nutch 1.x

 Dependent on success of Gora




                                            31 / 37
2.x Work in progress

 Stabilise backend implementations
   – GORA-Hbase most reliable


 Synchronize features with 1.x
   – e.g. has ElasticSearch but missing LinkRank equivalent


 Filter enabled scans (GORA-119)
   – Don't need to de-serialize the whole dataset




                                                              32 / 37
Future

 Both 1.x and 2.x in parallel
   – but more frequent releases for 2.x


 New functionalities
   –   Support for SOLRCloud
   –   Sitemap (from Crawler Commons library)
   –   Canonical tag
   –   More indexers (e.g. ElasticSearch) + pluggable indexers?




                                                                  33 / 37
More delegation
 Great deal done in recent years (SOLR, Tika)

 Share code with crawler-commons
  (http://code.google.com/p/crawler-commons/)
   – Fetcher / protocol handling
   – Robots.txt parsing
   – URL normalisation / filtering


 PageRank-like computations to graph library
   – e.g. Apache Giraph
   – Should be more efficient as well




                                                 34 / 37
Where to find out more?

  Project page : http://nutch.apache.org/
  Wiki : http://wiki.apache.org/nutch/
  Mailing lists :
    – user@nutch.apache.org
    – dev@nutch.apache.org


 Chapter in 'Hadoop the Definitive Guide' (T. White)
   – Understanding Hadoop is essential anyway...


 Support / consulting :
   – http://wiki.apache.org/nutch/Support



                                                        35 / 37
Questions




            ?

                36 / 37
37 / 37

Contenu connexe

Tendances

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 

Tendances (20)

Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
좌충우돌 Data Engineering 학습기
좌충우돌 Data Engineering 학습기좌충우돌 Data Engineering 학습기
좌충우돌 Data Engineering 학습기
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Fundamentals of Web for Non-Developers
Fundamentals of Web for Non-DevelopersFundamentals of Web for Non-Developers
Fundamentals of Web for Non-Developers
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache Spark
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축
[NDC 2018] Spark, Flintrock, Airflow 로 구현하는 탄력적이고 유연한 데이터 분산처리 자동화 인프라 구축
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 

En vedette

Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
abial
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
Nate Murray
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
mteutelink
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 

En vedette (20)

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutch
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 

Similaire à Large scale crawling with Apache Nutch

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
S S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 

Similaire à Large scale crawling with Apache Nutch (20)

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Large scale crawling with Apache Nutch

  • 1. Large Scale Crawling with Apache Julien Nioche julien@digitalpebble.com ApacheCon Europe 2012
  • 2. About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Data Mining  Strong focus on Open Source & Apache ecosystem  Apache Nutch VP  Apache Tika committer  User | Contributor – SOLR, Lucene – GATE, UIMA – Mahout – Behemoth 2 / 37
  • 3. Objectives  Overview of the project  Nutch in a nutshell  Nutch 2.x  Future developments 3 / 37
  • 4. Nutch?  “Distributed framework for large scale web crawling” – but does not have to be large scale at all – or even on the web (file-protocol)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search 4 / 37
  • 5. Short history  2002/2003 : Started By Doug Cutting & Mike Caffarella  2004 : sub-project of Lucene @Apache  2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache  2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache  May 2010 : TLP project at Apache  June 2012 : Nutch 1.5.1  Oct 2012 : Nutch 2.1 5 / 37
  • 6. Recent Releases 1.0 1.1 1.2 1.3 1.4 1.5.1 trunk 2.x 2.0 2.1 06/09 06/10 06/11 06/12 7 / 37
  • 7. Community  6 active committers / PMC members – 4 within the last 18 months  Constant stream of new contributions & bug reports  Steady numbers of mailing list subscribers and traffic  Nutch is a very healthy 10-year old 9 / 37
  • 8. Why use Nutch?  Usual reasons – Mature, business-friendly license, community, ...  Scalability – Tried and tested on very large scale – Hadoop cluster : installation and skills  Features – e.g. Index with SOLR – PageRank implementation – Can be extended with plugins 10 / 37
  • 9. Not the best option when ...  Hadoop based == batch processing == high latency – No guarantee that a page will be fetched / parsed / indexed within X minutes|hours  Javascript / Ajax not supported (yet) 11 / 37
  • 10. Use cases  Crawl for IR – Generic or vertical – Index and Search with SOLR – Single node to large clusters on Cloud  … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 12 / 37
  • 11. Customer cases Specificity (Verticality) Usecase : BetterJobs.com – Single server – Aggregates content from job portals – Extracts and normalizes structure (description, requirements, locations) – ~1M pages total – Feeds SOLR index Usecase : SimilarPages.com – Large cluster on Amazon EC2 (up to 400 nodes) – Fetched & parsed 3 billion pages – 10+ billion pages in crawlDB (~100TB data) – 200+ million lists of similarities – No indexing / search involved Scale 13 / 37
  • 12. Typical Nutch Steps  Same in 1.x and 2.x  Sequence of batch operations 1) Inject → populates CrawlDB from seed list 2) Generate → Selects URLS to fetch in segment 3) Fetch → Fetches URLs from segment 4) Parse → Parses content (text + metadata) 5) UpdateDB → Updates CrawlDB (new URLs, new status...) 6) InvertLinks → Build Webgraph 7) SOLRIndex → Send docs to SOLR 8) SOLRDedup → Remove duplicate docs based on signature  Repeat steps 2 to 8  Or use the all-in-one crawl script 14 / 37
  • 13. Main steps Seed List CrawlDB Segment / /crawl_fetch/ crawl_generate/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 15 / 37
  • 14. Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control seed – Requires content i=1 parsing and link extraction i=2 i=3 [Slide courtesy of A. Bialecki] 16 / 37
  • 15. An extensible framework  Plugins – Activated with parameter 'plugin.includes' – Implement one or more endpoints  Endpoints – Protocol – Parser – HtmlParseFilter (ParseFilter in Nutch 2.x) – ScoringFilter (used in various places) – URLFilter (ditto) – URLNormalizer (ditto) – IndexingFilter 17 / 37
  • 16. Features  Fetcher – Multi-threaded fetcher – Follows robots.txt – Groups URLs per hostname / domain / IP – Limit the number of URLs for round of fetching – Default values are polite but can be made more aggressive  Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom scoring plugins  Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 18 / 37
  • 17. Features (cont.)  Protocols – Http, file, ftp, https  Scheduling – Specified or adaptative  URL filters – Regex, FSA, TLD, prefix, suffix  URL normalisers – Default, regex 19 / 37
  • 18. Features (cont.)  Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well  Other plugins – CreativeCommons – Feeds – Language Identification – Rel tags – Arbitrary Metadata  Indexing to SOLR – Bespoke schema 20 / 37
  • 19. Data Structures in 1.x  MapReduce jobs => I/O : Hadoop [Sequence|Map]Files  CrawlDB => status of known pages MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; CrawlDB int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;  Input of : generate - index  Output of : inject - update 21 / 37
  • 20. Data Structures 1.x  Segment => round of fetching  Identified by a timestamp Segment /crawl_generate/ → SequenceFile<Text,CrawlDatum> /crawl_fetch/ → MapFile<Text,CrawlDatum> /content/ → MapFile<Text,Content> /crawl_parse/ → SequenceFile<Text,CrawlDatum> /parse_data/ → MapFile<Text,ParseData> /parse_text/ → MapFile<Text,ParseText>  Can have multiple versions of a page in different segments 22 / 37
  • 21. Data Structures – 1.x  linkDB => storage for Web Graph MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> LinkDB Inlink : String fromUrl String anchor  Output of : invertlinks  Input of : SOLRIndex 23 / 37
  • 22. NUTCH 2.x  2.0 released in July 2012  2.1 in October 2012  Common features as 1.x – delegation to SOLR, TIKA, MapReduce etc...  Moved to table-based architecture – Wealth of NoSQL projects in last few years  Abstraction over storage layer → Apache GORA 24 / 37
  • 23. Apache GORA  http://gora.apache.org/  ORM for NoSQL databases – and limited SQL support + file based storage  0.2.1 released in August 2012  DataStore implementations ● Accumulo ● Avro ● Cassandra ● DynamoDB (soon) ● HBase ● SQL  Serialization with Apache AVRO  Object-to-datastore mappings (backend-specific) 25 / 37
  • 24. AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 26 / 37
  • 25. Mapping file (backend specific – Hbase) <gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 27 / 37
  • 26. DataStore operations  Atomic operations – get(K key) – put(K key, T obj) – delete(K key)  Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query)  Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 28 / 37
  • 27. GORA in Nutch  AVRO schema provided and java code pre-generated  Mapping files provided for backends – can be modified if necessary  Need to rebuild to get dependencies for backend – No binary distribution of Nutch 2.x  http://wiki.apache.org/nutch/Nutch2Tutorial 29 / 37
  • 28. Benefits  Storage still distributed and replicated  but one big table – status, metadata, content, text → one place  Simplified logic in Nutch – Simpler code for updating / merging information  More efficient (?) – No need to read / write entire structure to update records – No comparison available yet + early days for GORA  Easier interaction with other resources – Third-party code just need to use GORA and schema 30 / 37
  • 29. Drawbacks  More stuff to install and configure :-)  Not as stable as Nutch 1.x  Dependent on success of Gora 31 / 37
  • 30. 2.x Work in progress  Stabilise backend implementations – GORA-Hbase most reliable  Synchronize features with 1.x – e.g. has ElasticSearch but missing LinkRank equivalent  Filter enabled scans (GORA-119) – Don't need to de-serialize the whole dataset 32 / 37
  • 31. Future  Both 1.x and 2.x in parallel – but more frequent releases for 2.x  New functionalities – Support for SOLRCloud – Sitemap (from Crawler Commons library) – Canonical tag – More indexers (e.g. ElasticSearch) + pluggable indexers? 33 / 37
  • 32. More delegation  Great deal done in recent years (SOLR, Tika)  Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – Robots.txt parsing – URL normalisation / filtering  PageRank-like computations to graph library – e.g. Apache Giraph – Should be more efficient as well 34 / 37
  • 33. Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org  Chapter in 'Hadoop the Definitive Guide' (T. White) – Understanding Hadoop is essential anyway...  Support / consulting : – http://wiki.apache.org/nutch/Support 35 / 37
  • 34. Questions ? 36 / 37

Notes de l'éditeur

  1. I&apos;ll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  2. A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  3. Note that I mention crawling and not web search → used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  4. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  5. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  6. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  7. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  8. Fetcher . multithreaded but polite
  9. Fetcher . multithreaded but polite
  10. Writable object – crawl datum
  11. What does this mean for Nutch?
  12. What does this mean for Nutch?