SlideShare une entreprise Scribd logo
1  sur  16
Web Scraping Using Nutch and Solr
● A simple example of using open source code
● Web Scrape a single web site - ours
● Environment and code
– Using Centos V6.2 ( Linux )
– Apache Nutch 1.6
– Solr 4.2.1
– Java 1.6
Nutch and Solr Architecture
● Nutch processes urls and feeds content to Solr
● Solr indexes content
Where to get source code
● Nutch
– http://nutch.apache.org
● Solr
– http://lucene.apache.org/solr
● Java
– http://java.com
Installing Source - Nutch
● Nutch is delivered as
– apache-nutch-1.6-bin.tar ( 64M )
– apache-nutch-1.6-src.tar ( 20M )
● Copy each tar file to your desired location
● Install each tar file as
– tar xvf <tar file>
● Second tar file optional
Installing Source - Solr
● Solr is delivered as
– solr-4.2.1.zip ( 116M )
● Copy file to your desired location
● Install each tar file as
– unzip <zip file>
Configuring Nutch Part 1
● Assuming we will crawl a single web site
● Ensure that JAVA_HOME is set
● cd apache-nutch-1.6
● Edit agent name in conf/nutch-site.xml
<property>
<name>http.agent.name</name>
<value>Nutch Spider</value>
</property>
● mkdir -p urls ; cd urls ; touch seed.txt
Configuring Nutch Part 2
● Add following url ( ours ) to seed.txt
– http://www.semtech-solutions.co.nz
● Change url filtering in conf/regex-urlfilter.txt, change the line
– # accept anything else
– +.
– To be
– +^http://([a-z0-9]*.)*semtech-solutions.co.nz/
● This means that we will filter the urls found to only be from the
local site
Configuring Solr Part 1
● cd solr-4.2.1/example/solr/collection1/conf
● Add some extra fields to schema.xml after _version_ field i.e.
Start Solr Server – Part 1
● Within solr-4.2.1/example
● Run the following command
● java -jar start.jar
● Now try to access admin web page for solr
– http://localhost:8983/solr/admin
● You should now see the admin web site
– ( see next page )
Start Solr Server – Part 2
● Solr Admin web page
Run Nutch / Solr
● We are ready to crawl our first web site
● Go to apache-nutch-1.6 directory
● Run the following commands
– touch nutch_start.bash
– chmod 755 nutch_start.bash
– vi nutch_start.bash
● Add the text to the file
#!/bin/bash
bin/nutch crawl urls -solr http://localhost:8983/solr/ 
-dir crawl -depth 3 -topN 3
Run Nutch / Solr
● Now run the nutch bash file
– ./nutch_start.bash
● Select the Logging option on the admin console
● Monitor for errors in Logging console
● The crawl should finish with no errors and the line
– Crawl finished: crawl
– In the crawl window
Check Crawled Data
● Now we check the data that we have crawled
● In Admin Console window
– Set Core Selector to collection1
– Select the Query option
– Click execute query button
● You should now see some of the data that you have crawled
Crawled Data
● Crawled data in solr query
Crawled Data
● Thats your first simple crawl completed
● Further reading at
– http://nutch.apache.org
– http://lucene.apache.org/solr
● Now you can
– Add more urls to your seed.txt
– Increase the depth of your link search via options
● -depth
● -topN
– Modify your url filtering
Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

Contenu connexe

Tendances

Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
abial
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
Gerrit van Vuuren
 
Install hadoop in a cluster
Install hadoop in a clusterInstall hadoop in a cluster
Install hadoop in a cluster
Xuhong Zhang
 
Introduction to Mesos
Introduction to MesosIntroduction to Mesos
Introduction to Mesos
koboltmarky
 
D8 configuration migration
D8 configuration migrationD8 configuration migration
D8 configuration migration
Viktor Likin
 

Tendances (20)

Caching. api. http 1.1
Caching. api. http 1.1Caching. api. http 1.1
Caching. api. http 1.1
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
HBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with ClusterdockHBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with Clusterdock
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkit
 
Distributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUDistributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBU
 
Perl Programming - 04 Programming Database
Perl Programming - 04 Programming DatabasePerl Programming - 04 Programming Database
Perl Programming - 04 Programming Database
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
Shark - Lab Assignment
Shark - Lab AssignmentShark - Lab Assignment
Shark - Lab Assignment
 
Chef
ChefChef
Chef
 
Install hadoop in a cluster
Install hadoop in a clusterInstall hadoop in a cluster
Install hadoop in a cluster
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab Assignment
 
Introduction to Mesos
Introduction to MesosIntroduction to Mesos
Introduction to Mesos
 
Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)
 
Advanced troubleshooting linux performance
Advanced troubleshooting linux performanceAdvanced troubleshooting linux performance
Advanced troubleshooting linux performance
 
D8 configuration migration
D8 configuration migrationD8 configuration migration
D8 configuration migration
 

En vedette

Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
sixtyone
 

En vedette (14)

Migration from FAST ESP to Solr
Migration from FAST ESP to SolrMigration from FAST ESP to Solr
Migration from FAST ESP to Solr
 
Solr installation
Solr installationSolr installation
Solr installation
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Apache ManifoldCF
Apache ManifoldCFApache ManifoldCF
Apache ManifoldCF
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 

Similaire à Web scraping with nutch solr

Similaire à Web scraping with nutch solr (20)

Apache1.ppt
Apache1.pptApache1.ppt
Apache1.ppt
 
Nginx as a Revers Proxy for Apache on Ubuntu
Nginx as a Revers Proxy for Apache on UbuntuNginx as a Revers Proxy for Apache on Ubuntu
Nginx as a Revers Proxy for Apache on Ubuntu
 
Penetration Testing Boot CAMP
Penetration Testing Boot CAMPPenetration Testing Boot CAMP
Penetration Testing Boot CAMP
 
OpenNebula 5.4 Hands-on Tutorial
OpenNebula 5.4 Hands-on TutorialOpenNebula 5.4 Hands-on Tutorial
OpenNebula 5.4 Hands-on Tutorial
 
les01.pdf
les01.pdfles01.pdf
les01.pdf
 
Android for Embedded Linux Developers
Android for Embedded Linux DevelopersAndroid for Embedded Linux Developers
Android for Embedded Linux Developers
 
Apache
ApacheApache
Apache
 
Setting up LAMP for Linux newbies
Setting up LAMP for Linux newbiesSetting up LAMP for Linux newbies
Setting up LAMP for Linux newbies
 
Apache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling UpApache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling Up
 
Cobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsCobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale Environments
 
Cobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsCobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale Environments
 
Sharded cluster tutorial
Sharded cluster tutorialSharded cluster tutorial
Sharded cluster tutorial
 
MongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster TutorialMongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster Tutorial
 
MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017
 
Docker 1.11 @ Docker SF Meetup
Docker 1.11 @ Docker SF MeetupDocker 1.11 @ Docker SF Meetup
Docker 1.11 @ Docker SF Meetup
 
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
 
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael CrosbyDocker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
Docker 1.11 Meetup: Containerd and runc, by Arnaud Porterie and Michael Crosby
 
Deployment of WebObjects applications on CentOS Linux
Deployment of WebObjects applications on CentOS LinuxDeployment of WebObjects applications on CentOS Linux
Deployment of WebObjects applications on CentOS Linux
 
Crikeycon 2019 Velociraptor Workshop
Crikeycon 2019 Velociraptor WorkshopCrikeycon 2019 Velociraptor Workshop
Crikeycon 2019 Velociraptor Workshop
 
Virt monitoring
Virt monitoringVirt monitoring
Virt monitoring
 

Plus de Mike Frampton

An introduction to Apache Mesos
An introduction to Apache MesosAn introduction to Apache Mesos
An introduction to Apache Mesos
Mike Frampton
 
An introduction to Pentaho
An introduction to PentahoAn introduction to Pentaho
An introduction to Pentaho
Mike Frampton
 

Plus de Mike Frampton (20)

Apache Airavata
Apache AiravataApache Airavata
Apache Airavata
 
Apache MADlib AI/ML
Apache MADlib AI/MLApache MADlib AI/ML
Apache MADlib AI/ML
 
Apache MXNet AI
Apache MXNet AIApache MXNet AI
Apache MXNet AI
 
Apache Gobblin
Apache GobblinApache Gobblin
Apache Gobblin
 
Apache Singa AI
Apache Singa AIApache Singa AI
Apache Singa AI
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
OrientDB
OrientDBOrientDB
OrientDB
 
Prometheus
PrometheusPrometheus
Prometheus
 
Apache Tephra
Apache TephraApache Tephra
Apache Tephra
 
Apache Kudu
Apache KuduApache Kudu
Apache Kudu
 
Apache Bahir
Apache BahirApache Bahir
Apache Bahir
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
JanusGraph DB
JanusGraph DBJanusGraph DB
JanusGraph DB
 
Apache Ignite
Apache IgniteApache Ignite
Apache Ignite
 
Apache Samza
Apache SamzaApache Samza
Apache Samza
 
Apache Flink
Apache FlinkApache Flink
Apache Flink
 
Apache Edgent
Apache EdgentApache Edgent
Apache Edgent
 
Apache CouchDB
Apache CouchDBApache CouchDB
Apache CouchDB
 
An introduction to Apache Mesos
An introduction to Apache MesosAn introduction to Apache Mesos
An introduction to Apache Mesos
 
An introduction to Pentaho
An introduction to PentahoAn introduction to Pentaho
An introduction to Pentaho
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Web scraping with nutch solr

  • 1. Web Scraping Using Nutch and Solr ● A simple example of using open source code ● Web Scrape a single web site - ours ● Environment and code – Using Centos V6.2 ( Linux ) – Apache Nutch 1.6 – Solr 4.2.1 – Java 1.6
  • 2. Nutch and Solr Architecture ● Nutch processes urls and feeds content to Solr ● Solr indexes content
  • 3. Where to get source code ● Nutch – http://nutch.apache.org ● Solr – http://lucene.apache.org/solr ● Java – http://java.com
  • 4. Installing Source - Nutch ● Nutch is delivered as – apache-nutch-1.6-bin.tar ( 64M ) – apache-nutch-1.6-src.tar ( 20M ) ● Copy each tar file to your desired location ● Install each tar file as – tar xvf <tar file> ● Second tar file optional
  • 5. Installing Source - Solr ● Solr is delivered as – solr-4.2.1.zip ( 116M ) ● Copy file to your desired location ● Install each tar file as – unzip <zip file>
  • 6. Configuring Nutch Part 1 ● Assuming we will crawl a single web site ● Ensure that JAVA_HOME is set ● cd apache-nutch-1.6 ● Edit agent name in conf/nutch-site.xml <property> <name>http.agent.name</name> <value>Nutch Spider</value> </property> ● mkdir -p urls ; cd urls ; touch seed.txt
  • 7. Configuring Nutch Part 2 ● Add following url ( ours ) to seed.txt – http://www.semtech-solutions.co.nz ● Change url filtering in conf/regex-urlfilter.txt, change the line – # accept anything else – +. – To be – +^http://([a-z0-9]*.)*semtech-solutions.co.nz/ ● This means that we will filter the urls found to only be from the local site
  • 8. Configuring Solr Part 1 ● cd solr-4.2.1/example/solr/collection1/conf ● Add some extra fields to schema.xml after _version_ field i.e.
  • 9. Start Solr Server – Part 1 ● Within solr-4.2.1/example ● Run the following command ● java -jar start.jar ● Now try to access admin web page for solr – http://localhost:8983/solr/admin ● You should now see the admin web site – ( see next page )
  • 10. Start Solr Server – Part 2 ● Solr Admin web page
  • 11. Run Nutch / Solr ● We are ready to crawl our first web site ● Go to apache-nutch-1.6 directory ● Run the following commands – touch nutch_start.bash – chmod 755 nutch_start.bash – vi nutch_start.bash ● Add the text to the file #!/bin/bash bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir crawl -depth 3 -topN 3
  • 12. Run Nutch / Solr ● Now run the nutch bash file – ./nutch_start.bash ● Select the Logging option on the admin console ● Monitor for errors in Logging console ● The crawl should finish with no errors and the line – Crawl finished: crawl – In the crawl window
  • 13. Check Crawled Data ● Now we check the data that we have crawled ● In Admin Console window – Set Core Selector to collection1 – Select the Query option – Click execute query button ● You should now see some of the data that you have crawled
  • 14. Crawled Data ● Crawled data in solr query
  • 15. Crawled Data ● Thats your first simple crawl completed ● Further reading at – http://nutch.apache.org – http://lucene.apache.org/solr ● Now you can – Add more urls to your seed.txt – Increase the depth of your link search via options ● -depth ● -topN – Modify your url filtering
  • 16. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems