SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Web Crawling
  Web Scraping

cuneytykaya

cuneyt.yesilkaya
Cüneyt Yeşilkaya

                7        0       2
.........   2 00    2 01     2 01



                                     2048
Agenda
●   Web Crawling
●   Web Scraping
●   Web Crawling Tools
●   Demo (Crawler4j & Jsoup)
●   Crawling - Where to Use
Web Crawling
Browsing the
World Wide Web
in a methodical,
automated
manner or in an
orderly fashion.
Web Scraping
Computer software technique of extracting
information from websites.
Web Crawling Tools
Selecting Crawler ?
●   Multi-Threaded Structure
●   Max Page to Fetch
●   Max Page Size
●   Max Depth to Crawl
●   Redundant Link Control
●   Politeness Time
●   Resumable
●   Well-Documented
Crawler4j




               Yasser Ganjisaffar

     Microsoft Bing & Microsoft Live Search
Demo - Crawler4j (1/3)


myCrawler.java     myController.java
Demo - Crawler4j (2/3)
                           myCrawler.java

import edu.uci.ics.crawler4j.crawler.WebCrawler;

public class myCrawler extends WebCrawler {

    @Override
    public boolean shouldVisit(WebURL url) {
      return url.getURL().startsWith("http://www.gdgistanbul.com");
    }
    @Override
    public void visit(Page page) {
       String url = page.getWebURL().getURL();
    }
}
Demo - Crawler4j (3/3)
                    myController.java

  int numberOfCrawlers = 4;

  CrawlConfig config = new CrawlConfig();
  config.setPolitenessDelay(250);
  config.setMaxPagesToFetch(100);
  PageFetcher pageFetcher = new PageFetcher(config);
  RobotstxtConfig robotstxtConfig = new RobotstxtConfig();

  RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
  CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

  controller.addSeed("http://www.gdgistanbul.com");
  controller.start(myCrawler.class, numberOfCrawlers);
Demo - Jsoup (1/2)
       Jsoup : nice way to do HTML Parsing in Java

● scrape and parse HTML from a URL, file, or string
● find and extract data, using DOM traversal or CSS selectors
● manipulate the HTML elements, attributes, and text
Demo - Jsoup (2/2)
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

String html = "<html><head><title>First parse</title></head>"
    + "<body><p>Parsed HTML into a doc.</p></body></html>"
                                                         ;
Document doc = Jsoup.parse(html);


Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
    String linkHref = link.attr("href");
    String linkText = link.text();
}
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Where to Use
● Search Engines (GoogleBot)
● Aggregators
  ○   Data aggregator
  ○   News aggregator
  ○   Review aggregator
  ○   Search aggregator
  ○   Social network aggregation
  ○   Video aggregator
● Kaarun Product Collector
www.kaarun.com
All Friends
Products for each Facebook Like
Teşekkürler...




cyesilkaya.wordpress.com & @cuneytykaya & tr.linkedin/cuneyt.yesilkaya

Contenu connexe

Tendances

Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
Scrapy talk at DataPhilly
Scrapy talk at DataPhillyScrapy talk at DataPhilly
Scrapy talk at DataPhillyobdit
 
Streaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & ElasticsearchStreaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & ElasticsearchKeira Zhou
 
Do something in 5 with gas 9-copy between databases with oauth2
Do something in 5 with gas 9-copy between databases with oauth2Do something in 5 with gas 9-copy between databases with oauth2
Do something in 5 with gas 9-copy between databases with oauth2Bruce McPherson
 
Do something in 5 with apps scripts number 6 - fusion crossfilter
Do something in 5 with apps scripts number 6 - fusion crossfilterDo something in 5 with apps scripts number 6 - fusion crossfilter
Do something in 5 with apps scripts number 6 - fusion crossfilterBruce McPherson
 
Do something in 5 with gas 4- Get your analytics profiles to a spreadsheet
Do something in 5 with gas 4- Get your analytics profiles to a spreadsheetDo something in 5 with gas 4- Get your analytics profiles to a spreadsheet
Do something in 5 with gas 4- Get your analytics profiles to a spreadsheetBruce McPherson
 
Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...
Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...
Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...Bruce McPherson
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appBruce McPherson
 
Node collaboration - Exported Resources and PuppetDB
Node collaboration - Exported Resources and PuppetDBNode collaboration - Exported Resources and PuppetDB
Node collaboration - Exported Resources and PuppetDBm_richardson
 
Elasticsearch War Stories
Elasticsearch War StoriesElasticsearch War Stories
Elasticsearch War StoriesArno Broekhof
 
11 schema design & crud
11 schema design & crud11 schema design & crud
11 schema design & crudAhmed Elbassel
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyErin Shellman
 
Parse: 5 tricks that won YC Hacks
Parse: 5 tricks that won YC HacksParse: 5 tricks that won YC Hacks
Parse: 5 tricks that won YC HacksThomas Bouldin
 
VBA API for scriptDB primer
VBA API for scriptDB primerVBA API for scriptDB primer
VBA API for scriptDB primerBruce McPherson
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...Anton
 
When all you have is a hammer, everything looks like Javascript - ebookcraft ...
When all you have is a hammer, everything looks like Javascript - ebookcraft ...When all you have is a hammer, everything looks like Javascript - ebookcraft ...
When all you have is a hammer, everything looks like Javascript - ebookcraft ...BookNet Canada
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
 

Tendances (20)

Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Scrapy talk at DataPhilly
Scrapy talk at DataPhillyScrapy talk at DataPhilly
Scrapy talk at DataPhilly
 
Streaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & ElasticsearchStreaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & Elasticsearch
 
Do something in 5 with gas 9-copy between databases with oauth2
Do something in 5 with gas 9-copy between databases with oauth2Do something in 5 with gas 9-copy between databases with oauth2
Do something in 5 with gas 9-copy between databases with oauth2
 
Do something in 5 with apps scripts number 6 - fusion crossfilter
Do something in 5 with apps scripts number 6 - fusion crossfilterDo something in 5 with apps scripts number 6 - fusion crossfilter
Do something in 5 with apps scripts number 6 - fusion crossfilter
 
Do something in 5 with gas 4- Get your analytics profiles to a spreadsheet
Do something in 5 with gas 4- Get your analytics profiles to a spreadsheetDo something in 5 with gas 4- Get your analytics profiles to a spreadsheet
Do something in 5 with gas 4- Get your analytics profiles to a spreadsheet
 
Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...
Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...
Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing app
 
Node collaboration - Exported Resources and PuppetDB
Node collaboration - Exported Resources and PuppetDBNode collaboration - Exported Resources and PuppetDB
Node collaboration - Exported Resources and PuppetDB
 
Goa tutorial
Goa tutorialGoa tutorial
Goa tutorial
 
Elasticsearch War Stories
Elasticsearch War StoriesElasticsearch War Stories
Elasticsearch War Stories
 
11 schema design & crud
11 schema design & crud11 schema design & crud
11 schema design & crud
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
 
Fun with Python
Fun with PythonFun with Python
Fun with Python
 
Parse: 5 tricks that won YC Hacks
Parse: 5 tricks that won YC HacksParse: 5 tricks that won YC Hacks
Parse: 5 tricks that won YC Hacks
 
VBA API for scriptDB primer
VBA API for scriptDB primerVBA API for scriptDB primer
VBA API for scriptDB primer
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
When all you have is a hammer, everything looks like Javascript - ebookcraft ...
When all you have is a hammer, everything looks like Javascript - ebookcraft ...When all you have is a hammer, everything looks like Javascript - ebookcraft ...
When all you have is a hammer, everything looks like Javascript - ebookcraft ...
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 

En vedette

The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsBarry Feldman
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copyMohammad_Tariq
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Dev Con 2014
Dev Con 2014Dev Con 2014
Dev Con 2014yewint ko
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler Thamme Gowda
 
Ppt struc ture conditionaldfs
Ppt struc ture conditionaldfsPpt struc ture conditionaldfs
Ppt struc ture conditionaldfsBachtiar Idris
 
New media and preventive health
New media and preventive healthNew media and preventive health
New media and preventive healthUCT ICO
 
New e-inhalation products and potential health risks: The case for regulation
New e-inhalation products and potential health risks: The case for regulationNew e-inhalation products and potential health risks: The case for regulation
New e-inhalation products and potential health risks: The case for regulationUCT ICO
 
扶青團經營建議方針
扶青團經營建議方針扶青團經營建議方針
扶青團經營建議方針mrJim Note
 
ทบทวนการประเมินรอบสอง
ทบทวนการประเมินรอบสองทบทวนการประเมินรอบสอง
ทบทวนการประเมินรอบสองStrisuksa Roi-Et
 
Sofia
SofiaSofia
Sofianiod
 
сoncept-рouse-web-12
сoncept-рouse-web-12сoncept-рouse-web-12
сoncept-рouse-web-12blackfung
 
07 bio มข
07 bio มข07 bio มข
07 bio มขBiobiome
 
Secondhand exposure to e-cigarettes emissions
Secondhand exposure to  e-cigarettes emissionsSecondhand exposure to  e-cigarettes emissions
Secondhand exposure to e-cigarettes emissionsUCT ICO
 
Assignemt 2; introduction to documentary task
Assignemt 2; introduction to documentary taskAssignemt 2; introduction to documentary task
Assignemt 2; introduction to documentary taskkauana1995
 
Learning project muse, Spr 09 gaxiola, ca
Learning project muse, Spr 09 gaxiola, caLearning project muse, Spr 09 gaxiola, ca
Learning project muse, Spr 09 gaxiola, cacarriegaxiola
 

En vedette (20)

The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Dev Con 2014
Dev Con 2014Dev Con 2014
Dev Con 2014
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
Ppt struc ture conditionaldfs
Ppt struc ture conditionaldfsPpt struc ture conditionaldfs
Ppt struc ture conditionaldfs
 
Hipaa
HipaaHipaa
Hipaa
 
BlueBerryAsia
BlueBerryAsiaBlueBerryAsia
BlueBerryAsia
 
New media and preventive health
New media and preventive healthNew media and preventive health
New media and preventive health
 
New e-inhalation products and potential health risks: The case for regulation
New e-inhalation products and potential health risks: The case for regulationNew e-inhalation products and potential health risks: The case for regulation
New e-inhalation products and potential health risks: The case for regulation
 
扶青團經營建議方針
扶青團經營建議方針扶青團經營建議方針
扶青團經營建議方針
 
ทบทวนการประเมินรอบสอง
ทบทวนการประเมินรอบสองทบทวนการประเมินรอบสอง
ทบทวนการประเมินรอบสอง
 
Sofia
SofiaSofia
Sofia
 
сoncept-рouse-web-12
сoncept-рouse-web-12сoncept-рouse-web-12
сoncept-рouse-web-12
 
07 bio มข
07 bio มข07 bio มข
07 bio มข
 
Secondhand exposure to e-cigarettes emissions
Secondhand exposure to  e-cigarettes emissionsSecondhand exposure to  e-cigarettes emissions
Secondhand exposure to e-cigarettes emissions
 
Assignemt 2; introduction to documentary task
Assignemt 2; introduction to documentary taskAssignemt 2; introduction to documentary task
Assignemt 2; introduction to documentary task
 
Learning project muse, Spr 09 gaxiola, ca
Learning project muse, Spr 09 gaxiola, caLearning project muse, Spr 09 gaxiola, ca
Learning project muse, Spr 09 gaxiola, ca
 
Macro
MacroMacro
Macro
 

Similaire à GDG İstanbul Şubat Etkinliği - Sunum

Web UI test automation instruments
Web UI test automation instrumentsWeb UI test automation instruments
Web UI test automation instrumentsArtem Nagornyi
 
Web Standards Support in WebKit
Web Standards Support in WebKitWeb Standards Support in WebKit
Web Standards Support in WebKitJoone Hur
 
Improving Your Selenium WebDriver Tests - Belgium testing days_2016
Improving Your Selenium WebDriver Tests - Belgium testing days_2016Improving Your Selenium WebDriver Tests - Belgium testing days_2016
Improving Your Selenium WebDriver Tests - Belgium testing days_2016Roy de Kleijn
 
Selenium WebDriver
Selenium WebDriverSelenium WebDriver
Selenium WebDriverRajathi-QA
 
Web Scraping In Ruby Utosc 2009.Key
Web Scraping In Ruby Utosc 2009.KeyWeb Scraping In Ruby Utosc 2009.Key
Web Scraping In Ruby Utosc 2009.Keyjtzemp
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptx
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptxWRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptx
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptxsalemsg
 
Build Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBuild Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBob Paulin
 
Selenium interview questions and answers
Selenium interview questions and answersSelenium interview questions and answers
Selenium interview questions and answerskavinilavuG
 
JavaScript front end performance optimizations
JavaScript front end performance optimizationsJavaScript front end performance optimizations
JavaScript front end performance optimizationsChris Love
 
Top100summit 谷歌-scott-improve your automated web application testing
Top100summit  谷歌-scott-improve your automated web application testingTop100summit  谷歌-scott-improve your automated web application testing
Top100summit 谷歌-scott-improve your automated web application testingdrewz lin
 
Performance Metrics in a Day with Selenium
Performance Metrics in a Day with SeleniumPerformance Metrics in a Day with Selenium
Performance Metrics in a Day with SeleniumMark Watson
 
Web Performance Part 4 "Client-side performance"
Web Performance Part 4  "Client-side performance"Web Performance Part 4  "Client-side performance"
Web Performance Part 4 "Client-side performance"Binary Studio
 
Complete_QA_Automation_Guide__1696637878.pdf
Complete_QA_Automation_Guide__1696637878.pdfComplete_QA_Automation_Guide__1696637878.pdf
Complete_QA_Automation_Guide__1696637878.pdframya9288
 
Rendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rankRendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rankWeLoveSEO
 

Similaire à GDG İstanbul Şubat Etkinliği - Sunum (20)

Web UI test automation instruments
Web UI test automation instrumentsWeb UI test automation instruments
Web UI test automation instruments
 
Web Standards Support in WebKit
Web Standards Support in WebKitWeb Standards Support in WebKit
Web Standards Support in WebKit
 
Improving Your Selenium WebDriver Tests - Belgium testing days_2016
Improving Your Selenium WebDriver Tests - Belgium testing days_2016Improving Your Selenium WebDriver Tests - Belgium testing days_2016
Improving Your Selenium WebDriver Tests - Belgium testing days_2016
 
BrightonSEO
BrightonSEOBrightonSEO
BrightonSEO
 
Selenium WebDriver
Selenium WebDriverSelenium WebDriver
Selenium WebDriver
 
Web Scraping In Ruby Utosc 2009.Key
Web Scraping In Ruby Utosc 2009.KeyWeb Scraping In Ruby Utosc 2009.Key
Web Scraping In Ruby Utosc 2009.Key
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptx
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptxWRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptx
WRStmlDSQUmUrZpQ0tFJ4Q_a36bc57fe1a24dd8bc5ba549736e406f_C2-Week2.pptx
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
Sanjeev ghai 12
Sanjeev ghai 12Sanjeev ghai 12
Sanjeev ghai 12
 
Build Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBuild Your Own CMS with Apache Sling
Build Your Own CMS with Apache Sling
 
Selenium interview questions and answers
Selenium interview questions and answersSelenium interview questions and answers
Selenium interview questions and answers
 
JavaScript front end performance optimizations
JavaScript front end performance optimizationsJavaScript front end performance optimizations
JavaScript front end performance optimizations
 
Top100summit 谷歌-scott-improve your automated web application testing
Top100summit  谷歌-scott-improve your automated web application testingTop100summit  谷歌-scott-improve your automated web application testing
Top100summit 谷歌-scott-improve your automated web application testing
 
Performance Metrics in a Day with Selenium
Performance Metrics in a Day with SeleniumPerformance Metrics in a Day with Selenium
Performance Metrics in a Day with Selenium
 
Web Performance Part 4 "Client-side performance"
Web Performance Part 4  "Client-side performance"Web Performance Part 4  "Client-side performance"
Web Performance Part 4 "Client-side performance"
 
Complete_QA_Automation_Guide__1696637878.pdf
Complete_QA_Automation_Guide__1696637878.pdfComplete_QA_Automation_Guide__1696637878.pdf
Complete_QA_Automation_Guide__1696637878.pdf
 
Introduction to java_script
Introduction to java_scriptIntroduction to java_script
Introduction to java_script
 
Rendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rankRendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rank
 
The MEAN stack
The MEAN stack The MEAN stack
The MEAN stack
 

GDG İstanbul Şubat Etkinliği - Sunum

  • 1. Web Crawling Web Scraping cuneytykaya cuneyt.yesilkaya
  • 2. Cüneyt Yeşilkaya 7 0 2 ......... 2 00 2 01 2 01 2048
  • 3. Agenda ● Web Crawling ● Web Scraping ● Web Crawling Tools ● Demo (Crawler4j & Jsoup) ● Crawling - Where to Use
  • 4. Web Crawling Browsing the World Wide Web in a methodical, automated manner or in an orderly fashion.
  • 5. Web Scraping Computer software technique of extracting information from websites.
  • 7. Selecting Crawler ? ● Multi-Threaded Structure ● Max Page to Fetch ● Max Page Size ● Max Depth to Crawl ● Redundant Link Control ● Politeness Time ● Resumable ● Well-Documented
  • 8. Crawler4j Yasser Ganjisaffar Microsoft Bing & Microsoft Live Search
  • 9. Demo - Crawler4j (1/3) myCrawler.java myController.java
  • 10. Demo - Crawler4j (2/3) myCrawler.java import edu.uci.ics.crawler4j.crawler.WebCrawler; public class myCrawler extends WebCrawler { @Override public boolean shouldVisit(WebURL url) { return url.getURL().startsWith("http://www.gdgistanbul.com"); } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); } }
  • 11. Demo - Crawler4j (3/3) myController.java int numberOfCrawlers = 4; CrawlConfig config = new CrawlConfig(); config.setPolitenessDelay(250); config.setMaxPagesToFetch(100); PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.addSeed("http://www.gdgistanbul.com"); controller.start(myCrawler.class, numberOfCrawlers);
  • 12. Demo - Jsoup (1/2) Jsoup : nice way to do HTML Parsing in Java ● scrape and parse HTML from a URL, file, or string ● find and extract data, using DOM traversal or CSS selectors ● manipulate the HTML elements, attributes, and text
  • 13. Demo - Jsoup (2/2) Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); Elements newsHeadlines = doc.select("#mp-itn b a"); String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>" ; Document doc = Jsoup.parse(html); Element content = doc.getElementById("content"); Elements links = content.getElementsByTag("a"); for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); } Elements links = doc.select("a[href]"); Elements media = doc.select("[src]");
  • 14. Where to Use ● Search Engines (GoogleBot) ● Aggregators ○ Data aggregator ○ News aggregator ○ Review aggregator ○ Search aggregator ○ Social network aggregation ○ Video aggregator ● Kaarun Product Collector
  • 17. Products for each Facebook Like