7. Selecting Crawler ?
● Multi-Threaded Structure
● Max Page to Fetch
● Max Page Size
● Max Depth to Crawl
● Redundant Link Control
● Politeness Time
● Resumable
● Well-Documented
8. Crawler4j
Yasser Ganjisaffar
Microsoft Bing & Microsoft Live Search
10. Demo - Crawler4j (2/3)
myCrawler.java
import edu.uci.ics.crawler4j.crawler.WebCrawler;
public class myCrawler extends WebCrawler {
@Override
public boolean shouldVisit(WebURL url) {
return url.getURL().startsWith("http://www.gdgistanbul.com");
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
}
}
11. Demo - Crawler4j (3/3)
myController.java
int numberOfCrawlers = 4;
CrawlConfig config = new CrawlConfig();
config.setPolitenessDelay(250);
config.setMaxPagesToFetch(100);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed("http://www.gdgistanbul.com");
controller.start(myCrawler.class, numberOfCrawlers);
12. Demo - Jsoup (1/2)
Jsoup : nice way to do HTML Parsing in Java
● scrape and parse HTML from a URL, file, or string
● find and extract data, using DOM traversal or CSS selectors
● manipulate the HTML elements, attributes, and text
13. Demo - Jsoup (2/2)
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>"
;
Document doc = Jsoup.parse(html);
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
14. Where to Use
● Search Engines (GoogleBot)
● Aggregators
○ Data aggregator
○ News aggregator
○ Review aggregator
○ Search aggregator
○ Social network aggregation
○ Video aggregator
● Kaarun Product Collector