Seminar on crawler

WEB CRAWLERs Siddharth Shankar

Resource finding Finding info on the web - Surfing - Searching - crawling ,[object Object], - Find stuff - Gather stuff - Check stuff

less used names- ants,bots and worms.

A program or automated script which browses the World Wide Web in a methodical, automated manner ,[object Object],download pages from the web for later processing by a search engine that will index the downloaded pages to provide fast searches.

WHY CRAWLERS? ,[object Object]

Finding relevant information requires an efficient mechanism.

Web Crawlers provide that scope to the search engine.,[object Object]

Prerequisites of Crawling System ,[object Object]

High Performance(Scalability): System needs to be scalable with a minimum of one thousand pages/ second and extending up to millions of pages. ,[object Object], with unexpected Web server behavior, can handle stopped processes or interruptions in network services.

[object Object], necessary for monitoring the crawling process including: Download speed Statistics on the pages Amounts of data stored.

Crawling Strategies ,[object Object]

Repetitive Crawling: once pages have been crawled, some systems require the process to be repeated periodically so that indexes are kept updated.

Targeted Crawling: specialized search engines use crawling process heuristics in order to target a certain type of page.,[object Object]

Crawling Policies Selection Policy that states which pages to download. Re-visit Policy that states when to check for changes to the pages. Politeness Policy that states how to avoid overloading Web sites. Parallelization Policy that states how to coordinate distributed Web crawlers.

Selection policy ,[object Object]

This requires download of relevant pages, hence a good selection policy is very important. ,[object Object], Restricting followed links Path-ascending crawling Focused crawling Crawling the Deep Web

Re-Visit Policy ,[object Object]

Cost factors play important role in crawling.

Freshness and Age- commonly used cost functions.

Objective of crawler- high average freshness; low average age of web pages. ,[object Object], Uniform policy Proportional policy

Politeness Policy ,[object Object]

Seminar on crawler

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (20)

Similaire à Seminar on crawler

Similaire à Seminar on crawler (20)

Dernier

Dernier (20)

Seminar on crawler