SlideShare une entreprise Scribd logo
1  sur  99
Télécharger pour lire hors ligne
INTELLIGENT WEB CRAWLING
WI-IAT 2013 Tutorial
WI-IAT 2013 Tutorial, Atlanta, USA, 20.11.2013
ver 1.8: 10.04.2015
Denis Shestakov
denshe at gmail
Department of Media Technology, Aalto University, Finland
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
1/98
References to this tutorial
To cite please use:
D. Shestakov, "Intelligent Web Crawling," IEEE Intelligent
Informatics Bulletin, 14(1), pp. 5-7, 2013.
[BibTeX]
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
2/98
Speaker’s Bio
(2009-2013) Postdoc in
Web Services Group,
Aalto University, Finland
PhD thesis (2008) on
limited coverage of web
crawlers
Over ten years of
experience in the area
Tutorials on web crawling
given at SAC’12 and
ICWE’13
Web Services Group in 2011
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
3/98
Speaker’s Info
As of 2013: Current:
http://www.linkedin.com/in/dshestakov
http://www.mendeley.com/profiles/
denis-shestakov/
http://www.researchgate.net/profile/
Denis_Shestakov
https://mediatech.aalto.fi/~denis/
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
4/98
TUTORIAL OUTLINE
I. OVERVIEW
Web crawling in a nutshell
Web crawling applications
Web size and web link structure
II. INTELLIGENT WEB CRAWLING
Architecture of web crawler
Crawling strategies
Adaptive crawling approaches
III. OPEN CHALLENGES
Crawlers in Web ecosystem
Collaborative web crawling
Deep Web crawling
Crawling multimedia content
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
5/98
Links to Tutorial
Slides:
http://goo.gl/woVtQk
http://www.slideshare.net/denshe/presentations
Similar tutorials:
Tutorials on web crawling at ICWE’13 and SAC’12
Their diffs with this tutorial: better overview the topic (parts I
and III), but not cover crawling strategies (part II)
Supporting materials:
http://www.mendeley.com/groups/531771/web-crawling/
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
6/98
PART I: OVERVIEW
Visualization of http://media.tkk.fi/webservices by aharef.info applet
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
7/98
Outline of Part I
Overview of Web Crawling
Web crawling in a nutshell
Web crawling applications
Web size and web link structure
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
8/98
Web Crawling in a Nutshell
Automatic harvesting of web content
Done by web crawlers (also known as robots, bots or
spiders)
Follow a link from a set of links (URL queue), download a
page, extract all links, eliminate already visited, add the
rest to the queue
Then repeat
Set of policies involved (like ’ignore links to images’, etc.)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
9/98
Web Crawling in a Nutshell
Example:
1. Follow http://media.tkk.fi/webservices (vizualization of its
HTML DOM tree below)
2. Extract URLs inside blue bubbles (designating <a> tags)
3. Remove already visited URLs
4. For each non-visited URL, start at Step 1
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
10/98
Web Crawling in a Nutshell
In essence: simple and naive process
However, a number of ’restrictions’ imposed make it much
more complicated
Most complexities due to operating environment (Web)
For example, do not overload web servers (challenging as
distribution of web pages on web servers is non-uniform)
Or avoiding web spam (not only useless but consumes
resources and often spoils the collected content)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
11/98
Web Crawling in a Nutshell
Crawler Agents
First in 1993: the Wanderer (written in Perl)
Over different 1100 crawler signatures (User-Agent string
in HTTP request header) mentioned at
http://www.crawltrack.net/crawlerlist.php
Educated guess on overall number of different crawlers –
at least several thousands
Write your own in a few dozens lines of code (using
libraries for URL fetching and HTML parsing)
Or use existing agent: e.g., wget tool (developed from
1996; http://www.gnu.org/software/wget/)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
12/98
Web Crawling in a Nutshell
Crawler Agents
For advanced things, you may modify the code of existing
projects for programming language preferred
Crawlers play a big role on the Web
Bring more traffic to certain web sites than human visitors
Generate sizeable portion of traffic to any (public) web site
Crawler traffic important for emerging web sites
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
13/98
Web Crawling in a Nutshell
Classification
General/universal crawlers
Not so many of them, lots of resources required
Big web search engines
Topical/focused crawlers
Pages/sites on certain topic
Crawling all in one specific (i.e., national) web segment is
rather general, though
Batch crawling
One or several (static) snapshots
Incremental/continuous crawling
Re-visiting
Resources divided between fetching newly discovered
pages and re-downloading previously crawled pages
Search engines
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
14/98
Applications of Web Crawling
Web Search Engines
Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex,
Ask, ...
One of three underlying technology stacks
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
15/98
Applications of Web Crawling
Web Search Engines
One of three underlying technology stacks
BTW, what are the other two and which is the most
’crucial’?
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
16/98
Applications of Web Crawling
Web Search Engines
What are the other two and which is the most ’crucial’?
Query processor (particularly, ranking)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
17/98
Applications of Web Crawling
Web Archiving
Digital preservation
“Librarian” look on the Web
The biggest: Internet Archive
Quite huge collections
Batch crawls
Primarily, collection of national web sites – web sites at
country-specific TLDs or physically hosted in a country
There are quite many and some are huge! see the list of
Web Archiving Initiatives at Wikipedia
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
18/98
Applications of Web Crawling
Vertical Search Engines
Data aggregating from many sources on certain topic
E.g., apartment search, car search
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
19/98
Applications of Web Crawling
Web Data Mining
“To get data to be actually mined”
Usually using focused crawlers
For example, opinion mining
Or digests of current happenings on the Web (e.g., what
music people listen now)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
20/98
Applications of Web Crawling
Web Monitoring
Monitoring sites/pages for changes and updates
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
21/98
Applications of Web Crawling
Detection of malicious web sites
Typically a part of anti-virus, firewall, search engine, etc.
service
Building a list of such web sites and inform a user about
potential threat of visiting such
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
22/98
Applications of Web Crawling
Web site/application testing
Crawl a web site to check a navigation through it, validity
the links, etc.
Regression/security/... testing a rich internet application
(RIA) via crawling
Checking different application states by simulating possible
user interaction events (e.g., mouse click, time-out)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
23/98
Applications of Web Crawling
Copyright violation detection
Crawl to find (media) items under copyright or links to them
Regular re-visiting ’suspicious’ web sites, forums, etc.
Tasks like finding terrorist chat rooms also go here
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
24/98
Applications of Web Crawling
Web Scraping
Extracting particular pieces of information from a group of
typically similar pages
When API to data is not available
Interestingly, scraping might be more preferable even with
API available as scraped data often more clean and
up-to-date than data-via-API
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
25/98
Applications of Web Crawling
Web Mirroring
Copying of web sites
Hosting copies on different servers to ensure 24x7
accessibility
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
26/98
Industry vs. Academia Divide
In web crawling domain
Huge lag between industrial and academic web crawlers
Research-wise and development-wise
Algorithms, techniques, strategies used in industrial
crawlers (namely, operated by search engines) poorly
known
Industrial crawlers operate on a web-scale
That is, dozens of billions pages
Only a few academic crawlers dealt with more than one
billion pages
Academic scale is rather hundreds of millions
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
27/98
Industry vs. Academia
Re-crawling
Batch crawls in academia
Regular re-crawls by industrial crawlers
Evaluation of crawled data
Crucial for corrections/improvements into crawlers
Direct evaluation by users of search engines
To some extent, artificial evaluation of academic crawls
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
28/98
Web Size and Structure
Some numbers
Number of pages per host is not uniform: most hosts
contain only a few pages, others contain millions
Roughly 100 links on a page
According to Google statistics (over 4 billions pages,
2010): fetching a page takes 320KB (textual content plus
all embeddings)
Page has 10-100KB of textual (HTML) content on average
One trillion URLs known by Google/Yahoo in 2008
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
29/98
Web Size and Structure
Some numbers
20 million web pages in 1995 (indexed by AltaVista)
One trillion (1012) URLs known by Google/Yahoo in 2008
- ’Independent’ search engine called Majestic12
(P2P-crawling) confirms one trillion items
Doesn’t mean one trillion indexed pages
Supposedly, index has dozens times less pages
Cool crawler facts: IRLbot crawler (running on one server)
downloaded 6.4 billion pages over 2 months
Throughput: 1000-1500 pages per second
Over 30 billion discovered URLs
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
30/98
Web Size and Structure
Bow-tie model of the Web
Illustration taken from http://dx.doi.org/doi:10.1038/35012155
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
31/98
PART II: INTELLIGENT WEB CRAWLING
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
32/98
Outline of Part II
Intelligent Web Crawling
Architecture of web crawler
Crawling strategies
Adaptive crawling approaches
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
33/98
Architecture of Web Crawler
Crawler crawls the Web
Crawled
URLs
URL Frontier
Seed
URLs
Uncrawled Web
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
34/98
Architecture of Web Crawler
Typically in a distributed fashion
Seed
URLs
Crawled
URLs
URL Frontier
crawling thread
Uncrawled Web
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
35/98
Architecture of Web Crawler
URL Frontier
Include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must try to keep all crawling threads busy
Prioritization also helps
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
36/98
Architecture of Web Crawler
Crawler Architecture
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
37/98
Architecture of Web Crawler
Content seen?
If page fetched is already in the base/index, don’t process it
Document fingerprints (shingles)
Filtering
Filter out URLs – due to ’politeness’, restrictions on crawl
Fetched robots.txt are cached to avoid fetching them
repeatedly
Duplicate URL Elimination
Check if an extracted+filtered URL has been already
passed to frontier (batch crawling)
More complicated in continuous crawling (different URL
frontier implementation)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
38/98
Architecture of Web Crawler
Distributed Crawling
Run multiple crawl threads, under different processes
(often at different nodes)
Nodes can be geographically distributed
Partition hosts being crawled into nodes
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
39/98
Architecture of Web Crawler
Host Splitter
Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
40/98
Architecture of Web Crawler
Implementation (in Perl)
Other popular languages: Java, Python, C/C++
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
41/98
Architecture of Web Crawler
Crawling objectives
High web coverage
High page freshness
High content quality
High download rate
Internal and External factors
Amount of hardware (I)
Network bandwidth (I)
Rate of web growth (E)
Rate of web change (E)
Amount of malicious content (i.e., spam, duplicates) (E)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
42/98
Crawling Strategies
Download prioritization
Given a period, only a subset of web pages can be
downloaded
“Important” pages first
Hence, need in prioritization
Ordering a queue of URLs to be visited
Strategies (ordering metrics)
Breadth-First, Depth-First
Backlink count
Best-First
PageRank
Shark-Search
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
43/98
Crawling Strategies
Breadth-First, Depth-First
Breadth-First search
Implemented with
QUEUE (FIFO)
Pages with shortest
paths first
Depth-First search
Implemented with
STACK (LIFO)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
44/98
Crawling Strategies
Pseudocode for Breadth-First
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
45/98
Crawling Strategies
Backlink count
Use the link graph information
Count # of crawled pages that point to a page
Links with highest counts first
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
46/98
Crawling Strategies
Best-First
Best link selected based on some criterion
I.e., lexical similarity between topic’s keywords and link’s
source page
Similarity score sim(topic, p) assigned to outgoing links of
page p
Cosine similarity often used
where q is a topic, p is a crawled page, fkq,fkp are frequencies of term k
in q and p
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
47/98
Crawling Strategies
Pseudocode for Best-First
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
48/98
Crawling Strategies
PageRank
The pagerank of a page is the probability for a random
surfer (who follows links randomly) to be on this page at
any given time
A page’s score (rank) defined by scores of pages with links
to this page
where p is a page, in(p) is a set of pages with links to p, out(d) is a set
of links out of d, γ are damping factor
PageRank of pages periodically recalculated using data
structure with crawled pages
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
49/98
Crawling Strategies
Pseudocode for PageRank
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
50/98
Crawling Strategies
Shark-Search
More emphasis on web segments where relevant pages
were found
Penalizing segments yielding a few relevant pages
A link’s score defined by a link’s anchor text, text
surrounding a link (link context) and inherited score from
ancestor pages (pages pointing to a page with this link)
Parameters:
d - depth bound
r - relative importance of inherited score versus link
neighbourhood score
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
51/98
Crawling Strategies
Pseudocode for Shark-Search
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
52/98
Adaptive Crawling
Static vs. adaptive strategies
Strategies presented to this point are static
Not adjust in the course of the crawl
Adaptive (intelligent) crawling
InfoSpiders
Ant-based crawling
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
53/98
Adaptive Crawling
InfoSpiders
Independent agents crawling in parallel
HTML parser
Noise word
remover
Stemmer
Document
relevance
assessment
Reproduction
or death
Learning
Link
assessment
and selection
HTML
document
Compact
document
representation
Document
assessment
########## $$$
########## $$$
Term
weights
Neural net
weights
Keyword
vector
Agent
representation
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
54/98
Adaptive Crawling
InfoSpiders
Independent agents crawling in parallel
Each agent uses list of keywords (initialized with topic
keywords)
Neural network evaluates new links
Keywords in the vicinity a link used as input
More importance (weight) to those keywords close to a link
Maximum to words in the anchor text
Output is a numerical quality estimate for a link
Link score combined with cosine similarity score (between
agent’s keywords and a page with this link)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
55/98
Adaptive Crawling
InfoSpiders
Each agent has an energy level
Agent moves from a current to a new page if boltzmann
function returns true
where δ is diff between similarity of new and current page to agent’s
keywords
If energy level passes some threshold, an agent
reproduces
Offspring gets the half of parent’s frontier
Offspring keywords mutated (expanded) with most
frequent terms in parent’s current document
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
56/98
Adaptive Crawling
Pseudocode for InfoSpiders
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
57/98
Adaptive Crawling
Pseudocode for InfoSpiders (cont.)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
58/98
Adaptive Crawling
Ant-based crawling
Motivation: allow crawling agents to communicate with
each other
Follow a model of social insect collective behaviour
Ants leave the pheromone along the followed path
Other ants follow such pheromone trails
A crawler agent follows some path by visiting many URLs
At some moment, a certain amount of pheromone (weight)
can be assigned to sequence of URLs on the followed path
The amount can depend on similarity of visited pages to a
given topic
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
59/98
Adaptive Crawling
Ant-based crawling
Ants (crawlers) operate in cycles
During each cycle, agents make a predefined number of
moves (visits of pages)
#moves = constant ∗ #cycle
At the end of each cycle, pheromone intensity values are
updated for the followed path
Agents-ants return to their starting positions
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
60/98
Adaptive Crawling
Ant-based crawling
Next link selected based on probability, which is defined by
the corresponding pheromone intensity
If no pheromone information, an agent-ant moves
randomly
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
61/98
Adaptive Crawling
Ant-based crawling
Probability of selecting a link
where t is the cycle number, τij (t) is pheromone value between pi and
pj and (i, l) designates the presence of a link from pi to pl
During the cycle, each ant stores the list of visited URLs
If pj was already visited, Pij(t) = 0
At the end of cycle, the list with visited URLs emptied out
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
62/98
Adaptive Crawling
Implications
Strategies evaluating links based on their context (text
close by) are not directly applicable to large-scale crawling
I.e., consider crawling of 109 pages within one month
Crawl rate: around 400 documents per second
Around 40000 links per second
Every second 10000-30000 “new” links to be evaluated
(scored) and added to the frontier
Too many even for link’s anchor text evaluation only
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
63/98
PART III: OPEN CHALLENGES
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
64/98
Outline of Part III
Open Challenges
Crawlers in Web ecosystem
Collaborative web crawling
Deep Web crawling
Crawling multimedia content
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
65/98
Crawlers in Web ecosystem
Push vs. Pull model
Web pages accessed via pull model
- HTTP is a pull protocol
That is, a client requests a page from a server
If push, a server would send a page/info to a client
Why Pull?
Pull is just easier for both parties
No ’agreement’ between provider and aggregator
No specific protocols for content providers – serving
content is enough
Perhaps pull model is the reason why the Web is
succeeded while earlier hypertext systems failed
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
66/98
Crawlers in Web ecosystem
Why not Push?
Still pull model has several disadvantages
What are these?
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
67/98
Crawlers in Web ecosystem
Why not Push?
Still pull model has several disadvantages
Publishing/updating content easier with push: no need in
redundant requests from crawlers
Better control over the content from providers: no need in
crawler politeness
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
68/98
Crawlers in Web ecosystem
Crawler politeness
Content providers possess some control over crawlers
Via special protocols to define access to parts of a site
Via direct banning of agents hitting a site too often
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
69/98
Crawlers in Web ecosystem
Crawler politeness
Robots.txt says what can(not) be crawled
Sitemaps is newer protocol specifying access restrictions
and other info
No agent should visit any URL starting with
“yoursite/notcrawldir”, except an agent called
“goodsearcher”
Example
User-agent: *
Disallow: yoursite/notcrawldir
User-agent: goodsearcher
Disallow:
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
70/98
Collaborative Crawling
Main considerations
Lots of redundant crawling
To get data (often on a specific topic) need to crawl broadly
- Often lack of expertise when large crawl required
- Often, crawl a lot, use only a small subset
Too many redundant requests for content providers
Idea: have one crawler doing very broad and intensive
crawl and many parties accessing the crawled data via API
- Specify filters to select required pages
Crawler as a common service
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
71/98
Collaborative Crawling
Some requirements
Filter language for specifying conditions
Efficient filter processing (millions filter to process)
Efficient fetching (hundreds pages per second)
Support real-time requests
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
72/98
Collaborative Crawling
New component
Process a stream of documents against a filter index
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
73/98
Collaborative Crawling
Filter processing architecture
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
74/98
Collaborative Crawling
Filter processing architecture
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
75/98
Collaborative Crawling
Based on ’The architecture and implementation of an
extensible web crawler’ by Hsieh, Gribble, Levy, 2010
(illustrations on slides 61-62 from Hsieh’s slides)
E.g., 80legs provides similar crawling services
In a way, it is reconsidering pull/push model of content
delivery on the Web
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
76/98
Deep Web Crawling
Visualization of http://amazon.com by aharef.info applet
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
77/98
Deep Web Crawling
In a nutshell
Problem is in yellow nodes (designating web form
elements)
●
Deep Web – part of the Web not accessible through search
engines
●
My preferred: Deep Web - content behind web search forms on
publicly available pages
●
Pages with forms themselves are typically accessible/searchable
(=crawled)
1
Content hidden behind HTML forms
Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
Why is it important?
Large source of structured data
●
Forms present a search interface over backend databases
Significant gap in search engine coverage
●
Potentially more content that currently searchable
●
More than 10 million distinct HTML forms
●
Likely to increase and more data comes online
Size of the deep Web is unclear
●
500x figures are highly disputable
●
Number of resources is a bit simpler: ~450k databases on the Web in
2004
●
Some part of deep web content crawled/covered by search engines
●
Content can be both searched and browsed via links categorizing
content
●
Business-driven sites (e.g., shopping) typically provide both ways of
access
2Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
Can’t pass through the forms (need to specify some values)
I.e., content is “hidden” behind search forms
●
Reason for another name for deep Web: hidden Web
To crawl/access the content behind the following is
required:
●
Identify a search form on a page
●
Fill form with proper values
●
Submit the form
●
Get the result pages
●
Extract links/data from them
Why crawlers not crawl deep Web
3Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
Approaches to deep Web crawling
Google’s Deep Web Crawl (2008)
●
Identify search forms
●
Pre-compute all interesting form submissions to each
HTML form
●
Each form submission corresponds to a distinct URL
●
Add URLs for each form submission into search
engine index
●
Allows to reuse existing search engine infrastructure
●
No aim for full coverage of a deep web resource
●
Not all forms (only GET forms) covered
4Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
Deep Web site identification
• Task: identify a search form leading to content-rich
web pages
• Surprisingly, quite challenging task
• One of the problems:
●
Detect if form is searchable
5Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
Searchable forms
Non-searchable: login forms, those that require user info
Depends: Highly-interactive forms, e.g., airline reservations
What are deep Web resources?
store locations
used cars
radio stations
patents
recipes
6Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
7Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
Deep Web site identification
• Detect if form is informational
●
Challenging for human too: e.g., assume a form is in
unknown language
• Detection by building/training binary classifiers
• Forms identified as searchable can then be classified into
domains (e.g., car search, apartment search, etc.)
●
Based on form structure (e.g., num.fields)
●
Based on form field labels
• Slow process
●
Done by specific component in offline mode
8Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
Crawling JavaScript-rich sites
• Web pages became more responsive, interactive,
user-friendly, etc.
●
Thanks to emergence of new web technologies
such as AJAX
• Besides, they led to wide spread of web applications
(RIAs)
• Challenge for crawlers as they do not
●
Manipulate client-side site
●
Take into account asynchronous communication
with the server
9Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
Crawling JavaScript-rich sites
• Very similar to deep Web crawling challenge
●
Content is hard to crawl
●
Direct problem: AJAX/JS-enabled forms are hard to
deal with (e.g., to detect and then generate meaningful
queries)
• Web pages designed for human beings, not for
automatic programs
• JS-code should be processed to get the actual content
●
Dynamically changing
●
Lots of additional resources required (crawler should
be supplemented with JS-interpreter)
10Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
Crawling JavaScript-rich sites
• Several techniques for AJAX crawling proposed since
2007/08
●
Focus is either on indexing and searching or on testing
RIAs
• Approach:
●
AJAX-enabled web page/application modeled using
states, events, transitions
●
Crawler uses breadth-first strategy:
●
Triggers the events on a page
●
If the DOM of a page changes then new
state/transition is added to transition graph
●
Back to initial state to invoke the next event
11Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
89/98
Crawling Multimedia Content
The web is now multimedia platform
Images, video, audio are integral part of web pages (not
just supplementing them)
Almost all crawlers, however, consider it as a textual
repository
One reason: indexing techniques for multimedia doesn’t
reach yet the maturity required by interesting use
cases/applications
Hence, no real need to harvest multimedia
But state-of-the-art multimedia retrieval/computer vision
techniques already provide adequate search quality
E.g., search for images with a cat and a man based on
actual image content (not text around/close to image)
In case of video: set of frames plus audio (can be converted
to textual form)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
90/98
Crawling Multimedia Content
Challenges in crawling multimedia
Bigger load on web sites since files are bigger
More apparent copyright issues
More resources (e.g., bandwidth, storage place) required
from a crawler
More complicated duplicate resolving
Re-visiting policy
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
91/98
Crawling Multimedia Content
Scalable Multimedia Web Observatory of ARCOMEM
project (http://www.arcomem.eu)
Focus on web archiving issues
Uses several crawlers
- ’Standard’ crawler for regular web pages
- API crawler to mine social media sources (e.g., Twitter,
Facebook, YouTube, etc.)
- Deep Web crawler able to extract information from
pre-defined web sites
Data can be exported in WARC (Web ARChive) files and in
RDF
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
92/98
Future Directions
Collaborative crawling, mixed pull-push model
Scalable adaptive strategies
Understanding site structure
Deep Web crawling
Semantic Web crawling
Media content crawling
Social network crawling
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
93/98
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
ClueWeb09 Dataset:
- http://lemurproject.org/clueweb09.php/
- One billion web pages, in ten languages
- 5TBs compressed
- Hosted at several cloud services (free license required) or
a copy can be ordered on hard disks (pay for disks)
ClueWeb12:
- Almost 900 millions English web pages
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
94/98
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
Common Crawl Corpus:
- See http://commoncrawl.org/data/accessing-the-data/
and http://aws.amazon.com/datasets/41740
- Around six billion web pages
- Over 100TB uncompressed
- Available as Amazon Web Services’ public dataset (pay for
processing)
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
95/98
References: Crawl Datasets
Use for building your crawls, web graph analysis, web data
mining tasks, etc.
Internet Archive:
- See http://blog.archive.org/2012/10/26/
80-terabytes-of-archived-web-crawl-data-available-for-resea
- Crawl of 2011
- 80TB WARC files
- 2.7 billions pages
- Includes multimedia data
- Available by request
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
96/98
References: Crawl Datasets
LAW Datasets:
- http://law.dsi.unimi.it/datasets.php
- Variety of web graphs datasets (nodes, arcs, etc.) including
basic properties of recent Facebook graphs (!)
- Thoroughly studied in a number of publications
ICWSM 2011 Spinn3r Dataset:
- http://www.icwsm.org/data/
- 130mln blog posts and 230mln social media publications
- 2TB compressed
Academic Web Link Database Project:
- http://cybermetrics.wlv.ac.uk/database/
- Crawls of national universities web sites
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
97/98
References: Literature
For beginners: Udacity/CS101 course;
http://www.udacity.com/overview/Course/cs101
Intermediate: Chapter 20 of Introduction to Information
Retrieval book by Manning, Raghavan, Schütze;
http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf
Intermediate: Current Challenges in Web Crawling tutorial
at ICWE 2013 by Shestakov; http://www.slideshare.
net/denshe/icwe13-tutorial-webcrawling
Advanced: Web Crawling by Olston and Najork;
http://www.nowpublishers.com/product.aspx?product=
INR&doi=1500000017
Denis Shestakov
Intelligent Web Crawling
WI-IAT’13, Atlanta, USA, 20.11.2013
98/98
References: Literature
See relevant publications at Mendeley:
http://www.mendeley.com/groups/531771/web-crawling/
Feel free to join the group!
Check ’Deep Web’ group too
http://www.mendeley.com/groups/601801/deep-web/

Contenu connexe

Tendances

Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Artificial Intelligence in the Financial Industries
Artificial Intelligence in the Financial IndustriesArtificial Intelligence in the Financial Industries
Artificial Intelligence in the Financial IndustriesGerardo Salandra
 
Mike Sharples - Generative AI and Large Language Models in Digital Education....
Mike Sharples - Generative AI and Large Language Models in Digital Education....Mike Sharples - Generative AI and Large Language Models in Digital Education....
Mike Sharples - Generative AI and Large Language Models in Digital Education....EADTU
 
What is Web 3,0?
What is Web 3,0?What is Web 3,0?
What is Web 3,0?dWebGuide1
 
22 3 2022 - AI & Marketing - Commpass - Hugues Rey
22 3 2022 - AI & Marketing - Commpass - Hugues Rey 22 3 2022 - AI & Marketing - Commpass - Hugues Rey
22 3 2022 - AI & Marketing - Commpass - Hugues Rey Hugues Rey
 
Wikipedia Powerpoint
Wikipedia PowerpointWikipedia Powerpoint
Wikipedia PowerpointChad Balmuth
 
How Google search works ppt
How Google search works pptHow Google search works ppt
How Google search works pptHardik Mahant
 
Principle Of Accounts School Based Assessments 2017 Guide
Principle Of Accounts School Based Assessments 2017 GuidePrinciple Of Accounts School Based Assessments 2017 Guide
Principle Of Accounts School Based Assessments 2017 GuideDarien Guillen
 
Leading responsible AI - the role of librarians and information professionals
Leading responsible AI - the role of librarians and information professionalsLeading responsible AI - the role of librarians and information professionals
Leading responsible AI - the role of librarians and information professionalsNicholas Poole
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Visualization For Data Science
Visualization For Data ScienceVisualization For Data Science
Visualization For Data ScienceAngela Zoss
 

Tendances (20)

Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Artificial Intelligence in the Financial Industries
Artificial Intelligence in the Financial IndustriesArtificial Intelligence in the Financial Industries
Artificial Intelligence in the Financial Industries
 
Mike Sharples - Generative AI and Large Language Models in Digital Education....
Mike Sharples - Generative AI and Large Language Models in Digital Education....Mike Sharples - Generative AI and Large Language Models in Digital Education....
Mike Sharples - Generative AI and Large Language Models in Digital Education....
 
Web mining
Web miningWeb mining
Web mining
 
What is Web 3,0?
What is Web 3,0?What is Web 3,0?
What is Web 3,0?
 
22 3 2022 - AI & Marketing - Commpass - Hugues Rey
22 3 2022 - AI & Marketing - Commpass - Hugues Rey 22 3 2022 - AI & Marketing - Commpass - Hugues Rey
22 3 2022 - AI & Marketing - Commpass - Hugues Rey
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Wikipedia Powerpoint
Wikipedia PowerpointWikipedia Powerpoint
Wikipedia Powerpoint
 
web mining
web miningweb mining
web mining
 
How Google search works ppt
How Google search works pptHow Google search works ppt
How Google search works ppt
 
Web 3.0.pptx
Web 3.0.pptxWeb 3.0.pptx
Web 3.0.pptx
 
Principle Of Accounts School Based Assessments 2017 Guide
Principle Of Accounts School Based Assessments 2017 GuidePrinciple Of Accounts School Based Assessments 2017 Guide
Principle Of Accounts School Based Assessments 2017 Guide
 
#ChatGPT #ResponsibleAI
#ChatGPT #ResponsibleAI#ChatGPT #ResponsibleAI
#ChatGPT #ResponsibleAI
 
Leading responsible AI - the role of librarians and information professionals
Leading responsible AI - the role of librarians and information professionalsLeading responsible AI - the role of librarians and information professionals
Leading responsible AI - the role of librarians and information professionals
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Visualization For Data Science
Visualization For Data ScienceVisualization For Data Science
Visualization For Data Science
 

En vedette

Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDataWorks Summit
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...Liber2012
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingNate Murray
 
The promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artistsThe promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artistsGuillaume Cabanac
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsisMayur Garg
 
How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?Nicolas Robinson-Garcia
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsMatteo Romanello
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseDATAVERSITY
 
Efficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matchingEfficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matchingMateusz Fedoryszak
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 

En vedette (20)

Web Crawling & Crawler
Web Crawling & CrawlerWeb Crawling & Crawler
Web Crawling & Crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster management
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
How to Face the Challenges of Web Archiving? The Experiences of a Small Libra...
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
 
Web Crawling
Web CrawlingWeb Crawling
Web Crawling
 
Nano
NanoNano
Nano
 
The promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artistsThe promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artists
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
 
Web Scraping : Crawling
Web Scraping : CrawlingWeb Scraping : Crawling
Web Scraping : Crawling
 
How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
 
Efficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matchingEfficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matching
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 

Similaire à Intelligent web crawling

Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseGiorgio Orsi
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the WebDenis Shestakov
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single DatabaseDatafiniti
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
Norfolk Intranet 2.0
Norfolk Intranet 2.0Norfolk Intranet 2.0
Norfolk Intranet 2.0djoneseaccess
 
Accessibility Geek Upv2
Accessibility Geek Upv2Accessibility Geek Upv2
Accessibility Geek Upv2philsmears
 
Introduction to web technology
Introduction to web technologyIntroduction to web technology
Introduction to web technologyPallawiBulakh1
 
Web Services Emissions 2006 Falke
Web Services Emissions 2006 FalkeWeb Services Emissions 2006 Falke
Web Services Emissions 2006 FalkeRudolf Husar
 
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUESTUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUEIAEME Publication
 
Web Accessibility Acronyms - Spring Break Conference 2008
Web Accessibility Acronyms - Spring Break Conference 2008Web Accessibility Acronyms - Spring Break Conference 2008
Web Accessibility Acronyms - Spring Break Conference 2008Andrea Hill
 
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...M. Atif Qureshi
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web DataIRJET Journal
 
Semantic.edu, an introduction
Semantic.edu, an introductionSemantic.edu, an introduction
Semantic.edu, an introductionBryan Alexander
 
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicSecurity-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicNana Kwame(Emeritus) Gyamfi
 
Improve your Tech Quotient
Improve your Tech QuotientImprove your Tech Quotient
Improve your Tech QuotientTarence DSouza
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
Info2006 Web20 Taly Print
Info2006 Web20 Taly PrintInfo2006 Web20 Taly Print
Info2006 Web20 Taly PrintRam Srivastava
 

Similaire à Intelligent web crawling (20)

Webware Webinar
Webware WebinarWebware Webinar
Webware Webinar
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single Database
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
Norfolk Intranet 2.0
Norfolk Intranet 2.0Norfolk Intranet 2.0
Norfolk Intranet 2.0
 
Accessibility Geek Upv2
Accessibility Geek Upv2Accessibility Geek Upv2
Accessibility Geek Upv2
 
Introduction to web technology
Introduction to web technologyIntroduction to web technology
Introduction to web technology
 
Web Services Emissions 2006 Falke
Web Services Emissions 2006 FalkeWeb Services Emissions 2006 Falke
Web Services Emissions 2006 Falke
 
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUESTUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
 
Web Accessibility Acronyms - Spring Break Conference 2008
Web Accessibility Acronyms - Spring Break Conference 2008Web Accessibility Acronyms - Spring Break Conference 2008
Web Accessibility Acronyms - Spring Break Conference 2008
 
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web Data
 
Semantic.edu, an introduction
Semantic.edu, an introductionSemantic.edu, an introduction
Semantic.edu, an introduction
 
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicSecurity-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
 
Improve your Tech Quotient
Improve your Tech QuotientImprove your Tech Quotient
Improve your Tech Quotient
 
Web mining
Web miningWeb mining
Web mining
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
Info2006 Web20 Taly Print
Info2006 Web20 Taly PrintInfo2006 Web20 Taly Print
Info2006 Web20 Taly Print
 

Plus de Denis Shestakov

Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Denis Shestakov
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopDenis Shestakov
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep WebDenis Shestakov
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery systemDenis Shestakov
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database SystemsDenis Shestakov
 

Plus de Denis Shestakov (6)

Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep Web
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 

Dernier

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Dernier (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Intelligent web crawling

  • 1. INTELLIGENT WEB CRAWLING WI-IAT 2013 Tutorial WI-IAT 2013 Tutorial, Atlanta, USA, 20.11.2013 ver 1.8: 10.04.2015 Denis Shestakov denshe at gmail Department of Media Technology, Aalto University, Finland
  • 2. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 1/98 References to this tutorial To cite please use: D. Shestakov, "Intelligent Web Crawling," IEEE Intelligent Informatics Bulletin, 14(1), pp. 5-7, 2013. [BibTeX]
  • 3. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 2/98 Speaker’s Bio (2009-2013) Postdoc in Web Services Group, Aalto University, Finland PhD thesis (2008) on limited coverage of web crawlers Over ten years of experience in the area Tutorials on web crawling given at SAC’12 and ICWE’13 Web Services Group in 2011
  • 4. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 3/98 Speaker’s Info As of 2013: Current: http://www.linkedin.com/in/dshestakov http://www.mendeley.com/profiles/ denis-shestakov/ http://www.researchgate.net/profile/ Denis_Shestakov https://mediatech.aalto.fi/~denis/
  • 5. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 4/98 TUTORIAL OUTLINE I. OVERVIEW Web crawling in a nutshell Web crawling applications Web size and web link structure II. INTELLIGENT WEB CRAWLING Architecture of web crawler Crawling strategies Adaptive crawling approaches III. OPEN CHALLENGES Crawlers in Web ecosystem Collaborative web crawling Deep Web crawling Crawling multimedia content
  • 6. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 5/98 Links to Tutorial Slides: http://goo.gl/woVtQk http://www.slideshare.net/denshe/presentations Similar tutorials: Tutorials on web crawling at ICWE’13 and SAC’12 Their diffs with this tutorial: better overview the topic (parts I and III), but not cover crawling strategies (part II) Supporting materials: http://www.mendeley.com/groups/531771/web-crawling/
  • 7. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 6/98 PART I: OVERVIEW Visualization of http://media.tkk.fi/webservices by aharef.info applet
  • 8. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 7/98 Outline of Part I Overview of Web Crawling Web crawling in a nutshell Web crawling applications Web size and web link structure
  • 9. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 8/98 Web Crawling in a Nutshell Automatic harvesting of web content Done by web crawlers (also known as robots, bots or spiders) Follow a link from a set of links (URL queue), download a page, extract all links, eliminate already visited, add the rest to the queue Then repeat Set of policies involved (like ’ignore links to images’, etc.)
  • 10. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 9/98 Web Crawling in a Nutshell Example: 1. Follow http://media.tkk.fi/webservices (vizualization of its HTML DOM tree below) 2. Extract URLs inside blue bubbles (designating <a> tags) 3. Remove already visited URLs 4. For each non-visited URL, start at Step 1
  • 11. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 10/98 Web Crawling in a Nutshell In essence: simple and naive process However, a number of ’restrictions’ imposed make it much more complicated Most complexities due to operating environment (Web) For example, do not overload web servers (challenging as distribution of web pages on web servers is non-uniform) Or avoiding web spam (not only useless but consumes resources and often spoils the collected content)
  • 12. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 11/98 Web Crawling in a Nutshell Crawler Agents First in 1993: the Wanderer (written in Perl) Over different 1100 crawler signatures (User-Agent string in HTTP request header) mentioned at http://www.crawltrack.net/crawlerlist.php Educated guess on overall number of different crawlers – at least several thousands Write your own in a few dozens lines of code (using libraries for URL fetching and HTML parsing) Or use existing agent: e.g., wget tool (developed from 1996; http://www.gnu.org/software/wget/)
  • 13. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 12/98 Web Crawling in a Nutshell Crawler Agents For advanced things, you may modify the code of existing projects for programming language preferred Crawlers play a big role on the Web Bring more traffic to certain web sites than human visitors Generate sizeable portion of traffic to any (public) web site Crawler traffic important for emerging web sites
  • 14. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 13/98 Web Crawling in a Nutshell Classification General/universal crawlers Not so many of them, lots of resources required Big web search engines Topical/focused crawlers Pages/sites on certain topic Crawling all in one specific (i.e., national) web segment is rather general, though Batch crawling One or several (static) snapshots Incremental/continuous crawling Re-visiting Resources divided between fetching newly discovered pages and re-downloading previously crawled pages Search engines
  • 15. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 14/98 Applications of Web Crawling Web Search Engines Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex, Ask, ... One of three underlying technology stacks
  • 16. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 15/98 Applications of Web Crawling Web Search Engines One of three underlying technology stacks BTW, what are the other two and which is the most ’crucial’?
  • 17. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 16/98 Applications of Web Crawling Web Search Engines What are the other two and which is the most ’crucial’? Query processor (particularly, ranking)
  • 18. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 17/98 Applications of Web Crawling Web Archiving Digital preservation “Librarian” look on the Web The biggest: Internet Archive Quite huge collections Batch crawls Primarily, collection of national web sites – web sites at country-specific TLDs or physically hosted in a country There are quite many and some are huge! see the list of Web Archiving Initiatives at Wikipedia
  • 19. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 18/98 Applications of Web Crawling Vertical Search Engines Data aggregating from many sources on certain topic E.g., apartment search, car search
  • 20. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 19/98 Applications of Web Crawling Web Data Mining “To get data to be actually mined” Usually using focused crawlers For example, opinion mining Or digests of current happenings on the Web (e.g., what music people listen now)
  • 21. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 20/98 Applications of Web Crawling Web Monitoring Monitoring sites/pages for changes and updates
  • 22. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 21/98 Applications of Web Crawling Detection of malicious web sites Typically a part of anti-virus, firewall, search engine, etc. service Building a list of such web sites and inform a user about potential threat of visiting such
  • 23. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 22/98 Applications of Web Crawling Web site/application testing Crawl a web site to check a navigation through it, validity the links, etc. Regression/security/... testing a rich internet application (RIA) via crawling Checking different application states by simulating possible user interaction events (e.g., mouse click, time-out)
  • 24. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 23/98 Applications of Web Crawling Copyright violation detection Crawl to find (media) items under copyright or links to them Regular re-visiting ’suspicious’ web sites, forums, etc. Tasks like finding terrorist chat rooms also go here
  • 25. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 24/98 Applications of Web Crawling Web Scraping Extracting particular pieces of information from a group of typically similar pages When API to data is not available Interestingly, scraping might be more preferable even with API available as scraped data often more clean and up-to-date than data-via-API
  • 26. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 25/98 Applications of Web Crawling Web Mirroring Copying of web sites Hosting copies on different servers to ensure 24x7 accessibility
  • 27. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 26/98 Industry vs. Academia Divide In web crawling domain Huge lag between industrial and academic web crawlers Research-wise and development-wise Algorithms, techniques, strategies used in industrial crawlers (namely, operated by search engines) poorly known Industrial crawlers operate on a web-scale That is, dozens of billions pages Only a few academic crawlers dealt with more than one billion pages Academic scale is rather hundreds of millions
  • 28. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 27/98 Industry vs. Academia Re-crawling Batch crawls in academia Regular re-crawls by industrial crawlers Evaluation of crawled data Crucial for corrections/improvements into crawlers Direct evaluation by users of search engines To some extent, artificial evaluation of academic crawls
  • 29. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 28/98 Web Size and Structure Some numbers Number of pages per host is not uniform: most hosts contain only a few pages, others contain millions Roughly 100 links on a page According to Google statistics (over 4 billions pages, 2010): fetching a page takes 320KB (textual content plus all embeddings) Page has 10-100KB of textual (HTML) content on average One trillion URLs known by Google/Yahoo in 2008
  • 30. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 29/98 Web Size and Structure Some numbers 20 million web pages in 1995 (indexed by AltaVista) One trillion (1012) URLs known by Google/Yahoo in 2008 - ’Independent’ search engine called Majestic12 (P2P-crawling) confirms one trillion items Doesn’t mean one trillion indexed pages Supposedly, index has dozens times less pages Cool crawler facts: IRLbot crawler (running on one server) downloaded 6.4 billion pages over 2 months Throughput: 1000-1500 pages per second Over 30 billion discovered URLs
  • 31. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 30/98 Web Size and Structure Bow-tie model of the Web Illustration taken from http://dx.doi.org/doi:10.1038/35012155
  • 32. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 31/98 PART II: INTELLIGENT WEB CRAWLING
  • 33. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 32/98 Outline of Part II Intelligent Web Crawling Architecture of web crawler Crawling strategies Adaptive crawling approaches
  • 34. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 33/98 Architecture of Web Crawler Crawler crawls the Web Crawled URLs URL Frontier Seed URLs Uncrawled Web
  • 35. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 34/98 Architecture of Web Crawler Typically in a distributed fashion Seed URLs Crawled URLs URL Frontier crawling thread Uncrawled Web
  • 36. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 35/98 Architecture of Web Crawler URL Frontier Include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy Prioritization also helps
  • 37. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 36/98 Architecture of Web Crawler Crawler Architecture Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
  • 38. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 37/98 Architecture of Web Crawler Content seen? If page fetched is already in the base/index, don’t process it Document fingerprints (shingles) Filtering Filter out URLs – due to ’politeness’, restrictions on crawl Fetched robots.txt are cached to avoid fetching them repeatedly Duplicate URL Elimination Check if an extracted+filtered URL has been already passed to frontier (batch crawling) More complicated in continuous crawling (different URL frontier implementation)
  • 39. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 38/98 Architecture of Web Crawler Distributed Crawling Run multiple crawl threads, under different processes (often at different nodes) Nodes can be geographically distributed Partition hosts being crawled into nodes
  • 40. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 39/98 Architecture of Web Crawler Host Splitter Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.
  • 41. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 40/98 Architecture of Web Crawler Implementation (in Perl) Other popular languages: Java, Python, C/C++
  • 42. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 41/98 Architecture of Web Crawler Crawling objectives High web coverage High page freshness High content quality High download rate Internal and External factors Amount of hardware (I) Network bandwidth (I) Rate of web growth (E) Rate of web change (E) Amount of malicious content (i.e., spam, duplicates) (E)
  • 43. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 42/98 Crawling Strategies Download prioritization Given a period, only a subset of web pages can be downloaded “Important” pages first Hence, need in prioritization Ordering a queue of URLs to be visited Strategies (ordering metrics) Breadth-First, Depth-First Backlink count Best-First PageRank Shark-Search
  • 44. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 43/98 Crawling Strategies Breadth-First, Depth-First Breadth-First search Implemented with QUEUE (FIFO) Pages with shortest paths first Depth-First search Implemented with STACK (LIFO)
  • 45. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 44/98 Crawling Strategies Pseudocode for Breadth-First
  • 46. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 45/98 Crawling Strategies Backlink count Use the link graph information Count # of crawled pages that point to a page Links with highest counts first
  • 47. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 46/98 Crawling Strategies Best-First Best link selected based on some criterion I.e., lexical similarity between topic’s keywords and link’s source page Similarity score sim(topic, p) assigned to outgoing links of page p Cosine similarity often used where q is a topic, p is a crawled page, fkq,fkp are frequencies of term k in q and p
  • 48. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 47/98 Crawling Strategies Pseudocode for Best-First
  • 49. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 48/98 Crawling Strategies PageRank The pagerank of a page is the probability for a random surfer (who follows links randomly) to be on this page at any given time A page’s score (rank) defined by scores of pages with links to this page where p is a page, in(p) is a set of pages with links to p, out(d) is a set of links out of d, γ are damping factor PageRank of pages periodically recalculated using data structure with crawled pages
  • 50. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 49/98 Crawling Strategies Pseudocode for PageRank
  • 51. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 50/98 Crawling Strategies Shark-Search More emphasis on web segments where relevant pages were found Penalizing segments yielding a few relevant pages A link’s score defined by a link’s anchor text, text surrounding a link (link context) and inherited score from ancestor pages (pages pointing to a page with this link) Parameters: d - depth bound r - relative importance of inherited score versus link neighbourhood score
  • 52. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 51/98 Crawling Strategies Pseudocode for Shark-Search
  • 53. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 52/98 Adaptive Crawling Static vs. adaptive strategies Strategies presented to this point are static Not adjust in the course of the crawl Adaptive (intelligent) crawling InfoSpiders Ant-based crawling
  • 54. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 53/98 Adaptive Crawling InfoSpiders Independent agents crawling in parallel HTML parser Noise word remover Stemmer Document relevance assessment Reproduction or death Learning Link assessment and selection HTML document Compact document representation Document assessment ########## $$$ ########## $$$ Term weights Neural net weights Keyword vector Agent representation
  • 55. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 54/98 Adaptive Crawling InfoSpiders Independent agents crawling in parallel Each agent uses list of keywords (initialized with topic keywords) Neural network evaluates new links Keywords in the vicinity a link used as input More importance (weight) to those keywords close to a link Maximum to words in the anchor text Output is a numerical quality estimate for a link Link score combined with cosine similarity score (between agent’s keywords and a page with this link)
  • 56. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 55/98 Adaptive Crawling InfoSpiders Each agent has an energy level Agent moves from a current to a new page if boltzmann function returns true where δ is diff between similarity of new and current page to agent’s keywords If energy level passes some threshold, an agent reproduces Offspring gets the half of parent’s frontier Offspring keywords mutated (expanded) with most frequent terms in parent’s current document
  • 57. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 56/98 Adaptive Crawling Pseudocode for InfoSpiders
  • 58. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 57/98 Adaptive Crawling Pseudocode for InfoSpiders (cont.)
  • 59. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 58/98 Adaptive Crawling Ant-based crawling Motivation: allow crawling agents to communicate with each other Follow a model of social insect collective behaviour Ants leave the pheromone along the followed path Other ants follow such pheromone trails A crawler agent follows some path by visiting many URLs At some moment, a certain amount of pheromone (weight) can be assigned to sequence of URLs on the followed path The amount can depend on similarity of visited pages to a given topic
  • 60. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 59/98 Adaptive Crawling Ant-based crawling Ants (crawlers) operate in cycles During each cycle, agents make a predefined number of moves (visits of pages) #moves = constant ∗ #cycle At the end of each cycle, pheromone intensity values are updated for the followed path Agents-ants return to their starting positions
  • 61. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 60/98 Adaptive Crawling Ant-based crawling Next link selected based on probability, which is defined by the corresponding pheromone intensity If no pheromone information, an agent-ant moves randomly
  • 62. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 61/98 Adaptive Crawling Ant-based crawling Probability of selecting a link where t is the cycle number, τij (t) is pheromone value between pi and pj and (i, l) designates the presence of a link from pi to pl During the cycle, each ant stores the list of visited URLs If pj was already visited, Pij(t) = 0 At the end of cycle, the list with visited URLs emptied out
  • 63. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 62/98 Adaptive Crawling Implications Strategies evaluating links based on their context (text close by) are not directly applicable to large-scale crawling I.e., consider crawling of 109 pages within one month Crawl rate: around 400 documents per second Around 40000 links per second Every second 10000-30000 “new” links to be evaluated (scored) and added to the frontier Too many even for link’s anchor text evaluation only
  • 64. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 63/98 PART III: OPEN CHALLENGES
  • 65. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 64/98 Outline of Part III Open Challenges Crawlers in Web ecosystem Collaborative web crawling Deep Web crawling Crawling multimedia content
  • 66. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 65/98 Crawlers in Web ecosystem Push vs. Pull model Web pages accessed via pull model - HTTP is a pull protocol That is, a client requests a page from a server If push, a server would send a page/info to a client Why Pull? Pull is just easier for both parties No ’agreement’ between provider and aggregator No specific protocols for content providers – serving content is enough Perhaps pull model is the reason why the Web is succeeded while earlier hypertext systems failed
  • 67. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 66/98 Crawlers in Web ecosystem Why not Push? Still pull model has several disadvantages What are these?
  • 68. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 67/98 Crawlers in Web ecosystem Why not Push? Still pull model has several disadvantages Publishing/updating content easier with push: no need in redundant requests from crawlers Better control over the content from providers: no need in crawler politeness
  • 69. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 68/98 Crawlers in Web ecosystem Crawler politeness Content providers possess some control over crawlers Via special protocols to define access to parts of a site Via direct banning of agents hitting a site too often
  • 70. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 69/98 Crawlers in Web ecosystem Crawler politeness Robots.txt says what can(not) be crawled Sitemaps is newer protocol specifying access restrictions and other info No agent should visit any URL starting with “yoursite/notcrawldir”, except an agent called “goodsearcher” Example User-agent: * Disallow: yoursite/notcrawldir User-agent: goodsearcher Disallow:
  • 71. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 70/98 Collaborative Crawling Main considerations Lots of redundant crawling To get data (often on a specific topic) need to crawl broadly - Often lack of expertise when large crawl required - Often, crawl a lot, use only a small subset Too many redundant requests for content providers Idea: have one crawler doing very broad and intensive crawl and many parties accessing the crawled data via API - Specify filters to select required pages Crawler as a common service
  • 72. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 71/98 Collaborative Crawling Some requirements Filter language for specifying conditions Efficient filter processing (millions filter to process) Efficient fetching (hundreds pages per second) Support real-time requests
  • 73. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 72/98 Collaborative Crawling New component Process a stream of documents against a filter index
  • 74. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 73/98 Collaborative Crawling Filter processing architecture
  • 75. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 74/98 Collaborative Crawling Filter processing architecture
  • 76. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 75/98 Collaborative Crawling Based on ’The architecture and implementation of an extensible web crawler’ by Hsieh, Gribble, Levy, 2010 (illustrations on slides 61-62 from Hsieh’s slides) E.g., 80legs provides similar crawling services In a way, it is reconsidering pull/push model of content delivery on the Web
  • 77. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 76/98 Deep Web Crawling Visualization of http://amazon.com by aharef.info applet
  • 78. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 77/98 Deep Web Crawling In a nutshell Problem is in yellow nodes (designating web form elements)
  • 79. ● Deep Web – part of the Web not accessible through search engines ● My preferred: Deep Web - content behind web search forms on publicly available pages ● Pages with forms themselves are typically accessible/searchable (=crawled) 1 Content hidden behind HTML forms Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 80. Why is it important? Large source of structured data ● Forms present a search interface over backend databases Significant gap in search engine coverage ● Potentially more content that currently searchable ● More than 10 million distinct HTML forms ● Likely to increase and more data comes online Size of the deep Web is unclear ● 500x figures are highly disputable ● Number of resources is a bit simpler: ~450k databases on the Web in 2004 ● Some part of deep web content crawled/covered by search engines ● Content can be both searched and browsed via links categorizing content ● Business-driven sites (e.g., shopping) typically provide both ways of access 2Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 81. Can’t pass through the forms (need to specify some values) I.e., content is “hidden” behind search forms ● Reason for another name for deep Web: hidden Web To crawl/access the content behind the following is required: ● Identify a search form on a page ● Fill form with proper values ● Submit the form ● Get the result pages ● Extract links/data from them Why crawlers not crawl deep Web 3Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 82. Approaches to deep Web crawling Google’s Deep Web Crawl (2008) ● Identify search forms ● Pre-compute all interesting form submissions to each HTML form ● Each form submission corresponds to a distinct URL ● Add URLs for each form submission into search engine index ● Allows to reuse existing search engine infrastructure ● No aim for full coverage of a deep web resource ● Not all forms (only GET forms) covered 4Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 83. Deep Web site identification • Task: identify a search form leading to content-rich web pages • Surprisingly, quite challenging task • One of the problems: ● Detect if form is searchable 5Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 84. Searchable forms Non-searchable: login forms, those that require user info Depends: Highly-interactive forms, e.g., airline reservations What are deep Web resources? store locations used cars radio stations patents recipes 6Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 85. 7Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 86. Deep Web site identification • Detect if form is informational ● Challenging for human too: e.g., assume a form is in unknown language • Detection by building/training binary classifiers • Forms identified as searchable can then be classified into domains (e.g., car search, apartment search, etc.) ● Based on form structure (e.g., num.fields) ● Based on form field labels • Slow process ● Done by specific component in offline mode 8Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 87. Crawling JavaScript-rich sites • Web pages became more responsive, interactive, user-friendly, etc. ● Thanks to emergence of new web technologies such as AJAX • Besides, they led to wide spread of web applications (RIAs) • Challenge for crawlers as they do not ● Manipulate client-side site ● Take into account asynchronous communication with the server 9Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 88. Crawling JavaScript-rich sites • Very similar to deep Web crawling challenge ● Content is hard to crawl ● Direct problem: AJAX/JS-enabled forms are hard to deal with (e.g., to detect and then generate meaningful queries) • Web pages designed for human beings, not for automatic programs • JS-code should be processed to get the actual content ● Dynamically changing ● Lots of additional resources required (crawler should be supplemented with JS-interpreter) 10Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 89. Crawling JavaScript-rich sites • Several techniques for AJAX crawling proposed since 2007/08 ● Focus is either on indexing and searching or on testing RIAs • Approach: ● AJAX-enabled web page/application modeled using states, events, transitions ● Crawler uses breadth-first strategy: ● Triggers the events on a page ● If the DOM of a page changes then new state/transition is added to transition graph ● Back to initial state to invoke the next event 11Denis Shestakov, Intelligent Web Crawling, WI-IAT'13, Atlanta, USA, 20.11.2013
  • 90. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 89/98 Crawling Multimedia Content The web is now multimedia platform Images, video, audio are integral part of web pages (not just supplementing them) Almost all crawlers, however, consider it as a textual repository One reason: indexing techniques for multimedia doesn’t reach yet the maturity required by interesting use cases/applications Hence, no real need to harvest multimedia But state-of-the-art multimedia retrieval/computer vision techniques already provide adequate search quality E.g., search for images with a cat and a man based on actual image content (not text around/close to image) In case of video: set of frames plus audio (can be converted to textual form)
  • 91. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 90/98 Crawling Multimedia Content Challenges in crawling multimedia Bigger load on web sites since files are bigger More apparent copyright issues More resources (e.g., bandwidth, storage place) required from a crawler More complicated duplicate resolving Re-visiting policy
  • 92. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 91/98 Crawling Multimedia Content Scalable Multimedia Web Observatory of ARCOMEM project (http://www.arcomem.eu) Focus on web archiving issues Uses several crawlers - ’Standard’ crawler for regular web pages - API crawler to mine social media sources (e.g., Twitter, Facebook, YouTube, etc.) - Deep Web crawler able to extract information from pre-defined web sites Data can be exported in WARC (Web ARChive) files and in RDF
  • 93. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 92/98 Future Directions Collaborative crawling, mixed pull-push model Scalable adaptive strategies Understanding site structure Deep Web crawling Semantic Web crawling Media content crawling Social network crawling
  • 94. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 93/98 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. ClueWeb09 Dataset: - http://lemurproject.org/clueweb09.php/ - One billion web pages, in ten languages - 5TBs compressed - Hosted at several cloud services (free license required) or a copy can be ordered on hard disks (pay for disks) ClueWeb12: - Almost 900 millions English web pages
  • 95. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 94/98 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. Common Crawl Corpus: - See http://commoncrawl.org/data/accessing-the-data/ and http://aws.amazon.com/datasets/41740 - Around six billion web pages - Over 100TB uncompressed - Available as Amazon Web Services’ public dataset (pay for processing)
  • 96. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 95/98 References: Crawl Datasets Use for building your crawls, web graph analysis, web data mining tasks, etc. Internet Archive: - See http://blog.archive.org/2012/10/26/ 80-terabytes-of-archived-web-crawl-data-available-for-resea - Crawl of 2011 - 80TB WARC files - 2.7 billions pages - Includes multimedia data - Available by request
  • 97. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 96/98 References: Crawl Datasets LAW Datasets: - http://law.dsi.unimi.it/datasets.php - Variety of web graphs datasets (nodes, arcs, etc.) including basic properties of recent Facebook graphs (!) - Thoroughly studied in a number of publications ICWSM 2011 Spinn3r Dataset: - http://www.icwsm.org/data/ - 130mln blog posts and 230mln social media publications - 2TB compressed Academic Web Link Database Project: - http://cybermetrics.wlv.ac.uk/database/ - Crawls of national universities web sites
  • 98. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 97/98 References: Literature For beginners: Udacity/CS101 course; http://www.udacity.com/overview/Course/cs101 Intermediate: Chapter 20 of Introduction to Information Retrieval book by Manning, Raghavan, Schütze; http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf Intermediate: Current Challenges in Web Crawling tutorial at ICWE 2013 by Shestakov; http://www.slideshare. net/denshe/icwe13-tutorial-webcrawling Advanced: Web Crawling by Olston and Najork; http://www.nowpublishers.com/product.aspx?product= INR&doi=1500000017
  • 99. Denis Shestakov Intelligent Web Crawling WI-IAT’13, Atlanta, USA, 20.11.2013 98/98 References: Literature See relevant publications at Mendeley: http://www.mendeley.com/groups/531771/web-crawling/ Feel free to join the group! Check ’Deep Web’ group too http://www.mendeley.com/groups/601801/deep-web/