2. What is Search Engine ?
“A web search engine is a software system that
is designed to search for information on the
World Wide Web.”
3. Purpose of Search Engines
Helping people find what they’re looking for:
• Starts with an “information need”
• Convert to a query
• Gets results
4. Types of Search Engines
• Search by Keywords
(e.g.AltaVista,Google)
• Search by categories
(e.g. Yahoo)
5. The Parts of a Search Engine
Spider (or “crawler”)
Index
Search software (an algorithm)
6. The “spider” or “crawler”
The spider visits a web page, reads it, and
then follows links to other pages within the
site. This is what it means when someone
refers to a site being "spidered" or
"crawled". This is also known as
“harvesting”. The spider returns to the site
on a regular basis, such as every month or
two, to look for changes.
7. The Indexer
Everything the spider finds goes
into the second part of a search
engine, the index. The index,
sometimes called the catalog, is like
a giant book containing a copy of
every web page that the spider
finds. If a web page changes, then
this book is updated new
information.
8. Search engine software
It is the third part of a search
engine. This is the program that
sifts through the millions of pages
recorded in the index to find
matches to a search and rank them
in order of what it believes is most
relevant.
9. Variations of the tf–idf weighting
scheme are often used by search
engines as a central tool in scoring and
ranking a document's relevance given a
user query.
Term Frequency–Inverse Document
Frequency, is a numerical statistic that is
intended to reflect how important a
word is to a document in a collection.
TF-IDF Ranking Algorithm
wij = weight of Term Tj in Document Di
tfij = frequency of Term Tj in Document Dj
N = number of Documents in collection
n = number of Documents where term Tj occurs at least once
10. • The equation:
PR(A) = (1-d) + d(PR(t1)/C(t1) + … + PR(tn)/C(tn))
• Used by WebQuery and Google
• Google simulates users using the search engine to
rank documents.
• Google uses citation graph (518 million links)
• Google computes 26 million in a few hours.
PageRank
11. PageRank works by counting
the number and quality of
links to a page to determine a
rough estimate of how
important the website is. The
underlying assumption is that
more important websites are
likely to receive more links
from other websites