1. THE ANATOMY OF A LARGE SCALE-HYPER
TEXTUAL WEB SEARCH ENGINE
ASIM FROM UNIVERSITY PESAHAWAR.
Author: Sergey Brin, Lawrence Page
2.
3.
4.
5.
6.
7. ABSTRACT
Google Search Engine as Prototype
Anatomy
Web Users: Queries (tens of millions)
Academic research
Building a large scale search engine
Heavy use of hyper textual information
(anchor links, hyperlinks)
8. INTRODUCTION
Web (as a dynamic entity)
Irrelevant Search Results
Human maintained Indices, Table of Contents
Too many low quality research
Address many problems of users (Page Ranking)
9. CONT…
Google: Scaling with the Web
Google’s Fast Crawling Technology
Storage space availability
Indexing system processing 100’s of Gigabytes
Data
Minimized Queries Response Time
10. DESIGN GOALS
Improved Search Quality.
Indexing does not provide Relevant Search Results.
Making the percentage of Junks Results as low as possible.
Users show interest in top ranked results.
Notion is to provide relevant results.
Google make uses of Link structure & anchor text.
11. CONT…
Academic search engine results.
User Accessibility & Availability of the desired
results.
Supports Novel Research.
All problem solving solutions to be given in a single
place.
12. SYSTEM FEATURES
Google search engine has two important features.
Link structure of the web(page ranking).
Utilization Links(anchor text) to improve search
results.
<A href="http://www.yahoo.com/">Yahoo!</A>
Besides the text of a hyperlink (anchor text) is
associated with the page that the link is on,
it is also associated with the page the link
points to.
13. PAGE RANK
Page Rank: bringing order to the web
Academic citation literature is applied to calculate
page rank
PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))
In the equation 't1 - tn' are pages linking to page A,
'C' is the number of outbound links that a page has
and 'd' is a damping factor, usually set to 0.85.
14. PAGE RANK (INTUITIVE JUSTIFICATION)
Many pages that point to a single page
A page having high PageRank that points to
another page
Broken Links are not listed on Higher Page Ranked
sites
Text of the link provides more description, Google
utilizes such information
Provides more accurate results for images, graphs,
databases
17. SYSTEM ANATOMY
URL Server:
provides list of URLs to the Crawlers for fetching
information from web
Distributed Crawlers (Downloading WebPages)
Store Server:
Compression and Storage in Repository
docID’s are used to distinguish WebPages
Indexer
Indexing, Sorting, Uncompressing, Parsing
Hits
records word occurences, position, text formate information in
documents
Hits are organized into barrels which creates partially sorted
forward index
18. FORWARD INDEX
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon
19. INVERTED INDEX
T0 = "it is what it is“
T1 = "what is it“
T2 = "it is a banana“
A term search for the terms "what", "is" and "it" would give the set.
If we run a phrase search for "what is it" we get hits for all the words in both document 0
and 1. But the terms occur consecutively only in document 1.
Inverted Index Words
{(2, 2)} a
{(2, 3)} banana
{(0, 1), (0, 4), (1, 1), (2, 1)} is
{(0, 0), (0, 3), (1, 2), (2, 0)} It
{(0, 2), (1, 0)} What
20. CONT…
Indexer:
Anchor files as a result of parsing possessing links
information (in & out links)
URL resolver:
Reads anchor files, converts relative to absolute URLs
and inturn into docIDs
Puts anchor text in forward index
Database of links, necessary to compute PageRanks
Sorter :
Takes the barrels which are sorted by docID and
resorts them by wordID to generate inverted index.
It produces a list of wordIDs and offsets into the inverted
index.
21. CONT…
DumpLexicon
A program DumpLexicon takes this list together with the
lexicon produced by the indexer and generates a new
lexicon to be used by the searcher.
Searcher:
The searcher is run by a web server and uses the
lexicon built by DumpLexicon together with the inverted
index and the PageRanks to answer queries.
22. CONT…
Major data structures
Data is stored in BigFiles which are virtual files and it
supports compression.
Half of the storage used by raw html repository.
Having compressed html of every page and its small
header.
Document index keep information of each document.
The ISAM(Index sequential access mode) index is
ordered by docID.
Each stored entry includes information of current
status, pointer into the repository, document checksum,
URL and title information.
They all are memory-based hash tables with varying
values attached with each word.
23. CONT…
Hit lits encoding
Uses compact encoding(a hand optimized)
It requires less space and less bit manipulation.
It uses two bytes for every hits.
For saving space the length of a hit list is combined with
the wordID in the forward index and the docID in the
inverted index.
Forward index is stored in the number of barrels(64).
Each barrels holds word IDs
Words falling in particular barrel, the DocIDs is recorded
into the barrel followed by the List of WordIDs with
Hitlists which corresponds to those words
24. CONT…
The inverted index consist of the same barrels as
the forward index. Inverted index is processed by
the sorter
Pointer is used for pointing to wordID in barrels.
Pointer points to List of docIDs and Hit list, this is
called docList
25. CRAWLING
Web Crawling (downloading pages)
Crawlers (3 to 4)
Each crawler contains three hundred open
connections
Social issues
Efficiency
27. DESCRIPTION OF THE PICTORIAL COMPONENTS
Components Description
Crawlers There are several distributed crawlers, they parse the pages and
extract links and keywords.
URL Server Provides to crawlers a list of URLs to scan. The crawlers sends
collected data to a store server.
Server Store It compresses the pages and places them in the repository. Each
page is stored with an identifier, a docID.
Repository Contains a copy of the pages and images, allowing comparisons and
caching.
Indexer It decompresses documents and converts them into sets of words
called "hits". It distributes hits among a set of "barrels". This provides
an index partially sorted. It also creates a list of URLs on each page.
A hit contains the following information: the word, its position in the
document, font size, capitalization.
Barrels These "barrels" are databases that classify documents by docID.
They are created by the indexer and used by the sorter.
Anchors The bank of anchors created by the indexer contains internal links
and text associated with each link.
28. CONT…
Components Description
URL It takes the contents of anchors, converts relative URLs into absolute
Resolver addresses and finds or creates a docID.
It builds an index of documents and a database of links.
Doc Index Contains the text relative to each URL.
The database of links associates each one with a docID (and so to a
Links
real document on the Web).
The software uses the database of links to define the PageRank of each
PageRank
page.
It interacts with barrels. It includes documents classified by docID and
Sorter
creates an inverted list sorted by wordID.
A software called DumpLexicon takes the list provided by the sorter
(classified by wordID), and also includes the lexicon created by the
Lexicon
indexer (the sets of keywords in each page), and produces a new
lexicon to the searcher.
It runs on a web server in a datacenter, uses the lexicon built by
Searcher DumpLexicon in combination with the index classified by wordID,
taking into account the PageRank, and produces a results page.
29. RESULTS, PROBLEMS & CONCLUSION
Most important issue is quality of search results
Google performance is better compared to other
commercial engines
Need of Relevant and exact Query Results
Up to date information processing
Performing search queries
Crawling technologies
Google employs a number of techniques to improve
search quality including page rank, anchor text, and
proximity information.
“The ultimate search engine would understand exactly
what you mean and give back exactly what you want.”
by Larry Page
30. “The ultimate search engine would understand exactly what you
mean and give back exactly what you want.” by Larry Page
“The absolute search engine’s query generation would be based
on information, not based on the repository records and query
results will be real timed, and it will change the whole internet
and web architecture.” by asim