WT - Web & Working of Search Engine

Web &
Working of Search Engine

Presented By:
Vinay Arora
Assistant Professor
CSED, Thapar University

Web Content

Web Content/Resource means content accessible/present on Internet.

Invisible Web
Visible Web

Visible Web – The Publicly Index able pages that have been picked up and
Indexed by conventional search engines, mainly consist of static HTML pages.

Invisible Web/Deep Web/Hidden Web - Information that cannot be Indexed/Seen
by the Crawlers or Spiders of conventional Search Engines.

Types of Invisible Web
Truly Invisible Web
Opaque Proprietary
Private

TYPES of Invisible Web & Reasons of being Invisible

Truly Invisible Web is not accessible for search engines mainly because of
technical reasons Dynamically generated pages, Pages with pdf, exe, swf format.

Proprietary Web Databases which are mainly fee based and are provided by
Information Providers. These Databases provide user with search facility however,
their contents are not searchable through the search engines.

Private Web Technically Indexable , but have purposely been excluded from
search engines using Password Protected Pages, Robot.txt, NoIndex META Tag.

Opaque Web Disconnected URL.

Size Of Invisible Web is approx.500 times larger than Visible Web.

Crawling & Indexing

A Search Engine operates, in the
Following order:

1. Web Crawling.

2. Indexing.

3. Searching.

Making Invisible Web Visible

Register Website with Search Engine


Sitemap.xml - Sitemaps are an easy way for webmasters to inform search
engines about pages on their sites that are available for crawling. In its simplest
form, a Sitemap is an XML file that lists URLs for a site along with additional
metadata about each URL.


Making Entries into Robot.txt file for allowing the Robots to Crawl and Changing
META Content.


Providing links of the desired website from another Websites so that it can be
made accessible from other/different websites. And can be Crawled.

www.orkut.com
orkut www.gmail.com

Changing the Source Code of Web Crawlers – Making the crawlers efficient and
intelligent enough so that it can accept files with extension pdf, swf etc. and
list/Index the entries properly.

The content of Proprietary Web Databases are not searchable through the
search engines. They are assembled into Web pages as responses to queries
submitted through the “Query Interface” of an underlying database. Because
current search engines cannot effectively “Crawl” databases, such data is
believed to be “Invisible,” and thus remain largely “hidden” from users

User Form Interaction

For Form-based Search Interfaces when user is present for Input instead of
Crawler. Result will be obtained after Query execution as soon as User press
Submit button after filling the required fields present in the Form.

We want Response Page to be
listed in Search Engine.
We have to make this Visible.

Crawler Form Interaction & Steps for Hidden Web Crawler

Crawler at desired URL.

Form Analysis for Internal Form Representation.

Matching with the entries present in Task Specific Database.

Automatic FORM Processing and Submission.

Response Page from the Server.

Response Analysis of that Page.

Putting the results in the Repository.

References
The Deep Web: Surfacing Hidden Value. http://www.completeplanet.com/Tutorials/DeepWeb/.

Paper: Crawling the Hidden web Hector Garcia CSE Department Stanford University, USA

http://www.invisible-web.net

All About Invisible Web : Natalia Arroyo, Internet Lab, CINDOC – CSIC

Accessing the Deep Web: A Survey , Bin He, Mitesh Patel, Zhen Zhang, Kevin
Chen-Chuan Chang, Computer Science Department, University of Illinois at
Urbana-Champaign.

Towards a Model of User oriented Aspects of the Invisible Web, Yazdan
Mansourian, Department of Information Studies , The University of Sheffield

WT - Web & Working of Search Engine

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à WT - Web & Working of Search Engine

Similaire à WT - Web & Working of Search Engine (20)

Plus de vinay arora

Plus de vinay arora (20)

Dernier

Dernier (20)

WT - Web & Working of Search Engine