SlideShare une entreprise Scribd logo
1  sur  54
Topics Speakers
1. Introduction and Overview Sayon Roy
2. Indexing Techniques – Transition from Manual to Automated System Kaustav Saha
3. Usage in Modern Day Search Engines Vikas Bhushan
4. Currents Trends and Applications Debashis Naskar
5. Conclusion Sumanta Bag
Indexing…. an Overview
• Indexing is a crucial part of any information retrieval system. It is a challenging task
requiring paying attention to many theoretical and practical issues. While the move
towards digital information systems and automated indexing is thought to have
reduced the need for indexers in some areas, professional indexers are still much
needed and as a matter of fact electronic environment has posed new challenges
for the indexers.
• Indexing is more a process of the extraction rather than
content analysis.
• The terms is an index represent certain concepts.
Subject Indexing and Subject Retrieval
• Subject indexing can be described as a system of classifying without notation. It
is the core theme of information science.
• Today subject retrieval is facilitated through the use of structured databases.
• The items that are retrieved are listed in the index.
• In OPACs indexing is done manually to determine what a
resource about. After identification the aboutness is translated
in the language of the vocabulary.
Schematic Illustration…
Conception of Subject
Analysis and Indexing
Type of Subject
Information
Indexing Method
Simplistic Conception
Explicit Information Extraction
Content- Oriented
Conception
Implicit Information Assignment
Requirement -Oriented
Conception
Early use of computers for Information Retrieval
• In 1948 a “machine called the Univac” capable of searching for text references
associated with a subject code was created.
• The machine could process “at the rate of 120 words per minute”. It appears that
this is the first reference to a computer being used to search for content.
• The impact of computers in IR is highlighted when Hollywood drew public
attention to the innovation with the comedy “Desk Set”, which came out in 1957.
It centred on a group of reference librarians who were about to be replaced by a
computer.
• IR as a research discipline was starting to emerge at this time with two important
developments: how to index documents and how to retrieve them.
Indexing and Information Retrieval… A Chronology
•Mortimer Taube’s Uniterm system, which was essentially a proposal to
index items by a list of keywords. As simple an idea as this seems today, this
was at the time a radical step.
Ranked retrieval
•The ranked retrieval approach to search was taken up by IR researchers,
who over the following decades refined and revised the means by which documents
were sorted in relation to a query.
•The superior effectiveness of this approach over Boolean search was demonstrated
in many experiments over those years.
• Work in the 1950s established computers as the definitive tool for search.
1960s …
•The 1960s witnessed formalization of algorithms to rank documents relative to a
Query.
•This was a process to support iterative search, where documents previously retrieved
could be marked as relevant in an IR system.
•Versions of this process are used in modern search engines, such as the “Related articles”
link on Google Scholar.
1970s…
•One of the key developments of this period was that Luhn’s term frequency (tf) weights
(based on the occurrence of words within a document)
•Spärck Jones’s work on word occurrence introduced the idea of
inverse document frequency (idf).
•An alternative means of modelling IR systems involved extending Maron, Kuhns and
Ray’s idea of using probability theory.
1980s – mid 1990s
•Building on the developments of the 1970s, variations of tf idf weighting schemes were
produced and the formal models of retrieval were extended.
•The original probabilistic model did not include tf weights and a number of researchers
worked to incorporate them in an effective and principled way.
•Amongst other achievements, this work ultimately led to the ranking function BM25
which, has proven to be a highly effective ranking function and is still commonly used.
•Advances on the basic vector space model were also developed and probably the most
well-known is Latent Semantic Indexing (LSI).
Mid 1990s – present
•The arrival of the web initiated the study of new problems in IR.
•Search engine developers quickly realised that they could use the links between
web pages to construct a crawler or robot to traverse and gather most web pages on
the internet
• The first full text search engine using a crawler was WebCrawler released in 1994.
-Kaustav Saha
Indexing Techniques – Transition from Manual to Automated System
What is an index?
•A Database where information (after being collected, parsed and
processed) is stored to allow for quick retrieval.
•Association of descriptors (keywords, concepts, metadata) to documents in
view of future retrieval
•The knowledge / expectation / behavior of the searcher needs to be
anticipated
Example of Indexing using POPSI
A report on the treatment of infections disease of lungs in India during 1982-85
Discipline Medical Science
Entity Lung
Property Infections disease
Action Treatment
Space modifier India
Time modifier 1982-85
Form modifier Report
Subject heading
MEDICAL SCIENCE, LUNG
infection disease, treatment, India, 1982-85
INFECTION DEASEASE, TREATMENT
medical science, lung, India, 1982-85
Cross Reference
Therapeutics see Treatment
Therapy see Treatment
Manual and Automatic Indexing
•Manual
•Human indexers assign index terms to documents
•A computer system may be used to record the descriptors generated by
the human
•Automatic
•The system extracts “typical”/ “significant” terms
•The human may contribute by setting the parameters or thresholds, or
by choosing components or algorithms
•Semi-automatic
•The system’s contribution may be supported in terms of word lists,
thesauri, reference system, etc, following or not the automatic
processing of the text
Manual vs. Automatic Indexing
•Manual
•Slow and expensive
•Is based on intellectual judgment and semantic interpretation (concepts, themes)
•Low consistency
•Automatic
•Fast and inexpensive
•Mechanical execution of algorithms, with no intelligent interpretation (aboutness /
relevance)
•Consistent
Vocabulary
•Vocabulary (indexing language)
•The set of concepts (terms or phrases) that can be used to index
documents in a collection
•Controlled
•Specific for specialized domains
•Potential for increased consistency of indexing and precision of
retrieval
•Un-controlled (free)
•Potentially all the terms in the documents
•Potential for increased recall
Thesauri
•Capture relationships between indexing terms
•Hierarchical
•Synonymous
•Related
•Creation of thesauri
•Manual vs. automatic
•Use of thesauri
•In manual / semi-automatic / automatic fashion
•Syntagmatic co-ordination / thesaurus-based query expansion during
indexing / searching
TEXT
REPRESENTATION
Lexical analysis
Stemming
Stop word removal
representation
Steps of automatic indexing
Collection/document structure
Data structure
Role of Indexing in Information Retrieval
Population of
Documents
Selected
documents
Indexing
Database in printed or electronic form
Search Strategy
Information Needs
Population of
database users
System
VocabularyDocument
Store
Document
Description
Usage in Modern Day Search Engines
- Vikas Bhushan
Search Engines
Use of search engines
Types of Search Engines
Software Components in Search Engines
Pictorial representation of Components
How Search Engines Works with a Model
Post-coordinate Indexes
Search engines : An initiative towards correct
retrieval from a Labyrinth of Ideas
 Search engines do not search only for keywords, some
search for other stuff as well
 and they are really not “engines” in the classical sense
but then mouse is not a “mouse”
Rather, these are computer programs that searches for
particular keywords and returns a list of documents in which
they were found, especially a service that scans documents on
the Internet.
Types of Search Engines
Crawler Based – Google, AltaVista
Human Based – Yahoo directory, Open directory, LookSmart
Hybrid Models – Yahoo, Google
Meta Search Engines – Dogpile, MetaCrawler
Use of search engines
… among others
WebCrawler founder Brian Pinkerton puts it, "Imagine walking up to a librarian and
saying, 'travel’ . They’re going to look at you with a blank face? "
Components in the Back-end & Front-end process
Software
Components
Back-end Front-end
Crawler/Spider
Indexer
Index File Database
Search Engine Interface
Query Parser
Ranking Mechanism
Google uses PageRank
Teoma uses ExpertRank
Yahoo uses TrustRank
Pictorial representation of Front-end & Back-end Process
Search Engine
Database
Your
Browser
How Search Engines Work
(Sherman 2003)
The Web
URL1
URL2
URL3 URL4
Crawler
Indexer
Search
Engine
Database Eggs?
Eggs.
Eggs - 90%
Eggo - 81%
Ego- 40%
Huh? - 10%
All About
Eggs
by
S. I. Am
Post-coordinate Indexes
An Information Retrieval system that allows the searcher to combine terms in
any way is frequently referred to as Post-coordinate.
Modern computer based system, operated online, can be considered to be a
direct descendent of the previous manual system.
The files of an online system comprises two major elements:
1. A complete set of document representations : Bibliographic reference or
similar to Search engine database.
2. A list of terms sometimes referred to as an inverted file or a postings files.
Continued…
The subject matter discussed in a document, and represented by index terms
assigned to it, is multidimensional in character .
Consider, for example an article discussing
“Political Contenders in Assembly Polls of Karnataka”.
Have been index under the following terms :
 Political Contenders
 Constituencies
 Assembly Polls
 Karnataka
Post-coordinate Indexes…
Political
Contenders
Index terms mentioned previously actually represent a network of relationship
Constituencies Assembly Polls
Karnataka
Continued…
Information Retrieval System Represented as a Matrix
1 2 3 4 5 6 7 8 9 10 11 12 13 14
A
B
C
D
E
F
G
H
X X
X X X X
X X X
X X
X X X X X X
X
X X X X
X X X X X
-DebashisNaskar
Currents Trends and Applications
Current trends and applications
 The web creates new challenges for information retrieval. The amount of information on the web is
growing rapidly, as well as the number of new users inexperienced in the art of web research.
 Automated search engines that rely on keyword matching usually return too many low quality matches.
 A large-scale search engine makes heavy use of the additional structure present in hypertext to
provide much higher quality search results.
What is XML Indexing?
 XML indexing is a form of embedded indexing in which
tags are inserted into an XML documents to mark the
occurrences of indexable terms or topics.
 The clients publishing process automatically generates
an index from these index elements. Fortunately, because
this automated process handles all layout and
formatting ,it is not necessary to treat these issues as a
matter of concern.
What makes it work?
Index entries in DocBook are encoded using the mother element and has five
child elements. There are summarized below:
• <indexterm> element: wrapper element for an index entry of any type.
• <primary> element: main entry.
• <secondary> element: subentry.
• <tertiary> element: sub-subentry.
• < see > element: ‘see’ references.
• <seealso> element: ‘seealso’ references.
Future hopes for Indexers
 Indexer should offer XML based services, which is a pre requisite for joining
the digital publishing revolution.
 Indexers are good with structures and use of XML indexing in publishing is
about imposition of structure on text.
Results and Performance
 The most important measure of a search
engine is the quality of its search results.
 Here we highlight the performance and
experience with Google. It produces
Better results than the major commercial
search engines for most searches.
Data Google bing Yahoo! Baidu Babylon Others
2012-04 91.7 3.5 3.36 0.26 0 1.18
2012-05 92.04 3.36 3.26 0.22 0 1.12
2012-06 91.75 3.27 3.04 0.23 0.29 1.42
2012-07 91.17 3.22 2.95 0.45 0.54 1.67
2012-08 91.01 3.22 2.98 0.5 0.6 1.7
2012-09 91.04 3.16 2.91 0.49 0.6 1.8
2012-10 90.75 3.35 2.91 0.54 0.58 1.87
2012-11 90.75 3.32 2.84 0.58 0.6 1.92
2012-12 90.43 3.26 2.89 0.66 0.54 2.21
2013-01 90.47 3.19 2.88 0.63 0.48 2.35
2013-02 89.64 3.62 3.17 0.73 0.39 2.45
2013-03 89.89 3.59 3.2 0.93 0.29 2.11
2013-04 90.17 3.61 3.08 0.92 0.27 1.95
Models for Information Retrieval
 Boolean or Vector space model of IR(Information Retrieval)
-In this matching is done in a formally defined but semantically imprecise calculus of Index terms.
 There are a number of retrieval models that function over a Probabilistic basis.
Binary Independence Model, is the most original and is still the most influential among other
probabilistic retrieval models.
Contd…
OKAPI BM25: model for Information Retrieval
 The BIM was originally designed for short catalogue records and
abstracts of fairly consistent length.
 For modern full-text search collections, a model should pay attention to term frequency and
document length.
The BM25 weighting scheme , often called Okapi weighting , after the system in which it was first
implemented, was developed as a way of building a probabilistic model.
Contd…
The score of any document as determined by OKAPI is determined through the following
equations:
Equation 1. The simplest score for document d is just idf weighting of the query terms present:
Equation 2. Sometimes, an alternative version of idf is used. If we start with the formula in the
absence of relevance feedback information we estimate that S = s = 0 , then we get an
alternative idf formulation as follows:
Contd…
Equation 3. We can improve on Equation 1 by factoring in the frequency of each term and
document length:
Equation 4. If the query is long, then we might also use similar weighting for query terms. This
is appropriate if the queries are paragraph long information needs, but unnecessary
for short queries:
-Sumanta Bag
Conclusion
• For implementation of indexing services individual indexers may prefer numerous approaches.
• The effectiveness of an index as a search tool will depend on the number of access points provided.
• Different factors influence the recall and precision measures for any retrieved information.
• Indexing and its usage can be made more sophisticated through implication of certain concepts like:
 Weighted Indexing
 Linking of terms
 Role Indicators
 Subheading
 Index Language Device
Conclusion: Enhancement of Indexing Procedures
• Many automatic systems include form of weighting to allow the ranking
• Weighted indexing grants autonomy on behalf of the searcher to vary the exhaustivity
• It is simplifies the process of indexing
• Weighted indexing assigns a numerical value to individual terms.
• Weighted index has two ways of retrieval from the database.
• Major and Minor descriptor.
Weighted Indexing
• For efficient and timely retrieval of appropriate and correct information
• Inappropriate or irrelevant responses can be avoided by reducing the exhaustively of index.
• Removal of unwanted or false association.
• To avoid false association by linking of index terms.
Linking of terms
• Role indicators play an important part in retrieval of accurate information
• Use of syntax to reduce ambiguity.
• Role indicators introduced into retrieval system in the early 1960s
• The first of its kind was the Engineers Joint a Council (EJC) set of role indicator.
• The document surrogate was a ‘telegraphic abstract’ by means of a ‘semantic code dictionary’.
Role Indicators
Subheadings
• The advent of automated system the need for retrieval of precise information gained importance
• The problem of false or ambiguous associations are now less
• Subheading can be applied much of post coordinate index system
• successful in reducing the ambiguities in the searching of electronic data bases
Index Language Devices
Precision Device
Weighting
Links
Role
indicators
Recall Device
Subheadings Synonym control
Inverse
Relation
Before we Conclude…
• The entire discussion was based on application of indexing techniques and principles for design of search
engines.
• To develop software tools that would allow the user to perform relatively specific subject searchers related
to resources of any type.
• Search engines operate by building ‘indexes’ to the network resources.
• The concept of Boolean logic is followed for searching purposes.
• Search engines use inverted indexes.
Conclusion
Today the internet has become versatile and is treated as a significant source of a information. The
transition from traditional to electronic form of information resources, has paved the way for creation
of various software and certain tools.
These provide enhanced navigation among resources available in electronic form and within networked
environment. However, various studies indicate that there is much ground to be cover before machines
become intelligent enough to completely replace humans. As of now the role of the human indexer is
quite indispensible.
Thus in days to come upgraded indexing techniques and principles would surely be developed thereby
ensuring efficient and timely retrieval of information from a digitized environment.
Thank
you

Contenu connexe

Tendances

Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)
silambu111
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
baradhimarch81
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
BAIRAVI T
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.L
anujessy
 
Controlled Vocabulary
Controlled VocabularyControlled Vocabulary
Controlled Vocabulary
guest118a9a
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
 
Kwic
KwicKwic
Kwic
PU
 

Tendances (20)

Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Absolute syntax
Absolute syntax Absolute syntax
Absolute syntax
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)
 
Precis
PrecisPrecis
Precis
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
basis of infromation retrival part 1 retrival tools
basis of infromation retrival part 1 retrival toolsbasis of infromation retrival part 1 retrival tools
basis of infromation retrival part 1 retrival tools
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
CANONS OF CATALOGUING ppt
CANONS OF CATALOGUING pptCANONS OF CATALOGUING ppt
CANONS OF CATALOGUING ppt
 
Colon Classification: An Overview
Colon Classification: An OverviewColon Classification: An Overview
Colon Classification: An Overview
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
POPSI
POPSIPOPSI
POPSI
 
Indexing popsi....
Indexing popsi....Indexing popsi....
Indexing popsi....
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.L
 
Shalini canons of library classification ( idea plane) pdf
Shalini canons of library classification ( idea plane) pdfShalini canons of library classification ( idea plane) pdf
Shalini canons of library classification ( idea plane) pdf
 
Controlled Vocabulary
Controlled VocabularyControlled Vocabulary
Controlled Vocabulary
 
Indexing language concept types and characteristics
Indexing language concept types and characteristicsIndexing language concept types and characteristics
Indexing language concept types and characteristics
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Kwic
KwicKwic
Kwic
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 

En vedette

1. indexing and abstracting
1. indexing and abstracting1. indexing and abstracting
1. indexing and abstracting
Moses Mbanje
 
Introduction to indexing (presentation1)
Introduction to indexing (presentation1)Introduction to indexing (presentation1)
Introduction to indexing (presentation1)
Mary May Porto
 
Introduction to indexing
Introduction to indexingIntroduction to indexing
Introduction to indexing
Daryl Superio
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
201014161
 
Indexing and-hashing
Indexing and-hashingIndexing and-hashing
Indexing and-hashing
Ami Ranjit
 
Electronic records management
Electronic records managementElectronic records management
Electronic records management
Kirti Joshi
 
Records management ppt
Records management pptRecords management ppt
Records management ppt
Aimee Pusing
 
Filing and record keeping
Filing and record keepingFiling and record keeping
Filing and record keeping
mahamed_11
 

En vedette (19)

1. indexing and abstracting
1. indexing and abstracting1. indexing and abstracting
1. indexing and abstracting
 
Introduction to indexing (presentation1)
Introduction to indexing (presentation1)Introduction to indexing (presentation1)
Introduction to indexing (presentation1)
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Introduction to indexing
Introduction to indexingIntroduction to indexing
Introduction to indexing
 
Indexing
IndexingIndexing
Indexing
 
Indexing
IndexingIndexing
Indexing
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 
Indexers in C#
Indexers in C#Indexers in C#
Indexers in C#
 
Indexing structure for files
Indexing structure for filesIndexing structure for files
Indexing structure for files
 
Database indexing techniques
Database indexing techniquesDatabase indexing techniques
Database indexing techniques
 
Abstracts & abstracting
Abstracts & abstractingAbstracts & abstracting
Abstracts & abstracting
 
Data indexing presentation
Data indexing presentationData indexing presentation
Data indexing presentation
 
Indexing and-hashing
Indexing and-hashingIndexing and-hashing
Indexing and-hashing
 
Electronic records management
Electronic records managementElectronic records management
Electronic records management
 
File And Records Management
File And Records ManagementFile And Records Management
File And Records Management
 
Indexing Data Structure
Indexing Data StructureIndexing Data Structure
Indexing Data Structure
 
Records management ppt
Records management pptRecords management ppt
Records management ppt
 
Filing and record keeping
Filing and record keepingFiling and record keeping
Filing and record keeping
 
What is Document Indexing? A tutorial for intelligent data capture.
What is Document Indexing? A tutorial for intelligent data capture.What is Document Indexing? A tutorial for intelligent data capture.
What is Document Indexing? A tutorial for intelligent data capture.
 

Similaire à Indexing Techniques: Their Usage in Search Engines for Information Retrieval

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 

Similaire à Indexing Techniques: Their Usage in Search Engines for Information Retrieval (20)

Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources
 
Information retrieval 1 introduction to ir
Information retrieval 1 introduction to irInformation retrieval 1 introduction to ir
Information retrieval 1 introduction to ir
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Ir 01
Ir   01Ir   01
Ir 01
 
Unit 1
Unit 1Unit 1
Unit 1
 
Leveraging Computational Methods for Theorizing IS Phenomena
Leveraging Computational Methods for Theorizing IS PhenomenaLeveraging Computational Methods for Theorizing IS Phenomena
Leveraging Computational Methods for Theorizing IS Phenomena
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
CHAPTER -12 it.pptx
CHAPTER -12 it.pptxCHAPTER -12 it.pptx
CHAPTER -12 it.pptx
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
Hci
HciHci
Hci
 
C N I20080404
C N I20080404C N I20080404
C N I20080404
 
Torsten Reimer
Torsten ReimerTorsten Reimer
Torsten Reimer
 
Text Mining
Text MiningText Mining
Text Mining
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Use of ICT in educational research
Use of ICT in educational researchUse of ICT in educational research
Use of ICT in educational research
 

Dernier

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Dernier (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

Indexing Techniques: Their Usage in Search Engines for Information Retrieval

  • 1.
  • 2. Topics Speakers 1. Introduction and Overview Sayon Roy 2. Indexing Techniques – Transition from Manual to Automated System Kaustav Saha 3. Usage in Modern Day Search Engines Vikas Bhushan 4. Currents Trends and Applications Debashis Naskar 5. Conclusion Sumanta Bag
  • 3. Indexing…. an Overview • Indexing is a crucial part of any information retrieval system. It is a challenging task requiring paying attention to many theoretical and practical issues. While the move towards digital information systems and automated indexing is thought to have reduced the need for indexers in some areas, professional indexers are still much needed and as a matter of fact electronic environment has posed new challenges for the indexers. • Indexing is more a process of the extraction rather than content analysis. • The terms is an index represent certain concepts.
  • 4. Subject Indexing and Subject Retrieval • Subject indexing can be described as a system of classifying without notation. It is the core theme of information science. • Today subject retrieval is facilitated through the use of structured databases. • The items that are retrieved are listed in the index. • In OPACs indexing is done manually to determine what a resource about. After identification the aboutness is translated in the language of the vocabulary.
  • 5. Schematic Illustration… Conception of Subject Analysis and Indexing Type of Subject Information Indexing Method Simplistic Conception Explicit Information Extraction Content- Oriented Conception Implicit Information Assignment Requirement -Oriented Conception
  • 6. Early use of computers for Information Retrieval • In 1948 a “machine called the Univac” capable of searching for text references associated with a subject code was created. • The machine could process “at the rate of 120 words per minute”. It appears that this is the first reference to a computer being used to search for content. • The impact of computers in IR is highlighted when Hollywood drew public attention to the innovation with the comedy “Desk Set”, which came out in 1957. It centred on a group of reference librarians who were about to be replaced by a computer. • IR as a research discipline was starting to emerge at this time with two important developments: how to index documents and how to retrieve them.
  • 7. Indexing and Information Retrieval… A Chronology •Mortimer Taube’s Uniterm system, which was essentially a proposal to index items by a list of keywords. As simple an idea as this seems today, this was at the time a radical step.
  • 8. Ranked retrieval •The ranked retrieval approach to search was taken up by IR researchers, who over the following decades refined and revised the means by which documents were sorted in relation to a query. •The superior effectiveness of this approach over Boolean search was demonstrated in many experiments over those years. • Work in the 1950s established computers as the definitive tool for search.
  • 9. 1960s … •The 1960s witnessed formalization of algorithms to rank documents relative to a Query. •This was a process to support iterative search, where documents previously retrieved could be marked as relevant in an IR system. •Versions of this process are used in modern search engines, such as the “Related articles” link on Google Scholar.
  • 10. 1970s… •One of the key developments of this period was that Luhn’s term frequency (tf) weights (based on the occurrence of words within a document) •Spärck Jones’s work on word occurrence introduced the idea of inverse document frequency (idf). •An alternative means of modelling IR systems involved extending Maron, Kuhns and Ray’s idea of using probability theory.
  • 11. 1980s – mid 1990s •Building on the developments of the 1970s, variations of tf idf weighting schemes were produced and the formal models of retrieval were extended. •The original probabilistic model did not include tf weights and a number of researchers worked to incorporate them in an effective and principled way. •Amongst other achievements, this work ultimately led to the ranking function BM25 which, has proven to be a highly effective ranking function and is still commonly used. •Advances on the basic vector space model were also developed and probably the most well-known is Latent Semantic Indexing (LSI).
  • 12. Mid 1990s – present •The arrival of the web initiated the study of new problems in IR. •Search engine developers quickly realised that they could use the links between web pages to construct a crawler or robot to traverse and gather most web pages on the internet • The first full text search engine using a crawler was WebCrawler released in 1994.
  • 13. -Kaustav Saha Indexing Techniques – Transition from Manual to Automated System
  • 14. What is an index? •A Database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. •Association of descriptors (keywords, concepts, metadata) to documents in view of future retrieval •The knowledge / expectation / behavior of the searcher needs to be anticipated
  • 15. Example of Indexing using POPSI A report on the treatment of infections disease of lungs in India during 1982-85 Discipline Medical Science Entity Lung Property Infections disease Action Treatment Space modifier India Time modifier 1982-85 Form modifier Report Subject heading MEDICAL SCIENCE, LUNG infection disease, treatment, India, 1982-85 INFECTION DEASEASE, TREATMENT medical science, lung, India, 1982-85 Cross Reference Therapeutics see Treatment Therapy see Treatment
  • 16. Manual and Automatic Indexing •Manual •Human indexers assign index terms to documents •A computer system may be used to record the descriptors generated by the human •Automatic •The system extracts “typical”/ “significant” terms •The human may contribute by setting the parameters or thresholds, or by choosing components or algorithms •Semi-automatic •The system’s contribution may be supported in terms of word lists, thesauri, reference system, etc, following or not the automatic processing of the text
  • 17. Manual vs. Automatic Indexing •Manual •Slow and expensive •Is based on intellectual judgment and semantic interpretation (concepts, themes) •Low consistency •Automatic •Fast and inexpensive •Mechanical execution of algorithms, with no intelligent interpretation (aboutness / relevance) •Consistent
  • 18. Vocabulary •Vocabulary (indexing language) •The set of concepts (terms or phrases) that can be used to index documents in a collection •Controlled •Specific for specialized domains •Potential for increased consistency of indexing and precision of retrieval •Un-controlled (free) •Potentially all the terms in the documents •Potential for increased recall
  • 19. Thesauri •Capture relationships between indexing terms •Hierarchical •Synonymous •Related •Creation of thesauri •Manual vs. automatic •Use of thesauri •In manual / semi-automatic / automatic fashion •Syntagmatic co-ordination / thesaurus-based query expansion during indexing / searching
  • 20. TEXT REPRESENTATION Lexical analysis Stemming Stop word removal representation Steps of automatic indexing Collection/document structure Data structure
  • 21. Role of Indexing in Information Retrieval Population of Documents Selected documents Indexing Database in printed or electronic form Search Strategy Information Needs Population of database users System VocabularyDocument Store Document Description
  • 22. Usage in Modern Day Search Engines - Vikas Bhushan Search Engines Use of search engines Types of Search Engines Software Components in Search Engines Pictorial representation of Components How Search Engines Works with a Model Post-coordinate Indexes
  • 23. Search engines : An initiative towards correct retrieval from a Labyrinth of Ideas  Search engines do not search only for keywords, some search for other stuff as well  and they are really not “engines” in the classical sense but then mouse is not a “mouse” Rather, these are computer programs that searches for particular keywords and returns a list of documents in which they were found, especially a service that scans documents on the Internet.
  • 24. Types of Search Engines Crawler Based – Google, AltaVista Human Based – Yahoo directory, Open directory, LookSmart Hybrid Models – Yahoo, Google Meta Search Engines – Dogpile, MetaCrawler
  • 25. Use of search engines … among others WebCrawler founder Brian Pinkerton puts it, "Imagine walking up to a librarian and saying, 'travel’ . They’re going to look at you with a blank face? "
  • 26. Components in the Back-end & Front-end process Software Components Back-end Front-end Crawler/Spider Indexer Index File Database Search Engine Interface Query Parser Ranking Mechanism Google uses PageRank Teoma uses ExpertRank Yahoo uses TrustRank
  • 27. Pictorial representation of Front-end & Back-end Process Search Engine Database
  • 28. Your Browser How Search Engines Work (Sherman 2003) The Web URL1 URL2 URL3 URL4 Crawler Indexer Search Engine Database Eggs? Eggs. Eggs - 90% Eggo - 81% Ego- 40% Huh? - 10% All About Eggs by S. I. Am
  • 29. Post-coordinate Indexes An Information Retrieval system that allows the searcher to combine terms in any way is frequently referred to as Post-coordinate. Modern computer based system, operated online, can be considered to be a direct descendent of the previous manual system. The files of an online system comprises two major elements: 1. A complete set of document representations : Bibliographic reference or similar to Search engine database. 2. A list of terms sometimes referred to as an inverted file or a postings files. Continued…
  • 30. The subject matter discussed in a document, and represented by index terms assigned to it, is multidimensional in character . Consider, for example an article discussing “Political Contenders in Assembly Polls of Karnataka”. Have been index under the following terms :  Political Contenders  Constituencies  Assembly Polls  Karnataka Post-coordinate Indexes…
  • 31. Political Contenders Index terms mentioned previously actually represent a network of relationship Constituencies Assembly Polls Karnataka Continued…
  • 32. Information Retrieval System Represented as a Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A B C D E F G H X X X X X X X X X X X X X X X X X X X X X X X X X X X
  • 34. Current trends and applications  The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research.  Automated search engines that rely on keyword matching usually return too many low quality matches.  A large-scale search engine makes heavy use of the additional structure present in hypertext to provide much higher quality search results.
  • 35. What is XML Indexing?  XML indexing is a form of embedded indexing in which tags are inserted into an XML documents to mark the occurrences of indexable terms or topics.  The clients publishing process automatically generates an index from these index elements. Fortunately, because this automated process handles all layout and formatting ,it is not necessary to treat these issues as a matter of concern.
  • 36. What makes it work? Index entries in DocBook are encoded using the mother element and has five child elements. There are summarized below: • <indexterm> element: wrapper element for an index entry of any type. • <primary> element: main entry. • <secondary> element: subentry. • <tertiary> element: sub-subentry. • < see > element: ‘see’ references. • <seealso> element: ‘seealso’ references.
  • 37. Future hopes for Indexers  Indexer should offer XML based services, which is a pre requisite for joining the digital publishing revolution.  Indexers are good with structures and use of XML indexing in publishing is about imposition of structure on text.
  • 38. Results and Performance  The most important measure of a search engine is the quality of its search results.  Here we highlight the performance and experience with Google. It produces Better results than the major commercial search engines for most searches.
  • 39.
  • 40. Data Google bing Yahoo! Baidu Babylon Others 2012-04 91.7 3.5 3.36 0.26 0 1.18 2012-05 92.04 3.36 3.26 0.22 0 1.12 2012-06 91.75 3.27 3.04 0.23 0.29 1.42 2012-07 91.17 3.22 2.95 0.45 0.54 1.67 2012-08 91.01 3.22 2.98 0.5 0.6 1.7 2012-09 91.04 3.16 2.91 0.49 0.6 1.8 2012-10 90.75 3.35 2.91 0.54 0.58 1.87 2012-11 90.75 3.32 2.84 0.58 0.6 1.92 2012-12 90.43 3.26 2.89 0.66 0.54 2.21 2013-01 90.47 3.19 2.88 0.63 0.48 2.35 2013-02 89.64 3.62 3.17 0.73 0.39 2.45 2013-03 89.89 3.59 3.2 0.93 0.29 2.11 2013-04 90.17 3.61 3.08 0.92 0.27 1.95
  • 41. Models for Information Retrieval  Boolean or Vector space model of IR(Information Retrieval) -In this matching is done in a formally defined but semantically imprecise calculus of Index terms.  There are a number of retrieval models that function over a Probabilistic basis. Binary Independence Model, is the most original and is still the most influential among other probabilistic retrieval models. Contd…
  • 42. OKAPI BM25: model for Information Retrieval  The BIM was originally designed for short catalogue records and abstracts of fairly consistent length.  For modern full-text search collections, a model should pay attention to term frequency and document length. The BM25 weighting scheme , often called Okapi weighting , after the system in which it was first implemented, was developed as a way of building a probabilistic model. Contd…
  • 43. The score of any document as determined by OKAPI is determined through the following equations: Equation 1. The simplest score for document d is just idf weighting of the query terms present: Equation 2. Sometimes, an alternative version of idf is used. If we start with the formula in the absence of relevance feedback information we estimate that S = s = 0 , then we get an alternative idf formulation as follows: Contd…
  • 44. Equation 3. We can improve on Equation 1 by factoring in the frequency of each term and document length: Equation 4. If the query is long, then we might also use similar weighting for query terms. This is appropriate if the queries are paragraph long information needs, but unnecessary for short queries:
  • 46. • For implementation of indexing services individual indexers may prefer numerous approaches. • The effectiveness of an index as a search tool will depend on the number of access points provided. • Different factors influence the recall and precision measures for any retrieved information. • Indexing and its usage can be made more sophisticated through implication of certain concepts like:  Weighted Indexing  Linking of terms  Role Indicators  Subheading  Index Language Device Conclusion: Enhancement of Indexing Procedures
  • 47. • Many automatic systems include form of weighting to allow the ranking • Weighted indexing grants autonomy on behalf of the searcher to vary the exhaustivity • It is simplifies the process of indexing • Weighted indexing assigns a numerical value to individual terms. • Weighted index has two ways of retrieval from the database. • Major and Minor descriptor. Weighted Indexing
  • 48. • For efficient and timely retrieval of appropriate and correct information • Inappropriate or irrelevant responses can be avoided by reducing the exhaustively of index. • Removal of unwanted or false association. • To avoid false association by linking of index terms. Linking of terms
  • 49. • Role indicators play an important part in retrieval of accurate information • Use of syntax to reduce ambiguity. • Role indicators introduced into retrieval system in the early 1960s • The first of its kind was the Engineers Joint a Council (EJC) set of role indicator. • The document surrogate was a ‘telegraphic abstract’ by means of a ‘semantic code dictionary’. Role Indicators
  • 50. Subheadings • The advent of automated system the need for retrieval of precise information gained importance • The problem of false or ambiguous associations are now less • Subheading can be applied much of post coordinate index system • successful in reducing the ambiguities in the searching of electronic data bases
  • 51. Index Language Devices Precision Device Weighting Links Role indicators Recall Device Subheadings Synonym control Inverse Relation
  • 52. Before we Conclude… • The entire discussion was based on application of indexing techniques and principles for design of search engines. • To develop software tools that would allow the user to perform relatively specific subject searchers related to resources of any type. • Search engines operate by building ‘indexes’ to the network resources. • The concept of Boolean logic is followed for searching purposes. • Search engines use inverted indexes.
  • 53. Conclusion Today the internet has become versatile and is treated as a significant source of a information. The transition from traditional to electronic form of information resources, has paved the way for creation of various software and certain tools. These provide enhanced navigation among resources available in electronic form and within networked environment. However, various studies indicate that there is much ground to be cover before machines become intelligent enough to completely replace humans. As of now the role of the human indexer is quite indispensible. Thus in days to come upgraded indexing techniques and principles would surely be developed thereby ensuring efficient and timely retrieval of information from a digitized environment.

Notes de l'éditeur

  1. The 1960s saw a wide range of activities reflecting the move from simply asking if IR was possible on computers to determining means of improving IR systems. One of these areas is the formalization of algorithms to rank documents relative to a query. Another significant innovation at this time was the introduction of relevance feedback . This was a process to support iterative search, where documents previously retrieved could be marked as relevant in an IR system. A user’s query was automatically adjusted using information extracted from the relevant documents. Versions of this process are used in modern search engines, such as the “Related articles” link on Google Scholar.
  2. She mentioned that the frequency of occurrence of a word in a document collection was inversely proportional to its significance in retrieval. Robertson defined the probability ranking principle , which determined how to optimally rank documents based on probabilistic measures with respect to defined evaluation measures
  3. The arrival of the web initiated the study of new problems in IR. This point also marked a time when the interaction between the commercial and research oriented IR communities was much stronger than it had been before. Ideas developed in earlier years were pushed further and implemented in the commercial search sector. Search engine developers quickly realised that they could use the links between web pages to construct a crawler or robot to traverse and gather most web pages on the internet; thereby automating acquisition of content. The first full text search engine using a crawler was WebCrawler released in 1994. The applications of search and the field of information retrieval continue to evolve as the computing environment changes. The most obvious recent example of this type of change is the rapid growth of mobile devices and social media. One response from the IR community has been the development of social search, which deals with search involving communities of users and informal information exchange.
  4. Before I proceed with discussions on usage of indexing principles in searching, a few words on what a search engine is and some of its functional features. Noun A complicated irregular network of passages or paths in which it is difficult to find one's way; a maze. An intricate and confusing arrangement.
  5. Crawler-based search engines are good when you have a specific search topic in mind and can be very efficient in finding relevant information in this situation. However, when the search topic is general, crawler-base search engines may return hundreds of thousands of irrelevant responses to simple search requests, including lengthy documents in which your keyword appears only once. Human-powered directories are good when you are interested in a general topic of search. In this situation, a directory can guide and help you narrow your search and get refined results. Therefore, search results found in a human-powered directory are usually more relevant to the search topic and more accurate. However, this is not an efficient way to find information when a specific search topic is in mind. Hybrid search engines use a combination of both crawler-based results and directory results. Meta-search engines are good for saving time by searching only in one place and sparing the need to use and learn several separate search engines. "But since meta-search engines do not allow for input of many search variables, their best use is to find hits on obscure items or to see if something can be found using the Internet
  6. Unfortunately, search engines don't have the ability to ask a few questions to focus your search, as a librarian can. They also can't rely on judgment and past experience to rank web pages, in the way humans can.So, how do crawler-based search engines go about determining relevancy, when confronted with hundreds of millions of web pages to sort through? They follow a set of rules, known as an algorithm.Exactly how a particular search engine's algorithm works is a closely-kept trade secret. One of the main rules in a ranking algorithm involves the location and frequency of keywords on a web page. Call it the location/frequency method. Remember the librarian mentioned above? They need to find books to match your request of "travel," so it makes sense that they first look at books with travel in the title. Search engines operate the same way.Pages with the search terms appearing in the HTML title tag are often assumed to be more relevant than others to the topic.Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning.Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages.
  7. Crawler/Spider : A crawler crawls around the web, visits web pages through their linkages, and downloads the page content to the local database of the search engine. Indexer : An indexer transfers a page into various components, typically keywords,and analyzes them. Entities such as, titles, headings, links, text, constructs, bold, italic,and other style portions of a page are ripped apart and analyzed. Index file database: The index file database is a category containing a copy of every single page that the crawler scans and the corresponding index in-formation generated by the indexer. When a crawler returns to your website and scans your website,this catalog will be updated. • Query parser : This is the software used to analyze the queries input from users by transferring the query into the same representation as document in the indexer. In this way,an comparison can be done between the query and each document. • Ranking mechanism: This is the algorithm that matches the users’ keywords with the pages in the database and comes up with matches. These matches are then displayed on the users’ screen in the decreasing order of their relevance to the users’ input query. • Search engine interface: This is the portion of a search engine a user interacts with when she performs a search and gets the returned results.
  8. The application of indexing techniques in information organization is directed towards efficient, precise and timely retrieval of information .For this post-coordination is followed. Post-coordinate indexes provide flexibility when index terms are obtained from automated and digitized information system . In present day search engines the concept of post-coordination in indexing is applied. A complete set of document representations : Bibliographic reference, usually accompanied by index terms or abstract or both.
  9. The following illustration depicts a networked relationship among 4 index terms that were identified from the previous example. There can be any query with all and any combination among the indexed terms.so that An user should be able to retrieve this document in a search involving in any single term or any combination of them : Any two terms, any three or all four.
  10. The happenings in an online search has been demonstrated through a reference matrix in this slide. Suppose the searcher enters the term POLITICAL CONTENDERS at a terminal an this is represented by H in the diagram.The system responds by indicating that Five items have been index under the term. The searcher enters ASSEMBLY POLLS (represented by E in the diagram) and is told the Six items appear under this terms. If the searcher now asks that E be combined with H, the system compares the document numbers on this two lists and indicates that Four items satisfy the requirement.When told to do so by the searcher, the computer finds these records by their identifying numbers (1,7,8,14) and displays them or prints them out.These procedure remains the same however many terms are involved and whatever the logical relationships specified the searcher.
  11. Thank you Vikas .I am the 4th speaker of this topic, my name is Debashis Naskar 1st year DRTC student, before proceeding the slide I wish very good afternoon to everyone, our respected teacher, our research scholar and our friends..
  12. In my portion discussions would be in regard to the Current status and applications of Indexing principles in searching . I begin with an introduction on the web and how it is being used for information retrieval. Now a days The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. People are likely to surf the web using its link graph, often starting with high quality human maintained indices such as Yahoo! or with search engines. Human maintained lists cover popular topics effectively but are subjective, and cannot cover all esoteric topics. Automated search engines that rely on keyword matching usually return too many low quality matches. HTML is syntactically reach whereas a XML is semantically reach.
  13. Now I would be telling a few words about XML Indexing.XML Indexing is a technique that is currently being used for providing identification to information resources in electronic form. The indexer should have the ability to provide correct identification to facilitated better navigation and timely retrieval of appropriate information.XML Indexing is a form of indexing which have various tags. You can create indexes on your XML data, to focus on particular parts of it that your query often, and thus improve performance. This automated process handle some formatting like word by word ,letter by latter.
  14. Now the encoding scheme for DocBook has been described briefly .This is an example of XML indexing. Index entries in DocBook are encoded using the <indexterm> element which is mother element and its five child elements <primary> , <secondary>, <tertiary> , < see > and <seealso>. These elements and their attributes provide for all standard features, including main entries, subentries, alternate alphabetization order, single page locators and page ranges, and see and see also references . There are summarized below: <indexterm> element: wrapper element for an index entry of any type. <primary> element: use for main entry. <secondary> element: use for subentry. <tertiary> element: use for sub-subentry. < see > element: used with <primary> or <secondary> for ‘see’ references. <seealso> element: used with <primary> or <secondary> for ‘seealso’ references.
  15. Here a few possibilities have been discussed that are there for indexers in the probable future. If the indexer can offer XML based services, publishers will use them. Learning XML is not hard for anyone who can think rigorously. Indexer are good with structures and fundamentally, XML in publishing is about imposition of structure on text. It is often believe that general text search will overcome the need for a digital indexing. However, this is not the case and indexer should try to educate publishers to make sure they understand this. In years to come many books will be published that will never appear in print. It will be the electronic version.
  16. Application of indexing principles are widely used now a days in search engines. Now I provide a general talk on Google which invariably is the most widely used search engine. Mainly I have highlighted the performance and experience with Google which produce better result then the other commercial search engines.
  17. This is a Bar graph representation of usage of various search engine. As you can see here the Google is the most widely used search engine on a global scale.
  18. Here I put 1yr statistics data for usage of search engines as depicted in the previous slide. We can see that from April 2012 to April 2013 the Google are performing outstanding then the other search engine.
  19. For various information needs users formulate queries which are translate into appropriate representation. There are documents which are converted into document representations. Based on such representations a system tries to determine how well documents satisfying information needs. Probability theory provides a principled foundation for reason. With this I introduce OKAPI BM25 weighting scheme which is example of non binary model that is used for term counts.
  20. Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework. The BIM(Binary Independence model) was originally designed for short catalogue records and abstracts of fairly consistent length . It is a full-text search collections. We can also called this  BM25 weighting scheme or Okapi weighting. Here an overview has been provided in relation to a series of forms that built up to the standard form that is currently being used for document scoring.
  21. Here we can see some equation which is basically score for document and it help us to retrieve the proper information to the user which needed. Now lets move to the equation part. First equation is basically related to this right side of the log function.If we consider ut is dft/N.then first equation will be satisfied . here idf means inverse document frequency which is the alternative version of RSVd means Retrieval Status value .How will we get this value? If we divided the dft from N which is No of the collection from the whole document, then we will get the RSVd value which is find out the score of the document. Why we are using the log function just because of to get the relevant information . Google provides us lots of information to compact this this information ,log function is needed. Let me explain you with the help of an example.( Suppose you are searching Barak Obama which is related to the white house but whenever you search Barak Obama by using this keyword then it will provides you lots of information which is relevant or which is irrelevant means Barak Obama which is not related to white house. But we needs only Barak Obama which is related to white house i. e. relevant information so this log function help us to retrieve the relevant information. ) now the second equation is the same and it is bacic and it will give you more relevant information . According to the OKAPI model if we consider this S=0,means if we put 0 on behalf of this S position then final equation will be this RSVd equation.and why we are using this half value just because of to avoid possibility of Zeros. It is fairly standard to add half it will adjust the marginal count. half means smoothing value.
  22. Now we can see the third equation which is related to the first equation. It means this equation help us to improve the first equation.here tdft means frequency of term of document and Ld is length of the document and lavg means average length of the document in whole collection.k1 means tuning parameter calibrating the term frequency and b is tuning parameter for scaling by document length. If we calculate this equation in proper way then, we will get the exact or prominent scoring value. And the last equation is basically usage for long query instead of keyword. It might be sentence or paragraph. That why here using the K3 which being another positive tuning parameter that this time calibrates term frequency scaling of the query. It is basically use for long query. b is tuning parameter for scaling by document length. So this model is very helpful and useful model to get scoring value of the relevant information in right time. b=1 when fully scaling the term weight by length. And b=0 when no length normalization.
  23. Throughout the previous discussion several approaches have been observed. Among other important features one significant aspect is that : for implementation of indexing service individual may prefer numerous approaches. It is largely dependent on how the tool is being used. For current awareness purposes, those tools that use some form of classified approach will usually the superior to the alphabetico – specific indexes. For example, someone interested in keeping current with new developments in parasitology in general would find BIOLOGICAL ABSTRACTS more useful than INDEX MEDICUS. It is because for the latter, reference to the subject are likely to be scattered over a wide verity of subject headings. For someone whose current awareness interest are highly specific, the alphabetico - specific approach might actually being more convenient. In considering these various tools as search and retrieval devices, the performance factors should relevantly be discussed. That is the effectiveness of an index as a search tool will depend on the number of access points provided, the specificity of the vocabulary used to index, the quality and consistency indexing, and the extent to which the tool offers positive help to the searcher (e.g. , by linking of semantically related terms). Contd…
  24. Many automatic systems include form of weighting to allow the ranking . Weighted Indexing grants autonomy on behalf of the searcher to vary the exhaustivity of the index. Much subject indexing entails a simple binary decision. A term is either assigned to the document or it is not .While this simplify the process of indexing, it does creates some problems for the user of a database, who cannot device a search strategy that will distinguish items in which a topic receive substantial treatment from those in which it is dealt in minor terms. In weighted indexing, indexers can assign to a term a numerical value that reflects their opinion on how important that term is in indicating what a particular document is about. For any subject matter with increase intension higher weights are assigned to the central theme. This is subjective and different indexers may provide different weights. Weighted indexing of this type can be used in two ways in retrieval from a database. One way is simply to allow a searcher to specify that only items index under a term carrying a particular weight should be retrieve. The alternative application is to used the weights to rank items retrieve in a search. The assignment of numerical weights to terms was first advocated by “Maron” and “Kuhn” in 1960s. Maron referred these type of indexing as ‘probabilistic ‘. Some databases incorporate weighting technique by distinguishing between ‘major’ and ‘minor’ descriptors, which is equivalent to adopting a numerical scale having two values. Minor descriptors are those that are associated only with the database in electronic form. This practice is follow at the National Library of Medicine (Index Medicus and the Medline database), the National Technical Information Service (NTIS) and the Educational Resources Information Centre (ERIC) . Suppose (M) denote ‘major’ descriptors and (m) denote ‘minor’ descriptors; then in an and relationship the crude ranking output can be achieved as : M * M M * m m * m Here these provide items in which two terms, with their respective denotations, were used by the searcher. Many (automatic) system include forms of weighting to allow the ranking of a output. In most automatic processing system weight by frequency criteria: frequency of a term occurrence in a text and / or frequency of occurrence in a database as a whole, or other methods have been tried , including use of positional criteria. Contd…
  25. For efficient and timely retrieval of appropriate and correct information, much emphasis should be laid on document representation. Linking of terms is very essential in this regards. At certain cases the document retrieved for any search might not be considered as an appropriate and relevant respond. Things like these can be avoided by the use of weighted indexing or by reducing the exhaustively of index. Other unwanted retrieval could be caused by false associations, cases in which the terms the cause an item to be retrieve are quite unrelated in the document. The probability of the occurrence of false associations results in an increase of the length of the record (i. e. with the number of access point provided or with the exhaustively of the indexing). One way of avoiding false association is by linking of an index term. That is, the documents in a partition into several subdocuments, each dealing with the separate although closely related to the subject. All terms in each link are directly related and certain terms may appear in several links. Each link is identified by some alphanumeric character that is provided to the database itself. In an online retrieval system this would be associated with the document number with the inverted file. Thus a documents with 1 to 05 as its number may be partition into subdocuments as 1205/1, 1205/2, 1205/3 and so on. This gives the searcher the opportunity to specify the two terms should co-occur not only in the document record but as well as particular link with which the document is identified. This is how false association can be avoided.
  26. Role indicator play an important part in retrieve of accurate information. The problem with unwanted retrievals can be rectified if proper linking can be done. At times, it may so happen that certain terms stand retrieve which are relevantly related to the document, but are not of interest to the searcher. The seek its mean in a different contexts such term though they appear in proper link are undesirable. Such problem are arise due to incorrect term relationship. Thus for purposes of disambiguation it is necessary to introduce some syntax into the indexing. The traditional method is to use role indicators (or relational indicators). These are codes that established explicit relationships. Link and roles are the same where introduced into retrieval system in the early 1960s. The post co-ordinate system where at rudimentary stages and computer base retrieval was at a mare birth. The firstof its kind was the Engineers Joint a Council (EJC) set of role indicator. Even more elaborate than the EJC method of indexing using links and roles was the ‘semantic code’ approach to information retrieval. The document surrogate was a ‘telegraphic abstract’. The terms in the telegraphic abstracts were encoded by means of a ‘semantic code dictionary’.
  27. Establishment of Links, implementation of role Indicator and the Semantic code; all of these lead to the foundation of an approach to indexing that was highly structured. This was during 1960s, when computer based systems were quite juvenile. Gradually with the advent of automated system the need for retrieval of precise information gained importance. The problems of false or ambiguous associations are now less severe than they were thirty or forty year ago because then higher level of precoordination existed in most systems. In a postcoordinate system, subheadings can be applied in much the same way in which they are used for traditional subject cataloguing. While the main justification for use of subheading was to facilitate the use of the Index Medicus. However, the concept of subheading was largely successful in reducing the ambiguities in the searching of electronic databases. By the inclusion of subheadings things were made more comprehensible for the users.
  28. When it comes to the enhancement of indexing principle factors affecting recall and precision need consideration. The recall and precision of any search depends on devices that have been categorized into recall devices and precision devices. Precision devices include weighting, links, and role indicators. These devices are quiet independent of the index language. Recall devices include subheadings and synonym control that are integral components of the index language. The complete array of such devices are known as Index Language Devices.
  29. The entire discussion was based on the application of indexing techniques and principles for the design of search engines that would facilitate efficient retrieval of relevant information. For many days the main concern has been to develop software tools that would allow the user to perform relatively specific subject searchers related to resources of any type. Search engine operate by building ‘indexes’ to the network resources. This signifies the extraction of words or phrases from the text itself and the creation of files that allow efficient searching of such extractsSearch for resources to Index . Search engine use inverted indexes which are build automatically . When an appropriate resource is found it is Index. Other resources linked to the resource selected may also be located an indexed . The techniques used are given various names including ‘crawlers’, ‘spiders’, ‘worms’.