Indexing Techniques: Their Usage in Search Engines for Information Retrieval
1.
2. Topics Speakers
1. Introduction and Overview Sayon Roy
2. Indexing Techniques – Transition from Manual to Automated System Kaustav Saha
3. Usage in Modern Day Search Engines Vikas Bhushan
4. Currents Trends and Applications Debashis Naskar
5. Conclusion Sumanta Bag
3. Indexing…. an Overview
• Indexing is a crucial part of any information retrieval system. It is a challenging task
requiring paying attention to many theoretical and practical issues. While the move
towards digital information systems and automated indexing is thought to have
reduced the need for indexers in some areas, professional indexers are still much
needed and as a matter of fact electronic environment has posed new challenges
for the indexers.
• Indexing is more a process of the extraction rather than
content analysis.
• The terms is an index represent certain concepts.
4. Subject Indexing and Subject Retrieval
• Subject indexing can be described as a system of classifying without notation. It
is the core theme of information science.
• Today subject retrieval is facilitated through the use of structured databases.
• The items that are retrieved are listed in the index.
• In OPACs indexing is done manually to determine what a
resource about. After identification the aboutness is translated
in the language of the vocabulary.
5. Schematic Illustration…
Conception of Subject
Analysis and Indexing
Type of Subject
Information
Indexing Method
Simplistic Conception
Explicit Information Extraction
Content- Oriented
Conception
Implicit Information Assignment
Requirement -Oriented
Conception
6. Early use of computers for Information Retrieval
• In 1948 a “machine called the Univac” capable of searching for text references
associated with a subject code was created.
• The machine could process “at the rate of 120 words per minute”. It appears that
this is the first reference to a computer being used to search for content.
• The impact of computers in IR is highlighted when Hollywood drew public
attention to the innovation with the comedy “Desk Set”, which came out in 1957.
It centred on a group of reference librarians who were about to be replaced by a
computer.
• IR as a research discipline was starting to emerge at this time with two important
developments: how to index documents and how to retrieve them.
7. Indexing and Information Retrieval… A Chronology
•Mortimer Taube’s Uniterm system, which was essentially a proposal to
index items by a list of keywords. As simple an idea as this seems today, this
was at the time a radical step.
8. Ranked retrieval
•The ranked retrieval approach to search was taken up by IR researchers,
who over the following decades refined and revised the means by which documents
were sorted in relation to a query.
•The superior effectiveness of this approach over Boolean search was demonstrated
in many experiments over those years.
• Work in the 1950s established computers as the definitive tool for search.
9. 1960s …
•The 1960s witnessed formalization of algorithms to rank documents relative to a
Query.
•This was a process to support iterative search, where documents previously retrieved
could be marked as relevant in an IR system.
•Versions of this process are used in modern search engines, such as the “Related articles”
link on Google Scholar.
10. 1970s…
•One of the key developments of this period was that Luhn’s term frequency (tf) weights
(based on the occurrence of words within a document)
•Spärck Jones’s work on word occurrence introduced the idea of
inverse document frequency (idf).
•An alternative means of modelling IR systems involved extending Maron, Kuhns and
Ray’s idea of using probability theory.
11. 1980s – mid 1990s
•Building on the developments of the 1970s, variations of tf idf weighting schemes were
produced and the formal models of retrieval were extended.
•The original probabilistic model did not include tf weights and a number of researchers
worked to incorporate them in an effective and principled way.
•Amongst other achievements, this work ultimately led to the ranking function BM25
which, has proven to be a highly effective ranking function and is still commonly used.
•Advances on the basic vector space model were also developed and probably the most
well-known is Latent Semantic Indexing (LSI).
12. Mid 1990s – present
•The arrival of the web initiated the study of new problems in IR.
•Search engine developers quickly realised that they could use the links between
web pages to construct a crawler or robot to traverse and gather most web pages on
the internet
• The first full text search engine using a crawler was WebCrawler released in 1994.
14. What is an index?
•A Database where information (after being collected, parsed and
processed) is stored to allow for quick retrieval.
•Association of descriptors (keywords, concepts, metadata) to documents in
view of future retrieval
•The knowledge / expectation / behavior of the searcher needs to be
anticipated
15. Example of Indexing using POPSI
A report on the treatment of infections disease of lungs in India during 1982-85
Discipline Medical Science
Entity Lung
Property Infections disease
Action Treatment
Space modifier India
Time modifier 1982-85
Form modifier Report
Subject heading
MEDICAL SCIENCE, LUNG
infection disease, treatment, India, 1982-85
INFECTION DEASEASE, TREATMENT
medical science, lung, India, 1982-85
Cross Reference
Therapeutics see Treatment
Therapy see Treatment
16. Manual and Automatic Indexing
•Manual
•Human indexers assign index terms to documents
•A computer system may be used to record the descriptors generated by
the human
•Automatic
•The system extracts “typical”/ “significant” terms
•The human may contribute by setting the parameters or thresholds, or
by choosing components or algorithms
•Semi-automatic
•The system’s contribution may be supported in terms of word lists,
thesauri, reference system, etc, following or not the automatic
processing of the text
17. Manual vs. Automatic Indexing
•Manual
•Slow and expensive
•Is based on intellectual judgment and semantic interpretation (concepts, themes)
•Low consistency
•Automatic
•Fast and inexpensive
•Mechanical execution of algorithms, with no intelligent interpretation (aboutness /
relevance)
•Consistent
18. Vocabulary
•Vocabulary (indexing language)
•The set of concepts (terms or phrases) that can be used to index
documents in a collection
•Controlled
•Specific for specialized domains
•Potential for increased consistency of indexing and precision of
retrieval
•Un-controlled (free)
•Potentially all the terms in the documents
•Potential for increased recall
19. Thesauri
•Capture relationships between indexing terms
•Hierarchical
•Synonymous
•Related
•Creation of thesauri
•Manual vs. automatic
•Use of thesauri
•In manual / semi-automatic / automatic fashion
•Syntagmatic co-ordination / thesaurus-based query expansion during
indexing / searching
21. Role of Indexing in Information Retrieval
Population of
Documents
Selected
documents
Indexing
Database in printed or electronic form
Search Strategy
Information Needs
Population of
database users
System
VocabularyDocument
Store
Document
Description
22. Usage in Modern Day Search Engines
- Vikas Bhushan
Search Engines
Use of search engines
Types of Search Engines
Software Components in Search Engines
Pictorial representation of Components
How Search Engines Works with a Model
Post-coordinate Indexes
23. Search engines : An initiative towards correct
retrieval from a Labyrinth of Ideas
Search engines do not search only for keywords, some
search for other stuff as well
and they are really not “engines” in the classical sense
but then mouse is not a “mouse”
Rather, these are computer programs that searches for
particular keywords and returns a list of documents in which
they were found, especially a service that scans documents on
the Internet.
24. Types of Search Engines
Crawler Based – Google, AltaVista
Human Based – Yahoo directory, Open directory, LookSmart
Hybrid Models – Yahoo, Google
Meta Search Engines – Dogpile, MetaCrawler
25. Use of search engines
… among others
WebCrawler founder Brian Pinkerton puts it, "Imagine walking up to a librarian and
saying, 'travel’ . They’re going to look at you with a blank face? "
26. Components in the Back-end & Front-end process
Software
Components
Back-end Front-end
Crawler/Spider
Indexer
Index File Database
Search Engine Interface
Query Parser
Ranking Mechanism
Google uses PageRank
Teoma uses ExpertRank
Yahoo uses TrustRank
28. Your
Browser
How Search Engines Work
(Sherman 2003)
The Web
URL1
URL2
URL3 URL4
Crawler
Indexer
Search
Engine
Database Eggs?
Eggs.
Eggs - 90%
Eggo - 81%
Ego- 40%
Huh? - 10%
All About
Eggs
by
S. I. Am
29. Post-coordinate Indexes
An Information Retrieval system that allows the searcher to combine terms in
any way is frequently referred to as Post-coordinate.
Modern computer based system, operated online, can be considered to be a
direct descendent of the previous manual system.
The files of an online system comprises two major elements:
1. A complete set of document representations : Bibliographic reference or
similar to Search engine database.
2. A list of terms sometimes referred to as an inverted file or a postings files.
Continued…
30. The subject matter discussed in a document, and represented by index terms
assigned to it, is multidimensional in character .
Consider, for example an article discussing
“Political Contenders in Assembly Polls of Karnataka”.
Have been index under the following terms :
Political Contenders
Constituencies
Assembly Polls
Karnataka
Post-coordinate Indexes…
32. Information Retrieval System Represented as a Matrix
1 2 3 4 5 6 7 8 9 10 11 12 13 14
A
B
C
D
E
F
G
H
X X
X X X X
X X X
X X
X X X X X X
X
X X X X
X X X X X
34. Current trends and applications
The web creates new challenges for information retrieval. The amount of information on the web is
growing rapidly, as well as the number of new users inexperienced in the art of web research.
Automated search engines that rely on keyword matching usually return too many low quality matches.
A large-scale search engine makes heavy use of the additional structure present in hypertext to
provide much higher quality search results.
35. What is XML Indexing?
XML indexing is a form of embedded indexing in which
tags are inserted into an XML documents to mark the
occurrences of indexable terms or topics.
The clients publishing process automatically generates
an index from these index elements. Fortunately, because
this automated process handles all layout and
formatting ,it is not necessary to treat these issues as a
matter of concern.
36. What makes it work?
Index entries in DocBook are encoded using the mother element and has five
child elements. There are summarized below:
• <indexterm> element: wrapper element for an index entry of any type.
• <primary> element: main entry.
• <secondary> element: subentry.
• <tertiary> element: sub-subentry.
• < see > element: ‘see’ references.
• <seealso> element: ‘seealso’ references.
37. Future hopes for Indexers
Indexer should offer XML based services, which is a pre requisite for joining
the digital publishing revolution.
Indexers are good with structures and use of XML indexing in publishing is
about imposition of structure on text.
38. Results and Performance
The most important measure of a search
engine is the quality of its search results.
Here we highlight the performance and
experience with Google. It produces
Better results than the major commercial
search engines for most searches.
41. Models for Information Retrieval
Boolean or Vector space model of IR(Information Retrieval)
-In this matching is done in a formally defined but semantically imprecise calculus of Index terms.
There are a number of retrieval models that function over a Probabilistic basis.
Binary Independence Model, is the most original and is still the most influential among other
probabilistic retrieval models.
Contd…
42. OKAPI BM25: model for Information Retrieval
The BIM was originally designed for short catalogue records and
abstracts of fairly consistent length.
For modern full-text search collections, a model should pay attention to term frequency and
document length.
The BM25 weighting scheme , often called Okapi weighting , after the system in which it was first
implemented, was developed as a way of building a probabilistic model.
Contd…
43. The score of any document as determined by OKAPI is determined through the following
equations:
Equation 1. The simplest score for document d is just idf weighting of the query terms present:
Equation 2. Sometimes, an alternative version of idf is used. If we start with the formula in the
absence of relevance feedback information we estimate that S = s = 0 , then we get an
alternative idf formulation as follows:
Contd…
44. Equation 3. We can improve on Equation 1 by factoring in the frequency of each term and
document length:
Equation 4. If the query is long, then we might also use similar weighting for query terms. This
is appropriate if the queries are paragraph long information needs, but unnecessary
for short queries:
46. • For implementation of indexing services individual indexers may prefer numerous approaches.
• The effectiveness of an index as a search tool will depend on the number of access points provided.
• Different factors influence the recall and precision measures for any retrieved information.
• Indexing and its usage can be made more sophisticated through implication of certain concepts like:
Weighted Indexing
Linking of terms
Role Indicators
Subheading
Index Language Device
Conclusion: Enhancement of Indexing Procedures
47. • Many automatic systems include form of weighting to allow the ranking
• Weighted indexing grants autonomy on behalf of the searcher to vary the exhaustivity
• It is simplifies the process of indexing
• Weighted indexing assigns a numerical value to individual terms.
• Weighted index has two ways of retrieval from the database.
• Major and Minor descriptor.
Weighted Indexing
48. • For efficient and timely retrieval of appropriate and correct information
• Inappropriate or irrelevant responses can be avoided by reducing the exhaustively of index.
• Removal of unwanted or false association.
• To avoid false association by linking of index terms.
Linking of terms
49. • Role indicators play an important part in retrieval of accurate information
• Use of syntax to reduce ambiguity.
• Role indicators introduced into retrieval system in the early 1960s
• The first of its kind was the Engineers Joint a Council (EJC) set of role indicator.
• The document surrogate was a ‘telegraphic abstract’ by means of a ‘semantic code dictionary’.
Role Indicators
50. Subheadings
• The advent of automated system the need for retrieval of precise information gained importance
• The problem of false or ambiguous associations are now less
• Subheading can be applied much of post coordinate index system
• successful in reducing the ambiguities in the searching of electronic data bases
51. Index Language Devices
Precision Device
Weighting
Links
Role
indicators
Recall Device
Subheadings Synonym control
Inverse
Relation
52. Before we Conclude…
• The entire discussion was based on application of indexing techniques and principles for design of search
engines.
• To develop software tools that would allow the user to perform relatively specific subject searchers related
to resources of any type.
• Search engines operate by building ‘indexes’ to the network resources.
• The concept of Boolean logic is followed for searching purposes.
• Search engines use inverted indexes.
53. Conclusion
Today the internet has become versatile and is treated as a significant source of a information. The
transition from traditional to electronic form of information resources, has paved the way for creation
of various software and certain tools.
These provide enhanced navigation among resources available in electronic form and within networked
environment. However, various studies indicate that there is much ground to be cover before machines
become intelligent enough to completely replace humans. As of now the role of the human indexer is
quite indispensible.
Thus in days to come upgraded indexing techniques and principles would surely be developed thereby
ensuring efficient and timely retrieval of information from a digitized environment.
The 1960s saw a wide range of activities reflecting the move from simply asking if IR
was possible on computers to determining means of improving IR systems. One of
these areas is the formalization of algorithms to rank documents relative to a query.
Another significant innovation at this time was the introduction of relevance feedback .
This was a process to support iterative search, where documents previously retrieved
could be marked as relevant in an IR system. A user’s query was automatically adjusted
using information extracted from the relevant documents.
Versions of this process are used in modern search engines, such as the “Related articles”
link on Google Scholar.
She mentioned that the frequency of occurrence
of a word in a document collection was inversely proportional to its significance in
retrieval.
Robertson defined the probability ranking principle ,
which determined how to optimally rank documents based on probabilistic measures with
respect to defined evaluation measures
The arrival of the web initiated the study of new problems in IR. This point also
marked a time when the interaction between the commercial and research oriented
IR communities was much stronger than it had been before. Ideas developed in earlier
years were pushed further and implemented in the commercial search sector.
Search engine developers quickly realised that they could use the links between
web pages to construct a crawler or robot to traverse and gather most web pages on
the internet; thereby automating acquisition of content. The first full text search engine
using a crawler was WebCrawler released in 1994.
The applications of search and the field of information retrieval continue to evolve as
the computing environment changes. The most obvious recent example of this type of
change is the rapid growth of mobile devices and social media. One response from the
IR community has been the development of social search, which deals with search
involving communities of users and informal information exchange.
Before I proceed with discussions on usage of indexing principles in searching, a few words on what a search engine is and some of its functional features.
Noun
A complicated irregular network of passages or paths in which it is difficult to find one's way; a maze.
An intricate and confusing arrangement.
Crawler-based search engines are good when you have a specific search topic in mind and can be very efficient in finding relevant information in this situation. However, when the search topic is general, crawler-base search engines may return hundreds of thousands of irrelevant responses to simple search requests, including lengthy documents in which your keyword appears only once.
Human-powered directories are good when you are interested in a general topic of search. In this situation, a directory can guide and help you narrow your search and get refined results. Therefore, search results found in a human-powered directory are usually more relevant to the search topic and more accurate. However, this is not an efficient way to find information when a specific search topic is in mind.
Hybrid search engines use a combination of both crawler-based results and directory results.
Meta-search engines are good for saving time by searching only in one place and sparing the need to use and learn several separate search engines. "But since meta-search engines do not allow for input of many search variables, their best use is to find hits on obscure items or to see if something can be found using the Internet
Unfortunately, search engines don't have the ability to ask a few questions to focus your search, as a librarian can. They also can't rely on judgment and past experience to rank web pages, in the way humans can.So, how do crawler-based search engines go about determining relevancy, when confronted with
hundreds of millions of web pages to sort through? They follow a set of rules, known as an algorithm.Exactly how a particular search engine's algorithm works is a closely-kept trade secret.
One of the main rules in a ranking algorithm involves the location and frequency of keywords on a web page. Call it the location/frequency method.
Remember the librarian mentioned above? They need to find books to match your request of "travel," so it makes sense that they first look at books with travel in the title. Search engines operate the same way.Pages with the search terms appearing in the HTML title tag are often assumed to be more relevant than others to the topic.Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning.Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages.
Crawler/Spider : A crawler crawls around the web, visits web pages through their linkages, and downloads the page content to the local database of the search engine.
Indexer : An indexer transfers a page into various components, typically keywords,and analyzes them. Entities such as, titles, headings, links, text, constructs, bold, italic,and other style portions of a page are ripped apart and analyzed.
Index file database: The index file database is a category containing a copy of every single page that the crawler scans and the corresponding index in-formation generated by the indexer. When a crawler returns to your website and scans your website,this catalog will be updated.
• Query parser : This is the software used to analyze the queries input from users by transferring the query into the same representation as document in the indexer. In this way,an comparison can be done between the query and each document.
• Ranking mechanism: This is the algorithm that matches the users’ keywords with the pages in the database and comes up with matches. These matches are then displayed on the users’ screen in the decreasing order of their relevance to the users’ input query.
• Search engine interface: This is the portion of a search engine a user interacts with when she performs a search and gets the returned results.
The application of indexing techniques in information organization is directed towards efficient, precise and timely retrieval of information .For this post-coordination is followed. Post-coordinate indexes provide flexibility when index terms are obtained from automated and digitized information system . In present day search engines the concept of post-coordination in indexing is applied.
A complete set of document representations : Bibliographic reference, usually accompanied by index terms or abstract or both.
The following illustration depicts a networked relationship among 4 index terms that were identified from the previous example. There can be any query with all and any combination among the indexed terms.so that An user should be able to retrieve this document in a search involving in any single term or any combination of them : Any two terms, any three or all four.
The happenings in an online search has been demonstrated through a reference matrix in this slide. Suppose the searcher enters the term POLITICAL CONTENDERS at a terminal an this is represented by H in the diagram.The system responds by indicating that Five items have been index under the term. The searcher enters ASSEMBLY POLLS (represented by E in the diagram) and is told the Six items appear under this terms.
If the searcher now asks that E be combined with H, the system compares the document numbers on this two lists and indicates that Four items satisfy the requirement.When told to do so by the searcher, the computer finds these records by their identifying numbers
(1,7,8,14) and displays them or prints them out.These procedure remains the same however many terms are involved and whatever the logical
relationships specified the searcher.
Thank you Vikas .I am the 4th speaker of this topic, my name is Debashis Naskar 1st year DRTC student, before proceeding the slide I wish very good afternoon to everyone, our respected teacher, our research scholar and our friends..
In my portion discussions would be in regard to the Current status and applications of Indexing principles in searching . I begin with an introduction on the web and how it is being used for information retrieval. Now a days The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. People are likely to surf the web using its link graph, often starting with high quality human maintained indices such as Yahoo! or with search engines. Human maintained lists cover popular topics effectively but are subjective, and cannot cover all esoteric topics. Automated search engines that rely on keyword matching usually return too many low quality matches. HTML is syntactically reach whereas a XML is semantically reach.
Now I would be telling a few words about XML Indexing.XML Indexing is a technique that is currently being used for providing identification to information resources in electronic form. The indexer should have the ability to provide correct identification to facilitated better navigation and timely retrieval of appropriate information.XML Indexing is a form of indexing which have various tags. You can create indexes on your XML data, to focus on particular parts of it that your query often, and thus improve performance. This automated process handle some formatting like word by word ,letter by latter.
Now the encoding scheme for DocBook has been described briefly .This is an example of XML indexing. Index entries in DocBook are encoded using the <indexterm> element which is mother element and its five child elements <primary> , <secondary>, <tertiary> , < see > and <seealso>.
These elements and their attributes provide for all standard features, including main entries, subentries, alternate alphabetization order, single page locators and page ranges, and see and see also references . There are summarized below:
<indexterm> element: wrapper element for an index entry of any type.
<primary> element: use for main entry.
<secondary> element: use for subentry.
<tertiary> element: use for sub-subentry.
< see > element: used with <primary> or <secondary> for ‘see’ references.
<seealso> element: used with <primary> or <secondary> for ‘seealso’ references.
Here a few possibilities have been discussed that are there for indexers in the probable future. If the indexer can offer XML based services, publishers will use them.
Learning XML is not hard for anyone who can think rigorously. Indexer are good with structures and fundamentally, XML in publishing is about imposition of
structure on text. It is often believe that general text search will overcome the need for a digital indexing.
However, this is not the case and indexer should try to educate publishers to make sure they understand this. In years to come many books will be published that will never appear in print. It will be the electronic version.
Application of indexing principles are widely used now a days in search engines. Now I provide a general talk on Google which invariably is the most widely used search engine. Mainly I have highlighted the performance and experience with Google which produce better result then the other commercial
search engines.
This is a Bar graph representation of usage of various search engine. As you can see here the Google is the most widely used search engine on a global scale.
Here I put 1yr statistics data for usage of search engines as depicted in the previous slide. We can see that from April 2012 to April 2013 the Google are performing outstanding then the other search engine.
For various information needs users formulate queries which are translate into appropriate representation. There are documents which are converted into document representations. Based on such representations a system tries to determine how well documents satisfying information needs. Probability theory provides a principled foundation for reason.
With this I introduce OKAPI BM25 weighting scheme which is example of non binary model that is used for term counts.
Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework. The BIM(Binary Independence model) was originally designed for short catalogue records and abstracts of fairly consistent length . It is a full-text search collections. We can also called this BM25 weighting scheme or Okapi weighting. Here an overview has been provided in relation to a series of forms that built up to the standard form that is currently being used for document scoring.
Here we can see some equation which is basically score for document and it help us to retrieve the proper information to the user which needed. Now lets move to the equation part. First equation is basically related to this right side of the log function.If we consider ut is dft/N.then first equation will be satisfied . here idf means inverse document frequency which is the alternative version of RSVd means Retrieval Status value .How will we get this value? If we divided the dft from N which is No of the collection from the whole document, then we will get the RSVd value which is find out the score of the document. Why we are using the log function just because of to get the relevant information . Google provides us lots of information to compact this this information ,log function is needed. Let me explain you with the help of an example.( Suppose you are searching Barak Obama which is related to the white house but whenever you search Barak Obama by using this keyword then it will provides you lots of information which is relevant or which is irrelevant means Barak Obama which is not related to white house. But we needs only Barak Obama which is related to white house i. e. relevant information so this log function help us to retrieve the relevant information. ) now the second equation is the same and it is bacic and it will give you more relevant information . According to the OKAPI model if we consider this S=0,means if we put 0 on behalf of this S position then final equation will be this RSVd equation.and why we are using this half value just because of to avoid possibility of Zeros. It is fairly standard to add half it will adjust the marginal count. half means smoothing value.
Now we can see the third equation which is related to the first equation. It means this equation help us to improve the first equation.here tdft means frequency of term of document and Ld is length of the document and lavg means average length of the document in whole collection.k1 means tuning parameter calibrating the term frequency and b is tuning parameter for scaling by document length. If we calculate this equation in proper way then, we will get the exact or prominent scoring value.
And the last equation is basically usage for long query instead of keyword. It might be sentence or paragraph. That why here using the K3 which being another positive tuning parameter that this time calibrates term frequency scaling of the query. It is basically use for long query. b is tuning parameter for scaling by document length.
So this model is very helpful and useful model to get scoring value of the relevant information in right time.
b=1 when fully scaling the term weight by length. And b=0 when no length normalization.
Throughout the previous discussion several approaches have been observed. Among other important
features one significant aspect is that : for implementation of indexing service individual may prefer
numerous approaches. It is largely dependent on how the tool is being used.
For current awareness purposes, those tools that use some form of classified approach will usually
the superior to the alphabetico – specific indexes. For example, someone interested in keeping current
with new developments in parasitology in general would find BIOLOGICAL ABSTRACTS more useful
than INDEX MEDICUS. It is because for the latter, reference to the subject are likely to be scattered
over a wide verity of subject headings. For someone whose current awareness interest are highly
specific, the alphabetico - specific approach might actually being more convenient.
In considering these various tools as search and retrieval devices, the performance factors should
relevantly be discussed. That is the effectiveness of an index as a search tool will depend on the number
of access points provided, the specificity of the vocabulary used to index, the quality and consistency
indexing, and the extent to which the tool offers positive help to the searcher (e.g. , by linking of
semantically related terms).
Contd…
Many automatic systems include form of weighting to allow the ranking . Weighted Indexing grants
autonomy on behalf of the searcher to vary the exhaustivity of the index. Much subject indexing
entails a simple binary decision. A term is either assigned to the document or it is not .While this
simplify the process of indexing, it does creates some problems for the user of a database, who cannot
device a search strategy that will distinguish items in which a topic receive substantial treatment
from those in which it is dealt in minor terms.
In weighted indexing, indexers can assign to a term a numerical value that reflects their opinion on how
important that term is in indicating what a particular document is about. For any subject matter with
increase intension higher weights are assigned to the central theme. This is subjective and different
indexers may provide different weights.
Weighted indexing of this type can be used in two ways in retrieval from a database. One way is simply
to allow a searcher to specify that only items index under a term carrying a particular weight should be
retrieve. The alternative application is to used the weights to rank items retrieve in a search.
The assignment of numerical weights to terms was first advocated by “Maron” and “Kuhn” in 1960s.
Maron referred these type of indexing as ‘probabilistic ‘.
Some databases incorporate weighting technique by distinguishing between ‘major’ and ‘minor’
descriptors, which is equivalent to adopting a numerical scale having two values.
Minor descriptors are those that are associated only with the database in electronic form. This
practice is follow at the National Library of Medicine (Index Medicus and the Medline database),
the National Technical Information Service (NTIS) and the Educational Resources Information
Centre (ERIC) . Suppose (M) denote ‘major’ descriptors and (m) denote ‘minor’ descriptors;
then in an and relationship the crude ranking output can be achieved as :
M * M
M * m
m * m
Here these provide items in which two terms, with their respective denotations, were used by the
searcher. Many (automatic) system include forms of weighting to allow the ranking of a output.
In most automatic processing system weight by frequency criteria: frequency of a term occurrence
in a text and / or frequency of occurrence in a database as a whole, or other methods have been tried
, including use of positional criteria.
Contd…
For efficient and timely retrieval of appropriate and correct information, much emphasis should be laid on
document representation. Linking of terms is very essential in this regards. At certain cases the document
retrieved for any search might not be considered as an appropriate and relevant respond. Things like
these can be avoided by the use of weighted indexing or by reducing the exhaustively of index.
Other unwanted retrieval could be caused by false associations, cases in which the terms the cause an item
to be retrieve are quite unrelated in the document. The probability of the occurrence of false associations
results in an increase of the length of the record (i. e. with the number of access point provided or with the
exhaustively of the indexing).
One way of avoiding false association is by linking of an index term. That is, the documents in a partition
into several subdocuments, each dealing with the separate although closely related to the subject. All terms
in each link are directly related and certain terms may appear in several links. Each link is identified by some
alphanumeric character that is provided to the database itself. In an online retrieval system this would be
associated with the document number with the inverted file. Thus a documents with 1 to 05 as its number
may be partition into subdocuments as 1205/1, 1205/2, 1205/3 and so on. This gives the searcher the
opportunity to specify the two terms should co-occur not only in the document record but as well as
particular link with which the document is identified. This is how false association can be avoided.
Role indicator play an important part in retrieve of accurate information. The problem with unwanted
retrievals can be rectified if proper linking can be done. At times, it may so happen that certain terms
stand retrieve which are relevantly related to the document, but are not of interest to the searcher. The
seek its mean in a different contexts such term though they appear in proper link are undesirable. Such
problem are arise due to incorrect term relationship.
Thus for purposes of disambiguation it is necessary to introduce some syntax into the indexing. The
traditional method is to use role indicators (or relational indicators). These are codes that established
explicit relationships.
Link and roles are the same where introduced into retrieval system in the early 1960s. The post
co-ordinate system where at rudimentary stages and computer base retrieval was at a mare birth. The
firstof its kind was the Engineers Joint a Council (EJC) set of role indicator.
Even more elaborate than the EJC method of indexing using links and roles was the ‘semantic code’
approach to information retrieval.
The document surrogate was a ‘telegraphic abstract’. The terms in the telegraphic abstracts were encoded
by means of a ‘semantic code dictionary’.
Establishment of Links, implementation of role Indicator and the Semantic code; all of these lead to the
foundation of an approach to indexing that was highly structured. This was during 1960s, when
computer based systems were quite juvenile. Gradually with the advent of automated system the need for
retrieval of precise information gained importance.
The problems of false or ambiguous associations are now less severe than they were thirty or forty year
ago because then higher level of precoordination existed in most systems.
In a postcoordinate system, subheadings can be applied in much the same way in which they are used for
traditional subject cataloguing.
While the main justification for use of subheading was to facilitate the use of the Index Medicus. However,
the concept of subheading was largely successful in reducing the ambiguities in the searching of
electronic databases. By the inclusion of subheadings things were made more comprehensible for the
users.
When it comes to the enhancement of indexing principle factors affecting recall and precision need
consideration. The recall and precision of any search depends on devices that have been categorized
into recall devices and precision devices. Precision devices include weighting, links, and role indicators.
These devices are quiet independent of the index language. Recall devices include subheadings and
synonym control that are integral components of the index language.
The complete array of such devices are known as Index Language Devices.
The entire discussion was based on the application of indexing techniques and principles for the design of
search engines that would facilitate efficient retrieval of relevant information. For many days the main
concern has been to develop software tools that would allow the user to perform relatively specific
subject searchers related to resources of any type.
Search engine operate by building ‘indexes’ to the network resources. This signifies the extraction of words
or phrases from the text itself and the creation of files that allow efficient searching of such extractsSearch
for resources to Index . Search engine use inverted indexes which are build automatically . When an
appropriate resource is found it is Index. Other resources linked to the resource selected may also be
located an indexed . The techniques used are given various names including ‘crawlers’, ‘spiders’,
‘worms’.