SlideShare une entreprise Scribd logo
1  sur  7
Télécharger pour lire hors ligne
ISSN: 2277 – 9043
                              International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                          Volume 1, Issue 6, August 2012



              Maximizing the Search Engine Efficiency
                   V. KISHORE, V.SARITHA, G. RAMA MOHAN RAO, M. RAMAKRISHNA
                                                                        their particular information needs. There are two ways of
    Abstract— Due to the massive amount of heterogeneous                 searching for information: to use a search engines or to
information on the web, insufficient and vague user queries,             browse directories organized by categories. There is still a
and use of the different query by different users for same aims,         large part of the Internet (private databases and intranets) that
the information retrieval process deals with a huge amount of            is not accessible.
uncertainty and doubt. Under such circumstances, designing an               The organization of this paper is as follows. We briefly
efficient retrieval function and ranking algorithm by which the
                                                                         mention the search engines history, features, and services.
most relevant results are provided is of the greatest importance.
   Now a day’s search engines became more vital in retrieving            We present the generic architecture of a search engine. We
information about unknown entities. In the world an entity may           discuss its Web crawling component, which has the task to
be recognized by more than on name (alias). For example many             collect WebPages to be indexed. Then we focus on the
celebrities and experts from various fields may have been                Information Retrieval component which has the task of
referred by not only their original names but also by their              retrieving documents (mainly text documents) that answer a
aliases on web. Aliases are very important in information                user query. We describe information retrieval techniques,
retrieval to retrieve complete information about a personal              focusing on the challenges faced by search engines. One
name from the web, as some of the web pages of the person may            particular challenge is the large scale, given by the huge
also be referred by his aliases.
                                                                         number of WebPages available on the Internet. Another
   In recent years, researchers have been developing algorithms
for the automatic information retrieval to meet the demands of           challenge is inherent to any information retrieval system that
retrieving almost all data; the web search engine automatically          deals with text: the ambiguity of the natural language
expands the search query on a person name by tagging his                 (English or other languages) that makes it difficult to have
aliases for complete information retrieval thereby improving             perfect matches between documents and user queries. We
recall in relation detection task and achieving a significant            propose methods used to evaluate the performance of the
Mean Reciprocal Rank (MRR) of search engine. For the further             Information Retrieval component by overcoming the
substantial improvement on recall and MRR from the                       ambiguity of name aliases. The proposed method will work
previously proposed methods, our proposed method will order              on the aliases and get the association orders between name
the aliases based on their associations with the name using the
                                                                         and aliases to help search engine tag those aliases according
definition of anchor texts-based co-occurrences between name
and aliases in order to help the search engine tag the aliases
                                                                         to the orders such as first order associations, second order
according to the order of associations. The association orders           associations etc so as to substantially increase the recall and
will automatically be discovered by creating an anchor                   MRR of the search engine while searching made on person
texts-based co-occurrence graph between name and aliases.                names.
Ranking Support Vector Machine (SVM) will be used to create
connections between name and aliases in the graph by
                                                                                     II. OVER VIEW OF SEARCH ENGINES
performing ranking on anchor texts-based co-occurrence                   2.1 Search engines:
measures. The hop distances between nodes in the graph will                 There are many general-purpose search engines available
lead to have the associations between name and aliases. The hop          on the Web. A resource containing up to date information on
distances will be found by mining the graph. The contribution            the           most         used        search         engines
presented in this paper makes human intervention one step
                                                                         is:www.searchenginewatch.com Here are some popular
more down by outperform previously proposed methods,
achieving substantial improvement on recall and MRR.                     search engines AltaVista Google Yahoo! MSN Search.
   Index Terms: Search engine, Relation Detection Task, Word             Meta-search engines combine several existing search engines
Co-occurrences. Anchor Text mining, Information retrieval.               in order to provide documents relevant to a user query. Their
                                                                         task is reduced to ranking results from the different search
                        I. INTRODUCTION                                  engines and eliminating duplicates. Some examples are:
                                                                         www.metacrawler.com,             www.mamma.com,           and
  There is a huge quantity of text, audio, video, and other
                                                                         www.dogpile.com.
documents available on the Internet, on about any subject.
                                                                         2.2 Search engine History
Users need to be able to find relevant information to satisfy
                                                                            The very first tool used for searching on the Internet was
                                                                         called Archie (the name stands for "archive"). It was created
   Manuscript received July 15, 2012.                                    in 1990 by Alan Emtage, a student at McGill University in
   V. KISHORE, Department of Computer Science & Engineering, SANA
Engineering College, Kodad, Nalgonda, Andhra Pradesh, India, Mobile:     Montreal. The program downloaded the directory listings of
9010092225                                                               all the files located on public anonymous FTP sites, creating
   V. SARITHA, Department of Computer Science & Engineering, SANA        a searchable database of filenames. Gopher was created in
Engineering College, , Kodad, Nalgonda, Andhra Pradesh, India, Mobile:   1991 by Mark McCahill at the University of Minnesota.
8978157316.
   G. RAMAMOHAN RAO, Department of Computer Science &                    While Archie indexed file names, Gopher indexed plain text
Engineering, SANA Engineering College, , Kodad, Nalgonda, Andhra         documents.
Pradesh, India, Mobile: 9676339884 .                                        Soon after, many search engines appeared and became
   M. RAMAKRISHNA, Department of Computer Science & Engineering,         popular. These included Excite, Infoseek, Inktomi, Northern
SANA Engineering College, , Kodad, Nalgonda, Andhra Pradesh, India,
Mobile:8978652777.
                                                                         Light, and AltaVista. In some ways, they competed with

                                                                                                                                       58
                                                  All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                            International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                        Volume 1, Issue 6, August 2012

popular directories such as Yahoo. Later, the directories           all the pages related to user query (including aliases) in a
integrated or added on search engine technology for greater         generic crawl has not received attention.
functionality.                                                         Documents that are exact duplicates of each other (due to
   Around 2001, the Google search engine rose to                    mirroring and plagiarism) are easy to identify by standard
prominence (Page and Brin, 1998). Its success was based in          check summing techniques. A more difficult problem is the
part on the concept of link popularity and PageRank that uses       identification of near-duplicate documents. Two such
the premise that good or desirable pages are pointed to by          documents are identical in terms of content but differ in a
more pages than others. Google's minimalist user interface          small portion of the document such as advertisements,
was very popular with users, and has since spawned a number         counters and timestamps. These differences are irrelevant for
of imitators. In 2005, it indexed approximately 8 billion           web search. So if a newly-crawled page Pduplicate is deemed a
pages, more than any other search engine. It also offers a          near-duplicate of an already-crawled page P, the crawl
growing range of Web services, such as Google Maps and              engine should ignore Pduplicate and its entire out-going links
online automatic translation tools...etc                            (intuition suggests that these are probably near-duplicates of
   In 2002, Yahoo acquired Inktomi and in 2003, Overture,           pages reachable from P). Elimination of near-duplicates
which owned AlltheWeb and AltaVista. Despite owning its             saves network bandwidth, reduces storage costs and
own search engine, Yahoo initially kept using Google to             improves the quality of search indexes. It also reduces the
provide its users with search results. In 2004, Yahoo               load on the remote host that is serving such web pages.
launched its own search engine based on the combined                   A system for detection of near-duplicate pages faces a
technologies of its acquisitions and providing a service that       number of challenges. First and foremost is the issue of scale:
gave preeminence to the Web search engine over its                  search engines index billions of web-pages; this amounts to a
manually-maintained subject directory.                              multi-terabyte database. Second, the crawl engine should be
   MSN Search is a search engine owned by Microsoft,                able to crawl billions of web-pages per day. So the decision
which previously relied on others for its search engine             to mark a newly-crawled page as a near-duplicate of an
listings. In early 2005 it started showing its own results,         existing page should be made quickly. Finally, the system
collected by its own crawler. Many other search engines tend        should use as few machines as possible.
to be portals that merely show the results from another
                                                                            III. OVERVIEW OF INFORMATION RETRIEVAL
company's search engine.
2.3 Search Engine Architectures                                     3.1 Information Retrieval:
   The components of a search engine are: Web crawling                 The last few decades have witnessed the birth and
(gathering WebPages), indexing (representing and storing            explosive growth of the web. It is indisputable that the
the information), retrieval (being able to retrieve documents       exponential growth of the web has made it into a huge ocean.
relevant to user queries), and ranking the results in their order   Finding web pages with high quality is one of the important
of relevance. Figure.1 presents a simplified view of the            purposes in search engines. The quality of pages is defined
                                                                    based on request and preferences of user. Usually, there are
                                                                    millions of relative pages with each query.
                                                                    Information storage and retrieval systems make large
                                                                    volumes of text accessible to people with information needs
                                                                    (van Rijsbergen, 1979; Salton and McGill, 1983). A person
                                                                    (hereafter referred to as the user) approaches such a system
                                                                    with some idea of what they want to find out, and the goal of
                                                                    the system is to fulfill that need. The person provides an
                                                                    outline of their requirements—perhaps a list of keywords
                                                                    relating to the topic in question, or even an example
                                                                    document. The system searches its database for documents
components of a search engine.                                      that are related to the user’s query, and presents those that are
     Figure 1: The simplified architecture of a search engine.      most relevant. Nevertheless, users specially consider only 10
                                                                    to 20 of those pages. The basic architecture is as shown in
2.4 Web crawler                                                     Figure.2.
   Web crawling is an integral piece of infrastructure for
search engines. The web crawler is a program that
automatically traverses the web by downloading the pages
and following the links from page to page [Koster 1999]. A
general purpose of web crawler is to download any web page
that can be accessed through the links.
   Generic crawlers [1, 2] crawl documents and links
belonging to a variety of topics, whereas focused crawlers [3,
4, 5] use some specialized knowledge to limit the crawl to
pages pertaining to specific topics. For web crawling, issues
like freshness and efficient resource usage have previously
been addressed [6, 7, 8]. However, the problem of                      Figure 2: Information Retrieval System - Basic Architecture
elimination of near-duplicate web documents and returning

                                                                                                                                     59
                                                All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                             International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                         Volume 1, Issue 6, August 2012

   Queries transform the user’s information need into a form          query to reflect the user’s needs more precisely. The user is
that correctly represents the user’s underlying information           required only to indicate the relevance of each document
requirement and is suitable for the matching process. Query           presented. Finally, an information retrieval system’s
formatting depends on the underlying model of retrieval used          conceptual model defines how documents in its database are
(Boolean models [Bookstein, 1985], Vector Space models                compared.
[Salton & McGill, 1983], Probabilistic models [Maron &                3.2 Evaluation of Information Retrieval Systems
Kuhns, 1960; Robertson, 1977], based on artificial                       To compare the performance of information retrieval
intelligence techniques [Maaeng, 1992; Evans 1993]).                  systems there is a need for standard test collections and
   A matching algorithm matches a user’s requests (in terms           benchmarks. The TREC forum (Text Retrieval Conference)
of queries) with the document representations and retrieves           provides test collections and organizes competition between
documents that are most likely to be relevant to the user. A          IR systems every year, since 1992. In order to compute
matching algorithm addresses two issues:                              evaluation scores, we need to know the expected solution.
  1. How to decide how well documents match a user’s                  Relevance judgments are produced by human judges and
       information request. Blair & Maron [1985] showed that          included in the standard test collections.
       it is very difficult for users to predict the exact words or      CLEF (Cross-Language Evaluation Forum) is another
       phrases used by authors in desired documents. Hence if         evaluation forum that organizes competition between IR
       a document term does not match search terms then a             systems that allow queries or documents in multiple
       relevant document may not be retrieved.                        languages, since the year 2000.
  2. Another issue involved in matching is how to decide the             In order to evaluate the performance of an IR system we
       order in which the documents are to be shown to the            need to measure how far down the ranked list of results will a
       user. Typically the matching algorithms calculate a            user need to look to find some or all the relevant documents.
       matching number for each document and retrieve the             3.3 IR Paradigms
       documents in the decreasing order of this number.                 This section briefly describes various research paradigms
   The user rates documents presented as either relevant or           prevalent in IR and where our work fits in. At a broad level,
non-relevant to his/her information need. The basic problem           research in IR can be categorized [Chen, 1995] into three
facing any IR system is how to retrieve only the relevant             categories: Probabilistic IR, Knowledge based IR, and
documents for the user’s information requirements, while not          IR based on machine learning techniques.
retrieving non-relevant ones. Various system performance              3.3.1. Probabilistic IR:
criteria like precision and recall have been used to gauge the           Probabilistic retrieval is based on estimating a probability
effectiveness of the system in meeting users’ information             of relevance of a document to the user for the given user
requirements.                                                         query. Typically relevance feedback from a few documents is
   Precision is defined as the ratio of the number of relevant        used to establish the probability of relevance for other
retrieved documents to the total number of retrieved                  documents in the collection [Fuhr et al., 1991; Gordon,
documents.                                                            1988]. There are three different learning strategies used in
               Number of relevant documents retrieved                 probabilistic retrieval. Estimation of probabilities of
          P=                                                          relevance is done for a set of sample documents [Robertson
                 Total number of retrieved documents
                                                                      & Sparck Jones, 1976], or a set of sample queries [Maron &
  Recall is the ratio of the number of relevant retrieved             Kuhns, 1960] and extended to all the documents or queries.
documents to the total number of relevant documents                   Inference networks [Turtle & Croft, 1990] use a document
available in the document collection.                                 and query network that capture probabilistic dependencies
              Number of relevant documents retrieved                  among the nodes in the network.
        R=                                                            3.3.2. Knowledge based IR:
                Total number of relevant documents
                                                                         This approach focuses on modeling two areas. First, it tries
   Using this information, a retrieval system is able to rank         to model the knowledge of an expert retriever in terms of the
the documents that match and present the most relevant ones           expert’s domain knowledge, i.e. user search strategies and
to the user first—a major advantage over the standard                 feedback heuristics. An example of such an approach is the
Boolean approach which does not order retrieved documents             Unified Medical Language System. Another area that has
in any significant way. From a list of documents sorted by            been modeled is the user of the system. This typically follows
predicted relevance, a user can select several from the top as        the way the librarian develops a client profile. Although
being most pertinent. In a system using the Boolean model,            knowledge based approaches might be effective in certain
users would have to search through all the documents                  domains, it may not be applicable in all domains [Chen et al.,
retrieved, determining each one’s importance manually.                1991].
   Systems that employ simple query methods usually                   3.3.3. Learning systems based IR:
provide other facilities to aid users in their search. For               This approach is based on algorithmic extraction of
example, a thesaurus can match words in a query with those            knowledge or identifying patterns in the data. There are three
in the index that have similar meaning. The most effective            broad areas within this approach: Symbolic Learning, Neural
tool exploited by information retrieval systems is user               Networks and Evolution based algorithms. In the Symbolic
feedback. Instead of having users re-engineer their queries in        Learning approach knowledge discovery is done typically by
an adhoc fashion; a system that measures their interest in            inductive learning by creating a hierarchical arrangement of
retrieved documents can adjust the queries for them. By               concepts and producing IF-THEN type production rules. ID3
weighting terms in the query according to their performance           decision-making algorithm [Quinlan, 1986] is one such
in selecting appropriate documents, the system can refine a           popular algorithm.

                                                                                                                                  60
                                               All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                            International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                        Volume 1, Issue 6, August 2012

                     IV. RELATED WORK                                                 V. PROPOSED METHOD
4.1 Keyword Extraction Algorithm:                                     The proposed method will work on the aliases and get the
   Keyword extraction algorithm that applies to a single           association orders between name and aliases to help search
document without using a corpus was proposed by Matsuo,            engine tag those aliases according to the orders such as first
Ishizuka [10]. Frequent terms are extracted first, and then a      order associations, second order associations…etc so as to
set of co-occurrences between each term and the frequent           substantially increase the Recall and Mean Reciprocal Rank
terms, i.e., occurrences in the same sentences, are generated.     (MRR) of the search engine while searching made on person
Co-occurrence distribution showed the importance of a term         names.
in the document. However, this method only extracts a                 The MRR applies to the binary relevance task: for a given
keyword from a document but not correlate any more                 query, any returned document is labeled ―relevant‖ or ―not
documents using anchor texts-based co-occurrence                   relevant‖, and if ri is the rank of the highest ranking relevant
frequency.                                                         document for the ith query, then the reciprocal rank measure
4.2 Transitive Translation Approach:                               for that query is 1r and the MRR is just the reciprocal rank,
   Transitive translation approach to find translation                                   i
equivalents of query terms and constructing multilingual           of the search engine for a given sample of queries is that the
lexicons through the mining of web anchor texts and link           average of the reciprocal ranks for each query.
structures was proposed by Lu, Chien and Lee [11]. The                The term word Co-Occurrence refers to the temporal
translation equivalents of a query term can be extracted via its   property of the two words occurring at the same web page or
translation in an intermediate language. However this method       same document on the web.
did not associate anchor texts using the definition of                The Anchor Text is the clickable text on web pages, which
co-occurrences.                                                    points to a particular web document. Moreover the anchor
4.3 Feature Selection Method:                                      texts are used by search engine algorithms to provide relevant
   A novel feature selection method based on part-of-speech        documents for search results because they point to the web
and word co-occurrence was proposed by Liu, Yu, Deng,              pages that are relevant to the user queries. So the anchor texts
Wang, Bian[12]. According to the components of Chinese             will be helpful to find the strength of association between two
document text, they utilized the word’s part-of-speech             words on the web.
attributes to filter lots of meaningless terms. Then they             The anchor texts-based co-occurrence means that the two
defined and used co-occurrence words by their                      anchor texts from the different web pages point to the same
part-of-speech to select features. The results showed that         URL on the web. The anchor texts which point to the same
their method can select better features and get a more             URL are called as inbound anchor texts [9]. The proposed
pleasant clustering performance. However, this method does         method will find the anchor texts-based co-occurrences
not use anchor texts-based co-occurrences on words.                between name and aliases using co-occurrence statistics and
4.4 Data Treatment Strategy:                                       will rank the name and aliases by Support Vector Machine
   A data treatment strategy to generate new discriminative        (SVM) according to the co-occurrence measures in order to
features, called compound-features(c-features) for the sake        get connections among name and aliases for drawing the
of text classification was proposed by Figueiredo et al. [13].     word co-occurrence graph.
These c-features are composed by terms that co-occur in
documents without any restrictions on order or distance
between terms within a document. This strategy precedes the
classification task, in order to enhance documents with
discriminative c-features. This method extracts only a
keyword from a document but not correlate any more
documents using anchor texts.
      Anchor Texts         x         C - {x}        C
            p               k           n-k           N
         V-{P}            K-k       N-n-k+K          N-n
         V-{P}              k          N-K            N
     Table 1. Contingency Table for anchor Texts „p‟ and „x‟.
4.5 Alias Extraction Method:
   A method to extract aliases from the web for a given
personal name was first proposed by Bollegala, Matsuo, and
Ishizuka [9]. They have used lexical pattern approach to
extract candidate aliases. The incorrect aliases have been                      Figure 3: Proposed method - overview
removed by page counts, anchor text co-occurrence
frequency, and lexical pattern frequency. However, this               Then a word co-occurrence graph will be created and
method considered only the first order co-occurrences on           mined by graph mining algorithm so as to get the hop
aliases to rank them but did not focus on the second order         distance between name and aliases that will lead to the
co-occurrences to improve recall and achieve a substantial         association orders of aliases with the name. The search
MRR for the web search engine.                                     engine can now expand the search query on a name by

                                                                                                                                61
                                               All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                             International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                         Volume 1, Issue 6, August 2012

tagging the aliases according to their association orders to         5.1.1 Role of Anchor Texts
retrieve all relevant pages which in turn will increase the             The main objective of search engine is to provide the most
Recall and achieve a substantial MRR.                                relevant documents for a user’s query. Anchor texts play a
   The proposed method is outlined in Figure 3 and                   vital role in search engine algorithm because it is clickable
comprises four main components namely                                text which points to a particular relevant page on the web.
       Computation of word co-occurrence statistics                 Hence search engine considers anchor text as a main factor to
       Ranking anchor texts                                         retrieve relevant documents to the user’s query. Anchor texts
       Creation of anchor text co-occurrence graph                  are used in synonym extraction, ranking and classification of
       Discovery of association orders                              web pages and query translation in cross language
   To compute anchor texts-based co-occurrence measures,             information retrieval system.
there are nine co-occurrence statistics [9] used in anchor text      5.1.2 Anchor Texts Co-occurrence Frequency
mining to measure the associations between anchor texts:                The two anchor texts appearing in different web pages are
    Co-occurrence Frequency (CF)                                    called as inbound anchor texts [9] if they point to the same
    Term Frequency–Inverse Document Frequency (tfidf)               URL. Anchor texts co-occurrence frequency [9] between
    Chi Square (CS)                                                 anchor texts refers to the number of different URLs on which
    Log Likelihood Ratio (LLR)                                      they co-occur. For example, if p and x that are two anchor
    Point wise Mutual Information (PMI)                             texts are co-occurring, then p and x point to the same URL. If
    Hyper Geometric Distribution (HG)                               the co-occurrence frequency between p and x is that say an
    Cosine                                                          example k, and then p and x co-occur in k number of different
    Overlap                                                         URLs. For example, the picture of Sachin Tendulkar is
    Dice                                                            shown in Figure 4 which is being liked by six different
   Ranking Support Vector Machine will be used to rank the           anchor texts. According to the definition of co-occurrences
anchor texts with respect to each anchor text to identify the        on anchor texts, Little Master and Master Blaster are
highest ranking anchor text for making first order                   co-occurring. As well, The Sachin and Tendulkar are also
associations among anchor texts.                                     co-occurring.
5.1 Co-occurrences in Anchor Texts                                   5.1.3 Word Co-occurrence Statistics
   The proposed method will first retrieve all corresponding            To measure the association between anchor texts, nine
URLs from search engine for all anchor texts in which name           popular measurements will be used and calculated from the
and aliases appear. Most of the search engines provide search        Table 1.
operators to search in anchor texts on the web. For example,         5.1.3.1 Co-occurrence Frequency (CF)
Google provides Inanchor or Allinanchor search operator to              CF [9] is the simplest measurement among all and it
retrieve URLs that are pointed by the anchor text given as a         denotes the value of k in the Table 1.
query. For example, query on ―Allinanchor: Sachin                    5.1.3.2 Term Frequency–Inverse Document Frequency (tfidf)
Tendulkar” to the Google will provide all URLs pointed by               The CF is biased towards highly frequent words. But tfidf
Sachin Tendulkar anchor text on the web.                             [9, 14] resolves the bias by reducing the weight, that is,
                                                                     assigned to the words on anchor texts. The tfidf score for the
                                                Master Blaster
                                                                     anchor texts p and x is calculated from Table 1 as
   Little master
                                                                                                            N 
                                                                                     tfidf  p,x  = klog        ........... 1
                                                                                                            K+1 
          Sachin                                 God of Cricket      5.1.3.3 Chi Square (CS)
                                                                        The Chi Square [9] is used to test the dependence between
                                                                     two words in natural language processing tasks. Given the
      Tendulkar                                 Tendlya              contingency table in Table 1, the X2 measure compares the
                                                                     observed frequency in Table 1 with the expected frequency
                                                                     for independence. Then it is likely that the anchor texts p and
 Figure 4: A picture of Sachin Tendulkar being linked by different
                      anchor texts on the web                        x are dependent if the difference between the observed and
                                                                     expected frequencies is large. The X2 measure sums the
   Next the contingency table will be created as described in        difference between the observed and expected frequencies
Table.1 for each pair of anchor texts to measure their               and is scaled by the expected values. The X2 measure is given
strength. There in                                                   as
   x and p : two input anchor texts.
   C       : set of input anchor texts except p                                                (Oij  Eij )2
                                                                                                             ...........  2
                                                                                          2
                                                                                         X =
   V       : set of all words that appear in anchor texts                                   ij     Eij
   C-{x} : anchor texts except x
   V-{p} : anchor texts except p                                        Where Oij and Eij are the observed and expected
   K       : co-occurrence frequency between p and x                 frequencies respectively. Using Equation (2), the X2 score
   n       : sum of the co-occurrence frequencies between p          for anchor texts p and x from the Table 1 is as follows
                                                                                         N k  N  K nk  nk  K k 
              and all anchor texts in C.                                                                                     2
   K       : sum of co-occurrence frequencies between all                CS  p, x  =                                               ...........  3
                                                                                                  nK ( N  K )( N n)
              words in V and x
                                                                     5.1.3.4 Log Likelihood Ratio (LLR)
   N       : sum of the co-occurrence frequencies between               LLR [9, 15] is the ratio between the likelihoods of two
              all words in V and all anchor texts in C.              alternative hypotheses: that the texts p and x are independent
                                                                                                                                                        62
                                                All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                   International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                               Volume 1, Issue 6, August 2012

or they are dependent. LLR is calculated using the Table 1 as                                                            k
                                                                                              overlap ( p, x)                   ........... 12 
                                                                                                                    min  n, K 
follows
                      kN               ( n -k ) N                 N ( K -k )
 LLR ( p, x )  k log      n-k  log             ( K - k ) log            +
                      nK               n( N - K )                 K ( N -n )       5.1.3.9 Dice
                                   N ( N - K -n  k )                                 Dice [9, 17] retrieves collocations from large textual
               N - K -nk  log                      ...........  4             corpora. The Dice is defined over two sets X and Y as
                                   ( N - K )( N -n)
                                                                                                                    2 X Y
                                                                                                  Dice( X , Y )                ........... 13
                                                                                                                      X  Y
5.1.3.5 Point wise Mutual Information (PMI)
                                                                                                                       2k
   PMI [9, 16] reflects the dependence between two                                                 Dice( p, x)               ........... 14 
probabilistic events. The PMI is defined for y and z events as                                                       nK
                                  P( y, z )                                      5.2 Ranking Anchor Texts
              PMI  y,z  = log                  ...........  5 
                               2  P( y) P( z ) 
                                                
                                                                                      Ranking SVM [9, 18] will be used for ranking the aliases.
                                                                                   The ranking SVM will be trained by training samples of
  Where P(y) and P(z), respectively, represent the                                 name and aliases. All the co-occurrence measures for the
probability of events y and z. Whereas P(y, z) is the joint                        anchor texts of the training samples will be found and will be
probability of y and z. The PMI is calculated from Table 1 as                      normalized into the range of [0-1]. The normalized values
                                      kN 
                  PMI  y,z  = log        ...........  6                       termed as feature vectors will be used to train the SVM to get
                                   2  Kn 
                                                                                  the ranking function to test the given anchor texts of name
5.1.3.6 Hyper Geometric Distribution (HG)                                          and aliases. Then for each anchor text, the trained SVM using
   Hyper Geometric distribution [9, 16] is a discrete                              the ranking function will rank the other anchor texts with
probability distribution that represents the number of                             respect to their co-occurrence measures with it. The highest
successes in a sequence of draws from a finite population                          ranking anchor text will be elected to make a first–order
without replacement. For example, the probability of the                           association with its corresponding anchor text for which
event that ―k red balls are contained among n balls, which are                     ranking was performed. Next the word co-occurrence graph
arbitrarily selected from among N balls containing K red                           will be drawn for name and aliases according to the first order
balls‖ is given by the hyper geometric distribution hg(N, K,                       associations between them.
n, k) as
                                                                                                             Sachin Tendulkar
                                  K  N  K 
                                           
                                   k  nk 
         hg  N , K ,n,k      =               ...........  7 
                                       N
                                                                                        Little Master                                            Master
                                      n 
                                                                                                                                                   Blaster
  The hyper geometric distribution is applied to the values of
Table 1 and the HG (p, x) is computed as the probability of                                   Member of                               Cricket
observing more than k number of co-occurrences of p and x.                                    Parliament
                                                                
           HG ( p, x )  - log 2   hg ( N , K ,n,l 
                                                    
                                  i k                                                       Mumbai                                 Sports
         max 0, N  K -n  l  min n, K  ........... 8 
                                                                                               Indians

5.1.3.7 Cosine                                                                        Figure 5: Word Co-occurrence graph for a personal name “Sachin
   Cosine [9] computes the association between anchor texts.                                                   Tendulkar”
The association between elements in two sets X and Y is                            5.3 Word Co-occurrence Graph
computed as                                                                           Word co-occurrence graph is an undirected graph where
                                   X Y                                            the nodes represent words that appear in anchor texts on the
              cosine(x,y)                                ...........  9
                                              X     Y                              web. For each word in anchor text, a node will be created in
   Where |X| represents the number of elements in set X.                           the graph. According to the definition of co-occurrences if
Considering X be the co-occurrences of anchor texts x and Y                        the two anchor texts co-occur in pointing to the same URL,
be the co-occurrences of anchor text p, then cosine measure                        then undirected edge will be drawn between them to denote
from Table 1 is computed as                                                        their co-occurrences. A word co-occurrence graph like that
                                                                                   shown in Fig 3 will be created for the name and aliases
            cosine( p, x)  k       ........... 10                               according to their first order associations among them. Each
                              n K
5.1.3.8 Overlap                                                                    name and aliases will be represented by a node in the graph.
The overlap [9] between two sets X and Y is defined as                             The two nodes will be connected if they make first order
                                                                                   associations between them. The edge between nodes will
                                         X Y
             overlap ( x, y )                            ........... 11         describe that the nodes bearing anchor texts co-occur
                                          
                                    min X Y                                       according to the definition of anchor texts co-occurrences.
                                                                                   Next the hop distance between nodes will be identified in
  Assuming that X and Y, respectively, represent
                                                                                   order to have first, second, and higher order associations
occurrences of anchor texts p and x. The overlap of (p, x) to
                                                                                   between name and aliases by graph mining algorithm.
evaluate the appropriateness is defined as
                                                                                   5.4 Discovery of Association Orders
                                                                                      Using the graph mining algorithm [19, 20], the word
                                                                                   co-occurrence graph will be mined to find the hop distances
                                                                                                                                                             63
                                                                All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                             International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                         Volume 1, Issue 6, August 2012

between nodes in graph. The hop distances between two                 9) D.Bollegala, Y. Matsuo, and M. Ishizuka, ―Automatic Discovery
nodes will be measured by counting the number of edges                    of Personal Name Aliases from the Web,‖ IEEE Transactions on
in-between the corresponding two nodes. The number of                     Knowledge and Data Engineering, vol. 23, No. 6, June 2011.
edges will yield the association orders between two nodes.            10) Y. Matsuo, and M. Ishizuka,‖ Keyword Extraction from a Single
                                                                          Document using Word Co-occurrence Statistical Information‖,
According to the definition, a node that lays n hops away                 International Journal on Artificial Intelligence Tools, 2004.
from p has an n-order co-occurrence with p. Hence the first,          11) W. Lu, L. Chien and H. Lee, ―Anchor Text Mining for
second and higher order associations between name and                     Translation of Web Queries: A Transitive Translation
aliases will be identified by finding the hop distances                   Approach,‖ ACM Transactions on Information Systems, Vol.
between them. The search engine can now expand the query                  22, No. 2, Aprill 2004, Pages 242-269.
on person names by tagging the aliases according to the               12) Z. Liu, W. Yu, Y. Deng, Y. Wang, and Z. Bian,‖ A Feature
association orders with the name. Thereby the recall will be              selection Method for Document Clustering based on
substantially improved by 40% in relation detection task.                 Part-of-Speech and Word Co-occurrence,‖ Proceedings of 7th
Moreover the search engine will get a substantial MRR for a               International Conference on Fuzzy Systems and Knowledge
                                                                          Discovery (FSKD 10), pp. 2331-2334, Aug 2010.
sample of queries by giving relevant search results.
                                                                      13) F. Figueiredo, L. Rocha, T. Couto, T. Salles, M.A. Gonclaves,
5.5 Data Set                                                              and W. Meira Jr, ―Word Co-occurrence Features for Text
   To train and evaluate the proposed method, there are two               Classification‖, Vol 36, Issues 5, Pages 843-858, July 2011.
data sets: the personal names data set and the place names            14) G. Salton and C. Buckley,‖ Term-Weighting Approaches in
data set. The personal names data set includes people from                Automatic Text Retrieval,‖ Information processing and
various fields of cinema, sports, politics, science, and mass             Management, vol. 24, pp. 513-523, 1988.
media. The place names data set contains aliases for US               15) T. Dunning, ―Accurate Methods for the Statistics of Surprise
                                                                          and Coincidence,‖ Computational Linguistics, vol. 19, pp.
states.                                                                   61-74, 1993.
                       VI. CONCLUSION                                 16) T. Hisamitsu and Y. Niwa, ―Topic-Word Selection Based on
                                                                          Combinatorial Probability,‖ Proc. Natural Language Processing
   The proposed method will compute anchor texts-based                    Pacific-Rim Symp. (NLPRS ’01), pp.289-296, 2001.
co-occurrences among the given personal name and aliases,             17) F.Smadja, ―Retrieveing Collocations from Text: Xtract,‖
and will create a word co-occurrence graph by making                      Computational Liguistics, Vol. 19, no 1, pp. 143-177, 1993.
connections between nodes representing name and aliases in            18) T. Joachims,‖ Optimizing Search Engines using Click through
the graph based on their first order associations with each               Data,‖ proc. ACM SIGKDD ’02, 2002.
other. The graph mining algorithm to find out the hop                 19) D.Chakrabarti and C. Faloutsos , ―Graph Mining: Laws,
distances between nodes will be used to identify the                      Generators, and Algorithms,‖ ACM Computing Surveys, Vol.
                                                                          38, March 2006, Article 2.
association orders between name and aliases. Ranking SVM
                                                                      20) C.C. Agarwal and H. Wang, ―Graph Data Management and
will be used to rank the anchor texts according to the                    Mining: A Survey of Algorithms and Applications,‖ DOI
co-occurrence statistics in order to identify the anchor texts in         10.1007/978-1-4419-6045-0_2, @ Springler Science+Business
the first order associations. The web search engine can                   Media, LLC 2010.
expand the query on a personal name by tagging aliases in the             .
order of their associations with name to retrieve all relevant                              V.KISHORE M.Tech, (Ph.D), working as Assoc.
                                                                                         Professor & HOD of Department of Computer
results thereby improving recall and achieving a substantial                             Science & Engineering, SANA Engineering College,
MRR compared to that of previously proposed methods.                                     Kodad, Nalgonda. He has Published 4 International
                                                                                         Journals, His areas of interest is Network Security,
                          REFERENCES
                                                                                         Information Retrieval Systems…etc.,
1) A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S.
    Raghavan. Searching the web. ACM Transactions on Internet
    Technology, 1(1) : 2-43, 2001.
2) S. Brin and L. Page. The anatomy of a large-scale hypertextual                         V.SARITHA, she is pursuing M.Tech (CSE) in
    Web search engine. Computer Networks and ISDN Systems,                              SANA Engineering College, Kodad, Nalgonda,
    30(1-7):107-117, 1998.                                                              Andhra Pradesh, India.
3) M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori.
    Focused crawling using context graphs. In 26th International
    Conference on Very Large Databases, (VLDB 2000), pages                                 G. RAMAMOHAN RAO, M.Tech. working as
    527-534, sep 2000.                                                                  Asst. Professor in Department of Computer Science
4) F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating                       and Engineering, SANA Engineering College, Kodad,
    topic-driven web crawlers. In Proc. 24th Annual International                       Nalgonda.
    ACM SIGIR Conference On Research and Development in
    Information Retrieval, pages 241-249, 2001.
5) S. Pandey and C. Olston. User-centric web crawling. In Proc.
    WWW 2005, pages 401-411, 2005.
                                                                                          M. RAMAKRISHNA, he is pursuing M.Tech
6) A. Z. Broder, M. Najork, and J. L. Wiener. Efficient URL
                                                                                        (CSE) in SANA Engineering College, Kodad,
    caching for World Wide Web crawling. In International                               Nalgonda, Andhra Pradesh, India.
    conference on World Wide Web, 2003.
7) S. Chakrabarti. Mining the Web: Discovering Knowledge from
    Hypertext Data. Morgan-Kauffman, 2002.
8) J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling
    through URL ordering. Computer Networks and ISDN Systems,
    30(1-7):161-172, 1998.


                                                                                                                                         64
                                               All Rights Reserved © 2012 IJARCSEE

Contenu connexe

Tendances

Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCSCJournals
 
Building a Semantic search Engine in a library
Building a Semantic search Engine in a libraryBuilding a Semantic search Engine in a library
Building a Semantic search Engine in a librarySEECS NUST
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
 
On Incentive-based Tagging
On Incentive-based TaggingOn Incentive-based Tagging
On Incentive-based TaggingFrancesco Rizzo
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey dannyijwest
 
Metadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled IntelligenceMetadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled Intelligencedannyijwest
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
Information Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyInformation Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyBhojaraju Gunjal
 
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...IRJET Journal
 
IRJET- Semantic Web Mining and Semantic Search Engine: A Review
IRJET- Semantic Web Mining and Semantic Search Engine: A ReviewIRJET- Semantic Web Mining and Semantic Search Engine: A Review
IRJET- Semantic Web Mining and Semantic Search Engine: A ReviewIRJET Journal
 
Riding The Semantic Wave
Riding The Semantic WaveRiding The Semantic Wave
Riding The Semantic WaveKaniska Mandal
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperJohn Felahi
 

Tendances (18)

Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
 
Building a Semantic search Engine in a library
Building a Semantic search Engine in a libraryBuilding a Semantic search Engine in a library
Building a Semantic search Engine in a library
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
On Incentive-based Tagging
On Incentive-based TaggingOn Incentive-based Tagging
On Incentive-based Tagging
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey
 
Metadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled IntelligenceMetadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled Intelligence
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
Information Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case StudyInformation Storage and Retrieval : A Case Study
Information Storage and Retrieval : A Case Study
 
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
 
IRJET- Semantic Web Mining and Semantic Search Engine: A Review
IRJET- Semantic Web Mining and Semantic Search Engine: A ReviewIRJET- Semantic Web Mining and Semantic Search Engine: A Review
IRJET- Semantic Web Mining and Semantic Search Engine: A Review
 
Riding The Semantic Wave
Riding The Semantic WaveRiding The Semantic Wave
Riding The Semantic Wave
 
Ir 01
Ir   01Ir   01
Ir 01
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
 
Convolutional Neural Networks
Convolutional Neural Networks Convolutional Neural Networks
Convolutional Neural Networks
 
Aardvark shalini
Aardvark shaliniAardvark shalini
Aardvark shalini
 
625 634
625 634625 634
625 634
 
Hci
HciHci
Hci
 

En vedette (9)

130 133
130 133130 133
130 133
 
28 35
28 3528 35
28 35
 
14 19
14 1914 19
14 19
 
1 5
1 51 5
1 5
 
88 92
88 9288 92
88 92
 
35 38
35 3835 38
35 38
 
67 75
67 7567 75
67 75
 
116 121
116 121116 121
116 121
 
20 26
20 26 20 26
20 26
 

Similaire à 58 64

Aardvark Final Www2010
Aardvark Final Www2010Aardvark Final Www2010
Aardvark Final Www2010guestcc519e
 
UProRevs-User Profile Relevant Results
UProRevs-User Profile Relevant ResultsUProRevs-User Profile Relevant Results
UProRevs-User Profile Relevant ResultsRoyston Olivera
 
Introduction to internet.
Introduction to internet.Introduction to internet.
Introduction to internet.Anish Thomas
 
An imperative focus on semantic
An imperative focus on semanticAn imperative focus on semantic
An imperative focus on semanticijasa
 
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...ijscai
 
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...IJSCAI Journal
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey dannyijwest
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for BeginnersValeria de Paiva
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER ijcseit
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
 
Design Issues for Search Engines and Web Crawlers: A Review
Design Issues for Search Engines and Web Crawlers: A ReviewDesign Issues for Search Engines and Web Crawlers: A Review
Design Issues for Search Engines and Web Crawlers: A ReviewIOSR Journals
 
A novel method to search information through multi agent search and retrie
A novel method to search information through multi agent search and retrieA novel method to search information through multi agent search and retrie
A novel method to search information through multi agent search and retrieIAEME Publication
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008Blogtalk 2008
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET Journal
 
The Revolution Of Cloud Computing
The Revolution Of Cloud ComputingThe Revolution Of Cloud Computing
The Revolution Of Cloud ComputingCarmen Sanborn
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringIRJET Journal
 

Similaire à 58 64 (20)

Aardvark Final Www2010
Aardvark Final Www2010Aardvark Final Www2010
Aardvark Final Www2010
 
UProRevs-User Profile Relevant Results
UProRevs-User Profile Relevant ResultsUProRevs-User Profile Relevant Results
UProRevs-User Profile Relevant Results
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Introduction to internet.
Introduction to internet.Introduction to internet.
Introduction to internet.
 
An imperative focus on semantic
An imperative focus on semanticAn imperative focus on semantic
An imperative focus on semantic
 
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
 
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
 
Introduction abstract
Introduction abstractIntroduction abstract
Introduction abstract
 
Design Issues for Search Engines and Web Crawlers: A Review
Design Issues for Search Engines and Web Crawlers: A ReviewDesign Issues for Search Engines and Web Crawlers: A Review
Design Issues for Search Engines and Web Crawlers: A Review
 
A novel method to search information through multi agent search and retrie
A novel method to search information through multi agent search and retrieA novel method to search information through multi agent search and retrie
A novel method to search information through multi agent search and retrie
 
In3415791583
In3415791583In3415791583
In3415791583
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search Results
 
The Revolution Of Cloud Computing
The Revolution Of Cloud ComputingThe Revolution Of Cloud Computing
The Revolution Of Cloud Computing
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document Clustering
 

Plus de Ijarcsee Journal (20)

122 129
122 129122 129
122 129
 
109 115
109 115109 115
109 115
 
104 108
104 108104 108
104 108
 
99 103
99 10399 103
99 103
 
93 98
93 9893 98
93 98
 
82 87
82 8782 87
82 87
 
78 81
78 8178 81
78 81
 
73 77
73 7773 77
73 77
 
65 72
65 7265 72
65 72
 
52 57
52 5752 57
52 57
 
46 51
46 5146 51
46 51
 
41 45
41 4541 45
41 45
 
36 40
36 4036 40
36 40
 
24 27
24 2724 27
24 27
 
19 23
19 2319 23
19 23
 
16 18
16 1816 18
16 18
 
12 15
12 1512 15
12 15
 
6 11
6 116 11
6 11
 
134 138
134 138134 138
134 138
 
125 131
125 131125 131
125 131
 

Dernier

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

58 64

  • 1. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 Maximizing the Search Engine Efficiency V. KISHORE, V.SARITHA, G. RAMA MOHAN RAO, M. RAMAKRISHNA  their particular information needs. There are two ways of Abstract— Due to the massive amount of heterogeneous searching for information: to use a search engines or to information on the web, insufficient and vague user queries, browse directories organized by categories. There is still a and use of the different query by different users for same aims, large part of the Internet (private databases and intranets) that the information retrieval process deals with a huge amount of is not accessible. uncertainty and doubt. Under such circumstances, designing an The organization of this paper is as follows. We briefly efficient retrieval function and ranking algorithm by which the mention the search engines history, features, and services. most relevant results are provided is of the greatest importance. Now a day’s search engines became more vital in retrieving We present the generic architecture of a search engine. We information about unknown entities. In the world an entity may discuss its Web crawling component, which has the task to be recognized by more than on name (alias). For example many collect WebPages to be indexed. Then we focus on the celebrities and experts from various fields may have been Information Retrieval component which has the task of referred by not only their original names but also by their retrieving documents (mainly text documents) that answer a aliases on web. Aliases are very important in information user query. We describe information retrieval techniques, retrieval to retrieve complete information about a personal focusing on the challenges faced by search engines. One name from the web, as some of the web pages of the person may particular challenge is the large scale, given by the huge also be referred by his aliases. number of WebPages available on the Internet. Another In recent years, researchers have been developing algorithms for the automatic information retrieval to meet the demands of challenge is inherent to any information retrieval system that retrieving almost all data; the web search engine automatically deals with text: the ambiguity of the natural language expands the search query on a person name by tagging his (English or other languages) that makes it difficult to have aliases for complete information retrieval thereby improving perfect matches between documents and user queries. We recall in relation detection task and achieving a significant propose methods used to evaluate the performance of the Mean Reciprocal Rank (MRR) of search engine. For the further Information Retrieval component by overcoming the substantial improvement on recall and MRR from the ambiguity of name aliases. The proposed method will work previously proposed methods, our proposed method will order on the aliases and get the association orders between name the aliases based on their associations with the name using the and aliases to help search engine tag those aliases according definition of anchor texts-based co-occurrences between name and aliases in order to help the search engine tag the aliases to the orders such as first order associations, second order according to the order of associations. The association orders associations etc so as to substantially increase the recall and will automatically be discovered by creating an anchor MRR of the search engine while searching made on person texts-based co-occurrence graph between name and aliases. names. Ranking Support Vector Machine (SVM) will be used to create connections between name and aliases in the graph by II. OVER VIEW OF SEARCH ENGINES performing ranking on anchor texts-based co-occurrence 2.1 Search engines: measures. The hop distances between nodes in the graph will There are many general-purpose search engines available lead to have the associations between name and aliases. The hop on the Web. A resource containing up to date information on distances will be found by mining the graph. The contribution the most used search engines presented in this paper makes human intervention one step is:www.searchenginewatch.com Here are some popular more down by outperform previously proposed methods, achieving substantial improvement on recall and MRR. search engines AltaVista Google Yahoo! MSN Search. Index Terms: Search engine, Relation Detection Task, Word Meta-search engines combine several existing search engines Co-occurrences. Anchor Text mining, Information retrieval. in order to provide documents relevant to a user query. Their task is reduced to ranking results from the different search I. INTRODUCTION engines and eliminating duplicates. Some examples are: www.metacrawler.com, www.mamma.com, and There is a huge quantity of text, audio, video, and other www.dogpile.com. documents available on the Internet, on about any subject. 2.2 Search engine History Users need to be able to find relevant information to satisfy The very first tool used for searching on the Internet was called Archie (the name stands for "archive"). It was created Manuscript received July 15, 2012. in 1990 by Alan Emtage, a student at McGill University in V. KISHORE, Department of Computer Science & Engineering, SANA Engineering College, Kodad, Nalgonda, Andhra Pradesh, India, Mobile: Montreal. The program downloaded the directory listings of 9010092225 all the files located on public anonymous FTP sites, creating V. SARITHA, Department of Computer Science & Engineering, SANA a searchable database of filenames. Gopher was created in Engineering College, , Kodad, Nalgonda, Andhra Pradesh, India, Mobile: 1991 by Mark McCahill at the University of Minnesota. 8978157316. G. RAMAMOHAN RAO, Department of Computer Science & While Archie indexed file names, Gopher indexed plain text Engineering, SANA Engineering College, , Kodad, Nalgonda, Andhra documents. Pradesh, India, Mobile: 9676339884 . Soon after, many search engines appeared and became M. RAMAKRISHNA, Department of Computer Science & Engineering, popular. These included Excite, Infoseek, Inktomi, Northern SANA Engineering College, , Kodad, Nalgonda, Andhra Pradesh, India, Mobile:8978652777. Light, and AltaVista. In some ways, they competed with 58 All Rights Reserved © 2012 IJARCSEE
  • 2. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 popular directories such as Yahoo. Later, the directories all the pages related to user query (including aliases) in a integrated or added on search engine technology for greater generic crawl has not received attention. functionality. Documents that are exact duplicates of each other (due to Around 2001, the Google search engine rose to mirroring and plagiarism) are easy to identify by standard prominence (Page and Brin, 1998). Its success was based in check summing techniques. A more difficult problem is the part on the concept of link popularity and PageRank that uses identification of near-duplicate documents. Two such the premise that good or desirable pages are pointed to by documents are identical in terms of content but differ in a more pages than others. Google's minimalist user interface small portion of the document such as advertisements, was very popular with users, and has since spawned a number counters and timestamps. These differences are irrelevant for of imitators. In 2005, it indexed approximately 8 billion web search. So if a newly-crawled page Pduplicate is deemed a pages, more than any other search engine. It also offers a near-duplicate of an already-crawled page P, the crawl growing range of Web services, such as Google Maps and engine should ignore Pduplicate and its entire out-going links online automatic translation tools...etc (intuition suggests that these are probably near-duplicates of In 2002, Yahoo acquired Inktomi and in 2003, Overture, pages reachable from P). Elimination of near-duplicates which owned AlltheWeb and AltaVista. Despite owning its saves network bandwidth, reduces storage costs and own search engine, Yahoo initially kept using Google to improves the quality of search indexes. It also reduces the provide its users with search results. In 2004, Yahoo load on the remote host that is serving such web pages. launched its own search engine based on the combined A system for detection of near-duplicate pages faces a technologies of its acquisitions and providing a service that number of challenges. First and foremost is the issue of scale: gave preeminence to the Web search engine over its search engines index billions of web-pages; this amounts to a manually-maintained subject directory. multi-terabyte database. Second, the crawl engine should be MSN Search is a search engine owned by Microsoft, able to crawl billions of web-pages per day. So the decision which previously relied on others for its search engine to mark a newly-crawled page as a near-duplicate of an listings. In early 2005 it started showing its own results, existing page should be made quickly. Finally, the system collected by its own crawler. Many other search engines tend should use as few machines as possible. to be portals that merely show the results from another III. OVERVIEW OF INFORMATION RETRIEVAL company's search engine. 2.3 Search Engine Architectures 3.1 Information Retrieval: The components of a search engine are: Web crawling The last few decades have witnessed the birth and (gathering WebPages), indexing (representing and storing explosive growth of the web. It is indisputable that the the information), retrieval (being able to retrieve documents exponential growth of the web has made it into a huge ocean. relevant to user queries), and ranking the results in their order Finding web pages with high quality is one of the important of relevance. Figure.1 presents a simplified view of the purposes in search engines. The quality of pages is defined based on request and preferences of user. Usually, there are millions of relative pages with each query. Information storage and retrieval systems make large volumes of text accessible to people with information needs (van Rijsbergen, 1979; Salton and McGill, 1983). A person (hereafter referred to as the user) approaches such a system with some idea of what they want to find out, and the goal of the system is to fulfill that need. The person provides an outline of their requirements—perhaps a list of keywords relating to the topic in question, or even an example document. The system searches its database for documents components of a search engine. that are related to the user’s query, and presents those that are Figure 1: The simplified architecture of a search engine. most relevant. Nevertheless, users specially consider only 10 to 20 of those pages. The basic architecture is as shown in 2.4 Web crawler Figure.2. Web crawling is an integral piece of infrastructure for search engines. The web crawler is a program that automatically traverses the web by downloading the pages and following the links from page to page [Koster 1999]. A general purpose of web crawler is to download any web page that can be accessed through the links. Generic crawlers [1, 2] crawl documents and links belonging to a variety of topics, whereas focused crawlers [3, 4, 5] use some specialized knowledge to limit the crawl to pages pertaining to specific topics. For web crawling, issues like freshness and efficient resource usage have previously been addressed [6, 7, 8]. However, the problem of Figure 2: Information Retrieval System - Basic Architecture elimination of near-duplicate web documents and returning 59 All Rights Reserved © 2012 IJARCSEE
  • 3. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 Queries transform the user’s information need into a form query to reflect the user’s needs more precisely. The user is that correctly represents the user’s underlying information required only to indicate the relevance of each document requirement and is suitable for the matching process. Query presented. Finally, an information retrieval system’s formatting depends on the underlying model of retrieval used conceptual model defines how documents in its database are (Boolean models [Bookstein, 1985], Vector Space models compared. [Salton & McGill, 1983], Probabilistic models [Maron & 3.2 Evaluation of Information Retrieval Systems Kuhns, 1960; Robertson, 1977], based on artificial To compare the performance of information retrieval intelligence techniques [Maaeng, 1992; Evans 1993]). systems there is a need for standard test collections and A matching algorithm matches a user’s requests (in terms benchmarks. The TREC forum (Text Retrieval Conference) of queries) with the document representations and retrieves provides test collections and organizes competition between documents that are most likely to be relevant to the user. A IR systems every year, since 1992. In order to compute matching algorithm addresses two issues: evaluation scores, we need to know the expected solution. 1. How to decide how well documents match a user’s Relevance judgments are produced by human judges and information request. Blair & Maron [1985] showed that included in the standard test collections. it is very difficult for users to predict the exact words or CLEF (Cross-Language Evaluation Forum) is another phrases used by authors in desired documents. Hence if evaluation forum that organizes competition between IR a document term does not match search terms then a systems that allow queries or documents in multiple relevant document may not be retrieved. languages, since the year 2000. 2. Another issue involved in matching is how to decide the In order to evaluate the performance of an IR system we order in which the documents are to be shown to the need to measure how far down the ranked list of results will a user. Typically the matching algorithms calculate a user need to look to find some or all the relevant documents. matching number for each document and retrieve the 3.3 IR Paradigms documents in the decreasing order of this number. This section briefly describes various research paradigms The user rates documents presented as either relevant or prevalent in IR and where our work fits in. At a broad level, non-relevant to his/her information need. The basic problem research in IR can be categorized [Chen, 1995] into three facing any IR system is how to retrieve only the relevant categories: Probabilistic IR, Knowledge based IR, and documents for the user’s information requirements, while not IR based on machine learning techniques. retrieving non-relevant ones. Various system performance 3.3.1. Probabilistic IR: criteria like precision and recall have been used to gauge the Probabilistic retrieval is based on estimating a probability effectiveness of the system in meeting users’ information of relevance of a document to the user for the given user requirements. query. Typically relevance feedback from a few documents is Precision is defined as the ratio of the number of relevant used to establish the probability of relevance for other retrieved documents to the total number of retrieved documents in the collection [Fuhr et al., 1991; Gordon, documents. 1988]. There are three different learning strategies used in Number of relevant documents retrieved probabilistic retrieval. Estimation of probabilities of P= relevance is done for a set of sample documents [Robertson Total number of retrieved documents & Sparck Jones, 1976], or a set of sample queries [Maron & Recall is the ratio of the number of relevant retrieved Kuhns, 1960] and extended to all the documents or queries. documents to the total number of relevant documents Inference networks [Turtle & Croft, 1990] use a document available in the document collection. and query network that capture probabilistic dependencies Number of relevant documents retrieved among the nodes in the network. R= 3.3.2. Knowledge based IR: Total number of relevant documents This approach focuses on modeling two areas. First, it tries Using this information, a retrieval system is able to rank to model the knowledge of an expert retriever in terms of the the documents that match and present the most relevant ones expert’s domain knowledge, i.e. user search strategies and to the user first—a major advantage over the standard feedback heuristics. An example of such an approach is the Boolean approach which does not order retrieved documents Unified Medical Language System. Another area that has in any significant way. From a list of documents sorted by been modeled is the user of the system. This typically follows predicted relevance, a user can select several from the top as the way the librarian develops a client profile. Although being most pertinent. In a system using the Boolean model, knowledge based approaches might be effective in certain users would have to search through all the documents domains, it may not be applicable in all domains [Chen et al., retrieved, determining each one’s importance manually. 1991]. Systems that employ simple query methods usually 3.3.3. Learning systems based IR: provide other facilities to aid users in their search. For This approach is based on algorithmic extraction of example, a thesaurus can match words in a query with those knowledge or identifying patterns in the data. There are three in the index that have similar meaning. The most effective broad areas within this approach: Symbolic Learning, Neural tool exploited by information retrieval systems is user Networks and Evolution based algorithms. In the Symbolic feedback. Instead of having users re-engineer their queries in Learning approach knowledge discovery is done typically by an adhoc fashion; a system that measures their interest in inductive learning by creating a hierarchical arrangement of retrieved documents can adjust the queries for them. By concepts and producing IF-THEN type production rules. ID3 weighting terms in the query according to their performance decision-making algorithm [Quinlan, 1986] is one such in selecting appropriate documents, the system can refine a popular algorithm. 60 All Rights Reserved © 2012 IJARCSEE
  • 4. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 IV. RELATED WORK V. PROPOSED METHOD 4.1 Keyword Extraction Algorithm: The proposed method will work on the aliases and get the Keyword extraction algorithm that applies to a single association orders between name and aliases to help search document without using a corpus was proposed by Matsuo, engine tag those aliases according to the orders such as first Ishizuka [10]. Frequent terms are extracted first, and then a order associations, second order associations…etc so as to set of co-occurrences between each term and the frequent substantially increase the Recall and Mean Reciprocal Rank terms, i.e., occurrences in the same sentences, are generated. (MRR) of the search engine while searching made on person Co-occurrence distribution showed the importance of a term names. in the document. However, this method only extracts a The MRR applies to the binary relevance task: for a given keyword from a document but not correlate any more query, any returned document is labeled ―relevant‖ or ―not documents using anchor texts-based co-occurrence relevant‖, and if ri is the rank of the highest ranking relevant frequency. document for the ith query, then the reciprocal rank measure 4.2 Transitive Translation Approach: for that query is 1r and the MRR is just the reciprocal rank, Transitive translation approach to find translation i equivalents of query terms and constructing multilingual of the search engine for a given sample of queries is that the lexicons through the mining of web anchor texts and link average of the reciprocal ranks for each query. structures was proposed by Lu, Chien and Lee [11]. The The term word Co-Occurrence refers to the temporal translation equivalents of a query term can be extracted via its property of the two words occurring at the same web page or translation in an intermediate language. However this method same document on the web. did not associate anchor texts using the definition of The Anchor Text is the clickable text on web pages, which co-occurrences. points to a particular web document. Moreover the anchor 4.3 Feature Selection Method: texts are used by search engine algorithms to provide relevant A novel feature selection method based on part-of-speech documents for search results because they point to the web and word co-occurrence was proposed by Liu, Yu, Deng, pages that are relevant to the user queries. So the anchor texts Wang, Bian[12]. According to the components of Chinese will be helpful to find the strength of association between two document text, they utilized the word’s part-of-speech words on the web. attributes to filter lots of meaningless terms. Then they The anchor texts-based co-occurrence means that the two defined and used co-occurrence words by their anchor texts from the different web pages point to the same part-of-speech to select features. The results showed that URL on the web. The anchor texts which point to the same their method can select better features and get a more URL are called as inbound anchor texts [9]. The proposed pleasant clustering performance. However, this method does method will find the anchor texts-based co-occurrences not use anchor texts-based co-occurrences on words. between name and aliases using co-occurrence statistics and 4.4 Data Treatment Strategy: will rank the name and aliases by Support Vector Machine A data treatment strategy to generate new discriminative (SVM) according to the co-occurrence measures in order to features, called compound-features(c-features) for the sake get connections among name and aliases for drawing the of text classification was proposed by Figueiredo et al. [13]. word co-occurrence graph. These c-features are composed by terms that co-occur in documents without any restrictions on order or distance between terms within a document. This strategy precedes the classification task, in order to enhance documents with discriminative c-features. This method extracts only a keyword from a document but not correlate any more documents using anchor texts. Anchor Texts x C - {x} C p k n-k N V-{P} K-k N-n-k+K N-n V-{P} k N-K N Table 1. Contingency Table for anchor Texts „p‟ and „x‟. 4.5 Alias Extraction Method: A method to extract aliases from the web for a given personal name was first proposed by Bollegala, Matsuo, and Ishizuka [9]. They have used lexical pattern approach to extract candidate aliases. The incorrect aliases have been Figure 3: Proposed method - overview removed by page counts, anchor text co-occurrence frequency, and lexical pattern frequency. However, this Then a word co-occurrence graph will be created and method considered only the first order co-occurrences on mined by graph mining algorithm so as to get the hop aliases to rank them but did not focus on the second order distance between name and aliases that will lead to the co-occurrences to improve recall and achieve a substantial association orders of aliases with the name. The search MRR for the web search engine. engine can now expand the search query on a name by 61 All Rights Reserved © 2012 IJARCSEE
  • 5. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 tagging the aliases according to their association orders to 5.1.1 Role of Anchor Texts retrieve all relevant pages which in turn will increase the The main objective of search engine is to provide the most Recall and achieve a substantial MRR. relevant documents for a user’s query. Anchor texts play a The proposed method is outlined in Figure 3 and vital role in search engine algorithm because it is clickable comprises four main components namely text which points to a particular relevant page on the web.  Computation of word co-occurrence statistics Hence search engine considers anchor text as a main factor to  Ranking anchor texts retrieve relevant documents to the user’s query. Anchor texts  Creation of anchor text co-occurrence graph are used in synonym extraction, ranking and classification of  Discovery of association orders web pages and query translation in cross language To compute anchor texts-based co-occurrence measures, information retrieval system. there are nine co-occurrence statistics [9] used in anchor text 5.1.2 Anchor Texts Co-occurrence Frequency mining to measure the associations between anchor texts: The two anchor texts appearing in different web pages are  Co-occurrence Frequency (CF) called as inbound anchor texts [9] if they point to the same  Term Frequency–Inverse Document Frequency (tfidf) URL. Anchor texts co-occurrence frequency [9] between  Chi Square (CS) anchor texts refers to the number of different URLs on which  Log Likelihood Ratio (LLR) they co-occur. For example, if p and x that are two anchor  Point wise Mutual Information (PMI) texts are co-occurring, then p and x point to the same URL. If  Hyper Geometric Distribution (HG) the co-occurrence frequency between p and x is that say an  Cosine example k, and then p and x co-occur in k number of different  Overlap URLs. For example, the picture of Sachin Tendulkar is  Dice shown in Figure 4 which is being liked by six different Ranking Support Vector Machine will be used to rank the anchor texts. According to the definition of co-occurrences anchor texts with respect to each anchor text to identify the on anchor texts, Little Master and Master Blaster are highest ranking anchor text for making first order co-occurring. As well, The Sachin and Tendulkar are also associations among anchor texts. co-occurring. 5.1 Co-occurrences in Anchor Texts 5.1.3 Word Co-occurrence Statistics The proposed method will first retrieve all corresponding To measure the association between anchor texts, nine URLs from search engine for all anchor texts in which name popular measurements will be used and calculated from the and aliases appear. Most of the search engines provide search Table 1. operators to search in anchor texts on the web. For example, 5.1.3.1 Co-occurrence Frequency (CF) Google provides Inanchor or Allinanchor search operator to CF [9] is the simplest measurement among all and it retrieve URLs that are pointed by the anchor text given as a denotes the value of k in the Table 1. query. For example, query on ―Allinanchor: Sachin 5.1.3.2 Term Frequency–Inverse Document Frequency (tfidf) Tendulkar” to the Google will provide all URLs pointed by The CF is biased towards highly frequent words. But tfidf Sachin Tendulkar anchor text on the web. [9, 14] resolves the bias by reducing the weight, that is, assigned to the words on anchor texts. The tfidf score for the Master Blaster anchor texts p and x is calculated from Table 1 as Little master  N  tfidf  p,x  = klog   ........... 1  K+1  Sachin God of Cricket 5.1.3.3 Chi Square (CS) The Chi Square [9] is used to test the dependence between two words in natural language processing tasks. Given the Tendulkar Tendlya contingency table in Table 1, the X2 measure compares the observed frequency in Table 1 with the expected frequency for independence. Then it is likely that the anchor texts p and Figure 4: A picture of Sachin Tendulkar being linked by different anchor texts on the web x are dependent if the difference between the observed and expected frequencies is large. The X2 measure sums the Next the contingency table will be created as described in difference between the observed and expected frequencies Table.1 for each pair of anchor texts to measure their and is scaled by the expected values. The X2 measure is given strength. There in as x and p : two input anchor texts. C : set of input anchor texts except p (Oij  Eij )2 ...........  2 2 X = V : set of all words that appear in anchor texts ij Eij C-{x} : anchor texts except x V-{p} : anchor texts except p Where Oij and Eij are the observed and expected K : co-occurrence frequency between p and x frequencies respectively. Using Equation (2), the X2 score n : sum of the co-occurrence frequencies between p for anchor texts p and x from the Table 1 is as follows N k  N  K nk  nk  K k  and all anchor texts in C. 2 K : sum of co-occurrence frequencies between all CS  p, x  = ...........  3 nK ( N  K )( N n) words in V and x 5.1.3.4 Log Likelihood Ratio (LLR) N : sum of the co-occurrence frequencies between LLR [9, 15] is the ratio between the likelihoods of two all words in V and all anchor texts in C. alternative hypotheses: that the texts p and x are independent 62 All Rights Reserved © 2012 IJARCSEE
  • 6. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 or they are dependent. LLR is calculated using the Table 1 as k overlap ( p, x)  ........... 12  min  n, K  follows kN ( n -k ) N N ( K -k ) LLR ( p, x )  k log   n-k  log  ( K - k ) log + nK n( N - K ) K ( N -n ) 5.1.3.9 Dice N ( N - K -n  k ) Dice [9, 17] retrieves collocations from large textual  N - K -nk  log ...........  4  corpora. The Dice is defined over two sets X and Y as ( N - K )( N -n) 2 X Y Dice( X , Y )  ........... 13 X  Y 5.1.3.5 Point wise Mutual Information (PMI) 2k PMI [9, 16] reflects the dependence between two Dice( p, x)  ........... 14  probabilistic events. The PMI is defined for y and z events as nK  P( y, z )  5.2 Ranking Anchor Texts PMI  y,z  = log  ...........  5  2  P( y) P( z )   Ranking SVM [9, 18] will be used for ranking the aliases. The ranking SVM will be trained by training samples of Where P(y) and P(z), respectively, represent the name and aliases. All the co-occurrence measures for the probability of events y and z. Whereas P(y, z) is the joint anchor texts of the training samples will be found and will be probability of y and z. The PMI is calculated from Table 1 as normalized into the range of [0-1]. The normalized values  kN  PMI  y,z  = log  ...........  6  termed as feature vectors will be used to train the SVM to get 2  Kn   the ranking function to test the given anchor texts of name 5.1.3.6 Hyper Geometric Distribution (HG) and aliases. Then for each anchor text, the trained SVM using Hyper Geometric distribution [9, 16] is a discrete the ranking function will rank the other anchor texts with probability distribution that represents the number of respect to their co-occurrence measures with it. The highest successes in a sequence of draws from a finite population ranking anchor text will be elected to make a first–order without replacement. For example, the probability of the association with its corresponding anchor text for which event that ―k red balls are contained among n balls, which are ranking was performed. Next the word co-occurrence graph arbitrarily selected from among N balls containing K red will be drawn for name and aliases according to the first order balls‖ is given by the hyper geometric distribution hg(N, K, associations between them. n, k) as Sachin Tendulkar  K  N  K     k  nk  hg  N , K ,n,k  =  ...........  7   N   Little Master Master n  Blaster The hyper geometric distribution is applied to the values of Table 1 and the HG (p, x) is computed as the probability of Member of Cricket observing more than k number of co-occurrences of p and x. Parliament   HG ( p, x )  - log 2   hg ( N , K ,n,l     i k  Mumbai Sports max 0, N  K -n  l  min n, K  ........... 8  Indians 5.1.3.7 Cosine Figure 5: Word Co-occurrence graph for a personal name “Sachin Cosine [9] computes the association between anchor texts. Tendulkar” The association between elements in two sets X and Y is 5.3 Word Co-occurrence Graph computed as Word co-occurrence graph is an undirected graph where X Y the nodes represent words that appear in anchor texts on the cosine(x,y)  ...........  9 X Y web. For each word in anchor text, a node will be created in Where |X| represents the number of elements in set X. the graph. According to the definition of co-occurrences if Considering X be the co-occurrences of anchor texts x and Y the two anchor texts co-occur in pointing to the same URL, be the co-occurrences of anchor text p, then cosine measure then undirected edge will be drawn between them to denote from Table 1 is computed as their co-occurrences. A word co-occurrence graph like that shown in Fig 3 will be created for the name and aliases cosine( p, x)  k ........... 10 according to their first order associations among them. Each n K 5.1.3.8 Overlap name and aliases will be represented by a node in the graph. The overlap [9] between two sets X and Y is defined as The two nodes will be connected if they make first order associations between them. The edge between nodes will X Y overlap ( x, y )  ........... 11 describe that the nodes bearing anchor texts co-occur  min X Y  according to the definition of anchor texts co-occurrences. Next the hop distance between nodes will be identified in Assuming that X and Y, respectively, represent order to have first, second, and higher order associations occurrences of anchor texts p and x. The overlap of (p, x) to between name and aliases by graph mining algorithm. evaluate the appropriateness is defined as 5.4 Discovery of Association Orders Using the graph mining algorithm [19, 20], the word co-occurrence graph will be mined to find the hop distances 63 All Rights Reserved © 2012 IJARCSEE
  • 7. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 between nodes in graph. The hop distances between two 9) D.Bollegala, Y. Matsuo, and M. Ishizuka, ―Automatic Discovery nodes will be measured by counting the number of edges of Personal Name Aliases from the Web,‖ IEEE Transactions on in-between the corresponding two nodes. The number of Knowledge and Data Engineering, vol. 23, No. 6, June 2011. edges will yield the association orders between two nodes. 10) Y. Matsuo, and M. Ishizuka,‖ Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information‖, According to the definition, a node that lays n hops away International Journal on Artificial Intelligence Tools, 2004. from p has an n-order co-occurrence with p. Hence the first, 11) W. Lu, L. Chien and H. Lee, ―Anchor Text Mining for second and higher order associations between name and Translation of Web Queries: A Transitive Translation aliases will be identified by finding the hop distances Approach,‖ ACM Transactions on Information Systems, Vol. between them. The search engine can now expand the query 22, No. 2, Aprill 2004, Pages 242-269. on person names by tagging the aliases according to the 12) Z. Liu, W. Yu, Y. Deng, Y. Wang, and Z. Bian,‖ A Feature association orders with the name. Thereby the recall will be selection Method for Document Clustering based on substantially improved by 40% in relation detection task. Part-of-Speech and Word Co-occurrence,‖ Proceedings of 7th Moreover the search engine will get a substantial MRR for a International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 10), pp. 2331-2334, Aug 2010. sample of queries by giving relevant search results. 13) F. Figueiredo, L. Rocha, T. Couto, T. Salles, M.A. Gonclaves, 5.5 Data Set and W. Meira Jr, ―Word Co-occurrence Features for Text To train and evaluate the proposed method, there are two Classification‖, Vol 36, Issues 5, Pages 843-858, July 2011. data sets: the personal names data set and the place names 14) G. Salton and C. Buckley,‖ Term-Weighting Approaches in data set. The personal names data set includes people from Automatic Text Retrieval,‖ Information processing and various fields of cinema, sports, politics, science, and mass Management, vol. 24, pp. 513-523, 1988. media. The place names data set contains aliases for US 15) T. Dunning, ―Accurate Methods for the Statistics of Surprise and Coincidence,‖ Computational Linguistics, vol. 19, pp. states. 61-74, 1993. VI. CONCLUSION 16) T. Hisamitsu and Y. Niwa, ―Topic-Word Selection Based on Combinatorial Probability,‖ Proc. Natural Language Processing The proposed method will compute anchor texts-based Pacific-Rim Symp. (NLPRS ’01), pp.289-296, 2001. co-occurrences among the given personal name and aliases, 17) F.Smadja, ―Retrieveing Collocations from Text: Xtract,‖ and will create a word co-occurrence graph by making Computational Liguistics, Vol. 19, no 1, pp. 143-177, 1993. connections between nodes representing name and aliases in 18) T. Joachims,‖ Optimizing Search Engines using Click through the graph based on their first order associations with each Data,‖ proc. ACM SIGKDD ’02, 2002. other. The graph mining algorithm to find out the hop 19) D.Chakrabarti and C. Faloutsos , ―Graph Mining: Laws, distances between nodes will be used to identify the Generators, and Algorithms,‖ ACM Computing Surveys, Vol. 38, March 2006, Article 2. association orders between name and aliases. Ranking SVM 20) C.C. Agarwal and H. Wang, ―Graph Data Management and will be used to rank the anchor texts according to the Mining: A Survey of Algorithms and Applications,‖ DOI co-occurrence statistics in order to identify the anchor texts in 10.1007/978-1-4419-6045-0_2, @ Springler Science+Business the first order associations. The web search engine can Media, LLC 2010. expand the query on a personal name by tagging aliases in the . order of their associations with name to retrieve all relevant V.KISHORE M.Tech, (Ph.D), working as Assoc. Professor & HOD of Department of Computer results thereby improving recall and achieving a substantial Science & Engineering, SANA Engineering College, MRR compared to that of previously proposed methods. Kodad, Nalgonda. He has Published 4 International Journals, His areas of interest is Network Security, REFERENCES Information Retrieval Systems…etc., 1) A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology, 1(1) : 2-43, 2001. 2) S. Brin and L. Page. The anatomy of a large-scale hypertextual V.SARITHA, she is pursuing M.Tech (CSE) in Web search engine. Computer Networks and ISDN Systems, SANA Engineering College, Kodad, Nalgonda, 30(1-7):107-117, 1998. Andhra Pradesh, India. 3) M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In 26th International Conference on Very Large Databases, (VLDB 2000), pages G. RAMAMOHAN RAO, M.Tech. working as 527-534, sep 2000. Asst. Professor in Department of Computer Science 4) F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating and Engineering, SANA Engineering College, Kodad, topic-driven web crawlers. In Proc. 24th Annual International Nalgonda. ACM SIGIR Conference On Research and Development in Information Retrieval, pages 241-249, 2001. 5) S. Pandey and C. Olston. User-centric web crawling. In Proc. WWW 2005, pages 401-411, 2005. M. RAMAKRISHNA, he is pursuing M.Tech 6) A. Z. Broder, M. Najork, and J. L. Wiener. Efficient URL (CSE) in SANA Engineering College, Kodad, caching for World Wide Web crawling. In International Nalgonda, Andhra Pradesh, India. conference on World Wide Web, 2003. 7) S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, 2002. 8) J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1-7):161-172, 1998. 64 All Rights Reserved © 2012 IJARCSEE