A novel approach towards developing a statistical dependent and rank

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
229
A NOVEL APPROACH TOWARDS DEVELOPING A STATISTICAL
DEPENDENT AND RANKING MEASURE FOR KEYWORD SEARCH
OVER XML DATA
Dayananda P1
, Dr. Rajashree Shettar 2
1Assistant Professor, Department of Information Science and Engg, MSRIT, Bangalore-54
2
Professor, Department of Computer Science and Engg, RVCE, Bangalore-59
ABSTRACT
Extensible Markup Language (XML) defines a set of conventions for representing the
encrypted documents in both human-readable and machine-readable format. XML is widely
used to represent the arbitrary data structure. Since XML is being largely accepted as a
standard for data representation, it is mostly preferred markup language to support keyword
search. In this paper, a statistical dependent and ranking measure for keyword search over
XML data is proposed. The proposed method consists of the following steps such as: 1)
Indexing, 2) Selecting the exact T-type node, 3) Data search and Ranking of search results. A
T-type node is considered as a desired node to searched, if XML node contains informative
enough with relevant information and node type T should relate to every keyword in query.
First the input XML data is given to indexing process that converts the XML data into the
indexed format to make search easier. Then, the corresponding T-type node is selected
through our proposed statistical dependent formulae. Once selection of T-type node, the
relevant data is obtained based on sorting the node type paths. Finally, ranking is done based
on the search results obtained from the previous steps with our designed ranking measure.
This work of ours addresses the two challenges addressed by TF*IDF strategy and improve
the effectiveness of the search for node type and ranking of search results.
Keywords: XML Keyword search, Indexing, search for node type, Data search and Ranking
Measure.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 3, May-June (2013), pp. 229-247
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E

230
1. INTRODUCTION
For big amounts of information, Internet is the depository space. The sharing of XML
information quantity over the World Wide Web is expanding severely. The text-centric XML
document collections are now obtaining more and more common, as the big majority of this
XML data is data-centric. As an effect, it became useful to give means to control these
collections. Using document-clustering methods this can be done by automatically arranging
very big collections into smaller sub-collections. Unluckily, the majority of the research on
structured document processing [1] and [3] is still focused on data-centric XML. With the
major difficulty in this area being the need to optimally index them for storage and retrieval
purposes, the Processing and management of XML documents [4] have already become
popular research issues. There have been several searching methods grown up in the IR
research community that basically depend on a set of weighted keywords in a search query to
decide the proximity of the query and a document in the feature space. However, the finding
of XML documents goes away from the conventional data retrieval strategy, which means
that the XML documents have nested XML elements and semantics of information values
indicated by tags. As an effect, in XML searching, the notion of keyword proximity utilized
in IR [13] is too simple to be effective.
To enquire XML documents the Keyword search is a handy way, since it permits
users to easily issue keyword queries without the knowledge of complex query languages or
the structure of underlying information. The keyword proximity search is focused on by
majority of the research efforts in XML keyword search in either tree model or general
digraph model. The two approaches commonly suppose a smaller sub-structure of the XML
document which consists of all query keywords indicates a better effect. Smallest Lowest
Common Ancestor (SLCA) is a simple and effective semantics in tree model for XML
keyword proximity search [15, 8]. Every SLCA result of a keyword query is a smallest XML
node that 1) covers all keywords in its descendants and 2) has no single proper descendant to
cover all query keywords. Based on tree model, however, the SLCA semantics does not catch
ID reference data that is generally available and significant in XML data-bases. It may, as an
effect, return a large tree consisting of irrelevant data. XML documents, on the other hand
may be modeled as digraphs to take into account ID reference edges. The main concept in
digraph model, which finds for minimal connected sub trees in graph, is called reduced sub
trees [14]. However, the difficulty of searching all reduced sub trees and enumerating effects
by rising sizes of reduced sub trees is NP-hard [17, 10].
The heuristics are dependent on by current XML keyword and natural language query
answering approaches that suppose certain properties of the DB schema. Though these
heuristics are intuitively logical, even in the highest-quality XML schemas, they are enough
ad hoc that they are often violated in practice. Thus present approaches endure from low
precision, low recall, or both [19]. Now the concern is turning to queries of the end-user
effectiveness of such search systems. To the new domain, the Traditional IR similarity
metrics have been ported and combined with domain-specific structural features. Both
through developing new methods and tuning existing ones, there is also proof of significant
improvements in effectiveness [20].
Motivation of our research is to design and develop a technique for keyword search
over XML data. The work presented in [10] over the XML search technique is our real
motivation, in which they have used TF*IDF strategy by addressing two challenges. When
analyzing the existing work [10], finding the term frequency-based score computation was

231
not much impressive in selecting the exact T-type node. Incorporating some other features
along with frequency can lead to effective T-type search in XML data. Searching output for a
user is significantly high, the ranking of search result is more important. This problem can be
solved easily by putting the effective ranking mechanism.
The above mentioned two challenges will be solved using the proposed methodology
along this; work addresses the effectiveness and efficiency in term of result relevance by
addressing the challenges addressed in [10] such as identifying the users search intention,
resolving the keyword ambiguity issues and effective ranking of the search results. The
proposed method consists of the following steps such as;
1) Indexing: The input XML data is given to indexing process that converts the XML data
into the two indices (data index and node index) which will make search easier.
2) Selecting the exact T-type node: The corresponding T-type nodes will be selected
through our designed statistical dependent formulae such as Dscore and Tscore .
3) Data search and Ranking of search results: Once selection of T-type nodes, the relevant
data are obtained based on the sorting the node type paths. Finally, ranking will be done
based on the search results obtained from the previous steps with our designed ranking
measure using correlation measure.
The rest of the paper is organized as follows. The literature of keyword search over
XML data is presented in Section 2, and proposed research methodology in Section 3. In
Section 4 the proposed method is discussed, while the Results and Experiments are discussed
in Section 5. The conclusion is done in Section 6.
2. RELATED WORK
JianhuaFeng and GuoliangLiet al in [5] presented a fuzzy type-ahead search in XML
data, their information-access paradigm in which the system searches XML data on the fly as
the user types in query keywords. It allows users to explore data as they type, even in the
presence of minor errors of their keywords. Their approach had the following features: 1)
Search as you type: It extended Auto complete by supporting queries with multiple keywords
in XML data. 2) Fuzzy: It could find high-quality answers that have keywords matching
query keywords approximately. 3) Efficient: effective index structures and searching
algorithms can achieve a very high interactive speed. They presented an effective index
structures and top-k algorithms to achieve a high interactive speed. Also, they examined
effective ranking functions and early termination techniques to progressively identify the top-
k relevant answers. And their implementation results achieved high search efficiency and
result quality.
Wei Waet al in [6] presented a multidimensional search approach that allows users to
perform fuzzy searches for structure and metadata conditions in addition to keyword
conditions. Their techniques individually score each dimension and integrate the three
dimension scores into a meaningful unified score. They also have designed indexes and
algorithms to efficiently identify the most relevant files that match multidimensional queries.
Experimental evaluation of their approach showed that their relaxation and scoring

232
framework for fuzzy query conditions in non content dimensions can significantly improve
ranking accuracy.
Ziyang Liu et al in [7] presented an XML search engine Target Search that addresses
an open problem in XML keyword search: given relevant matches to keywords, how to
compose query results properly so that they could be effectively ranked and easily digested
by users. Intuitively, each query had a search target and each result should contain exactly
one instance of the search target along with its evidence. They have developed Target Search
which composes atomic and intact query results driven by users search targets.
ChunxiaoLiuetalin [8] presented a user-friendly Top-k keywords searching approach
based on the relationship of keywords. The SLCA of a keyword search was first obtained by
the LISA II algorithm. Then, the structure of SLCA was leveraged to speculate the
relationship of keywords, i.e., the keyword search was translated into twig queries. Next, the
relationship of keywords could be estimated by the structure of twig queries and these twig
queries were ranked according to the relationships of keywords. Finally, all results of the
ordered twig queries were obtained by TJFast algorithm.
Yiqun Chen and Jinyin Cao in [9] have presented an approach to type-ahead keyword
searched in XML data, call Take XIR. The IR-style approach basically utilized the statistics
of underlying XML data to address that the following challenges in XML IR system: (1)
identify the user search intention, i.e. identify the keywords to express user interests and
identify nodes user wanted to search for and search via. (2) Resolve keyword ambiguity
problems: synonyms and polysemy exist in natural language, and a keyword could appear as
the text values or tag value of different XML node and carry different meanings. They have
modeled XML data as a graph, analyzed the identification of user search intention and result
ranking in the presence of keyword ambiguities and used the related definition and formula to
build a query prediction technique to improved search efficiency.
Jiang Li and Junhu Wang [11] have presented an XML keyword search provided a
simple and user-friendly way of retrieved data from XML databases, but the ambiguities of
keywords make it difficult to effectively answer keyword queries. XReal utilized the statistics
of underlying data to resolved keyword ambiguity problems. However, they found their
presented formula for inferring the search-for node type suffers from inconsistency and
abnormality problems. Finally a dynamic reduction factor schemes as well as an algorithm
Dynamic Infer to resolve these two problems. Experimental results are shown provided to
verify the effectiveness.
Liang Jeff Chen and YannisPapakonstantinouin[12] have presented a series of
algorithm that incorporated both the efficient semantic pruning and the top-K processing to
support top-K keyword search[23]. They presented a join-based algorithm that processes
nodes bottom up and reduced keyword query evaluated into relational joins. Several
optimizations were presented to further improve its efficiency. They then incorporated the
idea of the top-K join from relational databases and presented a join-based top-K algorithm to
computed top K results. Extensive experimental results confirmed the advantages of
algorithms over previous algorithms in both efficiency and top-K processing.
ZhifengBaoetalin [10] have studied the problem of effective XML keyword search
which included the identification of user search intention and result ranking in the presence of
keyword ambiguities. They utilized statistics to infer user search intention and rank the query
results. In particular, they have defined XML TF and XML DF, based on which have been
designed formulae to computed the confidence level of each candidate node type to be a
search for/search via node, and further proposed XML TF*IDF similarity ranking scheme to

233
captured the hierarchical structure of XML data. Finally, the popularity of a query result
(captured by ID Ref relationships) was considered to handle the case that multiple results
have comparable relevance scores.
As an extension of [10], several major updates in terms of: 1)our ranking framework
uses the correlation concept considered in section 4, which outperforms the ranking concepts
in[10], 2) Selecting the exact T-type node into consideration in section 4, 3) New index and
algorithm are designed in section 4.
3. RESEARCH METHODOLOGY
Definition 3.1(Structural Node) A tag name is used to label XML node called a structural
node. Internal node is defined as children’s of structural node; otherwise, it is called a leaf
node.
Definition 3.2(T type node) A T type node is considered as a desired search for node if, T
type node is intuitively related to every query keyword, XML nodes of T type should be
informative enough to contain enough relevant information and XML nodes of type T should
be not overwhelming to contain too much irrelevant information .
Definition 3.2 (Data Node) the leaf node of XML data containing text values and have no
tag name is called as data node.
The primary intention of our research is to design and develop a technique for
keyword search over XML data. The real motivation of the work is come out from the XML
search technique given in [10], in which they have used TF*IDF strategy by addressing two
challenges. When analyzing the existing work [10], the finding is that term frequency-based
score computation was not much impressive in selecting the exact T-type node. Incorporating
some other features along with frequency can lead to effective T-type search in XML data.
Also, the ranking of the search results is important for the users if search output is
significantly high. This problem can be solved easily by putting the effective ranking
mechanism.
The above mentioned two challenges will be solved using the proposed methodology.
The proposed method consists of the three major steps such as, 1) Indexing, 2) Selecting the
exact T-type node, 3) Data search and Ranking of search results. At first, the input XML data
is given to indexing process that converts the XML data into the indexed format to make
search easier. Then, the corresponding T-type nodes are selected through our designed
statistical dependent formulae. Once we select T-type nodes, the relevant data are obtained
based on the similarity matching with the input query. Finally, ranking will be done based on
the search results obtained from the previous steps with our designed ranking measure. The
proposed algorithm will be implemented using JAVA and the performance of the algorithm
will be compared with existing algorithm in terms of precision, recall and ranking measure
with two different datasets.

234
4. PROPOSED METHOD
1. Indexing
The approach presented in [10] for Data processing, built two indices viz. keyword
inverted list and frequency table. Of these indices, the keyword inverted list retrieves a list of
data nodes in document order whose values contain the input keyword. For each inverted list,
an index viz. B+-Tree is built on top of it. The second index built, called frequency table,
stores only the frequency (number of T-typed nodes that contain keyword k in their subtrees
in XML data) for each combination of keyword k and node type T in XML document. If a
query keyword is searched, the approach presented in [10] doesn’t identify the keyword as
node or data and this leads to more complex query processing.
There by, to overcome the above discussed demerits, a specific indexing method is
proposed that builds two indices viz. Nodeindex and Data index for structural nodes and data
nodes respectively. These two indices are represented in Table 1 and Table 2 for DBLP XML
document. In contrast to the indices presented in[10], the proposed approach stores node
name of each structural node, frequency of occurrence of each structural node either in T-
typed nodes or their subtrees, prefix path of the corresponding T-typed nodes in the node
index and name of data nodes. Corresponding node names and frequency of occurrences of
each data node in XML document is stored in data index. The data node information table is
dependent on the Node index in relation with the node name. Scores with reference to the two
indices is utilized efficiently to determine the exact T-typed node for a given keyword query.
Thus, the proposed indexing approach addresses each node and data separately in XML
database and results in effective query processing. The fig 1 shows the partial structure of
DBLP XML database and Fig 2 shows partial data subtree for DBLP XML database.
Fig.1. Partial data tree structure for ‘DBLP’ XML database
pages
416-440
book title
year
1986
dblp
inproceedings
phdthesisarticlemastersthesis
author
title
year
school
Tolga
Yurek
“Efficient
view
maintenance
at data
warehouses
”
1997 “University
of California
at santa
Barbara,
department
of computer
science”
ee author cdrom
“GTE/
MAN0
95 pdf”
“Frank
Manol
a”
“db/labs/
gte/TR-
0310-11-
95-
165.html
”
author
title school
year
“AndraSi
keler”
“impleme
ntierungs
konzeptef
uuml; r
Non-
standard-
Datenban
ksysteme.
”
1989
“Universitauml; t
kaiserslavtern”
author title url
“Eike
Best”
“COSY:
Its
Relation
to Nets
and
CSP.”
“db/c
onf/a
c/petr
i86-
2.htm
l#Bes
t86”
month
“November”

235
Sr. no. Node Frequency Path
300 author 212898 dblp,article
302 url 106805 dblp,article
303 publisher 4 dblp,article
307 year 72 dblp,phdthesis
311 publisher 3 dblp,phdthesis
319 author 14 dblp,www
320 editor 21 dblp,www
321 booktitle 1 dblp,www
324 title 2609 dblp,proceedings
326 series 1955 dblp,proceedings
Table 1: Node index
Table 2: Data index
3. SEARCH FOR NODE TYPE-T
For selection of exact T- type node for a given keyword query, the keyword matching
tag may occur many times in different T-typenode and their subtrees. Thus, causing search
for node type process to be more complex. In order to overcome this drawback, we have
proposed a couple of mathematical scores such that the optimal T-type nodes are selected.
The proposed mathematical scores which addresses the complexity issue are viz; 1) Dscore and
2) Tscore. Where, Dscoreis the ratio of the depth of the ancestor nodes from the keywords in a
given query and Tscore gives the percentage score of each node type having the best depth
score (Dscore).
a) Dscore
For a given input Qurery ‘q’, initially the depth of the Lowest common
ancestor(LCA) node from all the keywords in the query, as well the depth of the Highest
common ancestor(HCA) node for the same keywords are computed. Therefore, the ratio of
the depth of the ancestor nodes from the keywords in a given query is known as the Dscore.
Sr. no. Data Node Frequency
30 db/labs/gte/index.html#TR-0169-12-91-165 url 1
32 db/labs/gte/TR-0231-08-93-165.html ee 1
33 Sandra Heiler author 7
35 TR-0231-08-93-165 volume 8
36 1993 year 4144
38 GTE/MANO93c.pdf cdrom 1
42 June month 5
44 db/labs/gte/index.html#TM-0014-06-88-165 url 1
45 GTE/MANO88.pdf cdrom 1
46 db/labs/gte/TM-0332-11-90-165.html ee 1

236
Month
Fig 2. Partial data sub tree Structure for ‘DBLP’ XML database






nodeHCAofdepth
nodeLCAofdepth
=D score (1)
The LCA nodes with the lowest set of Dscore values are selected as the probable node
type for the given Query ‘q’. From these set of likely Dscore values the best node will be
selected as the T-type node for given Query keywords. To do so, a Tscore percentage is
estimated.
b) Tscore
Tscore percentage is estimated by defining the score as for a keyword query, what is the
chance of occurrence of keyword ‘k’ at that node type-T. This can be identified by
conditional probability property. The conditional probability states that, if ‘q’ and ‘T’ are the
events respectively, then it is said to be the probability of ‘q’ given ‘T’ and it is denoted by P
(q/T).
Therefore, the conditional probability with respect to the above definition and notations is
expressed as;
( )
( )TP
TqP
=
T
q
P
I





 (2)
Where;
P(q/T) is defined as the chance of event ‘q’ when event ‘T’ have occurred, P(q n T) is
the occurrence of event ‘q’ in event ‘T’, P(T) is defined as the probability of occurrence of
event ‘T’.
dblp
Article
“November”
ee Author cdrom
“GTE/MAN095
pdf”
“Frank
Manola”
“db/labs/gte/TR-
0310-11-95-
165.html”

237
Now with reference to the mathematical derivation of the conditional probability
(P(q/T)), say probability of ‘q’ given ‘T’. Equation in (2) can be represented the sum of the
probability of occurrence of the keyword at that node type-T.
( )
∑∈












Tqk P(T)
P(k)
=
T
q
P
I
(3)



×











∑∈
P(k)
P(T)
1
=
T
q
P
T)(qk I
P (T) is constant for no of keywords (‘k’=1 to n) in the query
(4)
)(
1
P(k)=
T
q
P
n
1k TP
=



×





∑=
αα (5)
Thus, to estimate the best T-node type the percentage of frequency of occurrence of
‘k’ at that node type is very important and hence it is considered as the Tscore% of a particular
node and the node having highest Tscore% is the relevant type node and is defined as-
Therefore, ∑=
×
n
k 1
score P(k)=T α (6)
But, P (k) can also be defined as the frequency of occurrence of ‘k’ at that node type
‘T’ and P (T) can also be defined as the frequency of the node type-T. And hence defined in
equation (6) as;
)(
1
,f(k)=T
1
score
Tf
for
n
k
=



× ∑=
αα (7)
Thus the Tscorepercentage is defined as,
100f(k)=T
1
score% ×× ∑=
n
k
α (8)
The percentage score of the optimal node type Tscore% is thus defined as, the
percentage of frequency of occurrence of keywords in the query at a particular node type with
respect to the frequency of occurrence of that node type defined in equation(8).
4. DATA SEARCH AND RANKING
For a input keyword query containing ‘n’ keywords. Based on proposed indexing
techniques after pre-processing the XML document, we extract two different indices for each
keyword in the Query. These indices are viz; data index and node index. Data index is the
one having its frequency and node type information whereas; Node index is the one having

238
its frequency and path information. The proposed XML keyword search is carried out in
following steps:
1. It identifies the search intent of the user. To identify the desired search for node type
we initially estimate the Dscore of the LCA nodes in the XML document using
equation (1) and choose those nodes having leastDscore.
2. Then for each node type having a valid Dscore, we evaluate its Tscore% by using
equation (8) and choose the optimal or maximum Tscore% as the best search for node
type.
3. With respect to the desired or relevant search for node type-T computed form valid
Tscore% the prefix paths for the node type are sorted. Then the sorted prefix paths of the
search for node type is Ranked by defining the correlation between the sorted paths.
Algorithm 1:
Input: Query; Node_index; Data_index;
Keyword Matching= index( )
{
Query="q";
if (q = node & Node index!=null)
for(Node_indexlength)
{
q = keyword[Node_index];
f= get_nodefrequency(query);
}
Else if(q = data &Data_index!=null)
for(Data_indexlength)
{
q = keyword[Data_index];
f= get_datafrequency(q);
}
}
// search for node type//
Score = get_Dscore( )
{
if (Dscore( ) = min) then
get_Tscore()
node_type = max[Tscore( )]
}
//Ranking//
Rank = get_corr( )
{
if (sum_corr( ) = max) then
Ry = max[sum_corr( )]
Check threshold()
{
if difference (Rank1-Rank2)<Threshold
then select lowest Tscore
else Rank1.
}
}
In algorithm 1, function get_nodefrequency will calculate the frequency of T type nodes
containing all the query keywords and function get_datafrequency will retrieves the number
of data node present under an each T-type node. Dscore retrieves the list of path with lowest
Dscore value and it is based on output of Dcore function, the path is selected with highest
Tscore. Finally ranking is done through get_corr function, by finding correlation between all
paths.

239
Generally, any statistical relationship between two random variables or two sets of
data is referred to as Dependence. And any of a broad class of statistical relationships
involving dependence is referred to as Correlation. There are several correlation coefficients
measuring the degree of correlation. The most commonly preferred is Pearson’s correlation
coefficient. Pearson’s correlation is obtained by dividing the covariance of the two variables
by the product of their standard deviations. Since we have series of n sorted paths of say X &
Y written as Xi& Yi where i=1, 2… n. thus the sample correlation coefficient is used to
estimate the population pearson correlation ‘r’ between X & Y. The sample correlation
coefficient for Ranking is written as;
∑
∑ ∑
=
= =






×
n
1i
1i 1i
2
i
2
i
ii
xy
)y'-(y)x'-(x
)]y'-)(yx'-[(x
=r
n n
(9)
∑= 















×





×
n
1i
i
x
i
xy
)y'-(y
S
)x'-(x
1)-(n
1
=r
yS
(10)
x
i
S
)x'-(x
Is the standard score, the equation above can be corrected for a sample X’ is
the sample mean and sx is the sample standard deviation given in equation 9 & 10.After
determining the correlation for each combination of paths for the search for node type, the
sum of the correlation of a path with itself and the other paths related to the node type will
rank the node type path.
Correlation map
X
Y
P1 P2 P3 P4 P5
P1 Corr(P1,P1) Corr(P1,P2) Corr(P1,P3) Corr(P1,P4) Corr(P1,P5)
Rank Σx=1to5corr(Px,P1) Σx=1to5corr(Px,P2) Σx=1to5corr(Px,P3) Σx=1to5corr(Px,P4) Σx=1to5corr(Px,P5)
Therefore from the correlation map it is observed that the correlation each pair of path
addresses the ranking effectiveness. The ranking is defined as;
∑=
5
1x
yxy )P,corr(P=R (11)
The Path of the search for node type having the ‘Ry’ value with the highest sum is
ranked as the best search intention given in equation 11, if the difference of the first to ranked
correlation sum of the paths is greater than or equal to the threshold value, else if the
difference is less than the threshold then the lowest Tscore% is selected as the desired search
for node type, as given in equation 12.

240
Rank1.(Rank)maxRelsif
Tscore%lowestistypenodeRthen
thresholdRank2)-iff(Rank1=R
==
=
<d
(12)
5. RESULTS AND COMPARISON
Our proposed statistical dependent and ranking measure for keyword search over XML
data was experimented by implementing our approach using JAVA software (jdk-1.6 version) on
3.20GHz Intel(R) Pentium(R) D, 1.00GB RAM, and 32-bit operating system with windows 7
professional. The experimental results obtained are tabulated and these results are compared with
the existing method XReal. The results generated and compared are tested for the real datasets;
viz., DBLP, WSU, and eBay [10, 2], and are further discussed in terms of effectiveness and
efficiency.
Effectiveness test: This type contains two tests viz., 1.1) Inferring the desired search for
node type and 1.2) Quality measure using metrics= Precision, Recall and F-measure.
Efficiency test: This type of test is evaluated by measure of Query response time of the proposed
method with the XReal for all three real datasets.
Note: Query under test
Notation Query
DBLP dataset
QD1 “Java book”
QD2 “author Chen Lei”
QD3 “Jim Gray article”
QD4 “XML twig”
QD5 “Ling tokwang twig”
QD6 “vldb 2000”
QD7 “Philip Bernstein”
QD8 “WISE”
QD9 “ER 2005”
QD10 “LATIN 2006”
WSU dataset
QW1 “230”
QW2 “CAC 101”
QW3 “ECON”
QW4 “Biology”
QW5 “place TODD”
QW6 “days TU TH”
eBay dataset
QE1 “2 days”
QE2 “cpu 933”
QE3 “Hard drive CA”
5.1 Effectiveness test
The effectiveness of our approach for a statistical dependent and ranking measure for
keyword search over XML data is addressed by identifying the user search intention and
resolving the ambiguity issues. The accuracy of our approach is tested by evaluating the user
search intention for the search for node type for the query tabulated in the table 3 of which couple
of query having both ambiguity 1 and 2 and few having ambiguity 2 are considered.

241
5.1.1 Inferring the desired search for node type
The queries used in table 3, such as QD1 and QD3 have both ambiguity 1(keyword
appearas an XML tag name and text value) and ambiguity 2(keyword appear as text values of
different type of XML nodes) whereas QD2, QD6 and QW1 have ambiguity 2. The user
search intention, if observed from the table 3 for DBLP dataset is ideal for our method and
XReal approach compared to the SLCA/XSeek. While for the WSU and eBay dataset the
search intention is almost able to infer a desired search for node type as these datasets are of
small size and the root node occurs alongside the search intention. For example in case of
Query QE1 search intention is auction_info and our approach outputs auction _info; listing.
Example for desired Search for node type using our proposed method is as follows;
We consider a Query for which the complete Search for node type is presented.
Input Query: “java book”
==========================================
1) Dscore
Tag frequency path Dscore
author 413010 dblp,inproceedings 1.0
author 212898 dblp,article 1.0
title 179060 dblp,inproceedings 1.0
url 179058 dblp,inproceedings 1.0
booktitle 179058 dblp,inproceedings 1.0
title 106834 dblp,article 1.0
url 106805 dblp,article 1.0
ee 73560 dblp,inproceedings 1.0
ee 23442 dblp,article 1.0
title 2609 dblp,proceedings 1.0
url 2491 dblp,proceedings 1.0
booktitle 2293 dblp,proceedings 1.0
author 1996 dblp,incollection 1.0
author 1153 dblp,book 1.0
title 1009 dblp,incollection 1.0
booktitle 1009 dblp,incollection 1.0
url 1006 dblp,incollection 1.0
title 845 dblp,book 1.0
book 845 dblp,book 1.0
url 128 dblp,book 1.0
ee 107 dblp,incollection 1.0
title 72 dblp,phdthesis 1.0
author 72 dblp,phdthesis 1.0
url 38 dblp,www 1.0
title 38 dblp,www 1.0
author 14 dblp,www 1.0
ee 6 dblp,proceedings 1.0
title 5 dblp,mastersthesis 1.0
ee 5 dblp,book 1.0
author 5 dblp,mastersthesis 1.0
ee 1 dblp,phdthesis 1.0
booktitle 1 dblp,www 1.0

242
2) Tscore
Tag Name Tscore path
booktitle 182361.0 dblp,www
author 125829.6 dblp,mastersthesis
ee 97121.0 dblp,phdthesis
title 58094.4 dblp,mastersthesis
author 44939.142857142855 dblp,www
ee 19424.2 dblp,book
ee 16186.833333333332 dblp,proceedings
author 8738.166666666666 dblp,phdthesis
title 7644.0 dblp,www
url 7619.105263157894 dblp,www
title 4034.333333333333 dblp,phdthesis
url 2261.921875 dblp,book
ee 907.6728971962616 dblp,incollection
author 545.661751951431 dblp,book
title 343.75384615384615 dblp,book
author 315.2044088176352 dblp,incollection
title 287.8810703666997 dblp,incollection
url 287.79920477137176 dblp,incollection
booktitle 180.73439048562932 dblp,incollection
url 116.228823765556 dblp,proceedings
title 111.33461096205444 dblp,proceedings
booktitle 79.52943741822939 dblp,proceedings
ee 4.143033870830134 dblp,article
author 2.9551616266944736 dblp,article
title 2.7189097103918227 dblp,article
url 2.710790693319601 dblp,article
title 1.6222048475371385 dblp,inproceedings
url 1.6169397625350446 dblp,inproceedings
author 1.5233238904627007 dblp,inproceedings
ee 1.3202963567156063 dblp,inproceedings
booktitle 1.0184465368763194 dblp,inproceedings
book 0.0 dblp,book
=================================================
3) Ranking
Example: correlation of dblp,proceedings and dblp,incollection
corr(dblp,proceedingsdblp,incollection)= 0.1221784083384564
Ranked Sum of correlation:
Path Rank
P1=dblp,book 3.2727014742218543
P2=dblp,phdthesis 3.1869696826431175
P3=dblp,incollection 3.0431260287060002
P4=dblp,www 2.0916351992181195
P5=dblp,article 1.8924147256281627
P6=dblp,inproceedings 1.8924147256281627
P7=dblp,proceedings 0.13822919060961375
P8=dblp,mastersthesis 0.0
Rank1.(Rank)maxRelsif
Tscore%lowestistypenodeRthen
thresholdRank2)-iff(Rank1=R
==
=
<d
Selected Path is dblp, book

243
Table 3. Effectiveness test on Inferring the desired search for node type
Query Intention XReal SLCA/XSeek Our
DBLP (370MB)
QD1
Java,
book
book book
book ; title/
book; article
book
QD2
author,
Chen, Lei
inproceedings inproceedings author
inproceedings
QD3
Jim,
Gray,
article
article article article
article
QD4
XML,
twig
inproceedings inproceedings
title/
inproceedings
inproceedings
QD5
Ling, tok,
wang,
twig
inproceedings inproceedings Inproceedings
inproceedings
QD6
vldb,
2000
inproceedings inproceedings inproceedings
inproceedings
WSU (16.5MB)
QW1 230 place course;place
room; crs /
course
Place;course
QW2
CAC,
101
course course course
Course
QW3 ECON course course prefix/course Course
QW4 Biology course course title/course course
QW5
place,
TODD
course course place/course
Place;course
QW6
days, TU,
TH
course course days/course
Place
eBay (0.36MB)
QE1 2 , days auction_info listing
time_left /
listing
auction_info;listing
QE2 cpu, 933 listing listing cpu / listing Item_info;listing
QE3
Hard,
drive, CA
listing listing
description /
listing`
listing
5.1.2 Quality measure (Precision, Recall & F-measure)
Quality measure is also addresses the effectiveness of our approach by evaluating all
the queries under test, and sums up few metrics viz; precision, recall and F-measure.
Precision is the percentage measure of, the output subtrees that are desired; recall is the
percentage measure of the desired subtrees that are output; while F-measure is the weighted
mean value of precision and recall. Because most of the queries on DBLP have more than
100 results, therefore, in [10] precision, recall and F- measure are XReal’s. Similarly, for
each query issued on WSU and eBay, thus in figure 3 and 4.

244
(a) (b)
(c)
Fig. 3. Precision comparison (percent) (a) DBLP (b) WSU and (c) EBAY
(a) (b)
(c)
Fig. 4. Recall comparison (percent). (a) DBLP, (b) WSU, and (c) EBAY
0
10
20
30
40
50
60
70
80
90
100
X Real
Proposed
0
10
20
30
40
50
60
70
80
90
100
QW1 QW2 QW3 QW4 QW5 QW6
X Real
Proposed
0
10
20
30
40
50
60
70
80
90
100
QE1 QE2 QE3
X Real
Proposed
80
82
84
86
88
90
92
94
96
98
100
X Real
0
10
20
30
40
50
60
70
80
90
100
X Real
Proposed
0
10
20
30
40
50
60
70
80
90
100
QE1 QE2 QE3
X Real
Proposed

245
Table 4: F-Measure (%)
Method
Dataset
XReal Proposed
DBLP 47.48 48.48
WSU 49.67 37.5
EBAY 40.02 44.44
Figure 3 represents that the Average precision for our proposed approach is effective than
the XReal for the queries in the DBLP dataset. Figure 4 represents the Recall measure for all
three real datasets and the recall measure for our approach out performs XReal. Further, F-
measure is measured adopting formula F = [(precision * recall)/ (precision + recall)] to get F-
measure in Table 4. This can be measured as the average precision and recall score of all the
queries under test. F-measure for our method in the DBLP dataset is 48.48% and Ebay is 44.44%
whereas; for XReal in DBLP it is 47.48 % and in Ebay it is 40.02%.
5.2 Efficiency test
The efficiency test is addressed by evaluating the query response time adopting our
proposed method designing the indices for keyword information discussed in section 4. This is
executed by measuring the time taken to search for the node type of the given query. The
response time of individual queries under test is represented in Table 4. Proposed method is
compared with the XReal Dup type norm. In case of DBLP,WSU and ebay real dataset it is
observed that our approach is faster than even Dup type norm (three level information indexing).
Fig. 5 shows the response time in seconds on individual queries DBLP, WSU and eBay
databases.
(a) (b)
(c)
Fig. 5. Response time on individual queries (a) DBLP (b) WSU and (c) eBay
0
2
4
6
8
10
12
QD1 QD2 QD3 QD4 QD5 QD6
DupTypeNorm
Proposed method
Time(s)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DupTypeNorm
Proposed method
Time(s)
0
1
2
3
4
5
6
QE1 QE2 QE3
DupTypeNorm
Proposed method
Time(s)

246
6. CONCLUSION
In this paper, a statistical dependent and ranking measure for keyword search over
XML data is designed and this approach is analyzed over various real XML datasets. Also,
we have performed a broad analysis over the different approaches available for keyword
search on XML data in the literature. We developed representations for identifying the users
search intention and to resolve the keyword ambiguity issues as well ranking the desired
search intention. This was done by introducing Node index and Data index, based on whose
information Dscore and Tscore measures were developed to infer the search for node type,
and a Correlation Ranking mechanism to Rank the search intention. From the results obtained
of the Query under testing different datasets in terms of effectiveness and efficiency indicates
that the proposed approach outperforms the existing techniques of XML keyword search.
7. REFERENCES
[1] D. Guillaume and F. Murtaugh, “Clustering of XML Documents”, Computer physics
communication, Vol: 127, pp: 215-227, 2000.
[2] N. Sundaresan, “A classifier for semi-structured documents”, in proceedings of the
sixth ACM SIGKDD international conference on knowledge discovery and data
mining, pp: 3404—344, 2000.
[3] Antoine Doucet and Helena Ahonen-Myka, "Naive clustering of a large XML
document collection", in Proceedings of the 1st INEX, Germany, 2002.
[4] Abiteboul, S., Buneman, P. and Suciu, D, “Data on the Web”, Morgan Kaufmann,
2000.
[5] JianhuaFeng and GuoliangLi , “Efficient Fuzzy Type-Ahead Searchin XML
Data”,IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 24, NO. 5, MAY 2012.
[6] Wei Wang, Christopher Peery, Ame´lie Marian, and Thu D. Nguyen, “Efficient
Multidimensional Fuzzy Search for Personal Information Management Systems”,
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
24, NO. 9, SEPTEMBER 2012.
[7] Ziyang Liu, YichuanCai, and Yi Chen, “TargetSearch: A Ranking Friendly XML
Keyword Search Engine”,International conference on Data Engineering, pp:1101-
1104, 2010.
[8] Chunxiao Liu, XiangfuMeng and Ke Wei, “A Top-k Keywords Searching Approach
based on the Relationship of Keywords”, IEEE International Conference on Systems,
Man, and Cybernetics, October 2012.
[9] Yiqun Chen and Jinyin Cao, "TakeXIR: a Type-Ahead Keyword Search XML
Information Retrieval System", I.J. Education and Management Engineering, vol.8,
pp: 1-5, 2012.
[10] ZhifengBao, Jiaheng Lu, Tok Wang Ling and Bo Chen, "Towards an Effective XML
Keyword Search", Knowledge and Data Engineering, Vol. 22, no. 8, pp: 1077- 1092,
2010.
[11] Jiang Li and Junhu Wang, "Effectively Inferring the Search-for Node Type in XML
Keyword Search", Database Systems for Advanced Applications, p p.110-124, 2010.
[12] Liang Jeff Chen and YannisPapakonstantinou, "Supporting Top-K Keyword Search in
XML Databases", Data Mining Workshops (ICDMW), p p. 805- 812, 2012.

247
[13] Wilfred Ng and Lau Ho Lam, "A Co-Training Framework for Searching XML
Documents", Journal Information Systems, vol.32, no.3, 2007.
[14] B. Kimelfeld and Y. Sagiv, “Efficiently enumerating results of keyword search”, In
Proceedings of DBPL Conference, pp. 58-73, 2005.
[15] Y. Li, C. Yu, and H. V. Jagadish, “Schema-free XQuery”, In VLDB, pp. 72-83, 2004.
[16] A. Schmidt, M. L. Kersten, and M. Windhouwer, “Querying XML documents made
easy: Nearest concept queries”, In ICDE, pp. 321-329, 2001.
[17] Ralf Schenkel and Martin Theobald, "Structural Feedback for Keyword-Based XML
Retrieval", ECIR, pp. 326-337, 2006.
[18] Bo Chen, Jiaheng Lu, and Tok Wang Ling, "Exploiting ID References for Effective
keyword Search in XML Documents", In Proceedings of DASFAA, pp. 529-537,
2008.
[19] ArashTermehchy, mariannewinslett, “Using Structural Information in XML Keyword
Search Effectively”, ACM Transactions on Database Systems, Vol. 36, No.1, Month
2011.
[20] William Webber, “Evaluating the Effectiveness of Keyword Search", IEEE Data Eng.
Bull., vol. 33, no. 1, pp. 54-59, 2010.
[21] Junfeng Zhou, ZhifengBao, Wei Wang, Tok Wang Ling, Ziyang Chen, Xudong Lin
and JingfengGuo, "Fast SLCA and ELCA Computation for XML Keyword Queries
based on Set Intersection”, Data Engineering (ICDE), p p.905-916, April 2012.
[22] Jia-Jian Jiang, Zhi-Hong Deng, NingGao, and Sheng-Long Lv, "Guess What I Want:
Inferring the Semantics of Keyword Queries Using Evidence T heory", Springer-
Verlag Berlin Heidelberg, p p. 388-398, 2012.
[23] Dayananda P, Dr. Rajashree Shettar,” Survey on Information Retrieval in Semi
Structured Data”, International Journal of Computer Applications 32(8):1-5, October
2011.
[24] Y. Swapna, S. Ravi Sankar, “A Frame Work For Clustering Time Evolving Data
Using Sliding Window Technique” International Journal of Computer Engineering &
Technology (IJCET),Volume 3,Issue 3,2012,pp. 377 - 383,ISSN Print:0976 – 6367,
ISSN Online: 0976 – 6375.

A novel approach towards developing a statistical dependent and rank

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (17)

En vedette

En vedette (9)

Similaire à A novel approach towards developing a statistical dependent and rank

Similaire à A novel approach towards developing a statistical dependent and rank (20)

Plus de IAEME Publication

Plus de IAEME Publication (20)

Dernier

Dernier (20)

A novel approach towards developing a statistical dependent and rank