Contenu connexe
Similaire à 50120130406007 (20)
Plus de IAEME Publication (20)
50120130406007
- 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 6, November - December (2013), pp. 70-77
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
©IAEME
AUTHENTICATED INDEXING FOR THE QUERY DEPENDENT
K-NEAREST NEIGHBOURS IN SPATIAL DATABASE
K.Padmapriya1,
Dr. S.Sridhar2
1
Research Scholar, Department of Computer Science and Engineering, Sathyabama University,
Chennai, India
2
Research Supervisor, Department of Computer Science and Engineering, Sathyabama University,
Chennai, India
ABSTRACT
Various indexing models have been proposed in the area of information retrieval and
artificial intelligence. But most of the existing algorithms do not consider about the significant
differences among the queries. They tried to solve the problem with a single query. In this paper,
we propose different indexing models for multiple queries and we call it as Indexing for Multiple
Queries(IMQ). As a first step, we propose k-Nearest Neighbor(kNN) for indexing the queries. We
classified our method into online and offline method. In online method, we create a indexing model
for a given query by maintaining the labelled queries and then index the documents with respect to
the queries. Next we consider two offline methods create indexing models in advance to make
efficient indexing. Then we tried with different datasets and experimental results prove that our
proposed online and offline methods both perform better than the baseline method uses single
indexing model.
Keywords – Information retrieval, indexing for multiple queries, k-Nearest Neighbor.
1.
INTRODUCTION
Since searching and information retrieval attains a very good growth, indexing will always be
an important research topic. While searching, indexing has been performed as follows. Given a
query, the documents related to the query from the document repository, are sorted based on their
relevance to the query using a indexing model. Then the list of top indexed documents is presented
to the user. The problem in this method is developing the suitable indexing model to provide best
relevance.
Lots of models have been proposed indexing like the vector space model[24, 25], Boolean
model[3] , BM25[22] and language model for IR[14, 19]. Recently, learning to index have been
70
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
applied to automatic indexing model construction [6, 7, 8, 12, 18, 29, 30] in machine learning
techniques. By applying machine learning algorithms and labelled training data, this method is able
to perform indexing model effectively. These training data contains queries, their relevance
documents labels show their association with the documents. In this paper, we also try to base our
model on index learning.
Previously a single indexing function is used to handle all the queries. But it may not be
proper, especially in web search. While searching for a query in web may differ in semantics and the
user’s objectives they represent, how they appear, how many relevant documents they have in the
document repository. Sometimes queries may be informational, transactional and navigational.
Sometimes queries may be phrases or combination of phrases or natural language sentences. Queries
may be product names, personal names or terminology. Queries can be longer or shorter, popular or
not popular. Hence single indexing function do not give appropriate results but with lower accuracy
in relevance indexing.
The importance of conducting query dependent indexing is well understood by IR
community. However much efforts have been taken on query classification [4,5, 13, 15, 23], but not
on indexing model learning and construction. Kang and Kim[13] classified into two categories. One
is based on search intention and the next is two different indexing models for different categories.
As discussed in previous work [9], we propose a query-dependent indexing model
construction for k-Nearest Neighbors. We used training queries where each query is denoted by a
point. During indexing, we can retrieve k nearest training queries for a given test query, learn a
indexing model, and then index the documents relevant to the test query with the model. The
advantages of our proposed methods are: 1. Query indexing method is achieved by controlling the
needful information of the similar queries by neglecting the dissimilar queries. 2. Classification of
queries has been done dynamically and the similar queries are selected. Our experiment results
prove its advantage over the single indexing model and the query classification method.
Since kNN wants to conduct online training of the indexing model for each query, it is
expensive practically. Hence we propose two methods to move the training by offline. Especially
we proved our methods are accurate especially in loss of prediction if the algorithm used for learning
is stable while doing minor changes in training examples.
2. PREVIOUS WORK
There is no much work done on dependent query ranking. Most of the work done on query
classification and learning to index. Many methodologies have been proposed for query
classification. In [4, 5, 26] queries have classified according to its topic, computers, for instance,
information and entertainment etc. Like KDD cup2005. In [13, 15, 23, 27], queries have been
classified according to the user’s need for searching, topic distillation, for instance, home page
finding and named page finding. Support vector method for machine learning was also applied in
classification.
However many surveys have been conducted for learning to index and extends its application
into information retrieval. Previously they categorised as 1. Point-wise approach [ 18] – transforms
indexing to classification or regression done on single documents 2. Pair-wise approach [6, 9, 12] –
performs ranking as classification on pairs of document and 3. List-wise approach [7, 29, 30] –
reduces loss function on list of documents.
3. INDEXING USING k-NEAREST NEIGHBORS
In practical, we categorized the queries into two types: 1. Popular queries- have many related
documents and features lead to its popularity is important for indexing. 2. Rare queries - have very
71
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
few related documents and the use of its popularity is not necessary for indexing. Hence we use
different type of indexing model for different queries. A direct approach is to apply hard
classification model to classify the queries into its categories and train a indexing model for each and
every category. Still it is not so easy to achieve the performance using this approach.
Fig.1. Sample for distribution of data
When we consider about the data, it is not so easy to provide clear boundaries among the
queries in various categories. We have represented our queries in 27-dimensional features as per
defined in [27]. Then we minimize the space into two dimensions by applying Principal Component
Analysis (PCA). We plot our queries within this constrained space and got the graph as inFig.1.
Since the queries from different categories are combined together, it is not possible to separate it
using hard classification. A query belongs to the same category with high probability will be its
neighbour. It can be called as “Queries Locality Property (QLP)”.
3.1 kNN Online method
We use kNN method for query dependent indexing. We use feature vector for each training
query qi with the corresponding data set as Sqi, where i=1, 2,...., m. We represented in terms of
Euclidean space in query feature space. Having a test query q, we tried to find out the k nearest
queries by means of its Euclidean distance. Then we tried to train a local indexing model online by
using its closest training queries Ck(qi) and index the test queries using local training model SVM.
The working principle of our algorithm is illustrated in Fig.2 where the red circles represent
the test queries qi, blue circles represent training queries, and the large circle represents the
neighbours of queries qi.
Fig.2. Representation of kNN online method
For each query qi, we apply a reference model (BM25) to find out its top T indexed
documents, and take its mean values as query features.
72
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
Algorithm 1: kNN Online method
Step 1: Use reference model hr to find out the top T indexed documents for query qi, and defines
its query features from those documents.
Step 2: Find kNNs of qi from training data(Ck(qi), with Euclidean distance calculated in the query
feature space.
Step 3: Learn a local training model hq using the training set Sck(qi) = Uqiᶦϵck(qi)Sqᶦ.
Step 4: Apply hq to the documents related to the queries to obtain the indexed list.
3.2 kNN Offline method – 1:
To get better efficiency, we use offline training model. Initially for each training query qi, we
tried to find its closest queries ck(qi) from the feature space of the queries. Then we use a local
training model hqi from Sck(qi). During testing we find kNN ck(q) for any new query q. There after
we compare Sck(q) with every Sck(qi) to find out the similar neighbour queries with Sck(q). Then we
use hqiᶦ to index the query documents. In fig.3. the circled solid dot represents the selected training
query qiᶦ.
Fig. 3. Representation of kNN Offline method – 1
For each training query qi, compute kNNs of qi from the training data in the query feature
space and denote it as ck(qi). And use the training data set Sck(qi) to learn a model hqi.
Algorithm 2: kNN Offline method -1
Step 1: Use reference model hr to compute the top T indexed documents for query qi, and defines
its query features from those documents.
Step 2: Find kNNs of qi from training data (Ck(qi), with Euclidean distance calculated in the query
feature space.
Step 3: Compute the similar training set Sck(qiᶦ) by comparing Sck(q) with Sck(qi).
Step 4: Use the training model hr to the documents related with the query q to find out the indexed
list.
3.3. kNN Offline method -2
Instead of finding the kNNs for the test query qi, we compute its single closest neighbour
from the query feature space. Directly we use the training model hqiᶦ from Sck(qiᶦ) to test query qi.
Hence we reduced the time complexity by searching a single nearest neighbour.
73
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
Fig.4 Representation of kNN Offline method -2
For each training query qi, compute kNNs of qi from the training data in the query feature
space and denote it as ck(qi). And use the training data set Sck(qi) to learn a model hqi.
Algorithm 3: kNN Offline method -2
Step 1: Use reference model hr to compute the top T indexed documents for query qi, and defines
its query features from those documents.
Step 2: Find the nearest neighbour of q, represented as qi .
Step 3: Use the training model hr to the documents related with the query q to find out the indexed
list.
4. EXPERIMENTAL RESULT
Here we used two data sets: 1. Dataset1 – contains 1500 training queries and 500 test
queries, 2. Dataset2 – contains 3000 training queries and 600 test queries. Each query is related with
its labelled documents relevant to it. The relevance to those documents has been given with its
scores as – Perfect- 4, Excellent-3, Good-2, Fair-1 and Bad-0. A feature vector is defined for a query
dependent pair.
In our experiments, we used Ranking SVM[12] as our baseline algorithm, which has only one
n representation of trade-off between the model complexity and the empirical loss. We set n=0.01
for all the methods. In kNN, we used a reference model BM25 to index the documents, select the top
T=50 documents and then create the query features.
0.72
0.7
0.68
0.66
SM
0.64
kNN Online
0.62
kNN Offline- 1
0.6
kNN Offline - 2
0.58
0.56
1
2
3
4
5
Fig.5a. Indexing accuracies in terms of Dataset1
74
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
0.72
0.7
0.68
0.66
SM
0.64
kNN Online
0.62
kNN Offline - 1
0.6
kNN Offline - 2
0.58
0.56
0.54
1
2
3
4
5
Fig. 5b. Indexing accuracies in terms of Dataset2
We compared our methods with the single model approach(SM). From Fig.5a and 4b. We
can understand that the proposed methods perform well with each other and also outperform the
baseline algorithm. We conducted t-tests and the results show that the improvements of the kNN
methods are statistically significant for both Dataset1 and Dataset2 when compared with SM. In
SM, it has been observed that errors in the query classification mainly damage the results of
document indexing. And also it proves that it is difficult to develop a query dependent indexing
method which can hit the conventional indexing methods. Whereas, the kNN methods control the
indexing patterns of similar queries successfully and attain better performances on indexing.
5. CONCLUSION AND FUTURE WORK
In this paper, we have discussed about different queries, indexing of documents while
searching, based on different models with different types of queries. We have defined kNN model
for learning indexing functions. We have categorised two offline models to improve the efficiency
of the methods. Our experimental results prove that the proposed model outperforms well than the
baseline algorithm.
However when a small number of neighbours are applied, the kNN’s performances are not
good because of the inadequate training data. When there is increase in numbers of neighbours, the
performances will also improve gradually due to the use of more information. Conversely if too
many neighbours are used (like attaining 1500 as in SM), the performances begin to go worse.
Hence the best performance can be achieved in this model when n takes values from a relatively
large range i.e from 300 to 700.
In future, we can try to reduce the complexity of the online method by using kD-trees. We
can further reduce the complexity of offline methods by conducting clustering on the training
queries. We can make use of some other metrics other than Euclidean distance to check whether it
performs better for the task.
75
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
6. REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
S. Agarwal and P. Niyogi, “Stability and generalization of bipartite ranking algorithms”,
Proceedings of COLT 2005, pp 32–47.
M. Richardson, A. Prakash, and E. Brill, “Beyond PageRank: machine learning for static
ranking”. WWW ’06: Proceedings of the 15th international conference on World Wide Web,
New York, NY, USA, 2006, pp 707–715.
R. Baeza-Yates and B. Ribeiro-Neto, “Modern Information Retrieval”, Addison Wesley, May
1999.
S. M. Beitzel, E. C. Jensen, A. Chowdhury, and O. Frieder, “Varying approaches to topical
web query classification”, SIGIR ’07: Proceedings of the 30th annual international ACM
SIGIR conference on Research and development in information retrieval, New York, NY,
USA, 2007, pp 783–784
S. M. Beitzel, E. C. Jensen, O. Frieder, D. Grossman, D. D. Lewis, A. Chowdhury, and A.
Kolcz, “Automatic web query classification using labeled and unlabeled training data”.
SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research
and development in information retrieval, New York, NY, USA, 2005, pp 581–582.
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender,
“Learning to rank using gradient descent”. ICML ’05: Proceedings of the 22nd international
conference on Machine learning, New York, NY, USA, 2005, pp 89–96.
Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: from pairwise approach to
listwise approach”. ICML’07, volume 227 of ACM International Conference Proceeding
Series, 2007, pp 129–136.
Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm for
combining preferences”. J. Mach.Learn. Res., 4:933–969, 2003.
D. S. Guru and H. S. Nagendraswam,. “Clustering of interval-valued symbolic patterns based
on mutual similarity value and the concept of -mutual nearest neighbourhood”. ACCV (2),
2006, pp 234–243.
K. Jarvelin and J. Kekalainen, “Cumulated gain-based evaluation of IR techniques”. ACM
Trans. Inf. Syst., 20(4):422–446, 2002.
T. Joachims, “Making large-scale support vector machine learning practical. Advances in
Kernel Methods: Support Vector Machines”.
T. Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li, “LETOR: Benchmark dataset for research on
learning to rank for information retrieval”. SIGIR ’07: Proceedings of the Learning to Rank
workshop in the 30th annual international ACM SIGIR conference on Research and
development in information retrieval, 2007.
T. Joachims, “Optimizing search engines using clickthrough data”. Proceedings of the ACM
Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2002.
I. Kang and G. Ki, “Query type classification for web document retrieval”. SIGIR ’03:
Proceedings of the 27th annual international ACM SIGIR conference on Research and
development in information retrieval, 2003.
J. Laterty and C. Zhai, “Document language models, query models, and risk minimization for
information retrieval”. SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR
conference on Research and development in information retrieval, New York, NY, USA,
2001, pp 111–119.
U. Lee, Z. Liu, and J. Cho, “Automatic identification of user goals in web search”. WWW
’05: Proceedings of the 14th international conference on World Wide Web, New York, NY,
USA, 2005, pp 391–400.
76
- 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
17. T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma, “Support vector machines
classification with a very large-scale taxonomy”. SIGKDD Explor. Newsl., 7(1):36–43, 2005.
18. R. Nallapati, “Discriminative models for information retrieval”. SIGIR ’04: Proceedings of
the 27th annual international ACM SIGIR conference on Research and development in
information retrieval, New York, NY, USA, 2004, pp 64–71.
19. J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”.
Research and Development in Information Retrieval, 1998, pp 275–281.
20. F. P. Preparata and M. I. Shamos, “Computational Geometry:Discriminative models An
Introduction (Monographs in Computer Science)”. Springer, August 1985.
21. S. Robertson, “Overview of the okapi projects”. Journal of Documentation, 1998, pp 275–
281.
22. D. E. Rose and D. Levinson, “Understanding user goals in web search”. WWW ’04:
Proceedings of the 13th international conference on World Wide Web, New York, NY, USA,
2004, pp 13–19.
23. G. Salton, “The SMART Retrieval System-Experiments in Automatic Document Processing”.
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1971.
24. G. Salton and M. E. Lesk, “Computer evaluation of indexing and text processing”. J. ACM,
15(1):8–36, 1968.
25. D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, “Building bridges for web query classification”.
SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research
and development in information retrieval, , New York, NY, USA, 2006 pp 131–138.
26. J. Xu and H. Li, “Adarank: a boosting algorithm for information retrieval”. SIGIR ’07:
Proceedings of the 30th annual international ACM SIGIR conference on Research and
development in information retrieval, New York, NY, USA, 2007, pp 391–398.
27. R. Song, J.-R. Wen, S. Shi, G. Xin, T.-Y. Liu, T. Qin, X. Zheng, J. Zhang, G. Xue, and W.-Y.
Ma, “Microsoft research asia at web track and terabyte track of trec 2004”. Proceedings of the
Thirteenth Text REtrieval Conference Proceedings (TREC-2004), 2004.
28. E. Xing, A. Ng, M. Jordan, and S. Russell, “Distance metric learning, with application to
clustering with side-information”. Advances in NIPS, number vol. 15, 2003.
29. Y. Yue, T. Finley, F. Radlinski, and T. Joachim, “A Support Vector Method for Optimizing
Average Precision”. SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR
conference on Research and development in information retrieval, New York, NY, USA,
2007, pp 271–278.
30. Ioannidis, Y., Kang, Y, "Randomized Algorithms for Optimizing Large Join Queries". ACM
SIGMOD, 1990.
31. Y. Angeline Christobel and P. Sivaprakasam, “Improving the Performance of K-Nearest
Neighbor Algorithm for the Classification of Diabetes Dataset with Missing Values”,
International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 3,
2012, pp. 155 - 167, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
32. Mousmi Chaurasia and Dr. Sushil Kumar, “Natural Language Processing Based Information
Retrieval for the Purpose of Author Identification”, International Journal of Information
Technology and Management Information Systems (IJITMIS), Volume 1, Issue 1, 2010,
pp. 45 - 54, ISSN Print: 0976 – 6405, ISSN Online: 0976 – 6413.
33. Prakasha S, Shashidhar Hr and Dr. G T Raju, “A Survey on Various Architectures, Models
and Methodologies for Information Retrieval”, International Journal of Computer Engineering
& Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print: 0976 – 6367,
ISSN Online: 0976 – 6375.
77