Contenu connexe Similaire à 50120130405011 (20) Plus de IAEME Publication (20) 501201304050111. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 5, September – October (2013), pp. 84-90
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
©IAEME
GENETIC ALGORITHM WITH A RANKING BASED OBJECTIVE
FUNCTION AND INVERSE INDEX REPRESENTATION FOR WEB DATA
MINING
Suresh Subramanian1, Dr. Sivaprakasam2
1
2
(Research Scholar, Karpagam University, Coimbatore, India)
(Department of Computer Science, Sri Vasavi College, Erode, India)
ABSTRACT
As we agree that the internet has become part of human life and information available at
World Wide Web (WWW) has increased drastically. WWW is the potential resource for any kind of
Information Retrieval(IR), however, extracting relevant information became a major issue for
everyone.This paper describes the viability of using evolutionary algorithms on web mining to
provide the best results for the user query. The proposed Genetic Algorithm with a Ranking Based
Objective Function is used to determine the documents which best match the user query based on
relevance combined with the inverse indexing model. Modifications on the presented fitness
functions on GAHWM shows a significant improvement in determining the relevant files and the
execution time has been reduced.
Keywords: Genetic Algorithm, Inverted Index, Information Retrieval, Web Mining.
1. INTRODUCTION
Perhaps one of the most used features of the internet is a search engine. Since the World
Wide Web contains a lot of information, this service helps users to find the most relevant pages
depending on their query. Web Mining is a new learning area developed especially for this. It
collects different data across the internet and learns its contents for further processing. The data
collected can be used for searching, indexing, information retrieval, and application services among
many others.
Genetic Algorithm is one of the search algorithms which are based on evolutionary theory. It
is an example of a stochastic search algorithm which simulates the natural selection based on the
theory of Charles Darwin. Genetic algorithm is composed of individuals or chromosomes that
represent a solution. Each of the solution is evaluated through a fitness function which determines
the correctness of the solution to the problem.
84
2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
Studies on evolutionary algorithms show that it can be used on Information Retrieval methods as it
can be seen as an optimization problem. An optimization problem does not require a solution that is
absolutely correct, but a solution that is near or close to the correct solution.A short description of
preprocessing will be presented in Section 2, before to the process of Genetic Algorithm in Section 3.
Genetic operators will be discussed in Section 4, while Section 5 discuss the implementation
proposed function and the comparative results will be discussed in Section 6.
Section 7provides recommendations for future improvement and section 8 will discuss the
conclusion.
2. PREPROCESSING
To optimize the searching algorithm, the documents are first modeled into a data structure
where the method can easily read these documents and determine whether there are keywords that
match the user input [1].
The basic web searching method includes the terms in a user query and a database containing
the files. To reduce the load process of searching each document individually, a model is used to
represent these documents. Inverse index data mapping is a form of data structure where the contents
of the file are mapped to the filename itself.Each wordin this collection is called a term and
corresponding to eachterm we maintain a list, called inverted list, of all the documentsin which this
word appears[2].For this application, Inverse-index mapping is used as each unique term found on
the documents is mapped to the files which contain them.The map also contains the frequency of the
term found in the document and the weight as determined by table 1. Using this model allows the
algorithm to determine all the files which contains the term at constant time [3].
Parameter
Weight
Title
6
Headers
5
Anchor
4
Bold
3
Body
1
Table 1. Parameters used for indexing and their weights
Figure 1. Inverted-index model data structure for documents
85
3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
3. GENETIC ALGORITHM
3.1 Data Representation
Data representation is an encoding which represents the solution to the problem. For the data
representation, each chromosome contains a set of genes, and each gene holds a reference to a
document number. In the proposed method, each chromosome may have variable lengths which are
randomly generated. The length may be varied for every operation done on the chromosome. The
chromosomes are then sorted according to the fitness value of each document. The first document on
the list has the highest fitness value and the last document has the lowest fitness value [3]
Figure 2. Sample chromosome generated by the algorithm
3.2 Initialization
The first step taken for the algorithm is to initialize the first generation. For the first
generation, a population of 50 chromosomes is randomly generated. This number of population is
kept until the solution is found. Each chromosome must have at least five unique randomly selected
genes and must not exceed the total number of documents.
3.3 Evaluation
Based from the work of Cao, Xu, and their colleagues, the ranking of documents holds a
great importance in document retrieval. The documents are ranked according to their relevance to the
query, which makes the returned documents more accurate based on user request [4].
The fitness function is used to evaluate each of the chromosomes whether it is suited to be a
solution or not. The function is a modified version of the fitness function presented in the GAHWM
[1]. The added functionality is based on the GA experiments by Chang and Kwok which improves
the function by adding a ranking method on each of the chromosomes, showing the most relevant file
on the top of the list. The added utility function allows the algorithm to modify the fitness value
based on its rank. Using this function, the weight of the document decreases as it goes down the
list [5].
ே
ୀଵ
ୀ
1
1
ܨሺܿ ሻ ൌ ൈ ቌ݂൫݀ ൯ ൈ ቍ ሺ1ሻ
ܰ
݅
݂൫݀ ൯ ൌ ݓ ሺ2ሻ
ݓ ൌ
ୀଵ
ܭ ܨ ݆ 1
ܶ
ܰ
ൈ
ൈ ൈ ݄ ݆ ൈ log ൬ ൰ ൈ log
ሺ3ሻ
ܨ ܭ
ݐ
ܶ
݂݀
86
4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
Variable
݅
݆
ݓ
ܭ
ܰ
ݐ
݀
݂ ݆
ܨ
݄ ݆
ܶ
ܶ
ܿ
ܭ
ܭ
Description
Term in the document
Current document
Weight of term in the document
Total number of terms in the user query
Total number of documents in the chromosome
Unique terms in the current document
Current document in the chromosome
Frequency of the term in the document
Total number of terms in the document
Weight of the term in the document
Total frequency of terms in the index database
Total frequency of the current term in the index
database
Current chromosome
Frequency of term I in the user query
Number of terms in the user Query
Table 1: Terms and variables description
3.4 Breeding
For setting up the next generation, 10 chromosomes are randomly chosen from the current
population. The top two fittest individual are chosen as parents for a chromosome in the next
generation. This method is repeated until the required number of chromosomes is achieved for the
current population.
3.5 End
Since there is no assurance that GA will reach the optimal solution in an infinite amount of
time, the algorithm will stop in a specified number of generations [6]. The algorithm will return the
chromosome with the highest fitness value for the current population. The most fitted chromosome is
obtained when there is no change on its genes until the end of the execution.
4. OPERATIONS
4.1 Crossover
The algorithm uses “cut and splice” approach for the crossover operation. Each parent
chromosome selected in the current population will have a separate choice for the crossover point.
This approach is the optimal choice for the crossover operation as the parent chromosomes have
varying lengths. This crossover operator utilizes the variable length of the chromosomes. This
operation is used because the genes have their specific position in the chromosomes [7]. The head of
the chromosome also plays an important role in the fitness value as it varies the most relevant genes.
The parents are divided where the crossover point is defined, each having a head and a tail. Two
children are produced for each pair of parents. The first child is produced by combining the head of
the first parent to the tail of the second parent, and the second child by combining the tail of the first
parent to the head of the second parent. Duplicates found on the chromosomes are discarded [7].
4.2 Mutation
For the mutation function, a random gene is selected from a chromosome and is removed. A
new gene is then randomly generated based from the search space, and is then examined for
87
5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
uniqueness. If the new gene does not have a replicate on the chromosome, it is added to the list of
genes, otherwise the process is repeated. The mutation function has a probability of 0.05 percent to
occur in each child.
4.3 Modifier Operator
Since each solution can contain documents that are significantly unrelated to the user query, a
modifier operator is added. This function removes any documents from a chromosome if it is below
a threshold value. The threshold value is equal to eighty percent of the average fitness value of the
chromosome. Documents below the computed threshold are removed from the chromosome.
5. IMPLEMENTATION
The algorithm is developed using Java programming language. The dataset used for the
program is a collection of web pages from different universities taken from the World Wide Web
Knowledge Base Project containing 8276 files. The program is developed in Windows Vista
platform, and is executed in Java Eclipse SDK 3.3.1.1. Fifty chromosomes are used for every
generation and the chromosomes are populated until generation fifty. The sample query used in the
algorithm is the string “topics to be covered in database systems”.
6. RESULTS
This section will discuss the results by using the proposed algorithm as compared to
GAHWM.
The fitness function for GAHWM is
ܨሺܿሻ ൌ ቀ݂൫݀ ൯ቁ
ୀ
ሺ1ሻ
݂൫݀ ൯ ൌ ݓ ሺ2ሻ
ݓ ൌ
ୀଵ
ܭ ܨ ݆ 1
ܶ
ܰ
ൈ
ൈ ൈ ݄ ݆ ൈ log ൬ ൰ ൈ log
ሺ3ሻ
ܨ ܭ
ݐ
ܶ
݂݀
The performance of both algorithms is measured in terms of recall and precision. The recall is
measured by the number of relevant retrieved documents in the collection of all relevant documents
with respect to the user query. The precision is measured by the number of relevant retrieved
documents in the collection of retrieved documents. Both are formulated as follows:
ܴ݈݈݁ܿܽ ൌ
ܴ݈݁݁݀݁ݒ݁݅ݎݐܴ݁ ݐ݊ܽݒ
ܴ݈݁݁ݐ݊ܽݒ
ܲ ݊݅ݏ݅ܿ݁ݎൌ
ܴ݈݁݁݀݁ݒ݁݅ݎݐܴ݁ ݐ݊ܽݒ
ܴ݁݀݁ݒ݅ݎݐ
A document is said to be relevant if it contains a number of terms greater than or equal to the
terms in the user query.
88
6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
GAHWM
GAHWM with ranking
8276 Files (Chromosomes of length 8276)
Execution Time
41.3548078 seconds
25.6544668 seconds
Recall
0.99
0.78
Precision
0.0113
1.0
4512 Files (Chromosomes of length 4512)
Execution Time
18.2564585 seconds
11.4751356 seconds
Recall
0.96
0.63
Precision
0.0189
1.0
Table 2: Comparative Results of the fitness functions used in the algorithm
7. RECOMMENDATION
Further improvements on the fitness function may be used in the genetic algorithm, such as
adding proximity parameters. This parameter will measure the distance of each of the terms in the
documents and may contribute greatly in evaluating the fitness value. For real life datasets, the name
of the file and the URL may hold a high value for the fitness. Further tests on the different values of
parameters in the algorithm, such as the size of the population and the number of parents for
selection, may also improve the search performance as different values may lead to different results.
8. CONCLUSION
This paper introduces a new method of evaluating the fitness value of a HTML document.
Genetic algorithm is used as a search method in determining relevant files. The new method uses a
ranking system to determine the relevant files. The solution shows a noticeable improvement in
terms of precision, returning only files that are relevant to the user query and the improvement in the
execution time as well. The recall of the new method is lower than that of GAHWM, but is able to
return a good number of documents. Based from the result, the new fitness function presented in
determining the fitness value is better as compared to GAHWM.
9. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
Al-Dallal, A., & Shaker, R. (2009a). Proceedings from GCC Conference & Exhibition 5th
IEEE: Genetic algorithm in web search using inverted index representation. Kuwait: GCC.
Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter,
Sabrina Chandrasekaran. Inverted Indexes for Phrases and Strings SIGIR’11, July 24–28,
2011, Beijing, China.
Al-Dallal, A.,& Shaker, R. (2009b).Genetic algorithm based mining for HTML document.
Retrieved from
http://wwwis.win.tue.nl/bnaic2009/papers/junk/bnaic2009_submission_87.pdf
Coa, Y., Xu, J., Liu,T-Y., Li, H., Huang, Y & Hon, H-W. (2006). Adapting Ranking SVM to
Document RetrievalRetrieved from
http://research.microsoft.com/en-us/people/tyliu/cao-et-al-sigir2006.pdf
Fan, W., Fox, E., Pathak, P., & Wu, H. (2004a). The effects of fitness functions on genetic
programming-based ranking discovery for web. Journal of the American Society for
Information Science and Technology, 55 (7), 628-636.
Bokar, P., &Patil, L. (2013). Web information retrieval using genetic algorithm-particle
swarm optimization. International Journal of Future Computer and Communication, 2 (6),
595-599.
89
7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Hutt, B. & Warwick, K,. (2003) Synapsing Variable Length Crossover: Biologically Inspired
Crossover for Variable Length Genomes. Artificial Neural Nets and Genetic Algorithms:
Proceedings of the International Conference in Roanne, France. 198-215.
Mashagbal, E., Mashagbal, F.,&Nassar, M. (2011). Query optimization using genetic
algorithms in the vector space model. International Journal of Computer Science, Issue8 (3),
457-450.
Vizine,André., de Castro, L., &Gudwin, R., (2005). An Evolutionary Algorithm to Optimize
Web Document Retrieval. Retrieved from
http://www.dca.fee.unicamp.br/~gudwin/ftp/publications/Kimas05-2.pdf
Sathya, S.& Simon, P. (2009). Review on Applicability of Genetic Algorithm to Web
Search. International Journal of Computer Theory and Engineering, Vol. 1, (4),450-455.
Joshi, A. &Todwal, S. (2003). Evolutionary Machine Learning for Web Mining Retrieved
from http://www-cs-students.stanford.edu/~amrutaj/work/papers/tencon03.pdf
Prof. Sindhu P Menon and Dr. Nagaratna P Hegde, “Research on Classification Algorithms
and its Impact on Web Mining”, International Journal of Computer Engineering &
Technology (IJCET), Volume 4, Issue 4, 2013, pp. 495 - 504, ISSN Print: 0976 – 6367,
ISSN Online: 0976 – 6375.
Priti Bhardwaj and Rahul Johari, “Routing in Delay Tolerant Network using Genetic
Algorithm”, International Journal of Computer Engineering & Technology (IJCET),
Volume 4, Issue 2, 2013, pp. 590 - 597, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
Mousmi Chaurasia and Dr. Sushil Kumar, “Natural Language Processing Based Information
Retrieval for the Purpose of Author Identification”, International Journal of Information
Technology and Management Information Systems (IJITMIS), Volume 1, Issue 1, 2010,
pp. 45 - 54, ISSN Print: 0976 – 6405, ISSN Online: 0976 – 6413.
Prakasha S, Shashidhar Hr and Dr. G T Raju, “A Survey on Various Architectures, Models
and Methodologies for Information Retrieval”, International Journal of Computer
Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print:
0976 – 6367, ISSN Online: 0976 – 6375.
90