AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB USAGE MINING

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 12, December (2014), pp. 275-289 © IAEME
275
AN EXTENSIVE LITERATURE SURVEY ON
COMPREHENSIVE RESEARCH ACTIVITIES OF WEB
USAGE MINING
Dr. V.V.R. Maheswara Rao1
, Dr. V. Valli Kumari 2
1
(Professor, Department of CSE, Shri Vishnu Engg. College for Women, AP, INDIA)
2
(Professor, Department of CS & SE, Andhra University, AP, INDIA)
ABSTRACT
The increased usage of World Wide Web (WWW) becomes a vast data repository related to
the users’ interaction with the websites which is unstructured, unlabeled, high redundant and less
reliable recorded in weblog. In addition, the existence of non linearity, incompleteness,
heterogeneous and transient nature of data makes the weblog further complex. This situation creates
inevitably increasing challenges in extracting desired knowledge from web log. Web usage mining is
a methodology that blends traditional mining techniques with sophisticated algorithms to capture,
model and analyze the behavioral patterns from weblog. The knowledge derived from such patterns
creates a great value addition to any organization as they make use in decision making process.
Thus, it is necessary to empower the web usage mining techniques that are aptly compatible
to incremental nature of weblog. These techniques promote the prerequisite of applying the new
approach at all stages of web usage mining comprehensively, to completely investigate the web
usage mining effectively. To design and develop comprehensive model for investigating the web
usage mining, the authors in the present paper conducts an extensive literature survey on various
research activities in the era of web usage mining.
Keywords: Web Mining, Pre-Processing, Storage Models, Pattern Discovery, Optimization
Techniques, Pattern Analysis, Knowledge Representation.
1. INTRODUCTION
The unexpected wide spread use of WWW and dynamically increasing nature of the web
creates new challenges in the web mining since the data in the web inherently unlabelled,
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 5, Issue 12, December (2014), pp. 275-289
© IAEME: www.iaeme.com/IJCET.asp
Journal Impact Factor (2014): 8.5328 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E

276
incomplete, and heterogeneous. In addition, it turned into a golden mount containing extremely
dynamic and interrelated data for web miners to perform web mining.
1.1. Web Mining
Web Mining is the application of data mining techniques [5] to discover and retrieve useful
information and patterns from the WWW documents and services. The web mining can be used to
discover hidden patterns and relationships within the web data. The web mining task can be divided
into three general types, known as Web Content Mining (WCM), Web Structure Mining (WSM) and
Web Usage Mining (WUM) as shown in figure 1.
Figure 1: Types of web mining
Web content mining is a mining technique which can extract the knowledge from the content
published on internet [8], usually as semi-structured (HTML), unstructured (Plain text) and
structured (XML) documents. The content of a web page include, like text, images, HTML, tables or
forms. The ability to conduct web content mining allows results of search engines to maximize the
flow of customer clicks to a website, or particular web pages of the site to be accessed numerous
times in relevance to search queries. The main uses for this type of web mining are to gather,
categorize, organize and provide the best possible information available on web.
Web structure mining is a mining technique which can extract the knowledge from the
WWW and hyperlinks between references in the web. Mining the structure [4] of the web involves
extracting knowledge from the interconnections of the hypertext documents in the WWW. This
results in discovery of web communities, and also pages that are authoritative. Moreover, the nature
of relationship between neighboring pages can be discovered by structure mining. The main purpose
for structure mining is to extract interesting relationships between web pages.
Web usage mining, also known as weblog mining, is the process of automatic discovery and
investigation of patterns in click streams and associated data collected or generated as a result of user
interactions with web resources on websites. The main goal of web usage mining is to capture, model
and analyze the behavioral patterns [3] and profiles of users interacting with websites. The
discovered patterns are usually represented as collection of pages, objects or resources that are
frequently accessed by groups of users with common needs or interests [4]. The primary data
resources used in web usage mining are log files generated by web and application servers. In
addition, it provides detailed information on web user behavior that can be useful for detecting
intrusion and fraud.
1.2 Web Usage Mining
The overall web usage mining process can be divided into mainly three interdependent
stages: Pre-processing, Storage models, Pattern discovery, Optimization techniques & Pattern
analysis as shown in figure 2.

277
Figure 2: Stages of web usage mining
Pre-processing is an initial stage [4] in web usage mining for the creation of suitable target
dataset to which mining algorithms can be applied. The inputs for pre-processing stage are the web
server logs, referral logs, registration files and index server logs [7]. In the pre-processing stage, the
click stream data is cleaned and portioned into a set of user transactions representing the activities of
each user during different visits to the sites. However, in order to provide the most suitable data for
further stages of web usage mining, pre-processing is an aspect of data mining whose importance
should not be underestimated.
Web mining applications rely on monolithic storage models which takes the responsibility of
data storage, retrieval and maintenance. The storage model creates a solid platform of consolidated
qualified data for analysis. Basically, the storage models can be divided into two types based on the
nature of their growth, namely static storage models and incremental storage models.
Pattern discovery takes the output generated by pre-processing stage. The goal of pattern
discovery is the stage of learning some general concepts from the pre-processed data. In this phase,
statistical, database and machine learning techniques like classification, clustering and association
rule mining are performed on the extracted information.
The soft optimization techniques are characterized by their ability for granular computation in
avoiding the concept of approximation. Basically, these models provide the foundation for
computational intelligence systems and further outline the basis of future generation computing
systems. These models are close resemblance to human like decision making and used for modeling
highly non linear data, where the pattern discovery, rule generation and learnability are typical.
Pattern analysis is the final stage of usage mining [4] which can extract interested patterns
from the output of pattern discovery. The goal of pattern analysis is the task of understanding,
visualizing, and interpreting the discovered patterns and statistics. The output generated by pattern
analysis is used as input to the various applications such as recommendation engines, visualization
tools and web analytics.
1.3 Web Log Characteristics
Data collection is the primary step in web usage mining process [15] and it is the process of
extracting the task relevant data from diverse web logs. It is very important and difficult task to get
details of web user usage data from weblogs, since they are unstructured, incremental in nature and
rapidly growing. One has to pay attention to collect the data from weblogs which normally includes
web content data, web structure data and web usage data.
Web usage data stores general access logs and user profiles, which consists of general access
patterns and customized access patterns respectively. The web usage mining is a task of applying
data mining techniques to extract useful patterns from such web data through various stages. These
patterns can be used to investigate interesting characteristics of web users.
The data collected using the web servers is richest, but practically it is very difficult to have
web server’s data. The data can be collected using web clients by enabling java scripts, java applets
and modified browsers, yet these methods require user participation to enable their functionalities
[16]. Proxy servers are more suitable and reliable to collect web usage data since they are placed at

278
client location and act as real servers. They capture all requests made by clients to the original server
and store automatically in weblogs and further they improve navigation speed through caching.
The weblog collected at proxy server is unstructured since the information contains different
types of entries. These entries do not have definite number of attributes, identifiable structure and
defined relation. This resulted in ambiguities and it was difficult to understand using computer
programs.
The other characteristics of weblog data are Heterogeneous, Distributed, Different data types,
Dynamic content, Voluminous / Non-Scalable, Time dimension, Incremental in nature and
Exponentially Growing.
2. AN EXTENSIVE AND COMPREHENSIVE LITERATURE SURVEY
Knowledge discovery in the data has been used to analyze the data collected on the web and
extracted useful knowledge. This effort was named as Web Mining by Etzioni in 1996, and from
then onwards the research on web mining got its roots spread by the efforts of Robert Armstrong and
his colleagues. Several approaches [1] have been proposed by many authors in the vein of web
mining to get the work to next level. The rapid growing nature of weblog is a strong endorsement [1]
to drag the high level interest of next generation researchers in the emerging field of web mining.
The efforts in the recent past [7] focused on the issues related to the feasibility, scalability, usability
and efficiency of the web mining techniques.
The survey conducted by various authors [4] and their research contributions identified three
broad categories of web mining, namely web structure mining, web usage mining and web content
mining. The literature survey of web usage mining is as shown in figure 3.
Figure 3: Roadmap of Literature Survey

279
The efforts made by some authors [2, 5] recognized that web usage mining can be utilized for
different tasks namely Personalization, System Improvement, Usage Characterization, Site
Modification and Business Intelligence. Some other authors as shown in [2, 9] have already dealt the
problem of web personalization task that comprises simple functions, advanced functions and
intelligent functions which can perform certain tasks on behalf of user without taking explicit
requests.
The work explored by the authors [5, 6] witnessed that the system improvement emerged as
another focus of web usage mining which concentrates on improvements of techniques to mine the
information and knowledge on the web quickly and effectively. Performance and other service
quality attributes are crucial to user satisfaction from services.
The survey conducted by various authors [1, 3] acknowledged that usage characterization is
an interesting task of web usage mining which focuses on the techniques that predict the user
behavior while the user interacts with the website. Some authors [10] have mostly focused on the
approaches that are utilized in site modification using web usage mining. The site modification is a
crucial issue for many applications in terms of both usage and structure.
The efforts put in by the recent authors [11] have become evident so that emergency e-
services in the web era such as e-commerce, e-learning, e-banking and so on change radically the
usage of internet, turning websites into businesses and thus business intelligence has become one of
the task of web usage mining.
2.1 Pre-Processing Techniques
In the recent past the web usage mining got attention of many researchers [15, 17], yet, the
pre-processing in the knowledge discovery has received less attention than it deserves. Many
researchers [12] are working on pre-processing that involves user identification, session
identification, path completion, transaction identification.
Brodley and Kohavi discussed in 2000, clipping once per session, a single session is
truncated at some point within the session. The implicit unit of analysis here is a fragment of a
session. Martin Arlitt [21] studied numerous user session characteristics including number of
requests per session, number of pages requested per session, session length, and inter session times.
Both of these techniques are time consuming and inaccurate as the web server log captures cache
hits.
During 2001, Berendt, B., B. Mobasher, M. Spiliopoulou, J. Wiltshire [13] provided a formal
frame work that uses sessionizing heuristics which partitions the user activity log into a set of
constructed sessions using time oriented heuristics and navigational oriented heuristics to improve
the accuracy of sessionization.
In 2002, Tan P N and Kumar V [24] explored on web robots that can perform many tasks
automatically and made the web administrators’ task easy. The web robots are used by many
business organizations to collect email addresses, monitor product prices, corporate news and so on.
This emerged into a large proliferation of weblog.
In 2003, M. Spiliopoulou, B. Mobasher, B. Berendt [22] have expressed that the reliability of
web usage mining results depends heavily on the proper preparation of the input datasets. In
particular, errors in the reconstruction of sessions and incomplete tracing of users’ activities in a site
can easily result in invalid patterns.
All the range in 2004, J. Zhang and A. A. Ghorbani [19] described an improved statistical
based time oriented heuristics for the reconstruction of user sessions from a server log. Even though,
some of the results for usage mining in their experiments have shown less performance. They
expressed that combination of time oriented heuristics might be arriving a better performance.

280
In 2005, Show-Jane Yen, Yue-Shi Lee [52] applied bit-string database generation technique
as a part of preprocessing to improve the efficiency in finding the interesting association rules.
Moreover, they expressed that the bit-string database generation technique cost extra memory space
in the transformation process.
In the subsequent year 2006, Natheer Khasawneh and Chien-Chung Chan [23] introduced a
fast active user based user identification algorithm and they also presented an ontology based method
that utilizes functionalities to identify sessions.
After that in year 2007, Renata Ivancsy and Sandor Juhasz [26] attempted a novel approach
that uses a complex cookie based method to identify web users. Furthermore, they also explained
steps towards identifying individuals behind impersonal web users. Their approach is demonstrated
by implementing web activity tracking system that aims at a more precise distinction of web users
based on log data..
Then in the year 2008, G T Raju, P S Satyanarayana [18] implemented a complete pre
processing methodology that utilizes several heuristics for cleaning web usage data. This
methodology allows the analyst to merge any collection of weblog into a single file. Further, it
allows the analyzer to analyze jointly on multiple weblogs. Yet, the relational data model is not
suitable for the present huge weblogs, thus, it creates a scope for further research.
Within the year 2009, K. R. Suneetha, Dr. R. Krishnamoorthi [20] summarized the weblog
locations as server side logs, proxy side logs and client side logs along with their nature like transfer
log, agent log, error log and referrer logs. They presented the web log structure and its attributes, and
indicated the future work as the data mining techniques can be applied on the pre processed weblog
to find frequently accessed patterns in less time with high accuracy.
In 2010, V. Chitraa, A.S.D. [27] reviewed the existing work done in the pre processing stage
that includes data collection and pre processing. The data collection comprises of server level, proxy
level and client side. They also concluded weblog is the best source to know usage behavior. But the
raw weblogs contain unnecessary details which will affect the accuracy of pattern discovery and
analysis.
In 2011, Ma Shuyue, Liu Wencai, Wang Shuo [28] reviewed the existing work done in the
pre processing stage that includes the embedded animations in web pages and other page elements
which meet the new standards can be combined with the logs to become the concerns. Therefore, the
preprocessing before the web mining in weblog should become a more important research.
During 2012, P. Nithya, Dr. P. Sumathi [25] proposed a novel pre-processing technique that
can remove local and global noise and web robots data. However, they also expressed that intelligent
techniques are required to discard the noise and the data accessed by web robots automatically as
their future directions.
Recently in 2013, Chintan R. Varnagar, Nirali N. Madhak, Trupti M. Kodinariya, Jayesh N.
Rathod [14] provided a detailed survey of work done so far on data collection and pre-processing
stage of web usage mining. They endorsed that web log data pre-processing is very important and
crucial task in entire process. This phase can be strengthened by choosing and applying various
intelligent techniques.
2.2 Storage Models
Valtchev P. and Missaoui R. [37] proposed a framework in the year 2001 that design a
generic procedure for lattice building. This new lattice building approach is more general than the
previous lattices, which is built on the basis of theory of lattice. They expressed that the features of
growing data sought the development of incremental algorithms as their future work. Yet, the major
known incremental techniques are lagging in theoretical foundation.

281
The concept of lattices derived using universally quantified expressions during the year 2002
by JL. Pfaltz and CM. Taylor [33]. They have shown the concept of incremental lattices and their
associated logical implications along with scientific observations are generated. They profoundly
expressed the incremental discovery is as rightly suitable for growing data like weblogs.
In the subsequent year 2003, F. Masseglia, P. Poncelet, and M. Teisseire [31] considered the
problem of incremental mining and presented a new algorithm for mining frequent sequences in the
updated database. They specifically expressed in their future avenue that incremental mining is also
appropriate for web usage mining, where the modifications need to be taken into account in order to
save storage space as previous information is no longer of interest or becomes invalid.
In 2004, Show-Jane Yen, Yue-Shi Lee and Chung-Wen Cho [36] implemented an
incremental updating technique to maintain the discovered frequent traversal patterns when the user
sequences are inserted into the database. The experimental results have shown that the algorithm is
efficient for the maintenance of mining frequent traversal patterns.
Shao M W. presented the approaches of attribute reduction and object reduction for two kinds
of generalized concept lattices in the year 2005, in which they removed the attributes and objects that
are not essential to the generalized lattices.
During 2006, Li H R, Zhang W X, Wang H [34] investigated the attribute classification and
reduction of lattices using binary relations. They also presented two kinds of recognition methods of
attribute classification based on the properties of irreducible elements and its congruence. According
to the classification a reduction method of lattices is obtained.
In the year 2007, Graham Cormode, Flip Korn, S. Muthukrishnan and Divesh Srivastava [32]
proposed an algorithm based on product of hierarchical dimensions from mathematical lattice
structure. They found importance of hierarchical multidimensional summarization of data in
emerging data stream applications. However, there is no clear method how to discard nodes that do
not qualify the minimum threshold.
Ben Martin and Peter Eklund [39] proposed a boarder algorithm for making the covering
relations of concept explicit for iceberg lattices in the year 2008. Empirical testing has been
performed to compare the boarder algorithm with a traditional algorithm based on covering edges
algorithm from concept of data analysis.
An incremental data mining algorithm has been proposed in 2009 by Yue-Shi Lee [39]
towards website redesign to improve the web services based on navigational patterns. Moreover,
they expressed that this technique can not only be used for website design but is also able to analyze
user behavior.
In 2010, Eklund, Peter, Villerd, Jean [30] projected hybrid visual representation techniques of
concept lattices concentrating on line diagrams. However, combining too many value attributes
resulted in complex nested diagram, useful for deep analysis but not suitable for first glance
navigation.
In 2011, Andreas Lubcke, Veit Koppen, and Gunter Saake [38] introduced a decision
approach based design process concerning different storage architectures. Recently in 2012, Santo
Lombardo, Elisabetta Di Nitto and Danilo Ardagna, expressed the necessity of development of
hybrid architecture for storage models as their future work.
2.2 Pattern Discovery Techniques
In the year 2001, Lu, S., Hu, H., and Li, F. [48] presented vertical and mixed weighed
association rules on static datasets to determine correlations between items. The experiments
conducted by them retrieved better predictive ability. This method has also been demonstrated on
static and synthetic datasets.

282
During 2002, C. Ezeife, Y. Su [40] presented frequent pattern tree structure to reduce the
required number of database scans. DB Tree, PotFP Tree algorithm are the proposed algorithms
applied on large databases. The discovery of closed patterns on multi dimensional patterns is an
interesting future direction from their work.
All the range in 2003, J. Fong, H.K. Wong, S.M. Huang [45] introduced a frame metadata
model to facilitate the continuous association rules generation in data mining. A new set of
association rules can be derived with the update of source databases with this model using static and
active classes. Mining association rules on weblog data is the future direction.
An efficient method presented in 2004 by Harms, S.K. and Deogun [43] for mining frequent
association rules from multiple data sets. This work highlighted the importance of user constraints by
introducing antecedent and consequent patterns in the mining process on the association rules.
In the next year 2005, F.C Tseng, C.C. Hsu and K.S. Fu [42] established a simpler and more
efficient data structure for representing the frequent pattern list. The technique partitioned both the
search space and solution space so as to apply divide and conquer approach in mining frequent
patterns.
During 2006, Huang, Y.M., Kuo, Y-H., Chen, J.N. and Jeng, Y.L [44] developed a
navigational pattern tree NP-Miner for discovering sequential patterns. Most of the authors uttered
about dynamically maintaining and updating of knowledge base and comprehensive evaluating
methods in knowledge discovery process.
In the subsequent year 2007, Dalamagas, T., Bouros, P., Galanis, T., Eirinaki, M. and Sellis,
T.K [41] provided a set of mining tasks intended for user navigation patterns in focus of
personalizing topic directories according to the navigational behavior of the users. The efforts made
by many of the researchers concluded that among all the data mining algorithms of association rules,
incremental algorithms fit better for growing large databases.
All the range in 2008, J L Balcazar [47] studied and explored the concept of redundancy
among association rules from a fundamental perspective. They discussed several existing alternative
definitions of redundancy between association rules and provided new characterizations and
relationships among them. They also provided a sound and complete calculus to construct deduction
scheme for redundancy rules. They also analyzed the risk degree of lost rules based on the
incremental mining.
Przemysław Kazienko [51] reviewed indirect association rules and presented a new approach
to discover indirect associations existing between pages that rarely occurred in 2009. Their
experimental results revealed the usefulness of indirect rules in the weblog scenario.
Priyanka Makkar, Payal Gulati, Dr. A.K. Sharma [50] have presented recently in the year
2010, an approach for predicting user behavior to improve web performance. They also expressed
that web pre fetching became an attractive solution where in forthcoming page accesses of a user are
predicted based on weblog.
In 2011, Liu Jian, Wang Yan-Qing [49] presented a research frame work that makes a
contribution to web mining. Their experimental results show that there is still space for improvement
to optimize the solution by employing advanced techniques.
During 2012, Mahendra Pratap Yadav, Pankaj Kumar Keserwani, Shefalika Ghosh Samaddar
proposed an adaptive algorithm for incremental mining of association rules. The algorithm is a
highly efficient incremental mining technique and the authors expressed it can be applied in the
weblog scenario.
Recently in 2013, Johannes K. Chiang, Rui-Han Yang [46] proposed an approach which
includes a novel data structure and an efficient algorithm for mining association rules on various
granularities. However, their test results shown its performance, efficiency and scalability better than

283
the current approaches. But the effects of perceived issues and potential development of data mining
and concept description are worthy of further investigation.
2.4 Optimization Techniques
In the year 2002, H. K. Tsai, J. M. Yang, and C. Y. Kao. [56] demonstrated the usage of GA
in finding the optimal global strategies by using clustering technique on biological datasets. The
extensive bibliography provided by them is an evidence of the relevance of usage of GAs in web
mining.
Then in 2003, B. Minaei-Bidgoli, William F. Punch [53] presented an approach for
classifying the students in order to predict their final grade based on features extracted from log data
of an education web based system. To minimize the prediction error rate they used genetic
algorithms by weighting the features. They also provided comparison study of several crossover
operators.
In the next year 2004, B. Minaei-Bidgoli, G. Kortemeyer, and W. F. Punch extended their
previous work [53] and presented a new approach for predicting students’ performance based on
extracting the average of feature values for overall of the problem.
S. Y. Wang and K. Tai [60] in 2005, implemented a bit array representation method for
structural topology optimization using the genetic algorithm. An identical initialization method is
also proposed to improve the genetic algorithm performance in dealing with problems with narrow
design domains. Their results specified that bit array representation is suitable in handling the design
connectivity problem.
In the next year 2006, S. Y. Wang, K. Tai, and M. Y. Wang [61] presented a versatile, robust
and enhanced genetic algorithm for structural topology optimization using problem specific
knowledge. In their implementation process specifically pronounced the importance of choosing
appropriate representation techniques, genetic operators and evaluation methods.
During 2008, S. Ventura, C. Romero, A. Zafra, J. A. Delgado, and C. Hervas [59] designed a
framework that can apply to maximize reusability and availability of evolutionary computation with
a minimum effort in web mining. The heavily demanding computational performance is an open
problem as earmarked in their future research work.
Hyunchul Ahn, Kyoung-jae Kim [57] reviewed prior studies on optimization techniques for
several systems in 2009. They further examined genetic approach for optimization of feature weights
and relevant instances for similarity calculations. They also mentioned in their limitations that the
size of the population and the number of generations for genetic algorithm is very huge. Thus,
reducing the size of population and number of generations for genetic algorithm is an open
challenge.
Recently in the year 2010, Mehmet Kaya [58] proposed a novel method using multi objective
evolutionary algorithm that extracts the patterns automatically. This method applied on dataset with a
sequential character. The methodology of automatic extraction is a promising future research as mark
down in their conclusions.
In 2011, Diana Martın, Alejandro Rosete, Jesus Alcala-Fdez and Francisco Herrera [54]
extended the well-known multi-objective evolutionary algorithms to perform learning of the intervals
of attributes and a condition selection in order to mine a set of optimum association rules with
accuracy.
During 2012, Xiaoyan Sun, Lei Yang, Dunwei Gong and Ming Li, [62] studied that collective
intelligence extracted from multiple users enhance the performance of GA. They felt that designing
evolutionary algorithm is a promising research direction in the knowledge discovery process as
mentioned in their future research directions.

284
Recently in 2013, Gaurav Dubey, Arvind Jaiswa [55] have dealt the challenge of association
rule mining problem in finding frequent itemsets using GA based method. However, they noticed
that a more extensive empirical evaluation of their proposed method is a promising future research.
2.5 Pattern Analysis Techniques
During 2001, Hilderman, R. J. and Hamilton, H. J focused on classifying interestingness
measure and provided general overview of more successful and widely used interestingness
measures from the literature that have been employed in data mining applications. They expressed
extending theory of interestingness for diversity measures is an open for future research work.
All the range in 2002, Keim, D.A. [69] proposed a classification of visualization techniques
which is based on the data type to be visualized. And they articulated tight integration of
visualization techniques with traditional techniques for their future work. Grinstein, G., Hoffman, P.,
& Pickett, R described a set of benchmarking for visualization approaches. Although benchmarking
is made, further study and contribution from the research community and the industry is required.
In the year 2003, Brijs T., Vanhoof K. and Wets G [63] provided an overview of existing
measures of interestingness and divided them into objective and subjective measures. They focused
on objective measures by means of statistical criteria.
In the next year 2004, Jaroszewicz, S. and Simovici, D A [68] proposed a new definition of
interestingness as the absolute difference between its support estimated from the data and from the
Bayesian network. In addition, they presented an efficient algorithm based on the new definition.
Their experimental evolution proved usefulness of the algorithm for finding interesting, unexpected
patterns.
During 2005, Furnkranz, J. and Flach, P. A [64] provided analysis of behavior of covering
rule algorithms by visualizing their evaluation metrics and their dynamics and coverage space. They
described heuristics for evaluating rules as well as filtering and stopping criteria. Their experimental
results proved that covering algorithm suitable for understanding both the behavior of heuristic
functions and dynamics.
In 2006, Padmanabhan, B. and Tuzhilin [71] presented a new method for discovering a
minimal set of unexpected patterns by combining two independent concepts of minimality and
unexpectedness, both of which have been well studied in the KDD literature. They demonstrated the
strength of this approach experimentally.
In the next year 2007, Heng-Soon Gan and Andrew [65] defined rescheduling stability
quantitatively and have provided analytical mean for various heuristics. Rescheduling stability of
heuristics is also important apart from effectiveness and efficiency. They extended empirical and
analytical work on heuristic robustness. In their future research work, considering Spearman’s foot
rule, a measure of permutation disarray may shed some further light on heuristics.
During 2008, Vitaly Friedman [72] proposed USER approach that finds unexpected
sequences and implication rules from data with user defined beliefs for mining unexpected behaviors
from weblogs. As the unexpected behaviors impact the web usage analysis and many of the times the
identification of unexpectedness depends on semantics and user beliefs, measure of unexpectedness
elevated as an open research.
In the next year 2009, Michael Friendly [70] surveyed the visualization techniques from the
deep roots to the current fruit. Their experimental results triggered interesting future research paths
towards automation of the process, quality, scalability and intelligence of the sensitivity measure.
In 2011, Izwan Nizal Mohd, Shaharanee, Fedja Hadzic, Tharam S. Dillon [67] have proposed
a strategy that combines data mining and statistical measurement techniques, including redundancy
analysis, sampling and multivariate statistical analysis, to discard the non- significant rules. Their

285
experimental results are evident that show their framework managed to reduce a large number of
non-significant and redundant rules while at the same time preserving relatively high accuracy.
Recently in 2013, David H. Glass focused on a particular class of objective measures known
as confirmation measures. Their proposed class of objective measures provided a solid basis for
future research.
3. CONCLUSIONS
The literature is clearly evident that, web usage mining is a promising and attractive task of
web mining. This extensive research study noticed and emphasized that the usage characterization
consists of mainly five interdependent stages: pre-processing, storage models, pattern discovery,
optimization techniques and pattern analysis. The authors in the present paper also observed the
importance, criticality and efficiency of comprehensive approach in the process of web usage mining
and which has been triggered as the formal basis for the future. Furthermore, the literature survey has
recognized that implementation of interdependent stages comprehensively is a promising and
practical research area.
4. FUTURE WORK
As a future work, a new approach will be designed and developed to concentrates
comprehensively on all stages of web usage mining and to leverage the strengths of individual
techniques. In addition, the comprehensive approach is planned to test with different weblogs that
cover a large spectrum of various applications, such as, web usage analysis for improvements in
fraud detection, product analysis and customer segmentation.
Further, future efforts, extension of comprehensive model can exploit and enable an effective
integration and mining of content, usage and structure web log data, promise to lead to the next
generation of useful and intelligent tools for web mining that can derive real time knowledge from
user transactions on the web.
REFERENCES
[1] B. Masand, M. Spiliopoulou, J. Srivastava, O.R. Zaiane, Proceedings of WebKDD2002, “Web
Mining for Usage Patterns and Profiles”, Edmonton, CA, 2002.
[2] B. Mobasher, R. Cooley, J. Srivastava, “Automatic Personalization Based on Web Usage
Mining” Communications of the ACM, 43(8), pp: 142–151, 2000.
[3] E. Frias-Martinez, V. Karamcheti, “A customizable behavior model for temporal prediction of
web user sequences”, In WEBKDD, Explorations, 1(2), pp: 66–85, 2000.
[4] Geeta R. B., Prof. Shashikumar, G. Totad, Dr. Prasad Reddy, “Amalgamation of Web Usage
Mining and Web Structure Mining”, International Journal of Recent Trends in Engineering,
Vol. 1, No. 2, pp: 279-281, 2009.
[5] Guandong Xu, Yanchun Zhang, Xun Yi, “Modelling User Behaviour for Web
Recommendation Using LDA Model”, IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology, pp: 529-532, 2008.
[6] Han, J., Chang, K. C., “Data mining for Web intelligence”, IEEE Computer, 35(11), pp: 64-70,
2002.
[7] Joshi K. P., Joshi A., Yesha Y., Krishnapuram R., “Warehousing and mining web logs”, In
Proceedings of the 2nd ACM CIKM Workshop on Web Information and Data Management,
Kansas City, Missouri, USA 1999, pp: 63–8, 1999.

286
[8] Kao, H., Lin, S., Ho, J., Chen, M., “Mining Web Informative Structures and Contents Based on
Entropy Analysis”, IEEE Transactions on Knowledge and Data Engineering, Vol.16, Iss.1,
pp: 41 – 55, 2004.
[9] Massimiliano Albanese, Antonio Picariello, Carlo and Lucio Sansone, “Web Personalization
Based on Static Information and Dynamic User Behavior”, ACM 1-58113-978-0/04/0011,
2004.
[10] Qingtian Han, Xiaoyan Gao, Wenguo Wu, “Study on Web Mining Algorithm Based on Usage
Mining”, IEEE Xplore, pp: 1121-1124, 2008.
[11] R. Kohavi, M. Spiliopoulou, J. Srivastava, Proceedings of “WebKDD2000 Web Mining for
E-Commerce – Challenges & Opportunities”, Boston, MA, 2000.
[12] Berendt B. et al., “The Impact of Site Structure and User Environment on Session
Reconstruction in Web Usage Analysis”, Proc. WEBKDD 2002: Mining Web Data for
Discovery Usage Patterns and Profiles, LNCS 2703, Springer-Verlag, pp: 159–179, 2002.
[13] Berendt, B., B. Mobasher, M. Spiliopoulou, J. Wiltshire., “Measuring, the accuracy of
sessionizers for web usage analysis”, Proc.of the Workshop on Web Mining, First SIAM
Internat.Conf. on Data Mining, Chicago, IL, pp: 7–14, 2001.
[14] Chintan R. Varnagar, Nirali N. Madhak, Trupti M. Kodinariya, Jayesh N. Rathod “Web Usage
Mining: A Review on Process, Methods and Techniques” IEEE Conference Publications,
pp: 40-46, 2013.
[15] Chungsheng Zhang, Liyan Zhuang, “New Path Filling Method on Data Preprocessing in Web
Mining”, Computer and Information Science, vol1(3), pp: 112-115, 2008.
[16] Fenstermacher, K.D., M. Ginsburg, “Client-side monitoring for web mining”, Journal of the
American Society for Information Science and Technology, Vol. 54, No. 7, pp: 625-637, 2003.
[17] G. Castellano, A. M. Fanelli, M. A. Torsello, “Log Data Preparation For Mining Web Usage
Patterns”, IADIS International Conference Applied Computing, 2007.
[18] G T Raju, P S Satyanarayana, “Knowledge Discovery from Web Usage Data: Complete
Preprocessing Methodology”, IJCSNS International Journal of Computer Science and Network
Security, VOL.8 No.1, pp: 179- 186, 2008.
[19] J. Zhang, A. A. Ghorbani, “The reconstruction of user sessions from a server log using
improved time oriented heuristics” in CNSR, IEEE Computer Society, pp: 315–322, 2004.
[20] K. R. Suneetha, Dr. R. Krishnamoorthi, “Identifying User Behavior by Analyzing Web Server
Access Log File”, IJCSNS International Journal of Computer Science and Network Security,
VOL.9 (4), pp: 327-332, 2009.
[21] Martin Arlitt, “Characterizing Web User Sessions” Internet and Mobile Systems Laboratory
HP Laboratories Palo Alto HPL- 2000-43, May, 2000.
[22] M. Spiliopoulou, B. Mobasher, B. Berendt, M. Nakagawa, “A Framework for the Evaluation of
Session Reconstruction Heuristics in Web Usage Analysis”. INFORMS Journal of Computing -
Special Issue on Mining Web-Based Data for E-Business Applications, 15 (2), pp: 171–190,
2003.
[23] Natheer Khasawneh, Chien-Chung Chan, “Active User-Based and Ontology-Based Web Log
Data Preprocessing for Web”, Proceedings of the 2006 IEEE/WIC/ACM International
Conference on Web Intelligence, 2006.
[24] Pang-Ning Tan, Vipin Kumar, “Discovery of Web Robot Sessions based on their Navigational
Patterns”, Data Mining and Knowledge Discovery, 6(1), pp: 9-35, 2002.
[25] P. Nithya, Dr. P. Sumathi “Novel Pre-Processing Technique for Web Log Mining by Removing
Global Noise and Web Robots”, IEEE Conference Publications, 2012.
[26] Renata Ivancsy, Sandor Juhasz, “Analysis of Web User Identification Methods, World
Academy of Science, Engineering and Technology”, pp: 338-345, 2007.

287
[27] V. Chitraa, A.S.D., "A Survey on Preprocessing Methods for Web Usage Data," (IJCSIS)
International Journal of Computer Science and Information Security, Vol. 7( 3), 2010.
[28] Ma Shuyue, Liu Wencai, Wang Shuo,” The Study on the Preprocessing in Web Log Mining”,
IEEE Conference Publications, pp: 1-5, 2011.
[29] Ben Martin, Peter Eklund, “From Concepts to Concept Lattice: A Border Algorithm for
Making Covers Explicit”, ICFCA 2008, Springer-Verlag Berlin Heidelberg, pp: 78-89, 2008.
[30] Eklund, Peter, Villerd, Jean, “A Survey of Hybrid Representations of Concept Lattices in
Conceptual Knowledge Processing”, Lecture Notes in Computer Science, Springer
Berlin/Heidelberg, pp: 296- 31, 2010.
[31] F. Masseglia, P. Poncelet, M. Teisseire., “Incremental mining of sequential patterns in large
databases”, Data Knowledge Engineering, 46(1), pp: 97-121, 2003.
[32] Graham Cormode, Flip Korn, S. Muthukrishnan, Divesh Srivastava, “Finding Hierarchical
Heavy Hitters in Streaming Data”, ACM Transactions on Database Systems, Vol. V, No. N,
2007.
[33] JL. Pfaltz, CM. Taylor, “Scientific discovery through iterative transformations of concept
lattices”, In Workhop on Discrete Applied Mathematics, in conjunction with the 2nd SIAM
International Conference on Data-Mining, pp: 65–74, 2002.
[34] Li H R, Zhang W X, Wang H., “Classification and reduction of attributes in concept lattices”,
Proc of IEEE International Conference on Granular Computing. Los Alamitos: IEEE Computer
Society, pp: 142-147, 2006.
[35] Shao, M W., “The reduction for two kind of generalized concept lattice”, Proceedings of the
4th International Conference on Machine Learning and Cybernetics, Berlin: Springer,
pp: 2217-2222, 2005.
[36] Show-Jane Yen, Yue-Shi Lee, Chung-Wen Cho, “An Efficient Approach for the Maintenance
of Path Traversal Patterns”, In Proceedings of IEEE International Conference on e-Technology,
e-Commerce and e-Service (EEE), pp: 207-214, 2004.
[37] Valtchev P., Missaoui R., “Building Concept (Galois) Lattices from arts: Generalizing the
Incremental Methods”, In Proceedings of the 9th International Conference on Conceptual
Structures (ICCS 2001), USA, pp: 290-303, 2001.
[38] Andreas Lubcke, Veit Koppen, and Gunter Saake, “A Decision Model to Select the Optimal
Storage Architecture for Relational Databases”, IEEE Conference Publications, pp: 1-11, 2011.
[39] Yue-Shi Lee, “A Lattice-Based Framework for Interactively and Incrementally Mining Web
Traversal Patterns”, DOI: 10.4018/978-1-59904-990-8, ch027, 2009.
[40] C. Ezeife, Y. Su, “Mining incremental association rules with generalized FP-tree,” in Lecture
Notes in Computer Science, LNCS 2338, Springer- Verlag, pp: 147-160, 2002.
[41] Dalamagas, T., Bouros, P., Galanis, T., Eirinaki, M., Sellis, T.K., “Mining user navigation
patterns for personalizing topic directories”, Proc. 9th ACM International Workshop on Web
Information and Data Management, Lisbon, Portugal, pp: 81-88, 2007.
[42] F.C Tseng, C.C. Hsu, K.S. Fu, “The Frequent Pattern List: Another Framework for Mining
Frequent Patterns,” International Journal of Electronic Business Management, vol. 3, no. 2,
pp: 104-115, Feb, 2005.
[43] Harms, S.K., Deogun, J.S., “Sequential association rule mining with time lags”, Journal of
Intelligent Information Systems, Vol. 22, No. 1, pp: 7-22, 2004.
[44] Huang, Y.M., Kuo, Y-H., Chen, J.N., Jeng, Y.L., “NP-miner: A real-time recommendation
algorithm by using web usage mining”, Knowledge Based Systems, Vol. 19, No. 4,
pp: 272-286, 2006.
[45] J. Fong, H.K. Wong, S.M. Huang, “Continuous and incremental data mining association rules
using frame metadata model”, Knowledge-Based Systems 16, Elsevier, pp: 91-100, 2003.

288
[46] Johannes K. Chiang, Rui-Han Yang, “Multidimensional Data Mining for Discover Association
Rules in Various Granularities”, IEEE Conference Publications, pp: 1-6, 2013.
[47] J L Balcazar, “Redundancy, Deduction Schemes, and Minimum-Size Bases for Association
Rules” Pascal Report 4259, 2008.
[48] Livingstone, G., Rosenberg, J., Buchanan, B., “An agenda and justification based framework
for discovery systems”, Knowledge and Information Systems 5(2), pp: 133-161, 2003.
[49] Liu Jian, Wang Yan-Qing, “Web Log Data Mining Based on Association Rule”, Eighth
International Conference on Fuzzy Systems and Knowledge Discovery (FSKD),
pp: 1855-1859, 2011.
[50] Priyanka Makkar, Payal Gulati, Dr. A.K. Sharma, “A Novel Approach for Predicting User
Behavior for Improving Web Performance”, International Journal on Computer Science and
Engineering, VOL. 02, No. 04, pp: 1233-1236, 2010.
[51] Przemysław Kazienko, “Mining Indirect Association Rules For Web Recommendation”
International Journal of Applied Mathematics and Computer Science, Vol. 19, No. 1,
pp: 165-186, 2009.
[52] Show-Jane Yen, Yue-Shi Lee, “An efficient data mining approach for discovering interesting
knowledge from customer transactions”, Expert Systems with Applications, Elsevier, pp: 1-8,
2005.
[53] B. Minaei-Bidgoli, William F. Punch, “Using Genetic Algorithms for Data Mining
Optimization in an Educational Web-based System”, http://www.lon-capa.org, 2003.
[54] 171 Diana Martın, Alejandro Rosete, Jesus Alcala-Fdez and Francisco Herrera, “A Multi-
Objective Evolutionary Algorithm for Mining Quantitative Association Rules”, IEEE
Conference Publications, pp: 1397-1402, 2011.
[55] Gaurav Dubey, Arvind Jaiswal, “Identifying Best Association Rules and Their Optimization
Using Genetic Algorithm”, International Journal of Emerging Science and Engineering
(IJESE), Volume-1, Issue-7, pp: 91-96, 2013.
[56] H. K. Tsai, J. M. Yang, C. Y. Kao., “Applying genetic algorithms to finding the optimal Gene
order in displaying the microarray data”, In Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO), pp: 610-617, 2002.
[57] Hyunchul Ahn, Kyoung-jae Kim, “Bankruptcy prediction modeling with hybrid case-based
reasoning and genetic algorithms approach, Applied Soft Computing, Volume 9, Issue 2,
pp: 599–607, 2009.
[58] Mehmet Kaya, “Automated extraction of extended structured motifs using multi-objective
genetic algorithm” Expert Systems with Applications, Volume 37, Issue 3, pp: 2421-2426,
2010.
[59] S. Ventura, C. Romero, A. Zafra, J. A. Delgado, C. Hervas, “JCLEC: A java framework for
evolutionary computation soft computing.” Soft Computing, vol. 4, no. 12, pp: 381–392, 2008.
[60] S. Y. Wang, K. Tai. “Structural topology design optimization using genetic algorithms with a
bit-array representation”, Computer Methods in Applied Mechanics and Engineering, 194, pp:
3749-3770, 2005.
[61] S. Y. Wang, K. Tai, M. Y. Wang. “An enhanced genetic algorithm for structural topology
optimization”, International Journal for Numerical Methods in Engineering, 65, pp: 18-44,
2006.
[62] Xiaoyan Sun, Lei Yang, Dunwei Gong and Ming Li, “Interactive Genetic Algorithm Assisted
with Collective Intelligence from Group Decision Making”, IEEE World Congress on
Computational Intelligence, pp: 1-8, 2012.
[63] Brijs, T., Vanhoof, K., Wets, G., “Defining interestingness for association rules”, International
Journal of Information Theories and Applications, 10(4), pp: 370–376, 2003.

289
[64] Furnkranz, J., Flach, P. A., “ROC ‘n’ rule learning: Towards a better understanding of covering
algorithms” Mach. Learn. 58, (1), pp: 39–77, 2005.
[65] Heng-Soon, Gan., Andrew, “Wirth Heuristic stability: A permutation disarray measure”,
Computers & Operations Research, Volume 34, Issue 11, pp: 3187-3208, 2007.
[66] Hilderman, R. J., Hamilton, H. J., “Evaluation of interestingness measures for ranking
discovered knowledge”, Lecture Notes in Computer Science 2035, pp: 247–259, 2001.
[67] Izwan Nizal Mohd, Shaharanee, Fedja Hadzic, Tharam S. Dillon, “Interestingness measures for
association rules based on statistical validity”, Knowledge-Based Systems, pp: 386–392, 2011.
[68] Jaroszewicz, S., Simovici, D. A., “Interestingness of frequent itemsets using Bayesian networks
as background knowledge”, in Proceedings of the 2004 ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, pp: 178–186, 2004.
[69] Keim, D.A., “Information visualization and visual data mining”, IEEE Transactions On
Visualization and Computer Graphics 7, pp: 100–107, 2002.
[70] Michael Friendly, “Milestones in the history of thematic cartography, statistical graphics, and
data visualization”, 2009.
[71] Padmanabhan, B., Tuzhilin, A., “On characterization and discovery of minimal unexpected
patterns in rule discovery”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18,
No. 2, pp: 202–216, 2006.
[72] Vitaly Friedman, “Data Visualization and Infographics” in: Graphics, Monday Inspiration,
January 14th, 2008.
[73] Ravita Mishra, “Web Usage Mining Contextual Factor: Human Information Behavior”,
International Journal of Information Technology and Management Information Systems
(IJITMIS), Volume 5, Issue 1, 2014, pp. 12 - 29, ISSN Print: 0976 – 6405, ISSN Online:
0976 – 6413.
[74] Suresh Subramanian and Dr. Sivaprakasam, “Genetic Algorithm with a Ranking Based
Objective Function and Inverse Index Representation for Web Data Mining”, International
Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 5, 2013, pp. 84 - 90,
ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[75] Jaykumar Jagani and Prof. Kamlesh Patel, “An Enhanced Algorithm for Classification of Web
Data for Web Usage Mining using Supervised Neural Network Algorithms”, International
Journal of Computer Engineering & Technology (IJCET), Volume 5, Issue 4, 2014, pp. 48 - 56,
ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.

AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB USAGE MINING

Recommandé

Recommandé

Contenu connexe

Similaire à AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB USAGE MINING

Similaire à AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB USAGE MINING (20)

Plus de James Heller

Plus de James Heller (20)

Dernier

Dernier (20)

AN EXTENSIVE LITERATURE SURVEY ON COMPREHENSIVE RESEARCH ACTIVITIES OF WEB USAGE MINING