SlideShare une entreprise Scribd logo
1  sur  6
International Journal of Engineering Inventions
e-ISSN: 2278-7461, p-ISSN: 2319-6491
Volume 2, Issue 5 (March 2013) PP: 16-21
www.ijeijournal.com Page | 16
Web Mining Patterns Discovery and Analysis Using Custom-Built
Apriori Algorithm
Latheefa.V1
, Rohini.V2
1, 2
Department of Computer Science, Bangalore, India,
Christ University,
Abstract: Mining web data in order to extract useful knowledge from it has become vital with the wide usage of
the World Wide Web. Web Usage Mining is the application of data mining techniques to discover interesting
usage patterns from web data. In this paper, a Custom-Built Apriori Algorithm is proposed for the discovery of
frequent patterns in the web log data. This study has also developed a tool to discover the frequent patterns,
association rules in the web log data. In this study, Web log data of Christ University web site is analysed. The
experiments conducted in this study proved that Custom-Built Apriori Algorithm is efficient than Classical
Apriori Algorithm as it takes less time.
Key words: Web Usage Mining, Custom-Built Apriori, FrequentPatterns, Association Rules.
I. Introduction
The web is an open medium. With this openness web has grown enormously. One can find any
information on web which is its main strength. As web and its usage continue to grow, there is an opportunity to
mine the web data and extract useful knowledge from it. Web mining is the application of data mining
techniques to find interesting and potentially useful knowledge from the web data. Web mining can be broadly
divided into 3 distinct categories based on the kinds of data to be mined: 1.Web Content Mining is the process
of extracting useful information from the contents of web documents. 2. Web Structure Mining can be regarded
as the process of discovering structure information from the web.3.Web Usage Mining is understanding user
behaviour in interacting with the web or with a web site[1]. One of the aims is to obtain information that may
assist web site reorganization or assist site adaptation to better suit the user needs.
The applications of Web Usage Mining are [2]:
 By determining the access behaviour of users, needed links can be identified to improve the overall
performance of future accesses
 Web usage patterns are used to gather business intelligence to improve customer attraction, and sales.
 Personalization for a user can be achieved by keeping track of previously accessed pages. These pages can
be identified to improve the overall performance of future accesses.
In this paper, web log file of size 70MB is preprocessed and thenanalysed. The contents of the paper are
organized as follows: Section IIbriefs the structure of web log file, Section III briefs the methodology, Section
IV discusses the experiment results and finally conclusion and future work are mentioned in Section V.
II. Web log file
A web log file is a file to which the webserver writes information each time a user requests a resource from that
particular website[3]. Web usage information takes the form of web server log files, or web logs. For each
request from a user’s browser to a web server, a response is generated automatically. This response takes the
form of a simple single-line transaction record that is appended to an ASCII text file on the web server. This file
is called as web server log.Figure 1 shows the fragment from webserver log of Christ University website.
115.119.146.168 - - [14/Dec/2011:16:37:59 +0530] "GET /block/block.html HTTP/1.1" 304 - "-" "Mozilla/5.0
(Windows NT 5.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"122.167.80.114 - - [14/Dec/2011:16:37:59 +0530]
"GET /webadmin/depgallery/_654.jpeg HTTP/1.1" 200 283307
"http://www.christuniversity.in/Management%20Studies/deptcourses.php?division=Commerceand
Management&dept=23" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0;
SearchToolbar 1.2; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; MDDC;)
Fig.1 Fragment of web log file
A. Common Log Format
Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm
www.ijeijournal.com Page | 17
Web logs come in various formats, which vary depending on the configuration of the web server. The common
log format is supported by a variety of web server applications and includes the following seven fields:
 Remote host field
 Identification field
 Auth-user field
 Date/time field
 HTTP request
 Status code field
 Transfer volume field
III. Methodology
The methodology contains the following steps:
 Data Pre-processing
 Developing custom built apriori algorithm
 Identifying web patterns using custom built apriori algorithm
 Analysing the discovered web patterns.
 Developing a standalone application for web mining.
The general flow of project methodology is diagrammatically represented in the figure 2.
Fig.2Methodology flow
A. Data Preprocessing
An important task in any data mining application is the creation of a suitable target data set to which data
mining algorithms can be applied. This is particularly important in web usage mining due to characteristics of
clickstream data and its relationship to other related data collected from multiple sources and across multiple
channels.
In this analysis WUMPrep tool is used for web log data preparation. WUMPrep contains a set of Perl scripts for
cleaning web log file of irrelevant and automatic requests and creating sessions in it[3]. In this study web log
data is pre-processed to make it suitable for mining. Data preprocessing consists of sequence of following tasks:
 Data Cleaning.
 Sessionization.
 Robot Detection.
 Converting Web log File to ARFF (Attribute Relation File Format).
B. Further Data Preprocessing
In this process the requests that are made by the users are replaced with the folders to which they belong. This
process is implemented by a java application developed in this thesis. The fragment of weblog file generated by
the java application is shown in the figure 3
241393:19|180.215.27.123 - - [14/Dec/2011:16:38:41 +0530] "GET Commerce HTTP/1.1" 200 9589
241393:28|117.230.171.53 - - [14/Dec/2011:16:38:53 +0530] "GET Psychology HTTP/1.1" 200 11991
241393:29|59.145.142.36 - - [14/Dec/2011:16:39:27 +0530] "GET School%20of%20Education HTTP/1.1"
200 8140
Fig.3 Folder request
C. Web log to ARFF
The proposed apriori algorithm accepts input files in a format called ARFF (Attribute Relation File Format).
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a
set of attributes. ARFF files have two distinct sections. The first section is the Header information, which is
followed the Data information. Lines that begin with a % are comments.
Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm
www.ijeijournal.com Page | 18
The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in the data),
and their types. The Header contains three parts @RELATION, @ATTRIBUTE and @DATA. All these
declarations are case insensitive.
The relation name is defined as the first line in the ARFF file. The format is:
@relation <relation-name>
Attribute declarations take the form of an ordered sequence of @attribute statements. Each attribute in the data
set has its own @attribute statement which uniquely defines the name of that attribute and its data type. The
order the attributes are declared indicates the column position in the data section of the file. The format for the
@attribute statement is:
@attribute <attribute-name><data type>
where the <attribute-name>must start with an alphabetic character,
<data type > is Boolean in the analysis.
The ARFF Data section of the file contains the data declaration line and the actual instance lines.
The @data declaration is a single line denoting the start of the data segment in the file. The format is: @data
Each instance is represented on a single line, with carriage returns denoting the end of the instance.
{ 0 t, 1 t, 2 t, 3 t }
There are two possible representatives of data in the Arff file – dense and sparse. In both formats a web page
must be considered as a binary attribute that takes the values true or false in each transaction, depending on
whether the page occurs in the transaction or not. Since the occurrence of web pages in transactions (user
sessions) is scarce, the sparse format is chosen for presenting the sessions in weblog data. Since WUMPrep does
not contain any script that converts web log data into either sparse or dense Arff file format, a utility application
is developed in java for this purpose. The code for this application is found in appendix. A fragment of Arff file
that is generated from this application is shown in fig 4.
@relation christweblog
@attribute Biotechnology {f,t}
@attribute Botany {f,t}
@attribute Chemistry {f,t}
@attribute Economics {f,t}
@data
{ 0 t, 1 t } { 0 t }
{ 0 t }
{ 0 t, 1 t, 2 t, 3 t }
{ 2 t, 3 t }
Fig.4 web log to ARFF
D. Apriori Algorithm
Apriori algorithm is proposed by R.Agrawal and R.Srikant for finding the association rules. This algorithm can
be considered to consist of two parts. In the first part, those itemsets that exceed the minimum support
requirement are found. Such itemsets are called frequent itemsets. In the second part, the association rules that
meet the minimum confidence requirement are found from the frequent itemsets[4].
Problems with Apriori Algorithm
 The number of candidate itemsets grows quickly and can result in huge candidate sets.
 The Apriori algorithm requires many scans of the database. If n is the length of the longest itemset, then
(n+1) scans are required.
 Many trivial rules are derived and it can often be difficult to extract the most interesting rules from all the
rules derived.
The Apriori algorithm assumes sparsity since the number of items in each transaction is small compared with
the total number of items. The algorithm works better with sparsity. Some applications produce dense data (i.e.
many items in each transaction) which may also have many frequently occurring items. Apriori is likely to be
inefficient for such applications.
E. Custom-Built Apriori Algorithm
In this research custom-built apriori is proposed to improve the performance of the Apriori algorithm. The
changes done in the proposed algorithm are
 Pruning operation is performed only on the candidate itemsets whose size > 2.
 For generation of frequent itemsets of size k only the transactions whose size >= k are considered.
Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm
www.ijeijournal.com Page | 19
Custom-Built Apriori Algorithm:-
Input:
 D, a database of transactions
 Minimum support count threshold
Output: L, Frequent itemsets in D
Method:
1. C1 := All the 1-item sets
2. Read the database to count the support of C1 to determine L1
3. L1:={frequent 1-itemsets}
4. K:=2
5. While(Lk-1 ≠ ϕ) do
begin
5.1 Ck:=gen_candidate_itemsets with the given Lk-1
5.2 If(k>2) then
5.3 Prune(Ck)
5.4 For all candidates in Ck do
5.5 count the number of transactions of at least length k that are common in each item ϕCk
5.6 Lk:=All candidates in Ck with minimum support
5.7 k:=k+1
end
6. Answer:=UkLk
F. Web Usage Association Rule Discovery Software
For the purpose of this research, an independent windows desktop application for the association rule
discovery in the web log data is developed. The application is implemented using java programming language.
It is an object oriented application specialized for the discovery of association rules in web log data, rather than
in the general relational database data. Figure 4.shows the user interface of the software.
When an input file containing web log usage data is opened, the web log data is transformed into
directories mentioned in the properties file. This file will be converted into ARFF format when the user clicks
on the generateARFFFile button. Once the ARFF is prepared, the web log data is ready for execution and obtain
the association rules. For Rules to obtain there is a need to satisfy constraints as support and confidence which
can be entered in the text fields.
Fig. 5 Custom built software to analyse association rules based on support and confidence
Fig. 6 Association Rules Generated By Custom-Built Apriori
IV. Results and Discussion
The analysis has been conducted on the web log data of size ~70MB. The first step in the web log mining
process is to clean web log data of irrelevant and automatic web requests prior to running association rule
discovery algorithm, as well as to group web requests into visitor sessions. The WUM Prep tool has been used
for web log data.
Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm
www.ijeijournal.com Page | 20
Perl Script Resultant
File Size
% of File
Reduction
logFilter.pl 6230KB 89
sessionized.pl 7550KB 79
detectRobots.pl 5858KB 83
Fig. 7WUMPrepTool Statistics
A. Itemset Level Generations
In Apriori Algorithm, the number of Candidate itemsets grows quickly and can result in huge candidate sets.
The proposed apriori algorithm will reduce the number of transactions that need to be verified at each level of
large item set generation.
Fig. 8 Transactions compared at each level of Large Item generation
A number of tests have been conducted with various support levels to find the range of support levels in which
the satisfactory and meaningful association rules are generated.
Fig 9 Large Items Generation Based On Support Level
Fig.10Behaviour of Accepted candidate Items over different Supports
V. Conclusion
The analysis was performed with an objective of knowledge discovery in web data by applying data mining
techniques. The analysis was carried out in four steps: data collection, data preprocessing, pattern discovery
using custom-built apriori algorithm, pattern analysis.
In this analysis Custom-built apriori algorithm is developed. The proposed algorithm has successfully
discovered the frequent patterns from web data.A tool has been developed in java for implementing the custom-
built apriori algorithm.This tool is specialized for the discovery of association rules in web log data rather than
in the general relational database data.For converting preprocessed web log data to ARFF java program has been
Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm
www.ijeijournal.com Page | 21
developed.The generated patterns are analysed by executing the custom-built apriori algorithm with different
support and confidence values
A.Future Work: This analysis can be extended by implementing other interesting measures such as lift in order
to fine tune the results. The proposed work can also be implemented by the adoption of more advanced
techniques/tools such as fuzzy associations, rule mining algorithm using the interesting measures such as fuzzy
support, fuzzy confidence, fuzzy interest and fuzzy conviction etc.
References
[1] R. Cooley, B. Mobasher and J. Srivastava, "Web Mining: Information and Pattern Discovery on the World Wide Web," IEEE,
1997.
[2] K. R. Suneetha and Dr. R. Krishnamoorthi, "Identifying User Behavior by Analyzing Web ServerAccess Log File,
" International Journal of Computer Science and Network Security, , vol. 9, no. 4, 2009.
[3] M. Dimitrijevic and Z. Bošnjak, "Discovering Interesting Association Rules in the Web Log Usage Data," Interdisciplinary Journal
of Information, Knowledge, and Management, vol. 5, 2010.
[4] G.K.Gupta, Introduction to Data Mining with Case Studies, New Delhi: EEE PHI , 2011.
[5] C. A. R. C. Goswami D.N., "An Algorithm for Frequent Pattern Mining Based On Apriori," International Journal on Computer
Science and Engineering, vol. 02, 2010.
[6] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," in 20th Int. Conf. Very Large Data Bases.
[7] Han and Kamber, Data Mining Concepts and Techniques, Secoond ed., New Delhi: Elsevier, 2006.

Contenu connexe

Tendances

A Novel Approach for Web Page Set Mining
A Novel Approach for Web Page Set Mining  A Novel Approach for Web Page Set Mining
A Novel Approach for Web Page Set Mining dannyijwest
 
XML Retrieval: A Survey
XML Retrieval: A SurveyXML Retrieval: A Survey
XML Retrieval: A Surveyijceronline
 
tutorial2.docx - Fedora Tutorial
tutorial2.docx - Fedora Tutorialtutorial2.docx - Fedora Tutorial
tutorial2.docx - Fedora Tutorialbutest
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463IJRAT
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...Computer Science Journals
 
Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Editor IJCATR
 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...cscpconf
 
web technology and soical networking
web technology and soical networking web technology and soical networking
web technology and soical networking Vijay Bansal
 
Migration of application schema to windows azure
Migration of application schema to windows azureMigration of application schema to windows azure
Migration of application schema to windows azureeSAT Publishing House
 
MULTIFACTOR NAÏVE BAYES CLASSIFICATION FOR THE SLOW LEARNER PREDICTION OVER M...
MULTIFACTOR NAÏVE BAYES CLASSIFICATION FOR THE SLOW LEARNER PREDICTION OVER M...MULTIFACTOR NAÏVE BAYES CLASSIFICATION FOR THE SLOW LEARNER PREDICTION OVER M...
MULTIFACTOR NAÏVE BAYES CLASSIFICATION FOR THE SLOW LEARNER PREDICTION OVER M...ijcsa
 
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...Implemenation of Enhancing Information Retrieval Using Integration of Invisib...
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...iosrjce
 
A Survey on Web Page Recommendation and Data Preprocessing
A Survey on Web Page Recommendation and Data PreprocessingA Survey on Web Page Recommendation and Data Preprocessing
A Survey on Web Page Recommendation and Data PreprocessingIJCERT
 
The Linked Data Lifecycle
The Linked Data LifecycleThe Linked Data Lifecycle
The Linked Data Lifecyclegeoknow
 

Tendances (16)

320 324
320 324320 324
320 324
 
A Novel Approach for Web Page Set Mining
A Novel Approach for Web Page Set Mining  A Novel Approach for Web Page Set Mining
A Novel Approach for Web Page Set Mining
 
XML Retrieval: A Survey
XML Retrieval: A SurveyXML Retrieval: A Survey
XML Retrieval: A Survey
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
tutorial2.docx - Fedora Tutorial
tutorial2.docx - Fedora Tutorialtutorial2.docx - Fedora Tutorial
tutorial2.docx - Fedora Tutorial
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
 
Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...
 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
 
web technology and soical networking
web technology and soical networking web technology and soical networking
web technology and soical networking
 
Migration of application schema to windows azure
Migration of application schema to windows azureMigration of application schema to windows azure
Migration of application schema to windows azure
 
MULTIFACTOR NAÏVE BAYES CLASSIFICATION FOR THE SLOW LEARNER PREDICTION OVER M...
MULTIFACTOR NAÏVE BAYES CLASSIFICATION FOR THE SLOW LEARNER PREDICTION OVER M...MULTIFACTOR NAÏVE BAYES CLASSIFICATION FOR THE SLOW LEARNER PREDICTION OVER M...
MULTIFACTOR NAÏVE BAYES CLASSIFICATION FOR THE SLOW LEARNER PREDICTION OVER M...
 
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...Implemenation of Enhancing Information Retrieval Using Integration of Invisib...
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...
 
A Survey on Web Page Recommendation and Data Preprocessing
A Survey on Web Page Recommendation and Data PreprocessingA Survey on Web Page Recommendation and Data Preprocessing
A Survey on Web Page Recommendation and Data Preprocessing
 
The Linked Data Lifecycle
The Linked Data LifecycleThe Linked Data Lifecycle
The Linked Data Lifecycle
 
Chemread – a chemical informant
Chemread – a chemical informantChemread – a chemical informant
Chemread – a chemical informant
 

En vedette

ArccoMagazine - Nº5 Junio 2013
ArccoMagazine - Nº5 Junio 2013ArccoMagazine - Nº5 Junio 2013
ArccoMagazine - Nº5 Junio 2013Arcco Psicología
 

En vedette (9)

B02010711
B02010711B02010711
B02010711
 
Characteristics Optimization of Different Welding Processes on Duplex Stainle...
Characteristics Optimization of Different Welding Processes on Duplex Stainle...Characteristics Optimization of Different Welding Processes on Duplex Stainle...
Characteristics Optimization of Different Welding Processes on Duplex Stainle...
 
Fragrance trends for 2011
Fragrance trends for 2011Fragrance trends for 2011
Fragrance trends for 2011
 
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on HmmEquirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
 
ArccoMagazine - Nº5 Junio 2013
ArccoMagazine - Nº5 Junio 2013ArccoMagazine - Nº5 Junio 2013
ArccoMagazine - Nº5 Junio 2013
 
Maximum Magnitudes and Accelerations Determination in the Rif Mountain Belt, ...
Maximum Magnitudes and Accelerations Determination in the Rif Mountain Belt, ...Maximum Magnitudes and Accelerations Determination in the Rif Mountain Belt, ...
Maximum Magnitudes and Accelerations Determination in the Rif Mountain Belt, ...
 
On the Zeros of Analytic Functions inside the Unit Disk
On the Zeros of Analytic Functions inside the Unit DiskOn the Zeros of Analytic Functions inside the Unit Disk
On the Zeros of Analytic Functions inside the Unit Disk
 
Performance Evaluation of Coded Adaptive OFDM System Over AWGN Channel
Performance Evaluation of Coded Adaptive OFDM System Over AWGN ChannelPerformance Evaluation of Coded Adaptive OFDM System Over AWGN Channel
Performance Evaluation of Coded Adaptive OFDM System Over AWGN Channel
 
On The Efficacy of Activated Carbon Derived From Bamboo in the Adsorption of ...
On The Efficacy of Activated Carbon Derived From Bamboo in the Adsorption of ...On The Efficacy of Activated Carbon Derived From Bamboo in the Adsorption of ...
On The Efficacy of Activated Carbon Derived From Bamboo in the Adsorption of ...
 

Similaire à Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

A Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningA Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningIJMER
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage dataijfcstjournal
 
Recommendation Based On Comparative Analysis of Apriori and BW-Mine Algorithm
Recommendation Based On Comparative Analysis of Apriori and BW-Mine AlgorithmRecommendation Based On Comparative Analysis of Apriori and BW-Mine Algorithm
Recommendation Based On Comparative Analysis of Apriori and BW-Mine AlgorithmIJAEMSJORNAL
 
Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...IRJET Journal
 
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS ijcax
 
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS ijcax
 
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS ijcax
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining Editor IJMTER
 
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure DataIRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET- On-AIR Based Information Retrieval System for Semi-Structure DataIRJET Journal
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsIRJET Journal
 
An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...IJCSIS Research Publications
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET Journal
 
Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule miningStudsPlanet.com
 

Similaire à Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm (20)

A Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningA Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
 
L017418893
L017418893L017418893
L017418893
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web Databases
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
 
G017334248
G017334248G017334248
G017334248
 
Recommendation Based On Comparative Analysis of Apriori and BW-Mine Algorithm
Recommendation Based On Comparative Analysis of Apriori and BW-Mine AlgorithmRecommendation Based On Comparative Analysis of Apriori and BW-Mine Algorithm
Recommendation Based On Comparative Analysis of Apriori and BW-Mine Algorithm
 
Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...
 
By
ByBy
By
 
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
 
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
 
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
 
50120140505007
5012014050500750120140505007
50120140505007
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining
 
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
 
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure DataIRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
 
An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
 
Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule mining
 

Plus de International Journal of Engineering Inventions www.ijeijournal.com

Plus de International Journal of Engineering Inventions www.ijeijournal.com (20)

H04124548
H04124548H04124548
H04124548
 
G04123844
G04123844G04123844
G04123844
 
F04123137
F04123137F04123137
F04123137
 
E04122330
E04122330E04122330
E04122330
 
C04121115
C04121115C04121115
C04121115
 
B04120610
B04120610B04120610
B04120610
 
A04120105
A04120105A04120105
A04120105
 
F04113640
F04113640F04113640
F04113640
 
E04112135
E04112135E04112135
E04112135
 
D04111520
D04111520D04111520
D04111520
 
C04111114
C04111114C04111114
C04111114
 
B04110710
B04110710B04110710
B04110710
 
A04110106
A04110106A04110106
A04110106
 
I04105358
I04105358I04105358
I04105358
 
H04104952
H04104952H04104952
H04104952
 
G04103948
G04103948G04103948
G04103948
 
F04103138
F04103138F04103138
F04103138
 
E04102330
E04102330E04102330
E04102330
 
D04101822
D04101822D04101822
D04101822
 
C04101217
C04101217C04101217
C04101217
 

Dernier

80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 

Dernier (20)

80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 

Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

  • 1. International Journal of Engineering Inventions e-ISSN: 2278-7461, p-ISSN: 2319-6491 Volume 2, Issue 5 (March 2013) PP: 16-21 www.ijeijournal.com Page | 16 Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm Latheefa.V1 , Rohini.V2 1, 2 Department of Computer Science, Bangalore, India, Christ University, Abstract: Mining web data in order to extract useful knowledge from it has become vital with the wide usage of the World Wide Web. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from web data. In this paper, a Custom-Built Apriori Algorithm is proposed for the discovery of frequent patterns in the web log data. This study has also developed a tool to discover the frequent patterns, association rules in the web log data. In this study, Web log data of Christ University web site is analysed. The experiments conducted in this study proved that Custom-Built Apriori Algorithm is efficient than Classical Apriori Algorithm as it takes less time. Key words: Web Usage Mining, Custom-Built Apriori, FrequentPatterns, Association Rules. I. Introduction The web is an open medium. With this openness web has grown enormously. One can find any information on web which is its main strength. As web and its usage continue to grow, there is an opportunity to mine the web data and extract useful knowledge from it. Web mining is the application of data mining techniques to find interesting and potentially useful knowledge from the web data. Web mining can be broadly divided into 3 distinct categories based on the kinds of data to be mined: 1.Web Content Mining is the process of extracting useful information from the contents of web documents. 2. Web Structure Mining can be regarded as the process of discovering structure information from the web.3.Web Usage Mining is understanding user behaviour in interacting with the web or with a web site[1]. One of the aims is to obtain information that may assist web site reorganization or assist site adaptation to better suit the user needs. The applications of Web Usage Mining are [2]:  By determining the access behaviour of users, needed links can be identified to improve the overall performance of future accesses  Web usage patterns are used to gather business intelligence to improve customer attraction, and sales.  Personalization for a user can be achieved by keeping track of previously accessed pages. These pages can be identified to improve the overall performance of future accesses. In this paper, web log file of size 70MB is preprocessed and thenanalysed. The contents of the paper are organized as follows: Section IIbriefs the structure of web log file, Section III briefs the methodology, Section IV discusses the experiment results and finally conclusion and future work are mentioned in Section V. II. Web log file A web log file is a file to which the webserver writes information each time a user requests a resource from that particular website[3]. Web usage information takes the form of web server log files, or web logs. For each request from a user’s browser to a web server, a response is generated automatically. This response takes the form of a simple single-line transaction record that is appended to an ASCII text file on the web server. This file is called as web server log.Figure 1 shows the fragment from webserver log of Christ University website. 115.119.146.168 - - [14/Dec/2011:16:37:59 +0530] "GET /block/block.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"122.167.80.114 - - [14/Dec/2011:16:37:59 +0530] "GET /webadmin/depgallery/_654.jpeg HTTP/1.1" 200 283307 "http://www.christuniversity.in/Management%20Studies/deptcourses.php?division=Commerceand Management&dept=23" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SearchToolbar 1.2; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; MDDC;) Fig.1 Fragment of web log file A. Common Log Format
  • 2. Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm www.ijeijournal.com Page | 17 Web logs come in various formats, which vary depending on the configuration of the web server. The common log format is supported by a variety of web server applications and includes the following seven fields:  Remote host field  Identification field  Auth-user field  Date/time field  HTTP request  Status code field  Transfer volume field III. Methodology The methodology contains the following steps:  Data Pre-processing  Developing custom built apriori algorithm  Identifying web patterns using custom built apriori algorithm  Analysing the discovered web patterns.  Developing a standalone application for web mining. The general flow of project methodology is diagrammatically represented in the figure 2. Fig.2Methodology flow A. Data Preprocessing An important task in any data mining application is the creation of a suitable target data set to which data mining algorithms can be applied. This is particularly important in web usage mining due to characteristics of clickstream data and its relationship to other related data collected from multiple sources and across multiple channels. In this analysis WUMPrep tool is used for web log data preparation. WUMPrep contains a set of Perl scripts for cleaning web log file of irrelevant and automatic requests and creating sessions in it[3]. In this study web log data is pre-processed to make it suitable for mining. Data preprocessing consists of sequence of following tasks:  Data Cleaning.  Sessionization.  Robot Detection.  Converting Web log File to ARFF (Attribute Relation File Format). B. Further Data Preprocessing In this process the requests that are made by the users are replaced with the folders to which they belong. This process is implemented by a java application developed in this thesis. The fragment of weblog file generated by the java application is shown in the figure 3 241393:19|180.215.27.123 - - [14/Dec/2011:16:38:41 +0530] "GET Commerce HTTP/1.1" 200 9589 241393:28|117.230.171.53 - - [14/Dec/2011:16:38:53 +0530] "GET Psychology HTTP/1.1" 200 11991 241393:29|59.145.142.36 - - [14/Dec/2011:16:39:27 +0530] "GET School%20of%20Education HTTP/1.1" 200 8140 Fig.3 Folder request C. Web log to ARFF The proposed apriori algorithm accepts input files in a format called ARFF (Attribute Relation File Format). An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files have two distinct sections. The first section is the Header information, which is followed the Data information. Lines that begin with a % are comments.
  • 3. Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm www.ijeijournal.com Page | 18 The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in the data), and their types. The Header contains three parts @RELATION, @ATTRIBUTE and @DATA. All these declarations are case insensitive. The relation name is defined as the first line in the ARFF file. The format is: @relation <relation-name> Attribute declarations take the form of an ordered sequence of @attribute statements. Each attribute in the data set has its own @attribute statement which uniquely defines the name of that attribute and its data type. The order the attributes are declared indicates the column position in the data section of the file. The format for the @attribute statement is: @attribute <attribute-name><data type> where the <attribute-name>must start with an alphabetic character, <data type > is Boolean in the analysis. The ARFF Data section of the file contains the data declaration line and the actual instance lines. The @data declaration is a single line denoting the start of the data segment in the file. The format is: @data Each instance is represented on a single line, with carriage returns denoting the end of the instance. { 0 t, 1 t, 2 t, 3 t } There are two possible representatives of data in the Arff file – dense and sparse. In both formats a web page must be considered as a binary attribute that takes the values true or false in each transaction, depending on whether the page occurs in the transaction or not. Since the occurrence of web pages in transactions (user sessions) is scarce, the sparse format is chosen for presenting the sessions in weblog data. Since WUMPrep does not contain any script that converts web log data into either sparse or dense Arff file format, a utility application is developed in java for this purpose. The code for this application is found in appendix. A fragment of Arff file that is generated from this application is shown in fig 4. @relation christweblog @attribute Biotechnology {f,t} @attribute Botany {f,t} @attribute Chemistry {f,t} @attribute Economics {f,t} @data { 0 t, 1 t } { 0 t } { 0 t } { 0 t, 1 t, 2 t, 3 t } { 2 t, 3 t } Fig.4 web log to ARFF D. Apriori Algorithm Apriori algorithm is proposed by R.Agrawal and R.Srikant for finding the association rules. This algorithm can be considered to consist of two parts. In the first part, those itemsets that exceed the minimum support requirement are found. Such itemsets are called frequent itemsets. In the second part, the association rules that meet the minimum confidence requirement are found from the frequent itemsets[4]. Problems with Apriori Algorithm  The number of candidate itemsets grows quickly and can result in huge candidate sets.  The Apriori algorithm requires many scans of the database. If n is the length of the longest itemset, then (n+1) scans are required.  Many trivial rules are derived and it can often be difficult to extract the most interesting rules from all the rules derived. The Apriori algorithm assumes sparsity since the number of items in each transaction is small compared with the total number of items. The algorithm works better with sparsity. Some applications produce dense data (i.e. many items in each transaction) which may also have many frequently occurring items. Apriori is likely to be inefficient for such applications. E. Custom-Built Apriori Algorithm In this research custom-built apriori is proposed to improve the performance of the Apriori algorithm. The changes done in the proposed algorithm are  Pruning operation is performed only on the candidate itemsets whose size > 2.  For generation of frequent itemsets of size k only the transactions whose size >= k are considered.
  • 4. Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm www.ijeijournal.com Page | 19 Custom-Built Apriori Algorithm:- Input:  D, a database of transactions  Minimum support count threshold Output: L, Frequent itemsets in D Method: 1. C1 := All the 1-item sets 2. Read the database to count the support of C1 to determine L1 3. L1:={frequent 1-itemsets} 4. K:=2 5. While(Lk-1 ≠ ϕ) do begin 5.1 Ck:=gen_candidate_itemsets with the given Lk-1 5.2 If(k>2) then 5.3 Prune(Ck) 5.4 For all candidates in Ck do 5.5 count the number of transactions of at least length k that are common in each item ϕCk 5.6 Lk:=All candidates in Ck with minimum support 5.7 k:=k+1 end 6. Answer:=UkLk F. Web Usage Association Rule Discovery Software For the purpose of this research, an independent windows desktop application for the association rule discovery in the web log data is developed. The application is implemented using java programming language. It is an object oriented application specialized for the discovery of association rules in web log data, rather than in the general relational database data. Figure 4.shows the user interface of the software. When an input file containing web log usage data is opened, the web log data is transformed into directories mentioned in the properties file. This file will be converted into ARFF format when the user clicks on the generateARFFFile button. Once the ARFF is prepared, the web log data is ready for execution and obtain the association rules. For Rules to obtain there is a need to satisfy constraints as support and confidence which can be entered in the text fields. Fig. 5 Custom built software to analyse association rules based on support and confidence Fig. 6 Association Rules Generated By Custom-Built Apriori IV. Results and Discussion The analysis has been conducted on the web log data of size ~70MB. The first step in the web log mining process is to clean web log data of irrelevant and automatic web requests prior to running association rule discovery algorithm, as well as to group web requests into visitor sessions. The WUM Prep tool has been used for web log data.
  • 5. Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm www.ijeijournal.com Page | 20 Perl Script Resultant File Size % of File Reduction logFilter.pl 6230KB 89 sessionized.pl 7550KB 79 detectRobots.pl 5858KB 83 Fig. 7WUMPrepTool Statistics A. Itemset Level Generations In Apriori Algorithm, the number of Candidate itemsets grows quickly and can result in huge candidate sets. The proposed apriori algorithm will reduce the number of transactions that need to be verified at each level of large item set generation. Fig. 8 Transactions compared at each level of Large Item generation A number of tests have been conducted with various support levels to find the range of support levels in which the satisfactory and meaningful association rules are generated. Fig 9 Large Items Generation Based On Support Level Fig.10Behaviour of Accepted candidate Items over different Supports V. Conclusion The analysis was performed with an objective of knowledge discovery in web data by applying data mining techniques. The analysis was carried out in four steps: data collection, data preprocessing, pattern discovery using custom-built apriori algorithm, pattern analysis. In this analysis Custom-built apriori algorithm is developed. The proposed algorithm has successfully discovered the frequent patterns from web data.A tool has been developed in java for implementing the custom- built apriori algorithm.This tool is specialized for the discovery of association rules in web log data rather than in the general relational database data.For converting preprocessed web log data to ARFF java program has been
  • 6. Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm www.ijeijournal.com Page | 21 developed.The generated patterns are analysed by executing the custom-built apriori algorithm with different support and confidence values A.Future Work: This analysis can be extended by implementing other interesting measures such as lift in order to fine tune the results. The proposed work can also be implemented by the adoption of more advanced techniques/tools such as fuzzy associations, rule mining algorithm using the interesting measures such as fuzzy support, fuzzy confidence, fuzzy interest and fuzzy conviction etc. References [1] R. Cooley, B. Mobasher and J. Srivastava, "Web Mining: Information and Pattern Discovery on the World Wide Web," IEEE, 1997. [2] K. R. Suneetha and Dr. R. Krishnamoorthi, "Identifying User Behavior by Analyzing Web ServerAccess Log File, " International Journal of Computer Science and Network Security, , vol. 9, no. 4, 2009. [3] M. Dimitrijevic and Z. Bošnjak, "Discovering Interesting Association Rules in the Web Log Usage Data," Interdisciplinary Journal of Information, Knowledge, and Management, vol. 5, 2010. [4] G.K.Gupta, Introduction to Data Mining with Case Studies, New Delhi: EEE PHI , 2011. [5] C. A. R. C. Goswami D.N., "An Algorithm for Frequent Pattern Mining Based On Apriori," International Journal on Computer Science and Engineering, vol. 02, 2010. [6] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," in 20th Int. Conf. Very Large Data Bases. [7] Han and Kamber, Data Mining Concepts and Techniques, Secoond ed., New Delhi: Elsevier, 2006.