Experiments and Results on Click stream analysis using R

DB 297C Data Analytics – Project Report
Term I (2013-14)

Group Information
GROUP NO: 11
TEAM MEMBERS:
Bisen Vikratsingh Mohansingh - MT2012036
Kodamasimham Pridhvi

- MT2012066

Vaibhav Singh Rajput

- MT2012145

Dataset Description
Blue Martini Software approached several clients using its Customer Interaction System to
volunteer their data, and a small dot-com company called Gazelle.com, a leg wear and leg care
retailer. Data was made available to in two formats: original data and aggregated data. Among
the data collected by the Blue Martini application server, the following three categories are the
relevant data:
 Customer information, which includes customer ID, registration information, and
registration form questionnaire responses.


Order information like Order header, which includes date/time, discount, tax, total
amount, payment, shipping, status, session ID, order line, which includes quantity,
price, product, date/time, assortment, and status.



Click stream information session, which includes starting and ending date/time, cookie,
browser, referrer, visit count, and user agent, page view, which includes date/time,
sequence number, URL, processing time, product, and assortment.

Initial dataset .csv file -> 25MB (14000 row x 296 column), With SQL query removed columns
with NULL making total no. of columns = 208. Removed record of Crawlers and Record with just
one page view i.e. Session_time_elasped =0.0 making new rows count = ~5000. Manually
removed few more irrelevant columns like browser/os, day, date info, etc, now no. of columns
=128.Considering the most frequent visitors removed Columns whose sum (visit) < 10 making
new number of columns =113.

Final size:
6.5MB, ~5000 x 113

Term I (2013-14)

Summary of top 5 observations
Rule Based Classification:
Rule-based methods, rule discovery or rule extraction from data, are data mining techniques
aimed at understanding data structures, providing comprehensible description instead of only
black-box prediction. Sets of rules are useful if rules are not too numerous, comprehensible,
and have sufficiently high accuracy.
From the result of the experiment we can see rules being generated, we have show some
sample rules in the documentation, there were totally 182 rules generated. To which class the
rule belongs to is shown at the end of each rule with actual number of rows / number of
misclassifications that belong to that particular rule.
Association Rules:
Association rules were taken based on two factors, lift and support. Rules having lift greater
than 1 and min-support > 0.5. A total of 377564 rules were generated out of which we applied
filters and selected few rules which showed some interesting patterns .

Result 1
The Rule based classification will generate a set of rules on which classification takes place; we
can see the set of rules from the model as generated below
q2_part
O/p: PART decision list
Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND
Num_main_lifestyles_Template_Views > 1: FALSE (38.0/3.0)
Num_main_assortment_Template_Views > 0 AND Num_main_shopping_cart_Template_Views <= 0
AND Num_LifeStyles_Assortment_Views <= 1: TRUE (7.0)
Num_UniqueBoutiques_Assortment_Views <= 0 AND Num_main_vendor_Template_Views <= 1 AND
Num_Brands_Assortment_Views <= 1: FALSE (145.0/71.0)
Num_articles_dpt_about_Template_Views > 0 AND Num_BrandOrder_Assortment_Views <= 1 AND

Term I (2013-14)
Num_articles_Template_Views <= 1: TRUE (73.0/22.0)

These are some of the sample rules that are generated based on the model using which the prediction
on the test data was done. A successful rule based model was build with an accuracy of above

70% for identifying whether a user will continue his session or not.
Results can be found at Rule Based Classification.

Result 2
5 Cluster of users plotted against their average amount spend>12$. It is clearly observable from the
figure that the most of users in cluster 0 are high spender where as cluster 2 consist of least highest
spender

Result 3
One more strange observation is high spender people are least interested in offers/free gift. Below
image shows a plot of cluster against number of visits to offer/free gift page.

Term I (2013-14)

Result 4
Highest spender people (cluster 0 as concluded from result 2nd ) as found to be mostly interested in
branded product that’s why their average spending is more than 12$. Below image shows a plot of
cluster against DonnaKaran(Fashion design brand) product views. And opposite for our least spender
cluster 2.

Result 5
After applying association rules , some interesting rules were generated which were very useful to
predict which users like to continue and who don’t and what were the killer pages in most cases. Due to
large number of rules generation we were unable to go through all rules. Some of the graphs were:

Term I (2013-14)

Scatter plot of rules generated based on support, confidence and lift.
Some of the interesting patterns were:
2 {Num_Women_Product_Views=Yes,
CONTINUE=YES}

=> {LEAVE=NO} 0.110

1 2.352941

Num_Men_Product_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.100

1 2.352941

18 {Num_MAS_Category_Views=No,
Num_Women_Product_Views=Yes,
CONTINUE=YES}

=> {LEAVE=NO} 0.105

1 2.352941

19 {Num_MDS_Category_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.105

1 2.352941

From the rules we can see that Women products were the most viewed and pages like MDS category,
MAS category were the killer pages where no one was interested.
Results can be observed here Association Rule Results.

Term I (2013-14)

APPENDIX – collection of individual experiment reports (not more than
15)

Data Cleaning/Filtering
DATA PRE-PROCESSING
Description of Dataset:
Blue Martini Software approached several clients using its Customer Interaction System to volunteer their
data, and a small dot-com company called Gazelle.com, a legwear and legcare retailer. Data was made available
to in two formats: original data and aggregated data.
Among the data collected by the Blue Martini application server, the following three categories are the
relevant data:
 Customer information, which includes customer ID, registration information, and registration form
questionnaire responses.
 Order information like Order header, which includes date/time, discount, tax, total amount, payment,
shipping, status, session ID, order line, which includes quantity, price, product, date/time, assortment, and
status.
 Clickstream information session, which includes starting and ending date/time, cookie, browser, referrer,
visit count, and user agent, page view, which includes date/time, sequence number, URL, processing time,
product, and assortment.

Steps for Pre-processing:








Initial dataset .csv file -> 25MB (14000 row x 296 col)
Loaded .csv into RDBMS using python scripts as discussed below
With SQL query removed columns with NULL
o total no. of col = 208
Removed record of Crawlers
Record with just one page view i.e Session_time_elasped =0.0
o new rows count = ~5000
Manually removed few more irrelevant col like browser/os,day,date info, etc,
o Now no. of col =108
Removed Col whose sum(visit) < 10
o New no of col =103

Final size:
6.5MB, ~5000 x 103

Term I (2013-14)

import sys
defreadfile(fname,oname):
f = file(fname)
whileTrue:
line = f.readline()
stri=line.split(':');
if len(line)==0:
break
outfile = file(oname,'a')
ifnot stri[0].startswith("|"):
mystr=stri[0].replace(" ","_")
if((c in mystr)for c in'*&'):
m=['*','&','/']
for i in m:
mystr=mystr.replace(i,"_")
ifnot any((c in mystr)for c in'*&|'):
# if any(c in stri[0] for c in '*$&|'):
mystr=mystr.replace(".","")
outfile.write(mystr.strip())
outfile.write("n")
outfile.close()
f.close()
if len(sys.argv)<2:
print'No file specified'
sys.exit()
else:
readfile(sys.argv[1],sys.argv[2])

Python Scripts:

Dataset was given in two files:
o Data File
o Names File
In the above script we are taking the Names File from the user and retrieving the column names for our
dataset by removing all special characters and writing it to another file for further processing of names.
>> python read.py names_file_name output_file_name
In the output_file_name we will get the columns names individually .

Term I (2013-14)

import sys
defmakeScript(fname,oname):
with open(fname,'r')as file_contents:
contents = file_contents.read()
my_data=contents.split("n")
outfile=file(oname,'a')
input_db = raw_input("Enter Database Name: ")
createDatabase ="CREATE DATABASE "+ input_db +";n"
useDatabase ="USE "+input_db +";n"
input_table = raw_input("Enter Table Name: ")
columns =""
count=0
print columns
for data in my_data:
count=count+1
if data:
columns += data +" VARCHAR(100) DEFAULT NULL ,n"
createTable="CREATE TABLE "+input_table +"("+ columns +") ;n"
createTable=createTable.replace("/","")
k = createTable.rfind(',')
createTable=createTable[:k]+" "+createTable[k+1:]
print createTable
print count
outfile.write(createDatabase)
outfile.write(useDatabase)
outfile.write(createTable)
outfile.close()
file_contents.close()
if len(sys.argv)<2:
print'No file specified'
sys.exit()
else:
makeScript(sys.argv[1],sys.argv[2])

In the above python file, we are using the file created before using read.py, which is the input
file for the above script. Here we are creating a sql script for creating a table with respective columns
having a data type of varchar, so that we can load the file into DBMS for further processing.
>> python Createscript.py output_file_name script.sql
We will get a sql script for creating the table with the columns retrieved.

Term I (2013-14)

import MySQLdb
myfile = open("path where required csv is there",'r')
db = MySQLdb.connect(host="localhost",# your host, usually localhost
user="root",# your username
passwd="root",# your password
db="da1")# name of the data base
cur = db.cursor()
for line in myfile:
print line
my_line_list = line.split(',')
string =""
for value in my_line_list:
string = string +"'"+ str(value)+"',"
query_string = string[:-1]
final_query ="insert into question1 values"+"("+query_string+");"
cur.execute(final_query)

We are here opening a connection to the database and reading data from Data File and writing it into
the table we created before in database. We are directly inserting into the database reading from file.
>>python da.py script.sql
here we need to specify the path inside the file where our .csv Data File exists and it will read from csv
and enter into the database.
After entering the data inside the DBMS we are removing the columns with all NULL values using simple
sql queries as mentioned above.
After performing the Data Pre-processing on the given dataset, we are exporting the table into a .csv file
for performing analysis using R.
Now we will analyze the data using R.

Term I (2013-14)

Classification
Random Forest
Objective:
To generate a model for building decision tree and to identify important features using random forest.

Description:
Random forests are an ensemble learning method for classification (and regression) that operate by
constructing a multitude of decision trees at training time and outputting the class that is the mode of the
classes output by individual trees.

Procedure:1.

After the Data preprocessing of dataset, it is loaded into R environment by using ,
question1 <- read.csv ("dataset.csv")
dim (question1)# 5220 103 ----- number of rows and columns

2.
After loading the dataset now we will divide dataset into 70% as trainDataset and 30% as
testDataset as below:
div<- sample (2, nrow(question1),replace=T,prob=c(0.7,0.3))
Here it will generate two samples from our dataset having 70% and 30% rows with non-overlapping of
rows. ‘nrow’ will give the number of rows in our dataset, ’prob’ decides the division ratio and ‘sample’
will assign numbers 1 or 2 to the rows as to which sample that row belongs to. To create trainData the
command is:
trainData<- question1[div==1,]
dim (trainData) #3670 103 --- dimensions of trainDataset
This will copy all the rows in dataset into trainData which are marked as 1 as per sample. Similarly the
testData is:
testData<- question1[div==2,]
dim(testData)#1550 103 --- dimensions of testDataset
3.
After generating the trainData and testData now we will load the required package
‘randomForest’ into R:
library(randomForest)
4.
Defining the target variable and independent variable in the formula to be used in the generation
of the model as below:
myformula<- Session_Continues ~ .
Here ‘Session_Continues’ is the target variables having classes as ‘true’ or ‘false’ and we are giving
remaining all as the independent variables on which basis our target variable is classified as represented
by ‘~.’.

Term I (2013-14)

5.
After the formula is decided now we applying the formula to generate the model based on our
trainData as below using the function ‘randomForest’ and get the model stored in ‘rf’:
rf<- randomForest(myformula,data=trainData,ntree=100,proximity=T)
‘ntree’ is the parameter to specify how many trees it algorithm has to generate to get an accurate model,
‘proximity’is the parameter for checking for exact match for oob and tree generated, used for reducing
error rates.
6.
We can see the classification result by:
-->rf
output:
Call:
randomForest(formula = myformula, data = trainData, ntree = 100,
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 10

proximity = T)

OOB estimate of error rate: 34.17%
Confusion matrix:
False True class.error
False 2280 173 0.07052589
True 1081 136 0.88824979
By seeing the result we can say that we are getting an error of 34%.
7.

For seeing the generated tree for classification:
-->getTree(rf,1)
Output:-

left daughter right daughter split var split point status prediction
1

2

3

3

1

0

2

4

5

29

1

0

3

6

7

105

1

0

4

8

9

98

1

0

5

10

11

36

1

0

6

12

13

34

1

0

7

0

0

0

-1

2

If status is -1 then that is the leaf node in the decision tree and prediction is 1 0r 2 means the class of
classification it is classified. We can get any tree information using the above command just by specifying
the randomForest object ‘rf’ and which tree number i.e ‘n’ which in our case is 1 < n < 100

Term I (2013-14)
8.

We can plot the error rates in our generated trees by :
plot(rf)

we will get a graph as show in figure(1)in observation.
9.

We can also find the features that contribute more to the decision tree using:
importance(rf)
It will give the feature and its mean Gini Index , we can see and decide which are the essential features
that effect our decision tree.
10.

We can use many attributes that are generated by randomForest, which are:
attributes(rf)
output:
$names
[1] "call"
"type"
"predicted"
"err.rate"
[5] "confusion"
"votes"
"oob.times"
"classes"
[9] "importance" "importanceSD" "localImportance" "proximity"
[13] "ntree"
"mtry"
"forest"
"y"
[17] "test"
"inbag"
"terms"
$class
[1] "randomForest.formula" "randomForest"

11.
Now using the model generated from the trainData, we will apply that model on testData for
prediction as below:
testpredict<- predict(rf,newdata=testData)
output:
testpredict False True
False 952 473
True 52 73
here we will use the model ‘rf’ and dataset as ‘testData’ and store the result in a variable as above.

Observations:
The plot of the model error rates is:

Term I (2013-14)
We can see that as number of trees increases the error rate decreases and it is able to classify accurately.

Result of importance of rf:Column
MeanGiniIndex
Num_BrandOrder_Assortment_Views
2.226234e+01
Num_UniqueBoutiques_Assortment_Views
3.372542e+01
Num_Brands_Assortment_Views
2.324481e+01
Num_Departments_Assortment_Views
2.224833e+01
Num_LifeStyles_Assortment_Views
1.265545e+01
Num_main_Template_Views
4.466076e+01
Num_products_Template_Views
1.143197e+01
Num_articles_Template_Views
1.943811e+01
Num_main_home_Template_Views
2.347768e+01
So from the result above we can observe that out of 103 features only some features contribute more to
the model so we can further reduce the dataset and achieve better results.
WEKA OUTPUT:
We also tried to get the selection attributes of the above dataset in weka , result is:
Num_Hanes_Product_Views
Num_Cotton_Product_Views
Num_Nylon_Product_Views
Num_BrandOrder_Assortment_Views
Num_UniqueBoutiques_Assortment_Views
Num_LifeStyles_Assortment_Views
Num_main_Template_Views
Num_articles_Template_Views
Num_main_home_Template_Views
Num_main_vendor_Template_Views
Num_articles_dpt_about_mgmtteam_Template_Views
Num_main_cust_serv_Template_Views
Nearly both the outputs were matching so we were able to find the factors most influencing our decision
tree.

Conclusion:
We will use the above features that have major influence in decision tree in random Forest as the
independent variable for actual construction of decision tree using ‘party’ package for better results and
classification.

Term I (2013-14)

Party Decision Tree
Objective:
To build a decision tree using the features identified by Random Forest using the “party” package.

Description:
A computational toolbox for recursive partitioning. The core of the package is ctree(), an
implementation of conditional inference trees which embed tree-structured regression models into
a well-defined theory of conditional inference procedures. This non-parametric class of regression
trees is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored
as well as multivariate response variables and arbitrary measurement scales of the covariates.

Procedure:
1.

2.
3.

4.

Loading the dataset into R,
question1_reduced <- read.csv("q2_reduced.csv")
dim(question1) #5220 103
Dividing the dataset into training dataset and testdata set,
div <- sample(2,nrow(question1_reduced),replace=T,prob=c(0.70,0.30))
Storing the traindata and testData in variable for analysis,
trainData_reduced <- question1_reduced[div==1,]
dim(trainData_reduced) # 3675 103
testData_reduced <- question1_reduced[div==2,]
dim(testData_reduced)# 1545 103
Defining formula based on the features identified by random forest on the target variable,
myformula_reduced <- Session_Continues ~ Num_Hanes_Product_Views +
Num_Cotton_Product_Views + Num_Nylon_Product_Views +
Num_BrandOrder_Assortment_Views + Num_UniqueBoutiques_Assortment_Views +
Num_LifeStyles_Assortment_Views + Num_main_Template_Views +
Num_articles_Template_Views + Num_main_home_Template_Views +
Num_main_vendor_Template_Views + Num_articles_dpt_about_mgmtteam_Template_Views
+ Num_main_cust_serv_Template_Views

The features are the result of the importance factor found during the randomForest.
5.
6.
7.

Now loading the “party” package for analysis,
library(party)
Applying the recursive decision tree algorithm on traindata based on the above formula
trainData_ctree <- ctree(myformula_reduced,data=trainData_reduced)
Now to see the generated model ,
print(trainData_ctree)

Term I (2013-14)

this wil show the features used in building the decision tree and also how the decision tree is built as
below
output:
Conditional inference tree with 3 terminal nodes
Response: Session_Continues
Inputs: Num_Hanes_Product_Views, Num_Cotton_Product_Views, Num_Nylon_Product_Views,
Num_BrandOrder_Assortment_Views, Num_UniqueBoutiques_Assortment_Views,
Num_LifeStyles_Assortment_Views, Num_main_Template_Views,
Num_articles_Template_Views, Num_main_home_Template_Views,
Num_main_vendor_Template_Views, Num_articles_dpt_about_mgmtteam_Template_Views,
Num_main_cust_serv_Template_Views
Number of observations: 3675
1) Num_main_home_Template_Views <= 1; criterion = 0.999, statistic = 16.455
2) Num_articles_dpt_about_mgmtteam_Template_Views <= 0; criterion = 0.984, statistic =
10.226
3)* weights = 2607
2) Num_articles_dpt_about_mgmtteam_Template_Views > 0
4)* weights = 245
1) Num_main_home_Template_Views > 1
5)* weights = 823
8.
For visualizing tree graphically it is,
Plot(trainData_ctree,type=”simple”)
We will get the graph of decision tree as below show in observation.
9.
Now applying the model on the testData set ,
testpred_reduced <- predict(trainData_reduced,newdata=testData_reduced)
10.

For checking the accuracy of prediction,
table(testpred_reduced,testData_reduced$Session_Continues)

output will show the prediction rate, as the prediction has many errors so this is not the suitable
method for decision tree generation.

Term I (2013-14)

Observation:
The graph of the decision tree obtained is:

From the graph we can see that only two factors are being considered by this ctree() algorithm.

Term I (2013-14)

WEKA:
When based on the above factor decision tree generated in weka is:

Conclusion
Results were not satisfactory, there were large rate of error and misclassification of data. Around
50% of data was not correctly classified using “party” package.

Term I (2013-14)

Rule Based Classification
Objective
To generate a model for building rules and to classify data based on the rules being satisfied.

Description
Rule-based methods, rule discovery or rule extraction from data, are data mining techniques aimed at
understanding data structures, providing comprehensible description instead of only black-box
prediction. Rule based systems should expose in a comprehensible way knowledge hidden in data ,
providing logical justification for drawing conclusions, showing possible inconsistencies and avoiding
unpredictable conclusions that black box predictors may generate in untypical situations . Sets of rules
are useful if rules are not too numerous, comprehensible, and have sufficiently high accuracy.

Procedure
1. Data was already been loaded in R and is divided into training dataset and test dataset so we can
directly apply the rule based classification directly on the train dataset.
2. For applying rule based classification we have to install the package “RWeka” which imports all
the algorithms in weka tool.
3. We will be using the “PART” rule based classification in weka for generating the rules of our training
dataset based on which we will classify our test dataset.
library(RWeka)
The above command will load the “RWeka” package into R environment
4. Now we will apply the PART algorithm on the training Dataset for obtaining the rules,
q2_part <- PART(Session_Continues ~.,data =q2_train)
Above command will take the training dataset as “q2_train” and apply the “PART” algorithm based
on the target variable i.e Session_Continues and remaining all as independent variables.
5. A model is build based on the previous command , which is used for classifying the test dataset as
below,
q2_pre <- evaluate_Weka_classifier(q2_part,newdata=q2_test)
here we are using the model generated by the training data to classify the test data. Here
“evaluate_Weka_classifier” will use the model and classify the testdata, which is a function of weka tool.
6. For seeing the result,
q2_pre
OUTPUT:-

=== Summary ===
Correctly Classified Instances
880
Incorrectly Classified Instances
425
Kappa statistic
0
Mean absolute error
0.4447
Root mean squared error
0.4689

67.433 %
32.567 %

Term I (2013-14)
Relative absolute error
99.9541 %
Root relative squared error
99.9958 %
Coverage of cases (0.95 level)
100 %
Mean rel. region size (0.95 level) 100 %
Total Number of Instances
1305
=== Confusion Matrix ===
a b <-- classified as
734 124 | a = FALSE
301 136 | b = TRUE
From the result we can see that we are getting a classification rate around 68% - 72 % which is a better
rate than decision tree.

Observation
The Rule based classification will generate a set of rules on which classification takes place , we can see
the set of rules from the model it generated as below
q2_part
O/p: PART decision list
Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND
Num_main_lifestyles_Template_Views > 1: FALSE (38.0/3.0)
Num_main_assortment_Template_Views > 0 AND Num_main_shopping_cart_Template_Views <= 0
AND Num_LifeStyles_Assortment_Views <= 1: TRUE (7.0)
Num_UniqueBoutiques_Assortment_Views <= 0 AND Num_main_vendor_Template_Views <= 1 AND
Num_Brands_Assortment_Views <= 1: FALSE (145.0/71.0)
Num_articles_dpt_about_Template_Views > 0 AND Num_BrandOrder_Assortment_Views <= 1 AND
Num_articles_Template_Views <= 1: TRUE (73.0/22.0)
From the result you can see the rules being generated, we have show some sample rules there were
totally 182 rules generated. Here at the last the class has been mentioned to which class the rule
belongs to showing actual number of rows / number of misclassifications that belong to that particular
rule.

Conclusion
So from the above observation and results we can see that a successful rule based model was build with
an accuracy of above 70% for identifying whether a user will continue his session or not.

Term I (2013-14)

Clustering
Objective:
To group visitor of websites whose page view pattern is similar and identify their interest.

Approach:
Clustering is a best methodology in Data analysis which can be used to group objects based on their
similarities. We are making use of WEKA tools for doing this analysis.

Preprocessing:
1. Remove all spam data by deleting record with just one page view
2. There are about 500+ dimension which is very not feasible to analyze, so for dimensionality
reduction
a. Go to select attribute of WEKA
b. Manually- remove all session data, browser information, most common page]
c. Auto – Calculate information gain and select top 25 Attribute

Process:
Experiment I
In this experiment we will make 5 cluster of given instances (users) and analyze their purchase habits
K-means
Steps:
1.
2.
3.
4.
5.
6.

Import a reduced dataset in weka
Select simple k-mean
Specify number of clusters
Set distance function to Euclidean
Specify k (no of cluster)
Click on start to generate clusters

Results:
(A) Using Euclidean distance
=== Run information ===
Scheme:
weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 num-slots 1 -S 10
Relation: q3_added-weka.filters.unsupervised.attribute.Remove-VR7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15
Instances: 1781
Attributes: 26
City
Customer_ID
US_State
Num_Sheer_Look_Product_Views
Num_CT_Waist_Control_Views

Term I (2013-14)
Num_PH_Category_Views
Num_main_shopping_cart_Template_Views
Num_Replenishable_Stock_Views
Num_account_Template_Views
Num_main_login2_Template_Views
Num_Sandal_Foot_Views
Num_HasDressingRoom_True_Views
Num_Legwear_Product_Views
Num_products_productDetailLegwear_Template_Views
Num_DonnaKaran_Product_Views
Num_AmericanEssentials_Product_Views
Num_Basic_Product_Views
Num_WDCS_Category_Views
Num_Oroblu_Product_Views
WhichDoYouWearMostFrequent
Home_Market_Value
Num_WAS_Category_Views
Num_main_freegift_Template_Views
Spend_over_$12_per_order_on_average
Test mode: evaluate on training data

=== Clustering model (full training set) ===

kMeans
======
Number of iterations: 12
Within cluster sum of squared errors: 4789.607514501406
Missing values globally replaced with mean/mode
Time taken to build model (full training data) : 0.14 seconds
=== Model and evaluation on training set ===
Clustered Instances
0
1
2
3
4

263 ( 15%)
273 ( 15%)
368 ( 21%)
484 ( 27%)
393 ( 22%)

Term I (2013-14)
Observation:
 Cluster 0
o High income
o Spend avg >12$ => potential customer (value)
o Purchase nylon more than cotton (nylon is costlier than cotton)
o Mostly view brands assortment page (DonnaKaran – costly fashion design brand)
o Less visit to free gift category
o More visit to sandal foot
 Cluster 1
o More men product than other cluster => cluster might have more mens
o Frequent use of search bar
o Rich visitors with most of them are have above average home/assets value
 Cluster 2
o General visitor
o Buy cheap products
 Cluster 3
o Interested mostly in offers/free gift products
o Highest visit to checkout page => potential customer (frequency)
 Cluster 4
o No special pattern observed
Experiment II
We can validate whether we can use page view data for identifying potential customer using
clustering. We have labeled data of avg. purchase >12$ with 1368 instances as true and 413 instances as
false.
=== Run information ===
Scheme:
weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 num-slots 1 -S 10
Relation: q3_added-weka.filters.unsupervised.attribute.Remove-VR7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15
Instances: 1781
Attributes: 26
City
Customer_ID
US_State
Num_Sheer_Look_Product_Views
Num_CT_Waist_Control_Views
Num_PH_Category_Views
Num_main_shopping_cart_Template_Views
Num_Replenishable_Stock_Views
Num_account_Template_Views
Num_main_login2_Template_Views
Num_Sandal_Foot_Views
Num_HasDressingRoom_True_Views
Num_Legwear_Product_Views

Term I (2013-14)
Num_products_productDetailLegwear_Template_Views
Num_DonnaKaran_Product_Views
Num_AmericanEssentials_Product_Views
Num_Basic_Product_Views
Num_WDCS_Category_Views
Num_Oroblu_Product_Views
WhichDoYouWearMostFrequent
Home_Market_Value
Num_WAS_Category_Views
Num_main_freegift_Template_Views
Ignored:
Spend_over_$12_per_order_on_average
Test mode: Classes to clusters evaluation on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 5
Within cluster sum of squared errors: 4913.0928856548035
Missing values globally replaced with mean/mode
Time taken to build model (full training data) : 0.05 seconds
=== Model and evaluation on training set ===
Clustered Instances
0
1

561 ( 31%)
1220 ( 69%)

Class attribute: Spend_over_$12_per_order_on_average
Classes to Clusters:
0 1 <-- assigned to cluster
402 966 | False
159 254 | True
Cluster 0 <-- True
Cluster 1 <-- False
Incorrectly clustered instances : 656.0

36.8332 %

Observation :
Only 63% of data is correctly classified, as data is more biased toward False(<12$ average spending)
class. But clustering gives us good insight of purchase/page view patterns.

Term I (2013-14)

Association Rules
Objective:
To identify some interesting patterns in the users page views and also the killer pages.

Description:
Association rule learning is a popular and well researched method for discovering interesting
relations between variables in large databases. It is intended to identify strong rules discovered
in databases using different measures of interestingness. Measures used in our analysis are lift,
confidence and support.

Procedure:
1. For performing the association rules we needed to convert the dataset into binary matrix
indicating in each session indicating whether he/she visited that page or not.
2. For performing the association rules, “arules” package is available.
library(arules)
3. Now loading the converted dataset into R for generation of rules, we used the important
columns based on mean gini index obtained from randomForest result.
4. After loading the data, we convert the data as transactions by following command,
dataTrans <- as(assoc,”transactions”)
5. Now we apply “apriori” algorithm to generate the rules, where we can pass the parameter list
having support, confidence and minl-ength of every rule.
rules <- apriori(dataTrans)
This will generate all rules using min-support as 0.1 and min-confidence as 0.8 around. It will
generate all the subset rules also based on the frequent itemset of attributes.
6. To know how many rules generated we can see that by
rules
Around 377564 rules were generated out of which we were interested in only rules having RHS
as LEAVE or CONTINUE to check whether person will continue or leave after seeing certain
pages.
7. Retrieved a subset of rules from all generated rules which were having some interesting
patterns.

Observation:
We were able to see some of the interesting patterns in the rules generated, like in our dataset most of
the persons were females so we were able to find out that most of the rules were having
“NUM_OF_WOMEN_PRODUCT_VIEWS” in possibly every transaction. Some of the brands were least
visited or never visited according to the rules. We were able to identify some of the killer pages based
on the user preferences or after visiting some pages user used to withdraw at the same page everytime
like that.

Term I (2013-14)

Results:
Some of the rules sorted based on the “lift” values are as below:
rulesLeave <- subset(rules,subset=rhs %pin% "LEAVE")// for getting the rules
inspect(head(sort(rulesLeave,by="lift"),20))
O/p:
lhs

rhs

1 {CONTINUE=YES}

support confidence

lift

=> {LEAVE=NO} 0.425

1 2.352941

CONTINUE=YES}

=> {LEAVE=NO} 0.110

1 2.352941

Num_Men_Product_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.100

1 2.352941

18 {Num_MAS_Category_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.105

1 2.352941

19 {Num_MDS_Category_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.105

1 2.352941

20 {Num_MCS_Category_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.110

1 2.352941

Some of the interesting rules are shown above.Some of the random generated are,
inspect(head(rulesLeave,6))
0/p:
lhs

rhs

support confidence

lift

CONTINUE=YES}

=> {LEAVE=NO} 0.110

1 2.352941

Term I (2013-14)
CONTINUE=NO}

=> {LEAVE=YES} 0.100

1 1.739130

5 {Num_Women_Product_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.315

1 2.352941

6 {Num_CT_Waist_Control_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.360

1 2.352941

For CONTINUE some examples are:
inspect(head(rulesContinue,4))
o/p:
lhs

rhs

support confidence

lift

LEAVE=NO}

=> {CONTINUE=YES} 0.110

1 2.352941

LEAVE=YES}

=> {CONTINUE=NO} 0.100

1 1.739130

5 {Num_Women_Product_Views=No,
LEAVE=NO}


1 2.352941

6 {Num_CT_Waist_Control_Views=No,
LEAVE=NO}


1 2.352941

Conclusion:
We were able to find some interesting patters in users page views and were able to identify some of the
killer pages as “Num_CT_Waist_Control_Views”,” Num_MAS_Category_Views” like these pages.

Experiments and Results on Click stream analysis using R

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (9)

Similaire à Experiments and Results on Click stream analysis using R

Similaire à Experiments and Results on Click stream analysis using R (20)

Plus de Pridhvi Kodamasimham

Plus de Pridhvi Kodamasimham (6)

Dernier

Dernier (20)

Experiments and Results on Click stream analysis using R