SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
DB 297C Data Analytics – Project Report
Term I (2013-14)

Group Information
GROUP NO: 11
TEAM MEMBERS:
Bisen Vikratsingh Mohansingh - MT2012036
Kodamasimham Pridhvi

- MT2012066

Vaibhav Singh Rajput

- MT2012145

Dataset Description
Blue Martini Software approached several clients using its Customer Interaction System to
volunteer their data, and a small dot-com company called Gazelle.com, a leg wear and leg care
retailer. Data was made available to in two formats: original data and aggregated data. Among
the data collected by the Blue Martini application server, the following three categories are the
relevant data:
 Customer information, which includes customer ID, registration information, and
registration form questionnaire responses.


Order information like Order header, which includes date/time, discount, tax, total
amount, payment, shipping, status, session ID, order line, which includes quantity,
price, product, date/time, assortment, and status.



Click stream information session, which includes starting and ending date/time, cookie,
browser, referrer, visit count, and user agent, page view, which includes date/time,
sequence number, URL, processing time, product, and assortment.

Initial dataset .csv file -> 25MB (14000 row x 296 column), With SQL query removed columns
with NULL making total no. of columns = 208. Removed record of Crawlers and Record with just
one page view i.e. Session_time_elasped =0.0 making new rows count = ~5000. Manually
removed few more irrelevant columns like browser/os, day, date info, etc, now no. of columns
=128.Considering the most frequent visitors removed Columns whose sum (visit) < 10 making
new number of columns =113.

Final size:
6.5MB, ~5000 x 113
DB 297C Data Analytics – Project Report
Term I (2013-14)

Summary of top 5 observations
Rule Based Classification:
Rule-based methods, rule discovery or rule extraction from data, are data mining techniques
aimed at understanding data structures, providing comprehensible description instead of only
black-box prediction. Sets of rules are useful if rules are not too numerous, comprehensible,
and have sufficiently high accuracy.
From the result of the experiment we can see rules being generated, we have show some
sample rules in the documentation, there were totally 182 rules generated. To which class the
rule belongs to is shown at the end of each rule with actual number of rows / number of
misclassifications that belong to that particular rule.
Association Rules:
Association rules were taken based on two factors, lift and support. Rules having lift greater
than 1 and min-support > 0.5. A total of 377564 rules were generated out of which we applied
filters and selected few rules which showed some interesting patterns .

Result 1
The Rule based classification will generate a set of rules on which classification takes place; we
can see the set of rules from the model as generated below
q2_part
O/p: PART decision list
Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND
Num_main_lifestyles_Template_Views > 1: FALSE (38.0/3.0)
Num_main_assortment_Template_Views > 0 AND Num_main_shopping_cart_Template_Views <= 0
AND Num_LifeStyles_Assortment_Views <= 1: TRUE (7.0)
Num_UniqueBoutiques_Assortment_Views <= 0 AND Num_main_vendor_Template_Views <= 1 AND
Num_Brands_Assortment_Views <= 1: FALSE (145.0/71.0)
Num_articles_dpt_about_Template_Views > 0 AND Num_BrandOrder_Assortment_Views <= 1 AND
DB 297C Data Analytics – Project Report
Term I (2013-14)
Num_articles_Template_Views <= 1: TRUE (73.0/22.0)

These are some of the sample rules that are generated based on the model using which the prediction
on the test data was done. A successful rule based model was build with an accuracy of above

70% for identifying whether a user will continue his session or not.
Results can be found at Rule Based Classification.

Result 2
5 Cluster of users plotted against their average amount spend>12$. It is clearly observable from the
figure that the most of users in cluster 0 are high spender where as cluster 2 consist of least highest
spender

Result 3
One more strange observation is high spender people are least interested in offers/free gift. Below
image shows a plot of cluster against number of visits to offer/free gift page.
DB 297C Data Analytics – Project Report
Term I (2013-14)

Result 4
Highest spender people (cluster 0 as concluded from result 2nd ) as found to be mostly interested in
branded product that’s why their average spending is more than 12$. Below image shows a plot of
cluster against DonnaKaran(Fashion design brand) product views. And opposite for our least spender
cluster 2.

Result 5
After applying association rules , some interesting rules were generated which were very useful to
predict which users like to continue and who don’t and what were the killer pages in most cases. Due to
large number of rules generation we were unable to go through all rules. Some of the graphs were:
DB 297C Data Analytics – Project Report
Term I (2013-14)

Scatter plot of rules generated based on support, confidence and lift.
Some of the interesting patterns were:
2 {Num_Women_Product_Views=Yes,
CONTINUE=YES}

=> {LEAVE=NO} 0.110

1 2.352941

17 {Num_Women_Product_Views=Yes,
Num_Men_Product_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.100

1 2.352941

18 {Num_MAS_Category_Views=No,
Num_Women_Product_Views=Yes,
CONTINUE=YES}

=> {LEAVE=NO} 0.105

1 2.352941

19 {Num_MDS_Category_Views=No,
Num_Women_Product_Views=Yes,
CONTINUE=YES}

=> {LEAVE=NO} 0.105

1 2.352941

From the rules we can see that Women products were the most viewed and pages like MDS category,
MAS category were the killer pages where no one was interested.
Results can be observed here Association Rule Results.
DB 297C Data Analytics – Project Report
Term I (2013-14)

APPENDIX – collection of individual experiment reports (not more than
15)

Data Cleaning/Filtering
DATA PRE-PROCESSING
Description of Dataset:
Blue Martini Software approached several clients using its Customer Interaction System to volunteer their
data, and a small dot-com company called Gazelle.com, a legwear and legcare retailer. Data was made available
to in two formats: original data and aggregated data.
Among the data collected by the Blue Martini application server, the following three categories are the
relevant data:
 Customer information, which includes customer ID, registration information, and registration form
questionnaire responses.
 Order information like Order header, which includes date/time, discount, tax, total amount, payment,
shipping, status, session ID, order line, which includes quantity, price, product, date/time, assortment, and
status.
 Clickstream information session, which includes starting and ending date/time, cookie, browser, referrer,
visit count, and user agent, page view, which includes date/time, sequence number, URL, processing time,
product, and assortment.

Steps for Pre-processing:








Initial dataset .csv file -> 25MB (14000 row x 296 col)
Loaded .csv into RDBMS using python scripts as discussed below
With SQL query removed columns with NULL
o total no. of col = 208
Removed record of Crawlers
Record with just one page view i.e Session_time_elasped =0.0
o new rows count = ~5000
Manually removed few more irrelevant col like browser/os,day,date info, etc,
o Now no. of col =108
Removed Col whose sum(visit) < 10
o New no of col =103

Final size:
6.5MB, ~5000 x 103
DB 297C Data Analytics – Project Report
Term I (2013-14)

import sys
defreadfile(fname,oname):
f = file(fname)
whileTrue:
line = f.readline()
stri=line.split(':');
if len(line)==0:
break
outfile = file(oname,'a')
ifnot stri[0].startswith("|"):
mystr=stri[0].replace(" ","_")
if((c in mystr)for c in'*&'):
m=['*','&','/']
for i in m:
mystr=mystr.replace(i,"_")
ifnot any((c in mystr)for c in'*&|'):
# if any(c in stri[0] for c in '*$&|'):
mystr=mystr.replace(".","")
outfile.write(mystr.strip())
outfile.write("n")
outfile.close()
f.close()
if len(sys.argv)<2:
print'No file specified'
sys.exit()
else:
readfile(sys.argv[1],sys.argv[2])

Python Scripts:

Dataset was given in two files:
o Data File
o Names File
In the above script we are taking the Names File from the user and retrieving the column names for our
dataset by removing all special characters and writing it to another file for further processing of names.
>> python read.py names_file_name output_file_name
In the output_file_name we will get the columns names individually .
DB 297C Data Analytics – Project Report
Term I (2013-14)

import sys
defmakeScript(fname,oname):
with open(fname,'r')as file_contents:
contents = file_contents.read()
my_data=contents.split("n")
outfile=file(oname,'a')
input_db = raw_input("Enter Database Name: ")
createDatabase ="CREATE DATABASE "+ input_db +";n"
useDatabase ="USE "+input_db +";n"
input_table = raw_input("Enter Table Name: ")
columns =""
count=0
print columns
for data in my_data:
count=count+1
if data:
columns += data +" VARCHAR(100) DEFAULT NULL ,n"
createTable="CREATE TABLE "+input_table +"("+ columns +") ;n"
createTable=createTable.replace("/","")
k = createTable.rfind(',')
createTable=createTable[:k]+" "+createTable[k+1:]
print createTable
print count
outfile.write(createDatabase)
outfile.write(useDatabase)
outfile.write(createTable)
outfile.close()
file_contents.close()
if len(sys.argv)<2:
print'No file specified'
sys.exit()
else:
makeScript(sys.argv[1],sys.argv[2])

In the above python file, we are using the file created before using read.py, which is the input
file for the above script. Here we are creating a sql script for creating a table with respective columns
having a data type of varchar, so that we can load the file into DBMS for further processing.
>> python Createscript.py output_file_name script.sql
We will get a sql script for creating the table with the columns retrieved.
DB 297C Data Analytics – Project Report
Term I (2013-14)

import MySQLdb
myfile = open("path where required csv is there",'r')
db = MySQLdb.connect(host="localhost",# your host, usually localhost
user="root",# your username
passwd="root",# your password
db="da1")# name of the data base
cur = db.cursor()
for line in myfile:
print line
my_line_list = line.split(',')
string =""
for value in my_line_list:
string = string +"'"+ str(value)+"',"
query_string = string[:-1]
final_query ="insert into question1 values"+"("+query_string+");"
cur.execute(final_query)

We are here opening a connection to the database and reading data from Data File and writing it into
the table we created before in database. We are directly inserting into the database reading from file.
>>python da.py script.sql
here we need to specify the path inside the file where our .csv Data File exists and it will read from csv
and enter into the database.
After entering the data inside the DBMS we are removing the columns with all NULL values using simple
sql queries as mentioned above.
After performing the Data Pre-processing on the given dataset, we are exporting the table into a .csv file
for performing analysis using R.
Now we will analyze the data using R.
DB 297C Data Analytics – Project Report
Term I (2013-14)

Classification
Random Forest
Objective:
To generate a model for building decision tree and to identify important features using random forest.

Description:
Random forests are an ensemble learning method for classification (and regression) that operate by
constructing a multitude of decision trees at training time and outputting the class that is the mode of the
classes output by individual trees.

Procedure:1.

After the Data preprocessing of dataset, it is loaded into R environment by using ,
question1 <- read.csv ("dataset.csv")
dim (question1)# 5220 103 ----- number of rows and columns

2.
After loading the dataset now we will divide dataset into 70% as trainDataset and 30% as
testDataset as below:
div<- sample (2, nrow(question1),replace=T,prob=c(0.7,0.3))
Here it will generate two samples from our dataset having 70% and 30% rows with non-overlapping of
rows. ‘nrow’ will give the number of rows in our dataset, ’prob’ decides the division ratio and ‘sample’
will assign numbers 1 or 2 to the rows as to which sample that row belongs to. To create trainData the
command is:
trainData<- question1[div==1,]
dim (trainData) #3670 103 --- dimensions of trainDataset
This will copy all the rows in dataset into trainData which are marked as 1 as per sample. Similarly the
testData is:
testData<- question1[div==2,]
dim(testData)#1550 103 --- dimensions of testDataset
3.
After generating the trainData and testData now we will load the required package
‘randomForest’ into R:
library(randomForest)
4.
Defining the target variable and independent variable in the formula to be used in the generation
of the model as below:
myformula<- Session_Continues ~ .
Here ‘Session_Continues’ is the target variables having classes as ‘true’ or ‘false’ and we are giving
remaining all as the independent variables on which basis our target variable is classified as represented
by ‘~.’.
DB 297C Data Analytics – Project Report
Term I (2013-14)

5.
After the formula is decided now we applying the formula to generate the model based on our
trainData as below using the function ‘randomForest’ and get the model stored in ‘rf’:
rf<- randomForest(myformula,data=trainData,ntree=100,proximity=T)
‘ntree’ is the parameter to specify how many trees it algorithm has to generate to get an accurate model,
‘proximity’is the parameter for checking for exact match for oob and tree generated, used for reducing
error rates.
6.
We can see the classification result by:
-->rf
output:
Call:
randomForest(formula = myformula, data = trainData, ntree = 100,
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 10

proximity = T)

OOB estimate of error rate: 34.17%
Confusion matrix:
False True class.error
False 2280 173 0.07052589
True 1081 136 0.88824979
By seeing the result we can say that we are getting an error of 34%.
7.

For seeing the generated tree for classification:
-->getTree(rf,1)
Output:-

left daughter right daughter split var split point status prediction
1

2

3

3

1

0

2

4

5

29

1

0

3

6

7

105

1

0

4

8

9

98

1

0

5

10

11

36

1

0

6

12

13

34

1

0

7

0

0

0

-1

2

If status is -1 then that is the leaf node in the decision tree and prediction is 1 0r 2 means the class of
classification it is classified. We can get any tree information using the above command just by specifying
the randomForest object ‘rf’ and which tree number i.e ‘n’ which in our case is 1 < n < 100
DB 297C Data Analytics – Project Report
Term I (2013-14)
8.

We can plot the error rates in our generated trees by :
plot(rf)

we will get a graph as show in figure(1)in observation.
9.

We can also find the features that contribute more to the decision tree using:
importance(rf)
It will give the feature and its mean Gini Index , we can see and decide which are the essential features
that effect our decision tree.
10.

We can use many attributes that are generated by randomForest, which are:
attributes(rf)
output:
$names
[1] "call"
"type"
"predicted"
"err.rate"
[5] "confusion"
"votes"
"oob.times"
"classes"
[9] "importance" "importanceSD" "localImportance" "proximity"
[13] "ntree"
"mtry"
"forest"
"y"
[17] "test"
"inbag"
"terms"
$class
[1] "randomForest.formula" "randomForest"

11.
Now using the model generated from the trainData, we will apply that model on testData for
prediction as below:
testpredict<- predict(rf,newdata=testData)
output:
testpredict False True
False 952 473
True 52 73
here we will use the model ‘rf’ and dataset as ‘testData’ and store the result in a variable as above.

Observations:
The plot of the model error rates is:
DB 297C Data Analytics – Project Report
Term I (2013-14)
We can see that as number of trees increases the error rate decreases and it is able to classify accurately.

Result of importance of rf:Column
MeanGiniIndex
Num_BrandOrder_Assortment_Views
2.226234e+01
Num_UniqueBoutiques_Assortment_Views
3.372542e+01
Num_Brands_Assortment_Views
2.324481e+01
Num_Departments_Assortment_Views
2.224833e+01
Num_LifeStyles_Assortment_Views
1.265545e+01
Num_main_Template_Views
4.466076e+01
Num_products_Template_Views
1.143197e+01
Num_articles_Template_Views
1.943811e+01
Num_main_home_Template_Views
2.347768e+01
So from the result above we can observe that out of 103 features only some features contribute more to
the model so we can further reduce the dataset and achieve better results.
WEKA OUTPUT:
We also tried to get the selection attributes of the above dataset in weka , result is:
Num_Hanes_Product_Views
Num_Cotton_Product_Views
Num_Nylon_Product_Views
Num_BrandOrder_Assortment_Views
Num_UniqueBoutiques_Assortment_Views
Num_LifeStyles_Assortment_Views
Num_main_Template_Views
Num_articles_Template_Views
Num_main_home_Template_Views
Num_main_vendor_Template_Views
Num_articles_dpt_about_mgmtteam_Template_Views
Num_main_cust_serv_Template_Views
Nearly both the outputs were matching so we were able to find the factors most influencing our decision
tree.

Conclusion:
We will use the above features that have major influence in decision tree in random Forest as the
independent variable for actual construction of decision tree using ‘party’ package for better results and
classification.
DB 297C Data Analytics – Project Report
Term I (2013-14)

Party Decision Tree
Objective:
To build a decision tree using the features identified by Random Forest using the “party” package.

Description:
A computational toolbox for recursive partitioning. The core of the package is ctree(), an
implementation of conditional inference trees which embed tree-structured regression models into
a well-defined theory of conditional inference procedures. This non-parametric class of regression
trees is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored
as well as multivariate response variables and arbitrary measurement scales of the covariates.

Procedure:
1.

2.
3.

4.

Loading the dataset into R,
question1_reduced <- read.csv("q2_reduced.csv")
dim(question1) #5220 103
Dividing the dataset into training dataset and testdata set,
div <- sample(2,nrow(question1_reduced),replace=T,prob=c(0.70,0.30))
Storing the traindata and testData in variable for analysis,
trainData_reduced <- question1_reduced[div==1,]
dim(trainData_reduced) # 3675 103
testData_reduced <- question1_reduced[div==2,]
dim(testData_reduced)# 1545 103
Defining formula based on the features identified by random forest on the target variable,
myformula_reduced <- Session_Continues ~ Num_Hanes_Product_Views +
Num_Cotton_Product_Views + Num_Nylon_Product_Views +
Num_BrandOrder_Assortment_Views + Num_UniqueBoutiques_Assortment_Views +
Num_LifeStyles_Assortment_Views + Num_main_Template_Views +
Num_articles_Template_Views + Num_main_home_Template_Views +
Num_main_vendor_Template_Views + Num_articles_dpt_about_mgmtteam_Template_Views
+ Num_main_cust_serv_Template_Views

The features are the result of the importance factor found during the randomForest.
5.
6.
7.

Now loading the “party” package for analysis,
library(party)
Applying the recursive decision tree algorithm on traindata based on the above formula
trainData_ctree <- ctree(myformula_reduced,data=trainData_reduced)
Now to see the generated model ,
print(trainData_ctree)
DB 297C Data Analytics – Project Report
Term I (2013-14)

this wil show the features used in building the decision tree and also how the decision tree is built as
below
output:
Conditional inference tree with 3 terminal nodes
Response: Session_Continues
Inputs: Num_Hanes_Product_Views, Num_Cotton_Product_Views, Num_Nylon_Product_Views,
Num_BrandOrder_Assortment_Views, Num_UniqueBoutiques_Assortment_Views,
Num_LifeStyles_Assortment_Views, Num_main_Template_Views,
Num_articles_Template_Views, Num_main_home_Template_Views,
Num_main_vendor_Template_Views, Num_articles_dpt_about_mgmtteam_Template_Views,
Num_main_cust_serv_Template_Views
Number of observations: 3675
1) Num_main_home_Template_Views <= 1; criterion = 0.999, statistic = 16.455
2) Num_articles_dpt_about_mgmtteam_Template_Views <= 0; criterion = 0.984, statistic =
10.226
3)* weights = 2607
2) Num_articles_dpt_about_mgmtteam_Template_Views > 0
4)* weights = 245
1) Num_main_home_Template_Views > 1
5)* weights = 823
8.
For visualizing tree graphically it is,
Plot(trainData_ctree,type=”simple”)
We will get the graph of decision tree as below show in observation.
9.
Now applying the model on the testData set ,
testpred_reduced <- predict(trainData_reduced,newdata=testData_reduced)
10.

For checking the accuracy of prediction,
table(testpred_reduced,testData_reduced$Session_Continues)

output will show the prediction rate, as the prediction has many errors so this is not the suitable
method for decision tree generation.
DB 297C Data Analytics – Project Report
Term I (2013-14)

Observation:
The graph of the decision tree obtained is:

From the graph we can see that only two factors are being considered by this ctree() algorithm.
DB 297C Data Analytics – Project Report
Term I (2013-14)

WEKA:
When based on the above factor decision tree generated in weka is:

Conclusion
Results were not satisfactory, there were large rate of error and misclassification of data. Around
50% of data was not correctly classified using “party” package.
DB 297C Data Analytics – Project Report
Term I (2013-14)

Rule Based Classification
Objective
To generate a model for building rules and to classify data based on the rules being satisfied.

Description
Rule-based methods, rule discovery or rule extraction from data, are data mining techniques aimed at
understanding data structures, providing comprehensible description instead of only black-box
prediction. Rule based systems should expose in a comprehensible way knowledge hidden in data ,
providing logical justification for drawing conclusions, showing possible inconsistencies and avoiding
unpredictable conclusions that black box predictors may generate in untypical situations . Sets of rules
are useful if rules are not too numerous, comprehensible, and have sufficiently high accuracy.

Procedure
1. Data was already been loaded in R and is divided into training dataset and test dataset so we can
directly apply the rule based classification directly on the train dataset.
2. For applying rule based classification we have to install the package “RWeka” which imports all
the algorithms in weka tool.
3. We will be using the “PART” rule based classification in weka for generating the rules of our training
dataset based on which we will classify our test dataset.
library(RWeka)
The above command will load the “RWeka” package into R environment
4. Now we will apply the PART algorithm on the training Dataset for obtaining the rules,
q2_part <- PART(Session_Continues ~.,data =q2_train)
Above command will take the training dataset as “q2_train” and apply the “PART” algorithm based
on the target variable i.e Session_Continues and remaining all as independent variables.
5. A model is build based on the previous command , which is used for classifying the test dataset as
below,
q2_pre <- evaluate_Weka_classifier(q2_part,newdata=q2_test)
here we are using the model generated by the training data to classify the test data. Here
“evaluate_Weka_classifier” will use the model and classify the testdata, which is a function of weka tool.
6. For seeing the result,
q2_pre
OUTPUT:-

=== Summary ===
Correctly Classified Instances
880
Incorrectly Classified Instances
425
Kappa statistic
0
Mean absolute error
0.4447
Root mean squared error
0.4689

67.433 %
32.567 %
DB 297C Data Analytics – Project Report
Term I (2013-14)
Relative absolute error
99.9541 %
Root relative squared error
99.9958 %
Coverage of cases (0.95 level)
100 %
Mean rel. region size (0.95 level) 100 %
Total Number of Instances
1305
=== Confusion Matrix ===
a b <-- classified as
734 124 | a = FALSE
301 136 | b = TRUE
From the result we can see that we are getting a classification rate around 68% - 72 % which is a better
rate than decision tree.

Observation
The Rule based classification will generate a set of rules on which classification takes place , we can see
the set of rules from the model it generated as below
q2_part
O/p: PART decision list
Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND
Num_main_lifestyles_Template_Views > 1: FALSE (38.0/3.0)
Num_main_assortment_Template_Views > 0 AND Num_main_shopping_cart_Template_Views <= 0
AND Num_LifeStyles_Assortment_Views <= 1: TRUE (7.0)
Num_UniqueBoutiques_Assortment_Views <= 0 AND Num_main_vendor_Template_Views <= 1 AND
Num_Brands_Assortment_Views <= 1: FALSE (145.0/71.0)
Num_articles_dpt_about_Template_Views > 0 AND Num_BrandOrder_Assortment_Views <= 1 AND
Num_articles_Template_Views <= 1: TRUE (73.0/22.0)
From the result you can see the rules being generated, we have show some sample rules there were
totally 182 rules generated. Here at the last the class has been mentioned to which class the rule
belongs to showing actual number of rows / number of misclassifications that belong to that particular
rule.

Conclusion
So from the above observation and results we can see that a successful rule based model was build with
an accuracy of above 70% for identifying whether a user will continue his session or not.
DB 297C Data Analytics – Project Report
Term I (2013-14)

Clustering
Objective:
To group visitor of websites whose page view pattern is similar and identify their interest.

Approach:
Clustering is a best methodology in Data analysis which can be used to group objects based on their
similarities. We are making use of WEKA tools for doing this analysis.

Preprocessing:
1. Remove all spam data by deleting record with just one page view
2. There are about 500+ dimension which is very not feasible to analyze, so for dimensionality
reduction
a. Go to select attribute of WEKA
b. Manually- remove all session data, browser information, most common page]
c. Auto – Calculate information gain and select top 25 Attribute

Process:
Experiment I
In this experiment we will make 5 cluster of given instances (users) and analyze their purchase habits
K-means
Steps:
1.
2.
3.
4.
5.
6.

Import a reduced dataset in weka
Select simple k-mean
Specify number of clusters
Set distance function to Euclidean
Specify k (no of cluster)
Click on start to generate clusters

Results:
(A) Using Euclidean distance
=== Run information ===
Scheme:
weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 num-slots 1 -S 10
Relation: q3_added-weka.filters.unsupervised.attribute.Remove-VR7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15
Instances: 1781
Attributes: 26
City
Customer_ID
US_State
Num_Sheer_Look_Product_Views
Num_CT_Waist_Control_Views
DB 297C Data Analytics – Project Report
Term I (2013-14)
Num_PH_Category_Views
Num_main_shopping_cart_Template_Views
Num_Replenishable_Stock_Views
Num_account_Template_Views
Num_main_login2_Template_Views
Num_Sandal_Foot_Views
Num_HasDressingRoom_True_Views
Num_Legwear_Product_Views
Num_products_productDetailLegwear_Template_Views
Num_DonnaKaran_Product_Views
Num_AmericanEssentials_Product_Views
Num_Basic_Product_Views
Num_WDCS_Category_Views
Num_Oroblu_Product_Views
WhichDoYouWearMostFrequent
Num_products_Template_Views
Home_Market_Value
Num_WAS_Category_Views
Num_main_vendor_Template_Views
Num_main_freegift_Template_Views
Spend_over_$12_per_order_on_average
Test mode: evaluate on training data

=== Clustering model (full training set) ===

kMeans
======
Number of iterations: 12
Within cluster sum of squared errors: 4789.607514501406
Missing values globally replaced with mean/mode
Time taken to build model (full training data) : 0.14 seconds
=== Model and evaluation on training set ===
Clustered Instances
0
1
2
3
4

263 ( 15%)
273 ( 15%)
368 ( 21%)
484 ( 27%)
393 ( 22%)
DB 297C Data Analytics – Project Report
Term I (2013-14)
Observation:
 Cluster 0
o High income
o Spend avg >12$ => potential customer (value)
o Purchase nylon more than cotton (nylon is costlier than cotton)
o Mostly view brands assortment page (DonnaKaran – costly fashion design brand)
o Less visit to free gift category
o More visit to sandal foot
 Cluster 1
o More men product than other cluster => cluster might have more mens
o Frequent use of search bar
o Rich visitors with most of them are have above average home/assets value
 Cluster 2
o General visitor
o Buy cheap products
 Cluster 3
o Interested mostly in offers/free gift products
o Highest visit to checkout page => potential customer (frequency)
 Cluster 4
o No special pattern observed
Experiment II
We can validate whether we can use page view data for identifying potential customer using
clustering. We have labeled data of avg. purchase >12$ with 1368 instances as true and 413 instances as
false.
=== Run information ===
Scheme:
weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 num-slots 1 -S 10
Relation: q3_added-weka.filters.unsupervised.attribute.Remove-VR7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15
Instances: 1781
Attributes: 26
City
Customer_ID
US_State
Num_Sheer_Look_Product_Views
Num_CT_Waist_Control_Views
Num_PH_Category_Views
Num_main_shopping_cart_Template_Views
Num_Replenishable_Stock_Views
Num_account_Template_Views
Num_main_login2_Template_Views
Num_Sandal_Foot_Views
Num_HasDressingRoom_True_Views
Num_Legwear_Product_Views
DB 297C Data Analytics – Project Report
Term I (2013-14)
Num_products_productDetailLegwear_Template_Views
Num_DonnaKaran_Product_Views
Num_AmericanEssentials_Product_Views
Num_Basic_Product_Views
Num_WDCS_Category_Views
Num_Oroblu_Product_Views
WhichDoYouWearMostFrequent
Num_products_Template_Views
Home_Market_Value
Num_WAS_Category_Views
Num_main_vendor_Template_Views
Num_main_freegift_Template_Views
Ignored:
Spend_over_$12_per_order_on_average
Test mode: Classes to clusters evaluation on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 5
Within cluster sum of squared errors: 4913.0928856548035
Missing values globally replaced with mean/mode
Time taken to build model (full training data) : 0.05 seconds
=== Model and evaluation on training set ===
Clustered Instances
0
1

561 ( 31%)
1220 ( 69%)

Class attribute: Spend_over_$12_per_order_on_average
Classes to Clusters:
0 1 <-- assigned to cluster
402 966 | False
159 254 | True
Cluster 0 <-- True
Cluster 1 <-- False
Incorrectly clustered instances : 656.0

36.8332 %

Observation :
Only 63% of data is correctly classified, as data is more biased toward False(<12$ average spending)
class. But clustering gives us good insight of purchase/page view patterns.
DB 297C Data Analytics – Project Report
Term I (2013-14)

Association Rules
Objective:
To identify some interesting patterns in the users page views and also the killer pages.

Description:
Association rule learning is a popular and well researched method for discovering interesting
relations between variables in large databases. It is intended to identify strong rules discovered
in databases using different measures of interestingness. Measures used in our analysis are lift,
confidence and support.

Procedure:
1. For performing the association rules we needed to convert the dataset into binary matrix
indicating in each session indicating whether he/she visited that page or not.
2. For performing the association rules, “arules” package is available.
library(arules)
3. Now loading the converted dataset into R for generation of rules, we used the important
columns based on mean gini index obtained from randomForest result.
4. After loading the data, we convert the data as transactions by following command,
dataTrans <- as(assoc,”transactions”)
5. Now we apply “apriori” algorithm to generate the rules, where we can pass the parameter list
having support, confidence and minl-ength of every rule.
rules <- apriori(dataTrans)
This will generate all rules using min-support as 0.1 and min-confidence as 0.8 around. It will
generate all the subset rules also based on the frequent itemset of attributes.
6. To know how many rules generated we can see that by
rules
Around 377564 rules were generated out of which we were interested in only rules having RHS
as LEAVE or CONTINUE to check whether person will continue or leave after seeing certain
pages.
7. Retrieved a subset of rules from all generated rules which were having some interesting
patterns.

Observation:
We were able to see some of the interesting patterns in the rules generated, like in our dataset most of
the persons were females so we were able to find out that most of the rules were having
“NUM_OF_WOMEN_PRODUCT_VIEWS” in possibly every transaction. Some of the brands were least
visited or never visited according to the rules. We were able to identify some of the killer pages based
on the user preferences or after visiting some pages user used to withdraw at the same page everytime
like that.
DB 297C Data Analytics – Project Report
Term I (2013-14)

Results:
Some of the rules sorted based on the “lift” values are as below:
rulesLeave <- subset(rules,subset=rhs %pin% "LEAVE")// for getting the rules
inspect(head(sort(rulesLeave,by="lift"),20))
O/p:
lhs

rhs

1 {CONTINUE=YES}

support confidence

lift

=> {LEAVE=NO} 0.425

1 2.352941

2 {Num_Women_Product_Views=Yes,
CONTINUE=YES}

=> {LEAVE=NO} 0.110

1 2.352941

17 {Num_Women_Product_Views=Yes,
Num_Men_Product_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.100

1 2.352941

18 {Num_MAS_Category_Views=No,
Num_Women_Product_Views=Yes,
CONTINUE=YES}

=> {LEAVE=NO} 0.105

1 2.352941

19 {Num_MDS_Category_Views=No,
Num_Women_Product_Views=Yes,
CONTINUE=YES}

=> {LEAVE=NO} 0.105

1 2.352941

20 {Num_MCS_Category_Views=No,
Num_Women_Product_Views=Yes,
CONTINUE=YES}

=> {LEAVE=NO} 0.110

1 2.352941

Some of the interesting rules are shown above.Some of the random generated are,
inspect(head(rulesLeave,6))
0/p:
lhs

rhs

support confidence

lift

3 {Num_Women_Product_Views=Yes,
CONTINUE=YES}

=> {LEAVE=NO} 0.110

1 2.352941
DB 297C Data Analytics – Project Report
Term I (2013-14)
4 {Num_Women_Product_Views=Yes,
CONTINUE=NO}

=> {LEAVE=YES} 0.100

1 1.739130

5 {Num_Women_Product_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.315

1 2.352941

6 {Num_CT_Waist_Control_Views=No,
CONTINUE=YES}

=> {LEAVE=NO} 0.360

1 2.352941

For CONTINUE some examples are:
inspect(head(rulesContinue,4))
o/p:
lhs

rhs

support confidence

lift

3 {Num_Women_Product_Views=Yes,
LEAVE=NO}

=> {CONTINUE=YES} 0.110

1 2.352941

4 {Num_Women_Product_Views=Yes,
LEAVE=YES}

=> {CONTINUE=NO} 0.100

1 1.739130

5 {Num_Women_Product_Views=No,
LEAVE=NO}

=> {CONTINUE=YES} 0.315

1 2.352941

6 {Num_CT_Waist_Control_Views=No,
LEAVE=NO}

=> {CONTINUE=YES} 0.360

1 2.352941

Conclusion:
We were able to find some interesting patters in users page views and were able to identify some of the
killer pages as “Num_CT_Waist_Control_Views”,” Num_MAS_Category_Views” like these pages.

Contenu connexe

En vedette

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 

En vedette (9)

Click Stream Analysis
Click Stream AnalysisClick Stream Analysis
Click Stream Analysis
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
clickstream analysis
 clickstream analysis clickstream analysis
clickstream analysis
 
Clickstream Analysis with Spark
Clickstream Analysis with Spark Clickstream Analysis with Spark
Clickstream Analysis with Spark
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
AWS re:Invent 2016: What’s New with AWS Lambda (SVR202)
AWS re:Invent 2016: What’s New with AWS Lambda (SVR202)AWS re:Invent 2016: What’s New with AWS Lambda (SVR202)
AWS re:Invent 2016: What’s New with AWS Lambda (SVR202)
 
AWS re:Invent 2016: The State of Serverless Computing (SVR311)
AWS re:Invent 2016: The State of Serverless Computing (SVR311)AWS re:Invent 2016: The State of Serverless Computing (SVR311)
AWS re:Invent 2016: The State of Serverless Computing (SVR311)
 
AWS re:Invent 2016: Serverless Computing Patterns at Expedia (SVR306) )
AWS re:Invent 2016: Serverless Computing Patterns at Expedia (SVR306) )AWS re:Invent 2016: Serverless Computing Patterns at Expedia (SVR306) )
AWS re:Invent 2016: Serverless Computing Patterns at Expedia (SVR306) )
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with spark
 

Similaire à Experiments and Results on Click stream analysis using R

Week10 Analysing Client Requirements
Week10 Analysing Client RequirementsWeek10 Analysing Client Requirements
Week10 Analysing Client Requirements
hapy
 
Week11 Determine Technical Requirements
Week11 Determine Technical RequirementsWeek11 Determine Technical Requirements
Week11 Determine Technical Requirements
hapy
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
Kate Subramanian
 
McKinsey Big Data Trinity for self-learning culture
McKinsey Big Data Trinity for self-learning cultureMcKinsey Big Data Trinity for self-learning culture
McKinsey Big Data Trinity for self-learning culture
Matt Ariker
 

Similaire à Experiments and Results on Click stream analysis using R (20)

Data Mining to Classify Telco Churners
Data Mining to Classify Telco ChurnersData Mining to Classify Telco Churners
Data Mining to Classify Telco Churners
 
Strategies to Monetize Energy Data - How Utilities Can Increase Their 'Earnin...
Strategies to Monetize Energy Data - How Utilities Can Increase Their 'Earnin...Strategies to Monetize Energy Data - How Utilities Can Increase Their 'Earnin...
Strategies to Monetize Energy Data - How Utilities Can Increase Their 'Earnin...
 
Week10 Analysing Client Requirements
Week10 Analysing Client RequirementsWeek10 Analysing Client Requirements
Week10 Analysing Client Requirements
 
Week11 Determine Technical Requirements
Week11 Determine Technical RequirementsWeek11 Determine Technical Requirements
Week11 Determine Technical Requirements
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industry
 
Finger Gesture Based Rating System
Finger Gesture Based Rating SystemFinger Gesture Based Rating System
Finger Gesture Based Rating System
 
IRJET- E-Commerce Recommender System using Data Mining Algorithms
IRJET-  	  E-Commerce Recommender System using Data Mining AlgorithmsIRJET-  	  E-Commerce Recommender System using Data Mining Algorithms
IRJET- E-Commerce Recommender System using Data Mining Algorithms
 
IRJET- E-Commerce Recommendation System: Problems and Solutions
IRJET- E-Commerce Recommendation System: Problems and SolutionsIRJET- E-Commerce Recommendation System: Problems and Solutions
IRJET- E-Commerce Recommendation System: Problems and Solutions
 
IRJET- Recommendation System for Electronic Products using BigData
IRJET- Recommendation System for Electronic Products using BigDataIRJET- Recommendation System for Electronic Products using BigData
IRJET- Recommendation System for Electronic Products using BigData
 
IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom Industry
 
Big Data Analytics for Predicting Consumer Behaviour
Big Data Analytics for Predicting Consumer BehaviourBig Data Analytics for Predicting Consumer Behaviour
Big Data Analytics for Predicting Consumer Behaviour
 
Gc3310851089
Gc3310851089Gc3310851089
Gc3310851089
 
Gc3310851089
Gc3310851089Gc3310851089
Gc3310851089
 
McKinsey Big Data Trinity for self-learning culture
McKinsey Big Data Trinity for self-learning cultureMcKinsey Big Data Trinity for self-learning culture
McKinsey Big Data Trinity for self-learning culture
 
Modern trends in information systems
Modern trends in information systemsModern trends in information systems
Modern trends in information systems
 
data collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptxdata collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptx
 
Data analysis step by step guide
Data analysis   step by step guideData analysis   step by step guide
Data analysis step by step guide
 
Information system infrastructure
Information system infrastructureInformation system infrastructure
Information system infrastructure
 
Lecture 01 mis
Lecture 01 misLecture 01 mis
Lecture 01 mis
 

Plus de Pridhvi Kodamasimham

DevanagiriOCR on CELL BROADBAND ENGINE
DevanagiriOCR on CELL BROADBAND ENGINEDevanagiriOCR on CELL BROADBAND ENGINE
DevanagiriOCR on CELL BROADBAND ENGINE
Pridhvi Kodamasimham
 

Plus de Pridhvi Kodamasimham (6)

Open Data Convergence
Open Data ConvergenceOpen Data Convergence
Open Data Convergence
 
Thirst
Thirst Thirst
Thirst
 
In-Memory DataBase
In-Memory DataBaseIn-Memory DataBase
In-Memory DataBase
 
DevanagiriOCR on CELL BROADBAND ENGINE
DevanagiriOCR on CELL BROADBAND ENGINEDevanagiriOCR on CELL BROADBAND ENGINE
DevanagiriOCR on CELL BROADBAND ENGINE
 
Data Convergence
Data ConvergenceData Convergence
Data Convergence
 
Reactive Data System
Reactive Data SystemReactive Data System
Reactive Data System
 

Dernier

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Experiments and Results on Click stream analysis using R

  • 1. DB 297C Data Analytics – Project Report Term I (2013-14) Group Information GROUP NO: 11 TEAM MEMBERS: Bisen Vikratsingh Mohansingh - MT2012036 Kodamasimham Pridhvi - MT2012066 Vaibhav Singh Rajput - MT2012145 Dataset Description Blue Martini Software approached several clients using its Customer Interaction System to volunteer their data, and a small dot-com company called Gazelle.com, a leg wear and leg care retailer. Data was made available to in two formats: original data and aggregated data. Among the data collected by the Blue Martini application server, the following three categories are the relevant data:  Customer information, which includes customer ID, registration information, and registration form questionnaire responses.  Order information like Order header, which includes date/time, discount, tax, total amount, payment, shipping, status, session ID, order line, which includes quantity, price, product, date/time, assortment, and status.  Click stream information session, which includes starting and ending date/time, cookie, browser, referrer, visit count, and user agent, page view, which includes date/time, sequence number, URL, processing time, product, and assortment. Initial dataset .csv file -> 25MB (14000 row x 296 column), With SQL query removed columns with NULL making total no. of columns = 208. Removed record of Crawlers and Record with just one page view i.e. Session_time_elasped =0.0 making new rows count = ~5000. Manually removed few more irrelevant columns like browser/os, day, date info, etc, now no. of columns =128.Considering the most frequent visitors removed Columns whose sum (visit) < 10 making new number of columns =113. Final size: 6.5MB, ~5000 x 113
  • 2. DB 297C Data Analytics – Project Report Term I (2013-14) Summary of top 5 observations Rule Based Classification: Rule-based methods, rule discovery or rule extraction from data, are data mining techniques aimed at understanding data structures, providing comprehensible description instead of only black-box prediction. Sets of rules are useful if rules are not too numerous, comprehensible, and have sufficiently high accuracy. From the result of the experiment we can see rules being generated, we have show some sample rules in the documentation, there were totally 182 rules generated. To which class the rule belongs to is shown at the end of each rule with actual number of rows / number of misclassifications that belong to that particular rule. Association Rules: Association rules were taken based on two factors, lift and support. Rules having lift greater than 1 and min-support > 0.5. A total of 377564 rules were generated out of which we applied filters and selected few rules which showed some interesting patterns . Result 1 The Rule based classification will generate a set of rules on which classification takes place; we can see the set of rules from the model as generated below q2_part O/p: PART decision list Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND Num_main_lifestyles_Template_Views > 1: FALSE (38.0/3.0) Num_main_assortment_Template_Views > 0 AND Num_main_shopping_cart_Template_Views <= 0 AND Num_LifeStyles_Assortment_Views <= 1: TRUE (7.0) Num_UniqueBoutiques_Assortment_Views <= 0 AND Num_main_vendor_Template_Views <= 1 AND Num_Brands_Assortment_Views <= 1: FALSE (145.0/71.0) Num_articles_dpt_about_Template_Views > 0 AND Num_BrandOrder_Assortment_Views <= 1 AND
  • 3. DB 297C Data Analytics – Project Report Term I (2013-14) Num_articles_Template_Views <= 1: TRUE (73.0/22.0) These are some of the sample rules that are generated based on the model using which the prediction on the test data was done. A successful rule based model was build with an accuracy of above 70% for identifying whether a user will continue his session or not. Results can be found at Rule Based Classification. Result 2 5 Cluster of users plotted against their average amount spend>12$. It is clearly observable from the figure that the most of users in cluster 0 are high spender where as cluster 2 consist of least highest spender Result 3 One more strange observation is high spender people are least interested in offers/free gift. Below image shows a plot of cluster against number of visits to offer/free gift page.
  • 4. DB 297C Data Analytics – Project Report Term I (2013-14) Result 4 Highest spender people (cluster 0 as concluded from result 2nd ) as found to be mostly interested in branded product that’s why their average spending is more than 12$. Below image shows a plot of cluster against DonnaKaran(Fashion design brand) product views. And opposite for our least spender cluster 2. Result 5 After applying association rules , some interesting rules were generated which were very useful to predict which users like to continue and who don’t and what were the killer pages in most cases. Due to large number of rules generation we were unable to go through all rules. Some of the graphs were:
  • 5. DB 297C Data Analytics – Project Report Term I (2013-14) Scatter plot of rules generated based on support, confidence and lift. Some of the interesting patterns were: 2 {Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941 17 {Num_Women_Product_Views=Yes, Num_Men_Product_Views=No, CONTINUE=YES} => {LEAVE=NO} 0.100 1 2.352941 18 {Num_MAS_Category_Views=No, Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.105 1 2.352941 19 {Num_MDS_Category_Views=No, Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.105 1 2.352941 From the rules we can see that Women products were the most viewed and pages like MDS category, MAS category were the killer pages where no one was interested. Results can be observed here Association Rule Results.
  • 6. DB 297C Data Analytics – Project Report Term I (2013-14) APPENDIX – collection of individual experiment reports (not more than 15) Data Cleaning/Filtering DATA PRE-PROCESSING Description of Dataset: Blue Martini Software approached several clients using its Customer Interaction System to volunteer their data, and a small dot-com company called Gazelle.com, a legwear and legcare retailer. Data was made available to in two formats: original data and aggregated data. Among the data collected by the Blue Martini application server, the following three categories are the relevant data:  Customer information, which includes customer ID, registration information, and registration form questionnaire responses.  Order information like Order header, which includes date/time, discount, tax, total amount, payment, shipping, status, session ID, order line, which includes quantity, price, product, date/time, assortment, and status.  Clickstream information session, which includes starting and ending date/time, cookie, browser, referrer, visit count, and user agent, page view, which includes date/time, sequence number, URL, processing time, product, and assortment. Steps for Pre-processing:        Initial dataset .csv file -> 25MB (14000 row x 296 col) Loaded .csv into RDBMS using python scripts as discussed below With SQL query removed columns with NULL o total no. of col = 208 Removed record of Crawlers Record with just one page view i.e Session_time_elasped =0.0 o new rows count = ~5000 Manually removed few more irrelevant col like browser/os,day,date info, etc, o Now no. of col =108 Removed Col whose sum(visit) < 10 o New no of col =103 Final size: 6.5MB, ~5000 x 103
  • 7. DB 297C Data Analytics – Project Report Term I (2013-14) import sys defreadfile(fname,oname): f = file(fname) whileTrue: line = f.readline() stri=line.split(':'); if len(line)==0: break outfile = file(oname,'a') ifnot stri[0].startswith("|"): mystr=stri[0].replace(" ","_") if((c in mystr)for c in'*&'): m=['*','&','/'] for i in m: mystr=mystr.replace(i,"_") ifnot any((c in mystr)for c in'*&|'): # if any(c in stri[0] for c in '*$&|'): mystr=mystr.replace(".","") outfile.write(mystr.strip()) outfile.write("n") outfile.close() f.close() if len(sys.argv)<2: print'No file specified' sys.exit() else: readfile(sys.argv[1],sys.argv[2]) Python Scripts: Dataset was given in two files: o Data File o Names File In the above script we are taking the Names File from the user and retrieving the column names for our dataset by removing all special characters and writing it to another file for further processing of names. >> python read.py names_file_name output_file_name In the output_file_name we will get the columns names individually .
  • 8. DB 297C Data Analytics – Project Report Term I (2013-14) import sys defmakeScript(fname,oname): with open(fname,'r')as file_contents: contents = file_contents.read() my_data=contents.split("n") outfile=file(oname,'a') input_db = raw_input("Enter Database Name: ") createDatabase ="CREATE DATABASE "+ input_db +";n" useDatabase ="USE "+input_db +";n" input_table = raw_input("Enter Table Name: ") columns ="" count=0 print columns for data in my_data: count=count+1 if data: columns += data +" VARCHAR(100) DEFAULT NULL ,n" createTable="CREATE TABLE "+input_table +"("+ columns +") ;n" createTable=createTable.replace("/","") k = createTable.rfind(',') createTable=createTable[:k]+" "+createTable[k+1:] print createTable print count outfile.write(createDatabase) outfile.write(useDatabase) outfile.write(createTable) outfile.close() file_contents.close() if len(sys.argv)<2: print'No file specified' sys.exit() else: makeScript(sys.argv[1],sys.argv[2]) In the above python file, we are using the file created before using read.py, which is the input file for the above script. Here we are creating a sql script for creating a table with respective columns having a data type of varchar, so that we can load the file into DBMS for further processing. >> python Createscript.py output_file_name script.sql We will get a sql script for creating the table with the columns retrieved.
  • 9. DB 297C Data Analytics – Project Report Term I (2013-14) import MySQLdb myfile = open("path where required csv is there",'r') db = MySQLdb.connect(host="localhost",# your host, usually localhost user="root",# your username passwd="root",# your password db="da1")# name of the data base cur = db.cursor() for line in myfile: print line my_line_list = line.split(',') string ="" for value in my_line_list: string = string +"'"+ str(value)+"'," query_string = string[:-1] final_query ="insert into question1 values"+"("+query_string+");" cur.execute(final_query) We are here opening a connection to the database and reading data from Data File and writing it into the table we created before in database. We are directly inserting into the database reading from file. >>python da.py script.sql here we need to specify the path inside the file where our .csv Data File exists and it will read from csv and enter into the database. After entering the data inside the DBMS we are removing the columns with all NULL values using simple sql queries as mentioned above. After performing the Data Pre-processing on the given dataset, we are exporting the table into a .csv file for performing analysis using R. Now we will analyze the data using R.
  • 10. DB 297C Data Analytics – Project Report Term I (2013-14) Classification Random Forest Objective: To generate a model for building decision tree and to identify important features using random forest. Description: Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. Procedure:1. After the Data preprocessing of dataset, it is loaded into R environment by using , question1 <- read.csv ("dataset.csv") dim (question1)# 5220 103 ----- number of rows and columns 2. After loading the dataset now we will divide dataset into 70% as trainDataset and 30% as testDataset as below: div<- sample (2, nrow(question1),replace=T,prob=c(0.7,0.3)) Here it will generate two samples from our dataset having 70% and 30% rows with non-overlapping of rows. ‘nrow’ will give the number of rows in our dataset, ’prob’ decides the division ratio and ‘sample’ will assign numbers 1 or 2 to the rows as to which sample that row belongs to. To create trainData the command is: trainData<- question1[div==1,] dim (trainData) #3670 103 --- dimensions of trainDataset This will copy all the rows in dataset into trainData which are marked as 1 as per sample. Similarly the testData is: testData<- question1[div==2,] dim(testData)#1550 103 --- dimensions of testDataset 3. After generating the trainData and testData now we will load the required package ‘randomForest’ into R: library(randomForest) 4. Defining the target variable and independent variable in the formula to be used in the generation of the model as below: myformula<- Session_Continues ~ . Here ‘Session_Continues’ is the target variables having classes as ‘true’ or ‘false’ and we are giving remaining all as the independent variables on which basis our target variable is classified as represented by ‘~.’.
  • 11. DB 297C Data Analytics – Project Report Term I (2013-14) 5. After the formula is decided now we applying the formula to generate the model based on our trainData as below using the function ‘randomForest’ and get the model stored in ‘rf’: rf<- randomForest(myformula,data=trainData,ntree=100,proximity=T) ‘ntree’ is the parameter to specify how many trees it algorithm has to generate to get an accurate model, ‘proximity’is the parameter for checking for exact match for oob and tree generated, used for reducing error rates. 6. We can see the classification result by: -->rf output: Call: randomForest(formula = myformula, data = trainData, ntree = 100, Type of random forest: classification Number of trees: 100 No. of variables tried at each split: 10 proximity = T) OOB estimate of error rate: 34.17% Confusion matrix: False True class.error False 2280 173 0.07052589 True 1081 136 0.88824979 By seeing the result we can say that we are getting an error of 34%. 7. For seeing the generated tree for classification: -->getTree(rf,1) Output:- left daughter right daughter split var split point status prediction 1 2 3 3 1 0 2 4 5 29 1 0 3 6 7 105 1 0 4 8 9 98 1 0 5 10 11 36 1 0 6 12 13 34 1 0 7 0 0 0 -1 2 If status is -1 then that is the leaf node in the decision tree and prediction is 1 0r 2 means the class of classification it is classified. We can get any tree information using the above command just by specifying the randomForest object ‘rf’ and which tree number i.e ‘n’ which in our case is 1 < n < 100
  • 12. DB 297C Data Analytics – Project Report Term I (2013-14) 8. We can plot the error rates in our generated trees by : plot(rf) we will get a graph as show in figure(1)in observation. 9. We can also find the features that contribute more to the decision tree using: importance(rf) It will give the feature and its mean Gini Index , we can see and decide which are the essential features that effect our decision tree. 10. We can use many attributes that are generated by randomForest, which are: attributes(rf) output: $names [1] "call" "type" "predicted" "err.rate" [5] "confusion" "votes" "oob.times" "classes" [9] "importance" "importanceSD" "localImportance" "proximity" [13] "ntree" "mtry" "forest" "y" [17] "test" "inbag" "terms" $class [1] "randomForest.formula" "randomForest" 11. Now using the model generated from the trainData, we will apply that model on testData for prediction as below: testpredict<- predict(rf,newdata=testData) output: testpredict False True False 952 473 True 52 73 here we will use the model ‘rf’ and dataset as ‘testData’ and store the result in a variable as above. Observations: The plot of the model error rates is:
  • 13. DB 297C Data Analytics – Project Report Term I (2013-14) We can see that as number of trees increases the error rate decreases and it is able to classify accurately. Result of importance of rf:Column MeanGiniIndex Num_BrandOrder_Assortment_Views 2.226234e+01 Num_UniqueBoutiques_Assortment_Views 3.372542e+01 Num_Brands_Assortment_Views 2.324481e+01 Num_Departments_Assortment_Views 2.224833e+01 Num_LifeStyles_Assortment_Views 1.265545e+01 Num_main_Template_Views 4.466076e+01 Num_products_Template_Views 1.143197e+01 Num_articles_Template_Views 1.943811e+01 Num_main_home_Template_Views 2.347768e+01 So from the result above we can observe that out of 103 features only some features contribute more to the model so we can further reduce the dataset and achieve better results. WEKA OUTPUT: We also tried to get the selection attributes of the above dataset in weka , result is: Num_Hanes_Product_Views Num_Cotton_Product_Views Num_Nylon_Product_Views Num_BrandOrder_Assortment_Views Num_UniqueBoutiques_Assortment_Views Num_LifeStyles_Assortment_Views Num_main_Template_Views Num_articles_Template_Views Num_main_home_Template_Views Num_main_vendor_Template_Views Num_articles_dpt_about_mgmtteam_Template_Views Num_main_cust_serv_Template_Views Nearly both the outputs were matching so we were able to find the factors most influencing our decision tree. Conclusion: We will use the above features that have major influence in decision tree in random Forest as the independent variable for actual construction of decision tree using ‘party’ package for better results and classification.
  • 14. DB 297C Data Analytics – Project Report Term I (2013-14) Party Decision Tree Objective: To build a decision tree using the features identified by Random Forest using the “party” package. Description: A computational toolbox for recursive partitioning. The core of the package is ctree(), an implementation of conditional inference trees which embed tree-structured regression models into a well-defined theory of conditional inference procedures. This non-parametric class of regression trees is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. Procedure: 1. 2. 3. 4. Loading the dataset into R, question1_reduced <- read.csv("q2_reduced.csv") dim(question1) #5220 103 Dividing the dataset into training dataset and testdata set, div <- sample(2,nrow(question1_reduced),replace=T,prob=c(0.70,0.30)) Storing the traindata and testData in variable for analysis, trainData_reduced <- question1_reduced[div==1,] dim(trainData_reduced) # 3675 103 testData_reduced <- question1_reduced[div==2,] dim(testData_reduced)# 1545 103 Defining formula based on the features identified by random forest on the target variable, myformula_reduced <- Session_Continues ~ Num_Hanes_Product_Views + Num_Cotton_Product_Views + Num_Nylon_Product_Views + Num_BrandOrder_Assortment_Views + Num_UniqueBoutiques_Assortment_Views + Num_LifeStyles_Assortment_Views + Num_main_Template_Views + Num_articles_Template_Views + Num_main_home_Template_Views + Num_main_vendor_Template_Views + Num_articles_dpt_about_mgmtteam_Template_Views + Num_main_cust_serv_Template_Views The features are the result of the importance factor found during the randomForest. 5. 6. 7. Now loading the “party” package for analysis, library(party) Applying the recursive decision tree algorithm on traindata based on the above formula trainData_ctree <- ctree(myformula_reduced,data=trainData_reduced) Now to see the generated model , print(trainData_ctree)
  • 15. DB 297C Data Analytics – Project Report Term I (2013-14) this wil show the features used in building the decision tree and also how the decision tree is built as below output: Conditional inference tree with 3 terminal nodes Response: Session_Continues Inputs: Num_Hanes_Product_Views, Num_Cotton_Product_Views, Num_Nylon_Product_Views, Num_BrandOrder_Assortment_Views, Num_UniqueBoutiques_Assortment_Views, Num_LifeStyles_Assortment_Views, Num_main_Template_Views, Num_articles_Template_Views, Num_main_home_Template_Views, Num_main_vendor_Template_Views, Num_articles_dpt_about_mgmtteam_Template_Views, Num_main_cust_serv_Template_Views Number of observations: 3675 1) Num_main_home_Template_Views <= 1; criterion = 0.999, statistic = 16.455 2) Num_articles_dpt_about_mgmtteam_Template_Views <= 0; criterion = 0.984, statistic = 10.226 3)* weights = 2607 2) Num_articles_dpt_about_mgmtteam_Template_Views > 0 4)* weights = 245 1) Num_main_home_Template_Views > 1 5)* weights = 823 8. For visualizing tree graphically it is, Plot(trainData_ctree,type=”simple”) We will get the graph of decision tree as below show in observation. 9. Now applying the model on the testData set , testpred_reduced <- predict(trainData_reduced,newdata=testData_reduced) 10. For checking the accuracy of prediction, table(testpred_reduced,testData_reduced$Session_Continues) output will show the prediction rate, as the prediction has many errors so this is not the suitable method for decision tree generation.
  • 16. DB 297C Data Analytics – Project Report Term I (2013-14) Observation: The graph of the decision tree obtained is: From the graph we can see that only two factors are being considered by this ctree() algorithm.
  • 17. DB 297C Data Analytics – Project Report Term I (2013-14) WEKA: When based on the above factor decision tree generated in weka is: Conclusion Results were not satisfactory, there were large rate of error and misclassification of data. Around 50% of data was not correctly classified using “party” package.
  • 18. DB 297C Data Analytics – Project Report Term I (2013-14) Rule Based Classification Objective To generate a model for building rules and to classify data based on the rules being satisfied. Description Rule-based methods, rule discovery or rule extraction from data, are data mining techniques aimed at understanding data structures, providing comprehensible description instead of only black-box prediction. Rule based systems should expose in a comprehensible way knowledge hidden in data , providing logical justification for drawing conclusions, showing possible inconsistencies and avoiding unpredictable conclusions that black box predictors may generate in untypical situations . Sets of rules are useful if rules are not too numerous, comprehensible, and have sufficiently high accuracy. Procedure 1. Data was already been loaded in R and is divided into training dataset and test dataset so we can directly apply the rule based classification directly on the train dataset. 2. For applying rule based classification we have to install the package “RWeka” which imports all the algorithms in weka tool. 3. We will be using the “PART” rule based classification in weka for generating the rules of our training dataset based on which we will classify our test dataset. library(RWeka) The above command will load the “RWeka” package into R environment 4. Now we will apply the PART algorithm on the training Dataset for obtaining the rules, q2_part <- PART(Session_Continues ~.,data =q2_train) Above command will take the training dataset as “q2_train” and apply the “PART” algorithm based on the target variable i.e Session_Continues and remaining all as independent variables. 5. A model is build based on the previous command , which is used for classifying the test dataset as below, q2_pre <- evaluate_Weka_classifier(q2_part,newdata=q2_test) here we are using the model generated by the training data to classify the test data. Here “evaluate_Weka_classifier” will use the model and classify the testdata, which is a function of weka tool. 6. For seeing the result, q2_pre OUTPUT:- === Summary === Correctly Classified Instances 880 Incorrectly Classified Instances 425 Kappa statistic 0 Mean absolute error 0.4447 Root mean squared error 0.4689 67.433 % 32.567 %
  • 19. DB 297C Data Analytics – Project Report Term I (2013-14) Relative absolute error 99.9541 % Root relative squared error 99.9958 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 100 % Total Number of Instances 1305 === Confusion Matrix === a b <-- classified as 734 124 | a = FALSE 301 136 | b = TRUE From the result we can see that we are getting a classification rate around 68% - 72 % which is a better rate than decision tree. Observation The Rule based classification will generate a set of rules on which classification takes place , we can see the set of rules from the model it generated as below q2_part O/p: PART decision list Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND Num_main_lifestyles_Template_Views > 1: FALSE (38.0/3.0) Num_main_assortment_Template_Views > 0 AND Num_main_shopping_cart_Template_Views <= 0 AND Num_LifeStyles_Assortment_Views <= 1: TRUE (7.0) Num_UniqueBoutiques_Assortment_Views <= 0 AND Num_main_vendor_Template_Views <= 1 AND Num_Brands_Assortment_Views <= 1: FALSE (145.0/71.0) Num_articles_dpt_about_Template_Views > 0 AND Num_BrandOrder_Assortment_Views <= 1 AND Num_articles_Template_Views <= 1: TRUE (73.0/22.0) From the result you can see the rules being generated, we have show some sample rules there were totally 182 rules generated. Here at the last the class has been mentioned to which class the rule belongs to showing actual number of rows / number of misclassifications that belong to that particular rule. Conclusion So from the above observation and results we can see that a successful rule based model was build with an accuracy of above 70% for identifying whether a user will continue his session or not.
  • 20. DB 297C Data Analytics – Project Report Term I (2013-14) Clustering Objective: To group visitor of websites whose page view pattern is similar and identify their interest. Approach: Clustering is a best methodology in Data analysis which can be used to group objects based on their similarities. We are making use of WEKA tools for doing this analysis. Preprocessing: 1. Remove all spam data by deleting record with just one page view 2. There are about 500+ dimension which is very not feasible to analyze, so for dimensionality reduction a. Go to select attribute of WEKA b. Manually- remove all session data, browser information, most common page] c. Auto – Calculate information gain and select top 25 Attribute Process: Experiment I In this experiment we will make 5 cluster of given instances (users) and analyze their purchase habits K-means Steps: 1. 2. 3. 4. 5. 6. Import a reduced dataset in weka Select simple k-mean Specify number of clusters Set distance function to Euclidean Specify k (no of cluster) Click on start to generate clusters Results: (A) Using Euclidean distance === Run information === Scheme: weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 num-slots 1 -S 10 Relation: q3_added-weka.filters.unsupervised.attribute.Remove-VR7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15 Instances: 1781 Attributes: 26 City Customer_ID US_State Num_Sheer_Look_Product_Views Num_CT_Waist_Control_Views
  • 21. DB 297C Data Analytics – Project Report Term I (2013-14) Num_PH_Category_Views Num_main_shopping_cart_Template_Views Num_Replenishable_Stock_Views Num_account_Template_Views Num_main_login2_Template_Views Num_Sandal_Foot_Views Num_HasDressingRoom_True_Views Num_Legwear_Product_Views Num_products_productDetailLegwear_Template_Views Num_DonnaKaran_Product_Views Num_AmericanEssentials_Product_Views Num_Basic_Product_Views Num_WDCS_Category_Views Num_Oroblu_Product_Views WhichDoYouWearMostFrequent Num_products_Template_Views Home_Market_Value Num_WAS_Category_Views Num_main_vendor_Template_Views Num_main_freegift_Template_Views Spend_over_$12_per_order_on_average Test mode: evaluate on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 12 Within cluster sum of squared errors: 4789.607514501406 Missing values globally replaced with mean/mode Time taken to build model (full training data) : 0.14 seconds === Model and evaluation on training set === Clustered Instances 0 1 2 3 4 263 ( 15%) 273 ( 15%) 368 ( 21%) 484 ( 27%) 393 ( 22%)
  • 22. DB 297C Data Analytics – Project Report Term I (2013-14) Observation:  Cluster 0 o High income o Spend avg >12$ => potential customer (value) o Purchase nylon more than cotton (nylon is costlier than cotton) o Mostly view brands assortment page (DonnaKaran – costly fashion design brand) o Less visit to free gift category o More visit to sandal foot  Cluster 1 o More men product than other cluster => cluster might have more mens o Frequent use of search bar o Rich visitors with most of them are have above average home/assets value  Cluster 2 o General visitor o Buy cheap products  Cluster 3 o Interested mostly in offers/free gift products o Highest visit to checkout page => potential customer (frequency)  Cluster 4 o No special pattern observed Experiment II We can validate whether we can use page view data for identifying potential customer using clustering. We have labeled data of avg. purchase >12$ with 1368 instances as true and 413 instances as false. === Run information === Scheme: weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 num-slots 1 -S 10 Relation: q3_added-weka.filters.unsupervised.attribute.Remove-VR7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15 Instances: 1781 Attributes: 26 City Customer_ID US_State Num_Sheer_Look_Product_Views Num_CT_Waist_Control_Views Num_PH_Category_Views Num_main_shopping_cart_Template_Views Num_Replenishable_Stock_Views Num_account_Template_Views Num_main_login2_Template_Views Num_Sandal_Foot_Views Num_HasDressingRoom_True_Views Num_Legwear_Product_Views
  • 23. DB 297C Data Analytics – Project Report Term I (2013-14) Num_products_productDetailLegwear_Template_Views Num_DonnaKaran_Product_Views Num_AmericanEssentials_Product_Views Num_Basic_Product_Views Num_WDCS_Category_Views Num_Oroblu_Product_Views WhichDoYouWearMostFrequent Num_products_Template_Views Home_Market_Value Num_WAS_Category_Views Num_main_vendor_Template_Views Num_main_freegift_Template_Views Ignored: Spend_over_$12_per_order_on_average Test mode: Classes to clusters evaluation on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 5 Within cluster sum of squared errors: 4913.0928856548035 Missing values globally replaced with mean/mode Time taken to build model (full training data) : 0.05 seconds === Model and evaluation on training set === Clustered Instances 0 1 561 ( 31%) 1220 ( 69%) Class attribute: Spend_over_$12_per_order_on_average Classes to Clusters: 0 1 <-- assigned to cluster 402 966 | False 159 254 | True Cluster 0 <-- True Cluster 1 <-- False Incorrectly clustered instances : 656.0 36.8332 % Observation : Only 63% of data is correctly classified, as data is more biased toward False(<12$ average spending) class. But clustering gives us good insight of purchase/page view patterns.
  • 24. DB 297C Data Analytics – Project Report Term I (2013-14) Association Rules Objective: To identify some interesting patterns in the users page views and also the killer pages. Description: Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. Measures used in our analysis are lift, confidence and support. Procedure: 1. For performing the association rules we needed to convert the dataset into binary matrix indicating in each session indicating whether he/she visited that page or not. 2. For performing the association rules, “arules” package is available. library(arules) 3. Now loading the converted dataset into R for generation of rules, we used the important columns based on mean gini index obtained from randomForest result. 4. After loading the data, we convert the data as transactions by following command, dataTrans <- as(assoc,”transactions”) 5. Now we apply “apriori” algorithm to generate the rules, where we can pass the parameter list having support, confidence and minl-ength of every rule. rules <- apriori(dataTrans) This will generate all rules using min-support as 0.1 and min-confidence as 0.8 around. It will generate all the subset rules also based on the frequent itemset of attributes. 6. To know how many rules generated we can see that by rules Around 377564 rules were generated out of which we were interested in only rules having RHS as LEAVE or CONTINUE to check whether person will continue or leave after seeing certain pages. 7. Retrieved a subset of rules from all generated rules which were having some interesting patterns. Observation: We were able to see some of the interesting patterns in the rules generated, like in our dataset most of the persons were females so we were able to find out that most of the rules were having “NUM_OF_WOMEN_PRODUCT_VIEWS” in possibly every transaction. Some of the brands were least visited or never visited according to the rules. We were able to identify some of the killer pages based on the user preferences or after visiting some pages user used to withdraw at the same page everytime like that.
  • 25. DB 297C Data Analytics – Project Report Term I (2013-14) Results: Some of the rules sorted based on the “lift” values are as below: rulesLeave <- subset(rules,subset=rhs %pin% "LEAVE")// for getting the rules inspect(head(sort(rulesLeave,by="lift"),20)) O/p: lhs rhs 1 {CONTINUE=YES} support confidence lift => {LEAVE=NO} 0.425 1 2.352941 2 {Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941 17 {Num_Women_Product_Views=Yes, Num_Men_Product_Views=No, CONTINUE=YES} => {LEAVE=NO} 0.100 1 2.352941 18 {Num_MAS_Category_Views=No, Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.105 1 2.352941 19 {Num_MDS_Category_Views=No, Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.105 1 2.352941 20 {Num_MCS_Category_Views=No, Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941 Some of the interesting rules are shown above.Some of the random generated are, inspect(head(rulesLeave,6)) 0/p: lhs rhs support confidence lift 3 {Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941
  • 26. DB 297C Data Analytics – Project Report Term I (2013-14) 4 {Num_Women_Product_Views=Yes, CONTINUE=NO} => {LEAVE=YES} 0.100 1 1.739130 5 {Num_Women_Product_Views=No, CONTINUE=YES} => {LEAVE=NO} 0.315 1 2.352941 6 {Num_CT_Waist_Control_Views=No, CONTINUE=YES} => {LEAVE=NO} 0.360 1 2.352941 For CONTINUE some examples are: inspect(head(rulesContinue,4)) o/p: lhs rhs support confidence lift 3 {Num_Women_Product_Views=Yes, LEAVE=NO} => {CONTINUE=YES} 0.110 1 2.352941 4 {Num_Women_Product_Views=Yes, LEAVE=YES} => {CONTINUE=NO} 0.100 1 1.739130 5 {Num_Women_Product_Views=No, LEAVE=NO} => {CONTINUE=YES} 0.315 1 2.352941 6 {Num_CT_Waist_Control_Views=No, LEAVE=NO} => {CONTINUE=YES} 0.360 1 2.352941 Conclusion: We were able to find some interesting patters in users page views and were able to identify some of the killer pages as “Num_CT_Waist_Control_Views”,” Num_MAS_Category_Views” like these pages.