SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
Clustering and Regression using WEKA
1. VGSOM
WEKA – Data Mining
Techniques
Clustering and Regression
BY
M.P.Vijaya Prabhu
10BM60097
2. Contents
1. INTRODUCTION ............................................................................................................................... 3
2. CLUSTERING .................................................................................................................................... 4
2.1 Data Visualization..................................................................................................................... 8
3. Regression Analysis........................................................................................................................ 10
3.1 Pricing the house ................................................................................................................... 10
4. References..................................................................................................................................... 13
3. WEKA – DATA MINING TECHNIQUES
1. INTRODUCTION
“Data Mining Software in Java”. Weka is the acronym of Waikato Environment for Knowledge
Analysis is a collection of state-of-the-art machine learning algorithms and data preprocessing tools
written in Java, developed at the University of Waikato, New Zealand. It is free software that runs on
almost any platform and is available under the GNU General Public License.
Weka is the next generation Data Mining Tool to complex analysis more interactively and can
visualize it more effectively.
WEKA GUI appears like this
Advantages of using WEKA
1) Built in Advanced algorithm
2) Effective Visualization of results
3) Easy to use GUI
4. Let us demonstrate the use of WEKA using 2 examples each on CLUSTERING (Kmeans) and
Regression.
2. CLUSTERING
Data is a sample bank data taken from an online source.It contains the following attributes
1) age numeric
2) {FEMALE,MALE}
3) region {INNER_CITY,TOWN,RURAL,SUBURBAN}
4) income numeric
5) married {NO,YES}
6) children {0,1,2,3}
7) car {NO,YES}
8) save_act {NO,YES}
9) current_act {NO,YES}
10) mortgage {NO,YES}
11) pep {YES,NO}
Based on these data we need to CLUSTER the user groups into 6 and have to find out the
characteristics of each group.
The sample data contains 600 instances. The objective is to cluster based on K-Means algorithm.
Once the preprocessing of the data is done, we can start with clustering the data.
First, the data is loaded into WEKA and preprocessing can be done as shown below.
5. WEKA SimpleKMeans algorithm automatically handles a mixture of categorical and
numerical attributes. While doing distance computations like in our case, the built in algorithm
will automatically normalizes numerical attributes. Euclidean distance is general measure of
distance between Euclidean and clusters.
After selecting k-Means we can select advance settings in the k-means algorithm. We
have given the CLUSTERs as 6 from 2 ,to get 6 different clusters from the given data.
6. After the required details are given “Use Training Set” is checked. Then we can click “Start”
The result is available as given below.
================================================================================================
OUTPUT :
=== Run information ===
Scheme: weka.clusterers.SimpleKMeans -N 6 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10
Relation: bank-data
Instances: 600
Attributes: 12
id
7. age
sex
region
income
married
children
car
save_act
current_act
mortgage
pep
Test mode: evaluate on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 18
Within cluster sum of squared errors: 1955.4146634784236
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4 5
(600) (74) (164) (71) (58) (99) (134)
==========================================================================================
id ID12101 ID12107 ID12103 ID12101 ID12104 ID12102 ID12108
age 42.395 42.9324 43.7744 39.0282 37.3103 38.404 47.3433
sex FEMALE FEMALE FEMALE FEMALE FEMALE MALE MALE
region INNER_CITY RURAL INNER_CITY INNER_CITY TOWN INNER_CITY TOWN
income 27524.0312 28838.7605 28586.4063 20463.1273 20600.8528 25720.037 33568.3929
married YES NO YES YES YES YES NO
children 1.0117 1.973 0.628 0.6901 1.6207 0.899 0.9403
car NO NO NO NO NO YES YES
save_act YES YES YES NO NO NO YES
current_act YES YES YES YES YES YES YES
mortgage NO NO NO NO NO YES NO
pep NO NO NO YES NO YES YES
Time taken to build model (full training data) : 0.16 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 74 ( 12%)
1 164 ( 27%)
2 71 ( 12%)
3 58 ( 10%)
4 99 ( 17%)
5 134 ( 22%)
================================================================================================
8. The result window shows the centroid of each cluster as well as statistics on the number and
percentage of instances assigned to different clusters.
0 74 ( 12%)
1 164 ( 27%)
2 71 ( 12%)
3 58 ( 10%)
4 99 ( 17%)
5 134 ( 22%)
The put put of this clustering can be found in the form of cluster centroid
Cluster 0 1 2 3 4 5 6
age 42.395 42.9324 43.7744 39.0282 37.3103 38.404 47.3433
sex FEMALE FEMALE FEMALE FEMALE FEMALE MALE MALE
INNER_CIT INNER_CIT INNER_CIT INNER_CIT
region
Y RURAL Y Y TOWN Y TOWN
27524.031 28838.760 28586.406 20463.127 20600.852 33568.392
income
2 5 3 3 8 25720.037 9
married YES NO YES YES YES YES NO
children 1.0117 1.973 0.628 0.6901 1.6207 0.899 0.9403
car NO NO NO NO NO YES YES
save_act YES YES YES NO NO NO YES
current_act YES YES YES YES YES YES YES
mortgage NO NO NO NO NO YES NO
pep NO NO NO YES NO YES YES
For example, the centroid for cluster 0 shows that this is a segment of cases representing middle aged
(approx. 42) females living in inner city with an average income of approx. $27,500, who are married
with one child, etc. Furthermore, this group has on average said YES to the NO product.
2.1 Data Visualization
The result can be viewed more intuitively by the advanced VISUALIZATION built in WEKA.
The visualization of the distribution of male and female in each cluster can be found by using the
following methods.
Step 1 : Right click on the output and select “Visualise Cluster alignment”
9. Step 2 : Select the different cluster as the X axis.
Step 3 : SelectInstance_Nbr as Y Axis
Step 4 : Select “ Sex “ as colour.It means it will differentiate sex based on colour.
This will result in a visualization of the distribution of males and females in each cluster.
10. 3. Regression Analysis
Regression can be done effectively with more options via WEKA software.Lets explain it using a
simple “LinearRegression”
3.1 Pricing the house
Data is taken from an online source .The selling price of the house needs to be determined
based on the data given. The data contains the following attributes.
1) houseSize NUMERIC
2) lotSize NUMERIC
3) bedrooms NUMERIC
4) granite NUMERIC
5) bathroom NUMERIC
6) sellingPrice NUMERIC
So, based on the size of the house, Lot size ,number of bedrooms it has ,whether it is furnished
with Granite, number of bathroom ,we need to predict the DEPENDANT VARIABLE ,i.e. the
SELLING PRICE.
First, the data is loaded into WEKA and necessary preprocess is done. Since, our data is already
processed .We proceed to selecting the type of REGRESSION
11. In the picture given above select the “Linear Regression” tab. Then Select “Use Training Set” in
the Test Options.
There are three other choices available while doing simple Linear Regression they are
Supplied test set: Supply test data to do model
12. Cross-validation : which lets WEKA build a model based on subsets of the supplied data
and then average them out to create a final model
Percentage split: where WEKA takes a percentile subset to build a final model.
Here the column “Selling Price” is chosen. This means with the available data we are going to
predict the DEPENDANT VARIABLE (Selling Price).
Then click on the “Start” button to build a model using WEKA.
OUTPUT:
================================================================================================
=== Run information ===
Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: house
Instances: 700
Attributes: 6
houseSize
lotSize
bedrooms
granite
bathroom
sellingPrice
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Linear Regression Model
sellingPrice =
22.6582 * houseSize +
9.1242 * lotSize +
42145.0767 * bedrooms +
42562.0901 * bathroom +
-20981.3142
Time taken to build model: 0.04 seconds
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient 0.9945
Mean absolute error 4790.821
Root mean squared error 4245.4125
Relative absolute error 11.9082 %
Root relative squared error 11.21 %
Total Number of Instances 700
================================================================================================
The output predicts that the Selling price will be
13. sellingPrice= (22.6582*houseSize) + (9.1242 * lotSize) + (42145.0767 * bedrooms) +
(42562.0901 * bathroom) -20981.3142.
If we want to determine the “selling price” of the house based on given data just “Plug in” the
values and find it easily.
The output predicts that the “Granite” doesn’t matter much regarding the SELLING PRICE of the
house.
4. References
http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm
www.cs.waikato.ac.nz/ml/weka/
http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/
http://maya.cs.depaul.edu/classes/ect584/weka/k-means.html
http://www.cs.utexas.edu/users/ml/tutorials/Weka-tut/