Prepared as part of the course requirements for the subject IT for Business Intelligence at Vinod Gupta School of Management, IIT Kharagpur. This paper discusses some of the data mining techniques using examples in the software WEKA.
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
IT for Business Intelligence Term Paper
1. IT for Business Intelligence
Term Paper on Data Mining Techniques
Prepared By:
Niloy Ghosh
Roll No: 10BM60054
Second Year, MBA
VInod Gupta School of Management (VGSOM)
IIT Kharagpur
2. Introduction
The purpose of this term paper is to demonstrate data mining techniques using the software tool
WEKA. Data mining aims at transforming large amounts of data into meaningful patterns and rules.
The derivation of meaning from the vast amounts of data has numerous business applications and is
generating a tremendous amount of interest.
Waikato Environment for Knowledge Analysis (WEKA) is a free and open source software that can be
used to mine data and generate useful information. For using WEKA, the data needs to be in the
Attribute-Relation File Format (ARFF). It is a flat file format where the type of data being loaded is
defined first, followed by the data itself.
In this paper two techniques, Linear Regression and Decision Tree, are discussed with examples. The
source of the data used to demonstrate the techniques is provided in the reference section.
Technique I
Linear Regression
Linear regression is used to predict the value of an unknown dependent variable based on the values
of a number of independent variables. In this example, the model tries to predict the housing prices
in the Boston area.
Description of dataset
The dataset contains details about housing in Boston area. The data contains 14 variables which are
defined as follows.
1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT: Percentage of lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000's
The objective is to predict the housing values (i.e. the variable MEDV) using Linear Regression.
3. Output
On running the model in WEKA, the following output was obtained.
=== Run information ===
Scheme:WEKA.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: housing
Instances: 506
Attributes: 14
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
CLASS
Test mode:split 70.0% train, remainder test
=== Classifier model (full training set) ===
4. Linear Regression Model
CLASS = -0.1084 * CRIM + 0.0458 * ZN + 2.7188 * CHAS + -17.3768 * NOX + 3.8016 * RM + -1.4927 * DIS + 0.2996 * RAD +
-0.0118 * TAX + -0.9466 * PTRATIO + 0.0093 * B + -0.5225 * LSTAT + 36.342
Time taken to build model: 0.05 seconds
=== Evaluation on test split ===
=== Summary ===
Correlation coefficient 0.8547
Mean absolute error 3.3219
Root mean squared error 4.6107
Relative absolute error 52.2759 %
Root relative squared error 51.9447 %
Total Number of Instances 152
The experiment was conducted using a 70-30 split of the data (70% used to form the model, 30%
used to test the same).
Interpretation
The results show a correlation of 85%, and thus the model is sufficiently acceptable. Though the
error values are quite high, other methods have yielded only slightly better results.
The following conclusions can be made:
The proportion of non-retail business and age of the buildings are not a factor for
evaluation.
As expected, crime rates, air pollution and (high) tax rates have a negative effect on the
house value.
The proportion of lower status population has a negative effect. Thus, low income
neighbourhoods will have lower house rates than affluent neighbourhoods.
Interestingly, the pupil student ratio has a negative effect and that too quite prominent.
Thus, it is evident that educational facilities is a big concern while looking for a home and
people are ready to pay more for areas having better educational facilities.
5. Technique II
Decision Tree
In data mining, a decision tree is a predictive model which maps observations about an item to
conclusions about the item's target value. Also known as classification trees, the leaves represent
class labels and branches represent conjunctions of features that lead to those class labels.
The WEKA classifier used in the example is J48. The model tries to make a diagnosis of urinary
system disease.
Description of dataset
The dataset contains the following variables.
1. Temperature of patient
2. Occurrence of nausea { yes, no }
3. Lumbar pain { yes, no }
4. Urine pushing (continuous need for urination) { yes, no }
5. Micturition pains { yes, no }
6. Burning of urethra, itch, swelling of urethra outlet { yes, no }
7. Decision: Inflammation of urinary bladder { yes, no }
8. Decision: Nephritis of renal pelvis origin { yes, no }
For the purpose of the demonstration, first the variable ‘Nephritis of renal pelvis origin’ had been
removed. The analysis then creates a decision tree for the prediction of the inflammation of urinary
bladder.
Next, the variable ‘Inflammation of urinary bladder’ has been removed and a new decision tree is
created for the prediction of Nephritis of renal pelvis origin.
6. Output
The WEKA output for prediction of the inflammation of urinary bladder was obtained as follows.
Model 1
=== Run information ===
Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2
Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R8
Instances: 120
Attributes: 7
temperature
nausea
Lumbar_pain
Urine_pushing
Micturition_pains
Burning_of_urethra
Inflammation_of_urinary_bladder
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree
------------------
Urine_pushing = yes
| Micturition_pains = yes: yes (49.0)
| Micturition_pains = no
7. | | Lumbar_pain = yes: no (21.0)
| | Lumbar_pain = no: yes (10.0)
Urine_pushing = no: no (40.0)
Number of Leaves : 4
Size of the tree : 7
Time taken to build model: 0.01 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 120 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 120
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 yes
1 0 1 1 1 1 no
Weighted Avg. 1 0 1 1 1 1
8. === Confusion Matrix ===
a b <-- classified as
59 0 a = yes
0 61 b = no
The tree is visualised as shown below.
The same experiment was repeated for predicting the occurrence of Nephritis of renal pelvis origin.
The following results were obtained.
9. Model 2
=== Run information ===
Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2
Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R7
Instances: 120
Attributes: 7
temperature
nausea
Lumbar_pain
Urine_pushing
Micturition_pains
Burning_of_urethra
Nephritis_of_renal_pelvis_origin
Test mode:evaluate on training data
=== Classifier model (full training set) ===
J48 pruned tree
------------------
temperature <= 37.9: no (60.0)
temperature > 37.9
| Lumbar_pain = yes: yes (50.0)
| Lumbar_pain = no: no (10.0)
Number of Leaves : 3
10. Size of the tree : 5
Time taken to build model: 0 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 120 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 120
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 yes
1 0 1 1 1 1 no
Weighted Avg. 1 0 1 1 1 1
=== Confusion Matrix ===
a b <-- classified as
50 0 a = yes
0 70 b = no
11. The visual tree is as below
Interpretation
As can be seen in both the models, 100% of the data has been classified correctly.
In Model 1, the differentiating factors were Urine pushing, Micturition pains and Lumbar pain.
In Model 2, the differentiating factors were Temperature and Lumbar Pain.
As can be seen from both the results, Lumbar pain is an important factor in determining urinary
infections.
Conclusion
The paper barely scratches the surface of all the possible applications of data mining. This powerful
technique can have unique applications in the field of business as well as academic research. It may
provide clues to numerous questions by allowing us to make sense of the ever growing volume of
data.