IT for Business Intelligence Term Paper

IT for Business Intelligence
Term Paper on Data Mining Techniques

Prepared By:
Niloy Ghosh

Roll No: 10BM60054

Second Year, MBA

VInod Gupta School of Management (VGSOM)

IIT Kharagpur

Introduction
The purpose of this term paper is to demonstrate data mining techniques using the software tool
WEKA. Data mining aims at transforming large amounts of data into meaningful patterns and rules.
The derivation of meaning from the vast amounts of data has numerous business applications and is
generating a tremendous amount of interest.

Waikato Environment for Knowledge Analysis (WEKA) is a free and open source software that can be
used to mine data and generate useful information. For using WEKA, the data needs to be in the
Attribute-Relation File Format (ARFF). It is a flat file format where the type of data being loaded is
defined first, followed by the data itself.

In this paper two techniques, Linear Regression and Decision Tree, are discussed with examples. The
source of the data used to demonstrate the techniques is provided in the reference section.

Technique I
Linear Regression

Linear regression is used to predict the value of an unknown dependent variable based on the values
of a number of independent variables. In this example, the model tries to predict the housing prices
in the Boston area.

Description of dataset

The dataset contains details about housing in Boston area. The data contains 14 variables which are
defined as follows.

1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT: Percentage of lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000's

The objective is to predict the housing values (i.e. the variable MEDV) using Linear Regression.

Output

On running the model in WEKA, the following output was obtained.

=== Run information ===

Scheme:WEKA.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation: housing

Instances: 506

Attributes: 14

CRIM

ZN

INDUS

CHAS

NOX

RM

AGE

DIS

RAD

TAX

PTRATIO

B

LSTAT

CLASS

Test mode:split 70.0% train, remainder test

=== Classifier model (full training set) ===

Linear Regression Model

CLASS = -0.1084 * CRIM + 0.0458 * ZN + 2.7188 * CHAS + -17.3768 * NOX + 3.8016 * RM + -1.4927 * DIS + 0.2996 * RAD +
-0.0118 * TAX + -0.9466 * PTRATIO + 0.0093 * B + -0.5225 * LSTAT + 36.342

Time taken to build model: 0.05 seconds

=== Evaluation on test split ===

=== Summary ===

Correlation coefficient 0.8547

Mean absolute error 3.3219

Root mean squared error 4.6107

Relative absolute error 52.2759 %

Root relative squared error 51.9447 %

Total Number of Instances 152

The experiment was conducted using a 70-30 split of the data (70% used to form the model, 30%
used to test the same).

Interpretation

The results show a correlation of 85%, and thus the model is sufficiently acceptable. Though the
error values are quite high, other methods have yielded only slightly better results.

The following conclusions can be made:

 The proportion of non-retail business and age of the buildings are not a factor for
evaluation.
 As expected, crime rates, air pollution and (high) tax rates have a negative effect on the
house value.
 The proportion of lower status population has a negative effect. Thus, low income
neighbourhoods will have lower house rates than affluent neighbourhoods.
 Interestingly, the pupil student ratio has a negative effect and that too quite prominent.
Thus, it is evident that educational facilities is a big concern while looking for a home and
people are ready to pay more for areas having better educational facilities.

Technique II
Decision Tree

In data mining, a decision tree is a predictive model which maps observations about an item to
conclusions about the item's target value. Also known as classification trees, the leaves represent
class labels and branches represent conjunctions of features that lead to those class labels.

The WEKA classifier used in the example is J48. The model tries to make a diagnosis of urinary
system disease.

Description of dataset

The dataset contains the following variables.

1. Temperature of patient
2. Occurrence of nausea { yes, no }
3. Lumbar pain { yes, no }
4. Urine pushing (continuous need for urination) { yes, no }
5. Micturition pains { yes, no }
6. Burning of urethra, itch, swelling of urethra outlet { yes, no }
7. Decision: Inflammation of urinary bladder { yes, no }
8. Decision: Nephritis of renal pelvis origin { yes, no }

For the purpose of the demonstration, first the variable ‘Nephritis of renal pelvis origin’ had been
removed. The analysis then creates a decision tree for the prediction of the inflammation of urinary
bladder.

Next, the variable ‘Inflammation of urinary bladder’ has been removed and a new decision tree is
created for the prediction of Nephritis of renal pelvis origin.

Output

The WEKA output for prediction of the inflammation of urinary bladder was obtained as follows.

Model 1


Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2

Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R8

Instances: 120

Attributes: 7

temperature

nausea

Lumbar_pain

Urine_pushing

Micturition_pains

Burning_of_urethra

Inflammation_of_urinary_bladder

Test mode:10-fold cross-validation


J48 pruned tree

------------------

Urine_pushing = yes

| Micturition_pains = yes: yes (49.0)

| Micturition_pains = no

| | Lumbar_pain = yes: no (21.0)

| | Lumbar_pain = no: yes (10.0)

Urine_pushing = no: no (40.0)

Number of Leaves : 4

Size of the tree : 7

Time taken to build model: 0.01 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 120 100 %

Incorrectly Classified Instances 0 0 %

Kappa statistic 1

Mean absolute error 0

Root mean squared error 0

Relative absolute error 0 %

Root relative squared error 0 %


=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 yes
1 0 1 1 1 1 no
Weighted Avg. 1 0 1 1 1 1

=== Confusion Matrix ===

a b <-- classified as
59 0 a = yes
0 61 b = no

The tree is visualised as shown below.

The same experiment was repeated for predicting the occurrence of Nephritis of renal pelvis origin.

The following results were obtained.

Model 2


Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2

Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R7

Instances: 120

Attributes: 7

temperature

nausea

Lumbar_pain

Urine_pushing

Micturition_pains

Burning_of_urethra

Nephritis_of_renal_pelvis_origin

Test mode:evaluate on training data


J48 pruned tree

------------------

temperature <= 37.9: no (60.0)

temperature > 37.9

| Lumbar_pain = yes: yes (50.0)

| Lumbar_pain = no: no (10.0)

Number of Leaves : 3

Size of the tree : 5

Time taken to build model: 0 seconds

=== Evaluation on training set ===

=== Summary ===

Correctly Classified Instances 120 100 %

Incorrectly Classified Instances 0 0 %

Kappa statistic 1

Mean absolute error 0

Root mean squared error 0

Relative absolute error 0 %

Root relative squared error 0 %


=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 yes
1 0 1 1 1 1 no
Weighted Avg. 1 0 1 1 1 1

=== Confusion Matrix ===

a b <-- classified as
50 0 a = yes
0 70 b = no

The visual tree is as below

Interpretation

As can be seen in both the models, 100% of the data has been classified correctly.

In Model 1, the differentiating factors were Urine pushing, Micturition pains and Lumbar pain.

In Model 2, the differentiating factors were Temperature and Lumbar Pain.

As can be seen from both the results, Lumbar pain is an important factor in determining urinary
infections.

Conclusion
The paper barely scratches the surface of all the possible applications of data mining. This powerful
technique can have unique applications in the field of business as well as academic research. It may
provide clues to numerous questions by allowing us to make sense of the ever growing volume of
data.

Reference
1. http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html

2. http://archive.ics.uci.edu/ml/datasets/Acute+Inflammations

3. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html

4. http://en.wikipedia.org/wiki/Decision_tree_learning

IT for Business Intelligence Term Paper

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (10)

En vedette

En vedette (20)

Similaire à IT for Business Intelligence Term Paper

Similaire à IT for Business Intelligence Term Paper (20)

Dernier

Dernier (20)

IT for Business Intelligence Term Paper