SlideShare une entreprise Scribd logo
1  sur  12
IT for Business Intelligence
Term Paper on Data Mining Techniques




 Prepared By:
 Niloy Ghosh

 Roll No: 10BM60054

 Second Year, MBA

 VInod Gupta School of Management (VGSOM)

 IIT Kharagpur
Introduction
The purpose of this term paper is to demonstrate data mining techniques using the software tool
WEKA. Data mining aims at transforming large amounts of data into meaningful patterns and rules.
The derivation of meaning from the vast amounts of data has numerous business applications and is
generating a tremendous amount of interest.

Waikato Environment for Knowledge Analysis (WEKA) is a free and open source software that can be
used to mine data and generate useful information. For using WEKA, the data needs to be in the
Attribute-Relation File Format (ARFF). It is a flat file format where the type of data being loaded is
defined first, followed by the data itself.

In this paper two techniques, Linear Regression and Decision Tree, are discussed with examples. The
source of the data used to demonstrate the techniques is provided in the reference section.



Technique I
Linear Regression

Linear regression is used to predict the value of an unknown dependent variable based on the values
of a number of independent variables. In this example, the model tries to predict the housing prices
in the Boston area.



Description of dataset

The dataset contains details about housing in Boston area. The data contains 14 variables which are
defined as follows.

    1.    CRIM: per capita crime rate by town
    2.    ZN:    proportion of residential land zoned for lots over 25,000 sq.ft.
    3.    INDUS: proportion of non-retail business acres per town
    4.    CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    5.    NOX: nitric oxides concentration (parts per 10 million)
    6.    RM:    average number of rooms per dwelling
    7.    AGE:   proportion of owner-occupied units built prior to 1940
    8.    DIS:   weighted distances to five Boston employment centres
    9.    RAD: index of accessibility to radial highways
    10.   TAX: full-value property-tax rate per $10,000
    11.   PTRATIO: pupil-teacher ratio by town
    12.   B:     1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13.   LSTAT: Percentage of lower status of the population
    14.   MEDV: Median value of owner-occupied homes in $1000's

The objective is to predict the housing values (i.e. the variable MEDV) using Linear Regression.
Output

On running the model in WEKA, the following output was obtained.



=== Run information ===



Scheme:WEKA.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation: housing

Instances: 506

Attributes: 14

        CRIM

        ZN

        INDUS

        CHAS

        NOX

        RM

        AGE

        DIS

        RAD

        TAX

        PTRATIO

        B

        LSTAT

        CLASS

Test mode:split 70.0% train, remainder test



=== Classifier model (full training set) ===
Linear Regression Model

CLASS = -0.1084 * CRIM + 0.0458 * ZN + 2.7188 * CHAS + -17.3768 * NOX + 3.8016 * RM + -1.4927 * DIS + 0.2996 * RAD +
-0.0118 * TAX + -0.9466 * PTRATIO + 0.0093 * B + -0.5225 * LSTAT + 36.342



Time taken to build model: 0.05 seconds



=== Evaluation on test split ===

=== Summary ===



Correlation coefficient            0.8547

Mean absolute error                3.3219

Root mean squared error              4.6107

Relative absolute error            52.2759 %

Root relative squared error         51.9447 %

Total Number of Instances           152



The experiment was conducted using a 70-30 split of the data (70% used to form the model, 30%
used to test the same).



Interpretation

The results show a correlation of 85%, and thus the model is sufficiently acceptable. Though the
error values are quite high, other methods have yielded only slightly better results.

The following conclusions can be made:

         The proportion of non-retail business and age of the buildings are not a factor for
          evaluation.
         As expected, crime rates, air pollution and (high) tax rates have a negative effect on the
          house value.
         The proportion of lower status population has a negative effect. Thus, low income
          neighbourhoods will have lower house rates than affluent neighbourhoods.
         Interestingly, the pupil student ratio has a negative effect and that too quite prominent.
          Thus, it is evident that educational facilities is a big concern while looking for a home and
          people are ready to pay more for areas having better educational facilities.
Technique II
Decision Tree

In data mining, a decision tree is a predictive model which maps observations about an item to
conclusions about the item's target value. Also known as classification trees, the leaves represent
class labels and branches represent conjunctions of features that lead to those class labels.

The WEKA classifier used in the example is J48. The model tries to make a diagnosis of urinary
system disease.



Description of dataset

The dataset contains the following variables.

    1.    Temperature of patient
    2.   Occurrence of nausea { yes, no }
    3.   Lumbar pain { yes, no }
    4.   Urine pushing (continuous need for urination) { yes, no }
    5.   Micturition pains { yes, no }
    6.   Burning of urethra, itch, swelling of urethra outlet { yes, no }
    7.   Decision: Inflammation of urinary bladder { yes, no }
    8.   Decision: Nephritis of renal pelvis origin { yes, no }

For the purpose of the demonstration, first the variable ‘Nephritis of renal pelvis origin’ had been
removed. The analysis then creates a decision tree for the prediction of the inflammation of urinary
bladder.

Next, the variable ‘Inflammation of urinary bladder’ has been removed and a new decision tree is
created for the prediction of Nephritis of renal pelvis origin.
Output

The WEKA output for prediction of the inflammation of urinary bladder was obtained as follows.



Model 1



=== Run information ===



Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2

Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R8

Instances: 120

Attributes: 7

         temperature

         nausea

         Lumbar_pain

         Urine_pushing

         Micturition_pains

         Burning_of_urethra

         Inflammation_of_urinary_bladder

Test mode:10-fold cross-validation



=== Classifier model (full training set) ===



J48 pruned tree

------------------



Urine_pushing = yes

| Micturition_pains = yes: yes (49.0)

| Micturition_pains = no
| | Lumbar_pain = yes: no (21.0)

| | Lumbar_pain = no: yes (10.0)

Urine_pushing = no: no (40.0)



Number of Leaves : 4



Size of the tree :    7



Time taken to build model: 0.01 seconds



=== Stratified cross-validation ===

=== Summary ===



Correctly Classified Instances            120        100       %

Incorrectly Classified Instances          0          0     %

Kappa statistic                           1

Mean absolute error                       0

Root mean squared error                   0

Relative absolute error                   0     %

Root relative squared error               0     %

Total Number of Instances                 120



=== Detailed Accuracy By Class ===




                          TP Rate     FP Rate       Precision      Recall   F-Measure   ROC Area   Class
                             1           0              1            1          1          1        yes
                             1           0              1            1          1          1        no
 Weighted Avg.               1           0              1            1          1          1
=== Confusion Matrix ===

                               a         b             <-- classified as
                               59        0             a = yes
                               0         61           b = no




The tree is visualised as shown below.




The same experiment was repeated for predicting the occurrence of Nephritis of renal pelvis origin.

The following results were obtained.
Model 2



=== Run information ===



Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2

Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R7

Instances: 120

Attributes: 7

         temperature

         nausea

         Lumbar_pain

         Urine_pushing

         Micturition_pains

         Burning_of_urethra

         Nephritis_of_renal_pelvis_origin

Test mode:evaluate on training data



=== Classifier model (full training set) ===



J48 pruned tree

------------------



temperature <= 37.9: no (60.0)

temperature > 37.9

| Lumbar_pain = yes: yes (50.0)

| Lumbar_pain = no: no (10.0)



Number of Leaves : 3
Size of the tree :    5



Time taken to build model: 0 seconds



=== Evaluation on training set ===

=== Summary ===



Correctly Classified Instances          120        100       %

Incorrectly Classified Instances        0          0     %

Kappa statistic                         1

Mean absolute error                     0

Root mean squared error                 0

Relative absolute error                 0     %

Root relative squared error             0     %

Total Number of Instances               120



=== Detailed Accuracy By Class ===

                              TP Rate    FP Rate       Precision   Recall       F-Measure   ROC Area   Class
                                 1          0              1         1              1          1        yes
                                 1          0              1         1              1          1        no
      Weighted Avg.              1          0              1         1              1          1


=== Confusion Matrix ===



                                             a             b       <-- classified as
                                            50             0            a = yes
                                            0             70            b = no
The visual tree is as below




Interpretation

As can be seen in both the models, 100% of the data has been classified correctly.

In Model 1, the differentiating factors were Urine pushing, Micturition pains and Lumbar pain.

In Model 2, the differentiating factors were Temperature and Lumbar Pain.



As can be seen from both the results, Lumbar pain is an important factor in determining urinary
infections.



Conclusion
The paper barely scratches the surface of all the possible applications of data mining. This powerful
technique can have unique applications in the field of business as well as academic research. It may
provide clues to numerous questions by allowing us to make sense of the ever growing volume of
data.
Reference
  1. http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html

  2. http://archive.ics.uci.edu/ml/datasets/Acute+Inflammations

  3. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html

  4. http://en.wikipedia.org/wiki/Decision_tree_learning

Contenu connexe

Tendances

Sociocast CF Benchmark
Sociocast CF BenchmarkSociocast CF Benchmark
Sociocast CF BenchmarkAlbert Azout
 
Linearprog, Reading Materials for Operational Research
Linearprog, Reading Materials for Operational Research Linearprog, Reading Materials for Operational Research
Linearprog, Reading Materials for Operational Research Derbew Tesfa
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4arogozhnikov
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackarogozhnikov
 
Luca Pozzi JSM 2011
Luca Pozzi JSM 2011Luca Pozzi JSM 2011
Luca Pozzi JSM 2011Luca Pozzi
 
Solutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. MontgomerySolutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. MontgomeryByron CZ
 
Lec 2 discrete random variable
Lec 2 discrete random variableLec 2 discrete random variable
Lec 2 discrete random variablecairo university
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 

Tendances (10)

Sociocast CF Benchmark
Sociocast CF BenchmarkSociocast CF Benchmark
Sociocast CF Benchmark
 
Linearprog, Reading Materials for Operational Research
Linearprog, Reading Materials for Operational Research Linearprog, Reading Materials for Operational Research
Linearprog, Reading Materials for Operational Research
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
 
Luca Pozzi JSM 2011
Luca Pozzi JSM 2011Luca Pozzi JSM 2011
Luca Pozzi JSM 2011
 
Chapter4
Chapter4Chapter4
Chapter4
 
Solutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. MontgomerySolutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. Montgomery
 
Lec 2 discrete random variable
Lec 2 discrete random variableLec 2 discrete random variable
Lec 2 discrete random variable
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 

En vedette

The GRASS GIS software (with QGIS) - GIS Seminar
The GRASS GIS software (with QGIS) - GIS SeminarThe GRASS GIS software (with QGIS) - GIS Seminar
The GRASS GIS software (with QGIS) - GIS SeminarMarkus Neteler
 
QGIS - How does it work?
QGIS - How does it work?QGIS - How does it work?
QGIS - How does it work?Nathan Woodrow
 
QGIS Module 2
QGIS Module 2QGIS Module 2
QGIS Module 2CAPSUCSF
 
QGIS Module 4
QGIS Module 4QGIS Module 4
QGIS Module 4CAPSUCSF
 
Glacier and snow
Glacier and snowGlacier and snow
Glacier and snowSwetha A
 
Spatial Analysis Tools with Open Source GIS
Spatial Analysis Tools with Open Source GISSpatial Analysis Tools with Open Source GIS
Spatial Analysis Tools with Open Source GISChingchai Humhong
 
QGIS Module 1
QGIS Module 1QGIS Module 1
QGIS Module 1CAPSUCSF
 
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...Swetha A
 
QGIS Module 3
QGIS Module 3QGIS Module 3
QGIS Module 3CAPSUCSF
 
OSM and QGIS
OSM and QGISOSM and QGIS
OSM and QGISQGIS UK
 
GEOPROCESSING IN QGIS
GEOPROCESSING IN QGISGEOPROCESSING IN QGIS
GEOPROCESSING IN QGISSwetha A
 
Remote Sensing And GIS Application In Wetland Mapping
Remote Sensing And GIS Application In Wetland MappingRemote Sensing And GIS Application In Wetland Mapping
Remote Sensing And GIS Application In Wetland MappingSwetha A
 
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...Swetha A
 
Cv cipta setiawan - 0915 - bez almt
Cv   cipta setiawan - 0915 - bez almtCv   cipta setiawan - 0915 - bez almt
Cv cipta setiawan - 0915 - bez almtCipta Setiawan
 
留住胡同 鉛筆畫
留住胡同 鉛筆畫留住胡同 鉛筆畫
留住胡同 鉛筆畫psjlew
 
Ip crammer presentation 2013
Ip crammer presentation 2013Ip crammer presentation 2013
Ip crammer presentation 2013Anneke Weber
 
1 9-and-a-z-in-the-water
1 9-and-a-z-in-the-water1 9-and-a-z-in-the-water
1 9-and-a-z-in-the-waterpsjlew
 

En vedette (20)

QGIS Tutorial 2
QGIS Tutorial 2QGIS Tutorial 2
QGIS Tutorial 2
 
QGIS Tutorial 1
QGIS Tutorial 1QGIS Tutorial 1
QGIS Tutorial 1
 
The GRASS GIS software (with QGIS) - GIS Seminar
The GRASS GIS software (with QGIS) - GIS SeminarThe GRASS GIS software (with QGIS) - GIS Seminar
The GRASS GIS software (with QGIS) - GIS Seminar
 
QGIS - How does it work?
QGIS - How does it work?QGIS - How does it work?
QGIS - How does it work?
 
QGIS Module 2
QGIS Module 2QGIS Module 2
QGIS Module 2
 
QGIS Module 4
QGIS Module 4QGIS Module 4
QGIS Module 4
 
Glacier and snow
Glacier and snowGlacier and snow
Glacier and snow
 
Spatial Analysis Tools with Open Source GIS
Spatial Analysis Tools with Open Source GISSpatial Analysis Tools with Open Source GIS
Spatial Analysis Tools with Open Source GIS
 
QGIS Module 1
QGIS Module 1QGIS Module 1
QGIS Module 1
 
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...
 
QGIS Module 3
QGIS Module 3QGIS Module 3
QGIS Module 3
 
OSM and QGIS
OSM and QGISOSM and QGIS
OSM and QGIS
 
QGIS training class 1
QGIS training class 1QGIS training class 1
QGIS training class 1
 
GEOPROCESSING IN QGIS
GEOPROCESSING IN QGISGEOPROCESSING IN QGIS
GEOPROCESSING IN QGIS
 
Remote Sensing And GIS Application In Wetland Mapping
Remote Sensing And GIS Application In Wetland MappingRemote Sensing And GIS Application In Wetland Mapping
Remote Sensing And GIS Application In Wetland Mapping
 
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...
 
Cv cipta setiawan - 0915 - bez almt
Cv   cipta setiawan - 0915 - bez almtCv   cipta setiawan - 0915 - bez almt
Cv cipta setiawan - 0915 - bez almt
 
留住胡同 鉛筆畫
留住胡同 鉛筆畫留住胡同 鉛筆畫
留住胡同 鉛筆畫
 
Ip crammer presentation 2013
Ip crammer presentation 2013Ip crammer presentation 2013
Ip crammer presentation 2013
 
1 9-and-a-z-in-the-water
1 9-and-a-z-in-the-water1 9-and-a-z-in-the-water
1 9-and-a-z-in-the-water
 

Similaire à IT for Business Intelligence Term Paper

Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
Introduction
IntroductionIntroduction
Introductionbutest
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svmtaikhoan262
 
Data analysis on bank data
Data analysis on bank dataData analysis on bank data
Data analysis on bank dataANISH BHANUSHALI
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Beyond Classification and Ranking: Constrained Optimization of the ROI
Beyond Classification and Ranking: Constrained Optimization of the ROIBeyond Classification and Ranking: Constrained Optimization of the ROI
Beyond Classification and Ranking: Constrained Optimization of the ROInkaf61
 
Est3 tutorial3mejorado
Est3 tutorial3mejoradoEst3 tutorial3mejorado
Est3 tutorial3mejoradohunapuh
 
Industrial plant optimization in reduced dimensional spaces
Industrial plant optimization in reduced dimensional spacesIndustrial plant optimization in reduced dimensional spaces
Industrial plant optimization in reduced dimensional spacesCapstone
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
Statistics
StatisticsStatistics
Statisticsmegamsma
 
26 Ch. 3 Organizing and Graphing DataAssignment 2ME.docx
26     Ch. 3 Organizing and Graphing DataAssignment 2ME.docx26     Ch. 3 Organizing and Graphing DataAssignment 2ME.docx
26 Ch. 3 Organizing and Graphing DataAssignment 2ME.docxeugeniadean34240
 
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesA Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesMohamed Farouk
 
Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueGael Varoquaux
 
Frequentist Operating Characteristics of Bayesian Posterior Designs
Frequentist Operating Characteristics of Bayesian Posterior DesignsFrequentist Operating Characteristics of Bayesian Posterior Designs
Frequentist Operating Characteristics of Bayesian Posterior DesignsBiomedical Statistical Consulting
 

Similaire à IT for Business Intelligence Term Paper (20)

Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
Introduction
IntroductionIntroduction
Introduction
 
JZanzigposter
JZanzigposterJZanzigposter
JZanzigposter
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
1b7 quality control
1b7 quality control1b7 quality control
1b7 quality control
 
Data analysis on bank data
Data analysis on bank dataData analysis on bank data
Data analysis on bank data
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Beyond Classification and Ranking: Constrained Optimization of the ROI
Beyond Classification and Ranking: Constrained Optimization of the ROIBeyond Classification and Ranking: Constrained Optimization of the ROI
Beyond Classification and Ranking: Constrained Optimization of the ROI
 
Guide
GuideGuide
Guide
 
Est3 tutorial3mejorado
Est3 tutorial3mejoradoEst3 tutorial3mejorado
Est3 tutorial3mejorado
 
Industrial plant optimization in reduced dimensional spaces
Industrial plant optimization in reduced dimensional spacesIndustrial plant optimization in reduced dimensional spaces
Industrial plant optimization in reduced dimensional spaces
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
Statistics
StatisticsStatistics
Statistics
 
26 Ch. 3 Organizing and Graphing DataAssignment 2ME.docx
26     Ch. 3 Organizing and Graphing DataAssignment 2ME.docx26     Ch. 3 Organizing and Graphing DataAssignment 2ME.docx
26 Ch. 3 Organizing and Graphing DataAssignment 2ME.docx
 
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesA Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
 
Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic value
 
Frequentist Operating Characteristics of Bayesian Posterior Designs
Frequentist Operating Characteristics of Bayesian Posterior DesignsFrequentist Operating Characteristics of Bayesian Posterior Designs
Frequentist Operating Characteristics of Bayesian Posterior Designs
 
Week 10 GEE Data Examples v2.pptx
Week 10 GEE Data Examples v2.pptxWeek 10 GEE Data Examples v2.pptx
Week 10 GEE Data Examples v2.pptx
 

Dernier

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 

Dernier (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 

IT for Business Intelligence Term Paper

  • 1. IT for Business Intelligence Term Paper on Data Mining Techniques Prepared By: Niloy Ghosh Roll No: 10BM60054 Second Year, MBA VInod Gupta School of Management (VGSOM) IIT Kharagpur
  • 2. Introduction The purpose of this term paper is to demonstrate data mining techniques using the software tool WEKA. Data mining aims at transforming large amounts of data into meaningful patterns and rules. The derivation of meaning from the vast amounts of data has numerous business applications and is generating a tremendous amount of interest. Waikato Environment for Knowledge Analysis (WEKA) is a free and open source software that can be used to mine data and generate useful information. For using WEKA, the data needs to be in the Attribute-Relation File Format (ARFF). It is a flat file format where the type of data being loaded is defined first, followed by the data itself. In this paper two techniques, Linear Regression and Decision Tree, are discussed with examples. The source of the data used to demonstrate the techniques is provided in the reference section. Technique I Linear Regression Linear regression is used to predict the value of an unknown dependent variable based on the values of a number of independent variables. In this example, the model tries to predict the housing prices in the Boston area. Description of dataset The dataset contains details about housing in Boston area. The data contains 14 variables which are defined as follows. 1. CRIM: per capita crime rate by town 2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS: proportion of non-retail business acres per town 4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX: nitric oxides concentration (parts per 10 million) 6. RM: average number of rooms per dwelling 7. AGE: proportion of owner-occupied units built prior to 1940 8. DIS: weighted distances to five Boston employment centres 9. RAD: index of accessibility to radial highways 10. TAX: full-value property-tax rate per $10,000 11. PTRATIO: pupil-teacher ratio by town 12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT: Percentage of lower status of the population 14. MEDV: Median value of owner-occupied homes in $1000's The objective is to predict the housing values (i.e. the variable MEDV) using Linear Regression.
  • 3. Output On running the model in WEKA, the following output was obtained. === Run information === Scheme:WEKA.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: housing Instances: 506 Attributes: 14 CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT CLASS Test mode:split 70.0% train, remainder test === Classifier model (full training set) ===
  • 4. Linear Regression Model CLASS = -0.1084 * CRIM + 0.0458 * ZN + 2.7188 * CHAS + -17.3768 * NOX + 3.8016 * RM + -1.4927 * DIS + 0.2996 * RAD + -0.0118 * TAX + -0.9466 * PTRATIO + 0.0093 * B + -0.5225 * LSTAT + 36.342 Time taken to build model: 0.05 seconds === Evaluation on test split === === Summary === Correlation coefficient 0.8547 Mean absolute error 3.3219 Root mean squared error 4.6107 Relative absolute error 52.2759 % Root relative squared error 51.9447 % Total Number of Instances 152 The experiment was conducted using a 70-30 split of the data (70% used to form the model, 30% used to test the same). Interpretation The results show a correlation of 85%, and thus the model is sufficiently acceptable. Though the error values are quite high, other methods have yielded only slightly better results. The following conclusions can be made:  The proportion of non-retail business and age of the buildings are not a factor for evaluation.  As expected, crime rates, air pollution and (high) tax rates have a negative effect on the house value.  The proportion of lower status population has a negative effect. Thus, low income neighbourhoods will have lower house rates than affluent neighbourhoods.  Interestingly, the pupil student ratio has a negative effect and that too quite prominent. Thus, it is evident that educational facilities is a big concern while looking for a home and people are ready to pay more for areas having better educational facilities.
  • 5. Technique II Decision Tree In data mining, a decision tree is a predictive model which maps observations about an item to conclusions about the item's target value. Also known as classification trees, the leaves represent class labels and branches represent conjunctions of features that lead to those class labels. The WEKA classifier used in the example is J48. The model tries to make a diagnosis of urinary system disease. Description of dataset The dataset contains the following variables. 1. Temperature of patient 2. Occurrence of nausea { yes, no } 3. Lumbar pain { yes, no } 4. Urine pushing (continuous need for urination) { yes, no } 5. Micturition pains { yes, no } 6. Burning of urethra, itch, swelling of urethra outlet { yes, no } 7. Decision: Inflammation of urinary bladder { yes, no } 8. Decision: Nephritis of renal pelvis origin { yes, no } For the purpose of the demonstration, first the variable ‘Nephritis of renal pelvis origin’ had been removed. The analysis then creates a decision tree for the prediction of the inflammation of urinary bladder. Next, the variable ‘Inflammation of urinary bladder’ has been removed and a new decision tree is created for the prediction of Nephritis of renal pelvis origin.
  • 6. Output The WEKA output for prediction of the inflammation of urinary bladder was obtained as follows. Model 1 === Run information === Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2 Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R8 Instances: 120 Attributes: 7 temperature nausea Lumbar_pain Urine_pushing Micturition_pains Burning_of_urethra Inflammation_of_urinary_bladder Test mode:10-fold cross-validation === Classifier model (full training set) === J48 pruned tree ------------------ Urine_pushing = yes | Micturition_pains = yes: yes (49.0) | Micturition_pains = no
  • 7. | | Lumbar_pain = yes: no (21.0) | | Lumbar_pain = no: yes (10.0) Urine_pushing = no: no (40.0) Number of Leaves : 4 Size of the tree : 7 Time taken to build model: 0.01 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 120 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances 120 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 yes 1 0 1 1 1 1 no Weighted Avg. 1 0 1 1 1 1
  • 8. === Confusion Matrix === a b <-- classified as 59 0 a = yes 0 61 b = no The tree is visualised as shown below. The same experiment was repeated for predicting the occurrence of Nephritis of renal pelvis origin. The following results were obtained.
  • 9. Model 2 === Run information === Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2 Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R7 Instances: 120 Attributes: 7 temperature nausea Lumbar_pain Urine_pushing Micturition_pains Burning_of_urethra Nephritis_of_renal_pelvis_origin Test mode:evaluate on training data === Classifier model (full training set) === J48 pruned tree ------------------ temperature <= 37.9: no (60.0) temperature > 37.9 | Lumbar_pain = yes: yes (50.0) | Lumbar_pain = no: no (10.0) Number of Leaves : 3
  • 10. Size of the tree : 5 Time taken to build model: 0 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 120 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances 120 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 yes 1 0 1 1 1 1 no Weighted Avg. 1 0 1 1 1 1 === Confusion Matrix === a b <-- classified as 50 0 a = yes 0 70 b = no
  • 11. The visual tree is as below Interpretation As can be seen in both the models, 100% of the data has been classified correctly. In Model 1, the differentiating factors were Urine pushing, Micturition pains and Lumbar pain. In Model 2, the differentiating factors were Temperature and Lumbar Pain. As can be seen from both the results, Lumbar pain is an important factor in determining urinary infections. Conclusion The paper barely scratches the surface of all the possible applications of data mining. This powerful technique can have unique applications in the field of business as well as academic research. It may provide clues to numerous questions by allowing us to make sense of the ever growing volume of data.
  • 12. Reference 1. http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html 2. http://archive.ics.uci.edu/ml/datasets/Acute+Inflammations 3. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html 4. http://en.wikipedia.org/wiki/Decision_tree_learning