SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR




 Data Mining using Weka
A Paper on Data Mining techniques using Weka
                  software



                        MBA 2010-2012


           IT FOR BUSINESS INTELLIGENCE – TERM PAPER

             INSTRUCTOR – PROF. PRITHWIS MUKERJEE




                                                         SUBMITTED BY
                                                       SATHISHWARAN.R
                                                            10BM60079
                                                         MBA 2010-2012
Data Mining using WEKA                      2



Table of Contents
  1. INTRODUCTION ......................................................................................................................... 3
  2. CLASSIFICATION......................................................................................................................... 3
       2.1 DATA.................................................................................................................................... 3
       2.2 SCREENS .............................................................................................................................. 3
       2.3 OUTPUT ............................................................................................................................... 6
       2.4 INTERPRETATION ................................................................................................................ 7
  3. ASSOCIATION RULES ................................................................................................................. 7
       3.1 DATA.................................................................................................................................... 7
       3.2 SCREENS .............................................................................................................................. 8
       3.3 OUTPUT ............................................................................................................................. 10
       3.4 INTERPRETATION .............................................................................................................. 12
  4. REFERNCES............................................................................................................................... 12
Data Mining using WEKA       3


1. INTRODUCTION

Widespread usage of computers has made life easier for business executives. However it has led
to the proliferation of data which had made it difficult to comprehend meaning out of it. The
amount of data that is generated in the world today had made decision making difficult. Data
mining is one approach that identifies the patterns in data and helps in making decisions by
analysing this huge data ocean. Weka (Waikato Environment for Knowledge Analysis) is free
software developed at university of Waikato in New Zealand and is available under the General
Public License. The software can be used for research, education and applications. It has a GUI
interface and comprehensive set of tools for analysing data. In this paper I have worked on data
mining techniques using the Weka software.


2. CLASSIFICATION

2.1 Data

The raw data used for this analysis has been obtained from website: http://tunedit.org/ and it
has been originally gathered from census data. There are 14 original attributes (features)
include age, work class, education, education, marital status, occupation, native country, etc. It
contains continuous, binary and categorical features. I have used the data for a two-class
classification problem. The task is to discover high revenue people from the census data and
also to make sure whether the data has been classified correctly by cross validation.

Link: http://tunedit.org/repo/Data/Agnostic-vs-Prior/Training/ada_prior_train.arff

2.2 Screens

Step 1: Launch Weka
Data Mining using WEKA   4


Step 2: Click Explorer




Step 3: Click Open file
Data Mining using WEKA   5


Step 4: Data updated in Weka




Step 4: Click Cross Validation and Decision Table. Click Start
Data Mining using WEKA       6


2.3 Output

Cross-validation

       === Run information ===

       Scheme: weka.classifiers.rules.DecisionTable -X 1 -S "weka.attributeSelection.BestFirst -
       D 1 -N 5"
       Relation: ADA_Prior
       Instances: 4147
       Attributes: 15
              age
              workclass
              fnlwgt
              education
              educationNum
              maritalStatus
              occupation
              relationship
              race
              sex
              capitalGain
              capitalLoss
              hoursPerWeek
              nativeCountry
              label
       Test mode:10-fold cross-validation

       === Classifier model (full training set) ===

       Decision Table:

       Number of training instances: 4147
       Number of Rules: 130
       Non matches covered by Majority class.
              Best first.
              Start set: no attributes
              Search direction: forward
              Stale search after 5 node expansions
              Total number of subsets evaluated: 96
              Merit of best subset found: 83.82
       Evaluation (for feature selection): CV (leave one out)
       Feature set: 5, 8,11,12,15

       Time taken to build model: 0.98 seconds

       === Stratified cross-validation ===
Data Mining using WEKA        7


       === Summary ===

       Correctly Classified Instances     3461      83.4579 %
       Incorrectly Classified Instances    686      16.5421 %
       Kappa statistic              0.5073
       Mean absolute error              0.2353
       Root mean squared error             0.339
       Relative absolute error          63.0518 %
       Root relative squared error        78.4907 %
       Total Number of Instances         4147

       === Detailed Accuracy By Class ===

             TP Rate      FP Rate Precision Recall F-Measure ROC Area Class
              0.939       0.483 0.855 0.939 0.895 0.873 -1
              0.517       0.061 0.738 0.517 0.608 0.873 1
       Weighted Avg.      0.835 0.378 0.826 0.835 0.824 0.873

       === Confusion Matrix ===

            a b <-- classified as
           2929 189 | a = -1
           497 532 | b = 1

2.4 Interpretation

      There are 83.45 % correctly classified instances and 16.54 % incorrectly classified
       instances.
      Classifier accuracy is 54.73 % from the kappa statistic
      The forecast error is got from the mean absolute error is 0.339
      3461 instances have been classified correctly and 686 instances have been classified
       incorrectly.

3. ASSOCIATION RULES


3.1 Data

The data set includes votes for each of the U.S. House of Representatives Congressmen on the 16
key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for,
and announced for (these three simplified to yea), voted against, paired against, and announced
against (these three simplified to nay), voted present, voted present to avoid conflict of interest,
and did not vote or otherwise make a position known (these three simplified to an unknown
disposition).

       Number of Instances: 435 (267 democrats, 168 republicans)
       Number of Attributes: 16 + class name = 17 (all Boolean valued)
Data Mining using WEKA   8


Attribute Information:

      Class Name: 2 (democrat, republican)
      handicapped-infants: 2 (y,n)
      water-project-cost-sharing: 2 (y,n)
      adoption-of-the-budget-resolution: 2 (y,n)
      physician-fee-freeze: 2 (y,n)
      el-salvador-aid: 2 (y,n)
      religious-groups-in-schools: 2 (y,n)
      anti-satellite-test-ban: 2 (y,n)
      aid-to-nicaraguan-contras: 2 (y,n)
      mx-missile: 2 (y,n)
      immigration: 2 (y,n)
      synfuels-corporation-cutback: 2 (y,n)
      education-spending: 2 (y,n)
      superfund-right-to-sue: 2 (y,n)
      crime: 2 (y,n)
      duty-free-exports: 2 (y,n)
      export-administration-act-south-africa: 2 (y,n)

Link: http://tunedit.org/repo/UCI/vote.arff

3.2 Screens

Step 1: Launch Weka
Data Mining using WEKA   9


Step 2: Click Explorer




Step 3: Click Open file… and choose respective file
Data Mining using WEKA   10


Step 4: Click Associate and choose Apriori




Step 5: Click Start




3.3 Output

=== Run information ===
Scheme:     weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: vote
Instances: 435
Attributes: 17
       handicapped-infants
Data Mining using WEKA     11


      water-project-cost-sharing
      adoption-of-the-budget-resolution
      physician-fee-freeze
      el-salvador-aid
      religious-groups-in-schools
      anti-satellite-test-ban
      aid-to-nicaraguan-contras
      mx-missile
      immigration
      synfuels-corporation-cutback
      education-spending
      superfund-right-to-sue
      crime
      duty-free-exports
      export-administration-act-south-africa
      Class
=== Associator model (full training set) ===

Apriori
=======

Minimum support: 0.45 (196 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 11

Generated sets of large itemsets:

Size of set of large itemsets L(1): 20
Size of set of large itemsets L(2): 17
Size of set of large itemsets L(3): 6
Size of set of large itemsets L(4): 1

Best rules found:

1. adoption-of-the-budget-resolution=y physician-fee-freeze=n 219 ==> Class=democrat 219
conf:(1)
2. adoption-of-the-budget-resolution=y physician-fee-freeze=n aid-to-nicaraguan-contras=y
198 ==> Class=democrat 198 conf:(1)
3. physician-fee-freeze=n aid-to-nicaraguan-contras=y 211 ==> Class=democrat 210 conf:(1)
4. physician-fee-freeze=n education-spending=n 202 ==> Class=democrat 201 conf:(1)
5. physician-fee-freeze=n 247 ==> Class=democrat 245 conf:(0.99)
6. el-salvador-aid=n Class=democrat 200 ==> aid-to-nicaraguan-contras=y 197 conf:(0.99)
7. el-salvador-aid=n 208 ==> aid-to-nicaraguan-contras=y 204 conf:(0.98)
8. adoption-of-the-budget-resolution=y aid-to-nicaraguan-contras=y Class=democrat 203 ==>
physician-fee-freeze=n 198 conf:(0.98)
9. el-salvador-aid=n aid-to-nicaraguan-contras=y 204 ==> Class=democrat 197 conf:(0.97)
Data Mining using WEKA     12


10. aid-to-nicaraguan-contras=y Class=democrat 218 ==> physician-fee-freeze=n 210
conf:(0.96)

3.4 Interpretation

Association rules have been formed by apriori association as they can be seen from the output.

4. REFERENCES:

      Book: Data Mining – Practical Machine Learning Tools and Techniques, Ian H. Witten,
       Eibe Frank, Mark A. Hall

      http://www.cs.waikato.ac.nz/ml/weka/

      http://www.tunedit.org/repo/Data/Agnostic-vs-Prior/Training/ada_prior_train.arff

      http://tunedit.org/repo/UCI/vote.arff

Contenu connexe

Tendances

Test Automation - Principles and Practices
Test Automation - Principles and PracticesTest Automation - Principles and Practices
Test Automation - Principles and PracticesAnand Bagmar
 
Setting up Page Object Model in Automation Framework
Setting up Page Object Model in Automation FrameworkSetting up Page Object Model in Automation Framework
Setting up Page Object Model in Automation Frameworkvaluebound
 
What Is Selenium? | Selenium Basics For Beginners | Introduction To Selenium ...
What Is Selenium? | Selenium Basics For Beginners | Introduction To Selenium ...What Is Selenium? | Selenium Basics For Beginners | Introduction To Selenium ...
What Is Selenium? | Selenium Basics For Beginners | Introduction To Selenium ...Simplilearn
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovAltinity Ltd
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkCaserta
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basicsnickmbailey
 
Insight on MongoDB Change Stream - Abhishek.D, Mydbops Team
Insight on MongoDB Change Stream - Abhishek.D, Mydbops TeamInsight on MongoDB Change Stream - Abhishek.D, Mydbops Team
Insight on MongoDB Change Stream - Abhishek.D, Mydbops TeamMydbops
 
Angular js routing options
Angular js routing optionsAngular js routing options
Angular js routing optionsNir Kaufman
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphXMaps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphXDatabricks
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOAltinity Ltd
 
Parallel Running Automation Solution with Docker, Jenkins and Zalenium
Parallel Running Automation Solution with Docker, Jenkins and ZaleniumParallel Running Automation Solution with Docker, Jenkins and Zalenium
Parallel Running Automation Solution with Docker, Jenkins and ZaleniumEvozon Test Lab
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
 
淺談C#物件導向與DesignPattern.pdf
淺談C#物件導向與DesignPattern.pdf淺談C#物件導向與DesignPattern.pdf
淺談C#物件導向與DesignPattern.pdfBrian Chou 周家禾
 

Tendances (20)

Test Automation - Principles and Practices
Test Automation - Principles and PracticesTest Automation - Principles and Practices
Test Automation - Principles and Practices
 
Setting up Page Object Model in Automation Framework
Setting up Page Object Model in Automation FrameworkSetting up Page Object Model in Automation Framework
Setting up Page Object Model in Automation Framework
 
What Is Selenium? | Selenium Basics For Beginners | Introduction To Selenium ...
What Is Selenium? | Selenium Basics For Beginners | Introduction To Selenium ...What Is Selenium? | Selenium Basics For Beginners | Introduction To Selenium ...
What Is Selenium? | Selenium Basics For Beginners | Introduction To Selenium ...
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
MySQL for beginners
MySQL for beginnersMySQL for beginners
MySQL for beginners
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
 
Awt
AwtAwt
Awt
 
MySQL ppt
MySQL ppt MySQL ppt
MySQL ppt
 
Insight on MongoDB Change Stream - Abhishek.D, Mydbops Team
Insight on MongoDB Change Stream - Abhishek.D, Mydbops TeamInsight on MongoDB Change Stream - Abhishek.D, Mydbops Team
Insight on MongoDB Change Stream - Abhishek.D, Mydbops Team
 
Angular js routing options
Angular js routing optionsAngular js routing options
Angular js routing options
 
Session and Cookies
Session and CookiesSession and Cookies
Session and Cookies
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphXMaps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
 
Indexes in postgres
Indexes in postgresIndexes in postgres
Indexes in postgres
 
Planning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera ClusterPlanning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera Cluster
 
Parallel Running Automation Solution with Docker, Jenkins and Zalenium
Parallel Running Automation Solution with Docker, Jenkins and ZaleniumParallel Running Automation Solution with Docker, Jenkins and Zalenium
Parallel Running Automation Solution with Docker, Jenkins and Zalenium
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
淺談C#物件導向與DesignPattern.pdf
淺談C#物件導向與DesignPattern.pdf淺談C#物件導向與DesignPattern.pdf
淺談C#物件導向與DesignPattern.pdf
 

Similaire à Weka project - Classification & Association Rule Generation

Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Sanghun Kim
 
MS Word.doc
MS Word.docMS Word.doc
MS Word.docbutest
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
 
research paper
research paperresearch paper
research paperKalyan Ram
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIVikas Virani
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
Fraud Detection with Ensemble Learning Technique
Fraud Detection with Ensemble Learning TechniqueFraud Detection with Ensemble Learning Technique
Fraud Detection with Ensemble Learning TechniqueFrancesca Pappalardo
 
Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6Roger Barga
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction modelsMuthu Kumaar Thangavelu
 
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...IRJET Journal
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiersamreshkr19
 
wekapresentation-130107115704-phpapp02.pdf
wekapresentation-130107115704-phpapp02.pdfwekapresentation-130107115704-phpapp02.pdf
wekapresentation-130107115704-phpapp02.pdfDr. Rajesh P Barnwal
 
A Survey on Stroke Prediction
A Survey on Stroke PredictionA Survey on Stroke Prediction
A Survey on Stroke PredictionMohammadRakib8
 
A survey on heart stroke prediction
A survey on heart stroke predictionA survey on heart stroke prediction
A survey on heart stroke predictiondrubosaha
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekaPrashant Menon
 

Similaire à Weka project - Classification & Association Rule Generation (20)

Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
 
MS Word.doc
MS Word.docMS Word.doc
MS Word.doc
 
Benchmarking_ML_Tools
Benchmarking_ML_ToolsBenchmarking_ML_Tools
Benchmarking_ML_Tools
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
research paper
research paperresearch paper
research paper
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMI
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Fraud Detection with Ensemble Learning Technique
Fraud Detection with Ensemble Learning TechniqueFraud Detection with Ensemble Learning Technique
Fraud Detection with Ensemble Learning Technique
 
Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Project
ProjectProject
Project
 
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
IRJET- Study of Prediction Algorithms on Aviation Accident Dataset using Rapi...
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiers
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
wekapresentation-130107115704-phpapp02.pdf
wekapresentation-130107115704-phpapp02.pdfwekapresentation-130107115704-phpapp02.pdf
wekapresentation-130107115704-phpapp02.pdf
 
Data Mining using Weka
Data Mining using WekaData Mining using Weka
Data Mining using Weka
 
A Survey on Stroke Prediction
A Survey on Stroke PredictionA Survey on Stroke Prediction
A Survey on Stroke Prediction
 
A survey on heart stroke prediction
A survey on heart stroke predictionA survey on heart stroke prediction
A survey on heart stroke prediction
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Dernier (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Weka project - Classification & Association Rule Generation

  • 1. VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR Data Mining using Weka A Paper on Data Mining techniques using Weka software MBA 2010-2012 IT FOR BUSINESS INTELLIGENCE – TERM PAPER INSTRUCTOR – PROF. PRITHWIS MUKERJEE SUBMITTED BY SATHISHWARAN.R 10BM60079 MBA 2010-2012
  • 2. Data Mining using WEKA 2 Table of Contents 1. INTRODUCTION ......................................................................................................................... 3 2. CLASSIFICATION......................................................................................................................... 3 2.1 DATA.................................................................................................................................... 3 2.2 SCREENS .............................................................................................................................. 3 2.3 OUTPUT ............................................................................................................................... 6 2.4 INTERPRETATION ................................................................................................................ 7 3. ASSOCIATION RULES ................................................................................................................. 7 3.1 DATA.................................................................................................................................... 7 3.2 SCREENS .............................................................................................................................. 8 3.3 OUTPUT ............................................................................................................................. 10 3.4 INTERPRETATION .............................................................................................................. 12 4. REFERNCES............................................................................................................................... 12
  • 3. Data Mining using WEKA 3 1. INTRODUCTION Widespread usage of computers has made life easier for business executives. However it has led to the proliferation of data which had made it difficult to comprehend meaning out of it. The amount of data that is generated in the world today had made decision making difficult. Data mining is one approach that identifies the patterns in data and helps in making decisions by analysing this huge data ocean. Weka (Waikato Environment for Knowledge Analysis) is free software developed at university of Waikato in New Zealand and is available under the General Public License. The software can be used for research, education and applications. It has a GUI interface and comprehensive set of tools for analysing data. In this paper I have worked on data mining techniques using the Weka software. 2. CLASSIFICATION 2.1 Data The raw data used for this analysis has been obtained from website: http://tunedit.org/ and it has been originally gathered from census data. There are 14 original attributes (features) include age, work class, education, education, marital status, occupation, native country, etc. It contains continuous, binary and categorical features. I have used the data for a two-class classification problem. The task is to discover high revenue people from the census data and also to make sure whether the data has been classified correctly by cross validation. Link: http://tunedit.org/repo/Data/Agnostic-vs-Prior/Training/ada_prior_train.arff 2.2 Screens Step 1: Launch Weka
  • 4. Data Mining using WEKA 4 Step 2: Click Explorer Step 3: Click Open file
  • 5. Data Mining using WEKA 5 Step 4: Data updated in Weka Step 4: Click Cross Validation and Decision Table. Click Start
  • 6. Data Mining using WEKA 6 2.3 Output Cross-validation === Run information === Scheme: weka.classifiers.rules.DecisionTable -X 1 -S "weka.attributeSelection.BestFirst - D 1 -N 5" Relation: ADA_Prior Instances: 4147 Attributes: 15 age workclass fnlwgt education educationNum maritalStatus occupation relationship race sex capitalGain capitalLoss hoursPerWeek nativeCountry label Test mode:10-fold cross-validation === Classifier model (full training set) === Decision Table: Number of training instances: 4147 Number of Rules: 130 Non matches covered by Majority class. Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 96 Merit of best subset found: 83.82 Evaluation (for feature selection): CV (leave one out) Feature set: 5, 8,11,12,15 Time taken to build model: 0.98 seconds === Stratified cross-validation ===
  • 7. Data Mining using WEKA 7 === Summary === Correctly Classified Instances 3461 83.4579 % Incorrectly Classified Instances 686 16.5421 % Kappa statistic 0.5073 Mean absolute error 0.2353 Root mean squared error 0.339 Relative absolute error 63.0518 % Root relative squared error 78.4907 % Total Number of Instances 4147 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.939 0.483 0.855 0.939 0.895 0.873 -1 0.517 0.061 0.738 0.517 0.608 0.873 1 Weighted Avg. 0.835 0.378 0.826 0.835 0.824 0.873 === Confusion Matrix === a b <-- classified as 2929 189 | a = -1 497 532 | b = 1 2.4 Interpretation  There are 83.45 % correctly classified instances and 16.54 % incorrectly classified instances.  Classifier accuracy is 54.73 % from the kappa statistic  The forecast error is got from the mean absolute error is 0.339  3461 instances have been classified correctly and 686 instances have been classified incorrectly. 3. ASSOCIATION RULES 3.1 Data The data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition). Number of Instances: 435 (267 democrats, 168 republicans) Number of Attributes: 16 + class name = 17 (all Boolean valued)
  • 8. Data Mining using WEKA 8 Attribute Information:  Class Name: 2 (democrat, republican)  handicapped-infants: 2 (y,n)  water-project-cost-sharing: 2 (y,n)  adoption-of-the-budget-resolution: 2 (y,n)  physician-fee-freeze: 2 (y,n)  el-salvador-aid: 2 (y,n)  religious-groups-in-schools: 2 (y,n)  anti-satellite-test-ban: 2 (y,n)  aid-to-nicaraguan-contras: 2 (y,n)  mx-missile: 2 (y,n)  immigration: 2 (y,n)  synfuels-corporation-cutback: 2 (y,n)  education-spending: 2 (y,n)  superfund-right-to-sue: 2 (y,n)  crime: 2 (y,n)  duty-free-exports: 2 (y,n)  export-administration-act-south-africa: 2 (y,n) Link: http://tunedit.org/repo/UCI/vote.arff 3.2 Screens Step 1: Launch Weka
  • 9. Data Mining using WEKA 9 Step 2: Click Explorer Step 3: Click Open file… and choose respective file
  • 10. Data Mining using WEKA 10 Step 4: Click Associate and choose Apriori Step 5: Click Start 3.3 Output === Run information === Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Relation: vote Instances: 435 Attributes: 17 handicapped-infants
  • 11. Data Mining using WEKA 11 water-project-cost-sharing adoption-of-the-budget-resolution physician-fee-freeze el-salvador-aid religious-groups-in-schools anti-satellite-test-ban aid-to-nicaraguan-contras mx-missile immigration synfuels-corporation-cutback education-spending superfund-right-to-sue crime duty-free-exports export-administration-act-south-africa Class === Associator model (full training set) === Apriori ======= Minimum support: 0.45 (196 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 11 Generated sets of large itemsets: Size of set of large itemsets L(1): 20 Size of set of large itemsets L(2): 17 Size of set of large itemsets L(3): 6 Size of set of large itemsets L(4): 1 Best rules found: 1. adoption-of-the-budget-resolution=y physician-fee-freeze=n 219 ==> Class=democrat 219 conf:(1) 2. adoption-of-the-budget-resolution=y physician-fee-freeze=n aid-to-nicaraguan-contras=y 198 ==> Class=democrat 198 conf:(1) 3. physician-fee-freeze=n aid-to-nicaraguan-contras=y 211 ==> Class=democrat 210 conf:(1) 4. physician-fee-freeze=n education-spending=n 202 ==> Class=democrat 201 conf:(1) 5. physician-fee-freeze=n 247 ==> Class=democrat 245 conf:(0.99) 6. el-salvador-aid=n Class=democrat 200 ==> aid-to-nicaraguan-contras=y 197 conf:(0.99) 7. el-salvador-aid=n 208 ==> aid-to-nicaraguan-contras=y 204 conf:(0.98) 8. adoption-of-the-budget-resolution=y aid-to-nicaraguan-contras=y Class=democrat 203 ==> physician-fee-freeze=n 198 conf:(0.98) 9. el-salvador-aid=n aid-to-nicaraguan-contras=y 204 ==> Class=democrat 197 conf:(0.97)
  • 12. Data Mining using WEKA 12 10. aid-to-nicaraguan-contras=y Class=democrat 218 ==> physician-fee-freeze=n 210 conf:(0.96) 3.4 Interpretation Association rules have been formed by apriori association as they can be seen from the output. 4. REFERENCES:  Book: Data Mining – Practical Machine Learning Tools and Techniques, Ian H. Witten, Eibe Frank, Mark A. Hall  http://www.cs.waikato.ac.nz/ml/weka/  http://www.tunedit.org/repo/Data/Agnostic-vs-Prior/Training/ada_prior_train.arff  http://tunedit.org/repo/UCI/vote.arff