SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
VGSOM




WEKA – Data Mining
   Techniques
  Clustering and Regression




                 BY
         M.P.Vijaya Prabhu
           10BM60097
Contents
1.     INTRODUCTION ............................................................................................................................... 3

2.     CLUSTERING .................................................................................................................................... 4

     2.1      Data Visualization..................................................................................................................... 8

3.     Regression Analysis........................................................................................................................ 10

     3.1      Pricing the house ................................................................................................................... 10

4.     References..................................................................................................................................... 13
WEKA – DATA MINING TECHNIQUES
    1. INTRODUCTION
        “Data Mining Software in Java”. Weka is the acronym of Waikato Environment for Knowledge
Analysis is a collection of state-of-the-art machine learning algorithms and data preprocessing tools
written in Java, developed at the University of Waikato, New Zealand. It is free software that runs on
almost any platform and is available under the GNU General Public License.

        Weka is the next generation Data Mining Tool to complex analysis more interactively and can
visualize it more effectively.

WEKA GUI appears like this




Advantages of using WEKA

    1) Built in Advanced algorithm
    2) Effective Visualization of results
    3) Easy to use GUI
Let us demonstrate the use of WEKA using 2 examples each on CLUSTERING (Kmeans) and
        Regression.


    2. CLUSTERING
Data is a sample bank data taken from an online source.It contains the following attributes
        1) age numeric
        2) {FEMALE,MALE}
        3) region {INNER_CITY,TOWN,RURAL,SUBURBAN}
        4) income numeric
        5) married {NO,YES}
        6) children {0,1,2,3}
        7) car {NO,YES}
        8) save_act {NO,YES}
        9) current_act {NO,YES}
        10) mortgage {NO,YES}
        11) pep {YES,NO}


        Based on these data we need to CLUSTER the user groups into 6 and have to find out the
        characteristics of each group.

The sample data contains 600 instances. The objective is to cluster based on K-Means algorithm.
Once the preprocessing of the data is done, we can start with clustering the data.


First, the data is loaded into WEKA and preprocessing can be done as shown below.
WEKA SimpleKMeans algorithm automatically handles a mixture of categorical and

numerical attributes. While doing distance computations like in our case, the built in algorithm
will automatically normalizes numerical attributes. Euclidean distance is general measure of
distance between Euclidean and clusters.




        After selecting k-Means we can select advance settings in the k-means algorithm. We
have given the CLUSTERs as 6 from 2 ,to get 6 different clusters from the given data.
After the required details are given “Use Training Set” is checked. Then we can click “Start”




The result is available as given below.
================================================================================================
OUTPUT :
=== Run information ===

Scheme:       weka.clusterers.SimpleKMeans -N 6 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10
Relation:     bank-data
Instances:    600
Attributes:   12
        id
age
       sex
       region
       income
       married
       children
       car
       save_act
       current_act
       mortgage
       pep
Test mode: evaluate on training data


=== Clustering model (full training set) ===


kMeans
======

Number of iterations: 18
Within cluster sum of squared errors: 1955.4146634784236
Missing values globally replaced with mean/mode

Cluster centroids:
                Cluster#
Attribute Full Data         0    1      2     3      4      5
           (600)     (74) (164)    (71)    (58)    (99) (134)
==========================================================================================
id        ID12101 ID12107 ID12103 ID12101 ID12104 ID12102 ID12108
age          42.395 42.9324 43.7744 39.0282 37.3103 38.404 47.3433
sex         FEMALE FEMALE FEMALE FEMALE FEMALE                    MALE MALE
region     INNER_CITY RURAL INNER_CITY INNER_CITY           TOWN INNER_CITY TOWN
income      27524.0312 28838.7605 28586.4063 20463.1273 20600.8528 25720.037 33568.3929
married         YES      NO    YES     YES     YES     YES     NO
children      1.0117 1.973 0.628 0.6901 1.6207 0.899 0.9403
car           NO       NO     NO     NO      NO      YES     YES
save_act         YES     YES   YES      NO      NO      NO      YES
current_act       YES     YES   YES     YES     YES     YES     YES
mortgage           NO      NO    NO       NO     NO      YES      NO
pep            NO       NO    NO     YES      NO     YES      YES




Time taken to build model (full training data) : 0.16 seconds

=== Model and evaluation on training set ===

Clustered Instances

0   74 ( 12%)
1 164 ( 27%)
2   71 ( 12%)
3   58 ( 10%)
4   99 ( 17%)
5 134 ( 22%)
================================================================================================
The result window shows the centroid of each cluster as well as statistics on the number and
      percentage of instances assigned to different clusters.
                0   74 ( 12%)
                1   164 ( 27%)
                2   71 ( 12%)
                3   58 ( 10%)
                4   99 ( 17%)
                5   134 ( 22%)


      The put put of this clustering can be found in the form of cluster centroid



  Cluster            0               1           2               3            4               5            6
   age            42.395          42.9324     43.7744        39.0282       37.3103         38.404       47.3433
   sex           FEMALE           FEMALE      FEMALE         FEMALE        FEMALE          MALE          MALE
                INNER_CIT                    INNER_CIT      INNER_CIT                    INNER_CIT
  region
                     Y             RURAL         Y               Y          TOWN              Y         TOWN
                27524.031        28838.760   28586.406      20463.127     20600.852                   33568.392
  income
                     2               5           3               3            8          25720.037        9
 married            YES             NO          YES             YES          YES            YES          NO
 children         1.0117           1.973       0.628          0.6901       1.6207          0.899       0.9403
    car             NO              NO          NO              NO           NO             YES          YES
 save_act           YES             YES         YES             NO           NO             NO           YES
current_act         YES             YES         YES             YES          YES            YES          YES
mortgage            NO              NO          NO              NO           NO             YES          NO
   pep              NO              NO          NO              YES          NO             YES          YES


      For example, the centroid for cluster 0 shows that this is a segment of cases representing middle aged
      (approx. 42) females living in inner city with an average income of approx. $27,500, who are married
      with one child, etc. Furthermore, this group has on average said YES to the NO product.


              2.1 Data Visualization

      The result can be viewed more intuitively by the advanced VISUALIZATION built in WEKA.

                The visualization of the distribution of male and female in each cluster can be found by using the
      following methods.

                Step 1 : Right click on the output and select “Visualise Cluster alignment”
Step 2 : Select the different cluster as the X axis.

Step 3 : SelectInstance_Nbr as Y Axis

Step 4 : Select “ Sex “ as colour.It means it will differentiate sex based on colour.

This will result in a visualization of the distribution of males and females in each cluster.
3. Regression Analysis
  Regression can be done effectively with more options via WEKA software.Lets explain it using a
  simple “LinearRegression”

3.1 Pricing the house

   Data is taken from an online source .The selling price of the house needs to be determined
  based on the data given. The data contains the following attributes.


  1) houseSize NUMERIC
  2) lotSize NUMERIC
  3) bedrooms NUMERIC
  4) granite NUMERIC
  5) bathroom NUMERIC
  6) sellingPrice NUMERIC


  So, based on the size of the house, Lot size ,number of bedrooms it has ,whether it is furnished
  with Granite, number of bathroom ,we need to predict the DEPENDANT VARIABLE ,i.e. the
  SELLING PRICE.


  First, the data is loaded into WEKA and necessary preprocess is done. Since, our data is already
  processed .We proceed to selecting the type of REGRESSION
In the picture given above select the “Linear Regression” tab. Then Select “Use Training Set” in
the Test Options.




There are three other choices available while doing simple Linear Regression they are
       Supplied test set: Supply test data to do model
    Cross-validation : which lets WEKA build a model based on subsets of the supplied data
         and then average them out to create a final model
        Percentage split: where WEKA takes a percentile subset to build a final model.


Here the column “Selling Price” is chosen. This means with the available data we are going to
predict the DEPENDANT VARIABLE (Selling Price).


Then click on the “Start” button to build a model using WEKA.
OUTPUT:
================================================================================================
=== Run information ===
Scheme:      weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: house
Instances: 700
Attributes: 6
        houseSize
        lotSize
        bedrooms
        granite
        bathroom
        sellingPrice
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Linear Regression Model
sellingPrice =
   22.6582 * houseSize +
    9.1242 * lotSize +
  42145.0767 * bedrooms +
  42562.0901 * bathroom +
 -20981.3142

Time taken to build model: 0.04 seconds

=== Evaluation on training set ===
=== Summary ===

Correlation coefficient      0.9945
Mean absolute error         4790.821
Root mean squared error       4245.4125
Relative absolute error      11.9082 %
Root relative squared error    11.21 %
Total Number of Instances       700
================================================================================================


The output predicts that the Selling price will be
sellingPrice= (22.6582*houseSize) + (9.1242 * lotSize) + (42145.0767 * bedrooms) +
  (42562.0901 * bathroom) -20981.3142.


  If we want to determine the “selling price” of the house based on given data just “Plug in” the
  values and find it easily.


  The output predicts that the “Granite” doesn’t matter much regarding the SELLING PRICE of the
  house.




4. References

  http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm
  www.cs.waikato.ac.nz/ml/weka/
  http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/
  http://maya.cs.depaul.edu/classes/ect584/weka/k-means.html
  http://www.cs.utexas.edu/users/ml/tutorials/Weka-tut/

Contenu connexe

Tendances

K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...Edureka!
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Edureka!
 
flip-kart case study and SWAT analysis after merge Walmart
flip-kart case study and SWAT analysis after merge Walmart flip-kart case study and SWAT analysis after merge Walmart
flip-kart case study and SWAT analysis after merge Walmart AnubhavMishra70
 
Weka presentation
Weka presentationWeka presentation
Weka presentationSaeed Iqbal
 
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Ankit Pandey
 
WEKA: Algorithms The Basic Methods
WEKA: Algorithms The Basic MethodsWEKA: Algorithms The Basic Methods
WEKA: Algorithms The Basic MethodsDataminingTools Inc
 
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...Edge AI and Vision Alliance
 
Big Data Case Study on Walmart
Big Data Case Study on WalmartBig Data Case Study on Walmart
Big Data Case Study on WalmartJainamParikh3
 
Iris data analysis example in R
Iris data analysis example in RIris data analysis example in R
Iris data analysis example in RDuyen Do
 
Data Analytics for Real-World Business Problems
Data Analytics for Real-World Business ProblemsData Analytics for Real-World Business Problems
Data Analytics for Real-World Business ProblemsAlexander Kolker
 
PPT on custom car business
PPT on custom car businessPPT on custom car business
PPT on custom car businessudayjoshi35
 
Merton Truck Company
Merton Truck CompanyMerton Truck Company
Merton Truck CompanyTushar Arora
 

Tendances (20)

K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
 
flip-kart case study and SWAT analysis after merge Walmart
flip-kart case study and SWAT analysis after merge Walmart flip-kart case study and SWAT analysis after merge Walmart
flip-kart case study and SWAT analysis after merge Walmart
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
Decision tree example problem
Decision tree example problemDecision tree example problem
Decision tree example problem
 
REPORT ON AMAZON
REPORT ON AMAZONREPORT ON AMAZON
REPORT ON AMAZON
 
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
 
WEKA: Algorithms The Basic Methods
WEKA: Algorithms The Basic MethodsWEKA: Algorithms The Basic Methods
WEKA: Algorithms The Basic Methods
 
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...
“Responsible AI and ModelOps in Industry: Practical Challenges and Lessons Le...
 
Big Data Case Study on Walmart
Big Data Case Study on WalmartBig Data Case Study on Walmart
Big Data Case Study on Walmart
 
Iris data analysis example in R
Iris data analysis example in RIris data analysis example in R
Iris data analysis example in R
 
Shadi.com
Shadi.comShadi.com
Shadi.com
 
Data Visualization Tools
Data Visualization ToolsData Visualization Tools
Data Visualization Tools
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Analytics for Real-World Business Problems
Data Analytics for Real-World Business ProblemsData Analytics for Real-World Business Problems
Data Analytics for Real-World Business Problems
 
PPT on custom car business
PPT on custom car businessPPT on custom car business
PPT on custom car business
 
Licious
LiciousLicious
Licious
 
Merton Truck Company
Merton Truck CompanyMerton Truck Company
Merton Truck Company
 

En vedette

Linear Regression Parameters
Linear Regression ParametersLinear Regression Parameters
Linear Regression Parameterscamposer
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKAbutest
 
DATA MINING WITH WEKA
DATA MINING WITH WEKADATA MINING WITH WEKA
DATA MINING WITH WEKAShubham Gupta
 
Drug glossaries
Drug glossariesDrug glossaries
Drug glossariesITgal
 
08 批次處理大量照片
08 批次處理大量照片08 批次處理大量照片
08 批次處理大量照片欣彥 郭
 
Rowin Petersma ’Projects 2011-1’
Rowin Petersma ’Projects 2011-1’Rowin Petersma ’Projects 2011-1’
Rowin Petersma ’Projects 2011-1’Rowin Petersma
 
Rowin Petersma \'Projects 2011-2\'
Rowin Petersma \'Projects 2011-2\'Rowin Petersma \'Projects 2011-2\'
Rowin Petersma \'Projects 2011-2\'Rowin Petersma
 
The immune system and anxiety disorders
The immune system and anxiety disordersThe immune system and anxiety disorders
The immune system and anxiety disordersYasir Hameed
 
NO HORSE PLAY
NO HORSE PLAYNO HORSE PLAY
NO HORSE PLAYEEWPRRK8
 
Infusing social justice principles in the research process
Infusing social justice principles in the research processInfusing social justice principles in the research process
Infusing social justice principles in the research processruthcwhite
 
Ruth White Cv11.11.11
Ruth White Cv11.11.11Ruth White Cv11.11.11
Ruth White Cv11.11.11ruthcwhite
 
bureau rowin petersma 2015
bureau rowin petersma 2015bureau rowin petersma 2015
bureau rowin petersma 2015Rowin Petersma
 
Ruth C. White Resume
Ruth C. White ResumeRuth C. White Resume
Ruth C. White Resumeruthcwhite
 
The Reproductive System
The Reproductive SystemThe Reproductive System
The Reproductive Systembsullivan4
 
Intercalated BMedSc Psychological Medicine
Intercalated BMedSc Psychological MedicineIntercalated BMedSc Psychological Medicine
Intercalated BMedSc Psychological MedicineYasir Hameed
 

En vedette (20)

Linear Regression Parameters
Linear Regression ParametersLinear Regression Parameters
Linear Regression Parameters
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
 
DATA MINING WITH WEKA
DATA MINING WITH WEKADATA MINING WITH WEKA
DATA MINING WITH WEKA
 
Drug glossaries
Drug glossariesDrug glossaries
Drug glossaries
 
Langzame Stad
Langzame StadLangzame Stad
Langzame Stad
 
The Eye
The EyeThe Eye
The Eye
 
08 批次處理大量照片
08 批次處理大量照片08 批次處理大量照片
08 批次處理大量照片
 
Rowin Petersma ’Projects 2011-1’
Rowin Petersma ’Projects 2011-1’Rowin Petersma ’Projects 2011-1’
Rowin Petersma ’Projects 2011-1’
 
Rowin Petersma \'Projects 2011-2\'
Rowin Petersma \'Projects 2011-2\'Rowin Petersma \'Projects 2011-2\'
Rowin Petersma \'Projects 2011-2\'
 
The immune system and anxiety disorders
The immune system and anxiety disordersThe immune system and anxiety disorders
The immune system and anxiety disorders
 
Ptc
PtcPtc
Ptc
 
NO HORSE PLAY
NO HORSE PLAYNO HORSE PLAY
NO HORSE PLAY
 
Ebook colombia travel
Ebook colombia travelEbook colombia travel
Ebook colombia travel
 
Angles complementaris
Angles complementarisAngles complementaris
Angles complementaris
 
Infusing social justice principles in the research process
Infusing social justice principles in the research processInfusing social justice principles in the research process
Infusing social justice principles in the research process
 
Ruth White Cv11.11.11
Ruth White Cv11.11.11Ruth White Cv11.11.11
Ruth White Cv11.11.11
 
bureau rowin petersma 2015
bureau rowin petersma 2015bureau rowin petersma 2015
bureau rowin petersma 2015
 
Ruth C. White Resume
Ruth C. White ResumeRuth C. White Resume
Ruth C. White Resume
 
The Reproductive System
The Reproductive SystemThe Reproductive System
The Reproductive System
 
Intercalated BMedSc Psychological Medicine
Intercalated BMedSc Psychological MedicineIntercalated BMedSc Psychological Medicine
Intercalated BMedSc Psychological Medicine
 

Similaire à Clustering and Regression using WEKA

face recognition using Principle Componet Analysis
face recognition using Principle Componet Analysisface recognition using Principle Componet Analysis
face recognition using Principle Componet AnalysisAbhilash Kotawar
 
AP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One SampleAP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One SampleFrances Coronel
 
Weka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule GenerationWeka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule Generationrsathishwaran
 
Final Project Report
Final Project ReportFinal Project Report
Final Project Reportbutest
 
Using Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningUsing Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningAll Things Open
 
ARitificial Intelligence - Project - Data Classification
ARitificial Intelligence - Project - Data ClassificationARitificial Intelligence - Project - Data Classification
ARitificial Intelligence - Project - Data Classificationmayank0318
 
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...Luis Beltran
 
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdfMachine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdfMaris R
 
Course Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningCourse Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningJohn Edward Slough II
 
Peterson_-_Machine_Learning_Project
Peterson_-_Machine_Learning_ProjectPeterson_-_Machine_Learning_Project
Peterson_-_Machine_Learning_Projectjpeterson2058
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin NUI Galway
 
DATA MINING - EVALUATING CLUSTERING ALGORITHM
DATA MINING - EVALUATING CLUSTERING ALGORITHMDATA MINING - EVALUATING CLUSTERING ALGORITHM
DATA MINING - EVALUATING CLUSTERING ALGORITHMTochukwu Udeh
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationHariniMS1
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Treessathish sak
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIVikas Virani
 
Supervised learning (2)
Supervised learning (2)Supervised learning (2)
Supervised learning (2)AlexAman1
 

Similaire à Clustering and Regression using WEKA (20)

face recognition using Principle Componet Analysis
face recognition using Principle Componet Analysisface recognition using Principle Componet Analysis
face recognition using Principle Componet Analysis
 
AP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One SampleAP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One Sample
 
Weka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule GenerationWeka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule Generation
 
Final Project Report
Final Project ReportFinal Project Report
Final Project Report
 
MNIST 10-class Classifiers
MNIST 10-class ClassifiersMNIST 10-class Classifiers
MNIST 10-class Classifiers
 
Using Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningUsing Open Source Tools for Machine Learning
Using Open Source Tools for Machine Learning
 
ARitificial Intelligence - Project - Data Classification
ARitificial Intelligence - Project - Data ClassificationARitificial Intelligence - Project - Data Classification
ARitificial Intelligence - Project - Data Classification
 
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
 
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdfMachine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
Machine-Learning-A-Z-Course-Downloadable-Slides-V1.5.pdf
 
Course Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningCourse Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine Learning
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Peterson_-_Machine_Learning_Project
Peterson_-_Machine_Learning_ProjectPeterson_-_Machine_Learning_Project
Peterson_-_Machine_Learning_Project
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin
 
DATA MINING - EVALUATING CLUSTERING ALGORITHM
DATA MINING - EVALUATING CLUSTERING ALGORITHMDATA MINING - EVALUATING CLUSTERING ALGORITHM
DATA MINING - EVALUATING CLUSTERING ALGORITHM
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
 
07 learning
07 learning07 learning
07 learning
 
Notes Chapter 4.pptx
Notes Chapter 4.pptxNotes Chapter 4.pptx
Notes Chapter 4.pptx
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMI
 
Supervised learning (2)
Supervised learning (2)Supervised learning (2)
Supervised learning (2)
 

Plus de Vijaya Prabhu

Google refine from a business perspective
Google refine   from a business perspectiveGoogle refine   from a business perspective
Google refine from a business perspectiveVijaya Prabhu
 
Google refine from a business perspective
Google refine   from a business perspectiveGoogle refine   from a business perspective
Google refine from a business perspectiveVijaya Prabhu
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotialVijaya Prabhu
 

Plus de Vijaya Prabhu (9)

Bose corporation
Bose corporationBose corporation
Bose corporation
 
Bose corporation
Bose corporationBose corporation
Bose corporation
 
Google refine from a business perspective
Google refine   from a business perspectiveGoogle refine   from a business perspective
Google refine from a business perspective
 
Google refine from a business perspective
Google refine   from a business perspectiveGoogle refine   from a business perspective
Google refine from a business perspective
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 
Google refine tutotial
Google refine tutotialGoogle refine tutotial
Google refine tutotial
 

Dernier

بروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdfبروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdfomnme1
 
Falcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small BusinessesFalcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small BusinessesFalcon investment
 
Exploring-Pipe-Flanges-Applications-Types-and-Benefits.pptx
Exploring-Pipe-Flanges-Applications-Types-and-Benefits.pptxExploring-Pipe-Flanges-Applications-Types-and-Benefits.pptx
Exploring-Pipe-Flanges-Applications-Types-and-Benefits.pptxTexas Flange
 
stock price prediction using machine learning
stock price prediction using machine learningstock price prediction using machine learning
stock price prediction using machine learninggauravwankar27
 
LinkedIn Masterclass Techweek 2024 v4.1.pptx
LinkedIn Masterclass Techweek 2024 v4.1.pptxLinkedIn Masterclass Techweek 2024 v4.1.pptx
LinkedIn Masterclass Techweek 2024 v4.1.pptxSymbio Agency Ltd
 
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptxBlinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptxSaksham Gupta
 
Series A Fundraising Guide (Investing Individuals Improving Our World) by Accion
Series A Fundraising Guide (Investing Individuals Improving Our World) by AccionSeries A Fundraising Guide (Investing Individuals Improving Our World) by Accion
Series A Fundraising Guide (Investing Individuals Improving Our World) by AccionAlejandro Cremades
 
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot ReportFuture of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot ReportDubai Multi Commodity Centre
 
Powerpoint showing results from tik tok metrics
Powerpoint showing results from tik tok metricsPowerpoint showing results from tik tok metrics
Powerpoint showing results from tik tok metricsCaitlinCummins3
 
tekAura | Desktop Procedure Template (2016)
tekAura | Desktop Procedure Template (2016)tekAura | Desktop Procedure Template (2016)
tekAura | Desktop Procedure Template (2016)Norah Medlin
 
Expert Cross-Border Financial Planning Advisors
Expert Cross-Border Financial Planning AdvisorsExpert Cross-Border Financial Planning Advisors
Expert Cross-Border Financial Planning Advisorscardinalpointwealth11
 
Inside the Black Box of Venture Capital (VC)
Inside the Black Box of Venture Capital (VC)Inside the Black Box of Venture Capital (VC)
Inside the Black Box of Venture Capital (VC)Alejandro Cremades
 
Hyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings releaseHyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings releaseirhcs
 
How to Maintain Healthy Life style.pptx
How to Maintain  Healthy Life style.pptxHow to Maintain  Healthy Life style.pptx
How to Maintain Healthy Life style.pptxrdishurana
 
Innomantra Viewpoint - Building Moonshots : May-Jun 2024.pdf
Innomantra Viewpoint - Building Moonshots : May-Jun 2024.pdfInnomantra Viewpoint - Building Moonshots : May-Jun 2024.pdf
Innomantra Viewpoint - Building Moonshots : May-Jun 2024.pdfInnomantra
 
zidauu _business communication.pptx /pdf
zidauu _business  communication.pptx /pdfzidauu _business  communication.pptx /pdf
zidauu _business communication.pptx /pdfzukhrafshabbir
 
Aptar Closures segment - Corporate Overview-India.pdf
Aptar Closures segment - Corporate Overview-India.pdfAptar Closures segment - Corporate Overview-India.pdf
Aptar Closures segment - Corporate Overview-India.pdfprchbhandari
 
MichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfMichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfmstarkes24
 
Presentation4 (2) survey responses clearly labelled
Presentation4 (2) survey responses clearly labelledPresentation4 (2) survey responses clearly labelled
Presentation4 (2) survey responses clearly labelledCaitlinCummins3
 

Dernier (20)

بروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdfبروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
 
Falcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small BusinessesFalcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small Businesses
 
Exploring-Pipe-Flanges-Applications-Types-and-Benefits.pptx
Exploring-Pipe-Flanges-Applications-Types-and-Benefits.pptxExploring-Pipe-Flanges-Applications-Types-and-Benefits.pptx
Exploring-Pipe-Flanges-Applications-Types-and-Benefits.pptx
 
stock price prediction using machine learning
stock price prediction using machine learningstock price prediction using machine learning
stock price prediction using machine learning
 
LinkedIn Masterclass Techweek 2024 v4.1.pptx
LinkedIn Masterclass Techweek 2024 v4.1.pptxLinkedIn Masterclass Techweek 2024 v4.1.pptx
LinkedIn Masterclass Techweek 2024 v4.1.pptx
 
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptxBlinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
 
Series A Fundraising Guide (Investing Individuals Improving Our World) by Accion
Series A Fundraising Guide (Investing Individuals Improving Our World) by AccionSeries A Fundraising Guide (Investing Individuals Improving Our World) by Accion
Series A Fundraising Guide (Investing Individuals Improving Our World) by Accion
 
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot ReportFuture of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
 
Powerpoint showing results from tik tok metrics
Powerpoint showing results from tik tok metricsPowerpoint showing results from tik tok metrics
Powerpoint showing results from tik tok metrics
 
tekAura | Desktop Procedure Template (2016)
tekAura | Desktop Procedure Template (2016)tekAura | Desktop Procedure Template (2016)
tekAura | Desktop Procedure Template (2016)
 
Expert Cross-Border Financial Planning Advisors
Expert Cross-Border Financial Planning AdvisorsExpert Cross-Border Financial Planning Advisors
Expert Cross-Border Financial Planning Advisors
 
Inside the Black Box of Venture Capital (VC)
Inside the Black Box of Venture Capital (VC)Inside the Black Box of Venture Capital (VC)
Inside the Black Box of Venture Capital (VC)
 
Hyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings releaseHyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings release
 
WAM Corporate Presentation May 2024_w.pdf
WAM Corporate Presentation May 2024_w.pdfWAM Corporate Presentation May 2024_w.pdf
WAM Corporate Presentation May 2024_w.pdf
 
How to Maintain Healthy Life style.pptx
How to Maintain  Healthy Life style.pptxHow to Maintain  Healthy Life style.pptx
How to Maintain Healthy Life style.pptx
 
Innomantra Viewpoint - Building Moonshots : May-Jun 2024.pdf
Innomantra Viewpoint - Building Moonshots : May-Jun 2024.pdfInnomantra Viewpoint - Building Moonshots : May-Jun 2024.pdf
Innomantra Viewpoint - Building Moonshots : May-Jun 2024.pdf
 
zidauu _business communication.pptx /pdf
zidauu _business  communication.pptx /pdfzidauu _business  communication.pptx /pdf
zidauu _business communication.pptx /pdf
 
Aptar Closures segment - Corporate Overview-India.pdf
Aptar Closures segment - Corporate Overview-India.pdfAptar Closures segment - Corporate Overview-India.pdf
Aptar Closures segment - Corporate Overview-India.pdf
 
MichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfMichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdf
 
Presentation4 (2) survey responses clearly labelled
Presentation4 (2) survey responses clearly labelledPresentation4 (2) survey responses clearly labelled
Presentation4 (2) survey responses clearly labelled
 

Clustering and Regression using WEKA

  • 1. VGSOM WEKA – Data Mining Techniques Clustering and Regression BY M.P.Vijaya Prabhu 10BM60097
  • 2. Contents 1. INTRODUCTION ............................................................................................................................... 3 2. CLUSTERING .................................................................................................................................... 4 2.1 Data Visualization..................................................................................................................... 8 3. Regression Analysis........................................................................................................................ 10 3.1 Pricing the house ................................................................................................................... 10 4. References..................................................................................................................................... 13
  • 3. WEKA – DATA MINING TECHNIQUES 1. INTRODUCTION “Data Mining Software in Java”. Weka is the acronym of Waikato Environment for Knowledge Analysis is a collection of state-of-the-art machine learning algorithms and data preprocessing tools written in Java, developed at the University of Waikato, New Zealand. It is free software that runs on almost any platform and is available under the GNU General Public License. Weka is the next generation Data Mining Tool to complex analysis more interactively and can visualize it more effectively. WEKA GUI appears like this Advantages of using WEKA 1) Built in Advanced algorithm 2) Effective Visualization of results 3) Easy to use GUI
  • 4. Let us demonstrate the use of WEKA using 2 examples each on CLUSTERING (Kmeans) and Regression. 2. CLUSTERING Data is a sample bank data taken from an online source.It contains the following attributes 1) age numeric 2) {FEMALE,MALE} 3) region {INNER_CITY,TOWN,RURAL,SUBURBAN} 4) income numeric 5) married {NO,YES} 6) children {0,1,2,3} 7) car {NO,YES} 8) save_act {NO,YES} 9) current_act {NO,YES} 10) mortgage {NO,YES} 11) pep {YES,NO} Based on these data we need to CLUSTER the user groups into 6 and have to find out the characteristics of each group. The sample data contains 600 instances. The objective is to cluster based on K-Means algorithm. Once the preprocessing of the data is done, we can start with clustering the data. First, the data is loaded into WEKA and preprocessing can be done as shown below.
  • 5. WEKA SimpleKMeans algorithm automatically handles a mixture of categorical and numerical attributes. While doing distance computations like in our case, the built in algorithm will automatically normalizes numerical attributes. Euclidean distance is general measure of distance between Euclidean and clusters. After selecting k-Means we can select advance settings in the k-means algorithm. We have given the CLUSTERs as 6 from 2 ,to get 6 different clusters from the given data.
  • 6. After the required details are given “Use Training Set” is checked. Then we can click “Start” The result is available as given below. ================================================================================================ OUTPUT : === Run information === Scheme: weka.clusterers.SimpleKMeans -N 6 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: bank-data Instances: 600 Attributes: 12 id
  • 7. age sex region income married children car save_act current_act mortgage pep Test mode: evaluate on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 18 Within cluster sum of squared errors: 1955.4146634784236 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 2 3 4 5 (600) (74) (164) (71) (58) (99) (134) ========================================================================================== id ID12101 ID12107 ID12103 ID12101 ID12104 ID12102 ID12108 age 42.395 42.9324 43.7744 39.0282 37.3103 38.404 47.3433 sex FEMALE FEMALE FEMALE FEMALE FEMALE MALE MALE region INNER_CITY RURAL INNER_CITY INNER_CITY TOWN INNER_CITY TOWN income 27524.0312 28838.7605 28586.4063 20463.1273 20600.8528 25720.037 33568.3929 married YES NO YES YES YES YES NO children 1.0117 1.973 0.628 0.6901 1.6207 0.899 0.9403 car NO NO NO NO NO YES YES save_act YES YES YES NO NO NO YES current_act YES YES YES YES YES YES YES mortgage NO NO NO NO NO YES NO pep NO NO NO YES NO YES YES Time taken to build model (full training data) : 0.16 seconds === Model and evaluation on training set === Clustered Instances 0 74 ( 12%) 1 164 ( 27%) 2 71 ( 12%) 3 58 ( 10%) 4 99 ( 17%) 5 134 ( 22%) ================================================================================================
  • 8. The result window shows the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters. 0 74 ( 12%) 1 164 ( 27%) 2 71 ( 12%) 3 58 ( 10%) 4 99 ( 17%) 5 134 ( 22%) The put put of this clustering can be found in the form of cluster centroid Cluster 0 1 2 3 4 5 6 age 42.395 42.9324 43.7744 39.0282 37.3103 38.404 47.3433 sex FEMALE FEMALE FEMALE FEMALE FEMALE MALE MALE INNER_CIT INNER_CIT INNER_CIT INNER_CIT region Y RURAL Y Y TOWN Y TOWN 27524.031 28838.760 28586.406 20463.127 20600.852 33568.392 income 2 5 3 3 8 25720.037 9 married YES NO YES YES YES YES NO children 1.0117 1.973 0.628 0.6901 1.6207 0.899 0.9403 car NO NO NO NO NO YES YES save_act YES YES YES NO NO NO YES current_act YES YES YES YES YES YES YES mortgage NO NO NO NO NO YES NO pep NO NO NO YES NO YES YES For example, the centroid for cluster 0 shows that this is a segment of cases representing middle aged (approx. 42) females living in inner city with an average income of approx. $27,500, who are married with one child, etc. Furthermore, this group has on average said YES to the NO product. 2.1 Data Visualization The result can be viewed more intuitively by the advanced VISUALIZATION built in WEKA. The visualization of the distribution of male and female in each cluster can be found by using the following methods. Step 1 : Right click on the output and select “Visualise Cluster alignment”
  • 9. Step 2 : Select the different cluster as the X axis. Step 3 : SelectInstance_Nbr as Y Axis Step 4 : Select “ Sex “ as colour.It means it will differentiate sex based on colour. This will result in a visualization of the distribution of males and females in each cluster.
  • 10. 3. Regression Analysis Regression can be done effectively with more options via WEKA software.Lets explain it using a simple “LinearRegression” 3.1 Pricing the house Data is taken from an online source .The selling price of the house needs to be determined based on the data given. The data contains the following attributes. 1) houseSize NUMERIC 2) lotSize NUMERIC 3) bedrooms NUMERIC 4) granite NUMERIC 5) bathroom NUMERIC 6) sellingPrice NUMERIC So, based on the size of the house, Lot size ,number of bedrooms it has ,whether it is furnished with Granite, number of bathroom ,we need to predict the DEPENDANT VARIABLE ,i.e. the SELLING PRICE. First, the data is loaded into WEKA and necessary preprocess is done. Since, our data is already processed .We proceed to selecting the type of REGRESSION
  • 11. In the picture given above select the “Linear Regression” tab. Then Select “Use Training Set” in the Test Options. There are three other choices available while doing simple Linear Regression they are  Supplied test set: Supply test data to do model
  • 12. Cross-validation : which lets WEKA build a model based on subsets of the supplied data and then average them out to create a final model  Percentage split: where WEKA takes a percentile subset to build a final model. Here the column “Selling Price” is chosen. This means with the available data we are going to predict the DEPENDANT VARIABLE (Selling Price). Then click on the “Start” button to build a model using WEKA. OUTPUT: ================================================================================================ === Run information === Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: house Instances: 700 Attributes: 6 houseSize lotSize bedrooms granite bathroom sellingPrice Test mode: evaluate on training data === Classifier model (full training set) === Linear Regression Model sellingPrice = 22.6582 * houseSize + 9.1242 * lotSize + 42145.0767 * bedrooms + 42562.0901 * bathroom + -20981.3142 Time taken to build model: 0.04 seconds === Evaluation on training set === === Summary === Correlation coefficient 0.9945 Mean absolute error 4790.821 Root mean squared error 4245.4125 Relative absolute error 11.9082 % Root relative squared error 11.21 % Total Number of Instances 700 ================================================================================================ The output predicts that the Selling price will be
  • 13. sellingPrice= (22.6582*houseSize) + (9.1242 * lotSize) + (42145.0767 * bedrooms) + (42562.0901 * bathroom) -20981.3142. If we want to determine the “selling price” of the house based on given data just “Plug in” the values and find it easily. The output predicts that the “Granite” doesn’t matter much regarding the SELLING PRICE of the house. 4. References http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm www.cs.waikato.ac.nz/ml/weka/ http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/ http://maya.cs.depaul.edu/classes/ect584/weka/k-means.html http://www.cs.utexas.edu/users/ml/tutorials/Weka-tut/