SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
USING WEKA TO CLUSTERING AND
     REGRESSION ANALYSIS
                 ( ITB PAPER )




          ANURADHA CHAKRABORTY
              ROLL NO: 10BM60014




  VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR
WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine
learning software written in Java, developed at the University of Waikato, New Zealand. WEKA
is free software available under the GNU General Public License. WEKA is a unique software
compared to MS –EXCEL because it can be used to run multivariate regression without any
hassles. It also gives output showing dependent variable equation and other statistical data.

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization. It
is also well-suited for developing new machine learning schemes.

The initial versions of WEKA used only Attribute Relationship File Format (ARFF) files, saved
as *.arff. But newer versions provide an option for multiple versions like: XRFF, Binary serial
files, LIBSVM, SVM Light, CSV, C4.5 among others.

USING WEKA:

The WEKA GUI Chooser has the four following options:
   1. Weka Explorer
   2. Weka Experimenter
   3. Weka Knowledge Flow
   4. Simple CLI




Weka Explorer has the following options in each tabs:
  1. Preprocess
  2. Classify
  3. Cluster
  4. Associate
  5. Select Attributes
  6. Visualize
Apart from doing these statistical operations, each of the data can be visualized graphically and
filtered according to requirement.




Weka Experimenter:
There are several algorithms for each process. Thus the criticality of the software lies in
identifying the optimal algorithm. For Regression and classification, Experimenter gives a
comparisn of the best algorithm by statistical analysis. Unfortunately, such an option is not there
for Clustering algorithms.

Import of data:
Data is imported in form of CSV file which is converted into arff format automatically while
importing. The data is imported through Preprocess tab of WEKA as shown in picture above.



                                      CLUSTERING
Definition: Cluster analysis is a class of statistical techniques that can be applied to data that
exhibit “natural” groupings. Cluster analysis sorts through the raw data and groups them into
clusters. A cluster is a group of relatively homogeneous cases or observations. Objects in a
cluster are similar to each other. They are also dissimilar to objects outside the cluster,
particularly objects in other clusters.”
DATA SET USED FOR CLUSTERING

The example used is a survey report on instant noodles. It had:
Instances: 76
Attribute: 33

The questions or attributes were as follows:
Age
Profession
Diabetesstop
Obesitystop
Otherstop
Cadburynchocl
Homemadesweets
Sweetfrmshop
Cakepastry
Sugarcube
Celebration
Gifts
Beginningauspicious
Yummyfood
Healthconcern
Lunchdinnerafter
Tastytraditn
Abroad
Frequencyeating
Inflnearby
Inflfrndrelative
Inflblogonline
Advert
Quality
Packaging
Ambience
Price
Imptraditonsweet
 Newexperimentswt
 Newvariety
 Homedeliveryimp
 Impchitchatplace
 Packagdsweetslngtime
PROCEDURE AND RESULT:

Data-set is taken from my AMRP project survey, regarding the interest and motivation of
consumers towards traditional sweets.

Simple K-Mean Algorithm was used to cluster the data set.

The output is as follows:

 Attribute        Full Data    0         1
                    (76)     (44)       (32)
 =======================================================
 Age               1.6711    1.6364   1.7188
 Profession          1.7632  1.6818   1.875
 Diabetesstop        2.3553   2.3636  2.3438
 Obesitystop         1.9605   1.9545  1.9688
 otherstop           1.9474  1.8636   2.0625
 Cadburynchocl       4.2895   4.25    4.3438
 homemadesweets       4.3421   4.3636  4.3125
 sweetfrmshop         4.0395   4.1136  3.9375
 cakepastry          3.9342    4.0455  3.7813
 sugarcube           2.4605    2.5    2.4063
 celebration          4.1447   4.3409 3.875
 gifts               3.7632   3.7955   3.7188
 beginningauspicious 3.7763    3.8636  3.6563
 yummyfood           3.8158    3.9318  3.6563
 healthconcern       2.9868    3       2.9688
 lunchdinnerafter    3.9737   4.0909   3.8125
 tastytraditn        3.7632   4.0227   3.4063
 abroad              1.8684   1.8864   1.8438
 frequencyeating     2.5658    2.4318   2.75
 inflnearby            3.0       4.0    3.0
 inflfrndrelative      4.0        4.0    3.0
 inflblogonline        3.0        3.0    2.0
 advert                3.0        3.0     2.0
 quality               5.0       5.0      5.0
 packaging            3.0        3.0       4.0
 ambience              3.0        3.0      4.0
 price                 3.0       4.0      3.0
 imptraditonsweet      5.0       5.0      3.0
 newexperimentswt 3.0            3.0      3.0
 newvariety            3.0      3.0       4.0
 homedeliveryimp 2.8158       2.8409      2.7813
 impchitchatplace 3.3421     3.3182        3.375
packagdsweetslngtime        3.1579        3     3.375

Note: The significant values in the above table, on which the cluster characteristics are formed,
are marked with red.

Clustered Instances

0    44 ( 58%)
1    32 ( 42%)


INTERPRETATION:

ASPECTS                        CLUSTER ‘0’                            CLUSTER ‘1’
Traditionality                 Loves traditional sweets.              Loves experiments and newer
                               Considers     sweet     as   a         variety of sweets
                               traditional symbol. Wants
                               sweet after lunch or dinner.
Frequency of consumption       High                                   Medium
Price                          More price sensitive                   Lesser price sensitive.
Influnce by friends and High                                          Medium. Generally tries new
relatives or advertisements to                                        shop by own instinct.
try a new shop
Ambience of shop and Matters less                                     Matters significantly.
packaging
Food Court for chatting (Like preferred                               prefered
Haldiram)
Packaged/ tinned sweets        Medium                                 Good Demand


INFERENCE AND SUGGESTION DERIVED FROM THE CLUSTERING:

There are two distinct clusters of consumers in the sweet industry.

Cluster ‘0’ (58%) considers sweet as the “symbol of tradition”, which is typically savored
after lunch and dinner. They enjoy the most traditional sweets, and don’t prefer to try new
variants. They prefer sticking to old shops unless inspired by external agents (friends/ relatives/
blog/ advertisements etc) to try otherwise. Quality is an important factor. But ambience and
packaging doesn’t play a major role. So, shops like Nokur or Girish Dey will be their typical
favorite ones.

Cluster ‘1’(42%) are the true connoisseurs of sweets. They appreciate both traditional as well
as experimental sweets (the new variants). They often prefer trying out new shops and
brands. Packaged sweets are also preferred which can be savored later. Apart from quality,
ambience and packaging plays a vital role, where as price is of medium importance. This
cluster seems to be more impulsive consumers, and would probably not mind paying a premium
for some new and creative sweets. So, brands like K.C. Das will be their preferred choice.


                               REGRESSION
The next procedure is regression analysis.

We obtain data from stores on monthly sales of a celebration chocolate pack depending on the
amount spent on its promotion in terms of posters used around the block or any other effort .

Here after we select all attributes and go to classify tab and run regression function.




OUTPUT

The output obtained is given below
= Run information ===

Scheme:    weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: Problem_2-weka.filters.unsupervised.attribute.Remove-R1
Instances: 46
Attributes: 3
         Sales
         Price
         Promotion
Test mode: split 80.0% train, remainder test

=== Classifier model (full training set) ===


Linear Regression Model

Sales = -53.2173 * Price +       3.6131 * Promotion + 5837.5208

Time taken to build model: 0 seconds

=== Evaluation on test split ===
=== Summary ===

Correlation coefficient        0.8066
Mean absolute error          543.6332
Root mean squared error         711.4575
Relative absolute error       48.288 %
Root relative squared error     59.6886 %
Total Number of Instances         5
Ignored Class Unknown Instances          4


INTERPRETETION


The given data shows correlation coefficient of 0.8066 which means 65% accuracy of the model.
As expected we find that sales will decrease due to increase in price and increase with increase in
promotion budget.
This explains how WEKA can be used for multivariate regression .



REFERENCE

http://en.wikipedia.org/wiki/Weka_(machine_learning)

http://www.cs.waikato.ac.nz/ml/weka/

http://en.wikipedia.org/wiki/Cluster_analysis_(in_marketing)

Contenu connexe

Similaire à Weka for clustering and regression itb vgsom

The projectAboveWay Sandwich - ProjectYou are a Master Black Belt .docx
The projectAboveWay Sandwich - ProjectYou are a Master Black Belt .docxThe projectAboveWay Sandwich - ProjectYou are a Master Black Belt .docx
The projectAboveWay Sandwich - ProjectYou are a Master Black Belt .docxssusera34210
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationHariniMS1
 
Insights from Sensory Research - How this Leads to Fresh Ideas and Innovation...
Insights from Sensory Research - How this Leads to Fresh Ideas and Innovation...Insights from Sensory Research - How this Leads to Fresh Ideas and Innovation...
Insights from Sensory Research - How this Leads to Fresh Ideas and Innovation...Merlien Institute
 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handlinghiratufail
 
Strategic Tools- Walmart
Strategic Tools- WalmartStrategic Tools- Walmart
Strategic Tools- WalmartSara Abdelaal
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docxblondellchancy
 
Process Mining - Chapter 3 - Data Mining
Process Mining - Chapter 3 - Data MiningProcess Mining - Chapter 3 - Data Mining
Process Mining - Chapter 3 - Data MiningWil van der Aalst
 
Process mining chapter_03_data_mining
Process mining chapter_03_data_miningProcess mining chapter_03_data_mining
Process mining chapter_03_data_miningMuhammad Ajmal
 
ITB tutorial WEKA Prabhat Agarwal
ITB tutorial WEKA Prabhat AgarwalITB tutorial WEKA Prabhat Agarwal
ITB tutorial WEKA Prabhat AgarwalPrabhat Agarwal
 
Ingredients based - Recipe recommendation engine
Ingredients based - Recipe recommendation engineIngredients based - Recipe recommendation engine
Ingredients based - Recipe recommendation engineBharat Gandhi
 
ADVANCED SPREADSHEET SKILLS.pptx
ADVANCED SPREADSHEET SKILLS.pptxADVANCED SPREADSHEET SKILLS.pptx
ADVANCED SPREADSHEET SKILLS.pptxROWELTREYES
 
Lecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignmentLecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignmentDaria Bogdanova
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?Smarten Augmented Analytics
 
Weka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule GenerationWeka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule Generationrsathishwaran
 
Power line business overview
Power line business overviewPower line business overview
Power line business overviewbestwebsite2008
 
Less is more: Household milk allocation response to price change in peri-urba...
Less is more: Household milk allocation response to price change in peri-urba...Less is more: Household milk allocation response to price change in peri-urba...
Less is more: Household milk allocation response to price change in peri-urba...ILRI
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
 
Existing and new approaches for analysing data from Check All That Apply ques...
Existing and new approaches for analysing data from Check All That Apply ques...Existing and new approaches for analysing data from Check All That Apply ques...
Existing and new approaches for analysing data from Check All That Apply ques...Compusense Inc.
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 

Similaire à Weka for clustering and regression itb vgsom (20)

The projectAboveWay Sandwich - ProjectYou are a Master Black Belt .docx
The projectAboveWay Sandwich - ProjectYou are a Master Black Belt .docxThe projectAboveWay Sandwich - ProjectYou are a Master Black Belt .docx
The projectAboveWay Sandwich - ProjectYou are a Master Black Belt .docx
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
Insights from Sensory Research - How this Leads to Fresh Ideas and Innovation...
Insights from Sensory Research - How this Leads to Fresh Ideas and Innovation...Insights from Sensory Research - How this Leads to Fresh Ideas and Innovation...
Insights from Sensory Research - How this Leads to Fresh Ideas and Innovation...
 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handling
 
Strategic Tools- Walmart
Strategic Tools- WalmartStrategic Tools- Walmart
Strategic Tools- Walmart
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
6MODULE 2Module 2 Problem SetEXAMPLEGrand .docx
 
Process Mining - Chapter 3 - Data Mining
Process Mining - Chapter 3 - Data MiningProcess Mining - Chapter 3 - Data Mining
Process Mining - Chapter 3 - Data Mining
 
Process mining chapter_03_data_mining
Process mining chapter_03_data_miningProcess mining chapter_03_data_mining
Process mining chapter_03_data_mining
 
ITB tutorial WEKA Prabhat Agarwal
ITB tutorial WEKA Prabhat AgarwalITB tutorial WEKA Prabhat Agarwal
ITB tutorial WEKA Prabhat Agarwal
 
Ingredients based - Recipe recommendation engine
Ingredients based - Recipe recommendation engineIngredients based - Recipe recommendation engine
Ingredients based - Recipe recommendation engine
 
ADVANCED SPREADSHEET SKILLS.pptx
ADVANCED SPREADSHEET SKILLS.pptxADVANCED SPREADSHEET SKILLS.pptx
ADVANCED SPREADSHEET SKILLS.pptx
 
Lecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignmentLecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignment
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
 
Weka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule GenerationWeka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule Generation
 
Power line business overview
Power line business overviewPower line business overview
Power line business overview
 
Less is more: Household milk allocation response to price change in peri-urba...
Less is more: Household milk allocation response to price change in peri-urba...Less is more: Household milk allocation response to price change in peri-urba...
Less is more: Household milk allocation response to price change in peri-urba...
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
Existing and new approaches for analysing data from Check All That Apply ques...
Existing and new approaches for analysing data from Check All That Apply ques...Existing and new approaches for analysing data from Check All That Apply ques...
Existing and new approaches for analysing data from Check All That Apply ques...
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 

Dernier

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Weka for clustering and regression itb vgsom

  • 1. USING WEKA TO CLUSTERING AND REGRESSION ANALYSIS ( ITB PAPER ) ANURADHA CHAKRABORTY ROLL NO: 10BM60014 VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR
  • 2. WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. WEKA is free software available under the GNU General Public License. WEKA is a unique software compared to MS –EXCEL because it can be used to run multivariate regression without any hassles. It also gives output showing dependent variable equation and other statistical data. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. The initial versions of WEKA used only Attribute Relationship File Format (ARFF) files, saved as *.arff. But newer versions provide an option for multiple versions like: XRFF, Binary serial files, LIBSVM, SVM Light, CSV, C4.5 among others. USING WEKA: The WEKA GUI Chooser has the four following options: 1. Weka Explorer 2. Weka Experimenter 3. Weka Knowledge Flow 4. Simple CLI Weka Explorer has the following options in each tabs: 1. Preprocess 2. Classify 3. Cluster 4. Associate 5. Select Attributes 6. Visualize
  • 3. Apart from doing these statistical operations, each of the data can be visualized graphically and filtered according to requirement. Weka Experimenter: There are several algorithms for each process. Thus the criticality of the software lies in identifying the optimal algorithm. For Regression and classification, Experimenter gives a comparisn of the best algorithm by statistical analysis. Unfortunately, such an option is not there for Clustering algorithms. Import of data: Data is imported in form of CSV file which is converted into arff format automatically while importing. The data is imported through Preprocess tab of WEKA as shown in picture above. CLUSTERING Definition: Cluster analysis is a class of statistical techniques that can be applied to data that exhibit “natural” groupings. Cluster analysis sorts through the raw data and groups them into clusters. A cluster is a group of relatively homogeneous cases or observations. Objects in a cluster are similar to each other. They are also dissimilar to objects outside the cluster, particularly objects in other clusters.”
  • 4. DATA SET USED FOR CLUSTERING The example used is a survey report on instant noodles. It had: Instances: 76 Attribute: 33 The questions or attributes were as follows: Age Profession Diabetesstop Obesitystop Otherstop Cadburynchocl Homemadesweets Sweetfrmshop Cakepastry Sugarcube Celebration Gifts Beginningauspicious Yummyfood Healthconcern Lunchdinnerafter Tastytraditn Abroad Frequencyeating Inflnearby Inflfrndrelative Inflblogonline Advert Quality Packaging Ambience Price Imptraditonsweet Newexperimentswt Newvariety Homedeliveryimp Impchitchatplace Packagdsweetslngtime
  • 5. PROCEDURE AND RESULT: Data-set is taken from my AMRP project survey, regarding the interest and motivation of consumers towards traditional sweets. Simple K-Mean Algorithm was used to cluster the data set. The output is as follows: Attribute Full Data 0 1 (76) (44) (32) ======================================================= Age 1.6711 1.6364 1.7188 Profession 1.7632 1.6818 1.875 Diabetesstop 2.3553 2.3636 2.3438 Obesitystop 1.9605 1.9545 1.9688 otherstop 1.9474 1.8636 2.0625 Cadburynchocl 4.2895 4.25 4.3438 homemadesweets 4.3421 4.3636 4.3125 sweetfrmshop 4.0395 4.1136 3.9375 cakepastry 3.9342 4.0455 3.7813 sugarcube 2.4605 2.5 2.4063 celebration 4.1447 4.3409 3.875 gifts 3.7632 3.7955 3.7188 beginningauspicious 3.7763 3.8636 3.6563 yummyfood 3.8158 3.9318 3.6563 healthconcern 2.9868 3 2.9688 lunchdinnerafter 3.9737 4.0909 3.8125 tastytraditn 3.7632 4.0227 3.4063 abroad 1.8684 1.8864 1.8438 frequencyeating 2.5658 2.4318 2.75 inflnearby 3.0 4.0 3.0 inflfrndrelative 4.0 4.0 3.0 inflblogonline 3.0 3.0 2.0 advert 3.0 3.0 2.0 quality 5.0 5.0 5.0 packaging 3.0 3.0 4.0 ambience 3.0 3.0 4.0 price 3.0 4.0 3.0 imptraditonsweet 5.0 5.0 3.0 newexperimentswt 3.0 3.0 3.0 newvariety 3.0 3.0 4.0 homedeliveryimp 2.8158 2.8409 2.7813 impchitchatplace 3.3421 3.3182 3.375
  • 6. packagdsweetslngtime 3.1579 3 3.375 Note: The significant values in the above table, on which the cluster characteristics are formed, are marked with red. Clustered Instances 0 44 ( 58%) 1 32 ( 42%) INTERPRETATION: ASPECTS CLUSTER ‘0’ CLUSTER ‘1’ Traditionality Loves traditional sweets. Loves experiments and newer Considers sweet as a variety of sweets traditional symbol. Wants sweet after lunch or dinner. Frequency of consumption High Medium Price More price sensitive Lesser price sensitive. Influnce by friends and High Medium. Generally tries new relatives or advertisements to shop by own instinct. try a new shop Ambience of shop and Matters less Matters significantly. packaging Food Court for chatting (Like preferred prefered Haldiram) Packaged/ tinned sweets Medium Good Demand INFERENCE AND SUGGESTION DERIVED FROM THE CLUSTERING: There are two distinct clusters of consumers in the sweet industry. Cluster ‘0’ (58%) considers sweet as the “symbol of tradition”, which is typically savored after lunch and dinner. They enjoy the most traditional sweets, and don’t prefer to try new variants. They prefer sticking to old shops unless inspired by external agents (friends/ relatives/ blog/ advertisements etc) to try otherwise. Quality is an important factor. But ambience and packaging doesn’t play a major role. So, shops like Nokur or Girish Dey will be their typical favorite ones. Cluster ‘1’(42%) are the true connoisseurs of sweets. They appreciate both traditional as well as experimental sweets (the new variants). They often prefer trying out new shops and brands. Packaged sweets are also preferred which can be savored later. Apart from quality, ambience and packaging plays a vital role, where as price is of medium importance. This
  • 7. cluster seems to be more impulsive consumers, and would probably not mind paying a premium for some new and creative sweets. So, brands like K.C. Das will be their preferred choice. REGRESSION The next procedure is regression analysis. We obtain data from stores on monthly sales of a celebration chocolate pack depending on the amount spent on its promotion in terms of posters used around the block or any other effort . Here after we select all attributes and go to classify tab and run regression function. OUTPUT The output obtained is given below = Run information === Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: Problem_2-weka.filters.unsupervised.attribute.Remove-R1 Instances: 46
  • 8. Attributes: 3 Sales Price Promotion Test mode: split 80.0% train, remainder test === Classifier model (full training set) === Linear Regression Model Sales = -53.2173 * Price + 3.6131 * Promotion + 5837.5208 Time taken to build model: 0 seconds === Evaluation on test split === === Summary === Correlation coefficient 0.8066 Mean absolute error 543.6332 Root mean squared error 711.4575 Relative absolute error 48.288 % Root relative squared error 59.6886 % Total Number of Instances 5 Ignored Class Unknown Instances 4 INTERPRETETION The given data shows correlation coefficient of 0.8066 which means 65% accuracy of the model. As expected we find that sales will decrease due to increase in price and increase with increase in promotion budget. This explains how WEKA can be used for multivariate regression . REFERENCE http://en.wikipedia.org/wiki/Weka_(machine_learning) http://www.cs.waikato.ac.nz/ml/weka/ http://en.wikipedia.org/wiki/Cluster_analysis_(in_marketing)