Uneak White's Personal Brand Exploration Presentation
Itb weka nikhil
1. IT FOR BUSINESS INTELLIGENCE
Data Analysis techniques using
WEKA: Classification and
Regression
Nikhil Yagnic (07AG3801)
2. Introduction
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software
written in Java, developed at the University of Waikato, New Zealand. Weka is free software
available under the GNU General Public License.
The Weka workbench[1] contains a collection of visualization tools and algorithms for data analysis
and predictive modelling, together with graphical user interfaces for easy access to this functionality.
The original non-Java version of Weka was a TCL/TK front-end to (mostly third-party) modelling
algorithms implemented in other programming languages, plus data pre-processing utilities in C, and
a Makefile-based system for running machine learning experiments. This original version was
primarily designed as a tool for analyzing data from agricultural domains,[2][3] but the more recent
fully Java-based version (Weka 3), for which development started in 1997, is now used in many
different application areas, in particular for educational purposes and research. Advantages of Weka
include:
free availability under the GNU General Public License
portability, since it is fully implemented in the Java programming language and thus runs on
almost any modern computing platform
a comprehensive collection of data pre-processing and modelling techniques
ease of use due to its graphical user interfaces
Weka supports several standard data mining tasks, more specifically, data pre-processing, clustering,
classification, regression, visualization, and feature selection. All of Weka's techniques are
predicated on the assumption that the data is available as a single flat file or relation, where each
data point is described by a fixed number of attributes (normally, numeric or nominal attributes, but
some other attribute types are also supported). Weka provides access to SQL databases using Java
Database Connectivity and can process the result returned by a database query. It is not capable of
multi-relational data mining, but there is separate software for converting a collection of linked
database tables into a single table that is suitable for processing using Weka.[4] Another important
area that is currently not covered by the algorithms included in the Weka distribution is sequence
modelling.
Classification via decision trees using WEKA
Problem:
A bank is introducing a new financial product. So the bank wants to classify the new customers
whether they will be ready to buy the new product or not. Bank has the existing information from
the old clients who are interested in buying the new product.
Classification is a statistical technique that helps to classify any new client into one of the existing
groups. It will create a model on the test data available. And then classifies the new data based on
the model that is developed using the test data.
Steps to do classification in WEKA
Step 1: Create a data file in the format of arff or csv. Weka understands these two formats. We are
taking the file in csv format Bank.csv
3. Step 2: Open the Weka application. This will show the following screen
Step 3: Loading data into WEKA.
To do that click on the open file button and browse for the bank.csv file. Then it shows all the
attributes as shown in the below figure.
4. Step 4: View the data
In the selected attribute panel you can see the values corresponding to the attributes and also its
type, name e.t.c
You can also visualize the frequency distribution of all the attributes at a time by clicking on the
“Visualize All” button. It shows the following screen.
5. This visualizes all shows the range of data for each attribute and also the mean, median and
frequency of each attribute. For example the value of age in our case is ranging from 18 to 67 with
an average of 42.5
Step 5: Classify the Test data
To do this select the classify button which shows the following screen.
Then select the J48 algorithm which is under the node of tree when you click on the choose button.
This will show the following screen.
6. Step 6: Run the classification Algorithm
Select the dependent variable that should be classified and click on the start.
This shows the output in the classifier output panel in ASCII version of the tree.
This is difficult to understand. To view the output in the form of tree, right click on the trees.j48 and
select “visualize tree” option. This shows the following screen by again right clicking on the output
and selecting full screen option.
7. Step 7: Analyze the model created by existing data
From the Classifier output we can find that the Classification accuracy of the model is 89%.
This means that the model is able to predict the values 89% correctly. So if we use the same model
to find out the buying decision of new customer the probability will be 0.89
Step 8: Test the New customer data
Create your new customer data in arff or csv format with the same attributes as test data.
Now input the data by checking the radio button “Supplied test set” and click on “ set” to browse for
the new data set.
8. Then click on the start button which generates a new tree.
Save the classification result as arff. This file contains a copy of the new instances along with an
additional column for the predicted value. The result will look like following.
9. Regression Using WEKA
Problem:
The idea is to find out how the CPU performance is correlated with the attributes like machine cycle
time, minimum main memory, cache memory e.t.c
A regression is a statistic tool that helps in finding out how the dependent variable (CPU
performance) is related to the independent attributes.
Steps to do Regression in WEKA
Step 1: Create data file and open the WEKA as in the same way as we did for Classification.
Step 2: Load the regression data file CPU.arff into weka.
Click on open file and browse for the file, that shows the following screen
Step 3: Run the regression
Click on the Classify tab and choose “Linear Regression” from the node under function. This shows
the following screen.
10. Click on start that will show output in the classifier output screen which gives a regression equation.
11. Interpretation of the output:
The CPU performance is more dependent on CHMAX and then CACHE
The correlation coefficient of 0.912 is very high, its output suggests that the dependent
variable is strongly associated with the independent variables.
We can also determine the new CPU performance by using the regression equation if we
have the values of the attributes.