SlideShare une entreprise Scribd logo
1  sur  23
WEKA




                    CS 595
       Knowledge Discovery and Datamining




                   Assignment # 1
            Evaluation Report for WEKA
       (Waikato Environment for Knowledge Analysis)




                     Presented By:
                     Manoj Wartikar
                     Sameer Sagade
                       Date:
                      th
                   14 March, 2000.



                                                      1 of 23
WEKA



                 Weka Machine Learning Project.
Machine Learning:
      An exciting and potentially far-reaching development in contemporary
computer science is the invention and application of methods of Machine
Learning. These enable a computer program to automatically analyze a large
body of data and decide what information is most relevant. This crystallized
information can then be used to help people make decision faster and more
accurately.

       One of the central problems of the information age is dealing with the
enormous explosion in the amount of raw information that is available. Machine
learning (ML) has the potential to sift through this mass of information and
convert it into knowledge that people can use. So far, however, it has been used
mainly on small problems under well-controlled conditions.

       The aim of the Weka Project is to bring the technology out of the
laboratory and provide solutions that can make a difference to people. The
overall goal of this research programme is to build a state-of-the art facility for
development of techniques of ML.

Objectives:
       The team at Waikato has incorporated several standard ML techniques
into software “Workbench” abbreviated WEKA (Waikato Environment for
Knowledge Analysis). With the use of WEKA, a specialist in a particular field is
able to use ML and derive useful knowledge from databases that are far too large
to be analyzed by hand. The main objectives of WEKA are to
       • Make Machine Learning (ML) techniques generally available;
       • Apply them to practical problems as in agriculture;
       • Develop new machine learning algorithms;
       • Design a theoretical framework for the field.


Documented Features:
       The WEKA presents a collection of algorithms for solving real-world data
mining problems. The software is written in Java 2 and includes a uniform
interface to the standard techniques in machine learning. The following
techniques in Data mining are implemented in WEKA.
       1. Attribute Selection.
       2. Clustering.
       3. Classifiers (both numeric and non-numeric).
       4. Association Rules.
       5. Filters.
       6. Estimators.




                                                                          2 of 23
WEKA


       Out of these options, only Classifiers, association rules and Filters are
available as direct executables. All the remaining functions are available as
API’s. The data required by the software is in the “.Arff” format. Sample
databases are also provided with the software.


Features:
       The WEKA package is comprised of a number of classes and
inheritances. We have to create an instance of any class to execute it. The
functionality of WEKA is classified based on the steps of Machine learning.

Classifiers:
       The Classifiers class prints out a decision tree classifier for the dataset
given as input. Also A ten-fold cross-validation estimation of its performance is
also calculated. The Classifiers package implements the most common
techniques separately for categorical and numerical values

       a) Classifiers for categorical prediction:

1.    Weka.classifiers.IBk                 K-nearest neighbor learner
2.    Weka.classifiers.j48.J48             C4.5 decision trees
3.    Weka.classifiers.j48.PART            Rule learner
4.    Weka.classifiers.NaiveBayes          Naive Bayes with/without kernels
5.    Weka.classifiers.OneR                Holte's oner
6.    Weka.classifiers.KernelDensity       Kernel density classifier
7.    Weka.classifiers.SMO                 Support vector machines
8.    Weka.classifiers.Logistic            Logistic regression
9.    Weka.classifiers.AdaBoostM1          Adaboost
10.   Weka.classifiers.LogitBoost          Logit boost
11.   Weka.classifiers.DecisionStump       Decision stumps (for boosting)




                                                                         3 of 23
WEKA


 Sample Executions of the various categorical CLASSIFIER Algorithms:


K Nearest Neighbour Algorithm:

>java weka.classifiers.IBk -t data/iris.arff

IB1 instance-based classifier
using 1 nearest neighbour(s) for classification


=== Error on training data ===

Correctly Classified Instances        150         100   %
Incorrectly Classified Instances        0         0   %
Mean absolute error                  0.0085
Root mean squared error                 0.0091
Total Number of Instances             150

=== Confusion Matrix ===

a b c <-- classified as
50 0 0 | a = Iris-setosa
0 50 0 | b = Iris-versicolor
0 0 50 | c = Iris-virginica

=== Stratified cross-validation ===

Correctly Classified Instances        144         96  %
Incorrectly Classified Instances        6         4  %
Mean absolute error                  0.0356
Root mean squared error                 0.1618
Total Number of Instances             150

=== Confusion Matrix ===

a b c <-- classified as
50 0 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 3 47 | c = Iris-virginica




                                                                 4 of 23
WEKA


J48 Pruned Tree Algorithm:

>java weka.classifiers.j48.J48 -t data/iris.arff

J48 pruned tree
------------------

petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)

Number of Leaves :           5

Size of the tree :    9


=== Error on training data ===

Correctly Classified Instances       147             98  %
Incorrectly Classified Instances       3             2  %
Mean absolute error                 0.0233
Root mean squared error                0.108
Total Number of Instances            150

=== Confusion Matrix ===

a b c <-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica

=== Stratified cross-validation ===
Correctly Classified Instances      143              95.3333 %
Incorrectly Classified Instances      7              4.6667 %
Mean absolute error                0.0391
Root mean squared error               0.1707
Total Number of Instances           150

=== Confusion Matrix ===
 a b c <-- classified as
49 1 0 | a = Iris-setosa
 0 47 3 | b = Iris-versicolor

                                                                 5 of 23
WEKA




=== Error on training data ===

Correctly Classified Instances      144        96  %
Incorrectly Classified Instances      6        4  %
Mean absolute error                0.0324
Root mean squared error               0.1495
 50 0 0 | a = Iris-setosa
 0 48 2 | b = Iris-versicolor
 0 4 46 | c = Iris-virginica


        SMO (support vector machines) and logistic regression algorithms can
 handle only two class data sets so are not evaluated.
        AdaBoost, Logit Boost,Decision Stump are algorithms which boost the
 performance of the two classifier algorithms. The boosted algorithms are run
 inside these booster algorithms. These booster algorithms monitor the execution
 and applies appropriate boosting patches to the them.




                                                                        6 of 23
WEKA


       b) Classifiers for numerical prediction:

1.   weka.classifiers.LinearRegression                Linear regression
2.   weka.classifiers.m5.M5Prime                      Model trees
3.   weka.classifiers.Ibk                             K-nearest neighbor learner
4.   weka.classifiers.LWR                             Locally weighted regression
5.   weka.classifiers.RegressionByDiscretization      Uses categorical classifiers

Sample Executions of the various categorical CLASSIFIER Algorithms:

Linear Regression Model:

> java weka.classifiers.LinearRegression -t data/cpu.arff

Linear Regression Model

class =

  -152.7641 * vendor=microdata,formation,prime,harris,dec,wang,perkin-
elmer,nixdorf,bti,sratus,dg,burroughs,cambex,magnuson,honeywell,ipl,ibm,cdc,n
cr,basf,gould,siemens,nas,adviser,sperry,amdahl +
   141.8644 * vendor=formation,prime,harris,dec,wang,perkin-
elmer,nixdorf,bti,sratus,dg,burroughs,cambex,magnuson,honeywell,ipl,ibm,cdc,n
cr,basf,gould,siemens,nas,adviser,sperry,amdahl +
   -38.2268 *
vendor=burroughs,cambex,magnuson,honeywell,ipl,ibm,cdc,ncr,basf,gould,siem
ens,nas,adviser,sperry,amdahl +
    39.4748 *
vendor=cambex,magnuson,honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,ad
viser,sperry,amdahl +
   -39.5986 *
vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl
+
    21.4119 *
vendor=ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl +
   -41.2396 * vendor=gould,siemens,nas,adviser,sperry,amdahl +
    32.0545 * vendor=siemens,nas,adviser,sperry,amdahl +
  -113.6927 * vendor=adviser,sperry,amdahl +
   176.5204 * vendor=sperry,amdahl +
   -51.2583 * vendor=amdahl +
     0.0616 * MYCT +
     0.0171 * MMIN +
     0.0054 * MMAX +
     0.6654 * CACH +
    -1.4159 * CHMIN +
     1.5538 * CHMAX +


                                                                          7 of 23
WEKA


  -41.4854

=== Error on training data ===

Correlation coefficient          0.963
Mean absolute error              28.4042
Root mean squared error             41.6084
Relative absolute error          32.5055 %
Root relative squared error        26.9508 %
Total Number of Instances          209

=== Cross-validation ===

Correlation coefficient          0.9328
Mean absolute error              35.014
Root mean squared error             55.6291
Relative absolute error          39.9885 %
Root relative squared error        35.9513 %
Total Number of Instances          209




                                               8 of 23
WEKA



Pruned Training Model Tree:

> java weka.classifiers.m5.M5Prime -t data/cpu.arff


Pruned training model tree:

MMAX <= 14000 : LM1 (141/4.18%)
MMAX > 14000 : LM2 (68/51.8%)

Models at the leaves:

 Smoothed (complex):

   LM1: class = 4.15
           -
2.05vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,am
dahl
           + 5.43vendor=adviser,sperry,amdahl - 5.78vendor=amdahl
           + 0.00638MYCT + 0.00158MMIN + 0.00345MMAX + 0.552CACH
           + 1.14CHMIN + 0.0945CHMAX
   LM2: class = -113
           -
56.1vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,am
dahl
           + 10.2vendor=adviser,sperry,amdahl - 10.9vendor=amdahl
           + 0.012MYCT + 0.0145MMIN + 0.0089MMAX + 0.808CACH +
1.29CHMAX

Number of Leaves : 2

=== Error on training data ===

Correlation coefficient          0.9853
Mean absolute error              13.4072
Root mean squared error             26.3977
Relative absolute error          15.3431 %
Root relative squared error        17.0985 %
Total Number of Instances          209

=== Cross-validation ===

Correlation coefficient          0.9767
Mean absolute error              13.1239
Root mean squared error             33.4455
Relative absolute error          14.9884 %


                                                                     9 of 23
WEKA


Root relative squared error   21.6147 %
Total Number of Instances     209




                                          10 of 23
WEKA



K Nearest Neighbour classifier Algorithm:

> java weka.classifiers.IBk -t data/cpu.arff


IB1 instance-based classifier
using 1 nearest neighbour(s) for classification


=== Error on training data ===

Correlation coefficient            1
Mean absolute error                 0
Root mean squared error                  0
Relative absolute error            0        %
Root relative squared error             0    %
Total Number of Instances               209

=== Cross-validation ===

Correlation coefficient            0.9475
Mean absolute error                20.8589
Root mean squared error               53.8162
Relative absolute error            23.8223 %
Root relative squared error          34.7797 %
Total Number of Instances            209




                                                  11 of 23
WEKA



Locally Weighted Regression:

> java weka.classifiers.LWR -t data/cpu.arff


Locally weighted regression
===========================
Using linear weighting kernels
Using all neighbours

=== Error on training data ===

Correlation coefficient          0.9967
Mean absolute error               8.9683
Root mean squared error             12.6133
Relative absolute error          10.2633 %
Root relative squared error         8.1699 %
Total Number of Instances          209

=== Cross-validation ===

Correlation coefficient          0.9808
Mean absolute error              14.9006
Root mean squared error             31.0836
Relative absolute error          17.0176 %
Root relative squared error        20.0884 %
Total Number of Instances          209




                                               12 of 23
WEKA



Regression by Descretization:

> java weka.classifiers.RegressionByDiscretization -t data/cpu.arff -W
weka.classifiers.Ibk

// Sub classifier is selected by categorical classification

Regression by discretization

Class attribute discretized into 10 values

Subclassifier: weka.classifiers.Ibk

IB1 instance-based classifier
using 1 nearest neighbour(s) for classification


=== Error on training data ===

Correlation coefficient            0.9783
Mean absolute error                32.0353
Root mean squared error               35.6977
Relative absolute error            36.6609 %
Root relative squared error          23.1223 %
Total Number of Instances            209

=== Cross-validation ===

Correlation coefficient            0.9244
Mean absolute error                41.5572
Root mean squared error               64.7253
Relative absolute error            47.4612 %
Root relative squared error          41.8299 %
Total Number of Instances            209




                                                                         13 of 23
WEKA


Association rules:
       Association rule mining finds interesting association or correlation
relationships among a large set of data items. With massive amounts of data
continuously being collected and stored in databases, many industries are
becoming interested in mining association rules from their databases. For
example, the discovery of interesting association relationships among huge
amounts of business transaction records can help catalog design, cross
marketing, loss-leader analysis, and other business decision making processes.

       A typical example of association rule mining is market basket analysis.
This process analyzes customer-buying habits by finding associations between
the different items that customer’s place in their “shopping baskets". The
discovery of such associations can help retailers develop marketing strategies by
gaining insight into which items are frequently purchased together by customers.
For instance, if customers are buying milk, how likely are they to also buy bread
(and what kind of bread) on the same trip to the supermarket? Such information
can lead to increased sales.

       The WEKA software efficiently produces association rules for the given
data set. The Apriori algorithm is used as the foundation of the package. It gives
all the itemsets and the subsequent frequent sets for the specified minimal
support and confidence.

      A typical output of the Association package is :
Apriori Principle:

> java weka.associations.Apriori -t data/weather.nominal.arff -I yes


Apriori
=======

Minimum support: 0.2
Minimum confidence: 0.9
Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 12

Large Itemsets L(1):
outlook=sunny 5
outlook=overcast 4
outlook=rainy 5
temperature=hot 4
temperature=mild 6



                                                                        14 of 23
WEKA


temperature=cool 4
humidity=high 7
humidity=normal 7
windy=TRUE 6
windy=FALSE 8
play=yes 9
play=no 5

Size of set of large itemsets L(2): 47

Large Itemsets L(2):
outlook=sunny temperature=hot 2
outlook=sunny temperature=mild 2
outlook=sunny humidity=high 3
outlook=sunny humidity=normal 2
outlook=sunny windy=TRUE 2
outlook=sunny windy=FALSE 3
outlook=sunny play=yes 2
outlook=sunny play=no 3
outlook=overcast temperature=hot 2
outlook=overcast humidity=high 2
outlook=overcast humidity=normal 2
outlook=overcast windy=TRUE 2
outlook=overcast windy=FALSE 2
outlook=overcast play=yes 4
outlook=rainy temperature=mild 3
outlook=rainy temperature=cool 2
outlook=rainy humidity=high 2
outlook=rainy humidity=normal 3
outlook=rainy windy=TRUE 2
outlook=rainy windy=FALSE 3
outlook=rainy play=yes 3
outlook=rainy play=no 2
temperature=hot humidity=high 3
temperature=hot windy=FALSE 3
temperature=hot play=yes 2
temperature=hot play=no 2
temperature=mild humidity=high 4
temperature=mild humidity=normal 2
temperature=mild windy=TRUE 3
temperature=mild windy=FALSE 3
temperature=mild play=yes 4
temperature=mild play=no 2
temperature=cool humidity=normal 4
temperature=cool windy=TRUE 2
temperature=cool windy=FALSE 2



                                         15 of 23
WEKA


temperature=cool play=yes 3
humidity=high windy=TRUE 3
humidity=high windy=FALSE 4
humidity=high play=yes 3
humidity=high play=no 4
humidity=normal windy=TRUE 3
humidity=normal windy=FALSE 4
humidity=normal play=yes 6
windy=TRUE play=yes 3
windy=TRUE play=no 3
windy=FALSE play=yes 6
windy=FALSE play=no 2

Size of set of large itemsets L(3): 39

Large Itemsets L(3):
outlook=sunny temperature=hot humidity=high 2
outlook=sunny temperature=hot play=no 2
outlook=sunny humidity=high windy=FALSE 2
outlook=sunny humidity=high play=no 3
outlook=sunny humidity=normal play=yes 2
outlook=sunny windy=FALSE play=no 2
outlook=overcast temperature=hot windy=FALSE 2
outlook=overcast temperature=hot play=yes 2
outlook=overcast humidity=high play=yes 2
outlook=overcast humidity=normal play=yes 2
outlook=overcast windy=TRUE play=yes 2
outlook=overcast windy=FALSE play=yes 2
outlook=rainy temperature=mild humidity=high 2
outlook=rainy temperature=mild windy=FALSE 2
outlook=rainy temperature=mild play=yes 2
outlook=rainy temperature=cool humidity=normal 2
outlook=rainy humidity=normal windy=FALSE 2
outlook=rainy humidity=normal play=yes 2
outlook=rainy windy=TRUE play=no 2
outlook=rainy windy=FALSE play=yes 3
temperature=hot humidity=high windy=FALSE 2
temperature=hot humidity=high play=no 2
temperature=hot windy=FALSE play=yes 2
temperature=mild humidity=high windy=TRUE 2
temperature=mild humidity=high windy=FALSE 2
temperature=mild humidity=high play=yes 2
temperature=mild humidity=high play=no 2
temperature=mild humidity=normal play=yes 2
temperature=mild windy=TRUE play=yes 2
temperature=mild windy=FALSE play=yes 2



                                                   16 of 23
WEKA


temperature=cool humidity=normal windy=TRUE 2
temperature=cool humidity=normal windy=FALSE 2
temperature=cool humidity=normal play=yes 3
temperature=cool windy=FALSE play=yes 2
humidity=high windy=TRUE play=no 2
humidity=high windy=FALSE play=yes 2
humidity=high windy=FALSE play=no 2
humidity=normal windy=TRUE play=yes 2
humidity=normal windy=FALSE play=yes 4

Size of set of large itemsets L(4): 6

Large Itemsets L(4):
outlook=sunny temperature=hot humidity=high play=no 2
outlook=sunny humidity=high windy=FALSE play=no 2
outlook=overcast temperature=hot windy=FALSE play=yes 2
outlook=rainy temperature=mild windy=FALSE play=yes 2
outlook=rainy humidity=normal windy=FALSE play=yes 2
temperature=cool humidity=normal windy=FALSE play=yes 2

Best rules found:

 1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1)
 2. temperature=cool 4 ==> humidity=normal 4 (1)
 3. outlook=overcast 4 ==> play=yes 4 (1)
 4. temperature=cool play=yes 3 ==> humidity=normal 3 (1)
 5. outlook=rainy windy=FALSE 3 ==> play=yes 3 (1)
 6. outlook=rainy play=yes 3 ==> windy=FALSE 3 (1)
 7. outlook=sunny humidity=high 3 ==> play=no 3 (1)
 8. outlook=sunny play=no 3 ==> humidity=high 3 (1)
 9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2 (1)
10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2 (1)




                                                                  17 of 23
WEKA


Advantages, disadvantages and Future Upgradations:

       The WEKA system has covered the entire machine learning (knowledge
discovery) process. Although an research project, the WEKA system has been
able to implement and evaluate a number of different Algorithms for different
steps in the machine learning process.

       The output and the information provided by the package is sufficient for an
expert in machine learning and related topics. The results as displayed by the
system show a detailed description of the flow and the steps involved in the
entire machine learning process. The outputs provided by different algorithms are
easy to compare and hence make the analysis easier.

       ARFF dataset is one of the most widely used data storage formats for
research databases, making this system easier for use in research oriented
projects.

      This package provides and number of application program interfaces (API)
which help novice Dataminers build their systems using the ”core WEKA
system”.

      Since the system provides a number of switches and options, we can
customize the output of the system to suit our needs.

        First, major disadvantage is that the system is a Java based system and
requires Java Virtual Machine installed for its execution. Since the system is
entirely based on Command Line parameters and switches, it is difficult for an
amateur to use the system efficiently. A Textual interface and output makes it all
the more difficult to interpret and visualize the results.

      Important results such as the pruned trees, hierarchy based outputs
cannot be displayed making it difficult to visualize the results.

    Although a commonly used dataset, ARFF is the only format that the
WEKA system supports.

      All the current version i.e. 3.0.1 has some bugs or disadvantages, the
developers are working on a better system and have come up with a new version
which has a graphical user interface making the system complete.




                                                                        18 of 23
WEKA




                        Appendix
       (Sample executions for other algorithms covered)




                                                          19 of 23
WEKA


PART Decision List Algorithm

>java weka.classifiers.j48.PART -t data/iris.arff

PART decision list
------------------

petalwidth <= 0.6: Iris-setosa (50.0)

petalwidth <= 1.7 AND
petallength <= 4.9: Iris-versicolor (48.0/1.0)

: Iris-virginica (52.0/3.0)

Number of Rules : 3


=== Error on training data ===

Correctly Classified Instances       146            97.3333 %
Incorrectly Classified Instances       4            2.6667 %
Mean absolute error                 0.0338
Root mean squared error                0.1301
Total Number of Instances            150

=== Confusion Matrix ===

a b c <-- classified as
50 0 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 1 49 | c = Iris-virginica

=== Stratified cross-validation ===

Correctly Classified Instances       142            94.6667 %
Incorrectly Classified Instances       8            5.3333 %
Mean absolute error                 0.0454
Root mean squared error                0.1805
Total Number of Instances            150

=== Confusion Matrix ===

a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 4 46 | c = Iris-virginica



                                                                20 of 23
WEKA


Naïve Bayes Classifier Algorithm:

> java weka.classifiers.NaiveBayes -t data/iris.arff


Naive Bayes Classifier

Class Iris-setosa: Prior probability = 0.33

sepallength: Normal Distribution. Mean = 4.9913 StandardDev = 0.355
WeightSum = 50 Precision = 0.10588235294117648
sepalwidth: Normal Distribution. Mean = 3.4015 StandardDev = 0.3925
WeightSum = 50 Precision = 0.10909090909090911
petallength: Normal Distribution. Mean = 1.4694 StandardDev = 0.1782
WeightSum = 50 Precision = 0.14047619047619048
petalwidth: Normal Distribution. Mean = 0.2743 StandardDev = 0.1096
WeightSum = 50 Precision = 0.11428571428571428


Class Iris-versicolor: Prior probability = 0.33

sepallength: Normal Distribution. Mean = 5.9379 StandardDev = 0.5042
WeightSum = 50 Precision = 0.10588235294117648
sepalwidth: Normal Distribution. Mean = 2.7687 StandardDev = 0.3038
WeightSum = 50 Precision = 0.10909090909090911
petallength: Normal Distribution. Mean = 4.2452 StandardDev = 0.4712
WeightSum = 50 Precision = 0.14047619047619048
petalwidth: Normal Distribution. Mean = 1.3097 StandardDev = 0.1915
WeightSum = 50 Precision = 0.11428571428571428


Class Iris-virginica: Prior probability = 0.33

sepallength: Normal Distribution. Mean = 6.5795 StandardDev = 0.6353
WeightSum = 50 Precision = 0.10588235294117648
sepalwidth: Normal Distribution. Mean = 2.9629 StandardDev = 0.3088
WeightSum = 50 Precision = 0.10909090909090911
petallength: Normal Distribution. Mean = 5.5516 StandardDev = 0.5529
WeightSum = 50 Precision = 0.14047619047619048
petalwidth: Normal Distribution. Mean = 2.0343 StandardDev = 0.2646
WeightSum = 50 Precision = 0.11428571428571428




                                                                       21 of 23
WEKA


OneR Classifier Algorithm:

> java weka.classifiers.OneR -t data/iris.arff


petallength:
       < 2.45 -> Iris-setosa
       < 4.75 -> Iris-versicolor
       >= 4.75        -> Iris-virginica
(143/150 instances correct)


=== Error on training data ===

Correctly Classified Instances        143        95.3333 %
Incorrectly Classified Instances        7        4.6667 %
Mean absolute error                  0.0311
Root mean squared error                 0.1764
Total Number of Instances             150

=== Confusion Matrix ===

a b c <-- classified as
50 0 0 | a = Iris-setosa
0 44 6 | b = Iris-versicolor
0 1 49 | c = Iris-virginica

=== Stratified cross-validation ===

Correctly Classified Instances        142        94.6667 %
Incorrectly Classified Instances        8        5.3333 %
Mean absolute error                  0.0356
Root mean squared error                 0.1886
Total Number of Instances             150

=== Confusion Matrix ===

a b c <-- classified as
50 0 0 | a = Iris-setosa
0 44 6 | b = Iris-versicolor
0 2 48 | c = Iris-virginica




                                                             22 of 23
WEKA


Kernel Density Algorithm:

> java weka.classifiers.KernelDensity -t data/iris.arff


Kernel Density Estimator

=== Error on training data ===

Correctly Classified Instances       148           98.6667 %
Incorrectly Classified Instances       2           1.3333 %
Mean absolute error                 0.0313
Root mean squared error                0.0944
Total Number of Instances            150

=== Confusion Matrix ===

a b c <-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 1 49 | c = Iris-virginica

=== Stratified cross-validation ===

Correctly Classified Instances       144           96  %
Incorrectly Classified Instances       6           4  %
Mean absolute error                 0.0466
Root mean squared error                0.1389
Total Number of Instances            150

=== Confusion Matrix ===

a b c <-- classified as
50 0 0 | a = Iris-setosa
0 48 2 | b = Iris-versicolor
0 4 46 | c = Iris-virginica




                                                               23 of 23

Contenu connexe

En vedette

Chapter01.ppt
Chapter01.pptChapter01.ppt
Chapter01.ppt
butest
 
Microsoft PowerPoint - Chapter01
Microsoft PowerPoint - Chapter01Microsoft PowerPoint - Chapter01
Microsoft PowerPoint - Chapter01
butest
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introduction
butest
 
Slide 1
Slide 1Slide 1
Slide 1
butest
 
final.doc
final.docfinal.doc
final.doc
butest
 
thesis_background.ppt
thesis_background.pptthesis_background.ppt
thesis_background.ppt
butest
 
拿回自己的鑰匙
拿回自己的鑰匙拿回自己的鑰匙
拿回自己的鑰匙
花東宏宣
 
12-Multistrategy-learning.doc
12-Multistrategy-learning.doc12-Multistrategy-learning.doc
12-Multistrategy-learning.doc
butest
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
butest
 
An Analysis of Graph Cut Size for Transductive Learning
An Analysis of Graph Cut Size for Transductive LearningAn Analysis of Graph Cut Size for Transductive Learning
An Analysis of Graph Cut Size for Transductive Learning
butest
 
Emulating Human Essay Scoring With Machine Learning Methods
Emulating Human Essay Scoring With Machine Learning MethodsEmulating Human Essay Scoring With Machine Learning Methods
Emulating Human Essay Scoring With Machine Learning Methods
butest
 
FayinLi_CV_Full.doc
FayinLi_CV_Full.docFayinLi_CV_Full.doc
FayinLi_CV_Full.doc
butest
 
by Warren Jin
by Warren Jin by Warren Jin
by Warren Jin
butest
 
Practical Knowledge Representation
Practical Knowledge RepresentationPractical Knowledge Representation
Practical Knowledge Representation
butest
 

En vedette (20)

Chapter01.ppt
Chapter01.pptChapter01.ppt
Chapter01.ppt
 
Microsoft PowerPoint - Chapter01
Microsoft PowerPoint - Chapter01Microsoft PowerPoint - Chapter01
Microsoft PowerPoint - Chapter01
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introduction
 
Slide 1
Slide 1Slide 1
Slide 1
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1
 
final.doc
final.docfinal.doc
final.doc
 
thesis_background.ppt
thesis_background.pptthesis_background.ppt
thesis_background.ppt
 
拿回自己的鑰匙
拿回自己的鑰匙拿回自己的鑰匙
拿回自己的鑰匙
 
12-Multistrategy-learning.doc
12-Multistrategy-learning.doc12-Multistrategy-learning.doc
12-Multistrategy-learning.doc
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
An Analysis of Graph Cut Size for Transductive Learning
An Analysis of Graph Cut Size for Transductive LearningAn Analysis of Graph Cut Size for Transductive Learning
An Analysis of Graph Cut Size for Transductive Learning
 
Emulating Human Essay Scoring With Machine Learning Methods
Emulating Human Essay Scoring With Machine Learning MethodsEmulating Human Essay Scoring With Machine Learning Methods
Emulating Human Essay Scoring With Machine Learning Methods
 
FayinLi_CV_Full.doc
FayinLi_CV_Full.docFayinLi_CV_Full.doc
FayinLi_CV_Full.doc
 
by Warren Jin
by Warren Jin by Warren Jin
by Warren Jin
 
W.doc
W.docW.doc
W.doc
 
Enterprising Donegal Business Week 2010
Enterprising Donegal Business Week 2010Enterprising Donegal Business Week 2010
Enterprising Donegal Business Week 2010
 
univerteam actualizada 2015
univerteam actualizada 2015univerteam actualizada 2015
univerteam actualizada 2015
 
Practical Knowledge Representation
Practical Knowledge RepresentationPractical Knowledge Representation
Practical Knowledge Representation
 
A
AA
A
 
Availability for Dummies
Availability for DummiesAvailability for Dummies
Availability for Dummies
 

Similaire à MS Word.doc

Open06
Open06Open06
Open06
butest
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
Prashant Menon
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
Sanghun Kim
 
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Sunil Nair
 
MicroManager_MATLAB_Implementation
MicroManager_MATLAB_ImplementationMicroManager_MATLAB_Implementation
MicroManager_MATLAB_Implementation
Philip Mohun
 

Similaire à MS Word.doc (20)

Weka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule GenerationWeka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule Generation
 
Classification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka ToolClassification and Prediction Based Data Mining Algorithm in Weka Tool
Classification and Prediction Based Data Mining Algorithm in Weka Tool
 
BioWeka
BioWekaBioWeka
BioWeka
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
 
Open06
Open06Open06
Open06
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiers
 
IRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET - Rainfall Forecasting using Weka Data Mining ToolIRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET - Rainfall Forecasting using Weka Data Mining Tool
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
Wek1
Wek1Wek1
Wek1
 
IRJET- Machine Learning and Deep Learning Methods for Cybersecurity
IRJET- Machine Learning and Deep Learning Methods for CybersecurityIRJET- Machine Learning and Deep Learning Methods for Cybersecurity
IRJET- Machine Learning and Deep Learning Methods for Cybersecurity
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
 
Analysis of Database Issues using AHF and Machine Learning v2 - AOUG2022
Analysis of Database Issues using AHF and Machine Learning v2 -  AOUG2022Analysis of Database Issues using AHF and Machine Learning v2 -  AOUG2022
Analysis of Database Issues using AHF and Machine Learning v2 - AOUG2022
 
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
Analysis of Database Issues using AHF and Machine Learning v2 -  SOUGAnalysis of Database Issues using AHF and Machine Learning v2 -  SOUG
Analysis of Database Issues using AHF and Machine Learning v2 - SOUG
 
Mining attributes
Mining attributesMining attributes
Mining attributes
 
MicroManager_MATLAB_Implementation
MicroManager_MATLAB_ImplementationMicroManager_MATLAB_Implementation
MicroManager_MATLAB_Implementation
 
Driverless Machine Learning Web App
Driverless Machine Learning Web AppDriverless Machine Learning Web App
Driverless Machine Learning Web App
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

MS Word.doc

  • 1. WEKA CS 595 Knowledge Discovery and Datamining Assignment # 1 Evaluation Report for WEKA (Waikato Environment for Knowledge Analysis) Presented By: Manoj Wartikar Sameer Sagade Date: th 14 March, 2000. 1 of 23
  • 2. WEKA Weka Machine Learning Project. Machine Learning: An exciting and potentially far-reaching development in contemporary computer science is the invention and application of methods of Machine Learning. These enable a computer program to automatically analyze a large body of data and decide what information is most relevant. This crystallized information can then be used to help people make decision faster and more accurately. One of the central problems of the information age is dealing with the enormous explosion in the amount of raw information that is available. Machine learning (ML) has the potential to sift through this mass of information and convert it into knowledge that people can use. So far, however, it has been used mainly on small problems under well-controlled conditions. The aim of the Weka Project is to bring the technology out of the laboratory and provide solutions that can make a difference to people. The overall goal of this research programme is to build a state-of-the art facility for development of techniques of ML. Objectives: The team at Waikato has incorporated several standard ML techniques into software “Workbench” abbreviated WEKA (Waikato Environment for Knowledge Analysis). With the use of WEKA, a specialist in a particular field is able to use ML and derive useful knowledge from databases that are far too large to be analyzed by hand. The main objectives of WEKA are to • Make Machine Learning (ML) techniques generally available; • Apply them to practical problems as in agriculture; • Develop new machine learning algorithms; • Design a theoretical framework for the field. Documented Features: The WEKA presents a collection of algorithms for solving real-world data mining problems. The software is written in Java 2 and includes a uniform interface to the standard techniques in machine learning. The following techniques in Data mining are implemented in WEKA. 1. Attribute Selection. 2. Clustering. 3. Classifiers (both numeric and non-numeric). 4. Association Rules. 5. Filters. 6. Estimators. 2 of 23
  • 3. WEKA Out of these options, only Classifiers, association rules and Filters are available as direct executables. All the remaining functions are available as API’s. The data required by the software is in the “.Arff” format. Sample databases are also provided with the software. Features: The WEKA package is comprised of a number of classes and inheritances. We have to create an instance of any class to execute it. The functionality of WEKA is classified based on the steps of Machine learning. Classifiers: The Classifiers class prints out a decision tree classifier for the dataset given as input. Also A ten-fold cross-validation estimation of its performance is also calculated. The Classifiers package implements the most common techniques separately for categorical and numerical values a) Classifiers for categorical prediction: 1. Weka.classifiers.IBk K-nearest neighbor learner 2. Weka.classifiers.j48.J48 C4.5 decision trees 3. Weka.classifiers.j48.PART Rule learner 4. Weka.classifiers.NaiveBayes Naive Bayes with/without kernels 5. Weka.classifiers.OneR Holte's oner 6. Weka.classifiers.KernelDensity Kernel density classifier 7. Weka.classifiers.SMO Support vector machines 8. Weka.classifiers.Logistic Logistic regression 9. Weka.classifiers.AdaBoostM1 Adaboost 10. Weka.classifiers.LogitBoost Logit boost 11. Weka.classifiers.DecisionStump Decision stumps (for boosting) 3 of 23
  • 4. WEKA Sample Executions of the various categorical CLASSIFIER Algorithms: K Nearest Neighbour Algorithm: >java weka.classifiers.IBk -t data/iris.arff IB1 instance-based classifier using 1 nearest neighbour(s) for classification === Error on training data === Correctly Classified Instances 150 100 % Incorrectly Classified Instances 0 0 % Mean absolute error 0.0085 Root mean squared error 0.0091 Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 50 0 | b = Iris-versicolor 0 0 50 | c = Iris-virginica === Stratified cross-validation === Correctly Classified Instances 144 96 % Incorrectly Classified Instances 6 4 % Mean absolute error 0.0356 Root mean squared error 0.1618 Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica 4 of 23
  • 5. WEKA J48 Pruned Tree Algorithm: >java weka.classifiers.j48.J48 -t data/iris.arff J48 pruned tree ------------------ petalwidth <= 0.6: Iris-setosa (50.0) petalwidth > 0.6 | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : 5 Size of the tree : 9 === Error on training data === Correctly Classified Instances 147 98 % Incorrectly Classified Instances 3 2 % Mean absolute error 0.0233 Root mean squared error 0.108 Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 49 1 | b = Iris-versicolor 0 2 48 | c = Iris-virginica === Stratified cross-validation === Correctly Classified Instances 143 95.3333 % Incorrectly Classified Instances 7 4.6667 % Mean absolute error 0.0391 Root mean squared error 0.1707 Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 5 of 23
  • 6. WEKA === Error on training data === Correctly Classified Instances 144 96 % Incorrectly Classified Instances 6 4 % Mean absolute error 0.0324 Root mean squared error 0.1495 50 0 0 | a = Iris-setosa 0 48 2 | b = Iris-versicolor 0 4 46 | c = Iris-virginica SMO (support vector machines) and logistic regression algorithms can handle only two class data sets so are not evaluated. AdaBoost, Logit Boost,Decision Stump are algorithms which boost the performance of the two classifier algorithms. The boosted algorithms are run inside these booster algorithms. These booster algorithms monitor the execution and applies appropriate boosting patches to the them. 6 of 23
  • 7. WEKA b) Classifiers for numerical prediction: 1. weka.classifiers.LinearRegression Linear regression 2. weka.classifiers.m5.M5Prime Model trees 3. weka.classifiers.Ibk K-nearest neighbor learner 4. weka.classifiers.LWR Locally weighted regression 5. weka.classifiers.RegressionByDiscretization Uses categorical classifiers Sample Executions of the various categorical CLASSIFIER Algorithms: Linear Regression Model: > java weka.classifiers.LinearRegression -t data/cpu.arff Linear Regression Model class = -152.7641 * vendor=microdata,formation,prime,harris,dec,wang,perkin- elmer,nixdorf,bti,sratus,dg,burroughs,cambex,magnuson,honeywell,ipl,ibm,cdc,n cr,basf,gould,siemens,nas,adviser,sperry,amdahl + 141.8644 * vendor=formation,prime,harris,dec,wang,perkin- elmer,nixdorf,bti,sratus,dg,burroughs,cambex,magnuson,honeywell,ipl,ibm,cdc,n cr,basf,gould,siemens,nas,adviser,sperry,amdahl + -38.2268 * vendor=burroughs,cambex,magnuson,honeywell,ipl,ibm,cdc,ncr,basf,gould,siem ens,nas,adviser,sperry,amdahl + 39.4748 * vendor=cambex,magnuson,honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,ad viser,sperry,amdahl + -39.5986 * vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl + 21.4119 * vendor=ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl + -41.2396 * vendor=gould,siemens,nas,adviser,sperry,amdahl + 32.0545 * vendor=siemens,nas,adviser,sperry,amdahl + -113.6927 * vendor=adviser,sperry,amdahl + 176.5204 * vendor=sperry,amdahl + -51.2583 * vendor=amdahl + 0.0616 * MYCT + 0.0171 * MMIN + 0.0054 * MMAX + 0.6654 * CACH + -1.4159 * CHMIN + 1.5538 * CHMAX + 7 of 23
  • 8. WEKA -41.4854 === Error on training data === Correlation coefficient 0.963 Mean absolute error 28.4042 Root mean squared error 41.6084 Relative absolute error 32.5055 % Root relative squared error 26.9508 % Total Number of Instances 209 === Cross-validation === Correlation coefficient 0.9328 Mean absolute error 35.014 Root mean squared error 55.6291 Relative absolute error 39.9885 % Root relative squared error 35.9513 % Total Number of Instances 209 8 of 23
  • 9. WEKA Pruned Training Model Tree: > java weka.classifiers.m5.M5Prime -t data/cpu.arff Pruned training model tree: MMAX <= 14000 : LM1 (141/4.18%) MMAX > 14000 : LM2 (68/51.8%) Models at the leaves: Smoothed (complex): LM1: class = 4.15 - 2.05vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,am dahl + 5.43vendor=adviser,sperry,amdahl - 5.78vendor=amdahl + 0.00638MYCT + 0.00158MMIN + 0.00345MMAX + 0.552CACH + 1.14CHMIN + 0.0945CHMAX LM2: class = -113 - 56.1vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,am dahl + 10.2vendor=adviser,sperry,amdahl - 10.9vendor=amdahl + 0.012MYCT + 0.0145MMIN + 0.0089MMAX + 0.808CACH + 1.29CHMAX Number of Leaves : 2 === Error on training data === Correlation coefficient 0.9853 Mean absolute error 13.4072 Root mean squared error 26.3977 Relative absolute error 15.3431 % Root relative squared error 17.0985 % Total Number of Instances 209 === Cross-validation === Correlation coefficient 0.9767 Mean absolute error 13.1239 Root mean squared error 33.4455 Relative absolute error 14.9884 % 9 of 23
  • 10. WEKA Root relative squared error 21.6147 % Total Number of Instances 209 10 of 23
  • 11. WEKA K Nearest Neighbour classifier Algorithm: > java weka.classifiers.IBk -t data/cpu.arff IB1 instance-based classifier using 1 nearest neighbour(s) for classification === Error on training data === Correlation coefficient 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances 209 === Cross-validation === Correlation coefficient 0.9475 Mean absolute error 20.8589 Root mean squared error 53.8162 Relative absolute error 23.8223 % Root relative squared error 34.7797 % Total Number of Instances 209 11 of 23
  • 12. WEKA Locally Weighted Regression: > java weka.classifiers.LWR -t data/cpu.arff Locally weighted regression =========================== Using linear weighting kernels Using all neighbours === Error on training data === Correlation coefficient 0.9967 Mean absolute error 8.9683 Root mean squared error 12.6133 Relative absolute error 10.2633 % Root relative squared error 8.1699 % Total Number of Instances 209 === Cross-validation === Correlation coefficient 0.9808 Mean absolute error 14.9006 Root mean squared error 31.0836 Relative absolute error 17.0176 % Root relative squared error 20.0884 % Total Number of Instances 209 12 of 23
  • 13. WEKA Regression by Descretization: > java weka.classifiers.RegressionByDiscretization -t data/cpu.arff -W weka.classifiers.Ibk // Sub classifier is selected by categorical classification Regression by discretization Class attribute discretized into 10 values Subclassifier: weka.classifiers.Ibk IB1 instance-based classifier using 1 nearest neighbour(s) for classification === Error on training data === Correlation coefficient 0.9783 Mean absolute error 32.0353 Root mean squared error 35.6977 Relative absolute error 36.6609 % Root relative squared error 23.1223 % Total Number of Instances 209 === Cross-validation === Correlation coefficient 0.9244 Mean absolute error 41.5572 Root mean squared error 64.7253 Relative absolute error 47.4612 % Root relative squared error 41.8299 % Total Number of Instances 209 13 of 23
  • 14. WEKA Association rules: Association rule mining finds interesting association or correlation relationships among a large set of data items. With massive amounts of data continuously being collected and stored in databases, many industries are becoming interested in mining association rules from their databases. For example, the discovery of interesting association relationships among huge amounts of business transaction records can help catalog design, cross marketing, loss-leader analysis, and other business decision making processes. A typical example of association rule mining is market basket analysis. This process analyzes customer-buying habits by finding associations between the different items that customer’s place in their “shopping baskets". The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket? Such information can lead to increased sales. The WEKA software efficiently produces association rules for the given data set. The Apriori algorithm is used as the foundation of the package. It gives all the itemsets and the subsequent frequent sets for the specified minimal support and confidence. A typical output of the Association package is : Apriori Principle: > java weka.associations.Apriori -t data/weather.nominal.arff -I yes Apriori ======= Minimum support: 0.2 Minimum confidence: 0.9 Number of cycles performed: 17 Generated sets of large itemsets: Size of set of large itemsets L(1): 12 Large Itemsets L(1): outlook=sunny 5 outlook=overcast 4 outlook=rainy 5 temperature=hot 4 temperature=mild 6 14 of 23
  • 15. WEKA temperature=cool 4 humidity=high 7 humidity=normal 7 windy=TRUE 6 windy=FALSE 8 play=yes 9 play=no 5 Size of set of large itemsets L(2): 47 Large Itemsets L(2): outlook=sunny temperature=hot 2 outlook=sunny temperature=mild 2 outlook=sunny humidity=high 3 outlook=sunny humidity=normal 2 outlook=sunny windy=TRUE 2 outlook=sunny windy=FALSE 3 outlook=sunny play=yes 2 outlook=sunny play=no 3 outlook=overcast temperature=hot 2 outlook=overcast humidity=high 2 outlook=overcast humidity=normal 2 outlook=overcast windy=TRUE 2 outlook=overcast windy=FALSE 2 outlook=overcast play=yes 4 outlook=rainy temperature=mild 3 outlook=rainy temperature=cool 2 outlook=rainy humidity=high 2 outlook=rainy humidity=normal 3 outlook=rainy windy=TRUE 2 outlook=rainy windy=FALSE 3 outlook=rainy play=yes 3 outlook=rainy play=no 2 temperature=hot humidity=high 3 temperature=hot windy=FALSE 3 temperature=hot play=yes 2 temperature=hot play=no 2 temperature=mild humidity=high 4 temperature=mild humidity=normal 2 temperature=mild windy=TRUE 3 temperature=mild windy=FALSE 3 temperature=mild play=yes 4 temperature=mild play=no 2 temperature=cool humidity=normal 4 temperature=cool windy=TRUE 2 temperature=cool windy=FALSE 2 15 of 23
  • 16. WEKA temperature=cool play=yes 3 humidity=high windy=TRUE 3 humidity=high windy=FALSE 4 humidity=high play=yes 3 humidity=high play=no 4 humidity=normal windy=TRUE 3 humidity=normal windy=FALSE 4 humidity=normal play=yes 6 windy=TRUE play=yes 3 windy=TRUE play=no 3 windy=FALSE play=yes 6 windy=FALSE play=no 2 Size of set of large itemsets L(3): 39 Large Itemsets L(3): outlook=sunny temperature=hot humidity=high 2 outlook=sunny temperature=hot play=no 2 outlook=sunny humidity=high windy=FALSE 2 outlook=sunny humidity=high play=no 3 outlook=sunny humidity=normal play=yes 2 outlook=sunny windy=FALSE play=no 2 outlook=overcast temperature=hot windy=FALSE 2 outlook=overcast temperature=hot play=yes 2 outlook=overcast humidity=high play=yes 2 outlook=overcast humidity=normal play=yes 2 outlook=overcast windy=TRUE play=yes 2 outlook=overcast windy=FALSE play=yes 2 outlook=rainy temperature=mild humidity=high 2 outlook=rainy temperature=mild windy=FALSE 2 outlook=rainy temperature=mild play=yes 2 outlook=rainy temperature=cool humidity=normal 2 outlook=rainy humidity=normal windy=FALSE 2 outlook=rainy humidity=normal play=yes 2 outlook=rainy windy=TRUE play=no 2 outlook=rainy windy=FALSE play=yes 3 temperature=hot humidity=high windy=FALSE 2 temperature=hot humidity=high play=no 2 temperature=hot windy=FALSE play=yes 2 temperature=mild humidity=high windy=TRUE 2 temperature=mild humidity=high windy=FALSE 2 temperature=mild humidity=high play=yes 2 temperature=mild humidity=high play=no 2 temperature=mild humidity=normal play=yes 2 temperature=mild windy=TRUE play=yes 2 temperature=mild windy=FALSE play=yes 2 16 of 23
  • 17. WEKA temperature=cool humidity=normal windy=TRUE 2 temperature=cool humidity=normal windy=FALSE 2 temperature=cool humidity=normal play=yes 3 temperature=cool windy=FALSE play=yes 2 humidity=high windy=TRUE play=no 2 humidity=high windy=FALSE play=yes 2 humidity=high windy=FALSE play=no 2 humidity=normal windy=TRUE play=yes 2 humidity=normal windy=FALSE play=yes 4 Size of set of large itemsets L(4): 6 Large Itemsets L(4): outlook=sunny temperature=hot humidity=high play=no 2 outlook=sunny humidity=high windy=FALSE play=no 2 outlook=overcast temperature=hot windy=FALSE play=yes 2 outlook=rainy temperature=mild windy=FALSE play=yes 2 outlook=rainy humidity=normal windy=FALSE play=yes 2 temperature=cool humidity=normal windy=FALSE play=yes 2 Best rules found: 1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1) 2. temperature=cool 4 ==> humidity=normal 4 (1) 3. outlook=overcast 4 ==> play=yes 4 (1) 4. temperature=cool play=yes 3 ==> humidity=normal 3 (1) 5. outlook=rainy windy=FALSE 3 ==> play=yes 3 (1) 6. outlook=rainy play=yes 3 ==> windy=FALSE 3 (1) 7. outlook=sunny humidity=high 3 ==> play=no 3 (1) 8. outlook=sunny play=no 3 ==> humidity=high 3 (1) 9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2 (1) 10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2 (1) 17 of 23
  • 18. WEKA Advantages, disadvantages and Future Upgradations: The WEKA system has covered the entire machine learning (knowledge discovery) process. Although an research project, the WEKA system has been able to implement and evaluate a number of different Algorithms for different steps in the machine learning process. The output and the information provided by the package is sufficient for an expert in machine learning and related topics. The results as displayed by the system show a detailed description of the flow and the steps involved in the entire machine learning process. The outputs provided by different algorithms are easy to compare and hence make the analysis easier. ARFF dataset is one of the most widely used data storage formats for research databases, making this system easier for use in research oriented projects. This package provides and number of application program interfaces (API) which help novice Dataminers build their systems using the ”core WEKA system”. Since the system provides a number of switches and options, we can customize the output of the system to suit our needs. First, major disadvantage is that the system is a Java based system and requires Java Virtual Machine installed for its execution. Since the system is entirely based on Command Line parameters and switches, it is difficult for an amateur to use the system efficiently. A Textual interface and output makes it all the more difficult to interpret and visualize the results. Important results such as the pruned trees, hierarchy based outputs cannot be displayed making it difficult to visualize the results. Although a commonly used dataset, ARFF is the only format that the WEKA system supports. All the current version i.e. 3.0.1 has some bugs or disadvantages, the developers are working on a better system and have come up with a new version which has a graphical user interface making the system complete. 18 of 23
  • 19. WEKA Appendix (Sample executions for other algorithms covered) 19 of 23
  • 20. WEKA PART Decision List Algorithm >java weka.classifiers.j48.PART -t data/iris.arff PART decision list ------------------ petalwidth <= 0.6: Iris-setosa (50.0) petalwidth <= 1.7 AND petallength <= 4.9: Iris-versicolor (48.0/1.0) : Iris-virginica (52.0/3.0) Number of Rules : 3 === Error on training data === Correctly Classified Instances 146 97.3333 % Incorrectly Classified Instances 4 2.6667 % Mean absolute error 0.0338 Root mean squared error 0.1301 Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 1 49 | c = Iris-virginica === Stratified cross-validation === Correctly Classified Instances 142 94.6667 % Incorrectly Classified Instances 8 5.3333 % Mean absolute error 0.0454 Root mean squared error 0.1805 Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 4 46 | c = Iris-virginica 20 of 23
  • 21. WEKA Naïve Bayes Classifier Algorithm: > java weka.classifiers.NaiveBayes -t data/iris.arff Naive Bayes Classifier Class Iris-setosa: Prior probability = 0.33 sepallength: Normal Distribution. Mean = 4.9913 StandardDev = 0.355 WeightSum = 50 Precision = 0.10588235294117648 sepalwidth: Normal Distribution. Mean = 3.4015 StandardDev = 0.3925 WeightSum = 50 Precision = 0.10909090909090911 petallength: Normal Distribution. Mean = 1.4694 StandardDev = 0.1782 WeightSum = 50 Precision = 0.14047619047619048 petalwidth: Normal Distribution. Mean = 0.2743 StandardDev = 0.1096 WeightSum = 50 Precision = 0.11428571428571428 Class Iris-versicolor: Prior probability = 0.33 sepallength: Normal Distribution. Mean = 5.9379 StandardDev = 0.5042 WeightSum = 50 Precision = 0.10588235294117648 sepalwidth: Normal Distribution. Mean = 2.7687 StandardDev = 0.3038 WeightSum = 50 Precision = 0.10909090909090911 petallength: Normal Distribution. Mean = 4.2452 StandardDev = 0.4712 WeightSum = 50 Precision = 0.14047619047619048 petalwidth: Normal Distribution. Mean = 1.3097 StandardDev = 0.1915 WeightSum = 50 Precision = 0.11428571428571428 Class Iris-virginica: Prior probability = 0.33 sepallength: Normal Distribution. Mean = 6.5795 StandardDev = 0.6353 WeightSum = 50 Precision = 0.10588235294117648 sepalwidth: Normal Distribution. Mean = 2.9629 StandardDev = 0.3088 WeightSum = 50 Precision = 0.10909090909090911 petallength: Normal Distribution. Mean = 5.5516 StandardDev = 0.5529 WeightSum = 50 Precision = 0.14047619047619048 petalwidth: Normal Distribution. Mean = 2.0343 StandardDev = 0.2646 WeightSum = 50 Precision = 0.11428571428571428 21 of 23
  • 22. WEKA OneR Classifier Algorithm: > java weka.classifiers.OneR -t data/iris.arff petallength: < 2.45 -> Iris-setosa < 4.75 -> Iris-versicolor >= 4.75 -> Iris-virginica (143/150 instances correct) === Error on training data === Correctly Classified Instances 143 95.3333 % Incorrectly Classified Instances 7 4.6667 % Mean absolute error 0.0311 Root mean squared error 0.1764 Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 44 6 | b = Iris-versicolor 0 1 49 | c = Iris-virginica === Stratified cross-validation === Correctly Classified Instances 142 94.6667 % Incorrectly Classified Instances 8 5.3333 % Mean absolute error 0.0356 Root mean squared error 0.1886 Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 44 6 | b = Iris-versicolor 0 2 48 | c = Iris-virginica 22 of 23
  • 23. WEKA Kernel Density Algorithm: > java weka.classifiers.KernelDensity -t data/iris.arff Kernel Density Estimator === Error on training data === Correctly Classified Instances 148 98.6667 % Incorrectly Classified Instances 2 1.3333 % Mean absolute error 0.0313 Root mean squared error 0.0944 Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 49 1 | b = Iris-versicolor 0 1 49 | c = Iris-virginica === Stratified cross-validation === Correctly Classified Instances 144 96 % Incorrectly Classified Instances 6 4 % Mean absolute error 0.0466 Root mean squared error 0.1389 Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 48 2 | b = Iris-versicolor 0 4 46 | c = Iris-virginica 23 of 23