SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
CMP: Data Mining and Statistics within the Health Services                                                                                                                                        19/02/2010




        Data Mining and Statistics
                                                                                                               Content
        Within the Health Services
                                                                                                                    1.      Introduction to Weka
                                     Tutorial for Weka                                                              2.      Data Mining Functions and Tools
                                                                                                                    3.      Data Format
                                             a data mining tool                                                     4.      Hands-on Demos
                                                                                                                         4.1 Weka Explorer
                                              Dr. Wenjia Wang                                                            • Classification
                                                                                                                         • Attribute( feature) Selection
                                        School of Computing Sciences                                                     4.2 Weka Experimenter
                                          University of East Anglia                                                      4.3 Weka KnowledgeFlow
                                                                                                                    5. Summary
            Data                  Pre-processing                 Data Mining                   Knowledge


       Data Mining & Statistics within the Health Services
                                                                                                               Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      2




        1. Introduction to WEKA                                                                                Weka Main Features
        • A collection of open source of many data                                                              • 49 data preprocessing tools
          mining and machine learning algorithms,                                                               • 76 classification/regression algorithms
          including                                                                                             • 8 clustering algorithms
              – pre-processing on data                                                                          • 15 attribute/subset evaluators + 10 search
              – Classification:                                                                                   algorithms for feature selection.
              – clustering                                                                                      • 3 algorithms for finding association rules
              – association rule extraction                                                                     • 3 graphical user interfaces
                                                                                                                      – “The Explorer” (exploratory data analysis)
        • Created by researchers at the University of
                                                                                                                      – “The Experimenter” (experimental environment)
          Waikato in New Zealand                                                                                      – “The KnowledgeFlow” (new process model inspired
        • Java based (also open source).                                                                                interface)
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)               3   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      4




        Weka: Download and Installation                                                                        Start the Weka

        • Download Weka (the stable version) from                                                              • From windows desktop,
               http://www.cs.waikato.ac.nz/ml/weka/                                                                   – click “Start”, choose “All programs”,
               – Choose a self-extracting executable (including Java VM)                                              – Choose “Weka 3.6” to start Weka
                                                                                                                      – Then the first interface
               – (If you are interested in modifying/extending weka there                                               window appears:
                 is a developer version that includes the source code)
                                                                                                                         Weka GUI Chooser.
        • After download is completed, run the self-
          extracting file to install Weka, and use the default
          set-ups.

       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)               5   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      6




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                    1
CMP: Data Mining and Statistics within the Health Services                                                                                                                             19/02/2010




       WEKA Application Interfaces                                                                   Weka Application Interfaces
                                                                                                    • Explorer
                                                                                                        – preprocessing, attribute selection, learning, visualiation
                                                                                                    • Experimenter
                                                                                                        – testing and evaluating machine learning algorithms
                                                                                                    • Knowledge Flow
                                                                                                        – visual design of KDD process
                                                                                                        – Explorer
                                                                                                    • Simple Command-line
                                                                                                        – A simple interface for typing commands


       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)    7   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)       8




                                                                                                     Load data file and
        2. Weka Functions and Tools
                                                                                                     Preprocessing
        •    Preprocessing Filters                                                                   • Load data file in formats: ARFF, CSV, C4.5,
                                                                                                       binary
        •    Attribute selection
                                                                                                     • Import from URL or SQL database (using JDBC)
        •    Classification/Regression
                                                                                                     • Preprocessing filters
        •    Clustering                                                                                    –    Adding/removing attributes
        •    Association discovery                                                                         –    Attribute value substitution
                                                                                                           –    Discretization
        •    Visualization
                                                                                                           –    Time series filters (delta, shift)
                                                                                                           –    Sampling, randomization
                                                                                                           –    Missing value management
                                                                                                           –    Normalization and other numeric transformations
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)    9   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      10




        Feature Selection                                                                            Classification
       • Very flexible: arbitrary combination of search and                                         • Predicted target must be categorical
         evaluation methods                                                                         • Implemented methods
       • Search methods                                                                                  –    decision trees(J48, etc.) and rules
            – best-first                                                                                 –    Naïve Bayes
            – genetic                                                                                    –    neural networks
            – ranking ...
                                                                                                         –    instance-based classifiers …
       • Evaluation measures
                                                                                                    • Evaluation methods
            – ReliefF
                                                                                                         – test data set
            – information gain
            – gain ratio                                                                                 – crossvalidation
       • Demo data: weather_nominal.arff                                                            • Demo data: iris, contact lenses, labor, soybeans,
                                                                                                      etc.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   11   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      12




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                          2
CMP: Data Mining and Statistics within the Health Services                                                                                                                                  19/02/2010




        Clustering                                                                                    Regression
       • Implemented methods
           –    k-Means                                                                               • Predicted target is continuous
           –    EM                                                                                    • Methods
           –    Cobweb
           –    X-means                                                                                     – linear regression
           –    FarthestFirst…                                                                              – neural networks
       • Clusters can be visualized and compared to “true”                                                  – regression trees …
         clusters (if given)
       • Demo data:                                                                                   • Demo data: cpu.arff,
           – any classification data may be used for clustering when
             its class attribute is filtered out.



       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   13    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          14




        Weka: Pros and cons                                                                           3. WEKA data formats
        • pros                                                                                        • Data can be imported from a file in various
               – Open source,                                                                           formats:
                     • Free                                                                                 – ARFF (Attribute Relation File Format) has two sections:
                     • Extensible                                                                                  • the Header information defines attribute name, type and
                     • Can be integrated into other java packages                                                    relations.
               – GUIs (Graphic User Interfaces)                                                                    • the Data section lists the data records.
                     • Relatively easier to use                                                             – CSV: Comma Separated Values (text file)
               – Features                                                                                   – C4.5: A format used by a decision induction algorithm
                     • Run individual experiment, or                                                          C4.5, requires two separated files
                     • Build KDD phases                                                                            • Name file: defines the names of the attributes
        • Cons                                                                                                     • Date file: lists the records (samples)
               – Lack of proper and adequate documentations                                                 – binary
               – Systems are updated constantly (Kitchen Sink Syndrome)                               • Data can also be read from a URL or from an
                                                                                                        SQL database (using JDBC)
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   15    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          16




        Attribute Relation File Format (arff)                                                         Breast Cancer data in ARFF
                                                                                                    % Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence-
        An ARFF file consists of two distinct sections:                                               events: 85)
                                                                                                    % Part 1: Definitions of attribute name, types and relations
        • the Header section defines attribute name, type                                           @relation breast-cancer
                                                                                                       @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'}
                                                                                                       @attribute menopause {'lt40','ge40','premeno'}
          and relations, start with a keyword.                                                         @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-
                                                                                                       49','50-54','55-59'}
               @Relation <data-name>                                                                   @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30-
                                                                                                       32','33-35','36-39'}
               @attribute <attribute-name> <type> or {range}                                           @attribute node-caps {'yes','no'}
                                                                                                       @attribute deg-malig {'1','2','3'}
                                                                                                       @attribute breast {'left','right'}
        • the Data section lists the data records, starts with                                         @attribute breast-quad {'left_up','left_low','right_up','right_low','central'}
                                                                                                       @attribute 'irradiat' {'yes','no'}
               @Data                                                                                   @attribute 'Class' {'no-recurrence-events','recurrence-events'}

               list of data instances                                                               % Part 2: data section
                                                                                                    @data
        • Any line start with % is the comments.                                                       '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
                                                                                                       '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
                                                                                                       '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
                                                                                                       ……
                                                                                                     * source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer

       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   17    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          18




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                   3
CMP: Data Mining and Statistics within the Health Services                                                                                                                                 19/02/2010




        4.1 WEKA Explorer                                                                               Weka Explorer: open data file
                                                                                                    •       Open
            • Click the Explorer on Weka GUI Chooser                                                        Breast
                                                                                                            Cancer
            • On the Explorer window,                                                                       data
                  – click button “Open File” to open a data file                                    •       Click an
                    from                                                                                    attribute,
                                                                                                            e.g. age,
                         • the folder where your data files stored.                                         then its
                           e.g. Breast Cancer data: breast_cancer.arff                                      distributio
                                                                                                            n will be
                         Or (if you don’t have this data set),                                              displayed
                         • the data folder provided by the weka package:                                    in a
                                                                                                            histogra
                           e.g. C:Program FilesWeka-3-6data                                              m.
                                 using “iris.arff” or “weather_nominal.arff”


       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   19       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      20




      Weka Explorer: training classifiers                                                           Results

        After loaded a data file, click “Classify”                                                      • Testing
        • Choose a classifier,                                                                            results:
              – Under “Classifier”: click “choose”, then a drop-down                                    • 97 cases
                                                                                                          used in
                menu appears,
                                                                                                          test.
              – Click “trees” and select “J48” – a decision tree                                        Correct:
                algorithm                                                                                66 (68%)
        • Select a test option                                                                          Wrong:
                                                                                                         31 (32%)
              – Select “percentage split”
                     • with default ratio 66% for training and 34% for testing
        • Click “Start” to train and test the classifier.
              – The training and testing information will be displayed
                in classifier output window.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   21       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      22




        Options for results and model                                                                   View the tree
        •    Point to                                                                                   •     Point to
             result                                                                                           result list
             list                                                                                             window,
             window,                                                                                          and right
             and
             right
                                                                                                              click
             click                                                                                            mouse,
             mouse.                                                                                     •     Choose
                                                                                                              “visualiz
        •    A menu                                                                                           e tree”,
             will pop                                                                                         then the
             out to                                                                                           tree will
             show all                                                                                         be
             the                                                                                              displayed
             options                                                                                          in
             availabl                                                                                         another
             e about
             the
                                                                                                              window.
             model.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   23       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      24




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                              4
CMP: Data Mining and Statistics within the Health Services                                                                                                                                  19/02/2010




         View classifier errors                                                                          Save the model and results
         •    right click the                                                                            •     Right
              result list,                                                                                     click on
                                                                                                               the
         •    Choose                                                                                           result
              “visualize                                                                                       list
              classifier
              error”, then a                                                                             •     Choose
              new window will                                                                                  “save
                                                                                                               model”
              be popped out                                                                                    and
              to display the                                                                                   “save
              classifier’s error.                                                                              result
                                                                                                               buffer”
               – Correctly                                                                                     to save
                 predicted                                                                                     the
                                                                                                               classifie
                 cases                                                                                         r and
               – Wrong                                                                                         the
                                                                                                               results
                 cases                                                                                         to the
                                                                                                               disk
                                                                                                               folder.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   25       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      26




         Train a neural net                                                                              View the model’s ROC curve
      Click “Choose”
          to select
                                                                                                         •     Right click
          another                                                                                              the result:
          function,                                                                                            “Multiplaye
      e.g. “Multilayer                                                                                         rPerceptro
          Perceptron”
          - a type of                                                                                          n”
          neural net.                                                                                    •     Choose
      Then click “Start”                                                                                       “visualize
          to train and
          test it. (note:                                                                                      threshold
          the training                                                                                         curve” and
          may take
          much longer
                                                                                                               “recurrent
          time.)                                                                                               events”;
                                                                                                         •     The ROC
      The results
         seem better
                                                                                                               curve will
         than the tree                                                                                         be
         classifier.                                                                                           displayed.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   27       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      28




         Select Attributes                                                                               4.2 Weka Experimenter
       • Click “Select                                                                               •  you can use Experimenter
         Attributes”                                                                                    to carry out experiments
                                                                                                        for multiple data sets
       • Choose an                                                                                      using multiple methods,
         “attribute                                                                                  e.g. classifying
         evaluator”                                                                                  • two data sets
             – e.g. chiSquare                                                                                – Breast cancer
       • Choose a                                                                                            – Iris
         “Search                                                                                     •    Using two methods
         Method”                                                                                             – Decision Tree: J48
       • Then click                                                                                          – Logistic
         “Start”                                                                                     •    The experiment is “Setup”
                                                                                                          as shown in the
       • The selected                                                                                     screenshot.
         attributes are                                                                              •    Then click “Run”
         listed.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   29       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      30




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                               5
CMP: Data Mining and Statistics within the Health Services                                                                                                                                     19/02/2010




        Analysis of the results                                                                             4.3 KnowledgeFlow
        •  Click                                                                                        • Click KnowledgeFlow on Weka GUI Chooser
           “analysis” to
           analyse the                                                                                  • A new window opened for buidling KDD process.
           results,
        E.g.
           paired t-test
           significance
        • Click
           “Experiment”
        • Configure
           test: choosing
           appropriate
           test and
           parameters
        • Click
           “Perform test”
           and the test
           results are
           listed.
       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     31       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         32




        Steps for building a KDD                                                                            A KDD process for Breast
        process                                                                                             Cancer
       Major steps for building a process
       1. Adding required nodes
            1) Add nodes
            2) Add a data source node from “DataSources”
                   1) Right click to configure it with a data set
            3)   Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node
            4)   Add a classifier, e.g. J48, from Classifiers
            5)   Add a classiferPerformanceEvaluator node from “Evaluation”
            6)   Add a text viewer from “Visualisation”
       2. Connect the nodes
            – Right click “DataSource” node and choose DataSet, then connect it to the
              ClassAssigner node,
            – do the same or similar for connecting between the other nodes.
       3. Run the process (using the default setups for each node)
            – Right click DataSource node and choose “Start loading”, the process should run and
              “Status” window should indicate if the run is correct and completed.
       4. View the results:
            – If the run is correctly completed, right click “Text Viewer” node and choose “Show
              results”, then another window pops out to show the results.



       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     33       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         34




        Results of the KDD process                                                                          5. Weka Tutorial Summary
                                                                                                        Weka is open source data mining software that offers
        •    right click
             “Text                                                                                      • Some GUI interfaces for data mining
             Viewer”                                                                                            – Explorer
             node and                                                                                           – Experimenter
             choose                                                                                             – KnowledgeFlow
             “Show                                                                                      •      Many functions and tools that include
             results”,                                                                                          – Methods for classification:
             then                                                                                                      decision trees, rule learners, naive Bayes, decision tables, locally weighted
                                                                                                                         regression, SVMs, instance-based learners, logistic regression, multi-layer
             another                                                                                                     perceptron
             window                                                                                             – methods for regression/prediction:
             pops out                                                                                                  linear regression, model tree generators, locally weighted regression, instance-
             to show                                                                                                      based learners, decision tables, multi-layer perceptron
             the                                                                                                – Ensemble schemes
                                                                                                                       • Bagging, boosting, stacking, RandomFrest
             results.
                                                                                                                – Methods for clustering:
                                                                                                                       • K-means, EM and Cobweb
                                                                                                                – Methods for feature selection

       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     35       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         36




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                     6

Contenu connexe

Similaire à Wekatutorial

Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOADemed L'Her
 
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...Richard Littauer
 
Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]butest
 
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre NimmagaddaM&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagaddarajopadhye
 
Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Jian Qin
 
Systems Lifecycle workbook
Systems Lifecycle workbookSystems Lifecycle workbook
Systems Lifecycle workbookMISY
 
Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application DevelopmentLARCA UPC
 
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Stuart Wrigley
 
Weka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningWeka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningKeshab Kumar Gaurav
 
The Research Process
The Research ProcessThe Research Process
The Research ProcessZain Mushtaq
 
EdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific researchEdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific researchMark Khoury
 
EdgarDB overview
EdgarDB overviewEdgarDB overview
EdgarDB overviewMark Khoury
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smithVince Smith
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05John Cobb
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories prwheatley
 

Similaire à Wekatutorial (20)

Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOA
 
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
 
Weka
WekaWeka
Weka
 
Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]
 
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre NimmagaddaM&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
 
Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08
 
Systems Lifecycle workbook
Systems Lifecycle workbookSystems Lifecycle workbook
Systems Lifecycle workbook
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application Development
 
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
 
Weka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningWeka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data mining
 
The Research Process
The Research ProcessThe Research Process
The Research Process
 
EdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific researchEdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific research
 
EdgarDB overview
EdgarDB overviewEdgarDB overview
EdgarDB overview
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith
 
Search Methods for Multidimensional Data
Search Methods for Multidimensional Data Search Methods for Multidimensional Data
Search Methods for Multidimensional Data
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories
 

Dernier

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Dernier (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Wekatutorial

  • 1. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Data Mining and Statistics Content Within the Health Services 1. Introduction to Weka Tutorial for Weka 2. Data Mining Functions and Tools 3. Data Format a data mining tool 4. Hands-on Demos 4.1 Weka Explorer Dr. Wenjia Wang • Classification • Attribute( feature) Selection School of Computing Sciences 4.2 Weka Experimenter University of East Anglia 4.3 Weka KnowledgeFlow 5. Summary Data Pre-processing Data Mining Knowledge Data Mining & Statistics within the Health Services Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 2 1. Introduction to WEKA Weka Main Features • A collection of open source of many data • 49 data preprocessing tools mining and machine learning algorithms, • 76 classification/regression algorithms including • 8 clustering algorithms – pre-processing on data • 15 attribute/subset evaluators + 10 search – Classification: algorithms for feature selection. – clustering • 3 algorithms for finding association rules – association rule extraction • 3 graphical user interfaces – “The Explorer” (exploratory data analysis) • Created by researchers at the University of – “The Experimenter” (experimental environment) Waikato in New Zealand – “The KnowledgeFlow” (new process model inspired • Java based (also open source). interface) Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 3 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 4 Weka: Download and Installation Start the Weka • Download Weka (the stable version) from • From windows desktop, http://www.cs.waikato.ac.nz/ml/weka/ – click “Start”, choose “All programs”, – Choose a self-extracting executable (including Java VM) – Choose “Weka 3.6” to start Weka – Then the first interface – (If you are interested in modifying/extending weka there window appears: is a developer version that includes the source code) Weka GUI Chooser. • After download is completed, run the self- extracting file to install Weka, and use the default set-ups. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 5 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 6 Dr. Wenjia Wang: Tutorial for DM tool Weka 1
  • 2. CMP: Data Mining and Statistics within the Health Services 19/02/2010 WEKA Application Interfaces Weka Application Interfaces • Explorer – preprocessing, attribute selection, learning, visualiation • Experimenter – testing and evaluating machine learning algorithms • Knowledge Flow – visual design of KDD process – Explorer • Simple Command-line – A simple interface for typing commands Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 7 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 8 Load data file and 2. Weka Functions and Tools Preprocessing • Preprocessing Filters • Load data file in formats: ARFF, CSV, C4.5, binary • Attribute selection • Import from URL or SQL database (using JDBC) • Classification/Regression • Preprocessing filters • Clustering – Adding/removing attributes • Association discovery – Attribute value substitution – Discretization • Visualization – Time series filters (delta, shift) – Sampling, randomization – Missing value management – Normalization and other numeric transformations Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 9 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 10 Feature Selection Classification • Very flexible: arbitrary combination of search and • Predicted target must be categorical evaluation methods • Implemented methods • Search methods – decision trees(J48, etc.) and rules – best-first – Naïve Bayes – genetic – neural networks – ranking ... – instance-based classifiers … • Evaluation measures • Evaluation methods – ReliefF – test data set – information gain – gain ratio – crossvalidation • Demo data: weather_nominal.arff • Demo data: iris, contact lenses, labor, soybeans, etc. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 11 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 12 Dr. Wenjia Wang: Tutorial for DM tool Weka 2
  • 3. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Clustering Regression • Implemented methods – k-Means • Predicted target is continuous – EM • Methods – Cobweb – X-means – linear regression – FarthestFirst… – neural networks • Clusters can be visualized and compared to “true” – regression trees … clusters (if given) • Demo data: • Demo data: cpu.arff, – any classification data may be used for clustering when its class attribute is filtered out. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 13 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 14 Weka: Pros and cons 3. WEKA data formats • pros • Data can be imported from a file in various – Open source, formats: • Free – ARFF (Attribute Relation File Format) has two sections: • Extensible • the Header information defines attribute name, type and • Can be integrated into other java packages relations. – GUIs (Graphic User Interfaces) • the Data section lists the data records. • Relatively easier to use – CSV: Comma Separated Values (text file) – Features – C4.5: A format used by a decision induction algorithm • Run individual experiment, or C4.5, requires two separated files • Build KDD phases • Name file: defines the names of the attributes • Cons • Date file: lists the records (samples) – Lack of proper and adequate documentations – binary – Systems are updated constantly (Kitchen Sink Syndrome) • Data can also be read from a URL or from an SQL database (using JDBC) Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 15 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 16 Attribute Relation File Format (arff) Breast Cancer data in ARFF % Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence- An ARFF file consists of two distinct sections: events: 85) % Part 1: Definitions of attribute name, types and relations • the Header section defines attribute name, type @relation breast-cancer @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'} @attribute menopause {'lt40','ge40','premeno'} and relations, start with a keyword. @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45- 49','50-54','55-59'} @Relation <data-name> @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30- 32','33-35','36-39'} @attribute <attribute-name> <type> or {range} @attribute node-caps {'yes','no'} @attribute deg-malig {'1','2','3'} @attribute breast {'left','right'} • the Data section lists the data records, starts with @attribute breast-quad {'left_up','left_low','right_up','right_low','central'} @attribute 'irradiat' {'yes','no'} @Data @attribute 'Class' {'no-recurrence-events','recurrence-events'} list of data instances % Part 2: data section @data • Any line start with % is the comments. '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' …… * source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 17 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 18 Dr. Wenjia Wang: Tutorial for DM tool Weka 3
  • 4. CMP: Data Mining and Statistics within the Health Services 19/02/2010 4.1 WEKA Explorer Weka Explorer: open data file • Open • Click the Explorer on Weka GUI Chooser Breast Cancer • On the Explorer window, data – click button “Open File” to open a data file • Click an from attribute, e.g. age, • the folder where your data files stored. then its e.g. Breast Cancer data: breast_cancer.arff distributio n will be Or (if you don’t have this data set), displayed • the data folder provided by the weka package: in a histogra e.g. C:Program FilesWeka-3-6data m. using “iris.arff” or “weather_nominal.arff” Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 19 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 20 Weka Explorer: training classifiers Results After loaded a data file, click “Classify” • Testing • Choose a classifier, results: – Under “Classifier”: click “choose”, then a drop-down • 97 cases used in menu appears, test. – Click “trees” and select “J48” – a decision tree Correct: algorithm 66 (68%) • Select a test option Wrong: 31 (32%) – Select “percentage split” • with default ratio 66% for training and 34% for testing • Click “Start” to train and test the classifier. – The training and testing information will be displayed in classifier output window. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 21 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 22 Options for results and model View the tree • Point to • Point to result result list list window, window, and right and right click click mouse, mouse. • Choose “visualiz • A menu e tree”, will pop then the out to tree will show all be the displayed options in availabl another e about the window. model. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 23 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 24 Dr. Wenjia Wang: Tutorial for DM tool Weka 4
  • 5. CMP: Data Mining and Statistics within the Health Services 19/02/2010 View classifier errors Save the model and results • right click the • Right result list, click on the • Choose result “visualize list classifier error”, then a • Choose new window will “save model” be popped out and to display the “save classifier’s error. result buffer” – Correctly to save predicted the classifie cases r and – Wrong the results cases to the disk folder. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 25 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 26 Train a neural net View the model’s ROC curve Click “Choose” to select • Right click another the result: function, “Multiplaye e.g. “Multilayer rPerceptro Perceptron” - a type of n” neural net. • Choose Then click “Start” “visualize to train and test it. (note: threshold the training curve” and may take much longer “recurrent time.) events”; • The ROC The results seem better curve will than the tree be classifier. displayed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 27 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 28 Select Attributes 4.2 Weka Experimenter • Click “Select • you can use Experimenter Attributes” to carry out experiments for multiple data sets • Choose an using multiple methods, “attribute e.g. classifying evaluator” • two data sets – e.g. chiSquare – Breast cancer • Choose a – Iris “Search • Using two methods Method” – Decision Tree: J48 • Then click – Logistic “Start” • The experiment is “Setup” as shown in the • The selected screenshot. attributes are • Then click “Run” listed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 29 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 30 Dr. Wenjia Wang: Tutorial for DM tool Weka 5
  • 6. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Analysis of the results 4.3 KnowledgeFlow • Click • Click KnowledgeFlow on Weka GUI Chooser “analysis” to analyse the • A new window opened for buidling KDD process. results, E.g. paired t-test significance • Click “Experiment” • Configure test: choosing appropriate test and parameters • Click “Perform test” and the test results are listed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 31 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 32 Steps for building a KDD A KDD process for Breast process Cancer Major steps for building a process 1. Adding required nodes 1) Add nodes 2) Add a data source node from “DataSources” 1) Right click to configure it with a data set 3) Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node 4) Add a classifier, e.g. J48, from Classifiers 5) Add a classiferPerformanceEvaluator node from “Evaluation” 6) Add a text viewer from “Visualisation” 2. Connect the nodes – Right click “DataSource” node and choose DataSet, then connect it to the ClassAssigner node, – do the same or similar for connecting between the other nodes. 3. Run the process (using the default setups for each node) – Right click DataSource node and choose “Start loading”, the process should run and “Status” window should indicate if the run is correct and completed. 4. View the results: – If the run is correctly completed, right click “Text Viewer” node and choose “Show results”, then another window pops out to show the results. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 33 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 34 Results of the KDD process 5. Weka Tutorial Summary Weka is open source data mining software that offers • right click “Text • Some GUI interfaces for data mining Viewer” – Explorer node and – Experimenter choose – KnowledgeFlow “Show • Many functions and tools that include results”, – Methods for classification: then decision trees, rule learners, naive Bayes, decision tables, locally weighted regression, SVMs, instance-based learners, logistic regression, multi-layer another perceptron window – methods for regression/prediction: pops out linear regression, model tree generators, locally weighted regression, instance- to show based learners, decision tables, multi-layer perceptron the – Ensemble schemes • Bagging, boosting, stacking, RandomFrest results. – Methods for clustering: • K-means, EM and Cobweb – Methods for feature selection Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 35 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 36 Dr. Wenjia Wang: Tutorial for DM tool Weka 6