SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
CMP: Data Mining and Statistics within the Health Services                                                                                                                                        19/02/2010




        Data Mining and Statistics
                                                                                                               Content
        Within the Health Services
                                                                                                                    1.      Introduction to Weka
                                     Tutorial for Weka                                                              2.      Data Mining Functions and Tools
                                                                                                                    3.      Data Format
                                             a data mining tool                                                     4.      Hands-on Demos
                                                                                                                         4.1 Weka Explorer
                                              Dr. Wenjia Wang                                                            • Classification
                                                                                                                         • Attribute( feature) Selection
                                        School of Computing Sciences                                                     4.2 Weka Experimenter
                                          University of East Anglia                                                      4.3 Weka KnowledgeFlow
                                                                                                                    5. Summary
            Data                  Pre-processing                 Data Mining                   Knowledge


       Data Mining & Statistics within the Health Services
                                                                                                               Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      2




        1. Introduction to WEKA                                                                                Weka Main Features
        • A collection of open source of many data                                                              • 49 data preprocessing tools
          mining and machine learning algorithms,                                                               • 76 classification/regression algorithms
          including                                                                                             • 8 clustering algorithms
              – pre-processing on data                                                                          • 15 attribute/subset evaluators + 10 search
              – Classification:                                                                                   algorithms for feature selection.
              – clustering                                                                                      • 3 algorithms for finding association rules
              – association rule extraction                                                                     • 3 graphical user interfaces
                                                                                                                      – “The Explorer” (exploratory data analysis)
        • Created by researchers at the University of
                                                                                                                      – “The Experimenter” (experimental environment)
          Waikato in New Zealand                                                                                      – “The KnowledgeFlow” (new process model inspired
        • Java based (also open source).                                                                                interface)
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)               3   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      4




        Weka: Download and Installation                                                                        Start the Weka

        • Download Weka (the stable version) from                                                              • From windows desktop,
               http://www.cs.waikato.ac.nz/ml/weka/                                                                   – click “Start”, choose “All programs”,
               – Choose a self-extracting executable (including Java VM)                                              – Choose “Weka 3.6” to start Weka
                                                                                                                      – Then the first interface
               – (If you are interested in modifying/extending weka there                                               window appears:
                 is a developer version that includes the source code)
                                                                                                                         Weka GUI Chooser.
        • After download is completed, run the self-
          extracting file to install Weka, and use the default
          set-ups.

       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)               5   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      6




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                    1
CMP: Data Mining and Statistics within the Health Services                                                                                                                             19/02/2010




       WEKA Application Interfaces                                                                   Weka Application Interfaces
                                                                                                    • Explorer
                                                                                                        – preprocessing, attribute selection, learning, visualiation
                                                                                                    • Experimenter
                                                                                                        – testing and evaluating machine learning algorithms
                                                                                                    • Knowledge Flow
                                                                                                        – visual design of KDD process
                                                                                                        – Explorer
                                                                                                    • Simple Command-line
                                                                                                        – A simple interface for typing commands


       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)    7   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)       8




                                                                                                     Load data file and
        2. Weka Functions and Tools
                                                                                                     Preprocessing
        •    Preprocessing Filters                                                                   • Load data file in formats: ARFF, CSV, C4.5,
                                                                                                       binary
        •    Attribute selection
                                                                                                     • Import from URL or SQL database (using JDBC)
        •    Classification/Regression
                                                                                                     • Preprocessing filters
        •    Clustering                                                                                    –    Adding/removing attributes
        •    Association discovery                                                                         –    Attribute value substitution
                                                                                                           –    Discretization
        •    Visualization
                                                                                                           –    Time series filters (delta, shift)
                                                                                                           –    Sampling, randomization
                                                                                                           –    Missing value management
                                                                                                           –    Normalization and other numeric transformations
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)    9   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      10




        Feature Selection                                                                            Classification
       • Very flexible: arbitrary combination of search and                                         • Predicted target must be categorical
         evaluation methods                                                                         • Implemented methods
       • Search methods                                                                                  –    decision trees(J48, etc.) and rules
            – best-first                                                                                 –    Naïve Bayes
            – genetic                                                                                    –    neural networks
            – ranking ...
                                                                                                         –    instance-based classifiers …
       • Evaluation measures
                                                                                                    • Evaluation methods
            – ReliefF
                                                                                                         – test data set
            – information gain
            – gain ratio                                                                                 – crossvalidation
       • Demo data: weather_nominal.arff                                                            • Demo data: iris, contact lenses, labor, soybeans,
                                                                                                      etc.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   11   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      12




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                          2
CMP: Data Mining and Statistics within the Health Services                                                                                                                                  19/02/2010




        Clustering                                                                                    Regression
       • Implemented methods
           –    k-Means                                                                               • Predicted target is continuous
           –    EM                                                                                    • Methods
           –    Cobweb
           –    X-means                                                                                     – linear regression
           –    FarthestFirst…                                                                              – neural networks
       • Clusters can be visualized and compared to “true”                                                  – regression trees …
         clusters (if given)
       • Demo data:                                                                                   • Demo data: cpu.arff,
           – any classification data may be used for clustering when
             its class attribute is filtered out.



       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   13    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          14




        Weka: Pros and cons                                                                           3. WEKA data formats
        • pros                                                                                        • Data can be imported from a file in various
               – Open source,                                                                           formats:
                     • Free                                                                                 – ARFF (Attribute Relation File Format) has two sections:
                     • Extensible                                                                                  • the Header information defines attribute name, type and
                     • Can be integrated into other java packages                                                    relations.
               – GUIs (Graphic User Interfaces)                                                                    • the Data section lists the data records.
                     • Relatively easier to use                                                             – CSV: Comma Separated Values (text file)
               – Features                                                                                   – C4.5: A format used by a decision induction algorithm
                     • Run individual experiment, or                                                          C4.5, requires two separated files
                     • Build KDD phases                                                                            • Name file: defines the names of the attributes
        • Cons                                                                                                     • Date file: lists the records (samples)
               – Lack of proper and adequate documentations                                                 – binary
               – Systems are updated constantly (Kitchen Sink Syndrome)                               • Data can also be read from a URL or from an
                                                                                                        SQL database (using JDBC)
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   15    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          16




        Attribute Relation File Format (arff)                                                         Breast Cancer data in ARFF
                                                                                                    % Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence-
        An ARFF file consists of two distinct sections:                                               events: 85)
                                                                                                    % Part 1: Definitions of attribute name, types and relations
        • the Header section defines attribute name, type                                           @relation breast-cancer
                                                                                                       @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'}
                                                                                                       @attribute menopause {'lt40','ge40','premeno'}
          and relations, start with a keyword.                                                         @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-
                                                                                                       49','50-54','55-59'}
               @Relation <data-name>                                                                   @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30-
                                                                                                       32','33-35','36-39'}
               @attribute <attribute-name> <type> or {range}                                           @attribute node-caps {'yes','no'}
                                                                                                       @attribute deg-malig {'1','2','3'}
                                                                                                       @attribute breast {'left','right'}
        • the Data section lists the data records, starts with                                         @attribute breast-quad {'left_up','left_low','right_up','right_low','central'}
                                                                                                       @attribute 'irradiat' {'yes','no'}
               @Data                                                                                   @attribute 'Class' {'no-recurrence-events','recurrence-events'}

               list of data instances                                                               % Part 2: data section
                                                                                                    @data
        • Any line start with % is the comments.                                                       '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
                                                                                                       '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
                                                                                                       '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
                                                                                                       ……
                                                                                                     * source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer

       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   17    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          18




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                   3
CMP: Data Mining and Statistics within the Health Services                                                                                                                                 19/02/2010




        4.1 WEKA Explorer                                                                               Weka Explorer: open data file
                                                                                                    •       Open
            • Click the Explorer on Weka GUI Chooser                                                        Breast
                                                                                                            Cancer
            • On the Explorer window,                                                                       data
                  – click button “Open File” to open a data file                                    •       Click an
                    from                                                                                    attribute,
                                                                                                            e.g. age,
                         • the folder where your data files stored.                                         then its
                           e.g. Breast Cancer data: breast_cancer.arff                                      distributio
                                                                                                            n will be
                         Or (if you don’t have this data set),                                              displayed
                         • the data folder provided by the weka package:                                    in a
                                                                                                            histogra
                           e.g. C:Program FilesWeka-3-6data                                              m.
                                 using “iris.arff” or “weather_nominal.arff”


       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   19       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      20




      Weka Explorer: training classifiers                                                           Results

        After loaded a data file, click “Classify”                                                      • Testing
        • Choose a classifier,                                                                            results:
              – Under “Classifier”: click “choose”, then a drop-down                                    • 97 cases
                                                                                                          used in
                menu appears,
                                                                                                          test.
              – Click “trees” and select “J48” – a decision tree                                        Correct:
                algorithm                                                                                66 (68%)
        • Select a test option                                                                          Wrong:
                                                                                                         31 (32%)
              – Select “percentage split”
                     • with default ratio 66% for training and 34% for testing
        • Click “Start” to train and test the classifier.
              – The training and testing information will be displayed
                in classifier output window.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   21       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      22




        Options for results and model                                                                   View the tree
        •    Point to                                                                                   •     Point to
             result                                                                                           result list
             list                                                                                             window,
             window,                                                                                          and right
             and
             right
                                                                                                              click
             click                                                                                            mouse,
             mouse.                                                                                     •     Choose
                                                                                                              “visualiz
        •    A menu                                                                                           e tree”,
             will pop                                                                                         then the
             out to                                                                                           tree will
             show all                                                                                         be
             the                                                                                              displayed
             options                                                                                          in
             availabl                                                                                         another
             e about
             the
                                                                                                              window.
             model.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   23       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      24




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                              4
CMP: Data Mining and Statistics within the Health Services                                                                                                                                  19/02/2010




         View classifier errors                                                                          Save the model and results
         •    right click the                                                                            •     Right
              result list,                                                                                     click on
                                                                                                               the
         •    Choose                                                                                           result
              “visualize                                                                                       list
              classifier
              error”, then a                                                                             •     Choose
              new window will                                                                                  “save
                                                                                                               model”
              be popped out                                                                                    and
              to display the                                                                                   “save
              classifier’s error.                                                                              result
                                                                                                               buffer”
               – Correctly                                                                                     to save
                 predicted                                                                                     the
                                                                                                               classifie
                 cases                                                                                         r and
               – Wrong                                                                                         the
                                                                                                               results
                 cases                                                                                         to the
                                                                                                               disk
                                                                                                               folder.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   25       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      26




         Train a neural net                                                                              View the model’s ROC curve
      Click “Choose”
          to select
                                                                                                         •     Right click
          another                                                                                              the result:
          function,                                                                                            “Multiplaye
      e.g. “Multilayer                                                                                         rPerceptro
          Perceptron”
          - a type of                                                                                          n”
          neural net.                                                                                    •     Choose
      Then click “Start”                                                                                       “visualize
          to train and
          test it. (note:                                                                                      threshold
          the training                                                                                         curve” and
          may take
          much longer
                                                                                                               “recurrent
          time.)                                                                                               events”;
                                                                                                         •     The ROC
      The results
         seem better
                                                                                                               curve will
         than the tree                                                                                         be
         classifier.                                                                                           displayed.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   27       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      28




         Select Attributes                                                                               4.2 Weka Experimenter
       • Click “Select                                                                               •  you can use Experimenter
         Attributes”                                                                                    to carry out experiments
                                                                                                        for multiple data sets
       • Choose an                                                                                      using multiple methods,
         “attribute                                                                                  e.g. classifying
         evaluator”                                                                                  • two data sets
             – e.g. chiSquare                                                                                – Breast cancer
       • Choose a                                                                                            – Iris
         “Search                                                                                     •    Using two methods
         Method”                                                                                             – Decision Tree: J48
       • Then click                                                                                          – Logistic
         “Start”                                                                                     •    The experiment is “Setup”
                                                                                                          as shown in the
       • The selected                                                                                     screenshot.
         attributes are                                                                              •    Then click “Run”
         listed.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   29       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      30




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                               5
CMP: Data Mining and Statistics within the Health Services                                                                                                                                     19/02/2010




        Analysis of the results                                                                             4.3 KnowledgeFlow
        •  Click                                                                                        • Click KnowledgeFlow on Weka GUI Chooser
           “analysis” to
           analyse the                                                                                  • A new window opened for buidling KDD process.
           results,
        E.g.
           paired t-test
           significance
        • Click
           “Experiment”
        • Configure
           test: choosing
           appropriate
           test and
           parameters
        • Click
           “Perform test”
           and the test
           results are
           listed.
       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     31       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         32




        Steps for building a KDD                                                                            A KDD process for Breast
        process                                                                                             Cancer
       Major steps for building a process
       1. Adding required nodes
            1) Add nodes
            2) Add a data source node from “DataSources”
                   1) Right click to configure it with a data set
            3)   Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node
            4)   Add a classifier, e.g. J48, from Classifiers
            5)   Add a classiferPerformanceEvaluator node from “Evaluation”
            6)   Add a text viewer from “Visualisation”
       2. Connect the nodes
            – Right click “DataSource” node and choose DataSet, then connect it to the
              ClassAssigner node,
            – do the same or similar for connecting between the other nodes.
       3. Run the process (using the default setups for each node)
            – Right click DataSource node and choose “Start loading”, the process should run and
              “Status” window should indicate if the run is correct and completed.
       4. View the results:
            – If the run is correctly completed, right click “Text Viewer” node and choose “Show
              results”, then another window pops out to show the results.



       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     33       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         34




        Results of the KDD process                                                                          5. Weka Tutorial Summary
                                                                                                        Weka is open source data mining software that offers
        •    right click
             “Text                                                                                      • Some GUI interfaces for data mining
             Viewer”                                                                                            – Explorer
             node and                                                                                           – Experimenter
             choose                                                                                             – KnowledgeFlow
             “Show                                                                                      •      Many functions and tools that include
             results”,                                                                                          – Methods for classification:
             then                                                                                                      decision trees, rule learners, naive Bayes, decision tables, locally weighted
                                                                                                                         regression, SVMs, instance-based learners, logistic regression, multi-layer
             another                                                                                                     perceptron
             window                                                                                             – methods for regression/prediction:
             pops out                                                                                                  linear regression, model tree generators, locally weighted regression, instance-
             to show                                                                                                      based learners, decision tables, multi-layer perceptron
             the                                                                                                – Ensemble schemes
                                                                                                                       • Bagging, boosting, stacking, RandomFrest
             results.
                                                                                                                – Methods for clustering:
                                                                                                                       • K-means, EM and Cobweb
                                                                                                                – Methods for feature selection

       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     35       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         36




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                     6

Contenu connexe

Similaire à Wekatutorial

Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOADemed L'Her
 
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...Richard Littauer
 
Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]butest
 
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre NimmagaddaM&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagaddarajopadhye
 
Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Jian Qin
 
Systems Lifecycle workbook
Systems Lifecycle workbookSystems Lifecycle workbook
Systems Lifecycle workbookMISY
 
Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application DevelopmentLARCA UPC
 
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Stuart Wrigley
 
Weka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningWeka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningKeshab Kumar Gaurav
 
The Research Process
The Research ProcessThe Research Process
The Research ProcessZain Mushtaq
 
EdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific researchEdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific researchMark Khoury
 
EdgarDB overview
EdgarDB overviewEdgarDB overview
EdgarDB overviewMark Khoury
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smithVince Smith
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05John Cobb
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories prwheatley
 

Similaire à Wekatutorial (20)

Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOA
 
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
 
Weka
WekaWeka
Weka
 
Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]
 
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre NimmagaddaM&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
 
Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08
 
Systems Lifecycle workbook
Systems Lifecycle workbookSystems Lifecycle workbook
Systems Lifecycle workbook
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application Development
 
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
 
Weka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningWeka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data mining
 
The Research Process
The Research ProcessThe Research Process
The Research Process
 
EdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific researchEdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific research
 
EdgarDB overview
EdgarDB overviewEdgarDB overview
EdgarDB overview
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith
 
Search Methods for Multidimensional Data
Search Methods for Multidimensional Data Search Methods for Multidimensional Data
Search Methods for Multidimensional Data
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories
 

Dernier

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Wekatutorial

  • 1. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Data Mining and Statistics Content Within the Health Services 1. Introduction to Weka Tutorial for Weka 2. Data Mining Functions and Tools 3. Data Format a data mining tool 4. Hands-on Demos 4.1 Weka Explorer Dr. Wenjia Wang • Classification • Attribute( feature) Selection School of Computing Sciences 4.2 Weka Experimenter University of East Anglia 4.3 Weka KnowledgeFlow 5. Summary Data Pre-processing Data Mining Knowledge Data Mining & Statistics within the Health Services Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 2 1. Introduction to WEKA Weka Main Features • A collection of open source of many data • 49 data preprocessing tools mining and machine learning algorithms, • 76 classification/regression algorithms including • 8 clustering algorithms – pre-processing on data • 15 attribute/subset evaluators + 10 search – Classification: algorithms for feature selection. – clustering • 3 algorithms for finding association rules – association rule extraction • 3 graphical user interfaces – “The Explorer” (exploratory data analysis) • Created by researchers at the University of – “The Experimenter” (experimental environment) Waikato in New Zealand – “The KnowledgeFlow” (new process model inspired • Java based (also open source). interface) Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 3 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 4 Weka: Download and Installation Start the Weka • Download Weka (the stable version) from • From windows desktop, http://www.cs.waikato.ac.nz/ml/weka/ – click “Start”, choose “All programs”, – Choose a self-extracting executable (including Java VM) – Choose “Weka 3.6” to start Weka – Then the first interface – (If you are interested in modifying/extending weka there window appears: is a developer version that includes the source code) Weka GUI Chooser. • After download is completed, run the self- extracting file to install Weka, and use the default set-ups. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 5 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 6 Dr. Wenjia Wang: Tutorial for DM tool Weka 1
  • 2. CMP: Data Mining and Statistics within the Health Services 19/02/2010 WEKA Application Interfaces Weka Application Interfaces • Explorer – preprocessing, attribute selection, learning, visualiation • Experimenter – testing and evaluating machine learning algorithms • Knowledge Flow – visual design of KDD process – Explorer • Simple Command-line – A simple interface for typing commands Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 7 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 8 Load data file and 2. Weka Functions and Tools Preprocessing • Preprocessing Filters • Load data file in formats: ARFF, CSV, C4.5, binary • Attribute selection • Import from URL or SQL database (using JDBC) • Classification/Regression • Preprocessing filters • Clustering – Adding/removing attributes • Association discovery – Attribute value substitution – Discretization • Visualization – Time series filters (delta, shift) – Sampling, randomization – Missing value management – Normalization and other numeric transformations Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 9 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 10 Feature Selection Classification • Very flexible: arbitrary combination of search and • Predicted target must be categorical evaluation methods • Implemented methods • Search methods – decision trees(J48, etc.) and rules – best-first – Naïve Bayes – genetic – neural networks – ranking ... – instance-based classifiers … • Evaluation measures • Evaluation methods – ReliefF – test data set – information gain – gain ratio – crossvalidation • Demo data: weather_nominal.arff • Demo data: iris, contact lenses, labor, soybeans, etc. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 11 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 12 Dr. Wenjia Wang: Tutorial for DM tool Weka 2
  • 3. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Clustering Regression • Implemented methods – k-Means • Predicted target is continuous – EM • Methods – Cobweb – X-means – linear regression – FarthestFirst… – neural networks • Clusters can be visualized and compared to “true” – regression trees … clusters (if given) • Demo data: • Demo data: cpu.arff, – any classification data may be used for clustering when its class attribute is filtered out. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 13 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 14 Weka: Pros and cons 3. WEKA data formats • pros • Data can be imported from a file in various – Open source, formats: • Free – ARFF (Attribute Relation File Format) has two sections: • Extensible • the Header information defines attribute name, type and • Can be integrated into other java packages relations. – GUIs (Graphic User Interfaces) • the Data section lists the data records. • Relatively easier to use – CSV: Comma Separated Values (text file) – Features – C4.5: A format used by a decision induction algorithm • Run individual experiment, or C4.5, requires two separated files • Build KDD phases • Name file: defines the names of the attributes • Cons • Date file: lists the records (samples) – Lack of proper and adequate documentations – binary – Systems are updated constantly (Kitchen Sink Syndrome) • Data can also be read from a URL or from an SQL database (using JDBC) Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 15 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 16 Attribute Relation File Format (arff) Breast Cancer data in ARFF % Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence- An ARFF file consists of two distinct sections: events: 85) % Part 1: Definitions of attribute name, types and relations • the Header section defines attribute name, type @relation breast-cancer @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'} @attribute menopause {'lt40','ge40','premeno'} and relations, start with a keyword. @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45- 49','50-54','55-59'} @Relation <data-name> @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30- 32','33-35','36-39'} @attribute <attribute-name> <type> or {range} @attribute node-caps {'yes','no'} @attribute deg-malig {'1','2','3'} @attribute breast {'left','right'} • the Data section lists the data records, starts with @attribute breast-quad {'left_up','left_low','right_up','right_low','central'} @attribute 'irradiat' {'yes','no'} @Data @attribute 'Class' {'no-recurrence-events','recurrence-events'} list of data instances % Part 2: data section @data • Any line start with % is the comments. '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' …… * source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 17 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 18 Dr. Wenjia Wang: Tutorial for DM tool Weka 3
  • 4. CMP: Data Mining and Statistics within the Health Services 19/02/2010 4.1 WEKA Explorer Weka Explorer: open data file • Open • Click the Explorer on Weka GUI Chooser Breast Cancer • On the Explorer window, data – click button “Open File” to open a data file • Click an from attribute, e.g. age, • the folder where your data files stored. then its e.g. Breast Cancer data: breast_cancer.arff distributio n will be Or (if you don’t have this data set), displayed • the data folder provided by the weka package: in a histogra e.g. C:Program FilesWeka-3-6data m. using “iris.arff” or “weather_nominal.arff” Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 19 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 20 Weka Explorer: training classifiers Results After loaded a data file, click “Classify” • Testing • Choose a classifier, results: – Under “Classifier”: click “choose”, then a drop-down • 97 cases used in menu appears, test. – Click “trees” and select “J48” – a decision tree Correct: algorithm 66 (68%) • Select a test option Wrong: 31 (32%) – Select “percentage split” • with default ratio 66% for training and 34% for testing • Click “Start” to train and test the classifier. – The training and testing information will be displayed in classifier output window. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 21 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 22 Options for results and model View the tree • Point to • Point to result result list list window, window, and right and right click click mouse, mouse. • Choose “visualiz • A menu e tree”, will pop then the out to tree will show all be the displayed options in availabl another e about the window. model. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 23 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 24 Dr. Wenjia Wang: Tutorial for DM tool Weka 4
  • 5. CMP: Data Mining and Statistics within the Health Services 19/02/2010 View classifier errors Save the model and results • right click the • Right result list, click on the • Choose result “visualize list classifier error”, then a • Choose new window will “save model” be popped out and to display the “save classifier’s error. result buffer” – Correctly to save predicted the classifie cases r and – Wrong the results cases to the disk folder. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 25 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 26 Train a neural net View the model’s ROC curve Click “Choose” to select • Right click another the result: function, “Multiplaye e.g. “Multilayer rPerceptro Perceptron” - a type of n” neural net. • Choose Then click “Start” “visualize to train and test it. (note: threshold the training curve” and may take much longer “recurrent time.) events”; • The ROC The results seem better curve will than the tree be classifier. displayed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 27 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 28 Select Attributes 4.2 Weka Experimenter • Click “Select • you can use Experimenter Attributes” to carry out experiments for multiple data sets • Choose an using multiple methods, “attribute e.g. classifying evaluator” • two data sets – e.g. chiSquare – Breast cancer • Choose a – Iris “Search • Using two methods Method” – Decision Tree: J48 • Then click – Logistic “Start” • The experiment is “Setup” as shown in the • The selected screenshot. attributes are • Then click “Run” listed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 29 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 30 Dr. Wenjia Wang: Tutorial for DM tool Weka 5
  • 6. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Analysis of the results 4.3 KnowledgeFlow • Click • Click KnowledgeFlow on Weka GUI Chooser “analysis” to analyse the • A new window opened for buidling KDD process. results, E.g. paired t-test significance • Click “Experiment” • Configure test: choosing appropriate test and parameters • Click “Perform test” and the test results are listed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 31 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 32 Steps for building a KDD A KDD process for Breast process Cancer Major steps for building a process 1. Adding required nodes 1) Add nodes 2) Add a data source node from “DataSources” 1) Right click to configure it with a data set 3) Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node 4) Add a classifier, e.g. J48, from Classifiers 5) Add a classiferPerformanceEvaluator node from “Evaluation” 6) Add a text viewer from “Visualisation” 2. Connect the nodes – Right click “DataSource” node and choose DataSet, then connect it to the ClassAssigner node, – do the same or similar for connecting between the other nodes. 3. Run the process (using the default setups for each node) – Right click DataSource node and choose “Start loading”, the process should run and “Status” window should indicate if the run is correct and completed. 4. View the results: – If the run is correctly completed, right click “Text Viewer” node and choose “Show results”, then another window pops out to show the results. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 33 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 34 Results of the KDD process 5. Weka Tutorial Summary Weka is open source data mining software that offers • right click “Text • Some GUI interfaces for data mining Viewer” – Explorer node and – Experimenter choose – KnowledgeFlow “Show • Many functions and tools that include results”, – Methods for classification: then decision trees, rule learners, naive Bayes, decision tables, locally weighted regression, SVMs, instance-based learners, logistic regression, multi-layer another perceptron window – methods for regression/prediction: pops out linear regression, model tree generators, locally weighted regression, instance- to show based learners, decision tables, multi-layer perceptron the – Ensemble schemes • Bagging, boosting, stacking, RandomFrest results. – Methods for clustering: • K-means, EM and Cobweb – Methods for feature selection Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 35 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 36 Dr. Wenjia Wang: Tutorial for DM tool Weka 6