Wekatutorial

CMP: Data Mining and Statistics within the Health Services 19/02/2010

Data Mining and Statistics
Content
Within the Health Services
1. Introduction to Weka
Tutorial for Weka 2. Data Mining Functions and Tools
3. Data Format
a data mining tool 4. Hands-on Demos
4.1 Weka Explorer
Dr. Wenjia Wang • Classification
• Attribute( feature) Selection
School of Computing Sciences 4.2 Weka Experimenter
University of East Anglia 4.3 Weka KnowledgeFlow
5. Summary
Data Pre-processing Data Mining Knowledge

Data Mining & Statistics within the Health Services
Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 2

1. Introduction to WEKA Weka Main Features
• A collection of open source of many data • 49 data preprocessing tools
mining and machine learning algorithms, • 76 classification/regression algorithms
including • 8 clustering algorithms
– pre-processing on data • 15 attribute/subset evaluators + 10 search
– Classification: algorithms for feature selection.
– clustering • 3 algorithms for finding association rules
– association rule extraction • 3 graphical user interfaces
– “The Explorer” (exploratory data analysis)
• Created by researchers at the University of
– “The Experimenter” (experimental environment)
Waikato in New Zealand – “The KnowledgeFlow” (new process model inspired
• Java based (also open source). interface)
Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 3 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 4

Weka: Download and Installation Start the Weka

• Download Weka (the stable version) from • From windows desktop,
http://www.cs.waikato.ac.nz/ml/weka/ – click “Start”, choose “All programs”,
– Choose a self-extracting executable (including Java VM) – Choose “Weka 3.6” to start Weka
– Then the first interface
– (If you are interested in modifying/extending weka there window appears:
is a developer version that includes the source code)
Weka GUI Chooser.
• After download is completed, run the self-
extracting file to install Weka, and use the default
set-ups.


Dr. Wenjia Wang: Tutorial for DM tool Weka 1


WEKA Application Interfaces Weka Application Interfaces
• Explorer
– preprocessing, attribute selection, learning, visualiation
• Experimenter
– testing and evaluating machine learning algorithms
• Knowledge Flow
– visual design of KDD process
– Explorer
• Simple Command-line
– A simple interface for typing commands


Load data file and
2. Weka Functions and Tools
Preprocessing
• Preprocessing Filters • Load data file in formats: ARFF, CSV, C4.5,
binary
• Attribute selection
• Import from URL or SQL database (using JDBC)
• Classification/Regression
• Preprocessing filters
• Clustering – Adding/removing attributes
• Association discovery – Attribute value substitution
– Discretization
• Visualization
– Time series filters (delta, shift)
– Sampling, randomization
– Missing value management
– Normalization and other numeric transformations

Feature Selection Classification
• Very flexible: arbitrary combination of search and • Predicted target must be categorical
evaluation methods • Implemented methods
• Search methods – decision trees(J48, etc.) and rules
– best-first – Naïve Bayes
– genetic – neural networks
– ranking ...
– instance-based classifiers …
• Evaluation measures
• Evaluation methods
– ReliefF
– test data set
– information gain
– gain ratio – crossvalidation
• Demo data: weather_nominal.arff • Demo data: iris, contact lenses, labor, soybeans,
etc.



Clustering Regression
• Implemented methods
– k-Means • Predicted target is continuous
– EM • Methods
– Cobweb
– X-means – linear regression
– FarthestFirst… – neural networks
• Clusters can be visualized and compared to “true” – regression trees …
clusters (if given)
• Demo data: • Demo data: cpu.arff,
– any classification data may be used for clustering when
its class attribute is filtered out.


Weka: Pros and cons 3. WEKA data formats
• pros • Data can be imported from a file in various
– Open source, formats:
• Free – ARFF (Attribute Relation File Format) has two sections:
• Extensible • the Header information defines attribute name, type and
• Can be integrated into other java packages relations.
– GUIs (Graphic User Interfaces) • the Data section lists the data records.
• Relatively easier to use – CSV: Comma Separated Values (text file)
– Features – C4.5: A format used by a decision induction algorithm
• Run individual experiment, or C4.5, requires two separated files
• Build KDD phases • Name file: defines the names of the attributes
• Cons • Date file: lists the records (samples)
– Lack of proper and adequate documentations – binary
– Systems are updated constantly (Kitchen Sink Syndrome) • Data can also be read from a URL or from an
SQL database (using JDBC)

Attribute Relation File Format (arff) Breast Cancer data in ARFF
% Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence-
An ARFF file consists of two distinct sections: events: 85)
% Part 1: Definitions of attribute name, types and relations
• the Header section defines attribute name, type @relation breast-cancer
@attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'}
@attribute menopause {'lt40','ge40','premeno'}
and relations, start with a keyword. @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-
49','50-54','55-59'}
@Relation <data-name> @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30-
32','33-35','36-39'}
@attribute <attribute-name> <type> or {range} @attribute node-caps {'yes','no'}
@attribute deg-malig {'1','2','3'}
@attribute breast {'left','right'}
• the Data section lists the data records, starts with @attribute breast-quad {'left_up','left_low','right_up','right_low','central'}
@attribute 'irradiat' {'yes','no'}
@Data @attribute 'Class' {'no-recurrence-events','recurrence-events'}

list of data instances % Part 2: data section
@data
• Any line start with % is the comments. '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
……
* source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer




4.1 WEKA Explorer Weka Explorer: open data file
• Open
• Click the Explorer on Weka GUI Chooser Breast
Cancer
• On the Explorer window, data
– click button “Open File” to open a data file • Click an
from attribute,
e.g. age,
• the folder where your data files stored. then its
e.g. Breast Cancer data: breast_cancer.arff distributio
n will be
Or (if you don’t have this data set), displayed
• the data folder provided by the weka package: in a
histogra
e.g. C:Program FilesWeka-3-6data m.
using “iris.arff” or “weather_nominal.arff”


Weka Explorer: training classifiers Results

After loaded a data file, click “Classify” • Testing
• Choose a classifier, results:
– Under “Classifier”: click “choose”, then a drop-down • 97 cases
used in
menu appears,
test.
– Click “trees” and select “J48” – a decision tree Correct:
algorithm 66 (68%)
• Select a test option Wrong:
31 (32%)
– Select “percentage split”
• with default ratio 66% for training and 34% for testing
• Click “Start” to train and test the classifier.
– The training and testing information will be displayed
in classifier output window.

Options for results and model View the tree
• Point to • Point to
result result list
list window,
window, and right
and
right
click
click mouse,
mouse. • Choose
“visualiz
• A menu e tree”,
will pop then the
out to tree will
show all be
the displayed
options in
availabl another
e about
the
window.
model.



View classifier errors Save the model and results
• right click the • Right
result list, click on
the
• Choose result
“visualize list
classifier
error”, then a • Choose
new window will “save
model”
be popped out and
to display the “save
classifier’s error. result
buffer”
– Correctly to save
predicted the
classifie
cases r and
– Wrong the
results
cases to the
disk
folder.

Train a neural net View the model’s ROC curve
Click “Choose”
to select
• Right click
another the result:
function, “Multiplaye
e.g. “Multilayer rPerceptro
Perceptron”
- a type of n”
neural net. • Choose
Then click “Start” “visualize
to train and
test it. (note: threshold
the training curve” and
may take
much longer
“recurrent
time.) events”;
• The ROC
The results
seem better
curve will
than the tree be
classifier. displayed.

Select Attributes 4.2 Weka Experimenter
• Click “Select • you can use Experimenter
Attributes” to carry out experiments
for multiple data sets
• Choose an using multiple methods,
“attribute e.g. classifying
evaluator” • two data sets
– e.g. chiSquare – Breast cancer
• Choose a – Iris
“Search • Using two methods
Method” – Decision Tree: J48
• Then click – Logistic
“Start” • The experiment is “Setup”
as shown in the
• The selected screenshot.
attributes are • Then click “Run”
listed.



Analysis of the results 4.3 KnowledgeFlow
• Click • Click KnowledgeFlow on Weka GUI Chooser
“analysis” to
analyse the • A new window opened for buidling KDD process.
results,
E.g.
paired t-test
significance
• Click
“Experiment”
• Configure
test: choosing
appropriate
test and
parameters
• Click
“Perform test”
and the test
results are
listed.

Steps for building a KDD A KDD process for Breast
process Cancer
Major steps for building a process
1. Adding required nodes
1) Add nodes
2) Add a data source node from “DataSources”
1) Right click to configure it with a data set
3) Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node
4) Add a classifier, e.g. J48, from Classifiers
5) Add a classiferPerformanceEvaluator node from “Evaluation”
6) Add a text viewer from “Visualisation”
2. Connect the nodes
– Right click “DataSource” node and choose DataSet, then connect it to the
ClassAssigner node,
– do the same or similar for connecting between the other nodes.
3. Run the process (using the default setups for each node)
– Right click DataSource node and choose “Start loading”, the process should run and
“Status” window should indicate if the run is correct and completed.
4. View the results:
– If the run is correctly completed, right click “Text Viewer” node and choose “Show
results”, then another window pops out to show the results.


Results of the KDD process 5. Weka Tutorial Summary
Weka is open source data mining software that offers
• right click
“Text • Some GUI interfaces for data mining
Viewer” – Explorer
node and – Experimenter
choose – KnowledgeFlow
“Show • Many functions and tools that include
results”, – Methods for classification:
then decision trees, rule learners, naive Bayes, decision tables, locally weighted
regression, SVMs, instance-based learners, logistic regression, multi-layer
another perceptron
window – methods for regression/prediction:
pops out linear regression, model tree generators, locally weighted regression, instance-
to show based learners, decision tables, multi-layer perceptron
the – Ensemble schemes
• Bagging, boosting, stacking, RandomFrest
results.
– Methods for clustering:
• K-means, EM and Cobweb
– Methods for feature selection



Wekatutorial

Recommandé

Recommandé

Contenu connexe

Similaire à Wekatutorial

Similaire à Wekatutorial (20)

Dernier

Dernier (20)

Wekatutorial