Introduction to Machine Learning & Classification

Machine Learning
Chris Sharkey
today
@shark2900

What do you think of when
we say machine learning?

big words
• Hadoop
• Terabyte
• Petabyte
• NoSQL
• Data Science
• D3
• Visualization
• Machine learning

“Predictive or descriptive
modeling which learns
from past experience or
data to build models which
can predict the future”

Past Data
(known outcome)
Machine Learning
Model
New Data
(unknown outcome)
Predicted Outcome

Will John play golf?
Date Weather Temperature Sally going? Did John Golf ?
Sept 1 Sunny 92o F Yes Yes
Sept 2 Cloudy 84o F No No
Sept 3 Raining 84o F No Yes
Sept 4 Sunny 95o F Yes Yes
Date Weather Temperature Sally going? Will John Golf ?
Sept 5 Cloudy 87o F No ?
We want a model based on John’s past behavior to predict
what he will do in the future. Can we use ML?

Yes. This is a
classification problem

ZeroR
Establishes a base line
Naïve Bayes
Probabilistic model
OneR
Single Rule
J4.5 / C4.5
Decision Tree

Upgrade our example
age blood pressure specific gravity albumin sugar
red blood cells pus cell pus cell clumps potassium blood glucose
blood urea serum creatinine sodium hemoglobin packed cell
volume
white blood cell
count
red blood cell
count
hypertension diabetes mellitus coronary artery
Heart disease appetite pedal edema anemia stage
Data Set
• 319 instances or people
• 25 attributes or variables
Machine Learning
• ZeroR
• OneR
• Naïve Bayes
• J4.5 / C4.5
Model
Blood test data for
new individuals with
unknown disease
status
Predict if induvial has
CKD and if so the
stage of there
disease status

ZeroR
Past data
(known outcome)
New instance
Classified
Classify new data as the
most ‘popular’ class
Build frequency table
Choice ‘most popular’ or
most frequent class

How did ZeroR do?
• Correctly classified 28.2% of the time
• Rule: always guess a new instance (person) has stage three kidney disease
• 28.2% correct classfication rate is our base line
• Correct classification rates above 28.2% are better than guessing

OneR
Past data
(known outcome)
New instance
Classified
Choose attribute which
rule has the highest
correct classification rate
Build frequency table for
each attribute. This
generates a rule for
value of each attribute.

How did OneR do?
• Rule based on serum creatinine
• < 0.85 is healthy
• < 1.15 is stage 2
• < 2.25 is stage 3
• > = 2.25 is stage 5
• Single rule is created and responsible for classification
• High classification rate indicates a single value has high influence in predicting class

Naïve Bayes
Past data
(known outcome)
New instance
Classified
For each attribute
multiply conditional
probability for each of
the values with
probability of value
Multiply all prior
calculated probabilities
Choose most probable
class
Build frequency table
for each attribute.
Determine
probabilities for values
of each attribute.
Determine conditional
probabilities for values
of each attribute.

How did Naïve Bayes do?
• Conditional and overall probabilities constitute a rule
• High classification rate indicates attributes have ‘equaler’ influence
• No iterative process, faster on larger data sets

J4.5 / C4.5
Past data
(known outcome)
New instance
Classified
Follow decision tree to a
leaf or class
Top down recursive
algorithm determining
splitting points based on
information gains

How did J4.5 do?
• Decision tree generated
• Balance between discrimination of OneR and fairness of Naïve Bayes
• Decision trees are popular, intuitive, easy to create and easy to interpret
• People like decision trees. They tell a nice story

ZeroR
• Correct classification rate – 28.2%
• Established base line accuracy
• Always guess stage 3 ckd
Naïve Bayes
• Established over all probabilities to
pick most probable class
OneR
• Serum Creatinine
• < 0.85 – Healthy
• < 1.15 – Stage 2
• < 2.25 – Stage 3
• > = 2.25 – Stage 5
J4.5 / C4.5

Other important concepts
in machine learning.

Cross Validation
• Hold out one of ten slices and build the
model on the other nine slices
• Test on the ‘held out’ slice
• Hold out a different slice, build the models
on the now other nine slices and test on the
new ‘held out’ slice

Overfitting
• Classification rule that is ‘over fit’ or so specific to the training data set that it does
not generalize to the broader population
• Limiting the complexity or rules can help prevent overfitting
• Large representative data sets can help fight overfitting
• A problem in machine learning
• Must be a suspicious data scientist

Introduction to Machine Learning & Classification

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introduction to Machine Learning & Classification

Similaire à Introduction to Machine Learning & Classification (20)

Plus de Christopher Sharkey

Plus de Christopher Sharkey (6)

Dernier

Dernier (20)

Introduction to Machine Learning & Classification