2. What is Machine Learning?
Adapt to / learn from data
To optimize a performance function
Can be used to:
Extract knowledge from data
Learn tasks that are difficult to formalise
Create software that improves over time
3.
4.
5. When to learn
Human expertise does not exist (navigating on Mars)
Humans are unable to explain their expertise (speech
recognition)
Solution changes in time (routing on a computer network)
Solution needs to be adapted to particular cases (user biometrics)
Learning involves
Learning general models from data
Data is cheap and abundant. Knowledge is expensive and scarce
Customer transactions to computer behaviour
Build a model that is a good and useful approximation to the data
6. Applications
Speech and hand-writing recognition
Autonomous robot control
Data mining and bioinformatics: motifs, alignment, …
Playing games
Fault detection
Clinical diagnosis
Spam email detection
Credit scoring, fraud detection
Web mining: search engines
Market basket analysis,
Applications are diverse but methods are generic
7. Generic methods
Learning from labelled data (supervised learning)
Eg. Classification, regression, prediction, function approx.
Learning from unlabelled data (unsupervised learning)
Eg. Clustering, visualisation, dimensionality reduction
Learning from sequential data
Eg. Speech recognition, DNA data analysis
Associations
Reinforcement Learning
8. Statistical Learning
Machine learning methods can be unified within the
framework of statistical learning:
Data is considered to be a sample from a probability
distribution.
Typically, we don’t expect perfect learning but only
“probably correct” learning.
Statistical concepts are the key to measuring our expected
performance on novel problem instances.
9. Induction and inference
Induction: Generalizing from specific examples.
Inference: Drawing conclusions from possibly incomplete
knowledge.
Learning machines need to do both.
10. Inductive learning
Data produced by “target”.
Hypothesis learned from data in order to “explain”, “predict”,“model”
or “control” target.
Generalisation ability is essential.
Inductive learning hypothesis:
“If the hypothesis works for enough data
then it will work on new examples.”
11. Example 1: Hand-written digits
Data representation: Greyscale images
Task: Classification (0,1,2,3…..9)
Problem features:
Highly variable inputs from same class including some
“weird” inputs,
imperfect human classification,
high cost associated with errors so “don’t know” may be
useful.
12.
13. Example 2: Speech recognition
Data representation: features from spectral analysis of
speech signals (two in this simple example).
Problem features:
Highly variable data with same classification.
Good feature selection is very important.
Speech recognition is often broken into a number of
smaller tasks like this.
14.
15. Example 3: DNA microarrays
DNA from ~10000 genes attached to a glass slide (the
microarray).
Green and red labels attached to mRNA from two
different samples.
mRNA is hybridized (stuck) to the DNA on the chip and
green/red ratio is used to measure relative abundance of
gene products.
16.
17. DNA microarrays
Data representation: ~10000 Green/red intensity levels ranging
from 10-10000.
Tasks: Sample classification, gene classification, visualisation and
clustering of genes/samples.
Problem features:
High-dimensional data but relatively small number of examples.
Extremely noisy data (noise ~ signal).
Lack of good domain knowledge.
18. Projection of 10000 dimensional data onto 2D using PCA
effectively separates cancer subtypes.
19. Probabilistic models
A large part of the module will deal with methods
that have an explicit probabilistic interpretation:
Good for dealing with uncertainty
eg. is a handwritten digit a three or an eight ?
Provides interpretable results
Unifies methods from different fields
20. 20 of 15
Face Detection
1. Image pyramid used to locate faces of different sizes
2. Image lighting compensation
3. Neural Network detects rotation of face candidate
4. Final face candidate de-rotated ready for detection
21. 21 of 15
Face Detection (Con’t)
5. Submit image to Neural Network
a. Break image into segments
b. Each segment is a unique input to the network
c. Each segment looks for certain patterns (eyes,
mouth, etc)
6. Output is likelihood of a face
22. Supervised Learning: Uses
Prediction of future cases
Knowledge extraction
Compression of Data & knowledge
23. Unsupervised Learning
Clustering: grouping similar instances
Example applications
Customer segmentation in CRM
Learning pattern in bioinformatics
Clustering items based on similarity
Clustering users based on interests
24. Reinforcement Learning
Learning a policy: A sequence of outputs
No supervised output but delayed reward
Credit assignment problem
Game playing
Robot in a maze
Multiple agnts, partial observability
25. ID3 Decision Tree
It is particularly interesting for
Its representation of learned knowledge
Its approach to the management of complexity
Its heuristic for selecting candidate concepts
Its potential for handling noisy data
27. ID3 Decision Tree
The previous table can be represented as the following
decision tree:
28. ID3 Decision Tree
In a decision tree, each internal node represents a test on some property
Each possible value of that property corresponds to a branch of the tree
Leaf nodes represents classification, such as low or moderate risk
30. ID3 Decision Tree
ID3 constructs decision trees in a top-down fashion.
ID3 selects a property to test at the current node of the
tree and uses this test to partition the set of examples
The algorithm recursively constructs a sub-tree for each
parturition
This continues until all members of the partition are in
the same class
31. ID3 Decision Tree
For example, ID3 selects income as the root property for
the first step
33. ID3 Decision Tree
How to select the 1st node? (and the following
nodes)
ID3 measures the information gained by making
each property the root of current subtree
It picks the property that provides the greatest
information gain
34. ID3 Decision Tree
If we assume that all the examples in the table occur
with equal probability, then:
P(risk is high)=6/14
P(risk is moderate)=3/14
P(risk is low)=5/14
35. ID3 Decision Tree
I[6,3,5]=
Based on
531.1)
14
5
(log
14
5
)
14
3
(log
14
3
)
14
6
(log
14
6
)5,3,6()( 222 IDInfo
n
i
ii mpmpMI
1
2 ))((log)()(
36. ID3 Decision Tree
The information gain form income is:
Gain(income)= I[6,3,5]-E[income]= 1.531-0.564=0.967
Similarly,
Gain(credit history)=0.266
Gain(debt)=0.063
Gain(colletral)=0.206
37. ID3 Decision Tree
Since income provides the greatest information gain, ID3
will select it as the root of the tree
39. Unsupervised Learning
The learning algorithms discussed so far implement
forms of supervised learning
They assume the existence of a teacher, some fitness
measure, or other external method of classifying training
instances
Unsupervised Learning eliminates the teacher and
requires that the learners form and evaluate concepts
their own
40. Unsupervised Learning
Science is perhaps the best example of unsupervised
learning in humans
Scientists do not have the benefit of a teacher.
Instead, they propose hypotheses to explain
observations,
41. Unsupervised Learning
The result of this algorithm is a Binary Tree whose leaf
nodes are instances and whose internal nodes are
clusters of increasing size
We may also extend this algorithm to objects
represented as sets of symbolic features.
43. Machine Learning
Up till now: how to search or reason using a model
Machine learning: how to select a model on the basis of
data / experience
Learning parameters (e.g. probabilities)
Learning hidden concepts (e.g. clustering)
44. Classification
In classification, we learn to predict labels (classes) for
inputs
Examples:
Spam detection (input: document, classes: spam / ham)
OCR (input: images, classes: characters)
Medical diagnosis (input: symptoms, classes: diseases)
Automatic essay grader (input: document, classes: grades)
Fraud detection (input: account activity, classes: fraud / no fraud)
Customer service email routing
… many more
Classification is an important commercial technology!
45. Classification
Data:
Inputs x, class labels y
We imagine that x is something that has a lot of structure, like an
image or document
In the basic case, y is a simple N-way choice
Basic Setup:
Training data: D = bunch of <x,y> pairs
Feature extractors: functions fi which provide attributes of an
example x
Test data: more x’s, we must predict y’s
46. Bayes Nets for Classification
One method of classification:
Features are values for observed variables
Y is a query variable
Use probabilistic inference to compute most likely Y
47. Simple Classification
Simple example: two binary features
This is a naïve Bayes model
M
S F
direct estimate
Bayes estimate
(no assumptions)
Conditional
independence
+
48. General Naïve Bayes
A general naive Bayes model:
C
E1 EnE2
|C| parameters
n x |E| x |C|
parameters
|C| x |E|n
parameters
49. Inference for Naïve Bayes
Goal: compute posterior over causes
Step 1: get joint probability of causes and evidence
Step 2: get probability of evidence
Step 3: renormalize
+
52. Parameter Estimation
Estimating the distribution of a random variable X or X|Y
Empirically: use training data
For each value x, look at the empirical rate of that value:
This estimate maximizes the likelihood of the data
Elicitation: ask a human!
Usually need domain experts, and sophisticated ways of eliciting
probabilities (e.g. betting games)
Trouble calibrating
r g g
56. Expectation Maximization EM
when to use
data is only partially observable
unsupervised clustering: target value unobservable
supervised learning: some instance attributes
unobservable
applications
training Bayesian Belief Networks
unsupervised clustering
learning hidden Markov models
57. Generating Data from Mixture of
Gaussians
Each instance x generated by
choosing one of the k Gaussians at random
Generating an instance according to that Gaussian
58. EM for Estimating k Means
Given:
instances from X generated by mixture of k Gaussians
unknown means <m1,…,mk> of the k Gaussians
don’t know which instance xi was generated by which
Gaussian
Determine:
maximum likelihood estimates of <m1,…,mk>
Think of full description of each instance as yi=<xi,zi1,zi2>
zij is 1 if xi generated by j-th Gaussian
xi observable
zij unobservable
59. EM Algorithm
Converges to local maximum likelihood and provides
estimates of hidden variables zij.
In fact local maximum in E [ln (P(Y|h)]
Y is complete (observable plus non-observable
variables) data
Expected valued is taken over possible values of
unobserved variables in Y
60. General EM Problem
Given:
observed data X = {x1,…,xm}
unobserved data Z = {z1,…,zm}
parameterized probability distribution P(Y|h) where
Y = {y1,…,ym} is the full data yi=<xi,zi>
h are the parameters
Determine:
h that (locally) maximizes E[ln P(Y|h)]
Applications:
train Bayesian Belief Networks
unsupervised clustering
hidden Markov models
61. General EM Method
Define likelihood function Q(h’|h) which calculates
Y = X Z using observed X and current parameters h
to estimate Z
Q(h’|h) = E[ ln( P(Y|h’) | h, X]
EM algorithm:
Estimation (E) step: Calculate Q(h’|h) using the current
hypothesis h and the observed data X to estimate the
probability distribution over Y.
Q(h’|h) = E[ ln( P(Y|h’) | h, X]
Maximization (M) step: Replace hypothesis h by the
hypothesis h’ that maximizes this Q function.
h = argmaxh’H Q(h’|h)