Machine learning

What is Machine Learning?
 Adapt to / learn from data
To optimize a performance function
Can be used to:
Extract knowledge from data
Learn tasks that are difficult to formalise
Create software that improves over time

When to learn
 Human expertise does not exist (navigating on Mars)
 Humans are unable to explain their expertise (speech
recognition)
 Solution changes in time (routing on a computer network)
 Solution needs to be adapted to particular cases (user biometrics)
Learning involves
 Learning general models from data
 Data is cheap and abundant. Knowledge is expensive and scarce
 Customer transactions to computer behaviour
 Build a model that is a good and useful approximation to the data

Applications
 Speech and hand-writing recognition
 Autonomous robot control
 Data mining and bioinformatics: motifs, alignment, …
 Playing games
 Fault detection
 Clinical diagnosis
 Spam email detection
 Credit scoring, fraud detection
 Web mining: search engines
 Market basket analysis,
Applications are diverse but methods are generic

Generic methods
 Learning from labelled data (supervised learning)
Eg. Classification, regression, prediction, function approx.
 Learning from unlabelled data (unsupervised learning)
Eg. Clustering, visualisation, dimensionality reduction
 Learning from sequential data
Eg. Speech recognition, DNA data analysis
 Associations
 Reinforcement Learning

Statistical Learning
Machine learning methods can be unified within the
framework of statistical learning:
Data is considered to be a sample from a probability
distribution.
Typically, we don’t expect perfect learning but only
“probably correct” learning.
Statistical concepts are the key to measuring our expected
performance on novel problem instances.

Induction and inference
 Induction: Generalizing from specific examples.
 Inference: Drawing conclusions from possibly incomplete
knowledge.
Learning machines need to do both.

Inductive learning
 Data produced by “target”.
 Hypothesis learned from data in order to “explain”, “predict”,“model”
or “control” target.
 Generalisation ability is essential.
Inductive learning hypothesis:
“If the hypothesis works for enough data
then it will work on new examples.”

Example 1: Hand-written digits
Data representation: Greyscale images
Task: Classification (0,1,2,3…..9)
Problem features:
 Highly variable inputs from same class including some
“weird” inputs,
 imperfect human classification,
 high cost associated with errors so “don’t know” may be
useful.

Example 2: Speech recognition
Data representation: features from spectral analysis of
speech signals (two in this simple example).
Problem features:
 Highly variable data with same classification.
 Good feature selection is very important.
 Speech recognition is often broken into a number of
smaller tasks like this.

Example 3: DNA microarrays
 DNA from ~10000 genes attached to a glass slide (the
microarray).
 Green and red labels attached to mRNA from two
different samples.
 mRNA is hybridized (stuck) to the DNA on the chip and
green/red ratio is used to measure relative abundance of
gene products.

DNA microarrays
Data representation: ~10000 Green/red intensity levels ranging
from 10-10000.
Tasks: Sample classification, gene classification, visualisation and
clustering of genes/samples.
Problem features:
 High-dimensional data but relatively small number of examples.
 Extremely noisy data (noise ~ signal).
 Lack of good domain knowledge.

Projection of 10000 dimensional data onto 2D using PCA
effectively separates cancer subtypes.

Probabilistic models
A large part of the module will deal with methods
that have an explicit probabilistic interpretation:
 Good for dealing with uncertainty
eg. is a handwritten digit a three or an eight ?
 Provides interpretable results
 Unifies methods from different fields

20 of 15
Face Detection
1. Image pyramid used to locate faces of different sizes
2. Image lighting compensation
3. Neural Network detects rotation of face candidate
4. Final face candidate de-rotated ready for detection

21 of 15
Face Detection (Con’t)
5. Submit image to Neural Network
a. Break image into segments
b. Each segment is a unique input to the network
c. Each segment looks for certain patterns (eyes,
mouth, etc)
6. Output is likelihood of a face

Supervised Learning: Uses
 Prediction of future cases
 Knowledge extraction
 Compression of Data & knowledge

Unsupervised Learning
 Clustering: grouping similar instances
 Example applications
Customer segmentation in CRM
Learning pattern in bioinformatics
Clustering items based on similarity
Clustering users based on interests

Reinforcement Learning
 Learning a policy: A sequence of outputs
 No supervised output but delayed reward
 Credit assignment problem
 Game playing
 Robot in a maze
 Multiple agnts, partial observability

ID3 Decision Tree
 It is particularly interesting for
Its representation of learned knowledge
Its approach to the management of complexity
Its heuristic for selecting candidate concepts
Its potential for handling noisy data

ID3 Decision Tree
 The previous table can be represented as the following
decision tree:

ID3 Decision Tree
 In a decision tree, each internal node represents a test on some property
 Each possible value of that property corresponds to a branch of the tree
 Leaf nodes represents classification, such as low or moderate risk

ID3 Decision Tree
 A simplified decision tree for credit risk management

ID3 Decision Tree
 ID3 constructs decision trees in a top-down fashion.
 ID3 selects a property to test at the current node of the
tree and uses this test to partition the set of examples
 The algorithm recursively constructs a sub-tree for each
parturition
 This continues until all members of the partition are in
the same class

ID3 Decision Tree
 For example, ID3 selects income as the root property for
the first step

ID3 Decision Tree
 How to select the 1st node? (and the following
nodes)
 ID3 measures the information gained by making
each property the root of current subtree
 It picks the property that provides the greatest
information gain

ID3 Decision Tree
 If we assume that all the examples in the table occur
with equal probability, then:
P(risk is high)=6/14
P(risk is moderate)=3/14
P(risk is low)=5/14

ID3 Decision Tree
 I[6,3,5]=
 Based on
531.1)
14
5
(log
14
5
)
14
3
(log
14
3
)
14
6
(log
14
6
)5,3,6()( 222  IDInfo


n
i
ii mpmpMI
1
2 ))((log)()(

ID3 Decision Tree
 The information gain form income is:
Gain(income)= I[6,3,5]-E[income]= 1.531-0.564=0.967
Similarly,
 Gain(credit history)=0.266
 Gain(debt)=0.063
 Gain(colletral)=0.206

ID3 Decision Tree
 Since income provides the greatest information gain, ID3
will select it as the root of the tree

 The learning algorithms discussed so far implement
forms of supervised learning
 They assume the existence of a teacher, some fitness
measure, or other external method of classifying training
instances
 Unsupervised Learning eliminates the teacher and
requires that the learners form and evaluate concepts
their own

 Science is perhaps the best example of unsupervised
learning in humans
 Scientists do not have the benefit of a teacher.
 Instead, they propose hypotheses to explain
observations,

 The result of this algorithm is a Binary Tree whose leaf
nodes are instances and whose internal nodes are
clusters of increasing size
 We may also extend this algorithm to objects
represented as sets of symbolic features.

 Object1={small, red, rubber, ball}
 Object1={small, blue, rubber, ball}
 Object1={large, black, wooden, ball}
 This metric would compute the similary
values:
Similarity(object1, object2)= ¾
Similarity(object1, object3)=1/4

Machine Learning
 Up till now: how to search or reason using a model
 Machine learning: how to select a model on the basis of
data / experience
Learning parameters (e.g. probabilities)
Learning hidden concepts (e.g. clustering)

Classification
 In classification, we learn to predict labels (classes) for
inputs
 Examples:
 Spam detection (input: document, classes: spam / ham)
 OCR (input: images, classes: characters)
 Medical diagnosis (input: symptoms, classes: diseases)
 Automatic essay grader (input: document, classes: grades)
 Fraud detection (input: account activity, classes: fraud / no fraud)
 Customer service email routing
 … many more
 Classification is an important commercial technology!

Classification
 Data:
 Inputs x, class labels y
 We imagine that x is something that has a lot of structure, like an
image or document
 In the basic case, y is a simple N-way choice
 Basic Setup:
 Training data: D = bunch of <x,y> pairs
 Feature extractors: functions fi which provide attributes of an
example x
 Test data: more x’s, we must predict y’s

Bayes Nets for Classification
 One method of classification:
Features are values for observed variables
Y is a query variable
Use probabilistic inference to compute most likely Y

Simple Classification
 Simple example: two binary features
This is a naïve Bayes model
M
S F
direct estimate
Bayes estimate
(no assumptions)
Conditional
independence
+

General Naïve Bayes
 A general naive Bayes model:
C
E1 EnE2
|C| parameters
n x |E| x |C|
parameters
|C| x |E|n
parameters

Inference for Naïve Bayes
 Goal: compute posterior over causes
 Step 1: get joint probability of causes and evidence
 Step 2: get probability of evidence
 Step 3: renormalize
+

A Digit Recognizer
 Input: pixel grids
 Output: a digit 0-9

Examples: CPTs
1 0.1
2 0.1
3 0.1
4 0.1
5 0.1
6 0.1
7 0.1
8 0.1
9 0.1
0 0.1
1 0.01
2 0.05
3 0.05
4 0.30
5 0.80
6 0.90
7 0.05
8 0.60
9 0.50
0 0.80
1 0.05
2 0.01
3 0.90
4 0.80
5 0.90
6 0.90
7 0.25
8 0.85
9 0.60
0 0.80

Parameter Estimation
 Estimating the distribution of a random variable X or X|Y
 Empirically: use training data
 For each value x, look at the empirical rate of that value:
 This estimate maximizes the likelihood of the data
 Elicitation: ask a human!
 Usually need domain experts, and sophisticated ways of eliciting
probabilities (e.g. betting games)
 Trouble calibrating
r g g

Handwritten characters classification

Gray level pictures:object
classification

Gray level pictures: human action
classification

Expectation Maximization EM
when to use
 data is only partially observable
 unsupervised clustering: target value unobservable
 supervised learning: some instance attributes
unobservable
applications
 training Bayesian Belief Networks
 unsupervised clustering
 learning hidden Markov models

Generating Data from Mixture of
Gaussians
Each instance x generated by
 choosing one of the k Gaussians at random
 Generating an instance according to that Gaussian

EM for Estimating k Means
Given:
 instances from X generated by mixture of k Gaussians
 unknown means <m1,…,mk> of the k Gaussians
 don’t know which instance xi was generated by which
Gaussian
Determine:
 maximum likelihood estimates of <m1,…,mk>
Think of full description of each instance as yi=<xi,zi1,zi2>
 zij is 1 if xi generated by j-th Gaussian
 xi observable
 zij unobservable

EM Algorithm
Converges to local maximum likelihood and provides
estimates of hidden variables zij.
In fact local maximum in E [ln (P(Y|h)]
 Y is complete (observable plus non-observable
variables) data
 Expected valued is taken over possible values of
unobserved variables in Y

General EM Problem
Given:
 observed data X = {x1,…,xm}
 unobserved data Z = {z1,…,zm}
 parameterized probability distribution P(Y|h) where
Y = {y1,…,ym} is the full data yi=<xi,zi>
h are the parameters
Determine:
 h that (locally) maximizes E[ln P(Y|h)]
Applications:
 train Bayesian Belief Networks
 unsupervised clustering
 hidden Markov models

General EM Method
Define likelihood function Q(h’|h) which calculates
Y = X  Z using observed X and current parameters h
to estimate Z
Q(h’|h) = E[ ln( P(Y|h’) | h, X]
EM algorithm:
Estimation (E) step: Calculate Q(h’|h) using the current
hypothesis h and the observed data X to estimate the
probability distribution over Y.
Q(h’|h) = E[ ln( P(Y|h’) | h, X]
Maximization (M) step: Replace hypothesis h by the
hypothesis h’ that maximizes this Q function.
h = argmaxh’H Q(h’|h)

Machine learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (14)

Similaire à Machine learning

Similaire à Machine learning (20)

Dernier

Dernier (20)

Machine learning