SlideShare une entreprise Scribd logo
1  sur  253
Télécharger pour lire hors ligne
Machine Learning Foundations
2018/02/05-06 @ Taiwan AI Academy
Albert Y. C. Chen, Ph.D.
Vice President, R&D
Viscovery
Albert Y. C. Chen, Ph.D.
陳彥呈博⼠士
albert@viscovery.com
http://www.linkedin.com/in/aycchen
http://slideshare.net/albertycchen
• Experience
2017-present: Vice President of R&D @ Viscovery
2015-2017: Chief Scientist @ Viscovery
2015-2015: Principal Scientist @ Nervve Technologies
2013-2014 Senior Scientist @ Tandent Vision Science
2011-2012 @ GE Global Research, Computer Vision Lab
• Education
Ph.D. in Computer Science, SUNY-Buffalo
M.S. in Computer Science, NTNU
B.S. in Computer Science, NTHU
When something is important enough,
you do it even if the odds are not in your favor.
Elon Musk
Falcon 9
takeoff
Falcon 9
decelerate
Falcon 9
vertical
touchdown
What is “Machine Learning”?
• Machine Learning (ML):
• Human Learning:
• Manual Programming:
rules
• Deterministic problems: repeat 1B
times, still get the same answer,
• problems lacking data,
• problems with easily separable data.
Manual Programming vs Machine Learning
• Data with noise,
• data of high dimension,
• data of large volume,
• data that changes over time.
When to manual program?
When to use machine learning?
• Important concepts (lessons learned) from
classical machine learning are still very
important, from dimensionality, sampling,
distance measures, error metrics, and
generalization issues.
• Understand how things work, why things worked
in the past, and why previously unattainable
problems are solved by Deep Learning.
Deep Learning, directly?
Where should we start?
We present you,
a simple & usable map for ML!
Dimension
Reduction
Clustering
Regression Classification
continuous
(predicting a quantity)
discrete
(predicting a category)
supervisedunsupervised
ML Roadmap, in more detail
• Before we start, we need to estimate data
distribution and develop sampling strategies,
• figure out how to measure/quantify data, or, in
other words, represent them as features,
• figure out how to split data to training and
validation set.
• After we learn a model, we need to measure the
fit, or the error on validation set.
• Finally, how do we evaluate how well our trained
model generalize.
Steps for Supervised Learning
Sampling & Distributions
😄
😃 🤪
😀
🤣
😂
😅😆
😁
☺
😊
😇
🙂
🙃
😉😌
😍
🤓
😎
🤩
😏
😬
🤠
😋
The importance of good sampling & distribution estimation.
Population with attribute
modeled by functionf : X ! Y
X Y
Learn from D =
😄
😃 🤪🤣
😂
🤩
😋
sample
x 2 X, y 2 Y
{(x1, y1), (x2, y2), ..., (xN , yN )}
f0
incorrectly predicts that
everyone else “smiles crazily”
f0
• The chances of getting a "perfect" sample of the
population at first try is very very small. When
the population is huge, this problem worsens.
• Noise during the measurement process adds
additional uncertainties.
• As a result, it is natural to try multiple times, and
formulate the problem in a probabilistic way.
Sampling & Distributions
• Joint probability of X taking
the value xi and Y taking the
value yi :
• Marginalizing: probability
that X takes the value xi
irrespective of Y:
Probability Theory
yj nij
xi
} rj
}
ci
p(X = xi, Y = yi) =
nij
N
p(X = xi) =
ci
N
, where ci =
X
j
nij
• Conditional Probability: the
fraction of instances where Y
= yj given that X = xi.
• Product Rule:
Probability Theory
yj nij
xi
} rj
}
ci
p(Y = yj|X = xi) =
nij
ci
p(X = xi, Y = yj) =
nij
N
=
nij
ci
·
ci
N
= p(Y = yj|X = xi)p(X = xi)
• Bayes' Rule plays a central
role in pattern recognition
and machine learning.
• From the product rule,
together with the symmetric
property
we get:
Bayes' Rule
yj nij
xi
} rj
}
ci
p(X, Y ) = p(Y, X)
p(Y |X) =
p(X|Y )p(Y )
p(X)
, where p(X) =
X
Y
p(X|Y )p(Y )
• p(Y = a) = 1/4, p(Y = b) = 3/4
• p(X = blue | Y = a) = 3/5
• p(X = green | Y = a) = 2/5
When we randomly draw a ball that is blue, the
probability that it comes from Y=a is?
Bayes' Rule Example 1
Y=a Y=b
p(Y = a|X = blue) =
p(X = blue|Y = a)p(Y = a)
p(X = blue)
=
p(X = blue|Y = a)p(Y = a)
(p(X = blue|Y = a)p(Y = a) + (p(X = blue|Y = b)p(Y = b)
=
3
5 · 1
4
3
5 · 1
4 + 2
5 · 3
4
=
3
20
3
20 + 6
20
=
3
20
9
20
=
1
3
• Monty Hall problem
• Prize behind one of the three
doors. After choosing door 1,
the host opens empty door 3
and asks if you want to switch
your choice. Should you switch?
Bayes' Rule Example 2
1 2 3
?
Behind
door 1
Behind
door 2
Behind
door 3
Result if staying
at door 1
Result if switching to
the door offered
Car Goat Goat Wins car Wins goat
Goat Car Goat Wins goat Wins car
Goat Goat Car Wins goat Wins car
When we measure the wrong
features, we’ll need very
complicated classifiers, and
the results are still not ideal.
Features
baseball tennis ball
vs
There’s always “exceptions”
that would ruin our perfect
assumptions yellow
baseball?
we learn the best features from data with deep learning.
• More features ≠ better: number of features*N,
feature space grows by ^N, the number of samples
needed for ML grows proportionally as well.
The curse of dimensionality
• Most of the volume of an n-D sphere is
concentrated in a thin shell near the surface!!!
• nD sphere of , the volume of sphere
between and is:
The curse of dimensionality
r = 1
r = 1 ✏ r = 1 1 (1 ✏)D
• The curse of dimensionality not just effects the
feature space, but also input, output, and others.
• Much more challenging to train a good n-class
classifier, e.g., face recognition, 1-to-1
verification vs 1-to-n identification.
• Much more issues arise from using a general
purpose 1M-class classifier vs problem
specific 1k-class classifier.
High-dim. issue is prevalent
Recognition
Accuracy:
• 1 to 1: 99%+
• 1 to 100: 90%
• 1 to 10,000:
50%-70%.
• 1 to 1M: 30%.
LFW dataset, common FN↑, FP↓
Prevalent high-dim issue, eg.1
• 1-to-N face identification, in the wild!
Prevalent high-dim issue, eg.2
• Smart photo album, with Google Cloud Vision
Distance between
histograms of 1M bins
is very close to 0 for
most of the time.
• Real data will often be confined to a region of
the space having lower effective dimensionality.
• Data will typically exhibit some smoothness
properties (at least locally).
Living with high dimensions
E.g., Low-dimensional
“manifold” of faces,
embedded within a
high-dim space.
Keywords:
• dimension reduction,
• learned features,
• manifold learning.
• k-fold cross validation
Splitting data
😄😃 🤪😀 🤣 😂😅😆 😁 ☺😊 😇🙂🙃 😉😌 😍🤓 😎🤩 😏 😬🤠 😋
Repurposing the smily faces
figures to represent the set of
annotated data.
😄
😃 🤪
😀
🤣
😂
😅😆
😁
☺
😊
😇
🙂
🙃
😉😌
😍
🤓
😎
🤩
😏
😬
🤠
😋
Randomly split into k groups
• Minimizing the misclassification rate
• Minimizing the expected loss
• The reject option
Decision Theory
• Decision boundary, or simply, in 1D, a threshold,
s.t. anything larger than the threshold are
classified as a class, and smaller than the
threshold as another class.
Decision Boundary
• Different metrics & names used in different fields
for measuring ML performance; however, the
common cornerstones are:
• True positive (TP): sample is an apple,
classified as an apple.
• False positive (FP): sample is not an apple, but
classified as an apple.
• True negative (TN): sample is not an apple,
classified as not an apple.
• False negative (FN): sample is an apple, but
misclassified as "not an apple.
True/False, Positive/Negative
• Precision: 



Classifier identified (TP+FP)
apples, only TP are apples.
(aka positive predictive value.)
• Recall:



Total (TP+FN) apples,
classifier identified TP. 

(aka, hit rate, sensitivity, true
positive rate)
Precision vs Recall
TP
TP + FP
TP
TP + FN
• F-measure: 



harmonic mean of precision and recall. F-
measure is criticized outside Information
Retrieval field for neglecting the true negative.
• Accuracy (ACC): 



a weighted arithmetic mean of precision and
inverse precision, as well as the weighted
arithmetic mean of recall and inverse recall.
A single balanced metric?
TP + TN
TP + TN + FP + FN
2 ·
precision · recall
precision + recall
Multi-objective Optimization
e.g., micro air vehicle wing design
• Different types of errors are weighted differently;
e.g., medical examinations, minimize false
negative but can tolerate false positive.
• Reformulate objectives from maximizing
probability to minimizing weighted loss
functions.
• The reject option: refrain from making decisions
on difficult cases (e.g., for samples within a
certain region inside the decision boundary.)
Minimizing the expected loss
• Minimizing Training and Validation Error, v.s.
minimizing Testing Error.
• Memorizing every “practice exam” question ≠
doing well on new questions. Avoid overfitting.
Generalization
E.g., training a classifier
that recognizes trees
Odd trees of the world
Odd trees of the world
Odd trees of the world
• Bias:
• Difference between the expected (or
averaged) prediction of our model and the
correct value.
• Error due to inaccurate assumptions/
simplifications.
• Variance:
• Amount that the estimate of the target function
will change if different training data was used.
Generalization Error
Bias/variance trade-off
Scott Fortmann-Roe
• Model is too simple to represent all the relevant
class characteristics.
• High bias (few degrees of freedom, DoF) and
low variance.
• High training error and high test error.
Underfitting
• Model is too complex and fits
irrelevant noise in the data
• Low bias, high variance
• Low training error, high test error
Overfitting
Error (mean square error, MSE) 

= noise2 + bias2 + variance
Bias-Variance Trade-off
unavoidable
error
error due to incorrect
assumptions made
about the data
error due to variance
of training samples
Model Complexity
Slide credit: D. Hoiem
Training Sample vs Model Complexity
Slide credit: D. Hoiem
Effect of Training Sample Size
Slide credit: D. Hoiem
• Models: describe relationship between variables
• Deterministic models: hypothesize exact
relationships, OK when noise is negligible
• Probabilistic models: deterministic part +
random error. For example:
• Regression models: one dependent
variable + one or more numerical or
categorical independent (explanatory)
variable.
• Correlation models: multiple independent
variables.
How do we learn models?
Generative vs Discriminative Models
Discriminative Model:
directly learn the data
boundary
Generative Model:
represent the data
and boundary
• Learn to directly predict labels from the data
• Often uses simpler boundaries (e.g., linear) for
hopes of better generalization.
• Often easier to predict a label from the data than
to model the data.
• E.g.,
• Logistic Regression
• Support Vector Machines
• Max Entropy Markov Model
• Conditional Random Fields
Discriminative Models
• Represent both the data and the boundary.
• Often use conditional independence and priors.
• Modeling data is challenging; need to make and
verify assumptions about data distribution
• Modeling data aids prediction & generalization.
• E.g.,
• Naive Bayes
• Gaussian Mixture Model (GMM)
• Hidden Markov Model
• Generative Adversarial Networks (GAN)
Generative Models
• Bernoulli Distribution
• Uniform Distribution
• Binomial Distribution
• Normal Distribution
• Poisson Distribution
• Exponential Distribution
Distributions
Dimension Reduction
Machine Learning Roadmap
Dimension
Reduction
Clustering
Regression Classification
continuous
(predicting a quantity)
discrete
(predicting a category)
supervisedunsupervised
• Goal: try to find a more compact
representation of the data
• Assume that the high
dimensional data actually
reside in an inherent low-
dimensional space.
• Additional dimensions are

just random noise
• Goal is to recover these
inherent dimensions and
discard noise.
Unsupervised Dimension Reduction
• Create a basis where
the axes represent the
dimensions of variance,
from high to low.
• Finds correlations in
data dimensions to
product best possible
lower-dimensional
representation based
on linear projections.
Principal Component Analysis (PCA)
PCA
PCA algorithm, conceptual steps
• Find a line s.t. when data is projected onto the
line, it has the maximum variance.
• Find new line orthogonal to the first that has the
maximum projected variance.
PCA algorithm, conceptual steps
• Repeated until d lines. The projected position of
a point on these lines gives the coordinates in
the m-dimensional reduced space.
• Computing these set of lines is achieved by
eigen-decomposition of the covariance matrix.
PCA algorithm, conceptual steps
• Given n data points: x1, ..., xn
• Consider a linear projection specified by v
• The projection of x onto v is
• The variance of the projected data is
• The 1st Principal Component maximizes the
variance subject to the constraint
PCA, maximizing variance
z = vT
x
var(z) = var(vT
xv) = vT
var(x)v = vT
Sv
• Maximize , subject to
• Lagrange:
• is the eigen-vector of S with eigen-value
• Sample variance of the projected data
• The eigen-values equals the amount of variance
captured by each eigen-vector
PCA, maximizing variance
vT
Sv vT
v = 1
vT
Sv (vT
v 1)
d
dv
= 0 ! Sv = v
v
vT
Sv = vT
v =
• View PCA as minimizing the reconstruction error
of using a low-dimensional approximation of the
original data:
Alternative view of PCA
x1
⇡ x0 + z1
u x2
⇡ x0 + z2
u
• Calculate the covariance matrix of the data S
• Calculate the eigen-vectors/eigen-values of S
• Rank the eigen-values in decreasing order
• Select eigen-vectors that retain a fixed % of the
variance, e.g., 80%, s.t.,
Dimension Reduction using PCA
Pd
i=1 i
P
i i
80%
PCA example: Eigenfaces
Mean face
Basis of variance (eigenvectors)
M. Turk; A. Pentland (1991). "Face recognition using eigenfaces".
Proc. IEEE Conference on Computer Vision and Pattern Recognition. pp. 586–591.
The ATT face database (formerly the ORL
database), 10 pictures of 40 subjects each
• Covariance of the image data is big. Finding
eigenvector of large matrices is slow.
• Singular Value Decomposition (SVD) can be
used to compute principal components.
• SVD steps:
• Create centered data matrix X
• Solve: X = USVT
• Columns of V are the eigenvectors of
sorted from largest to smallest eigenvalues.
PCA, scaling up
⌃
Singular Value Decomposition
Singular Value Decomposition
• Useful preprocessing for easing the "curse of
dimensionality" problem.
• Reduced dimension: simpler hypothesis
space
• Smaller VC dimension: less overfitting
• PCA can also be seen as noise reduction
• Fails when data consists of multiple separate
clusters
PCA discussion
• Also named Fisher Discriminant Analysis
• It can be viewed as
• a dimension reduction method,
• a generative classifier p(x|y), Gaussian with
distinct for each class but shared .
Linear Discriminant Analysis (LDA)
µ ⌃
classes mixed better separation
• Find a project direction so that the separation
between classes is maximized.
• Objective 1: maximize the distance between the
projected means of different classes
LDA Objectives
m1 =
1
N1
X
x2C1
x m2 =
1
N2
X
x2C2
x
original means:
projected means:
m0
1 =
1
N1
X
x2C1
wT
x m0
2 =
1
N2
X
x2C2
wT
x
• Objective 2: minimize scatter (variance within
class)
LDA Objectives
s2
i =
X
x2Ci
(wT
x m0
i)2Total within class scatter
for projected class i:
Total within class scatter: s2
1 + s2
2
• There are a number of different ways to combine
the two objectives.
• LDA seeks to optimize the following objective:
LDA Objective
The LDA Objective
LDA for two classes
w = S 1
w (m1 m2)
• Objective remains the same, with slightly
different definition for between-class scatter:
• Solution: k-1 eigenvectors of
LDA for Multi-Classes
J(w) =
wT
SBw
wTSww
SB =
1
k
kX
i=1
(mi m)(mi m)T
S 1
w SB
• Data often lies on
or near a nonlinear
low-dimensional
curve.
• We call such a
low-d structure
manifolds
• Algorithms include:
ICA, LLE, Isomap.
Nonlinear Dimension Reduction
swiss roll data
• A non-linear method for dimensionality reduction
• Preserves the global, nonlinear geometry of the
data by preserving the geodesic distances.
• Geodesic: shortest route between two points on
the surface of a manifold.
ISOMAP: Isometric Feature Mapping
1. Approximate the geodesic distance between
every pair of points in the data.
• The manifold is locally linear
• Euclidean distance works well for points that
are close enough.
• For points that are far apart, their geodesic
distance can be approximated by summing
up local Euclidean distances.
2. Find a Euclidean mapping of the data that
preserves the geodesic distance.
ISOMAP algorithm
• Construct a graph by:
• Connecting i and j if:
• d(i,j) < (if computing -isomap), or
• i is one of j's k nearest neighbors (k-isomap)
• Set the edge weight equal d(i,j) - Euclidean
distance
• Compute the Geodesic distance between any
two points as the shortest path distance.
Geodesic Distance
" "
• We can use Multi-Dimensional Scaling (MDS), a
class of statistical techniques that:
• Given:
• n x n matrix of dissimilarities between n
objects
• Outputs:
• a coordinate configuration of the data in low-d
space Rd whose Euclidean distances closely
match given dissimilarities.
Compute low-dimensional mapping
ISOMAP on Swiss Roll Data
ISOMAP Examples
ISOMAP Examples
Regression
Machine Learning Roadmap
Dimension
Reduction
Clustering
Regression Classification
continuous
(predicting a quantity)
discrete
(predicting a category)
supervisedunsupervised
• Unit-less, normalized between [-1, 1]
Pearson’s Correlation Coefficient
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3r = +1
Y
X
r = 0
Figures modified from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
r =
cov(x, y)
p
var(x)
p
var(y)
Linear Correlations
Y
X
Y
X
Linear relationships
Y
Y
X
X
Curvilinear relationships
Y
X
Y
X
Strong relationships
Y
Y
X
X
Weak relationships
Y
X
No relationship
Y
X
• In correlation, two variables are treated as
independent.
• In regression, one variable (x) is independent,
while the other (y) is dependent.
• Goal: if you know something about x, this would
help you predict something about y.
Regression
• Expected value at a
given level of x:
• Predicted value for a
new x:
Simple Linear Regression
y
x
random error that
follows a normal distribution
with 0 mean and variance
"
2
fixed exactly
on the line
y = w0 + w1x
y0
= w0 + w1x + "
w0
w0/w1
Multiple Linear Regression
y(x, w) = w0 + w1x1 + · · · + wDxD
w0, ..., wD
xi
• Linear function of parameters , also a
linear function of the input variables , has very
restricted modeling power (can't even fit curves).
• Assumes that:
• The relationship between X and Y is linear.
• Y is distributed normally at each value of X.
• The variance of Y at each value of X is the
same.
• The observations are independent.
• Before going further, let’s take a look at
polynomial line fitting (polynomial regression.)
Linear Regression
Given N=10 blue dots, try to find the function
that is used for generating the data points.
sin(2⇡x)
• Polynomial line fitting:
• M is the order of the polynomial
• linear function of the coefficients
• nonlinear function of
• Objective: minimize the error between the
predictions and the target value of
Polynomial Regression
x
w
y(xn, w) tn xn
ERMS =
p
2E(w⇤)/Nor, the root-mean-square error
E(w) =
1
2
NX
n=1
{y(xn, w) tn}
2
y(x, w) = w0 + w1x + w2x2
+ · · · + wM xM
+ "
Polynomial regression w. var. M
• There's only 10 data points, i.e., 9 degrees of
freedom; we can get 0 training error when M=9.
• Food for thought: make sure your deep neural
network's is not just "memorizing the training
data when its M >> data's DoF.
Polynomial regression w. var. M
• With M=9, but N=15 (left) and N=100, the over-
fitting problem is greatly reduced.
• ML is all about balancing M and N. One rough
heuristic is that N should be 5x-10x of M (model
complexity, not necessarily the number of param.)
What happens with more data?
• Regularization: used for controlling over-fitting.
• E.g., discourage coefficients from reaching
large values:







where
Regularization
˜E(w) =
1
2
NX
n=1
{y(xn, w) tn}
2
+
2
||w||2
||w||2
= wT
w = w2
0 + w2
1 + · · · + w2
M
• Extending linear regression to linear
combinations of fixed nonlinear functions:







where
• Basis functions: act as "features" in ML.
• Linear basis function:
• Polynomial basis function:
• Gaussian basis function
• Sigmoid basis function
Linear Models for Regression
y(x, w) =
M 1X
j=0
wj (x)
w = (w0, . . . , wM 1)T
, = ( 0, . . . , M 1)T
{ j(x)}
j(x) = xj
j(x) = xj
• Global functions of
the input variable,
s.t. changes in one
region of input
space affect all
other regions.
Polynomial Basis Functions
j(x) = xj
• Local functions, a
small change in x
only affect nearby
basis functions.
• and control
the location and
scale (width).
Gaussian Basis Functions
j(x) = exp
⇢
(x µj)2
2s2
µj s
• Local functions, a
small change in x
only affect nearby
basis functions.
• and control
the location and
scale (slope).
Sigmoidal Basis Functions
µj s
j(x) =
✓
x µj
s
◆
(a) =
1
1 + exp( a)
where
• Adding a regularization term to an error function:
• One of simplest forms of regularizer is sum-of-
squares of the weight vector elements:
• This type of weight decay regularizer (in ML),
a.k.a., parameter shrinkage (in statistics)
encourages weight values to decay towards
zero, unless supported by the data.
Regularized Least Squares
EW (w) =
1
2
wT
w
ED(w) + EW (w)
• A more general regularizer in the form of:
• q=2 is the quadratic regularizer (last page).
• q=1 is known as lasso in statistics.
Regularized Least Squares
1
2
NX
n=1
tn wT
(xn)
2
+
2
MX
j=1
|wj|q
sum of squared error generalized regularizer,
• LASSO: least absolute shrinkage and selection
operator
• When is sufficiently large, some of the
coefficients are driven to zero, leading to a
sparse model
LASSO
wj
The Bias-Variance Trade-off
• Large values of : small variance but large bias
• Small values of : large variance, small bias
The Bias-Variance Tradeoff
Clustering
Machine Learning Roadmap
Dimension
Reduction
Clustering
Regression Classification
continuous
(predicting a quantity)
discrete
(predicting a category)
supervisedunsupervised
• Group together similar points and represent
them with a single token.
• Issues:
• How do we define two points/images/patches
being "similar"?
• How do we compute an overall grouping from
pairwise similarity?
Clustering
• Grouping pixels of similar appearance and
spatial proximity together; there's so many ways
to do it, yet none are perfect.
Clustering Example
Clustering Example
• Summarizing Data
• Look at large amounts of data
• Patch-based compression or denoising
• Represent a large continuous vector with the
cluster number
• Counting
• Histograms of texture, color, SIFT vectors
• Segmentation
• Separate the image into different regions
• Prediction
• Images in the same cluster may have the same
labels
Why do we cluster?
• K-means
• Iteratively re-assign points to the nearest cluster
center
• Gaussian Mixture Model (GMM) Clustering
• Mean-shift clustering
• Estimate modes of pdf
• Hierarchical clustering
• Start with each point as its own cluster and
iteratively merge the closest clusters
• Spectral clustering
• Split the nodes in a graph based on assigned
links with similarity weights
How do we cluster?
• Goal: cluster to minimize variance in data given
clusters while preserving information.
Clustering for Summarization
c⇤
, ⇤
= argmin
c,
1
N
NX
j=0
KX
i=0
i,j(ci xj)2
cluster center
data
Whether is assigned toxj ci
• Euclidean Distance:
• Cosine similarity:
How do we measure similarity?
✓ = arccos
✓
xy
|x||y|
◆
x
y
||y x|| =
p
(y x) · (y x)
distance(x, y) =
p
(y1 x1)2 + (y2 x2)2 + · · · + (yn xn)2
=
v
u
u
t
nX
i=1
(yi xi)2
x · y = ||x||2 ||y||2 cos ✓
similarity(x, y) = cos(✓) =
x · y
||x||2 ||y||2
• Compare distance of closest (NN1) and second
closest (NN2) feature vector neighbor.
• If NN1≈NN2, ratio NN1/NN2 will be ≈1 →
matches too close.
• As NN1 << NN2, ratio NN1/NN2 tends to 0.
• Sorting by this ratio puts matches in order of
confidence.
Nearest Neighbor Distance Ratio
• How to threshold the nearest neighbor ratio?
Nearest Neighbor Distance Ratio
Lowe IJCV
2004 on
40,000
points.
Threshold
depends on
data and
specific
applications
1. Randomly select k initial cluster centers
2. Assign each point to nearest center



3. Update cluster centers as the mean of the points



4. repeat 2-3 until no points are re-assigned.
k-means clustering
t
= argmin
1
N
NX
j=1
KX
i=1
i,j ct 1
i xj
2
ct
= argmin
c
1
N
NX
j=1
KX
i=1
t
i,j (ci xj)
2
k-means convergence example
• Initialization
• Randomly select K points as initial cluster
center
• Greedily choose K points to minimize residual
• Distance measures
• Euclidean or others?
• Optimization
• Will converge to local minimum
• May want to use the best out of multiple trials
k-means: design choices
• Cluster on one set, use another (reserved) set to
test K.
• Minimum Description Length (MDL) principal for
model comparison.
• Minimize Schwarz Criterion, a.k.a. Bayes
Information Criteria (BIC)
• (When building dictionaries, more clusters
typically work better.)
How to choose k
• Generative
• How well are points reconstructed from the
cluster?
• Discriminative
• How well do the clusters correspond to labels
(purity)
How to evaluate clusters?
• Pros
• Finds cluster center that minimize conditional
variance (good representation of data)
• simple and fast
• easy to implement
k-means pros & cons
• Cons
• Need to choose K
• Sensitive to outliers
• Prone to local minima
• All clusters have the same parameters
• Can be slow. Each iteration is O(KNd) for N d-
dimensional points
k-means pros & cons
• Clusters are spherical
• Clusters are well separated
• Clusters are of similar volumes
• Clusters have similar number of points
k-means works if
• Hard assignments, or probabilistic assignments?
• Case against hard assignments:
• Clusters may overlap
• Clusters may be wider than others
• Can use a probabilistic model,
• Challenge: need to estimate model
parameters without labeled Ys.
GMM Clustering
P(X|Y )P(Y )
• Assume m-dimensional data points
• still multinomial, with k classes
• are k
multivariate Gaussians
Gaussian Mixture Models
P(Y )
P(X|Y = i), i = 1, · · · , k
P(X = x|Y = i)
=
1
p
(2⇡)m|⌃i|
exp
✓
1
2
(x µi)T
⌃ 1
(x µi)
◆
mean (m-dim vector)
variance (m*m matrix)
determinant of matrix
• Estimating parameters (when given data label Y)
• Solve optimization problem:
• MLE has closed form solution:
• i.e., solve
• Estimating parameters (without ), solve:
: all model param including mean, variance, etc.
Maximum Likelihood Estimation (MLE)
P(X = x|Y = i) = 1p
(2⇡)m|⌃i|
exp 1
2 (x µi)T
⌃ 1
(x µi)
µML = 1
n
Pn
i=1 xi
⌃ML = 1
n
Pn
i=1(xi
µML)(xi
µML)T
argmax✓
Q
j P(yj
, xj
; ✓)
yj
argmax✓
Q
j P(xj
, ✓) = argmax
Q
j
Pk
i=1 P(yj
= i, xj
; ✓)
✓
• Maximize marginal likelihood
• Almost always a hard problem
• Usually no closed form solution
• Even when is convex,
generally isn't
• For all but the simplest , we will have
to do gradient ascent, in a big messy space
with lots of local optimum.
Solving MLE for GMM Clustering
argmax✓
Q
j P(xj
, ✓) = argmax
Q
j
Pk
i=1 P(yj
= i, xj
; ✓)
P(X, Y ; ✓) P(X; ✓)
P(X; ✓)
• Simple example: GMM with
1D data, k=2 Gaussians,
variance=1, distribution over
classes is uniform, only need
to estimate , .
Solving MLE for GMM Clustering
µ1 µ2
nY
j=1
kX
i=1
P(X = xj
, Y = i) /
nY
j=1
kX
i=1
exp
✓
1
2 2
(xj
µi)2
◆
• Skipping the derivations.... still need to
differentiate and solve for , and P(Y=1) for
i=1...k. There are still no closed form solution,
gradient is complex with lots of local optimum.
µi ⌃i
• Expectation Maximization
• Objective:
• Data:
• E-step: For all examples j and values i for y,
compute:
• M-step: re-estimate the parameters with
weighted MLE estimates, set:
Solving MLE for GMM Clustering
argmax
✓
Y
j
kX
i=1
P(yj
= i, xj
|✓) =
X
j
log
kX
i=1
P(yj
= i, xj
|✓)
{xj
|j = 1 . . . n}
P(yj
= i|xj
, ✓)
✓ = argmax✓
P
j
Pk
i=1 P(yj
= i|xj
, ✓) log P(yj
= i, xj
|✓)
EM for GMM MLE example
1 2 3
4 5 6
• EM after 20 iterations
EM for GMM MLE example
• GMM for some bio assay data
EM for GMM MLE example
EM for GMM MLE example
• GMM for some bio
assay data, fitted
separately for three
diffrent compounds.
• GMM with hard assignments and unit variance,
EM is equivalent to k-means clustering
algorithm!!!
• EM, like k-NN, uses coordinate ascent, and can
get stuck in local optimum.
EM for GMM Clustering, notes
• mean-shift seeks modes of a given set of points
1. Choose kernel and bandwidth
2. For each point:
1. center a window on that point
2. compute the mean of the data in the
search window
3. center the search window at the new
mean location, repeat 2,3 until converge.
3. Assign points that lead to nearby modes to
the same cluster.
Mean-Shift Clustering
• Try to find modes of a non-parametric density
Mean-shift algorithm
Color
space
Color
space
clusters
• Attraction basin: the region for which all
trajectories lead to the same mode.
• Cluster: all data points in the attraction basin of
a mode.
Attraction Basin
Slides by Y. Ukrainitz & B. Sarel
Mean Shift
region of interest
mean-shift vector
center of mass
Mean Shift
Mean Shift
Mean Shift
• Kernel density estimation function
• Gaussian kernel
Kernel Density Estimation
ˆfh(x) =
1
nh
nX
i=1
K
✓
x xi
h
◆
K
✓
x xi
h
◆
=
1
p
2⇡
e
(x xi)2
2h2
• Compute mean shift vector m(x)
• Iteratively translate the kernel window y m(x)
until convergence
Computing the Mean Shift
m(x) =
2
4
Pn
i=1 xig
⇣
||x xi||2
h
⌘
Pn
i=1 g
⇣
||x xi||2
h
⌘ x
3
5
• Mean-shift can also be used as clustering-based
image segmentation.
Mean-Shift Segmentation
D. Comaniciu and P. Meer, Mean Shift: A Robust
Approach toward Feature Space Analysis, PAMI 2002.
• Compute features for each pixel (color, gradients,
texture, etc.).
• Set kernel size for features and position .
• Initialize windows at individual pixel locations.
• Run mean shift for each window until convergence.
• Merge windows that are within width of and .
Mean-Shift Segmentation
Color
space
Color
space
clusters
Kf Ks
Kf Ks
• Speedups:
• binned estimation
• fast neighbor search
• update each window in each iteration
• Other tricks
• Use kNN to determine window sizes
adaptively
Mean-Shift
• Pros
• Good general-practice segmentation
• Flexible in number and shape of regions
• robust to outliers
• Cons
• Have to choose kernel size in advance
• Not suitable for high-dimensional features
Mean-Shift pros & cons
• DBSCAN: Density-based spatial
clustering of applications with noise.
• Density: number of points within a
specified radius (ε-Neighborhood)
• Core point: a point with more than
a specified number of points
(MinPts) within ε.
• Border point: has fewer than
MinPts within ε, but is in the
neighborhood of a core point.
• Noise point: any point that is not a
core point or border point.
DBSCAN
MinPts=4
p is core point
q is border point
o is noise point
q p
"
"
o
• Density-reachable: p is density-
reachable from q w.r.t. ε and
MinPts if there is a chain of
objects p1, ..., pn with p1=q and
pn=p, s.t. pi+1 is directly density-
reachable from pi w.r.t. ε and
MinPts for all
• Density-connectivity: p is
density-connected to q w.r.t. ε
and MinPts if there is an object
o, s.t. both p and q are density-
reachable from o w.r.t. ε and
MinPts.
DBSCAN
1  i  n
• Cluster: a cluster C in a set of objects D w.r.t. ε
and MinPts is a non-empty subset of D satisfying
• Maximality: for all p,q, if p ∈ C and if q is
density reachable from p w.r.t. ε.
• Connectivity: for all p,q ∈ C, p is density-
connected to q w.r.t. ε and MinPts in D.
• Note: cluster contains core & border points.
• Noise: objects which are not directly density-
reachable from at least one core object.
DBSCAN clustering
1. Select a point p
2. Retrieve all points density-reachable from p
w.r.t. ε and MinPts.
1. if p is a core point, a cluster is formed
2. if p is a border point, no points are density
reachable from p and DBSCAN visits the
next point of the database
3. continue 1,2, until all points are processed.
(result independent of process ordering)
DBSCAN clustering algorithm
• Heuristic: for points in a cluster, their kth nearest
neighbors are at roughly the same distance.
• Noise points have the kth nearest neighbor at
farthest distance.
• So, plot sorted distance of every point to its kth
nearest neighbor.
DBSCAN parameters
sharp change;
good candidate
for ε and MinPts.
• Pros
• No need to decide K beforehand,
• Robust to noise, since it doesn't require every
point being assigned nor partition the data.
• Scales well to large datasets with .
• Stable across runs and different data ordering.
• Cons
• Trouble when clusters have different densities.
• ε may be hard to choose.
DBSCAN pros & cons
• Agglomerative clustering v.s. Divisive clustering
Hierarchical Clustering
• Method:
1. Every point is its own cluster
2. Find closest pair of clusters, merge into one
3. repeat
• The definition of closest is what differentiates
various flavors of agglomerative clustering
algorithms.
Agglomerative Clustering
• How to define the linkage/cluster similarity?
• Maximum or complete-linkage clustering
(a.k.a., farthest neighbor clustering)
• Minimum or single linkage clustering (UPGMA)
(a.k.a., nearest neighbor clustering)
• Centroid linkage clustering (UPGMC)
• Minimum Energy Clustering
• Sum of all intra-cluster variance
• Increase in variance for clusters being merged
Agglomerative Clustering
single linkage complete linkage average linkage centroid linkage
• How many clusters?
• Clustering creates a dendrogram (a tree)
• Threshold based on max number of clusters or
based on distance between merges.
Agglomerative Clustering
• Pros
• Simple to implement, widespread application
• Clusters have adaptive shapes
• Provides a hierarchy of clusters
• Cons
• May have imbalanced clusters
• Still have to choose the number of clusters or
thresholds
• Need to use an ultrametric to get a meaningful
hierarchy
Agglomerative Clustering
• Group points based on links in a graph
Spectral Clustering
A
B
• Normalized Cut
• A cut in a graph that penalizes large
segments
• Fix by normalizing for size of segments









volume(A) = sum of costs of all edges that
touch A
Spectral Clustering
Normalized Cut(A, B) =
cut(A, B)
volume(A)
+
cut(A, B)
volume(B)
• Determining importance by random walk
• What's the probability of visiting a given node?
• Create adjacency matrix based on visual similarity
• Edge weights determine probability of transition
Visual Page Rank
Jing Baluja 2008
• Quantization/Summarization: K-means
• aims to preserve variance of original data
• can easily assign new point to a cluster
Which Clustering Algorithm to use?
Quantization for computing
histograms
Summary of 20,000 photos of Rome using “greedy k-means”
http://grail.cs.washington.edu/projects/canonview/
• Image segmentation: agglomerative clustering
• More flexible with distance measures (e.g.,
can be based on boundry prediction)
• adapts better to specific data
• hierarchy can be useful
Which Clustering Algorithm to use?
http://www.cs.berkeley.edu/~arbelaez/UCM.html
• K-means useful for
summarization, building
dictionaries of patches,
general clustering.
• Agglomerative clustering
useful for segmentation,
general clustering.
• Spectral clustering useful for
determining relevance,
summarization, segmentation.
Which Clustering Algorithm to use?
• Synthetic dataset
Clustering algo. compared
http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
• K-means, k=6
Clustering algo. compared
http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
• Meanshift
Clustering algo. compared
http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
• DBSCAN, ε=0.025
Clustering algo. compared
http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
• Agglomerative Clustering, k=6, linkage=ward
Clustering algo. compared
http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
• Spectral Clustering, k=6
Clustering algo. compared
http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
Classification
Machine Learning Roadmap
Dimension
Reduction
Clustering
Regression Classification
continuous
(predicting a quantity)
discrete
(predicting a category)
supervisedunsupervised
• Given a set of samples and their ground
truth annotation , learn a function
that minimizes the prediction error
for new .
• The function is a classifier. Classifiers
divides input space into decision regions
separated by decision boundaries.
Supervised Learning
xj /2 X
xi 2 X
yi
decision boundary
E(yj, f(xj))
y = f(x)
y = f(x)
x1
x2
R1
R2
R3
• Spam detection:
• X = { characters and words in the email }
• Y = { spam, not spam}
• Digit recognition:
• X = cut out, normalized images of digits
• Y = {0,1,2,3,4,5,6,7,8,9}
• Medical diagnosis
• X = set of all symptoms
• Y = set of all diseases
Supervised Learning Examples
• Find a linear function to separate the classes
Linear Classifiers
• Logistic Regression
• Naïve Bayes
• Linear SVM
• Using a probabilistic approach to model data,
the distribution of P(X,Y): given data X, find the Y
that maximizes the posterior probability p(Y|X).
• Problem: we need to model all p(X|Y) and p(Y).
If | X | = n, there are 2n possible values for X.
• The Naïve Bayes' assumption assumes that xi's
are conditionally independent.
Naïve Bayes Classifier
p(Y |X) =
p(X|Y )p(Y )
p(X)
, where p(X) =
X
Y
p(X|Y )p(Y )
p(X1 . . . Xn|Y ) =
Y
i
p(Xi|Y )
• Given:
• Prior p(Y)
• n conditionally independent features,
represented by the vector X, given the class Y
• For each Xi, we have likelihood p(Xi | Y)
• Decision rule:
Naïve Bayes Classifier
Y ⇤
= argmax
Y
p(Y )p(X1, . . . , Xn|Y )
= argmax
Y
p(Y )
Y
i
p(Xi|Y )
• For discrete Naïve Bayes, simply count:
• Prior:
• Likelihood:
• Naïve Bayes Model:
Maximum Likelihood for Naïve Bayes
p(Y = y0
) =
Count(Y = y0
)
P
y Count(Y = y)
p(Xi = x0
|Y = y0
) =
Count(Xi = x0
, Y = y0
)
P
x Count(Xi = x, Y = y)
p(Y |X) / p(Y )
Y
i,j
p(X|Y )
• Conditional probability model over:
• Classifier:
Naïve Bayes Classifier
p(Ck|x1, . . . , xn) =
1
Z
p(Ck)
nY
i=1
p(xi|Ck)
˜y = argmax
k2{1,...,K}
p(Ck)
nY
i=1
p(xi|Ck)
• Features X are entire document. Xi for ith word in
article. X is huge! NB assumption helps a lot!
Naïve Bayes for Text Classification
• Typical additional assumption: Xi's position in
document doesn't matter: bag of words.
aardvark 0
about 2
all 2
Africa 1
apple 0
...
gas 1
...
oil 1
...
Zaire 0
Naïve Bayes for Text Classification
• Learning Phase:
• Prior: p(Y), count how many documents in
each topic (prior).
• Likelihood: p(Xi|Y), for each topic, count how
many times a word appears in documents of
this topic.
• Testing Phase: for each document, use Naïve
Bayes' decision rule:
argmax
y
p(y)
wordsY
i=1
p(xi|y)
Naïve Bayes for Text Classification
• Given 1000 training documents from each
group, learn to classify new documents
according to which newsgroup it came from.
• comp.graphics,
• comp.os.ms-windows.misc
• ...
• soc.religion.christian
• talk.religion.misc
• ...
• misc.forsale
• ...
Naïve Bayes for Text Classification
Naïve Bayes for Text Classification
• Usually, features are not conditionally independent:
• Actual probabilities p(Y|X) often bias towards 0 or 1
• Nonetheless, Naïve Bayes is the single most used
classifier.
• Naïve Bayes performs well, even when
assumptions are violated.
• Know its assumptions and when to use it.
Naïve Bayes Classifier Issues
p(X1, . . . , Xn|Y ) 6=
Y
i
p(Xi|Y )
• Regression model for which the dependent
variable is categorical.
• Binomial/Binary Logistic Regression
• Multinomial Logistic Regression
• Ordinal Logistic Regression (categorical, but
ordered)
• Substituting Logistic Function



,

we get:
Logistic Regression
y(x, w) =
1
1 + e (w0+w1x)
˜x = w0 + w1xf(˜x) =
1
1 + e ˜x
• E.g., for predicting:
• mortality of injured patients,
• risk of developing a certain disease based on
observations of the patient,
• whether an American voter would vote
Democratic or Republican,
• probability of failure of a given process, system or
product,
• customer's propensity to purchase a product or
halt a subscription,
• likelihood of homeowner defaulting on mortgage.
When to use logistic regression?
• Hours studied vs passing the exam
Logistic Regression Example
Ppass(h) =
1
1 + e ( 4.0777+1.5046·h)
• Learn p(Y|X) directly. Reuse
ideas from regression, but let y-
intercept define the probability.
• With normalization
Logistic Regression Classifier
p(Y = 1|X, w) / exp(w0 +
X
i
wiXi)
Exponential function
Logistic function
p(Y = 0|X, w) =
1
1 + exp(w0 +
P
i wiXi)
p(Y = 1|X, w) =
exp(w0 +
P
i wiXi)
1 + exp(w0 +
P
i wiXi)
y =
1
1 + exp( x)
• Prediction: output the Y with highest p(Y|X). For
binary Y, output Y if
Logistic Regression: decision boundary
p(Y = 0|X, w) =
1
1 + exp(w0 +
P
i wiXi)
p(Y = 1|X, w) =
exp(w0 +
P
i wiXi)
1 + exp(w0 +
P
i wiXi)
1 <
P(Y = 1|X)
P(Y = 0|X)
1 < exp(w0 +
nX
i=1
wiXi)
0 < w0 +
nX
i=1
wiXi
w0 + w · X = 0
• Decision boundary: p(Y=0 | X, w) = 0.5
• Slope of the line defines how quickly probabilities go to 0
or 1 around decision boundary.
Visualizing p(Y = 0|X, w) =
1
1 + exp(w0 + w1x1)
• Decision boundary is defined by y=0 hyperplane
Visualizing p(Y = 0|X, w) =
1
1 + exp(w0 + w1x1 + w2x2)
• Generative (Naïve Bayes) loss function:
• Data likelihood
• Discriminative (logistic regression) loss function:
• Conditional Data likelihood
• Maximize conditional log likelihood!
Logistic Regression Param. Estimation
ln p(D|w) =
NX
j=1
ln p(xj
, yj
|w) =
NX
j=1
ln p(yj
|xj
, w) +
NX
j=1
ln p(xj
|w)
ln p(DY |DX, w) =
NX
j=1
ln p(yj
|xj
, w)
• Maximize conditional log likelihood (Maximum
Likelihood Estimation, MLE):
• No closed-form solution.
• Concave function of w → no need to worry
about local optima; easy to optimize.
l(w) ⌘ ln
Y
j
p(yj
|xj
, w)
=
X
j
yj
(w0 +
X
i
wixj
i ) ln(1 + exp(w0 +
X
i
wixj
i )
Logistic Regression Param. Estimation
• Conditional likelihood for logistic regression is convex!
• Gradient:
• Gradient Ascent update rule:
• Simple, powerful, use in

many places.
rwl(w) =

dl(w)
dw0
, . . . ,
dl(w)
dwn
w = ⌘rwl(w)
w
(t+1)
i w
(t)
i + ⌘
dl(w)
dwi
Logistic Regression Param. Estimation
• MLE tends to prefer large weights
• Higher likelihood of properly classified
examples close to decision boundary.
• Larger influence of corresponding features on
decision.
• Can cause overfitting!!!
Logistic Regression Param. Estimation
• Regularization to avoid large weights, overfitting.
• Add priors on w and formulate as Maximum a
Posteriori (MAP) optimization problem.
• Define prior with normal distribution, zero
mean, identity towards zero; pushes
parameters towards zero.
• MAP estimate:
Logistic Regression Param. Estimation
p(w|Y, X) / p(Y |X, w)p(w)
w⇤
= argmax
w
ln
2
4p(w)
NY
j=1
p(yj
|xj
, w)
3
5
• Logistic Regression in more general case, where
Y = { y1, ..., yR}. Define a weight vector wi for
each yi, i=1,...,R-1.
Logistic Regression for Discrete Classification
p(Y = 1|X) / exp(w10 +
X
i
w1iXi)
p(Y = 2|X) / exp(w20 +
X
i
w2iXi)
p(Y = r|X) = 1
r 1X
j=1
p(Y = j|X)
...
• E.g., Y={0,1}, X = <X1, ..., Xn>, Xi continuous.
Naïve Bayes vs Logistic Regression
Naïve Bayes
(generative)
Logistic Regression
(discriminative)
Number of parameters 4n+1 n+1
parameter estimation uncoupled coupled
when # training samples → infinite

& model correct
good classifier good classifier
when # training samples → infinite

& model incorrect
biased classifier
less-biased
classifier
Training samples needed O(log N) O (N)
Training convergence speed faster slower
Naïve Bayes vs Logistic Regression
• Examples from UCI Machine Learning dataset
Perceptron
• Invented in 1957 at the Cornell Aeronautical
Lab. Intended to be a machine instead of a
program that is capable of recognition.
• A linear (binary) classifier.
Mark I
perceptron machine
i1
i2
in
...
+ f o
o = f
nX
k=1
ik · wk
!
• Start with zero weights: w=0
• For t=1...T (T passes over data)
• For i=1...n (each training sample)
• Classify with current weights

(sign(x) is +1 if x>0, else -1)
• If correct, (i.e., y=yi), no change!
• If wrong, update
Binary Perceptron Algorithm
w = w + yi
xi
y = sign(w · xi
)
w xi
w + (-1) xi
Binary Perceptron example
update = 0
Binary Perceptron example
update = 1update = 1
Binary Perceptron example
update = 1update = 1update = 2
Binary Perceptron example
update = 1update = 1update = 2update = 3
Binary Perceptron example
update = 1update = 1update = 2update = 3update = 5
Binary Perceptron example
update = 1update = 1update = 2update = 3update = 5update = 10
Binary Perceptron example
update = 1update = 1update = 2update = 3update = 5update = 10update = 20
• If we have more than two classes:
• Have a weight vector for each class wy
• Calculate an activation function for each class
• Highest activation wins
Multiclass Perceptron
activationw(x, y) = wy · x
y⇤
= argmax
y
(activationw(x, y))
• Starts with zero weights
• For t=1, ..., T, i=1, ..., n (T times over data)
• Classify with current weights
• If correct (y=yi), no change!
• If wrong: subtract features xi from weights for
predicted class wy and add them to weights
for correct class wyi.
Multiclass Perceptron
y = argmax
y
wy · xi
wy = wy xi
wyi = wyi xi
xi
wyi
wyi + xi
wy
wy xi
• Text classification example:
x = "win the vote" sentence
Multiclass Perceptron Example
BIAS 1
win 1
game 0
vote 1
the 1
,,,
BIAS -2
win 4
game 4
vote 0
the 0
,,,
BIAS 1
win 2
game 0
vote 4
the 0
,,,
BIAS 2
win 0
game 2
vote 0
the 0
,,,
wsports
wpolitics
wtech
x
x · wsports = 2
x · wpolitics = 7
x · wtech = 2
Classified as "politics"
• The data is linearly separable with margin if
Linearly separable (binary)
9w 8t yt
(w · xt
) > 0
x1
x2
• Assume data is separable with margin
• Also assume there is a number R such that
• Theorem: the number of mistakes (parameter
updates) made by the perceptron is bounded:
Mistake Bound for Perceptron
9w⇤
s.t.||w⇤
||2 = 1 and 8t yt
(w⇤
·t
)
8t ||xt
||2  R
mistakes 
R2
r2
• Noise: if the data isn't separable,
weights might thrash (averaging
weight vectors over time can help).
• Mediocre generalization: finds a
barely separating solution.
• Overtraining: test / hold-out
accuracy usually rises then falls.
Issues with Perceptrons
Seperable: Non-Seperable:
thrashing
barely separable
• Find a linear function to separate the classes
Linear SVM Classifier
f(x) = g(w · x + b)
• Define hyperplane where is the
tangent to hyperplane, is the matrix of all
data points. Minimize s.t.
produces correct label for all .
t
X
tX b = 0
||t|| tX b
X
x1
x2
• Find a linear function to separate the classes
Linear SVM Classifier
x1
x2 f(x) = g(w · x + b)
• Define hyperplane where is the
tangent to hyperplane, is the matrix of all
data points. Minimize s.t.
produces correct label for all .
t
X
tX b = 0
||t|| tX b
X
support vectors
• Some data sets are not linearly separable!
• Option 1:
• Use non-linear features, e.g., polynomial basis
functions
• Learn linear classifers in a transformed, non-
linear feature space
• Option 2:
• Use non-linear classifiers (decision trees,
neural networks, nearest neighbors)
Nonlinear Classifiers
• Assign label of nearest training data point to
each test data point.
Nearest Neighbor Classifier
Duda, Hart and Stork, Pattern Classification
K-Nearest Neighbor Classifier
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
1-nearest
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
3-nearest
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
5-nearest
• Data that are linearly separable work out great:
• But what if the dataset is just too hard?
• We can map it to a higher-dimensional space!
Nonlinear SVMs
0
0
x
x
0
x
x2
• Map the input space to some higher dimensional
feature space where the training set is
separable:
Nonlinear SVMs
: x ! (x)
• The kernel trick: instead of explicitly computing
the lifting transformation
• This gives a non-linear decision boundary in the
original feature space:
• Common kernel function: Radial basis function
kernel.
Nonlinear SVMs
K(xi, xj) = (xi) · (xj)
X
i
↵iyi (xi) · (x) + b =
X
i
↵iyiK(xi, x) + b
• Consider the mapping:
Nonlinear kernel example
0
x
x2
(x) = (x, x2
)
(x) · (y) = (x, x2
) · (y, y2
) = xy + x2
y2
K(x, y) = xy + x2
y2
• Histogram intersection kernel:
• Generlized Gaussian kernel:
D can be (inverse) L1 distance, Euclidean
distance, distance, etc.
Kernels for bags of features
I(h1, h2) =
NX
i=1
min(h1(i), h2(i))
K(h1, h2) = exp
✓
1
A
D(h1, h2)2
◆
X2
• Combine multiple two-class SVMs
• One vs others:
• Training: learn an SVM for each class vs the others.
• Testing: apply each SVM to test example and
assign it to the class of the SVM that returns the
highest decision value.
• One vs one:
• Training: learn an SVM for each pair of classes
• Testing: each learned SVM votes for a class to
assign to the test example.
Multi-class SVM
• Pros:
• SVMs work very well in practice, even with very
small training sample sizes.
• Cons:
• No direct multi-class SVM; must combine two-class
SVMs.
• Computation and memory usage:
• Must compute matrix of kernel values for each
pair of examples.
• Learning can take a long time for large problems.
SVMs: Pros & Cons
• Prediction is done by sending the example down
the tree until a class assignment is reached.
Decision Tree Classifier
• Internal Nodes: each test a feature
• Leaf nodes: each assign a classification
• Decision Trees divide the feature space into axis-
parallel rectangles and label each rectangle with
one of the K classes.
Decision Tree Classifier
• Goal: find a decision tree that achieves minimum
misclassification errors on the training data.
• Brute-force solution: create a tree with one path
from root to leaf for each training sample.

(problem: just memorizing, won't generalize.)
• Find the smallest tree that minimizes error.

(problem: this is NP-hard.)
Training Decision Trees
1. Choose the best feature a* for the root of the tree.
2. Split training set S into subsets {S1, S2, ..., Sk}
where each subset Si contains examples having
the same value for a*.
3. Recursively apply the algorithm on each new
subset until all examples have the same class
label.
The problem is, what defines the "best" feature?
Top-down induction of Decision Tree
• Decision Tree feature selection based on
classification error.
Choosing Best Feature
Does not work well, since it doesn't reflect progress
towards a good tree.
• Choose feature that gives the highest
information gain (X that has the highest mutual
information with Y).
• Define to be the expected remaining
uncertainty about y after testing xj.
Choosing Best Feature
argmax
j
I(Xj; Y ) = argmax
j
H(Y ) H(Y |Xj)
= argmin
j
H(Y |Xj)
˜J(j)
˜J(j) = H(YX)j) =
X
x
p(Xj = x)H(Y |Xj = x)
Ensembles: Combining Classifiers
1. Create T bootstrap samples, {S1, ..., ST} of S as
follows:
• For each Si, randomly draw |S| examples from
S with replacement.
• With large |S|, each Si will contain 1 - 1/e =
63.2% unique examples.
2. For each i=1, ..., T, hi = Learn (Si)
3. Output H = <{h1, ..., hT}, majority vote >
Bootstrap Aggregating (Bagging)
Leo Breiman, "Bagging Predictors", Machine Learning, 24, 123-140 (1996)
• A learning algorithm is unstable if small changes
in the training data produces large changes in
the output hypothesis.
• Bagging will have little benefit when used with
stable learning algorithms.
• Bagging works best when used with unstable
yet relatively accurate classifiers.
Learning Algorithm Stability
100 bagged decision trees
• Bagging: individual classifiers are independent
• Boosting: classifiers are learned iteratively
• Look at errors from previous classifiers to
decide what to focus on for the next iteration
over data.
• Successive classifiers depends upon its
predecessors.
• Result: more weights on "hard" examples, i.e.,
the ones classified incorrectly in the previous
iterations.
Boosting
• Consider E = <{h1, h2, h3}, majority vote>
• If h1, h2, h3 have error rates less than e, the error
rate of E is upper-bounded by g(a): 3e2-2e3 < e
Error Upper Bound
e
3e2-2e3
• Hypothesis of getting a classifier ensemble of
arbitrary accuracy, from weak classifiers.
Arbitrary Accuracy from Weak Classifiers
The original formulating of boosting learns too slowly.
Empirical studies show that Adaboost is highly effective.
• Adaboost works by learning many times on
different distributions over the training data.
• Modify learner to take distribution as input.
1. For each boosting round, learn on data set S
with distribution Dj to produce jth ensemble
member hj.
2. Compute the j+1th round distribution Dj+1 by
putting more weight on instances that hj made
mistake on.
3. Compute a voting weight wj for hj.
Adaboost
Adaboost Example
Credit: "A tutorial on boosting" by Yoav Freund and Rob Schapire.
Adaboost Example
Credit: "A tutorial on boosting" by Yoav Freund and Rob Schapire.
Adaboost Example
Credit: "A tutorial on boosting" by Yoav Freund and Rob Schapire.
Adaboost Example
Credit: "A tutorial on boosting" by Yoav Freund and Rob Schapire.
Adaboost Example
• Suppose the base learner L is a weak learner,
with error rate slightly less than 0.5 (better than
random guess)
• Training error goes to zero exponentially fast!!!
Adaboost Properties
Semi-supervised Learning
Machine Learning Roadmap
Dimension
Reduction
Clustering
Regression Classification
continuous
(predicting a quantity)
discrete
(predicting a category)
supervisedunsupervised
• Assume that class boundary should go through
low density areas.
• Having unlabeled data helps getting better
decision boundary.
Why can unlabeled data help?
supervised learning
semi-supervised learning
• Assume that each
class contains a
coherent group of
points (e.g., Gaussian)
• Having unlabeled data
points can help learn
the distribution more
accurately.
Why can unlabeled data help?
• Generative models:
• Use unlabeled data to more accurately
estimate the models.
• Discriminative models:
• Assume that p(y|x) is locally smooth
• Graph/manifold regularization
• Multi-view approach: multiple independent
learners that agree on unlabeled data
• Cotraining
Semi-Supervised Learning (SSL)
SSL Bayes Gaussian Classifier
Without SSL:
optimize
With SSL:
optimize
p(Xl, Yl|✓)
p(Xl, Yl, Xu|✓)
• In SSL, the learned needs to explain the
unlabeled data well, too.
• Find MLE or MAP estimate of joint and marginal
likelihood:
• Common mixture models used in SSL:
• GMM
• Mixture of Multinomials
SSL Bayes Gaussian Classifier
✓
p(Xl, Yl, Xu|✓) =
X
Yu
p(Xl, Yl, Xu, Yu|✓)
• Binary classification with GMM using MLE
• Using labeled data only, MLE is trivial:
• With both labeled and unlabeled data, MLE is
harder---use EM:
Estimating SSL GMM params
log p(Xl, Yl|✓) =
lX
i=1
log p(yi|✓) p(xi|yi, ✓)
+
l+uX
i=l+1
log (
2X
y=1
p(y|✓) p(xi|y, ✓))
log p(Xl, Yl|✓) =
lX
i=1
log p(yi|✓) p(xi|yi, ✓)
• Start with MLE
• = proportion of class c
• = sample mean of class c
• = sample covariance of class c
• The E-step: compute the expected label





for all .
• The M-step: update MLE with (now labeled)
Semi-Supervised EM for GMM
✓ = {w, µ, ⌃}1:2 on (Xl, Yl)
wc
µc
⌃c
p(y|x, ✓) =
p(x, y|✓)
P
y0 p(x, y0|✓)
x 2 Xµ
✓ Xµ
• SSL is sensitive to assumptions!!!
• Cases when the assumption is wrong:
SSL GMM Discussions
-- 蘇軾 <<稼說>>
博觀而約取
厚積而薄發

Contenu connexe

Tendances

Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre..."An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...Edge AI and Vision Alliance
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning AlgorithmsDezyreAcademy
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Simple overview of machine learning
Simple overview of machine learningSimple overview of machine learning
Simple overview of machine learningpriyadharshini R
 
Machine Learning
Machine LearningMachine Learning
Machine LearningRahul Kumar
 
Machine learning and types
Machine learning and typesMachine learning and types
Machine learning and typesPadma Metta
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
Lecture1 introduction to machine learning
Lecture1 introduction to machine learningLecture1 introduction to machine learning
Lecture1 introduction to machine learningUmmeSalmaM1
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised LearningLukas Tencer
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning ANKUSH PAL
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tearsAnkit Sharma
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA BoostAman Patel
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksFrancesco Collova'
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.ASHOK KUMAR
 

Tendances (20)

Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre..."An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Simple overview of machine learning
Simple overview of machine learningSimple overview of machine learning
Simple overview of machine learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine learning and types
Machine learning and typesMachine learning and types
Machine learning and types
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Lecture1 introduction to machine learning
Lecture1 introduction to machine learningLecture1 introduction to machine learning
Lecture1 introduction to machine learning
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA Boost
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 

Similaire à Machine Learning Foundations at Taiwan AI Academy

Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningLibya Thomas
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdfSugumarSarDurai
 
probability.pptx
probability.pptxprobability.pptx
probability.pptxbisan3
 
m2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxm2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxMesfinMelese4
 
5. testing differences
5. testing differences5. testing differences
5. testing differencesSteve Saffhill
 
Random forest and decision tree
Random forest and decision treeRandom forest and decision tree
Random forest and decision treeAAKANKSHA JAIN
 
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsScott Fraundorf
 
IME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptxIME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptxTemp762476
 
STAT 350 (Spring 2017) Homework 11 (20 points + 1 point BONUS).docx
STAT 350 (Spring 2017) Homework 11 (20 points + 1 point BONUS).docxSTAT 350 (Spring 2017) Homework 11 (20 points + 1 point BONUS).docx
STAT 350 (Spring 2017) Homework 11 (20 points + 1 point BONUS).docxwhitneyleman54422
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018digitalzombie
 

Similaire à Machine Learning Foundations at Taiwan AI Academy (20)

Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf
 
probability.pptx
probability.pptxprobability.pptx
probability.pptx
 
m2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxm2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptx
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
5. testing differences
5. testing differences5. testing differences
5. testing differences
 
Random forest and decision tree
Random forest and decision treeRandom forest and decision tree
Random forest and decision tree
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
Ds 2251 -_hypothesis test
Ds 2251 -_hypothesis testDs 2251 -_hypothesis test
Ds 2251 -_hypothesis test
 
ai4.ppt
ai4.pptai4.ppt
ai4.ppt
 
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random Intercepts
 
IME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptxIME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptx
 
NaiveBayes.ppt
NaiveBayes.pptNaiveBayes.ppt
NaiveBayes.ppt
 
NaiveBayes.ppt
NaiveBayes.pptNaiveBayes.ppt
NaiveBayes.ppt
 
NaiveBayes.ppt
NaiveBayes.pptNaiveBayes.ppt
NaiveBayes.ppt
 
STAT 350 (Spring 2017) Homework 11 (20 points + 1 point BONUS).docx
STAT 350 (Spring 2017) Homework 11 (20 points + 1 point BONUS).docxSTAT 350 (Spring 2017) Homework 11 (20 points + 1 point BONUS).docx
STAT 350 (Spring 2017) Homework 11 (20 points + 1 point BONUS).docx
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 

Plus de Albert Y. C. Chen

Building ML models for smart retail
Building ML models for smart retailBuilding ML models for smart retail
Building ML models for smart retailAlbert Y. C. Chen
 
Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Albert Y. C. Chen
 
為何VC不投資我的AI新創?
為何VC不投資我的AI新創?為何VC不投資我的AI新創?
為何VC不投資我的AI新創?Albert Y. C. Chen
 
數據特性 vs AI產品設計與實作
數據特性 vs AI產品設計與實作數據特性 vs AI產品設計與實作
數據特性 vs AI產品設計與實作Albert Y. C. Chen
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
Prototyping and Product Development for Startups
Prototyping and Product Development for StartupsPrototyping and Product Development for Startups
Prototyping and Product Development for StartupsAlbert Y. C. Chen
 
AI創新創業的商業模式與專案風險管理
AI創新創業的商業模式與專案風險管理AI創新創業的商業模式與專案風險管理
AI創新創業的商業模式與專案風險管理Albert Y. C. Chen
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
用AI創造大商機:媒體、廣告、電商、零售業的視覺辨識應用
用AI創造大商機:媒體、廣告、電商、零售業的視覺辨識應用用AI創造大商機:媒體、廣告、電商、零售業的視覺辨識應用
用AI創造大商機:媒體、廣告、電商、零售業的視覺辨識應用Albert Y. C. Chen
 
Find Your Passion and Make a Difference in Your Career
Find Your Passion and Make a Difference in Your CareerFind Your Passion and Make a Difference in Your Career
Find Your Passion and Make a Difference in Your CareerAlbert Y. C. Chen
 
Big Video Data Revolution, Challenges Unresolved
Big Video Data Revolution, Challenges UnresolvedBig Video Data Revolution, Challenges Unresolved
Big Video Data Revolution, Challenges UnresolvedAlbert Y. C. Chen
 
AI智慧服務推動經驗分享
AI智慧服務推動經驗分享AI智慧服務推動經驗分享
AI智慧服務推動經驗分享Albert Y. C. Chen
 
AI gold rush, tool vendors and the next big thing
AI gold rush, tool vendors and the next big thingAI gold rush, tool vendors and the next big thing
AI gold rush, tool vendors and the next big thingAlbert Y. C. Chen
 
媒體、影視產業、AI新創
媒體、影視產業、AI新創媒體、影視產業、AI新創
媒體、影視產業、AI新創Albert Y. C. Chen
 
Video AI for Media and Entertainment Industry
Video AI for Media and Entertainment IndustryVideo AI for Media and Entertainment Industry
Video AI for Media and Entertainment IndustryAlbert Y. C. Chen
 
Business Models for AI startups
Business Models for AI startupsBusiness Models for AI startups
Business Models for AI startupsAlbert Y. C. Chen
 
Practical computer vision-- A problem-driven approach towards learning CV/ML/DL
Practical computer vision-- A problem-driven approach towards learning CV/ML/DLPractical computer vision-- A problem-driven approach towards learning CV/ML/DL
Practical computer vision-- A problem-driven approach towards learning CV/ML/DLAlbert Y. C. Chen
 
Improving Spatiotemporal Stability for Object Detection and Classification
Improving Spatiotemporal Stability for Object Detection and ClassificationImproving Spatiotemporal Stability for Object Detection and Classification
Improving Spatiotemporal Stability for Object Detection and ClassificationAlbert Y. C. Chen
 

Plus de Albert Y. C. Chen (20)

Building ML models for smart retail
Building ML models for smart retailBuilding ML models for smart retail
Building ML models for smart retail
 
Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0
 
為何VC不投資我的AI新創?
為何VC不投資我的AI新創?為何VC不投資我的AI新創?
為何VC不投資我的AI新創?
 
數據特性 vs AI產品設計與實作
數據特性 vs AI產品設計與實作數據特性 vs AI產品設計與實作
數據特性 vs AI產品設計與實作
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Prototyping and Product Development for Startups
Prototyping and Product Development for StartupsPrototyping and Product Development for Startups
Prototyping and Product Development for Startups
 
AI創新創業的商業模式與專案風險管理
AI創新創業的商業模式與專案風險管理AI創新創業的商業模式與專案風險管理
AI創新創業的商業模式與專案風險管理
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
用AI創造大商機:媒體、廣告、電商、零售業的視覺辨識應用
用AI創造大商機:媒體、廣告、電商、零售業的視覺辨識應用用AI創造大商機:媒體、廣告、電商、零售業的視覺辨識應用
用AI創造大商機:媒體、廣告、電商、零售業的視覺辨識應用
 
Find Your Passion and Make a Difference in Your Career
Find Your Passion and Make a Difference in Your CareerFind Your Passion and Make a Difference in Your Career
Find Your Passion and Make a Difference in Your Career
 
Big Video Data Revolution, Challenges Unresolved
Big Video Data Revolution, Challenges UnresolvedBig Video Data Revolution, Challenges Unresolved
Big Video Data Revolution, Challenges Unresolved
 
AI智慧服務推動經驗分享
AI智慧服務推動經驗分享AI智慧服務推動經驗分享
AI智慧服務推動經驗分享
 
AI gold rush, tool vendors and the next big thing
AI gold rush, tool vendors and the next big thingAI gold rush, tool vendors and the next big thing
AI gold rush, tool vendors and the next big thing
 
影音大數據商機挖掘
影音大數據商機挖掘影音大數據商機挖掘
影音大數據商機挖掘
 
媒體、影視產業、AI新創
媒體、影視產業、AI新創媒體、影視產業、AI新創
媒體、影視產業、AI新創
 
Video AI for Media and Entertainment Industry
Video AI for Media and Entertainment IndustryVideo AI for Media and Entertainment Industry
Video AI for Media and Entertainment Industry
 
Business Models for AI startups
Business Models for AI startupsBusiness Models for AI startups
Business Models for AI startups
 
Practical computer vision-- A problem-driven approach towards learning CV/ML/DL
Practical computer vision-- A problem-driven approach towards learning CV/ML/DLPractical computer vision-- A problem-driven approach towards learning CV/ML/DL
Practical computer vision-- A problem-driven approach towards learning CV/ML/DL
 
Think Different, in Finance
Think Different, in FinanceThink Different, in Finance
Think Different, in Finance
 
Improving Spatiotemporal Stability for Object Detection and Classification
Improving Spatiotemporal Stability for Object Detection and ClassificationImproving Spatiotemporal Stability for Object Detection and Classification
Improving Spatiotemporal Stability for Object Detection and Classification
 

Dernier

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Dernier (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Machine Learning Foundations at Taiwan AI Academy

  • 1. Machine Learning Foundations 2018/02/05-06 @ Taiwan AI Academy Albert Y. C. Chen, Ph.D. Vice President, R&D Viscovery
  • 2. Albert Y. C. Chen, Ph.D. 陳彥呈博⼠士 albert@viscovery.com http://www.linkedin.com/in/aycchen http://slideshare.net/albertycchen • Experience 2017-present: Vice President of R&D @ Viscovery 2015-2017: Chief Scientist @ Viscovery 2015-2015: Principal Scientist @ Nervve Technologies 2013-2014 Senior Scientist @ Tandent Vision Science 2011-2012 @ GE Global Research, Computer Vision Lab • Education Ph.D. in Computer Science, SUNY-Buffalo M.S. in Computer Science, NTNU B.S. in Computer Science, NTHU
  • 3. When something is important enough, you do it even if the odds are not in your favor. Elon Musk Falcon 9 takeoff Falcon 9 decelerate Falcon 9 vertical touchdown
  • 4. What is “Machine Learning”? • Machine Learning (ML): • Human Learning: • Manual Programming: rules
  • 5. • Deterministic problems: repeat 1B times, still get the same answer, • problems lacking data, • problems with easily separable data. Manual Programming vs Machine Learning • Data with noise, • data of high dimension, • data of large volume, • data that changes over time. When to manual program? When to use machine learning?
  • 6. • Important concepts (lessons learned) from classical machine learning are still very important, from dimensionality, sampling, distance measures, error metrics, and generalization issues. • Understand how things work, why things worked in the past, and why previously unattainable problems are solved by Deep Learning. Deep Learning, directly?
  • 8. We present you, a simple & usable map for ML! Dimension Reduction Clustering Regression Classification continuous (predicting a quantity) discrete (predicting a category) supervisedunsupervised
  • 9. ML Roadmap, in more detail
  • 10. • Before we start, we need to estimate data distribution and develop sampling strategies, • figure out how to measure/quantify data, or, in other words, represent them as features, • figure out how to split data to training and validation set. • After we learn a model, we need to measure the fit, or the error on validation set. • Finally, how do we evaluate how well our trained model generalize. Steps for Supervised Learning
  • 11. Sampling & Distributions 😄 😃 🤪 😀 🤣 😂 😅😆 😁 ☺ 😊 😇 🙂 🙃 😉😌 😍 🤓 😎 🤩 😏 😬 🤠 😋 The importance of good sampling & distribution estimation. Population with attribute modeled by functionf : X ! Y X Y Learn from D = 😄 😃 🤪🤣 😂 🤩 😋 sample x 2 X, y 2 Y {(x1, y1), (x2, y2), ..., (xN , yN )} f0 incorrectly predicts that everyone else “smiles crazily” f0
  • 12. • The chances of getting a "perfect" sample of the population at first try is very very small. When the population is huge, this problem worsens. • Noise during the measurement process adds additional uncertainties. • As a result, it is natural to try multiple times, and formulate the problem in a probabilistic way. Sampling & Distributions
  • 13. • Joint probability of X taking the value xi and Y taking the value yi : • Marginalizing: probability that X takes the value xi irrespective of Y: Probability Theory yj nij xi } rj } ci p(X = xi, Y = yi) = nij N p(X = xi) = ci N , where ci = X j nij
  • 14. • Conditional Probability: the fraction of instances where Y = yj given that X = xi. • Product Rule: Probability Theory yj nij xi } rj } ci p(Y = yj|X = xi) = nij ci p(X = xi, Y = yj) = nij N = nij ci · ci N = p(Y = yj|X = xi)p(X = xi)
  • 15. • Bayes' Rule plays a central role in pattern recognition and machine learning. • From the product rule, together with the symmetric property we get: Bayes' Rule yj nij xi } rj } ci p(X, Y ) = p(Y, X) p(Y |X) = p(X|Y )p(Y ) p(X) , where p(X) = X Y p(X|Y )p(Y )
  • 16. • p(Y = a) = 1/4, p(Y = b) = 3/4 • p(X = blue | Y = a) = 3/5 • p(X = green | Y = a) = 2/5 When we randomly draw a ball that is blue, the probability that it comes from Y=a is? Bayes' Rule Example 1 Y=a Y=b p(Y = a|X = blue) = p(X = blue|Y = a)p(Y = a) p(X = blue) = p(X = blue|Y = a)p(Y = a) (p(X = blue|Y = a)p(Y = a) + (p(X = blue|Y = b)p(Y = b) = 3 5 · 1 4 3 5 · 1 4 + 2 5 · 3 4 = 3 20 3 20 + 6 20 = 3 20 9 20 = 1 3
  • 17. • Monty Hall problem • Prize behind one of the three doors. After choosing door 1, the host opens empty door 3 and asks if you want to switch your choice. Should you switch? Bayes' Rule Example 2 1 2 3 ? Behind door 1 Behind door 2 Behind door 3 Result if staying at door 1 Result if switching to the door offered Car Goat Goat Wins car Wins goat Goat Car Goat Wins goat Wins car Goat Goat Car Wins goat Wins car
  • 18. When we measure the wrong features, we’ll need very complicated classifiers, and the results are still not ideal. Features baseball tennis ball vs There’s always “exceptions” that would ruin our perfect assumptions yellow baseball? we learn the best features from data with deep learning.
  • 19. • More features ≠ better: number of features*N, feature space grows by ^N, the number of samples needed for ML grows proportionally as well. The curse of dimensionality
  • 20. • Most of the volume of an n-D sphere is concentrated in a thin shell near the surface!!! • nD sphere of , the volume of sphere between and is: The curse of dimensionality r = 1 r = 1 ✏ r = 1 1 (1 ✏)D
  • 21. • The curse of dimensionality not just effects the feature space, but also input, output, and others. • Much more challenging to train a good n-class classifier, e.g., face recognition, 1-to-1 verification vs 1-to-n identification. • Much more issues arise from using a general purpose 1M-class classifier vs problem specific 1k-class classifier. High-dim. issue is prevalent
  • 22. Recognition Accuracy: • 1 to 1: 99%+ • 1 to 100: 90% • 1 to 10,000: 50%-70%. • 1 to 1M: 30%. LFW dataset, common FN↑, FP↓ Prevalent high-dim issue, eg.1 • 1-to-N face identification, in the wild!
  • 23. Prevalent high-dim issue, eg.2 • Smart photo album, with Google Cloud Vision Distance between histograms of 1M bins is very close to 0 for most of the time.
  • 24. • Real data will often be confined to a region of the space having lower effective dimensionality. • Data will typically exhibit some smoothness properties (at least locally). Living with high dimensions E.g., Low-dimensional “manifold” of faces, embedded within a high-dim space. Keywords: • dimension reduction, • learned features, • manifold learning.
  • 25. • k-fold cross validation Splitting data 😄😃 🤪😀 🤣 😂😅😆 😁 ☺😊 😇🙂🙃 😉😌 😍🤓 😎🤩 😏 😬🤠 😋 Repurposing the smily faces figures to represent the set of annotated data. 😄 😃 🤪 😀 🤣 😂 😅😆 😁 ☺ 😊 😇 🙂 🙃 😉😌 😍 🤓 😎 🤩 😏 😬 🤠 😋 Randomly split into k groups
  • 26. • Minimizing the misclassification rate • Minimizing the expected loss • The reject option Decision Theory
  • 27. • Decision boundary, or simply, in 1D, a threshold, s.t. anything larger than the threshold are classified as a class, and smaller than the threshold as another class. Decision Boundary
  • 28. • Different metrics & names used in different fields for measuring ML performance; however, the common cornerstones are: • True positive (TP): sample is an apple, classified as an apple. • False positive (FP): sample is not an apple, but classified as an apple. • True negative (TN): sample is not an apple, classified as not an apple. • False negative (FN): sample is an apple, but misclassified as "not an apple. True/False, Positive/Negative
  • 29. • Precision: 
 
 Classifier identified (TP+FP) apples, only TP are apples. (aka positive predictive value.) • Recall:
 
 Total (TP+FN) apples, classifier identified TP. 
 (aka, hit rate, sensitivity, true positive rate) Precision vs Recall TP TP + FP TP TP + FN
  • 30. • F-measure: 
 
 harmonic mean of precision and recall. F- measure is criticized outside Information Retrieval field for neglecting the true negative. • Accuracy (ACC): 
 
 a weighted arithmetic mean of precision and inverse precision, as well as the weighted arithmetic mean of recall and inverse recall. A single balanced metric? TP + TN TP + TN + FP + FN 2 · precision · recall precision + recall
  • 31. Multi-objective Optimization e.g., micro air vehicle wing design
  • 32. • Different types of errors are weighted differently; e.g., medical examinations, minimize false negative but can tolerate false positive. • Reformulate objectives from maximizing probability to minimizing weighted loss functions. • The reject option: refrain from making decisions on difficult cases (e.g., for samples within a certain region inside the decision boundary.) Minimizing the expected loss
  • 33. • Minimizing Training and Validation Error, v.s. minimizing Testing Error. • Memorizing every “practice exam” question ≠ doing well on new questions. Avoid overfitting. Generalization E.g., training a classifier that recognizes trees
  • 34. Odd trees of the world
  • 35. Odd trees of the world
  • 36. Odd trees of the world
  • 37. • Bias: • Difference between the expected (or averaged) prediction of our model and the correct value. • Error due to inaccurate assumptions/ simplifications. • Variance: • Amount that the estimate of the target function will change if different training data was used. Generalization Error
  • 39. • Model is too simple to represent all the relevant class characteristics. • High bias (few degrees of freedom, DoF) and low variance. • High training error and high test error. Underfitting
  • 40. • Model is too complex and fits irrelevant noise in the data • Low bias, high variance • Low training error, high test error Overfitting
  • 41. Error (mean square error, MSE) 
 = noise2 + bias2 + variance Bias-Variance Trade-off unavoidable error error due to incorrect assumptions made about the data error due to variance of training samples
  • 43. Training Sample vs Model Complexity Slide credit: D. Hoiem
  • 44. Effect of Training Sample Size Slide credit: D. Hoiem
  • 45. • Models: describe relationship between variables • Deterministic models: hypothesize exact relationships, OK when noise is negligible • Probabilistic models: deterministic part + random error. For example: • Regression models: one dependent variable + one or more numerical or categorical independent (explanatory) variable. • Correlation models: multiple independent variables. How do we learn models?
  • 46. Generative vs Discriminative Models Discriminative Model: directly learn the data boundary Generative Model: represent the data and boundary
  • 47. • Learn to directly predict labels from the data • Often uses simpler boundaries (e.g., linear) for hopes of better generalization. • Often easier to predict a label from the data than to model the data. • E.g., • Logistic Regression • Support Vector Machines • Max Entropy Markov Model • Conditional Random Fields Discriminative Models
  • 48. • Represent both the data and the boundary. • Often use conditional independence and priors. • Modeling data is challenging; need to make and verify assumptions about data distribution • Modeling data aids prediction & generalization. • E.g., • Naive Bayes • Gaussian Mixture Model (GMM) • Hidden Markov Model • Generative Adversarial Networks (GAN) Generative Models
  • 49. • Bernoulli Distribution • Uniform Distribution • Binomial Distribution • Normal Distribution • Poisson Distribution • Exponential Distribution Distributions
  • 50. Dimension Reduction Machine Learning Roadmap Dimension Reduction Clustering Regression Classification continuous (predicting a quantity) discrete (predicting a category) supervisedunsupervised
  • 51. • Goal: try to find a more compact representation of the data • Assume that the high dimensional data actually reside in an inherent low- dimensional space. • Additional dimensions are
 just random noise • Goal is to recover these inherent dimensions and discard noise. Unsupervised Dimension Reduction
  • 52. • Create a basis where the axes represent the dimensions of variance, from high to low. • Finds correlations in data dimensions to product best possible lower-dimensional representation based on linear projections. Principal Component Analysis (PCA)
  • 53. PCA
  • 54. PCA algorithm, conceptual steps • Find a line s.t. when data is projected onto the line, it has the maximum variance.
  • 55. • Find new line orthogonal to the first that has the maximum projected variance. PCA algorithm, conceptual steps
  • 56. • Repeated until d lines. The projected position of a point on these lines gives the coordinates in the m-dimensional reduced space. • Computing these set of lines is achieved by eigen-decomposition of the covariance matrix. PCA algorithm, conceptual steps
  • 57. • Given n data points: x1, ..., xn • Consider a linear projection specified by v • The projection of x onto v is • The variance of the projected data is • The 1st Principal Component maximizes the variance subject to the constraint PCA, maximizing variance z = vT x var(z) = var(vT xv) = vT var(x)v = vT Sv
  • 58. • Maximize , subject to • Lagrange: • is the eigen-vector of S with eigen-value • Sample variance of the projected data • The eigen-values equals the amount of variance captured by each eigen-vector PCA, maximizing variance vT Sv vT v = 1 vT Sv (vT v 1) d dv = 0 ! Sv = v v vT Sv = vT v =
  • 59. • View PCA as minimizing the reconstruction error of using a low-dimensional approximation of the original data: Alternative view of PCA x1 ⇡ x0 + z1 u x2 ⇡ x0 + z2 u
  • 60. • Calculate the covariance matrix of the data S • Calculate the eigen-vectors/eigen-values of S • Rank the eigen-values in decreasing order • Select eigen-vectors that retain a fixed % of the variance, e.g., 80%, s.t., Dimension Reduction using PCA Pd i=1 i P i i 80%
  • 61. PCA example: Eigenfaces Mean face Basis of variance (eigenvectors) M. Turk; A. Pentland (1991). "Face recognition using eigenfaces". Proc. IEEE Conference on Computer Vision and Pattern Recognition. pp. 586–591.
  • 62. The ATT face database (formerly the ORL database), 10 pictures of 40 subjects each
  • 63. • Covariance of the image data is big. Finding eigenvector of large matrices is slow. • Singular Value Decomposition (SVD) can be used to compute principal components. • SVD steps: • Create centered data matrix X • Solve: X = USVT • Columns of V are the eigenvectors of sorted from largest to smallest eigenvalues. PCA, scaling up ⌃
  • 66. • Useful preprocessing for easing the "curse of dimensionality" problem. • Reduced dimension: simpler hypothesis space • Smaller VC dimension: less overfitting • PCA can also be seen as noise reduction • Fails when data consists of multiple separate clusters PCA discussion
  • 67. • Also named Fisher Discriminant Analysis • It can be viewed as • a dimension reduction method, • a generative classifier p(x|y), Gaussian with distinct for each class but shared . Linear Discriminant Analysis (LDA) µ ⌃ classes mixed better separation
  • 68. • Find a project direction so that the separation between classes is maximized. • Objective 1: maximize the distance between the projected means of different classes LDA Objectives m1 = 1 N1 X x2C1 x m2 = 1 N2 X x2C2 x original means: projected means: m0 1 = 1 N1 X x2C1 wT x m0 2 = 1 N2 X x2C2 wT x
  • 69. • Objective 2: minimize scatter (variance within class) LDA Objectives s2 i = X x2Ci (wT x m0 i)2Total within class scatter for projected class i: Total within class scatter: s2 1 + s2 2
  • 70. • There are a number of different ways to combine the two objectives. • LDA seeks to optimize the following objective: LDA Objective
  • 72. LDA for two classes w = S 1 w (m1 m2)
  • 73. • Objective remains the same, with slightly different definition for between-class scatter: • Solution: k-1 eigenvectors of LDA for Multi-Classes J(w) = wT SBw wTSww SB = 1 k kX i=1 (mi m)(mi m)T S 1 w SB
  • 74. • Data often lies on or near a nonlinear low-dimensional curve. • We call such a low-d structure manifolds • Algorithms include: ICA, LLE, Isomap. Nonlinear Dimension Reduction swiss roll data
  • 75. • A non-linear method for dimensionality reduction • Preserves the global, nonlinear geometry of the data by preserving the geodesic distances. • Geodesic: shortest route between two points on the surface of a manifold. ISOMAP: Isometric Feature Mapping
  • 76. 1. Approximate the geodesic distance between every pair of points in the data. • The manifold is locally linear • Euclidean distance works well for points that are close enough. • For points that are far apart, their geodesic distance can be approximated by summing up local Euclidean distances. 2. Find a Euclidean mapping of the data that preserves the geodesic distance. ISOMAP algorithm
  • 77. • Construct a graph by: • Connecting i and j if: • d(i,j) < (if computing -isomap), or • i is one of j's k nearest neighbors (k-isomap) • Set the edge weight equal d(i,j) - Euclidean distance • Compute the Geodesic distance between any two points as the shortest path distance. Geodesic Distance " "
  • 78. • We can use Multi-Dimensional Scaling (MDS), a class of statistical techniques that: • Given: • n x n matrix of dissimilarities between n objects • Outputs: • a coordinate configuration of the data in low-d space Rd whose Euclidean distances closely match given dissimilarities. Compute low-dimensional mapping
  • 79. ISOMAP on Swiss Roll Data
  • 82. Regression Machine Learning Roadmap Dimension Reduction Clustering Regression Classification continuous (predicting a quantity) discrete (predicting a category) supervisedunsupervised
  • 83. • Unit-less, normalized between [-1, 1] Pearson’s Correlation Coefficient Y X Y X Y X Y X Y X r = -1 r = -.6 r = 0 r = +.3r = +1 Y X r = 0 Figures modified from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall r = cov(x, y) p var(x) p var(y)
  • 84. Linear Correlations Y X Y X Linear relationships Y Y X X Curvilinear relationships Y X Y X Strong relationships Y Y X X Weak relationships Y X No relationship Y X
  • 85. • In correlation, two variables are treated as independent. • In regression, one variable (x) is independent, while the other (y) is dependent. • Goal: if you know something about x, this would help you predict something about y. Regression
  • 86. • Expected value at a given level of x: • Predicted value for a new x: Simple Linear Regression y x random error that follows a normal distribution with 0 mean and variance " 2 fixed exactly on the line y = w0 + w1x y0 = w0 + w1x + " w0 w0/w1
  • 87. Multiple Linear Regression y(x, w) = w0 + w1x1 + · · · + wDxD w0, ..., wD xi • Linear function of parameters , also a linear function of the input variables , has very restricted modeling power (can't even fit curves). • Assumes that: • The relationship between X and Y is linear. • Y is distributed normally at each value of X. • The variance of Y at each value of X is the same. • The observations are independent.
  • 88. • Before going further, let’s take a look at polynomial line fitting (polynomial regression.) Linear Regression Given N=10 blue dots, try to find the function that is used for generating the data points. sin(2⇡x)
  • 89. • Polynomial line fitting: • M is the order of the polynomial • linear function of the coefficients • nonlinear function of • Objective: minimize the error between the predictions and the target value of Polynomial Regression x w y(xn, w) tn xn ERMS = p 2E(w⇤)/Nor, the root-mean-square error E(w) = 1 2 NX n=1 {y(xn, w) tn} 2 y(x, w) = w0 + w1x + w2x2 + · · · + wM xM + "
  • 91. • There's only 10 data points, i.e., 9 degrees of freedom; we can get 0 training error when M=9. • Food for thought: make sure your deep neural network's is not just "memorizing the training data when its M >> data's DoF. Polynomial regression w. var. M
  • 92. • With M=9, but N=15 (left) and N=100, the over- fitting problem is greatly reduced. • ML is all about balancing M and N. One rough heuristic is that N should be 5x-10x of M (model complexity, not necessarily the number of param.) What happens with more data?
  • 93. • Regularization: used for controlling over-fitting. • E.g., discourage coefficients from reaching large values:
 
 
 
 where Regularization ˜E(w) = 1 2 NX n=1 {y(xn, w) tn} 2 + 2 ||w||2 ||w||2 = wT w = w2 0 + w2 1 + · · · + w2 M
  • 94. • Extending linear regression to linear combinations of fixed nonlinear functions:
 
 
 
 where • Basis functions: act as "features" in ML. • Linear basis function: • Polynomial basis function: • Gaussian basis function • Sigmoid basis function Linear Models for Regression y(x, w) = M 1X j=0 wj (x) w = (w0, . . . , wM 1)T , = ( 0, . . . , M 1)T { j(x)} j(x) = xj j(x) = xj
  • 95. • Global functions of the input variable, s.t. changes in one region of input space affect all other regions. Polynomial Basis Functions j(x) = xj
  • 96. • Local functions, a small change in x only affect nearby basis functions. • and control the location and scale (width). Gaussian Basis Functions j(x) = exp ⇢ (x µj)2 2s2 µj s
  • 97. • Local functions, a small change in x only affect nearby basis functions. • and control the location and scale (slope). Sigmoidal Basis Functions µj s j(x) = ✓ x µj s ◆ (a) = 1 1 + exp( a) where
  • 98. • Adding a regularization term to an error function: • One of simplest forms of regularizer is sum-of- squares of the weight vector elements: • This type of weight decay regularizer (in ML), a.k.a., parameter shrinkage (in statistics) encourages weight values to decay towards zero, unless supported by the data. Regularized Least Squares EW (w) = 1 2 wT w ED(w) + EW (w)
  • 99. • A more general regularizer in the form of: • q=2 is the quadratic regularizer (last page). • q=1 is known as lasso in statistics. Regularized Least Squares 1 2 NX n=1 tn wT (xn) 2 + 2 MX j=1 |wj|q sum of squared error generalized regularizer,
  • 100. • LASSO: least absolute shrinkage and selection operator • When is sufficiently large, some of the coefficients are driven to zero, leading to a sparse model LASSO wj
  • 102. • Large values of : small variance but large bias • Small values of : large variance, small bias The Bias-Variance Tradeoff
  • 103. Clustering Machine Learning Roadmap Dimension Reduction Clustering Regression Classification continuous (predicting a quantity) discrete (predicting a category) supervisedunsupervised
  • 104. • Group together similar points and represent them with a single token. • Issues: • How do we define two points/images/patches being "similar"? • How do we compute an overall grouping from pairwise similarity? Clustering
  • 105. • Grouping pixels of similar appearance and spatial proximity together; there's so many ways to do it, yet none are perfect. Clustering Example
  • 107. • Summarizing Data • Look at large amounts of data • Patch-based compression or denoising • Represent a large continuous vector with the cluster number • Counting • Histograms of texture, color, SIFT vectors • Segmentation • Separate the image into different regions • Prediction • Images in the same cluster may have the same labels Why do we cluster?
  • 108. • K-means • Iteratively re-assign points to the nearest cluster center • Gaussian Mixture Model (GMM) Clustering • Mean-shift clustering • Estimate modes of pdf • Hierarchical clustering • Start with each point as its own cluster and iteratively merge the closest clusters • Spectral clustering • Split the nodes in a graph based on assigned links with similarity weights How do we cluster?
  • 109. • Goal: cluster to minimize variance in data given clusters while preserving information. Clustering for Summarization c⇤ , ⇤ = argmin c, 1 N NX j=0 KX i=0 i,j(ci xj)2 cluster center data Whether is assigned toxj ci
  • 110. • Euclidean Distance: • Cosine similarity: How do we measure similarity? ✓ = arccos ✓ xy |x||y| ◆ x y ||y x|| = p (y x) · (y x) distance(x, y) = p (y1 x1)2 + (y2 x2)2 + · · · + (yn xn)2 = v u u t nX i=1 (yi xi)2 x · y = ||x||2 ||y||2 cos ✓ similarity(x, y) = cos(✓) = x · y ||x||2 ||y||2
  • 111. • Compare distance of closest (NN1) and second closest (NN2) feature vector neighbor. • If NN1≈NN2, ratio NN1/NN2 will be ≈1 → matches too close. • As NN1 << NN2, ratio NN1/NN2 tends to 0. • Sorting by this ratio puts matches in order of confidence. Nearest Neighbor Distance Ratio
  • 112. • How to threshold the nearest neighbor ratio? Nearest Neighbor Distance Ratio Lowe IJCV 2004 on 40,000 points. Threshold depends on data and specific applications
  • 113. 1. Randomly select k initial cluster centers 2. Assign each point to nearest center
 
 3. Update cluster centers as the mean of the points
 
 4. repeat 2-3 until no points are re-assigned. k-means clustering t = argmin 1 N NX j=1 KX i=1 i,j ct 1 i xj 2 ct = argmin c 1 N NX j=1 KX i=1 t i,j (ci xj) 2
  • 115. • Initialization • Randomly select K points as initial cluster center • Greedily choose K points to minimize residual • Distance measures • Euclidean or others? • Optimization • Will converge to local minimum • May want to use the best out of multiple trials k-means: design choices
  • 116. • Cluster on one set, use another (reserved) set to test K. • Minimum Description Length (MDL) principal for model comparison. • Minimize Schwarz Criterion, a.k.a. Bayes Information Criteria (BIC) • (When building dictionaries, more clusters typically work better.) How to choose k
  • 117. • Generative • How well are points reconstructed from the cluster? • Discriminative • How well do the clusters correspond to labels (purity) How to evaluate clusters?
  • 118. • Pros • Finds cluster center that minimize conditional variance (good representation of data) • simple and fast • easy to implement k-means pros & cons
  • 119. • Cons • Need to choose K • Sensitive to outliers • Prone to local minima • All clusters have the same parameters • Can be slow. Each iteration is O(KNd) for N d- dimensional points k-means pros & cons
  • 120. • Clusters are spherical • Clusters are well separated • Clusters are of similar volumes • Clusters have similar number of points k-means works if
  • 121. • Hard assignments, or probabilistic assignments? • Case against hard assignments: • Clusters may overlap • Clusters may be wider than others • Can use a probabilistic model, • Challenge: need to estimate model parameters without labeled Ys. GMM Clustering P(X|Y )P(Y )
  • 122. • Assume m-dimensional data points • still multinomial, with k classes • are k multivariate Gaussians Gaussian Mixture Models P(Y ) P(X|Y = i), i = 1, · · · , k P(X = x|Y = i) = 1 p (2⇡)m|⌃i| exp ✓ 1 2 (x µi)T ⌃ 1 (x µi) ◆ mean (m-dim vector) variance (m*m matrix) determinant of matrix
  • 123. • Estimating parameters (when given data label Y) • Solve optimization problem: • MLE has closed form solution: • i.e., solve • Estimating parameters (without ), solve: : all model param including mean, variance, etc. Maximum Likelihood Estimation (MLE) P(X = x|Y = i) = 1p (2⇡)m|⌃i| exp 1 2 (x µi)T ⌃ 1 (x µi) µML = 1 n Pn i=1 xi ⌃ML = 1 n Pn i=1(xi µML)(xi µML)T argmax✓ Q j P(yj , xj ; ✓) yj argmax✓ Q j P(xj , ✓) = argmax Q j Pk i=1 P(yj = i, xj ; ✓) ✓
  • 124. • Maximize marginal likelihood • Almost always a hard problem • Usually no closed form solution • Even when is convex, generally isn't • For all but the simplest , we will have to do gradient ascent, in a big messy space with lots of local optimum. Solving MLE for GMM Clustering argmax✓ Q j P(xj , ✓) = argmax Q j Pk i=1 P(yj = i, xj ; ✓) P(X, Y ; ✓) P(X; ✓) P(X; ✓)
  • 125. • Simple example: GMM with 1D data, k=2 Gaussians, variance=1, distribution over classes is uniform, only need to estimate , . Solving MLE for GMM Clustering µ1 µ2 nY j=1 kX i=1 P(X = xj , Y = i) / nY j=1 kX i=1 exp ✓ 1 2 2 (xj µi)2 ◆ • Skipping the derivations.... still need to differentiate and solve for , and P(Y=1) for i=1...k. There are still no closed form solution, gradient is complex with lots of local optimum. µi ⌃i
  • 126. • Expectation Maximization • Objective: • Data: • E-step: For all examples j and values i for y, compute: • M-step: re-estimate the parameters with weighted MLE estimates, set: Solving MLE for GMM Clustering argmax ✓ Y j kX i=1 P(yj = i, xj |✓) = X j log kX i=1 P(yj = i, xj |✓) {xj |j = 1 . . . n} P(yj = i|xj , ✓) ✓ = argmax✓ P j Pk i=1 P(yj = i|xj , ✓) log P(yj = i, xj |✓)
  • 127. EM for GMM MLE example 1 2 3 4 5 6
  • 128. • EM after 20 iterations EM for GMM MLE example
  • 129. • GMM for some bio assay data EM for GMM MLE example
  • 130. EM for GMM MLE example • GMM for some bio assay data, fitted separately for three diffrent compounds.
  • 131. • GMM with hard assignments and unit variance, EM is equivalent to k-means clustering algorithm!!! • EM, like k-NN, uses coordinate ascent, and can get stuck in local optimum. EM for GMM Clustering, notes
  • 132. • mean-shift seeks modes of a given set of points 1. Choose kernel and bandwidth 2. For each point: 1. center a window on that point 2. compute the mean of the data in the search window 3. center the search window at the new mean location, repeat 2,3 until converge. 3. Assign points that lead to nearby modes to the same cluster. Mean-Shift Clustering
  • 133. • Try to find modes of a non-parametric density Mean-shift algorithm Color space Color space clusters
  • 134. • Attraction basin: the region for which all trajectories lead to the same mode. • Cluster: all data points in the attraction basin of a mode. Attraction Basin Slides by Y. Ukrainitz & B. Sarel
  • 135. Mean Shift region of interest mean-shift vector center of mass
  • 139. • Kernel density estimation function • Gaussian kernel Kernel Density Estimation ˆfh(x) = 1 nh nX i=1 K ✓ x xi h ◆ K ✓ x xi h ◆ = 1 p 2⇡ e (x xi)2 2h2
  • 140. • Compute mean shift vector m(x) • Iteratively translate the kernel window y m(x) until convergence Computing the Mean Shift m(x) = 2 4 Pn i=1 xig ⇣ ||x xi||2 h ⌘ Pn i=1 g ⇣ ||x xi||2 h ⌘ x 3 5
  • 141. • Mean-shift can also be used as clustering-based image segmentation. Mean-Shift Segmentation D. Comaniciu and P. Meer, Mean Shift: A Robust Approach toward Feature Space Analysis, PAMI 2002.
  • 142. • Compute features for each pixel (color, gradients, texture, etc.). • Set kernel size for features and position . • Initialize windows at individual pixel locations. • Run mean shift for each window until convergence. • Merge windows that are within width of and . Mean-Shift Segmentation Color space Color space clusters Kf Ks Kf Ks
  • 143. • Speedups: • binned estimation • fast neighbor search • update each window in each iteration • Other tricks • Use kNN to determine window sizes adaptively Mean-Shift
  • 144. • Pros • Good general-practice segmentation • Flexible in number and shape of regions • robust to outliers • Cons • Have to choose kernel size in advance • Not suitable for high-dimensional features Mean-Shift pros & cons
  • 145. • DBSCAN: Density-based spatial clustering of applications with noise. • Density: number of points within a specified radius (ε-Neighborhood) • Core point: a point with more than a specified number of points (MinPts) within ε. • Border point: has fewer than MinPts within ε, but is in the neighborhood of a core point. • Noise point: any point that is not a core point or border point. DBSCAN MinPts=4 p is core point q is border point o is noise point q p " " o
  • 146. • Density-reachable: p is density- reachable from q w.r.t. ε and MinPts if there is a chain of objects p1, ..., pn with p1=q and pn=p, s.t. pi+1 is directly density- reachable from pi w.r.t. ε and MinPts for all • Density-connectivity: p is density-connected to q w.r.t. ε and MinPts if there is an object o, s.t. both p and q are density- reachable from o w.r.t. ε and MinPts. DBSCAN 1  i  n
  • 147. • Cluster: a cluster C in a set of objects D w.r.t. ε and MinPts is a non-empty subset of D satisfying • Maximality: for all p,q, if p ∈ C and if q is density reachable from p w.r.t. ε. • Connectivity: for all p,q ∈ C, p is density- connected to q w.r.t. ε and MinPts in D. • Note: cluster contains core & border points. • Noise: objects which are not directly density- reachable from at least one core object. DBSCAN clustering
  • 148. 1. Select a point p 2. Retrieve all points density-reachable from p w.r.t. ε and MinPts. 1. if p is a core point, a cluster is formed 2. if p is a border point, no points are density reachable from p and DBSCAN visits the next point of the database 3. continue 1,2, until all points are processed. (result independent of process ordering) DBSCAN clustering algorithm
  • 149. • Heuristic: for points in a cluster, their kth nearest neighbors are at roughly the same distance. • Noise points have the kth nearest neighbor at farthest distance. • So, plot sorted distance of every point to its kth nearest neighbor. DBSCAN parameters sharp change; good candidate for ε and MinPts.
  • 150. • Pros • No need to decide K beforehand, • Robust to noise, since it doesn't require every point being assigned nor partition the data. • Scales well to large datasets with . • Stable across runs and different data ordering. • Cons • Trouble when clusters have different densities. • ε may be hard to choose. DBSCAN pros & cons
  • 151. • Agglomerative clustering v.s. Divisive clustering Hierarchical Clustering
  • 152. • Method: 1. Every point is its own cluster 2. Find closest pair of clusters, merge into one 3. repeat • The definition of closest is what differentiates various flavors of agglomerative clustering algorithms. Agglomerative Clustering
  • 153. • How to define the linkage/cluster similarity? • Maximum or complete-linkage clustering (a.k.a., farthest neighbor clustering) • Minimum or single linkage clustering (UPGMA) (a.k.a., nearest neighbor clustering) • Centroid linkage clustering (UPGMC) • Minimum Energy Clustering • Sum of all intra-cluster variance • Increase in variance for clusters being merged Agglomerative Clustering single linkage complete linkage average linkage centroid linkage
  • 154. • How many clusters? • Clustering creates a dendrogram (a tree) • Threshold based on max number of clusters or based on distance between merges. Agglomerative Clustering
  • 155. • Pros • Simple to implement, widespread application • Clusters have adaptive shapes • Provides a hierarchy of clusters • Cons • May have imbalanced clusters • Still have to choose the number of clusters or thresholds • Need to use an ultrametric to get a meaningful hierarchy Agglomerative Clustering
  • 156. • Group points based on links in a graph Spectral Clustering A B
  • 157. • Normalized Cut • A cut in a graph that penalizes large segments • Fix by normalizing for size of segments
 
 
 
 
 volume(A) = sum of costs of all edges that touch A Spectral Clustering Normalized Cut(A, B) = cut(A, B) volume(A) + cut(A, B) volume(B)
  • 158. • Determining importance by random walk • What's the probability of visiting a given node? • Create adjacency matrix based on visual similarity • Edge weights determine probability of transition Visual Page Rank Jing Baluja 2008
  • 159. • Quantization/Summarization: K-means • aims to preserve variance of original data • can easily assign new point to a cluster Which Clustering Algorithm to use? Quantization for computing histograms Summary of 20,000 photos of Rome using “greedy k-means” http://grail.cs.washington.edu/projects/canonview/
  • 160. • Image segmentation: agglomerative clustering • More flexible with distance measures (e.g., can be based on boundry prediction) • adapts better to specific data • hierarchy can be useful Which Clustering Algorithm to use? http://www.cs.berkeley.edu/~arbelaez/UCM.html
  • 161. • K-means useful for summarization, building dictionaries of patches, general clustering. • Agglomerative clustering useful for segmentation, general clustering. • Spectral clustering useful for determining relevance, summarization, segmentation. Which Clustering Algorithm to use?
  • 162. • Synthetic dataset Clustering algo. compared http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
  • 163. • K-means, k=6 Clustering algo. compared http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
  • 164. • Meanshift Clustering algo. compared http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
  • 165. • DBSCAN, ε=0.025 Clustering algo. compared http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
  • 166. • Agglomerative Clustering, k=6, linkage=ward Clustering algo. compared http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
  • 167. • Spectral Clustering, k=6 Clustering algo. compared http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
  • 168. Classification Machine Learning Roadmap Dimension Reduction Clustering Regression Classification continuous (predicting a quantity) discrete (predicting a category) supervisedunsupervised
  • 169. • Given a set of samples and their ground truth annotation , learn a function that minimizes the prediction error for new . • The function is a classifier. Classifiers divides input space into decision regions separated by decision boundaries. Supervised Learning xj /2 X xi 2 X yi decision boundary E(yj, f(xj)) y = f(x) y = f(x) x1 x2 R1 R2 R3
  • 170. • Spam detection: • X = { characters and words in the email } • Y = { spam, not spam} • Digit recognition: • X = cut out, normalized images of digits • Y = {0,1,2,3,4,5,6,7,8,9} • Medical diagnosis • X = set of all symptoms • Y = set of all diseases Supervised Learning Examples
  • 171. • Find a linear function to separate the classes Linear Classifiers • Logistic Regression • Naïve Bayes • Linear SVM
  • 172. • Using a probabilistic approach to model data, the distribution of P(X,Y): given data X, find the Y that maximizes the posterior probability p(Y|X). • Problem: we need to model all p(X|Y) and p(Y). If | X | = n, there are 2n possible values for X. • The Naïve Bayes' assumption assumes that xi's are conditionally independent. Naïve Bayes Classifier p(Y |X) = p(X|Y )p(Y ) p(X) , where p(X) = X Y p(X|Y )p(Y ) p(X1 . . . Xn|Y ) = Y i p(Xi|Y )
  • 173. • Given: • Prior p(Y) • n conditionally independent features, represented by the vector X, given the class Y • For each Xi, we have likelihood p(Xi | Y) • Decision rule: Naïve Bayes Classifier Y ⇤ = argmax Y p(Y )p(X1, . . . , Xn|Y ) = argmax Y p(Y ) Y i p(Xi|Y )
  • 174. • For discrete Naïve Bayes, simply count: • Prior: • Likelihood: • Naïve Bayes Model: Maximum Likelihood for Naïve Bayes p(Y = y0 ) = Count(Y = y0 ) P y Count(Y = y) p(Xi = x0 |Y = y0 ) = Count(Xi = x0 , Y = y0 ) P x Count(Xi = x, Y = y) p(Y |X) / p(Y ) Y i,j p(X|Y )
  • 175. • Conditional probability model over: • Classifier: Naïve Bayes Classifier p(Ck|x1, . . . , xn) = 1 Z p(Ck) nY i=1 p(xi|Ck) ˜y = argmax k2{1,...,K} p(Ck) nY i=1 p(xi|Ck)
  • 176. • Features X are entire document. Xi for ith word in article. X is huge! NB assumption helps a lot! Naïve Bayes for Text Classification
  • 177. • Typical additional assumption: Xi's position in document doesn't matter: bag of words. aardvark 0 about 2 all 2 Africa 1 apple 0 ... gas 1 ... oil 1 ... Zaire 0 Naïve Bayes for Text Classification
  • 178. • Learning Phase: • Prior: p(Y), count how many documents in each topic (prior). • Likelihood: p(Xi|Y), for each topic, count how many times a word appears in documents of this topic. • Testing Phase: for each document, use Naïve Bayes' decision rule: argmax y p(y) wordsY i=1 p(xi|y) Naïve Bayes for Text Classification
  • 179. • Given 1000 training documents from each group, learn to classify new documents according to which newsgroup it came from. • comp.graphics, • comp.os.ms-windows.misc • ... • soc.religion.christian • talk.religion.misc • ... • misc.forsale • ... Naïve Bayes for Text Classification
  • 180. Naïve Bayes for Text Classification
  • 181. • Usually, features are not conditionally independent: • Actual probabilities p(Y|X) often bias towards 0 or 1 • Nonetheless, Naïve Bayes is the single most used classifier. • Naïve Bayes performs well, even when assumptions are violated. • Know its assumptions and when to use it. Naïve Bayes Classifier Issues p(X1, . . . , Xn|Y ) 6= Y i p(Xi|Y )
  • 182. • Regression model for which the dependent variable is categorical. • Binomial/Binary Logistic Regression • Multinomial Logistic Regression • Ordinal Logistic Regression (categorical, but ordered) • Substituting Logistic Function
 
 ,
 we get: Logistic Regression y(x, w) = 1 1 + e (w0+w1x) ˜x = w0 + w1xf(˜x) = 1 1 + e ˜x
  • 183. • E.g., for predicting: • mortality of injured patients, • risk of developing a certain disease based on observations of the patient, • whether an American voter would vote Democratic or Republican, • probability of failure of a given process, system or product, • customer's propensity to purchase a product or halt a subscription, • likelihood of homeowner defaulting on mortgage. When to use logistic regression?
  • 184. • Hours studied vs passing the exam Logistic Regression Example Ppass(h) = 1 1 + e ( 4.0777+1.5046·h)
  • 185. • Learn p(Y|X) directly. Reuse ideas from regression, but let y- intercept define the probability. • With normalization Logistic Regression Classifier p(Y = 1|X, w) / exp(w0 + X i wiXi) Exponential function Logistic function p(Y = 0|X, w) = 1 1 + exp(w0 + P i wiXi) p(Y = 1|X, w) = exp(w0 + P i wiXi) 1 + exp(w0 + P i wiXi) y = 1 1 + exp( x)
  • 186. • Prediction: output the Y with highest p(Y|X). For binary Y, output Y if Logistic Regression: decision boundary p(Y = 0|X, w) = 1 1 + exp(w0 + P i wiXi) p(Y = 1|X, w) = exp(w0 + P i wiXi) 1 + exp(w0 + P i wiXi) 1 < P(Y = 1|X) P(Y = 0|X) 1 < exp(w0 + nX i=1 wiXi) 0 < w0 + nX i=1 wiXi w0 + w · X = 0
  • 187. • Decision boundary: p(Y=0 | X, w) = 0.5 • Slope of the line defines how quickly probabilities go to 0 or 1 around decision boundary. Visualizing p(Y = 0|X, w) = 1 1 + exp(w0 + w1x1)
  • 188. • Decision boundary is defined by y=0 hyperplane Visualizing p(Y = 0|X, w) = 1 1 + exp(w0 + w1x1 + w2x2)
  • 189. • Generative (Naïve Bayes) loss function: • Data likelihood • Discriminative (logistic regression) loss function: • Conditional Data likelihood • Maximize conditional log likelihood! Logistic Regression Param. Estimation ln p(D|w) = NX j=1 ln p(xj , yj |w) = NX j=1 ln p(yj |xj , w) + NX j=1 ln p(xj |w) ln p(DY |DX, w) = NX j=1 ln p(yj |xj , w)
  • 190. • Maximize conditional log likelihood (Maximum Likelihood Estimation, MLE): • No closed-form solution. • Concave function of w → no need to worry about local optima; easy to optimize. l(w) ⌘ ln Y j p(yj |xj , w) = X j yj (w0 + X i wixj i ) ln(1 + exp(w0 + X i wixj i ) Logistic Regression Param. Estimation
  • 191. • Conditional likelihood for logistic regression is convex! • Gradient: • Gradient Ascent update rule: • Simple, powerful, use in
 many places. rwl(w) =  dl(w) dw0 , . . . , dl(w) dwn w = ⌘rwl(w) w (t+1) i w (t) i + ⌘ dl(w) dwi Logistic Regression Param. Estimation
  • 192. • MLE tends to prefer large weights • Higher likelihood of properly classified examples close to decision boundary. • Larger influence of corresponding features on decision. • Can cause overfitting!!! Logistic Regression Param. Estimation
  • 193. • Regularization to avoid large weights, overfitting. • Add priors on w and formulate as Maximum a Posteriori (MAP) optimization problem. • Define prior with normal distribution, zero mean, identity towards zero; pushes parameters towards zero. • MAP estimate: Logistic Regression Param. Estimation p(w|Y, X) / p(Y |X, w)p(w) w⇤ = argmax w ln 2 4p(w) NY j=1 p(yj |xj , w) 3 5
  • 194. • Logistic Regression in more general case, where Y = { y1, ..., yR}. Define a weight vector wi for each yi, i=1,...,R-1. Logistic Regression for Discrete Classification p(Y = 1|X) / exp(w10 + X i w1iXi) p(Y = 2|X) / exp(w20 + X i w2iXi) p(Y = r|X) = 1 r 1X j=1 p(Y = j|X) ...
  • 195. • E.g., Y={0,1}, X = <X1, ..., Xn>, Xi continuous. Naïve Bayes vs Logistic Regression Naïve Bayes (generative) Logistic Regression (discriminative) Number of parameters 4n+1 n+1 parameter estimation uncoupled coupled when # training samples → infinite
 & model correct good classifier good classifier when # training samples → infinite
 & model incorrect biased classifier less-biased classifier Training samples needed O(log N) O (N) Training convergence speed faster slower
  • 196. Naïve Bayes vs Logistic Regression • Examples from UCI Machine Learning dataset
  • 197. Perceptron • Invented in 1957 at the Cornell Aeronautical Lab. Intended to be a machine instead of a program that is capable of recognition. • A linear (binary) classifier. Mark I perceptron machine i1 i2 in ... + f o o = f nX k=1 ik · wk !
  • 198. • Start with zero weights: w=0 • For t=1...T (T passes over data) • For i=1...n (each training sample) • Classify with current weights
 (sign(x) is +1 if x>0, else -1) • If correct, (i.e., y=yi), no change! • If wrong, update Binary Perceptron Algorithm w = w + yi xi y = sign(w · xi ) w xi w + (-1) xi
  • 201. Binary Perceptron example update = 1update = 1update = 2
  • 202. Binary Perceptron example update = 1update = 1update = 2update = 3
  • 203. Binary Perceptron example update = 1update = 1update = 2update = 3update = 5
  • 204. Binary Perceptron example update = 1update = 1update = 2update = 3update = 5update = 10
  • 205. Binary Perceptron example update = 1update = 1update = 2update = 3update = 5update = 10update = 20
  • 206. • If we have more than two classes: • Have a weight vector for each class wy • Calculate an activation function for each class • Highest activation wins Multiclass Perceptron activationw(x, y) = wy · x y⇤ = argmax y (activationw(x, y))
  • 207. • Starts with zero weights • For t=1, ..., T, i=1, ..., n (T times over data) • Classify with current weights • If correct (y=yi), no change! • If wrong: subtract features xi from weights for predicted class wy and add them to weights for correct class wyi. Multiclass Perceptron y = argmax y wy · xi wy = wy xi wyi = wyi xi xi wyi wyi + xi wy wy xi
  • 208. • Text classification example: x = "win the vote" sentence Multiclass Perceptron Example BIAS 1 win 1 game 0 vote 1 the 1 ,,, BIAS -2 win 4 game 4 vote 0 the 0 ,,, BIAS 1 win 2 game 0 vote 4 the 0 ,,, BIAS 2 win 0 game 2 vote 0 the 0 ,,, wsports wpolitics wtech x x · wsports = 2 x · wpolitics = 7 x · wtech = 2 Classified as "politics"
  • 209. • The data is linearly separable with margin if Linearly separable (binary) 9w 8t yt (w · xt ) > 0 x1 x2
  • 210. • Assume data is separable with margin • Also assume there is a number R such that • Theorem: the number of mistakes (parameter updates) made by the perceptron is bounded: Mistake Bound for Perceptron 9w⇤ s.t.||w⇤ ||2 = 1 and 8t yt (w⇤ ·t ) 8t ||xt ||2  R mistakes  R2 r2
  • 211. • Noise: if the data isn't separable, weights might thrash (averaging weight vectors over time can help). • Mediocre generalization: finds a barely separating solution. • Overtraining: test / hold-out accuracy usually rises then falls. Issues with Perceptrons Seperable: Non-Seperable: thrashing barely separable
  • 212. • Find a linear function to separate the classes Linear SVM Classifier f(x) = g(w · x + b) • Define hyperplane where is the tangent to hyperplane, is the matrix of all data points. Minimize s.t. produces correct label for all . t X tX b = 0 ||t|| tX b X x1 x2
  • 213. • Find a linear function to separate the classes Linear SVM Classifier x1 x2 f(x) = g(w · x + b) • Define hyperplane where is the tangent to hyperplane, is the matrix of all data points. Minimize s.t. produces correct label for all . t X tX b = 0 ||t|| tX b X support vectors
  • 214. • Some data sets are not linearly separable! • Option 1: • Use non-linear features, e.g., polynomial basis functions • Learn linear classifers in a transformed, non- linear feature space • Option 2: • Use non-linear classifiers (decision trees, neural networks, nearest neighbors) Nonlinear Classifiers
  • 215. • Assign label of nearest training data point to each test data point. Nearest Neighbor Classifier Duda, Hart and Stork, Pattern Classification
  • 216. K-Nearest Neighbor Classifier x x x x x x x x o o o o o o o x2 x1 + + x x x x x x x x o o o o o o o x2 x1 + + 1-nearest x x x x x x x x o o o o o o o x2 x1 + + 3-nearest x x x x x x x x o o o o o o o x2 x1 + + 5-nearest
  • 217. • Data that are linearly separable work out great: • But what if the dataset is just too hard? • We can map it to a higher-dimensional space! Nonlinear SVMs 0 0 x x 0 x x2
  • 218. • Map the input space to some higher dimensional feature space where the training set is separable: Nonlinear SVMs : x ! (x)
  • 219. • The kernel trick: instead of explicitly computing the lifting transformation • This gives a non-linear decision boundary in the original feature space: • Common kernel function: Radial basis function kernel. Nonlinear SVMs K(xi, xj) = (xi) · (xj) X i ↵iyi (xi) · (x) + b = X i ↵iyiK(xi, x) + b
  • 220. • Consider the mapping: Nonlinear kernel example 0 x x2 (x) = (x, x2 ) (x) · (y) = (x, x2 ) · (y, y2 ) = xy + x2 y2 K(x, y) = xy + x2 y2
  • 221. • Histogram intersection kernel: • Generlized Gaussian kernel: D can be (inverse) L1 distance, Euclidean distance, distance, etc. Kernels for bags of features I(h1, h2) = NX i=1 min(h1(i), h2(i)) K(h1, h2) = exp ✓ 1 A D(h1, h2)2 ◆ X2
  • 222. • Combine multiple two-class SVMs • One vs others: • Training: learn an SVM for each class vs the others. • Testing: apply each SVM to test example and assign it to the class of the SVM that returns the highest decision value. • One vs one: • Training: learn an SVM for each pair of classes • Testing: each learned SVM votes for a class to assign to the test example. Multi-class SVM
  • 223. • Pros: • SVMs work very well in practice, even with very small training sample sizes. • Cons: • No direct multi-class SVM; must combine two-class SVMs. • Computation and memory usage: • Must compute matrix of kernel values for each pair of examples. • Learning can take a long time for large problems. SVMs: Pros & Cons
  • 224. • Prediction is done by sending the example down the tree until a class assignment is reached. Decision Tree Classifier
  • 225. • Internal Nodes: each test a feature • Leaf nodes: each assign a classification • Decision Trees divide the feature space into axis- parallel rectangles and label each rectangle with one of the K classes. Decision Tree Classifier
  • 226. • Goal: find a decision tree that achieves minimum misclassification errors on the training data. • Brute-force solution: create a tree with one path from root to leaf for each training sample.
 (problem: just memorizing, won't generalize.) • Find the smallest tree that minimizes error.
 (problem: this is NP-hard.) Training Decision Trees
  • 227. 1. Choose the best feature a* for the root of the tree. 2. Split training set S into subsets {S1, S2, ..., Sk} where each subset Si contains examples having the same value for a*. 3. Recursively apply the algorithm on each new subset until all examples have the same class label. The problem is, what defines the "best" feature? Top-down induction of Decision Tree
  • 228. • Decision Tree feature selection based on classification error. Choosing Best Feature Does not work well, since it doesn't reflect progress towards a good tree.
  • 229. • Choose feature that gives the highest information gain (X that has the highest mutual information with Y). • Define to be the expected remaining uncertainty about y after testing xj. Choosing Best Feature argmax j I(Xj; Y ) = argmax j H(Y ) H(Y |Xj) = argmin j H(Y |Xj) ˜J(j) ˜J(j) = H(YX)j) = X x p(Xj = x)H(Y |Xj = x)
  • 231. 1. Create T bootstrap samples, {S1, ..., ST} of S as follows: • For each Si, randomly draw |S| examples from S with replacement. • With large |S|, each Si will contain 1 - 1/e = 63.2% unique examples. 2. For each i=1, ..., T, hi = Learn (Si) 3. Output H = <{h1, ..., hT}, majority vote > Bootstrap Aggregating (Bagging) Leo Breiman, "Bagging Predictors", Machine Learning, 24, 123-140 (1996)
  • 232. • A learning algorithm is unstable if small changes in the training data produces large changes in the output hypothesis. • Bagging will have little benefit when used with stable learning algorithms. • Bagging works best when used with unstable yet relatively accurate classifiers. Learning Algorithm Stability
  • 234. • Bagging: individual classifiers are independent • Boosting: classifiers are learned iteratively • Look at errors from previous classifiers to decide what to focus on for the next iteration over data. • Successive classifiers depends upon its predecessors. • Result: more weights on "hard" examples, i.e., the ones classified incorrectly in the previous iterations. Boosting
  • 235. • Consider E = <{h1, h2, h3}, majority vote> • If h1, h2, h3 have error rates less than e, the error rate of E is upper-bounded by g(a): 3e2-2e3 < e Error Upper Bound e 3e2-2e3
  • 236. • Hypothesis of getting a classifier ensemble of arbitrary accuracy, from weak classifiers. Arbitrary Accuracy from Weak Classifiers The original formulating of boosting learns too slowly. Empirical studies show that Adaboost is highly effective.
  • 237. • Adaboost works by learning many times on different distributions over the training data. • Modify learner to take distribution as input. 1. For each boosting round, learn on data set S with distribution Dj to produce jth ensemble member hj. 2. Compute the j+1th round distribution Dj+1 by putting more weight on instances that hj made mistake on. 3. Compute a voting weight wj for hj. Adaboost
  • 238. Adaboost Example Credit: "A tutorial on boosting" by Yoav Freund and Rob Schapire.
  • 239. Adaboost Example Credit: "A tutorial on boosting" by Yoav Freund and Rob Schapire.
  • 240. Adaboost Example Credit: "A tutorial on boosting" by Yoav Freund and Rob Schapire.
  • 241. Adaboost Example Credit: "A tutorial on boosting" by Yoav Freund and Rob Schapire.
  • 243. • Suppose the base learner L is a weak learner, with error rate slightly less than 0.5 (better than random guess) • Training error goes to zero exponentially fast!!! Adaboost Properties
  • 244. Semi-supervised Learning Machine Learning Roadmap Dimension Reduction Clustering Regression Classification continuous (predicting a quantity) discrete (predicting a category) supervisedunsupervised
  • 245. • Assume that class boundary should go through low density areas. • Having unlabeled data helps getting better decision boundary. Why can unlabeled data help? supervised learning semi-supervised learning
  • 246. • Assume that each class contains a coherent group of points (e.g., Gaussian) • Having unlabeled data points can help learn the distribution more accurately. Why can unlabeled data help?
  • 247. • Generative models: • Use unlabeled data to more accurately estimate the models. • Discriminative models: • Assume that p(y|x) is locally smooth • Graph/manifold regularization • Multi-view approach: multiple independent learners that agree on unlabeled data • Cotraining Semi-Supervised Learning (SSL)
  • 248. SSL Bayes Gaussian Classifier Without SSL: optimize With SSL: optimize p(Xl, Yl|✓) p(Xl, Yl, Xu|✓)
  • 249. • In SSL, the learned needs to explain the unlabeled data well, too. • Find MLE or MAP estimate of joint and marginal likelihood: • Common mixture models used in SSL: • GMM • Mixture of Multinomials SSL Bayes Gaussian Classifier ✓ p(Xl, Yl, Xu|✓) = X Yu p(Xl, Yl, Xu, Yu|✓)
  • 250. • Binary classification with GMM using MLE • Using labeled data only, MLE is trivial: • With both labeled and unlabeled data, MLE is harder---use EM: Estimating SSL GMM params log p(Xl, Yl|✓) = lX i=1 log p(yi|✓) p(xi|yi, ✓) + l+uX i=l+1 log ( 2X y=1 p(y|✓) p(xi|y, ✓)) log p(Xl, Yl|✓) = lX i=1 log p(yi|✓) p(xi|yi, ✓)
  • 251. • Start with MLE • = proportion of class c • = sample mean of class c • = sample covariance of class c • The E-step: compute the expected label
 
 
 for all . • The M-step: update MLE with (now labeled) Semi-Supervised EM for GMM ✓ = {w, µ, ⌃}1:2 on (Xl, Yl) wc µc ⌃c p(y|x, ✓) = p(x, y|✓) P y0 p(x, y0|✓) x 2 Xµ ✓ Xµ
  • 252. • SSL is sensitive to assumptions!!! • Cases when the assumption is wrong: SSL GMM Discussions