SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
Machine Learning for Language Technology
Uppsala University
Department of Linguistics and Philology
Slides borrowed from previous courses.
Thanks to Ryan McDonald (Google Research)
and Prof. Joakim Nivre
Machine Learning for Language Technology 1(55)
Introduction
Linear Classifiers
Classifiers covered so far:
Decision trees
Nearest neighbors
Next two lectures: Linear classifiers
Statistics from Google Scholar (Sept 2013):
“Maximum Entropy”&“NLP”; 11,400 hits (2013), 2,660
(2009), 141 before 2000
“SVM”&“NLP”: 10,900 hits (2013), 2,210 (2009), 16 before
2000
“Perceptron”&“NLP”: 3,160 hits (2013), 947 (2009), 118
before 2000
All are linear classifiers that have become important tools in
Language Technology in the past 10 years or so.
Machine Learning for Language Technology 2(55)
Introduction
Outline
Today:
Preliminaries: input/output, features, etc.
Linear classifiers
Perceptron
Large-margin classifiers (SVMs, MIRA)
Logistic regression
Next time:
Structured prediction with linear classifiers
Structured perceptron
Structured large-margin classifiers (SVMs, MIRA)
Conditional random fields
Case study: Dependency parsing
Machine Learning for Language Technology 3(55)
Preliminaries
Inputs and Outputs
Input: x ∈ X
e.g., document or sentence with some words x = w1 . . . wn, or
a series of previous actions
Output: y ∈ Y
e.g., parse tree, document class, part-of-speech tags,
word-sense
Input/output pair: (x, y) ∈ X × Y
e.g., a document x and its label y
Sometimes x is explicit in y, e.g., a parse tree y will contain
the sentence x
Machine Learning for Language Technology 4(55)
Preliminaries
Feature Representations
We assume a mapping from input-output pairs (x, y) to a
high dimensional feature vector
f(x, y) : X × Y → Rm
For some cases, i.e., binary classification Y = {−1, +1}, we
can map only from the input to the feature space
f(x) : X → Rm
However, most problems in NLP require more than two
classes, so we focus on the multi-class case
For any vector v ∈ Rm, let vj be the jth value
Machine Learning for Language Technology 5(55)
Preliminaries
Features and Classes
All features must be numerical
Numerical features are represented directly as fi (x, y) ∈ R
Binary (boolean) features are represented as fi (x, y) ∈ {0, 1}
Multinomial (categorical) features must be binarized
Instead of: fi (x, y) ∈ {v0, . . . , vp}
We have: fi+0(x, y) ∈ {0, 1}, . . . , fi+p(x, y) ∈ {0, 1}
Such that: fi+j (x, y) = 1 iff fi (x, y) = vj
We need distinct features for distinct output classes
Instead of: fi (x) (1 ≤ i ≤ m)
We have: fi+0m(x, y), . . . , fi+Nm(x, y) for Y = {0, . . . , N}
Such that: fi+jm(x, y) = fi (x) iff y = yj
Machine Learning for Language Technology 6(55)
Preliminaries
Examples
x is a document and y is a label
fj (x, y) =



1 if x contains the word“interest”
and y =“financial”
0 otherwise
fj (x, y) = % of words in x with punctuation and y =“scientific”
x is a word and y is a part-of-speech tag
fj (x, y) =
1 if x = “bank”and y = Verb
0 otherwise
Machine Learning for Language Technology 7(55)
Preliminaries
Examples
x is a name, y is a label classifying the name
f0(x, y) =
8
<
:
1 if x contains “George”
and y = “Person”
0 otherwise
f1(x, y) =
8
<
:
1 if x contains “Washington”
and y = “Person”
0 otherwise
f2(x, y) =
8
<
:
1 if x contains “Bridge”
and y = “Person”
0 otherwise
f3(x, y) =
8
<
:
1 if x contains “General”
and y = “Person”
0 otherwise
f4(x, y) =
8
<
:
1 if x contains “George”
and y = “Object”
0 otherwise
f5(x, y) =
8
<
:
1 if x contains “Washington”
and y = “Object”
0 otherwise
f6(x, y) =
8
<
:
1 if x contains “Bridge”
and y = “Object”
0 otherwise
f7(x, y) =
8
<
:
1 if x contains “General”
and y = “Object”
0 otherwise
x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0]
x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0]
x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0]
Machine Learning for Language Technology 8(55)
Preliminaries
Block Feature Vectors
x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0]
x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0]
x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0]
One equal-size block of the feature vector for each label
Input features duplicated in each block
Non-zero values allowed only in one block
Machine Learning for Language Technology 9(55)
Linear Classifiers
Linear Classifiers
Linear classifier: score (or probability) of a particular
classification is based on a linear combination of features and
their weights
Let w ∈ Rm be a high dimensional weight vector
If we assume that w is known, then we define our classifier as
Multiclass Classification: Y = {0, 1, . . . , N}
y = arg max
y
w · f(x, y)
= arg max
y
m
j=0
wj × fj (x, y)
Binary Classification just a special case of multiclass
Machine Learning for Language Technology 10(55)
Linear Classifiers
Linear Classifiers - Bias Terms
Often linear classifiers presented as
y = arg max
y
m
j=0
wj × fj (x, y) + by
Where b is a bias or offset term
But this can be folded into f
x=General George Washington, y=Person → f(x, y) = [1 1 0 1 1 0 0 0 0 0]
x=General George Washington, y=Object → f(x, y) = [0 0 0 0 0 1 1 0 1 1]
f4(x, y) =

1 y =“Person”
0 otherwise
f9(x, y) =

1 y =“Object”
0 otherwise
w4 and w9 are now the bias terms for the labels
Machine Learning for Language Technology 11(55)
Linear Classifiers
Binary Linear Classifier
Divides all points:
Machine Learning for Language Technology 12(55)
Linear Classifiers
Multiclass Linear Classifier
Defines regions of space:
i.e., + are all points (x, y) where + = arg maxy w · f(x, y)
Machine Learning for Language Technology 13(55)
Linear Classifiers
Separability
A set of points is separable, if there exists a w such that
classification is perfect
Separable Not Separable
This can also be defined mathematically (and we will shortly)
Machine Learning for Language Technology 14(55)
Linear Classifiers
Supervised Learning – how to find w
Input: training examples T = {(xt, yt)}
|T |
t=1
Input: feature representation f
Output: w that maximizes/minimizes some important
function on the training set
minimize error (Perceptron, SVMs, Boosting)
maximize likelihood of data (Logistic Regression)
Assumption: The training data is separable
Not necessary, just makes life easier
There is a lot of good work in machine learning to tackle the
non-separable case
Machine Learning for Language Technology 15(55)
Linear Classifiers
Perceptron
... The resulting hyperplane is called -perceptron- ...
(Witten and Frank, 2005:126)
Machine Learning for Language Technology 16(55)
Linear Classifiers
Perceptron
Choose a w that minimizes error
w = arg min
w t
1 − 1[yt = arg max
y
w · f(xt, y)]
1[p] =
1 p is true
0 otherwise
This is a 0-1 loss function
Aside: when minimizing error people tend to use hinge-loss or
other smoother loss functions
Machine Learning for Language Technology 17(55)
Linear Classifiers
Perceptron Learning Algorithm
Training data: T = {(xt, yt)}
|T |
t=1
1. w(0) = 0; i = 0
2. for n : 1..N
3. for t : 1..T (random order: shuffle the training set beforehand!)
4. Let y = arg maxy w(i) · f(xt, y)
5. if y = yt
6. w(i+1) = w(i) + f(xt, yt) − f(xt, y )
7. i = i + 1
8. return wi
See also Daume’ (2012:38-41).
Machine Learning for Language Technology 18(55)
Linear Classifiers
Perceptron: Separability and Margin
Given an training instance (xt, yt), define:
¯Yt = Y − {yt}
i.e., ¯Yt is the set of incorrect labels for xt
A training set T is separable with margin γ > 0 if there exists
a vector w with w = 1 such that:
w · f(xt, yt) − w · f(xt, y ) ≥ γ
for all y ∈ ¯Yt and ||w|| = j w2
j (Euclidean or L2 norm)
Assumption: the training set is separable with margin γ
Machine Learning for Language Technology 19(55)
Linear Classifiers
Perceptron: Main Theorem
Theorem: For any training set separable with a margin of γ,
the following holds for the perceptron algorithm:
mistakes made during training ≤
R2
γ2
where R ≥ ||f(xt, yt) − f(xt, y )|| for all (xt, yt) ∈ T and
y ∈ ¯Yt
Thus, after a finite number of training iterations, the error on
the training set will converge to zero
For proof, see the Appendix to these slides
Machine Learning for Language Technology 20(55)
Linear Classifiers
Perceptron Summary
Learns a linear classifier that minimizes error
Guaranteed to find a w in a finite amount of time
Perceptron is an example of an online learning algorithm
w is updated based on a single training instance in isolation
w(i+1)
= w(i)
+ f(xt, yt) − f(xt, y )
Compare decision trees that perform batch learning
All training instances are used to find best split
Machine Learning for Language Technology 21(55)
Linear Classifiers
Margin
Machine Learning for Language Technology 22(55)
Linear Classifiers
Margin
Training Testing
Denote the
value of the
margin by γ
Machine Learning for Language Technology 23(55)
Linear Classifiers
Maximizing Margin (i)
For a training set T , the margin of a weight vector w is the
smallest γ such that
w · f(xt, yt) − w · f(xt, y ) ≥ γ
for every training instance (xt, yt) ∈ T , y ∈ ¯Yt
Machine Learning for Language Technology 24(55)
Linear Classifiers
Maximizing Margin (ii)
Intuitively maximizing margin makes sense
More importantly, generalization error to unseen test data is
proportional to the inverse of the margin (for the proof, see
Daume’, 2012: 45-46)
∝
R2
γ2 × |T |
Perceptron: we have shown that:
If a training set is separable by some margin, the perceptron
will find a w that separates the data
However, the perceptron does not pick w to maximize the
margin!
Machine Learning for Language Technology 25(55)
Linear Classifiers
Maximizing Margin (iii)
Let γ > 0
max
||w||≤1
γ
such that:
w · f(xt, yt) − w · f(xt, y ) ≥ γ
∀(xt, yt) ∈ T
and y ∈ ¯Yt
Note: algorithm still minimizes error
||w|| is bound since scaling trivially produces larger margin
β(w · f(xt, yt) − w · f(xt, y )) ≥ βγ, for some β ≥ 1
Machine Learning for Language Technology 26(55)
Linear Classifiers
Max Margin = Min Norm
Let γ > 0
Max Margin:
max
||w||≤1
γ
such that:
w·f(xt, yt)−w·f(xt, y ) ≥ γ
∀(xt, yt) ∈ T
and y ∈ ¯Yt
=
Min Norm:
min
w
1
2
||w||2
such that:
w·f(xt, yt)−w·f(xt, y ) ≥ 1
∀(xt, yt) ∈ T
and y ∈ ¯Yt
Instead of fixing ||w|| we fix the margin γ = 1
Technically γ ∝ 1/||w||
Machine Learning for Language Technology 27(55)
Linear Classifiers
Support Vector Machines
a.k.a. SVM(s)
Machine Learning for Language Technology 28(55)
Linear Classifiers
Support Vector Machines (i)
min
1
2
||w||2
such that:
w · f(xt, yt) − w · f(xt, y ) ≥ 1
∀(xt, yt) ∈ T
and y ∈ ¯Yt
Quadratic programming problem – a well known convex
optimization problem
Can be solved with out-of-the-box algorithms
Batch learning algorithm – w set w.r.t. all training points
Machine Learning for Language Technology 29(55)
Linear Classifiers
Support Vector Machines (ii)
Problem: Sometimes |T | is far too large
Thus the number of constraints might make solving the
quadratic programming problem very difficult
Common technique: Sequential Minimal Optimization (SMO)
Sparse: solution depends only on features in support vectors
Machine Learning for Language Technology 30(55)
Linear Classifiers
MIRA
Margin Infused Relaxed Algorithm
Machine Learning for Language Technology 31(55)
Linear Classifiers
Margin Infused Relaxed Algorithm (MIRA)
Another option – maximize margin using an online algorithm
Batch vs. Online
Batch – update parameters based on entire training set (SVM)
Online – update parameters based on a single training instance
at a time (Perceptron)
MIRA can be thought of as a max-margin perceptron or an
online SVM
Machine Learning for Language Technology 32(55)
Linear Classifiers
MIRA
Batch (SVMs):
min
1
2
||w||2
such that:
w·f(xt, yt)−w·f(xt, y ) ≥ 1
∀(xt, yt) ∈ T and y ∈ ¯Yt
Online (MIRA):
Training data: T = {(xt, yt)}
|T |
t=1
1. w(0)
= 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. w(i+1)
= arg minw* w* − w(i)
such that:
w · f(xt, yt) − w · f(xt, y ) ≥ 1
∀y ∈ ¯Yt
5. i = i + 1
6. return wi
MIRA has much smaller optimizations with only | ¯Yt|
constraints
Cost: sub-optimal optimization
Machine Learning for Language Technology 33(55)
Linear Classifiers
Interim Summary
What we have covered
Linear classifiers:
Perceptron
SVMs
MIRA
All are trained to minimize error
With or without maximizing margin
Online or batch
What is next
Logistic Regression
Train linear classifiers to maximize likelihood
Machine Learning for Language Technology 34(55)
Linear Classifiers
Logistic Regression
Machine Learning for Language Technology 35(55)
Linear Classifiers
Logistic Regression (i)
Define a conditional probability:
P(y|x) =
ew·f(x,y)
Zx
, where Zx =
y ∈Y
ew·f(x,y )
Note: still a linear classifier
arg max
y
P(y|x) = arg max
y
ew·f(x,y)
Zx
= arg max
y
ew·f(x,y)
= arg max
y
w · f(x, y)
Machine Learning for Language Technology 36(55)
Linear Classifiers
Logistic Regression (ii)
P(y|x) =
ew·f(x,y)
Zx
Q: How do we learn weights w
A: Set weights to maximize log-likelihood of training data:
w = arg max
w t
P(yt|xt) = arg max
w t
log P(yt|xt)
In a nut shell we set the weights w so that we assign as much
probability to the correct label y for each x in the training set
Machine Learning for Language Technology 37(55)
Linear Classifiers
Logistic Regression
P(y|x) =
ew·f(x,y)
Zx
, where Zx =
y ∈Y
ew·f(x,y )
w = arg max
w t
log P(yt|xt) (*)
The objective function (*) is concave
Therefore there is a global maximum
No closed form solution, but lots of numerical techniques
Gradient methods (gradient ascent, iterative scaling)
Newton methods (limited-memory quasi-newton)
Machine Learning for Language Technology 38(55)
Linear Classifiers
Logistic Regression Summary
Define conditional probability
P(y|x) =
ew·f(x,y)
Zx
Set weights to maximize log-likelihood of training data:
w = arg max
w t
log P(yt|xt)
Can find the gradient and run gradient ascent (or any
gradient-based optimization algorithm)
F(w) = (
∂
∂w0
F(w),
∂
∂w1
F(w), . . . ,
∂
∂wm
F(w))
∂
∂wi
F(w) =
t
fi (xt, yt) −
t y ∈Y
P(y |xt)fi (xt, y )
Machine Learning for Language Technology 39(55)
Linear Classifiers
Linear Classification: Summary
Basic form of (multiclass) classifier:
y = arg max
y
w · f(x, y)
Different learning methods:
Perceptron – separate data (0-1 loss, online)
Support vector machine – maximize margin (hinge loss, batch)
Logistic regression – maximize likelihood (log loss, batch)
All three methods are widely used in NLP
Machine Learning for Language Technology 40(55)
Linear Classifiers
Aside: Min error versus max log-likelihood (i)
Highly related but not identical
Example: consider a training set T with 1001 points
1000 × (xi , y = 0) = [−1, 1, 0, 0] for i = 1 . . . 1000
1 × (x1001, y = 1) = [0, 0, 3, 1]
Now consider w = [−1, 0, 1, 0]
Error in this case is 0 – so w minimizes error
[−1, 0, 1, 0] · [−1, 1, 0, 0] = 1 > [−1, 0, 1, 0] · [0, 0, −1, 1] = −1
[−1, 0, 1, 0] · [0, 0, 3, 1] = 3 > [−1, 0, 1, 0] · [3, 1, 0, 0] = −3
However, log-likelihood = −126.9 (omit calculation)
Machine Learning for Language Technology 41(55)
Linear Classifiers
Aside: Min error versus max log-likelihood (ii)
Highly related but not identical
Example: consider a training set T with 1001 points
1000 × (xi , y = 0) = [−1, 1, 0, 0] for i = 1 . . . 1000
1 × (x1001, y = 1) = [0, 0, 3, 1]
Now consider w = [−1, 7, 1, 0]
Error in this case is 1 – so w does not minimizes error
[−1, 7, 1, 0] · [−1, 1, 0, 0] = 8 > [−1, 7, 1, 0] · [0, 0, −1, 1] = −1
[−1, 7, 1, 0] · [0, 0, 3, 1] = 3 < [−1, 7, 1, 0] · [3, 1, 0, 0] = 4
However, log-likelihood = -1.4
Better log-likelihood and worse error
Machine Learning for Language Technology 42(55)
Linear Classifiers
Aside: Min error versus max log-likelihood (iii)
Max likelihood = min error
Max likelihood pushes as much probability on correct labeling
of training instance
Even at the cost of mislabeling a few examples
Min error forces all training instances to be correctly classified
SVMs with slack variables – allows some examples to be
classified wrong if resulting margin is improved on other
examples
Machine Learning for Language Technology 43(55)
Linear Classifiers
Aside: Max margin versus max log-likelihood
Let’s re-write the max likelihood objective function
w = arg max
w t
log P(yt|xt)
= arg max
w t
log
ew·f(xt ,yt )
y ∈Y ew·f(x,y )
= arg max
w t
w · f(xt, yt) − log
y ∈Y
ew·f(x,y )
Pick w to maximize score difference between correct labeling
and every possible labeling
Margin: maximize difference between correct and all incorrect
The above formulation is often referred to as the soft-margin
Machine Learning for Language Technology 44(55)
Linear Classifiers
Aside: Logistic Regression = Maximum
Entropy
Well known equivalence
Max Ent: maximize entropy subject to constraints on features
Empirical feature counts must equal expected counts
Quick intuition
Partial derivative in logistic regression
∂
∂wi
F(w) =
t
fi (xt, yt) −
t y ∈Y
P(y |xt)fi (xt, y )
First term is empirical feature counts and second term is
expected counts
Derivative set to zero maximizes function
Therefore when both counts are equivalent, we optimize the
logistic regression objective!
Machine Learning for Language Technology 45(55)
Appendix
Proofs and Derivations
Machine Learning for Language Technology 46(55)
Convergence Proof for Perceptron
Perceptron Learning Algorithm
Training data: T = {(xt , yt )}
|T |
t=1
1. w(0)
= 0; i = 0
2. for n : 1..N
3. for t : 1..T
4. Let y = arg maxy w(i)
· f(xt , y)
5. if y = yt
6. w(i+1)
= w(i)
+ f(xt , yt ) − f(xt , y )
7. i = i + 1
8. return wi
w(k−1) are the weights before kth
mistake
Suppose kth mistake made at the
tth example, (xt , yt )
y = arg maxy w(k−1) · f(xt , y)
y = yt
w(k) = w(k−1) + f(xt , yt ) − f(xt , y )
Now: u · w(k) = u · w(k−1) + u · (f(xt , yt ) − f(xt , y )) ≥ u · w(k−1) + γ
Now: w(0) = 0 and u · w(0) = 0, by induction on k, u · w(k) ≥ kγ
Now: since u · w(k) ≤ ||u|| × ||w(k)|| and ||u|| = 1 then ||w(k)|| ≥ kγ
Now:
||w(k)
||2
= ||w(k−1)
||2
+ ||f(xt , yt ) − f(xt , y )||2
+ 2w(k−1)
· (f(xt , yt ) − f(xt , y ))
||w(k)
||2
≤ ||w(k−1)
||2
+ R2
(since R ≥ ||f(xt , yt ) − f(xt , y )||
and w(k−1)
· f(xt , yt ) − w(k−1)
· f(xt , y ) ≤ 0)
Machine Learning for Language Technology 47(55)
Convergence Proof for Perceptron
Perceptron Learning Algorithm
We have just shown that ||w(k)|| ≥ kγ and
||w(k)||2 ≤ ||w(k−1)||2 + R2
By induction on k and since w(0) = 0 and ||w(0)||2 = 0
||w(k)
||2
≤ kR2
Therefore,
k2
γ2
≤ ||w(k)
||2
≤ kR2
and solving for k
k ≤
R2
γ2
Therefore the number of errors is bounded!
Machine Learning for Language Technology 48(55)
Gradient Ascent for Logistic Regression
Gradient Ascent
Let F(w) = t log ew·f(xt ,yt )
Zx
Want to find arg maxw F(w)
Set w0
= Om
Iterate until convergence
wi
= wi−1
+ α F(wi−1
)
α > 0 and set so that F(wi ) > F(wi−1)
F(w) is gradient of F w.r.t. w
A gradient is all partial derivatives over variables wi
i.e., F(w) = ( ∂
∂w0
F(w), ∂
∂w1
F(w), . . . , ∂
∂wm
F(w))
Gradient ascent will always find w to maximize F
Machine Learning for Language Technology 49(55)
Gradient Ascent for Logistic Regression
The partial derivatives
Need to find all partial derivatives ∂
∂wi
F(w)
F(w) =
t
log P(yt|xt)
=
t
log
ew·f(xt ,yt )
y ∈Y ew·f(xt ,y )
=
t
log
e
P
j wj ×fj (xt ,yt )
y ∈Y e
P
j wj ×fj (xt ,y )
Machine Learning for Language Technology 50(55)
Gradient Ascent for Logistic Regression
Partial derivatives - some reminders
1. ∂
∂x log F = 1
F
∂
∂x F
We always assume log is the natural logarithm loge
2. ∂
∂x eF = eF ∂
∂x F
3. ∂
∂x t Ft = t
∂
∂x Ft
4. ∂
∂x
F
G =
G ∂
∂x
F−F ∂
∂x
G
G2
Machine Learning for Language Technology 51(55)
Gradient Ascent for Logistic Regression
The partial derivatives
∂
∂wi
F(w) =
∂
∂wi t
log
e
P
j wj ×fj (xt ,yt )
y ∈Y e
P
j wj ×fj (xt ,y )
=
t
∂
∂wi
log
e
P
j wj ×fj (xt ,yt )
y ∈Y e
P
j wj ×fj (xt ,y )
=
t
(
y ∈Y e
P
j wj ×fj (xt ,y )
e
P
j wj ×fj (xt ,yt )
)(
∂
∂wi
e
P
j wj ×fj (xt ,yt )
y ∈Y e
P
wj
wj ×fj (xt ,y )
)
=
t
(
Zxt
e
P
j wj ×fj (xt ,yt )
)(
∂
∂wi
e
P
j wj ×fj (xt ,yt )
Zxt
)
Machine Learning for Language Technology 52(55)
Gradient Ascent for Logistic Regression
The partial derivatives
Now,
∂
∂wi
e
P
j wj ×fj (xt ,yt )
Zxt
=
Zxt
∂
∂wi
e
P
j wj ×fj (xt ,yt )
− e
P
j wj ×fj (xt ,yt ) ∂
∂wi
Zxt
Z2
xt
=
Zxt e
P
j wj ×fj (xt ,yt )
fi (xt , yt ) − e
P
j wj ×fj (xt ,yt ) ∂
∂wi
Zxt
Z2
xt
=
e
P
j wj ×fj (xt ,yt )
Z2
xt
(Zxt fi (xt , yt ) −
∂
∂wi
Zxt )
=
e
P
j wj ×fj (xt ,yt )
Z2
xt
(Zxt fi (xt , yt )
−
X
y ∈Y
e
P
j wj ×fj (xt ,y )
fi (xt , y ))
because
∂
∂wi
Zxt =
∂
∂wi
X
y ∈Y
e
P
j wj ×fj (xt ,y )
=
X
y ∈Y
e
P
j wj ×fj (xt ,y )
fi (xt , y )
Machine Learning for Language Technology 53(55)
Gradient Ascent for Logistic Regression
The partial derivatives
From before,
∂
∂wi
e
P
j wj ×fj (xt ,yt )
Zxt
=
e
P
j wj ×fj (xt ,yt )
Z2
xt
(Zxt fi (xt , yt )
−
X
y ∈Y
e
P
j wj ×fj (xt ,y )
fi (xt , y ))
Sub this in,
∂
∂wi
F(w) =
X
t
(
Zxt
e
P
j wj ×fj (xt ,yt )
)(
∂
∂wi
e
P
j wj ×fj (xt ,yt )
Zxt
)
=
X
t
1
Zxt
(Zxt fi (xt , yt ) −
X
y ∈Y
e
P
j wj ×fj (xt ,y )
fi (xt , y )))
=
X
t
fi (xt , yt ) −
X
t
X
y ∈Y
e
P
j wj ×fj (xt ,y )
Zxt
fi (xt , y )
=
X
t
fi (xt , yt ) −
X
t
X
y ∈Y
P(y |xt )fi (xt , y )
Machine Learning for Language Technology 54(55)
Gradient Ascent for Logistic Regression
FINALLY!!!
After all that,
∂
∂wi
F(w) =
t
fi (xt, yt) −
t y ∈Y
P(y |xt)fi (xt, y )
And the gradient is:
F(w) = (
∂
∂w0
F(w),
∂
∂w1
F(w), . . . ,
∂
∂wm
F(w))
So we can now use gradient assent to find w!!
Machine Learning for Language Technology 55(55)

Contenu connexe

Tendances

Ada boost brown boost performance with noisy data
Ada boost brown boost performance with noisy dataAda boost brown boost performance with noisy data
Ada boost brown boost performance with noisy dataShadhin Rahman
 
Lecture3 linear svm_with_slack
Lecture3 linear svm_with_slackLecture3 linear svm_with_slack
Lecture3 linear svm_with_slackStéphane Canu
 
Interconnections of hybrid systems
Interconnections of hybrid systemsInterconnections of hybrid systems
Interconnections of hybrid systemsMKosmykov
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2zukun
 
Lecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the DualLecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the DualStéphane Canu
 
Submodularity slides
Submodularity slidesSubmodularity slides
Submodularity slidesdragonthu
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningBig_Data_Ukraine
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsNAVER Engineering
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes FamilyKota Matsui
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Pooyan Jamshidi
 
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Gabriel Peyré
 
March12 natarajan
March12 natarajanMarch12 natarajan
March12 natarajanBBKuhn
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
Lecture8 multi class_svm
Lecture8 multi class_svmLecture8 multi class_svm
Lecture8 multi class_svmStéphane Canu
 
Image Processing 3
Image Processing 3Image Processing 3
Image Processing 3jainatin
 
Additive model and boosting tree
Additive model and boosting treeAdditive model and boosting tree
Additive model and boosting treeDong Guo
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesJ On The Beach
 

Tendances (20)

Lecture5 kernel svm
Lecture5 kernel svmLecture5 kernel svm
Lecture5 kernel svm
 
Ada boost brown boost performance with noisy data
Ada boost brown boost performance with noisy dataAda boost brown boost performance with noisy data
Ada boost brown boost performance with noisy data
 
Midterm sols
Midterm solsMidterm sols
Midterm sols
 
Lecture3 linear svm_with_slack
Lecture3 linear svm_with_slackLecture3 linear svm_with_slack
Lecture3 linear svm_with_slack
 
Interconnections of hybrid systems
Interconnections of hybrid systemsInterconnections of hybrid systems
Interconnections of hybrid systems
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2
 
Lecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the DualLecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the Dual
 
Submodularity slides
Submodularity slidesSubmodularity slides
Submodularity slides
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representations
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes Family
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
 
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
 
March12 natarajan
March12 natarajanMarch12 natarajan
March12 natarajan
 
Image denoising
Image denoisingImage denoising
Image denoising
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Lecture8 multi class_svm
Lecture8 multi class_svmLecture8 multi class_svm
Lecture8 multi class_svm
 
Image Processing 3
Image Processing 3Image Processing 3
Image Processing 3
 
Additive model and boosting tree
Additive model and boosting treeAdditive model and boosting tree
Additive model and boosting tree
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind Libraries
 

En vedette

Programming languages
Programming languagesProgramming languages
Programming languagesAsmasum
 
JavaScript: The Machine Language of the Ambient Computing Era
JavaScript: The Machine Language of the Ambient Computing EraJavaScript: The Machine Language of the Ambient Computing Era
JavaScript: The Machine Language of the Ambient Computing EraAllen Wirfs-Brock
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Nicolas Nicolov
 
Computer Languages....ppt
Computer Languages....pptComputer Languages....ppt
Computer Languages....ppthashgeneration
 
Lect 1. introduction to programming languages
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languagesVarun Garg
 

En vedette (8)

Programming languages
Programming languagesProgramming languages
Programming languages
 
JavaScript: The Machine Language of the Ambient Computing Era
JavaScript: The Machine Language of the Ambient Computing EraJavaScript: The Machine Language of the Ambient Computing Era
JavaScript: The Machine Language of the Ambient Computing Era
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...
 
Perceptron
PerceptronPerceptron
Perceptron
 
Computer languages 11
Computer languages 11Computer languages 11
Computer languages 11
 
Computer Languages....ppt
Computer Languages....pptComputer Languages....ppt
Computer Languages....ppt
 
Computer Languages.
Computer Languages.Computer Languages.
Computer Languages.
 
Lect 1. introduction to programming languages
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languages
 

Similaire à Lecture 03: Machine Learning for Language Technology - Linear Classifiers

NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習
NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習
NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習NTC.im(Notch Training Center)
 
TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習Mark Chang
 
Scala - where objects and functions meet
Scala - where objects and functions meetScala - where objects and functions meet
Scala - where objects and functions meetMario Fusco
 
Fp in scala part 2
Fp in scala part 2Fp in scala part 2
Fp in scala part 2Hang Zhao
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with pythonSimone Piunno
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward
 
Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector MachineLucas Xu
 
Fosdem 2013 petra selmer flexible querying of graph data
Fosdem 2013 petra selmer   flexible querying of graph dataFosdem 2013 petra selmer   flexible querying of graph data
Fosdem 2013 petra selmer flexible querying of graph dataPetra Selmer
 
Machine Learning: Make Your Ruby Code Smarter
Machine Learning: Make Your Ruby Code SmarterMachine Learning: Make Your Ruby Code Smarter
Machine Learning: Make Your Ruby Code SmarterAstrails
 
Introduction
IntroductionIntroduction
Introductionbutest
 
Machine Learning for Trading
Machine Learning for TradingMachine Learning for Trading
Machine Learning for TradingLarry Guo
 
support vector machine algorithm in machine learning
support vector machine algorithm in machine learningsupport vector machine algorithm in machine learning
support vector machine algorithm in machine learningSamGuy7
 
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...Codemotion
 

Similaire à Lecture 03: Machine Learning for Language Technology - Linear Classifiers (20)

Claire98
Claire98Claire98
Claire98
 
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof..."Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
 
Introduction to Machine Learning
Introduction to Machine Learning Introduction to Machine Learning
Introduction to Machine Learning
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 
NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習
NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習
NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習
 
TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習
 
Scala - where objects and functions meet
Scala - where objects and functions meetScala - where objects and functions meet
Scala - where objects and functions meet
 
Fp in scala part 2
Fp in scala part 2Fp in scala part 2
Fp in scala part 2
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
 
Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector Machine
 
Fosdem 2013 petra selmer flexible querying of graph data
Fosdem 2013 petra selmer   flexible querying of graph dataFosdem 2013 petra selmer   flexible querying of graph data
Fosdem 2013 petra selmer flexible querying of graph data
 
Machine Learning: Make Your Ruby Code Smarter
Machine Learning: Make Your Ruby Code SmarterMachine Learning: Make Your Ruby Code Smarter
Machine Learning: Make Your Ruby Code Smarter
 
Introduction
IntroductionIntroduction
Introduction
 
Machine Learning for Trading
Machine Learning for TradingMachine Learning for Trading
Machine Learning for Trading
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
 
support vector machine algorithm in machine learning
support vector machine algorithm in machine learningsupport vector machine algorithm in machine learning
support vector machine algorithm in machine learning
 
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
 
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
 

Plus de Marina Santini

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-Marina Santini
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesMarina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Marina Santini
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational SemanticsMarina Santini
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Marina Santini
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Marina Santini
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 

Plus de Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 

Dernier

4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 

Dernier (20)

FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 

Lecture 03: Machine Learning for Language Technology - Linear Classifiers

  • 1. Machine Learning for Language Technology Uppsala University Department of Linguistics and Philology Slides borrowed from previous courses. Thanks to Ryan McDonald (Google Research) and Prof. Joakim Nivre Machine Learning for Language Technology 1(55)
  • 2. Introduction Linear Classifiers Classifiers covered so far: Decision trees Nearest neighbors Next two lectures: Linear classifiers Statistics from Google Scholar (Sept 2013): “Maximum Entropy”&“NLP”; 11,400 hits (2013), 2,660 (2009), 141 before 2000 “SVM”&“NLP”: 10,900 hits (2013), 2,210 (2009), 16 before 2000 “Perceptron”&“NLP”: 3,160 hits (2013), 947 (2009), 118 before 2000 All are linear classifiers that have become important tools in Language Technology in the past 10 years or so. Machine Learning for Language Technology 2(55)
  • 3. Introduction Outline Today: Preliminaries: input/output, features, etc. Linear classifiers Perceptron Large-margin classifiers (SVMs, MIRA) Logistic regression Next time: Structured prediction with linear classifiers Structured perceptron Structured large-margin classifiers (SVMs, MIRA) Conditional random fields Case study: Dependency parsing Machine Learning for Language Technology 3(55)
  • 4. Preliminaries Inputs and Outputs Input: x ∈ X e.g., document or sentence with some words x = w1 . . . wn, or a series of previous actions Output: y ∈ Y e.g., parse tree, document class, part-of-speech tags, word-sense Input/output pair: (x, y) ∈ X × Y e.g., a document x and its label y Sometimes x is explicit in y, e.g., a parse tree y will contain the sentence x Machine Learning for Language Technology 4(55)
  • 5. Preliminaries Feature Representations We assume a mapping from input-output pairs (x, y) to a high dimensional feature vector f(x, y) : X × Y → Rm For some cases, i.e., binary classification Y = {−1, +1}, we can map only from the input to the feature space f(x) : X → Rm However, most problems in NLP require more than two classes, so we focus on the multi-class case For any vector v ∈ Rm, let vj be the jth value Machine Learning for Language Technology 5(55)
  • 6. Preliminaries Features and Classes All features must be numerical Numerical features are represented directly as fi (x, y) ∈ R Binary (boolean) features are represented as fi (x, y) ∈ {0, 1} Multinomial (categorical) features must be binarized Instead of: fi (x, y) ∈ {v0, . . . , vp} We have: fi+0(x, y) ∈ {0, 1}, . . . , fi+p(x, y) ∈ {0, 1} Such that: fi+j (x, y) = 1 iff fi (x, y) = vj We need distinct features for distinct output classes Instead of: fi (x) (1 ≤ i ≤ m) We have: fi+0m(x, y), . . . , fi+Nm(x, y) for Y = {0, . . . , N} Such that: fi+jm(x, y) = fi (x) iff y = yj Machine Learning for Language Technology 6(55)
  • 7. Preliminaries Examples x is a document and y is a label fj (x, y) =    1 if x contains the word“interest” and y =“financial” 0 otherwise fj (x, y) = % of words in x with punctuation and y =“scientific” x is a word and y is a part-of-speech tag fj (x, y) = 1 if x = “bank”and y = Verb 0 otherwise Machine Learning for Language Technology 7(55)
  • 8. Preliminaries Examples x is a name, y is a label classifying the name f0(x, y) = 8 < : 1 if x contains “George” and y = “Person” 0 otherwise f1(x, y) = 8 < : 1 if x contains “Washington” and y = “Person” 0 otherwise f2(x, y) = 8 < : 1 if x contains “Bridge” and y = “Person” 0 otherwise f3(x, y) = 8 < : 1 if x contains “General” and y = “Person” 0 otherwise f4(x, y) = 8 < : 1 if x contains “George” and y = “Object” 0 otherwise f5(x, y) = 8 < : 1 if x contains “Washington” and y = “Object” 0 otherwise f6(x, y) = 8 < : 1 if x contains “Bridge” and y = “Object” 0 otherwise f7(x, y) = 8 < : 1 if x contains “General” and y = “Object” 0 otherwise x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0] x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0] x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0] Machine Learning for Language Technology 8(55)
  • 9. Preliminaries Block Feature Vectors x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0] x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0] x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0] One equal-size block of the feature vector for each label Input features duplicated in each block Non-zero values allowed only in one block Machine Learning for Language Technology 9(55)
  • 10. Linear Classifiers Linear Classifiers Linear classifier: score (or probability) of a particular classification is based on a linear combination of features and their weights Let w ∈ Rm be a high dimensional weight vector If we assume that w is known, then we define our classifier as Multiclass Classification: Y = {0, 1, . . . , N} y = arg max y w · f(x, y) = arg max y m j=0 wj × fj (x, y) Binary Classification just a special case of multiclass Machine Learning for Language Technology 10(55)
  • 11. Linear Classifiers Linear Classifiers - Bias Terms Often linear classifiers presented as y = arg max y m j=0 wj × fj (x, y) + by Where b is a bias or offset term But this can be folded into f x=General George Washington, y=Person → f(x, y) = [1 1 0 1 1 0 0 0 0 0] x=General George Washington, y=Object → f(x, y) = [0 0 0 0 0 1 1 0 1 1] f4(x, y) =  1 y =“Person” 0 otherwise f9(x, y) =  1 y =“Object” 0 otherwise w4 and w9 are now the bias terms for the labels Machine Learning for Language Technology 11(55)
  • 12. Linear Classifiers Binary Linear Classifier Divides all points: Machine Learning for Language Technology 12(55)
  • 13. Linear Classifiers Multiclass Linear Classifier Defines regions of space: i.e., + are all points (x, y) where + = arg maxy w · f(x, y) Machine Learning for Language Technology 13(55)
  • 14. Linear Classifiers Separability A set of points is separable, if there exists a w such that classification is perfect Separable Not Separable This can also be defined mathematically (and we will shortly) Machine Learning for Language Technology 14(55)
  • 15. Linear Classifiers Supervised Learning – how to find w Input: training examples T = {(xt, yt)} |T | t=1 Input: feature representation f Output: w that maximizes/minimizes some important function on the training set minimize error (Perceptron, SVMs, Boosting) maximize likelihood of data (Logistic Regression) Assumption: The training data is separable Not necessary, just makes life easier There is a lot of good work in machine learning to tackle the non-separable case Machine Learning for Language Technology 15(55)
  • 16. Linear Classifiers Perceptron ... The resulting hyperplane is called -perceptron- ... (Witten and Frank, 2005:126) Machine Learning for Language Technology 16(55)
  • 17. Linear Classifiers Perceptron Choose a w that minimizes error w = arg min w t 1 − 1[yt = arg max y w · f(xt, y)] 1[p] = 1 p is true 0 otherwise This is a 0-1 loss function Aside: when minimizing error people tend to use hinge-loss or other smoother loss functions Machine Learning for Language Technology 17(55)
  • 18. Linear Classifiers Perceptron Learning Algorithm Training data: T = {(xt, yt)} |T | t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T (random order: shuffle the training set beforehand!) 4. Let y = arg maxy w(i) · f(xt, y) 5. if y = yt 6. w(i+1) = w(i) + f(xt, yt) − f(xt, y ) 7. i = i + 1 8. return wi See also Daume’ (2012:38-41). Machine Learning for Language Technology 18(55)
  • 19. Linear Classifiers Perceptron: Separability and Margin Given an training instance (xt, yt), define: ¯Yt = Y − {yt} i.e., ¯Yt is the set of incorrect labels for xt A training set T is separable with margin γ > 0 if there exists a vector w with w = 1 such that: w · f(xt, yt) − w · f(xt, y ) ≥ γ for all y ∈ ¯Yt and ||w|| = j w2 j (Euclidean or L2 norm) Assumption: the training set is separable with margin γ Machine Learning for Language Technology 19(55)
  • 20. Linear Classifiers Perceptron: Main Theorem Theorem: For any training set separable with a margin of γ, the following holds for the perceptron algorithm: mistakes made during training ≤ R2 γ2 where R ≥ ||f(xt, yt) − f(xt, y )|| for all (xt, yt) ∈ T and y ∈ ¯Yt Thus, after a finite number of training iterations, the error on the training set will converge to zero For proof, see the Appendix to these slides Machine Learning for Language Technology 20(55)
  • 21. Linear Classifiers Perceptron Summary Learns a linear classifier that minimizes error Guaranteed to find a w in a finite amount of time Perceptron is an example of an online learning algorithm w is updated based on a single training instance in isolation w(i+1) = w(i) + f(xt, yt) − f(xt, y ) Compare decision trees that perform batch learning All training instances are used to find best split Machine Learning for Language Technology 21(55)
  • 22. Linear Classifiers Margin Machine Learning for Language Technology 22(55)
  • 23. Linear Classifiers Margin Training Testing Denote the value of the margin by γ Machine Learning for Language Technology 23(55)
  • 24. Linear Classifiers Maximizing Margin (i) For a training set T , the margin of a weight vector w is the smallest γ such that w · f(xt, yt) − w · f(xt, y ) ≥ γ for every training instance (xt, yt) ∈ T , y ∈ ¯Yt Machine Learning for Language Technology 24(55)
  • 25. Linear Classifiers Maximizing Margin (ii) Intuitively maximizing margin makes sense More importantly, generalization error to unseen test data is proportional to the inverse of the margin (for the proof, see Daume’, 2012: 45-46) ∝ R2 γ2 × |T | Perceptron: we have shown that: If a training set is separable by some margin, the perceptron will find a w that separates the data However, the perceptron does not pick w to maximize the margin! Machine Learning for Language Technology 25(55)
  • 26. Linear Classifiers Maximizing Margin (iii) Let γ > 0 max ||w||≤1 γ such that: w · f(xt, yt) − w · f(xt, y ) ≥ γ ∀(xt, yt) ∈ T and y ∈ ¯Yt Note: algorithm still minimizes error ||w|| is bound since scaling trivially produces larger margin β(w · f(xt, yt) − w · f(xt, y )) ≥ βγ, for some β ≥ 1 Machine Learning for Language Technology 26(55)
  • 27. Linear Classifiers Max Margin = Min Norm Let γ > 0 Max Margin: max ||w||≤1 γ such that: w·f(xt, yt)−w·f(xt, y ) ≥ γ ∀(xt, yt) ∈ T and y ∈ ¯Yt = Min Norm: min w 1 2 ||w||2 such that: w·f(xt, yt)−w·f(xt, y ) ≥ 1 ∀(xt, yt) ∈ T and y ∈ ¯Yt Instead of fixing ||w|| we fix the margin γ = 1 Technically γ ∝ 1/||w|| Machine Learning for Language Technology 27(55)
  • 28. Linear Classifiers Support Vector Machines a.k.a. SVM(s) Machine Learning for Language Technology 28(55)
  • 29. Linear Classifiers Support Vector Machines (i) min 1 2 ||w||2 such that: w · f(xt, yt) − w · f(xt, y ) ≥ 1 ∀(xt, yt) ∈ T and y ∈ ¯Yt Quadratic programming problem – a well known convex optimization problem Can be solved with out-of-the-box algorithms Batch learning algorithm – w set w.r.t. all training points Machine Learning for Language Technology 29(55)
  • 30. Linear Classifiers Support Vector Machines (ii) Problem: Sometimes |T | is far too large Thus the number of constraints might make solving the quadratic programming problem very difficult Common technique: Sequential Minimal Optimization (SMO) Sparse: solution depends only on features in support vectors Machine Learning for Language Technology 30(55)
  • 31. Linear Classifiers MIRA Margin Infused Relaxed Algorithm Machine Learning for Language Technology 31(55)
  • 32. Linear Classifiers Margin Infused Relaxed Algorithm (MIRA) Another option – maximize margin using an online algorithm Batch vs. Online Batch – update parameters based on entire training set (SVM) Online – update parameters based on a single training instance at a time (Perceptron) MIRA can be thought of as a max-margin perceptron or an online SVM Machine Learning for Language Technology 32(55)
  • 33. Linear Classifiers MIRA Batch (SVMs): min 1 2 ||w||2 such that: w·f(xt, yt)−w·f(xt, y ) ≥ 1 ∀(xt, yt) ∈ T and y ∈ ¯Yt Online (MIRA): Training data: T = {(xt, yt)} |T | t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w(i+1) = arg minw* w* − w(i) such that: w · f(xt, yt) − w · f(xt, y ) ≥ 1 ∀y ∈ ¯Yt 5. i = i + 1 6. return wi MIRA has much smaller optimizations with only | ¯Yt| constraints Cost: sub-optimal optimization Machine Learning for Language Technology 33(55)
  • 34. Linear Classifiers Interim Summary What we have covered Linear classifiers: Perceptron SVMs MIRA All are trained to minimize error With or without maximizing margin Online or batch What is next Logistic Regression Train linear classifiers to maximize likelihood Machine Learning for Language Technology 34(55)
  • 35. Linear Classifiers Logistic Regression Machine Learning for Language Technology 35(55)
  • 36. Linear Classifiers Logistic Regression (i) Define a conditional probability: P(y|x) = ew·f(x,y) Zx , where Zx = y ∈Y ew·f(x,y ) Note: still a linear classifier arg max y P(y|x) = arg max y ew·f(x,y) Zx = arg max y ew·f(x,y) = arg max y w · f(x, y) Machine Learning for Language Technology 36(55)
  • 37. Linear Classifiers Logistic Regression (ii) P(y|x) = ew·f(x,y) Zx Q: How do we learn weights w A: Set weights to maximize log-likelihood of training data: w = arg max w t P(yt|xt) = arg max w t log P(yt|xt) In a nut shell we set the weights w so that we assign as much probability to the correct label y for each x in the training set Machine Learning for Language Technology 37(55)
  • 38. Linear Classifiers Logistic Regression P(y|x) = ew·f(x,y) Zx , where Zx = y ∈Y ew·f(x,y ) w = arg max w t log P(yt|xt) (*) The objective function (*) is concave Therefore there is a global maximum No closed form solution, but lots of numerical techniques Gradient methods (gradient ascent, iterative scaling) Newton methods (limited-memory quasi-newton) Machine Learning for Language Technology 38(55)
  • 39. Linear Classifiers Logistic Regression Summary Define conditional probability P(y|x) = ew·f(x,y) Zx Set weights to maximize log-likelihood of training data: w = arg max w t log P(yt|xt) Can find the gradient and run gradient ascent (or any gradient-based optimization algorithm) F(w) = ( ∂ ∂w0 F(w), ∂ ∂w1 F(w), . . . , ∂ ∂wm F(w)) ∂ ∂wi F(w) = t fi (xt, yt) − t y ∈Y P(y |xt)fi (xt, y ) Machine Learning for Language Technology 39(55)
  • 40. Linear Classifiers Linear Classification: Summary Basic form of (multiclass) classifier: y = arg max y w · f(x, y) Different learning methods: Perceptron – separate data (0-1 loss, online) Support vector machine – maximize margin (hinge loss, batch) Logistic regression – maximize likelihood (log loss, batch) All three methods are widely used in NLP Machine Learning for Language Technology 40(55)
  • 41. Linear Classifiers Aside: Min error versus max log-likelihood (i) Highly related but not identical Example: consider a training set T with 1001 points 1000 × (xi , y = 0) = [−1, 1, 0, 0] for i = 1 . . . 1000 1 × (x1001, y = 1) = [0, 0, 3, 1] Now consider w = [−1, 0, 1, 0] Error in this case is 0 – so w minimizes error [−1, 0, 1, 0] · [−1, 1, 0, 0] = 1 > [−1, 0, 1, 0] · [0, 0, −1, 1] = −1 [−1, 0, 1, 0] · [0, 0, 3, 1] = 3 > [−1, 0, 1, 0] · [3, 1, 0, 0] = −3 However, log-likelihood = −126.9 (omit calculation) Machine Learning for Language Technology 41(55)
  • 42. Linear Classifiers Aside: Min error versus max log-likelihood (ii) Highly related but not identical Example: consider a training set T with 1001 points 1000 × (xi , y = 0) = [−1, 1, 0, 0] for i = 1 . . . 1000 1 × (x1001, y = 1) = [0, 0, 3, 1] Now consider w = [−1, 7, 1, 0] Error in this case is 1 – so w does not minimizes error [−1, 7, 1, 0] · [−1, 1, 0, 0] = 8 > [−1, 7, 1, 0] · [0, 0, −1, 1] = −1 [−1, 7, 1, 0] · [0, 0, 3, 1] = 3 < [−1, 7, 1, 0] · [3, 1, 0, 0] = 4 However, log-likelihood = -1.4 Better log-likelihood and worse error Machine Learning for Language Technology 42(55)
  • 43. Linear Classifiers Aside: Min error versus max log-likelihood (iii) Max likelihood = min error Max likelihood pushes as much probability on correct labeling of training instance Even at the cost of mislabeling a few examples Min error forces all training instances to be correctly classified SVMs with slack variables – allows some examples to be classified wrong if resulting margin is improved on other examples Machine Learning for Language Technology 43(55)
  • 44. Linear Classifiers Aside: Max margin versus max log-likelihood Let’s re-write the max likelihood objective function w = arg max w t log P(yt|xt) = arg max w t log ew·f(xt ,yt ) y ∈Y ew·f(x,y ) = arg max w t w · f(xt, yt) − log y ∈Y ew·f(x,y ) Pick w to maximize score difference between correct labeling and every possible labeling Margin: maximize difference between correct and all incorrect The above formulation is often referred to as the soft-margin Machine Learning for Language Technology 44(55)
  • 45. Linear Classifiers Aside: Logistic Regression = Maximum Entropy Well known equivalence Max Ent: maximize entropy subject to constraints on features Empirical feature counts must equal expected counts Quick intuition Partial derivative in logistic regression ∂ ∂wi F(w) = t fi (xt, yt) − t y ∈Y P(y |xt)fi (xt, y ) First term is empirical feature counts and second term is expected counts Derivative set to zero maximizes function Therefore when both counts are equivalent, we optimize the logistic regression objective! Machine Learning for Language Technology 45(55)
  • 46. Appendix Proofs and Derivations Machine Learning for Language Technology 46(55)
  • 47. Convergence Proof for Perceptron Perceptron Learning Algorithm Training data: T = {(xt , yt )} |T | t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg maxy w(i) · f(xt , y) 5. if y = yt 6. w(i+1) = w(i) + f(xt , yt ) − f(xt , y ) 7. i = i + 1 8. return wi w(k−1) are the weights before kth mistake Suppose kth mistake made at the tth example, (xt , yt ) y = arg maxy w(k−1) · f(xt , y) y = yt w(k) = w(k−1) + f(xt , yt ) − f(xt , y ) Now: u · w(k) = u · w(k−1) + u · (f(xt , yt ) − f(xt , y )) ≥ u · w(k−1) + γ Now: w(0) = 0 and u · w(0) = 0, by induction on k, u · w(k) ≥ kγ Now: since u · w(k) ≤ ||u|| × ||w(k)|| and ||u|| = 1 then ||w(k)|| ≥ kγ Now: ||w(k) ||2 = ||w(k−1) ||2 + ||f(xt , yt ) − f(xt , y )||2 + 2w(k−1) · (f(xt , yt ) − f(xt , y )) ||w(k) ||2 ≤ ||w(k−1) ||2 + R2 (since R ≥ ||f(xt , yt ) − f(xt , y )|| and w(k−1) · f(xt , yt ) − w(k−1) · f(xt , y ) ≤ 0) Machine Learning for Language Technology 47(55)
  • 48. Convergence Proof for Perceptron Perceptron Learning Algorithm We have just shown that ||w(k)|| ≥ kγ and ||w(k)||2 ≤ ||w(k−1)||2 + R2 By induction on k and since w(0) = 0 and ||w(0)||2 = 0 ||w(k) ||2 ≤ kR2 Therefore, k2 γ2 ≤ ||w(k) ||2 ≤ kR2 and solving for k k ≤ R2 γ2 Therefore the number of errors is bounded! Machine Learning for Language Technology 48(55)
  • 49. Gradient Ascent for Logistic Regression Gradient Ascent Let F(w) = t log ew·f(xt ,yt ) Zx Want to find arg maxw F(w) Set w0 = Om Iterate until convergence wi = wi−1 + α F(wi−1 ) α > 0 and set so that F(wi ) > F(wi−1) F(w) is gradient of F w.r.t. w A gradient is all partial derivatives over variables wi i.e., F(w) = ( ∂ ∂w0 F(w), ∂ ∂w1 F(w), . . . , ∂ ∂wm F(w)) Gradient ascent will always find w to maximize F Machine Learning for Language Technology 49(55)
  • 50. Gradient Ascent for Logistic Regression The partial derivatives Need to find all partial derivatives ∂ ∂wi F(w) F(w) = t log P(yt|xt) = t log ew·f(xt ,yt ) y ∈Y ew·f(xt ,y ) = t log e P j wj ×fj (xt ,yt ) y ∈Y e P j wj ×fj (xt ,y ) Machine Learning for Language Technology 50(55)
  • 51. Gradient Ascent for Logistic Regression Partial derivatives - some reminders 1. ∂ ∂x log F = 1 F ∂ ∂x F We always assume log is the natural logarithm loge 2. ∂ ∂x eF = eF ∂ ∂x F 3. ∂ ∂x t Ft = t ∂ ∂x Ft 4. ∂ ∂x F G = G ∂ ∂x F−F ∂ ∂x G G2 Machine Learning for Language Technology 51(55)
  • 52. Gradient Ascent for Logistic Regression The partial derivatives ∂ ∂wi F(w) = ∂ ∂wi t log e P j wj ×fj (xt ,yt ) y ∈Y e P j wj ×fj (xt ,y ) = t ∂ ∂wi log e P j wj ×fj (xt ,yt ) y ∈Y e P j wj ×fj (xt ,y ) = t ( y ∈Y e P j wj ×fj (xt ,y ) e P j wj ×fj (xt ,yt ) )( ∂ ∂wi e P j wj ×fj (xt ,yt ) y ∈Y e P wj wj ×fj (xt ,y ) ) = t ( Zxt e P j wj ×fj (xt ,yt ) )( ∂ ∂wi e P j wj ×fj (xt ,yt ) Zxt ) Machine Learning for Language Technology 52(55)
  • 53. Gradient Ascent for Logistic Regression The partial derivatives Now, ∂ ∂wi e P j wj ×fj (xt ,yt ) Zxt = Zxt ∂ ∂wi e P j wj ×fj (xt ,yt ) − e P j wj ×fj (xt ,yt ) ∂ ∂wi Zxt Z2 xt = Zxt e P j wj ×fj (xt ,yt ) fi (xt , yt ) − e P j wj ×fj (xt ,yt ) ∂ ∂wi Zxt Z2 xt = e P j wj ×fj (xt ,yt ) Z2 xt (Zxt fi (xt , yt ) − ∂ ∂wi Zxt ) = e P j wj ×fj (xt ,yt ) Z2 xt (Zxt fi (xt , yt ) − X y ∈Y e P j wj ×fj (xt ,y ) fi (xt , y )) because ∂ ∂wi Zxt = ∂ ∂wi X y ∈Y e P j wj ×fj (xt ,y ) = X y ∈Y e P j wj ×fj (xt ,y ) fi (xt , y ) Machine Learning for Language Technology 53(55)
  • 54. Gradient Ascent for Logistic Regression The partial derivatives From before, ∂ ∂wi e P j wj ×fj (xt ,yt ) Zxt = e P j wj ×fj (xt ,yt ) Z2 xt (Zxt fi (xt , yt ) − X y ∈Y e P j wj ×fj (xt ,y ) fi (xt , y )) Sub this in, ∂ ∂wi F(w) = X t ( Zxt e P j wj ×fj (xt ,yt ) )( ∂ ∂wi e P j wj ×fj (xt ,yt ) Zxt ) = X t 1 Zxt (Zxt fi (xt , yt ) − X y ∈Y e P j wj ×fj (xt ,y ) fi (xt , y ))) = X t fi (xt , yt ) − X t X y ∈Y e P j wj ×fj (xt ,y ) Zxt fi (xt , y ) = X t fi (xt , yt ) − X t X y ∈Y P(y |xt )fi (xt , y ) Machine Learning for Language Technology 54(55)
  • 55. Gradient Ascent for Logistic Regression FINALLY!!! After all that, ∂ ∂wi F(w) = t fi (xt, yt) − t y ∈Y P(y |xt)fi (xt, y ) And the gradient is: F(w) = ( ∂ ∂w0 F(w), ∂ ∂w1 F(w), . . . , ∂ ∂wm F(w)) So we can now use gradient assent to find w!! Machine Learning for Language Technology 55(55)