2. Today
Algorithms for Classification
Binary classification
Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods
Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor
2
3. Binary Classification: examples
Spam filtering (spam, not spam)
Customer service message classification (urgent vs.
not urgent)
Information retrieval (relevant, not relevant)
Sentiment classification (positive, negative)
Sometime it can be convenient to treat a multi-way
problem like a binary one: one class versus all the
others, for all classes
3
4. Binary Classification
Given: some data items that belong to a positive (+1
) or a negative (-1 ) class
Task: Train the classifier and predict the class for a
new data item
Geometrically: find a separator
4
5. Linear versus Non Linear
algorithms
Linearly separable data: if all the data points can
be correctly classified by a linear (hyperplanar)
decision boundary
5
9. Linear versus Non Linear
algorithms
Linear or Non linear separable data?
We can find out only empirically
Linear algorithms (algorithms that find a linear decision
boundary)
When we think the data is linearly separable
Advantages
– Simpler, less parameters
Disadvantages
– High dimensional data (like for NLT) is usually not linearly separable
Examples: Perceptron, Winnow, SVM
Note: we can use linear algorithms also for non linear problems
(see Kernel methods)
9
10. Linear versus Non Linear
algorithms
Non Linear
When the data is non linearly separable
Advantages
– More accurate
Disadvantages
– More complicated, more parameters
Example: Kernel methods
Note: the distinction between linear and non linear
applies also for multi-class classification (we’ll see
this later)
10
11. Simple linear algorithms
Perceptron and Winnow algorithm
Linear
Binary classification
Online (process data sequentially, one data point at the
time)
Mistake driven
Simple single layer Neural Networks
11
12. Linear binary classification
Data: {(xi,yi)}i=1...n
x in Rd (x is a vector in d-dimensional space)
feature vector
y in {-1,+1}
label (class, category)
Question:
Design a linear decision boundary: wx + b (equation of hyperplane) such
that the classification rule associated with it has minimal probability of error
classification rule :
– y = sign(w x + b) which means:
– if wx + b > 0 then y = +1
– if wx + b < 0 then y = -1
From Gert Lanckriet, Statistical Learning Theory Tutorial
12
13. Linear binary classification
Find a good hyperplane
(w,b) in R d+1
that correctly classifies data
points as much as possible
In online fashion: one data
point at the time, update
weights as necessary
wx + b = 0
Classification Rule:
y = sign(wx + b)
From Gert Lanckriet, Statistical Learning Theory Tutorial
13
14. Perceptron algorithm
Initialize: w1 = 0
Updating rule For each data point x
If class(x) != decision(x,w)
wk
then
wk+1 wk + yixi
k k+1
else
wk+1 wk
0
+1
wk+1
-1
wk x + b = 0
Function decision(x, w)
If wx + b > 0 return +1
Else return -1
From Gert Lanckriet, Statistical Learning Theory Tutorial
Wk+1 x + b = 0
14
15. Perceptron algorithm
Online: can adjust to changing target, over time
Advantages
Simple and computationally efficient
Guaranteed to learn a linearly separable problem
(convergence, global optimum)
Limitations
Only linear separations
Only converges for linearly separable data
Not really “efficient with many features”
From Gert Lanckriet, Statistical Learning Theory Tutorial
15
16. Winnow algorithm
Another online algorithm for learning perceptron
weights:
f(x) = sign(wx + b)
Linear, binary classification
Update-rule: again error-driven, but multiplicative
(instead of additive)
From Gert Lanckriet, Statistical Learning Theory Tutorial
16
17. Winnow algorithm
Initialize: w1 = 0
Updating rule For each data point x
wk
If class(x) != decision(x,w)
then
wk+1 wk + yixi
Perceptron
w k+1 w k *exp(y i x i ) Winnow
k k+1
else
wk+1 wk
0
+1
wk+1
-1
wk x + b= 0
Function decision(x, w)
If wx + b > 0 return +1
Else return -1
From Gert Lanckriet, Statistical Learning Theory Tutorial
Wk+1 x + b = 0
17
18. Perceptron vs. Winnow
Assume
N available features
only K relevant items, with K<<N
Perceptron: number of mistakes: O( K N)
Winnow: number of mistakes: O(K log N)
Winnow is more robust to high-dimensional feature spaces
From Gert Lanckriet, Statistical Learning Theory Tutorial
18
19. Perceptron vs. Winnow
Perceptron
Online: can adjust to changing
target, over time
Advantages
Simple and computationally
efficient
Guaranteed to learn a linearly
separable problem
Limitations
only linear separations
only converges for linearly
separable data
not really “efficient with many
features”
Winnow
Online: can adjust to changing
target, over time
Advantages
Simple and computationally
efficient
Guaranteed to learn a linearly
separable problem
Suitable for problems with
many irrelevant attributes
Limitations
only linear separations
only converges for linearly
separable data
not really “efficient with many
features”
Used in NLP
From Gert Lanckriet, Statistical Learning Theory Tutorial
19
21. Large margin classifier
Another family of linear
algorithms
Intuition (Vapnik, 1965)
If the classes are linearly separable:
Separate the data
Place hyper-plane “far” from the
data: large margin
Statistical results guarantee
good generalization
BAD
From Gert Lanckriet, Statistical Learning Theory Tutorial
21
22. Large margin classifier
Intuition (Vapnik, 1965) if linearly
separable:
Separate the data
Place hyperplane “far” from the
data: large margin
Statistical results guarantee
good generalization
GOOD
Maximal Margin Classifier
From Gert Lanckriet, Statistical Learning Theory Tutorial
22
23. Large margin classifier
If not linearly separable
Allow some errors
Still, try to place hyperplane
“far” from each class
From Gert Lanckriet, Statistical Learning Theory Tutorial
23
25. Support Vector Machine (SVM)
M
Large Margin Classifier
wTxa + b = 1
wTxb + b = -1
Linearly separable case
Goal: find the
hyperplane that
maximizes the margin
wT x + b = 0
Support vectors
From Gert Lanckriet, Statistical Learning Theory Tutorial
25
26. Support Vector Machine (SVM)
Text classification
Hand-writing recognition
Computational biology (e.g., micro-array data)
Face detection
Face expression recognition
Time series prediction
From Gert Lanckriet, Statistical Learning Theory Tutorial
26
29. Non Linear problem
Kernel methods
A family of non-linear algorithms
Transform the non linear problem in a linear one (in a
different feature space)
Use linear algorithms to solve the linear problem in
the new space
From Gert Lanckriet, Statistical Learning Theory Tutorial
29
32. Basic principle kernel methods
Linear separability : more likely in high dimensions
Mapping: Φ maps input into high-dimensional
feature space
Classifier: construct linear classifier in highdimensional feature space
Motivation: appropriate choice of Φ leads to linear
separability
We can do this efficiently!
From Gert Lanckriet, Statistical Learning Theory Tutorial
32
33. Basic principle kernel methods
We can use the linear algorithms seen before
(Perceptron, SVM) for classification in the higher
dimensional space
33
34. Multi-class classification
Given: some data items that belong to one of M
possible classes
Task: Train the classifier and predict the class for a
new data item
Geometrically: harder problem, no more simple
geometry
34
37. (Some) Algorithms for Multi-class
classification
Linear
Parallel class separators: Decision Trees
Non parallel class separators: Naïve Bayes
Non Linear
K-nearest neighbors
37
41. Decision Trees
Decision tree is a classifier in the form of a tree
structure, where each node is either:
Leaf node - indicates the value of the target attribute (class)
of examples, or
Decision node - specifies some test to be carried out on a
single attribute-value, with one branch and sub-tree for each
possible outcome of the test.
A decision tree can be used to classify an example
by starting at the root of the tree and moving through
it until a leaf node, which provides the classification of
the instance.
http://dms.irb.hr/tutorial/tut_dtrees.php
41
42. Training Examples
Goal: learn when we can play Tennis and when we cannot
Day
Outlook
Temp.
Humidity
Wind
Play Tennis
D1
Sunny
Hot
High
Weak
No
D2
Sunny
Hot
High
Strong
No
D3
Overcast
Hot
High
Weak
Yes
D4
Rain
Mild
High
Weak
Yes
D5
Rain
Cool
Normal
Weak
Yes
D6
Rain
Cool
Normal
Strong
No
D7
Overcast
Cool
Normal
Weak
Yes
D8
Sunny
Mild
High
Weak
No
D9
Sunny
Cold
Normal
Weak
Yes
D10
Rain
Mild
Normal
Strong
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
D14
Rain
Mild
High
Strong
No
42
43. Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp
Wind
Strong
No
Weak
Yes
43
44. Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Each internal node tests an attribute
Normal
Yes
www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp
Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
44
45. Decision Tree for PlayTennis
Outlook Temperature Humidity Wind
Sunny
Hot
High
Weak
PlayTennis
? No
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp
Wind
Strong
No
Weak
Yes
45
46. Decision Tree for Reuter classification
Foundations of Statistical Natural Language Processing,
Manning and Schuetze
46
47. Decision Tree for Reuter classification
Foundations of Statistical Natural Language Processing,
Manning and Schuetze
47
48. Building Decision Trees
Given training data, how do we construct them?
The central focus of the decision tree growing
algorithm is selecting which attribute to test at each
node in the tree. The goal is to select the attribute
that is most useful for classifying examples.
Top-down, greedy search through the space of
possible decision trees.
That is, it picks the best attribute and never looks back to
reconsider earlier choices.
48
49. Building Decision Trees
Splitting criterion
Finding the features and the values to split on
– for example, why test first “cts” and not “vs”?
– Why test on “cts < 2” and not “cts < 5” ?
Split that gives us the maximum information gain (or the
maximum reduction of uncertainty)
Stopping criterion
When all the elements at one node have the same class,
no need to split further
In practice, one first builds a large tree and then one prunes it
back (to avoid overfitting)
See Foundations of Statistical Natural Language Processing ,
Manning and Schuetze for a good introduction
49
50. Decision Trees: Strengths
Decision trees are able to generate understandable
rules.
Decision trees perform classification without requiring
much computation.
Decision trees are able to handle both continuous
and categorical variables.
Decision trees provide a clear indication of which
features are most important for prediction or
classification.
http://dms.irb.hr/tutorial/tut_dtrees.php
50
51. Decision Trees: weaknesses
Decision trees are prone to errors in classification
problems with many classes and relatively small
number of training examples.
Decision tree can be computationally expensive to
train.
Need to compare all possible splits
Pruning is also expensive
Most decision-tree algorithms only examine a single
field at a time. This leads to rectangular classification
boxes that may not correspond well with the actual
distribution of records in the decision space.
http://dms.irb.hr/tutorial/tut_dtrees.php
51
54. Naïve Bayes Models
Graphical Models:
graph theory plus
probability theory
Nodes are variables
Edges are conditional
probabilities
A
B
C
P(A)
P(B|A)
P(C|A)
54
55. Naïve Bayes Models
Graphical Models:
graph theory plus
probability theory
Nodes are variables
Edges are conditional
probabilities
Absence of an edge
between nodes implies
independence between
the variables of the
nodes
A
B
C
P(A)
P(B|A)
P(C|A)
P(C|
A,B)
55
56. Naïve Bayes for text classification
Foundations of Statistical Natural Language Processing,
Manning and Schuetze
56
57. Naïve Bayes for text classification
earn
Shr
34
cts
vs
per
shr
57
58. Naïve Bayes for text classification
Topic
w1
w2
w3
w4
wn-1
wn
The words depend on the topic: P(wi| Topic)
P(cts|earn) > P(tennis| earn)
Naïve Bayes assumption: all words are independent given the topic
From training set we learn the probabilities P(w i| Topic) for each word
and for each topic in the training set
58
59. Naïve Bayes for text classification
Topic
w1
w2
w3
w4
wn-1
wn
To: Classify new example
Calculate P(Topic | w1, w2, … wn) for each topic
Bayes decision rule:
Choose the topic T’ for which
P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T≠ T’
59
60. Naïve Bayes: Math
Naïve Bayes define a joint probability distribution:
P(Topic , w1, w2, … wn) = P(Topic)∏ P(wi| Topic)
We learn P(Topic) and P(wi| Topic) in training
Test: we need P(Topic | w1, w2, … wn)
P(Topic | w1, w2, … wn) = P(Topic , w1, w2, … wn) / P(w1, w2, … wn)
60
61. Naïve Bayes: Strengths
Very simple model
Easy to understand
Very easy to implement
Very efficient, fast training and classification
Modest space storage
Widely used because it works really well for text
categorization
Linear, but non parallel decision boundaries
61
62. Naïve Bayes: weaknesses
Naïve Bayes independence assumption has two consequences:
The linear ordering of words is ignored (bag of words model)
The words are independent of each other given the class: False
– President is more likely to occur in a context that contains election than
in a context that contains poet
Naïve Bayes assumption is inappropriate if there are strong
conditional dependencies between the variables
(But even if the model is not “right”, Naïve Bayes models do well
in a surprisingly large number of cases because often we are
interested in classification accuracy and not in accurate
probability estimations)
62
64. k Nearest Neighbor Classification
Nearest Neighbor classification rule: to classify a new
object, find the object in the training set that is most
similar. Then assign the category of this nearest
neighbor
K Nearest Neighbor (KNN): consult k nearest
neighbors. Decision based on the majority category
of these neighbors. More robust than k = 1
Example of similarity measure often used in NLP is cosine
similarity
64
68. 3-Nearest Neighbor
But this is closer..
We can weight neighbors
according to their similarity
Assign the category of the majority of the neighbors
68
69. k Nearest Neighbor Classification
Strengths
Robust
Conceptually simple
Often works well
Powerful (arbitrary decision boundaries)
Weaknesses
Performance is very dependent on the similarity measure
used (and to a lesser extent on the number of neighbors k
used)
Finding a good similarity measure can be difficult
Computationally expensive
69
70. Summary
Algorithms for Classification
Linear versus non linear classification
Binary classification
Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods
Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor
On Wednesday: Weka
70