Winnow vs perceptron

SIMS 290-2:
Applied Natural Language Processing

Barbara Rosario
October 4, 2004

1

Today
Algorithms for Classification
Binary classification

Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods

Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor

2

Binary Classification: examples
Spam filtering (spam, not spam)
Customer service message classification (urgent vs.
not urgent)
Information retrieval (relevant, not relevant)
Sentiment classification (positive, negative)
Sometime it can be convenient to treat a multi-way
problem like a binary one: one class versus all the
others, for all classes

3

Binary Classification
Given: some data items that belong to a positive (+1
) or a negative (-1 ) class
Task: Train the classifier and predict the class for a
new data item
Geometrically: find a separator

4

Linear versus Non Linear
algorithms
Linearly separable data: if all the data points can
be correctly classified by a linear (hyperplanar)
decision boundary

5

Linearly separable data

Linear Decision boundary

Class1
Class2
6

Non linearly separable data

Class1
Class2
7

Non linearly separable data

Non Linear Classifier

Class1
Class2
8

algorithms
Linear or Non linear separable data?
We can find out only empirically

Linear algorithms (algorithms that find a linear decision
boundary)
When we think the data is linearly separable
Advantages
– Simpler, less parameters

Disadvantages
– High dimensional data (like for NLT) is usually not linearly separable

Examples: Perceptron, Winnow, SVM
Note: we can use linear algorithms also for non linear problems
(see Kernel methods)

9

algorithms
Non Linear
When the data is non linearly separable
Advantages
– More accurate

Disadvantages
– More complicated, more parameters

Example: Kernel methods

Note: the distinction between linear and non linear
applies also for multi-class classification (we’ll see
this later)
10

Simple linear algorithms
Perceptron and Winnow algorithm
Linear
Online (process data sequentially, one data point at the
time)
Mistake driven
Simple single layer Neural Networks

11

Linear binary classification
Data: {(xi,yi)}i=1...n
x in Rd (x is a vector in d-dimensional space)
 feature vector
y in {-1,+1}
 label (class, category)

Question:
Design a linear decision boundary: wx + b (equation of hyperplane) such
that the classification rule associated with it has minimal probability of error
classification rule :
– y = sign(w x + b) which means:
– if wx + b > 0 then y = +1
– if wx + b < 0 then y = -1

From Gert Lanckriet, Statistical Learning Theory Tutorial

12

Linear binary classification
Find a good hyperplane
(w,b) in R d+1
that correctly classifies data
points as much as possible
In online fashion: one data
point at the time, update
weights as necessary

wx + b = 0
Classification Rule:
y = sign(wx + b)


13

Perceptron algorithm
Initialize: w1 = 0
Updating rule For each data point x

If class(x) != decision(x,w)

wk

then
wk+1  wk + yixi
k k+1
else
wk+1  wk

0

+1

wk+1

-1
wk x + b = 0

Function decision(x, w)
If wx + b > 0 return +1
Else return -1


Wk+1 x + b = 0

14

Perceptron algorithm
Online: can adjust to changing target, over time
Advantages
Simple and computationally efficient
Guaranteed to learn a linearly separable problem
(convergence, global optimum)

Limitations
Only linear separations
Only converges for linearly separable data
Not really “efficient with many features”


15

Winnow algorithm
Another online algorithm for learning perceptron
weights:
f(x) = sign(wx + b)
Linear, binary classification
Update-rule: again error-driven, but multiplicative
(instead of additive)


16

Winnow algorithm
Initialize: w1 = 0
Updating rule For each data point x

wk

If class(x) != decision(x,w)
then

wk+1  wk + yixi
 Perceptron
w k+1  w k *exp(y i x i )  Winnow

k k+1
else
wk+1  wk

0

+1

wk+1

-1
wk x + b= 0

Function decision(x, w)
If wx + b > 0 return +1
Else return -1

Wk+1 x + b = 0

17

Perceptron vs. Winnow
Assume
N available features
only K relevant items, with K<<N

Perceptron: number of mistakes: O( K N)
Winnow: number of mistakes: O(K log N)
Winnow is more robust to high-dimensional feature spaces


18

Perceptron vs. Winnow
Perceptron
Online: can adjust to changing
target, over time
Advantages

Simple and computationally
efficient
Guaranteed to learn a linearly
separable problem

Limitations

only linear separations
only converges for linearly
separable data
not really “efficient with many
features”

Winnow
Online: can adjust to changing
target, over time
Advantages

Simple and computationally
efficient
Guaranteed to learn a linearly
separable problem
Suitable for problems with
many irrelevant attributes

Limitations

only linear separations
only converges for linearly
separable data
not really “efficient with many
features”

Used in NLP


19

Large margin classifier
Another family of linear
algorithms
Intuition (Vapnik, 1965)
If the classes are linearly separable:
Separate the data
Place hyper-plane “far” from the
data: large margin
Statistical results guarantee
good generalization
BAD


21

Intuition (Vapnik, 1965) if linearly
separable:
Separate the data
Place hyperplane “far” from the
data: large margin
Statistical results guarantee
good generalization
GOOD

 Maximal Margin Classifier


22

If not linearly separable
Allow some errors
Still, try to place hyperplane
“far” from each class


23

Large Margin Classifiers
Advantages
Theoretically better (better error bounds)

Limitations
Computationally more expensive, large quadratic
programming

24

Support Vector Machine (SVM)
M

Large Margin Classifier

wTxa + b = 1

wTxb + b = -1

Linearly separable case
Goal: find the
hyperplane that
maximizes the margin

wT x + b = 0

Support vectors


25

Support Vector Machine (SVM)
Text classification
Hand-writing recognition
Computational biology (e.g., micro-array data)
Face detection
Face expression recognition
Time series prediction


26

Non Linear problem
Kernel methods
A family of non-linear algorithms
Transform the non linear problem in a linear one (in a
different feature space)
Use linear algorithms to solve the linear problem in
the new space


29

Main intuition of Kernel methods
(Copy here from black board)

30

Basic principle kernel methods
Φ : Rd  RD

(D >> d)

wTΦ(x)+b=0

Φ(X)=[x2 z2 xz]

X=[x z]

f(x) = sign(w1x2+w2z2+w3xz +b)

31

Linear separability : more likely in high dimensions
Mapping: Φ maps input into high-dimensional
feature space
Classifier: construct linear classifier in highdimensional feature space
Motivation: appropriate choice of Φ leads to linear
separability
We can do this efficiently!


32

We can use the linear algorithms seen before
(Perceptron, SVM) for classification in the higher
dimensional space

33

Multi-class classification
Given: some data items that belong to one of M
possible classes
Task: Train the classifier and predict the class for a
new data item
Geometrically: harder problem, no more simple
geometry

34

Multi-class classification

35

Multi-class classification: Examples
Author identification
Language identification
Text categorization (topics)

36

(Some) Algorithms for Multi-class
classification
Linear
Parallel class separators: Decision Trees
Non parallel class separators: Naïve Bayes

Non Linear
K-nearest neighbors

37

Linear, parallel class separators
(ex: Decision Trees)

38

Linear, NON parallel class separators
(ex: Naïve Bayes)

39

Non Linear (ex: k Nearest Neighbor)

40

Decision Trees
Decision tree is a classifier in the form of a tree
structure, where each node is either:
Leaf node - indicates the value of the target attribute (class)
of examples, or
Decision node - specifies some test to be carried out on a
single attribute-value, with one branch and sub-tree for each
possible outcome of the test.

A decision tree can be used to classify an example
by starting at the root of the tree and moving through
it until a leaf node, which provides the classification of
the instance.
http://dms.irb.hr/tutorial/tut_dtrees.php

41

Training Examples
Goal: learn when we can play Tennis and when we cannot
Day

Outlook

Temp.

Humidity

Wind

Play Tennis

D1

Sunny

Hot

High

Weak

No

D2

Sunny

Hot

High

Strong

No

D3

Overcast

Hot

High

Weak

Yes

D4

Rain

Mild

High

Weak

Yes

D5

Rain

Cool

Normal

Weak

Yes

D6

Rain

Cool

Normal

Strong

No

D7

Overcast

Cool

Normal

Weak

Yes

D8

Sunny

Mild

High

Weak

No

D9

Sunny

Cold

Normal

Weak

Yes

D10

Rain

Mild

Normal

Strong

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rain

Mild

High

Strong

No
42

Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No

Overcast

Rain

Yes

Normal
Yes

www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp

Wind
Strong
No

Weak
Yes
43

Outlook
Sunny
Humidity
High
No

Overcast

Rain

Each internal node tests an attribute

Normal
Yes


Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
44

Outlook Temperature Humidity Wind
Sunny

Hot

High

Weak

PlayTennis
? No

Outlook
Sunny
Humidity
High
No

Overcast

Rain

Yes

Normal
Yes


Wind
Strong
No

Weak
Yes

45

Decision Tree for Reuter classification

Foundations of Statistical Natural Language Processing,
Manning and Schuetze

46

Decision Tree for Reuter classification


47

Building Decision Trees
Given training data, how do we construct them?
The central focus of the decision tree growing
algorithm is selecting which attribute to test at each
node in the tree. The goal is to select the attribute
that is most useful for classifying examples.
Top-down, greedy search through the space of
possible decision trees.
That is, it picks the best attribute and never looks back to
reconsider earlier choices.

48

Building Decision Trees
Splitting criterion
Finding the features and the values to split on
– for example, why test first “cts” and not “vs”?
– Why test on “cts < 2” and not “cts < 5” ?

Split that gives us the maximum information gain (or the
maximum reduction of uncertainty)

Stopping criterion
When all the elements at one node have the same class,
no need to split further

In practice, one first builds a large tree and then one prunes it
back (to avoid overfitting)
See Foundations of Statistical Natural Language Processing ,
Manning and Schuetze for a good introduction

49

Decision Trees: Strengths
Decision trees are able to generate understandable
rules.
Decision trees perform classification without requiring
much computation.
Decision trees are able to handle both continuous
and categorical variables.
Decision trees provide a clear indication of which
features are most important for prediction or
classification.


50

Decision Trees: weaknesses
Decision trees are prone to errors in classification
problems with many classes and relatively small
number of training examples.
Decision tree can be computationally expensive to
train.
Need to compare all possible splits
Pruning is also expensive

Most decision-tree algorithms only examine a single
field at a time. This leads to rectangular classification
boxes that may not correspond well with the actual
distribution of records in the decision space.


51

Decision Trees
Decision Trees in Weka

52

Naïve Bayes
More powerful that Decision Trees
Decision Trees

Naïve Bayes

53

Naïve Bayes Models
Graphical Models:
graph theory plus
probability theory
Nodes are variables
Edges are conditional
probabilities

A

B

C
P(A)
P(B|A)
P(C|A)
54

Naïve Bayes Models
Graphical Models:
graph theory plus
probability theory
Nodes are variables
Edges are conditional
probabilities
Absence of an edge
between nodes implies
independence between
the variables of the
nodes

A

B

C
P(A)
P(B|A)
P(C|A)

 P(C|
A,B)
55

Naïve Bayes for text classification


56


earn

Shr

34

cts

vs

per

shr
57

Topic

w1

w2

w3

w4

wn-1

wn

The words depend on the topic: P(wi| Topic)
P(cts|earn) > P(tennis| earn)

Naïve Bayes assumption: all words are independent given the topic
From training set we learn the probabilities P(w i| Topic) for each word
and for each topic in the training set
58

Topic

w1

w2

w3

w4

wn-1

wn

To: Classify new example
Calculate P(Topic | w1, w2, … wn) for each topic
Bayes decision rule:
Choose the topic T’ for which
P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T≠ T’
59

Naïve Bayes: Math
Naïve Bayes define a joint probability distribution:
P(Topic , w1, w2, … wn) = P(Topic)∏ P(wi| Topic)
We learn P(Topic) and P(wi| Topic) in training
Test: we need P(Topic | w1, w2, … wn)
P(Topic | w1, w2, … wn) = P(Topic , w1, w2, … wn) / P(w1, w2, … wn)

60

Naïve Bayes: Strengths
Very simple model
Easy to understand
Very easy to implement

Very efficient, fast training and classification
Modest space storage
Widely used because it works really well for text
categorization
Linear, but non parallel decision boundaries

61

Naïve Bayes: weaknesses
Naïve Bayes independence assumption has two consequences:
The linear ordering of words is ignored (bag of words model)
The words are independent of each other given the class: False
– President is more likely to occur in a context that contains election than
in a context that contains poet

Naïve Bayes assumption is inappropriate if there are strong
conditional dependencies between the variables
(But even if the model is not “right”, Naïve Bayes models do well
in a surprisingly large number of cases because often we are
interested in classification accuracy and not in accurate
probability estimations)

62

Naïve Bayes
Naïve Bayes in Weka

63

k Nearest Neighbor Classification
Nearest Neighbor classification rule: to classify a new
object, find the object in the training set that is most
similar. Then assign the category of this nearest
neighbor
K Nearest Neighbor (KNN): consult k nearest
neighbors. Decision based on the majority category
of these neighbors. More robust than k = 1
Example of similarity measure often used in NLP is cosine
similarity

64

3-Nearest Neighbor
But this is closer..
We can weight neighbors
according to their similarity

Assign the category of the majority of the neighbors
68

k Nearest Neighbor Classification
Strengths
Robust
Conceptually simple
Often works well
Powerful (arbitrary decision boundaries)

Weaknesses
Performance is very dependent on the similarity measure
used (and to a lesser extent on the number of neighbors k
used)
Finding a good similarity measure can be difficult
Computationally expensive
69

Summary
Algorithms for Classification
Linear versus non linear classification
Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods

Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor

On Wednesday: Weka

70

Winnow vs perceptron

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Winnow vs perceptron

Similaire à Winnow vs perceptron (20)

Dernier

Dernier (20)

Winnow vs perceptron