SlideShare une entreprise Scribd logo
1  sur  70
SIMS 290-2:
Applied Natural Language Processing

Barbara Rosario
October 4, 2004

1
Today
Algorithms for Classification
Binary classification

Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods

Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor

2
Binary Classification: examples
Spam filtering (spam, not spam)
Customer service message classification (urgent vs.
not urgent)
Information retrieval (relevant, not relevant)
Sentiment classification (positive, negative)
Sometime it can be convenient to treat a multi-way
problem like a binary one: one class versus all the
others, for all classes

3
Binary Classification
Given: some data items that belong to a positive (+1
) or a negative (-1 ) class
Task: Train the classifier and predict the class for a
new data item
Geometrically: find a separator

4
Linear versus Non Linear
algorithms
Linearly separable data: if all the data points can
be correctly classified by a linear (hyperplanar)
decision boundary

5
Linearly separable data

Linear Decision boundary

Class1
Class2
6
Non linearly separable data

Class1
Class2
7
Non linearly separable data

Non Linear Classifier

Class1
Class2
8
Linear versus Non Linear
algorithms
Linear or Non linear separable data?
We can find out only empirically

Linear algorithms (algorithms that find a linear decision
boundary)
When we think the data is linearly separable
Advantages
– Simpler, less parameters

Disadvantages
– High dimensional data (like for NLT) is usually not linearly separable

Examples: Perceptron, Winnow, SVM
Note: we can use linear algorithms also for non linear problems
(see Kernel methods)

9
Linear versus Non Linear
algorithms
Non Linear
When the data is non linearly separable
Advantages
– More accurate

Disadvantages
– More complicated, more parameters

Example: Kernel methods

Note: the distinction between linear and non linear
applies also for multi-class classification (we’ll see
this later)
10
Simple linear algorithms
Perceptron and Winnow algorithm
Linear
Binary classification
Online (process data sequentially, one data point at the
time)
Mistake driven
Simple single layer Neural Networks

11
Linear binary classification
Data: {(xi,yi)}i=1...n
x in Rd (x is a vector in d-dimensional space)
 feature vector
y in {-1,+1}
 label (class, category)

Question:
Design a linear decision boundary: wx + b (equation of hyperplane) such
that the classification rule associated with it has minimal probability of error
classification rule :
– y = sign(w x + b) which means:
– if wx + b > 0 then y = +1
– if wx + b < 0 then y = -1

From Gert Lanckriet, Statistical Learning Theory Tutorial

12
Linear binary classification
Find a good hyperplane
(w,b) in R d+1
that correctly classifies data
points as much as possible
In online fashion: one data
point at the time, update
weights as necessary

wx + b = 0
Classification Rule:
y = sign(wx + b)

From Gert Lanckriet, Statistical Learning Theory Tutorial

13
Perceptron algorithm
Initialize: w1 = 0
Updating rule For each data point x

If class(x) != decision(x,w)

wk

then
wk+1  wk + yixi
k k+1
else
wk+1  wk

0

+1

wk+1

-1
wk x + b = 0

Function decision(x, w)
If wx + b > 0 return +1
Else return -1

From Gert Lanckriet, Statistical Learning Theory Tutorial

Wk+1 x + b = 0

14
Perceptron algorithm
Online: can adjust to changing target, over time
Advantages
Simple and computationally efficient
Guaranteed to learn a linearly separable problem
(convergence, global optimum)

Limitations
Only linear separations
Only converges for linearly separable data
Not really “efficient with many features”

From Gert Lanckriet, Statistical Learning Theory Tutorial

15
Winnow algorithm
Another online algorithm for learning perceptron
weights:
f(x) = sign(wx + b)
Linear, binary classification
Update-rule: again error-driven, but multiplicative
(instead of additive)

From Gert Lanckriet, Statistical Learning Theory Tutorial

16
Winnow algorithm
Initialize: w1 = 0
Updating rule For each data point x

wk

If class(x) != decision(x,w)
then

wk+1  wk + yixi
 Perceptron
w k+1  w k *exp(y i x i )  Winnow

k k+1
else
wk+1  wk

0

+1

wk+1

-1
wk x + b= 0

Function decision(x, w)
If wx + b > 0 return +1
Else return -1
From Gert Lanckriet, Statistical Learning Theory Tutorial

Wk+1 x + b = 0

17
Perceptron vs. Winnow
Assume
N available features
only K relevant items, with K<<N

Perceptron: number of mistakes: O( K N)
Winnow: number of mistakes: O(K log N)
Winnow is more robust to high-dimensional feature spaces

From Gert Lanckriet, Statistical Learning Theory Tutorial

18
Perceptron vs. Winnow
Perceptron
Online: can adjust to changing
target, over time
Advantages

Simple and computationally
efficient
Guaranteed to learn a linearly
separable problem

Limitations

only linear separations
only converges for linearly
separable data
not really “efficient with many
features”

Winnow
Online: can adjust to changing
target, over time
Advantages

Simple and computationally
efficient
Guaranteed to learn a linearly
separable problem
Suitable for problems with
many irrelevant attributes

Limitations

only linear separations
only converges for linearly
separable data
not really “efficient with many
features”

Used in NLP

From Gert Lanckriet, Statistical Learning Theory Tutorial

19
Weka
Winnow in Weka

20
Large margin classifier
Another family of linear
algorithms
Intuition (Vapnik, 1965)
If the classes are linearly separable:
Separate the data
Place hyper-plane “far” from the
data: large margin
Statistical results guarantee
good generalization
BAD

From Gert Lanckriet, Statistical Learning Theory Tutorial

21
Large margin classifier
Intuition (Vapnik, 1965) if linearly
separable:
Separate the data
Place hyperplane “far” from the
data: large margin
Statistical results guarantee
good generalization
GOOD

 Maximal Margin Classifier

From Gert Lanckriet, Statistical Learning Theory Tutorial

22
Large margin classifier
If not linearly separable
Allow some errors
Still, try to place hyperplane
“far” from each class

From Gert Lanckriet, Statistical Learning Theory Tutorial

23
Large Margin Classifiers
Advantages
Theoretically better (better error bounds)

Limitations
Computationally more expensive, large quadratic
programming

24
Support Vector Machine (SVM)
M

Large Margin Classifier

wTxa + b = 1

wTxb + b = -1

Linearly separable case
Goal: find the
hyperplane that
maximizes the margin

wT x + b = 0

Support vectors

From Gert Lanckriet, Statistical Learning Theory Tutorial

25
Support Vector Machine (SVM)
Text classification
Hand-writing recognition
Computational biology (e.g., micro-array data)
Face detection
Face expression recognition
Time series prediction

From Gert Lanckriet, Statistical Learning Theory Tutorial

26
Non Linear problem

27
Non Linear problem

28
Non Linear problem
Kernel methods
A family of non-linear algorithms
Transform the non linear problem in a linear one (in a
different feature space)
Use linear algorithms to solve the linear problem in
the new space

From Gert Lanckriet, Statistical Learning Theory Tutorial

29
Main intuition of Kernel methods
(Copy here from black board)

30
Basic principle kernel methods
Φ : Rd  RD

(D >> d)

wTΦ(x)+b=0

Φ(X)=[x2 z2 xz]

X=[x z]

f(x) = sign(w1x2+w2z2+w3xz +b)
From Gert Lanckriet, Statistical Learning Theory Tutorial

31
Basic principle kernel methods
Linear separability : more likely in high dimensions
Mapping: Φ maps input into high-dimensional
feature space
Classifier: construct linear classifier in highdimensional feature space
Motivation: appropriate choice of Φ leads to linear
separability
We can do this efficiently!

From Gert Lanckriet, Statistical Learning Theory Tutorial

32
Basic principle kernel methods
We can use the linear algorithms seen before
(Perceptron, SVM) for classification in the higher
dimensional space

33
Multi-class classification
Given: some data items that belong to one of M
possible classes
Task: Train the classifier and predict the class for a
new data item
Geometrically: harder problem, no more simple
geometry

34
Multi-class classification

35
Multi-class classification: Examples
Author identification
Language identification
Text categorization (topics)

36
(Some) Algorithms for Multi-class
classification
Linear
Parallel class separators: Decision Trees
Non parallel class separators: Naïve Bayes

Non Linear
K-nearest neighbors

37
Linear, parallel class separators
(ex: Decision Trees)

38
Linear, NON parallel class separators
(ex: Naïve Bayes)

39
Non Linear (ex: k Nearest Neighbor)

40
Decision Trees
Decision tree is a classifier in the form of a tree
structure, where each node is either:
Leaf node - indicates the value of the target attribute (class)
of examples, or
Decision node - specifies some test to be carried out on a
single attribute-value, with one branch and sub-tree for each
possible outcome of the test.

A decision tree can be used to classify an example
by starting at the root of the tree and moving through
it until a leaf node, which provides the classification of
the instance.
http://dms.irb.hr/tutorial/tut_dtrees.php

41
Training Examples
Goal: learn when we can play Tennis and when we cannot
Day

Outlook

Temp.

Humidity

Wind

Play Tennis

D1

Sunny

Hot

High

Weak

No

D2

Sunny

Hot

High

Strong

No

D3

Overcast

Hot

High

Weak

Yes

D4

Rain

Mild

High

Weak

Yes

D5

Rain

Cool

Normal

Weak

Yes

D6

Rain

Cool

Normal

Strong

No

D7

Overcast

Cool

Normal

Weak

Yes

D8

Sunny

Mild

High

Weak

No

D9

Sunny

Cold

Normal

Weak

Yes

D10

Rain

Mild

Normal

Strong

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rain

Mild

High

Strong

No
42
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No

Overcast

Rain

Yes

Normal
Yes

www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp

Wind
Strong
No

Weak
Yes
43
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No

Overcast

Rain

Each internal node tests an attribute

Normal
Yes

www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp

Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
44
Decision Tree for PlayTennis
Outlook Temperature Humidity Wind
Sunny

Hot

High

Weak

PlayTennis
? No

Outlook
Sunny
Humidity
High
No

Overcast

Rain

Yes

Normal
Yes

www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp

Wind
Strong
No

Weak
Yes

45
Decision Tree for Reuter classification

Foundations of Statistical Natural Language Processing,
Manning and Schuetze

46
Decision Tree for Reuter classification

Foundations of Statistical Natural Language Processing,
Manning and Schuetze

47
Building Decision Trees
Given training data, how do we construct them?
The central focus of the decision tree growing
algorithm is selecting which attribute to test at each
node in the tree. The goal is to select the attribute
that is most useful for classifying examples.
Top-down, greedy search through the space of
possible decision trees.
That is, it picks the best attribute and never looks back to
reconsider earlier choices.

48
Building Decision Trees
Splitting criterion
Finding the features and the values to split on
– for example, why test first “cts” and not “vs”?
– Why test on “cts < 2” and not “cts < 5” ?

Split that gives us the maximum information gain (or the
maximum reduction of uncertainty)

Stopping criterion
When all the elements at one node have the same class,
no need to split further

In practice, one first builds a large tree and then one prunes it
back (to avoid overfitting)
See Foundations of Statistical Natural Language Processing ,
Manning and Schuetze for a good introduction

49
Decision Trees: Strengths
Decision trees are able to generate understandable
rules.
Decision trees perform classification without requiring
much computation.
Decision trees are able to handle both continuous
and categorical variables.
Decision trees provide a clear indication of which
features are most important for prediction or
classification.

http://dms.irb.hr/tutorial/tut_dtrees.php

50
Decision Trees: weaknesses
Decision trees are prone to errors in classification
problems with many classes and relatively small
number of training examples.
Decision tree can be computationally expensive to
train.
Need to compare all possible splits
Pruning is also expensive

Most decision-tree algorithms only examine a single
field at a time. This leads to rectangular classification
boxes that may not correspond well with the actual
distribution of records in the decision space.

http://dms.irb.hr/tutorial/tut_dtrees.php

51
Decision Trees
Decision Trees in Weka

52
Naïve Bayes
More powerful that Decision Trees
Decision Trees

Naïve Bayes

53
Naïve Bayes Models
Graphical Models:
graph theory plus
probability theory
Nodes are variables
Edges are conditional
probabilities

A

B

C
P(A)
P(B|A)
P(C|A)
54
Naïve Bayes Models
Graphical Models:
graph theory plus
probability theory
Nodes are variables
Edges are conditional
probabilities
Absence of an edge
between nodes implies
independence between
the variables of the
nodes

A

B

C
P(A)
P(B|A)
P(C|A)

 P(C|
A,B)
55
Naïve Bayes for text classification

Foundations of Statistical Natural Language Processing,
Manning and Schuetze

56
Naïve Bayes for text classification

earn

Shr

34

cts

vs

per

shr
57
Naïve Bayes for text classification
Topic

w1

w2

w3

w4

wn-1

wn

The words depend on the topic: P(wi| Topic)
P(cts|earn) > P(tennis| earn)

Naïve Bayes assumption: all words are independent given the topic
From training set we learn the probabilities P(w i| Topic) for each word
and for each topic in the training set
58
Naïve Bayes for text classification
Topic

w1

w2

w3

w4

wn-1

wn

To: Classify new example
Calculate P(Topic | w1, w2, … wn) for each topic
Bayes decision rule:
Choose the topic T’ for which
P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T≠ T’
59
Naïve Bayes: Math
Naïve Bayes define a joint probability distribution:
P(Topic , w1, w2, … wn) = P(Topic)∏ P(wi| Topic)
We learn P(Topic) and P(wi| Topic) in training
Test: we need P(Topic | w1, w2, … wn)
P(Topic | w1, w2, … wn) = P(Topic , w1, w2, … wn) / P(w1, w2, … wn)

60
Naïve Bayes: Strengths
Very simple model
Easy to understand
Very easy to implement

Very efficient, fast training and classification
Modest space storage
Widely used because it works really well for text
categorization
Linear, but non parallel decision boundaries

61
Naïve Bayes: weaknesses
Naïve Bayes independence assumption has two consequences:
The linear ordering of words is ignored (bag of words model)
The words are independent of each other given the class: False
– President is more likely to occur in a context that contains election than
in a context that contains poet

Naïve Bayes assumption is inappropriate if there are strong
conditional dependencies between the variables
(But even if the model is not “right”, Naïve Bayes models do well
in a surprisingly large number of cases because often we are
interested in classification accuracy and not in accurate
probability estimations)

62
Naïve Bayes
Naïve Bayes in Weka

63
k Nearest Neighbor Classification
Nearest Neighbor classification rule: to classify a new
object, find the object in the training set that is most
similar. Then assign the category of this nearest
neighbor
K Nearest Neighbor (KNN): consult k nearest
neighbors. Decision based on the majority category
of these neighbors. More robust than k = 1
Example of similarity measure often used in NLP is cosine
similarity

64
1-Nearest Neighbor

65
1-Nearest Neighbor

66
3-Nearest Neighbor

67
3-Nearest Neighbor
But this is closer..
We can weight neighbors
according to their similarity

Assign the category of the majority of the neighbors
68
k Nearest Neighbor Classification
Strengths
Robust
Conceptually simple
Often works well
Powerful (arbitrary decision boundaries)

Weaknesses
Performance is very dependent on the similarity measure
used (and to a lesser extent on the number of neighbors k
used)
Finding a good similarity measure can be difficult
Computationally expensive
69
Summary
Algorithms for Classification
Linear versus non linear classification
Binary classification
Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods

Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor

On Wednesday: Weka

70

Contenu connexe

Tendances

Spatial data mining
Spatial data miningSpatial data mining
Spatial data miningMITS Gwalior
 
Statistical learning
Statistical learningStatistical learning
Statistical learningSlideshare
 
Logical Agents
Logical AgentsLogical Agents
Logical AgentsYasir Khan
 
Instance based learning
Instance based learningInstance based learning
Instance based learningswapnac12
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVMCarlo Carandang
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation FinalEr. Jagrat Gupta
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningParas Kohli
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine LearningPavithra Thippanaik
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagationKrish_ver2
 
0/1 knapsack
0/1 knapsack0/1 knapsack
0/1 knapsackAmin Omi
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceUncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceExperfy
 
Regression and Classification: An Artificial Neural Network Approach
Regression and Classification: An Artificial Neural Network ApproachRegression and Classification: An Artificial Neural Network Approach
Regression and Classification: An Artificial Neural Network ApproachKhulna University
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
 

Tendances (20)

Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
 
Logical Agents
Logical AgentsLogical Agents
Logical Agents
 
A* Algorithm
A* AlgorithmA* Algorithm
A* Algorithm
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Fuzzy inference systems
Fuzzy inference systemsFuzzy inference systems
Fuzzy inference systems
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
0/1 knapsack
0/1 knapsack0/1 knapsack
0/1 knapsack
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceUncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial Intelligence
 
Regression and Classification: An Artificial Neural Network Approach
Regression and Classification: An Artificial Neural Network ApproachRegression and Classification: An Artificial Neural Network Approach
Regression and Classification: An Artificial Neural Network Approach
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 

En vedette

Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Rafiul Sabbir
 
Machine learning Lecture 3
Machine learning Lecture 3Machine learning Lecture 3
Machine learning Lecture 3Srinivasan R
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningPaxcel Technologies
 
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)Vikram Jeet Singh
 
Patents 101: How to Do a Patent Search
Patents 101: How to Do a Patent SearchPatents 101: How to Do a Patent Search
Patents 101: How to Do a Patent SearchKristina Gomez
 
Neural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronNeural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronMostafa G. M. Mostafa
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Modelsbutest
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorialbutest
 
Demonstration of a z transformation of a normal distribution
Demonstration of a z transformation of a normal distributionDemonstration of a z transformation of a normal distribution
Demonstration of a z transformation of a normal distributionkkong
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMNYC Predictive Analytics
 
Neural networks...
Neural networks...Neural networks...
Neural networks...Molly Chugh
 
Normal distribution and sampling distribution
Normal distribution and sampling distributionNormal distribution and sampling distribution
Normal distribution and sampling distributionMridul Arora
 

En vedette (20)

Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
 
Machine learning Lecture 3
Machine learning Lecture 3Machine learning Lecture 3
Machine learning Lecture 3
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine Learning
 
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
 
Machine learning-cheat-sheet
Machine learning-cheat-sheetMachine learning-cheat-sheet
Machine learning-cheat-sheet
 
NORMAL DISTRIBUTION
NORMAL DISTRIBUTIONNORMAL DISTRIBUTION
NORMAL DISTRIBUTION
 
Patents 101: How to Do a Patent Search
Patents 101: How to Do a Patent SearchPatents 101: How to Do a Patent Search
Patents 101: How to Do a Patent Search
 
Neural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronNeural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's Perceptron
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Models
 
The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)
 
Patent Ductus Arteriosus
Patent Ductus ArteriosusPatent Ductus Arteriosus
Patent Ductus Arteriosus
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
 
Perceptron
PerceptronPerceptron
Perceptron
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorial
 
Demonstration of a z transformation of a normal distribution
Demonstration of a z transformation of a normal distributionDemonstration of a z transformation of a normal distribution
Demonstration of a z transformation of a normal distribution
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
Neural networks...
Neural networks...Neural networks...
Neural networks...
 
Normal distribution stat
Normal distribution statNormal distribution stat
Normal distribution stat
 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
 
Normal distribution and sampling distribution
Normal distribution and sampling distributionNormal distribution and sampling distribution
Normal distribution and sampling distribution
 

Similaire à Winnow vs perceptron

lecture14-SVMs (1).ppt
lecture14-SVMs (1).pptlecture14-SVMs (1).ppt
lecture14-SVMs (1).pptmuqadsatareen
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesXavier Rafael Palou
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptNaglaaAbdelhady
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
lawper.ppt
lawper.pptlawper.ppt
lawper.pptHiuLXun4
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos butest
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1arogozhnikov
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 

Similaire à Winnow vs perceptron (20)

lecture14-SVMs (1).ppt
lecture14-SVMs (1).pptlecture14-SVMs (1).ppt
lecture14-SVMs (1).ppt
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniques
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
 
Text categorization
Text categorizationText categorization
Text categorization
 
[ppt]
[ppt][ppt]
[ppt]
 
[ppt]
[ppt][ppt]
[ppt]
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
lawper.ppt
lawper.pptlawper.ppt
lawper.ppt
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
 

Dernier

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Dernier (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Winnow vs perceptron

  • 1. SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004 1
  • 2. Today Algorithms for Classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multi-Class classification Decision Trees Naïve Bayes K nearest neighbor 2
  • 3. Binary Classification: examples Spam filtering (spam, not spam) Customer service message classification (urgent vs. not urgent) Information retrieval (relevant, not relevant) Sentiment classification (positive, negative) Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes 3
  • 4. Binary Classification Given: some data items that belong to a positive (+1 ) or a negative (-1 ) class Task: Train the classifier and predict the class for a new data item Geometrically: find a separator 4
  • 5. Linear versus Non Linear algorithms Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary 5
  • 6. Linearly separable data Linear Decision boundary Class1 Class2 6
  • 7. Non linearly separable data Class1 Class2 7
  • 8. Non linearly separable data Non Linear Classifier Class1 Class2 8
  • 9. Linear versus Non Linear algorithms Linear or Non linear separable data? We can find out only empirically Linear algorithms (algorithms that find a linear decision boundary) When we think the data is linearly separable Advantages – Simpler, less parameters Disadvantages – High dimensional data (like for NLT) is usually not linearly separable Examples: Perceptron, Winnow, SVM Note: we can use linear algorithms also for non linear problems (see Kernel methods) 9
  • 10. Linear versus Non Linear algorithms Non Linear When the data is non linearly separable Advantages – More accurate Disadvantages – More complicated, more parameters Example: Kernel methods Note: the distinction between linear and non linear applies also for multi-class classification (we’ll see this later) 10
  • 11. Simple linear algorithms Perceptron and Winnow algorithm Linear Binary classification Online (process data sequentially, one data point at the time) Mistake driven Simple single layer Neural Networks 11
  • 12. Linear binary classification Data: {(xi,yi)}i=1...n x in Rd (x is a vector in d-dimensional space)  feature vector y in {-1,+1}  label (class, category) Question: Design a linear decision boundary: wx + b (equation of hyperplane) such that the classification rule associated with it has minimal probability of error classification rule : – y = sign(w x + b) which means: – if wx + b > 0 then y = +1 – if wx + b < 0 then y = -1 From Gert Lanckriet, Statistical Learning Theory Tutorial 12
  • 13. Linear binary classification Find a good hyperplane (w,b) in R d+1 that correctly classifies data points as much as possible In online fashion: one data point at the time, update weights as necessary wx + b = 0 Classification Rule: y = sign(wx + b) From Gert Lanckriet, Statistical Learning Theory Tutorial 13
  • 14. Perceptron algorithm Initialize: w1 = 0 Updating rule For each data point x If class(x) != decision(x,w) wk then wk+1  wk + yixi k k+1 else wk+1  wk 0 +1 wk+1 -1 wk x + b = 0 Function decision(x, w) If wx + b > 0 return +1 Else return -1 From Gert Lanckriet, Statistical Learning Theory Tutorial Wk+1 x + b = 0 14
  • 15. Perceptron algorithm Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem (convergence, global optimum) Limitations Only linear separations Only converges for linearly separable data Not really “efficient with many features” From Gert Lanckriet, Statistical Learning Theory Tutorial 15
  • 16. Winnow algorithm Another online algorithm for learning perceptron weights: f(x) = sign(wx + b) Linear, binary classification Update-rule: again error-driven, but multiplicative (instead of additive) From Gert Lanckriet, Statistical Learning Theory Tutorial 16
  • 17. Winnow algorithm Initialize: w1 = 0 Updating rule For each data point x wk If class(x) != decision(x,w) then wk+1  wk + yixi  Perceptron w k+1  w k *exp(y i x i )  Winnow k k+1 else wk+1  wk 0 +1 wk+1 -1 wk x + b= 0 Function decision(x, w) If wx + b > 0 return +1 Else return -1 From Gert Lanckriet, Statistical Learning Theory Tutorial Wk+1 x + b = 0 17
  • 18. Perceptron vs. Winnow Assume N available features only K relevant items, with K<<N Perceptron: number of mistakes: O( K N) Winnow: number of mistakes: O(K log N) Winnow is more robust to high-dimensional feature spaces From Gert Lanckriet, Statistical Learning Theory Tutorial 18
  • 19. Perceptron vs. Winnow Perceptron Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Winnow Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Used in NLP From Gert Lanckriet, Statistical Learning Theory Tutorial 19
  • 21. Large margin classifier Another family of linear algorithms Intuition (Vapnik, 1965) If the classes are linearly separable: Separate the data Place hyper-plane “far” from the data: large margin Statistical results guarantee good generalization BAD From Gert Lanckriet, Statistical Learning Theory Tutorial 21
  • 22. Large margin classifier Intuition (Vapnik, 1965) if linearly separable: Separate the data Place hyperplane “far” from the data: large margin Statistical results guarantee good generalization GOOD  Maximal Margin Classifier From Gert Lanckriet, Statistical Learning Theory Tutorial 22
  • 23. Large margin classifier If not linearly separable Allow some errors Still, try to place hyperplane “far” from each class From Gert Lanckriet, Statistical Learning Theory Tutorial 23
  • 24. Large Margin Classifiers Advantages Theoretically better (better error bounds) Limitations Computationally more expensive, large quadratic programming 24
  • 25. Support Vector Machine (SVM) M Large Margin Classifier wTxa + b = 1 wTxb + b = -1 Linearly separable case Goal: find the hyperplane that maximizes the margin wT x + b = 0 Support vectors From Gert Lanckriet, Statistical Learning Theory Tutorial 25
  • 26. Support Vector Machine (SVM) Text classification Hand-writing recognition Computational biology (e.g., micro-array data) Face detection Face expression recognition Time series prediction From Gert Lanckriet, Statistical Learning Theory Tutorial 26
  • 29. Non Linear problem Kernel methods A family of non-linear algorithms Transform the non linear problem in a linear one (in a different feature space) Use linear algorithms to solve the linear problem in the new space From Gert Lanckriet, Statistical Learning Theory Tutorial 29
  • 30. Main intuition of Kernel methods (Copy here from black board) 30
  • 31. Basic principle kernel methods Φ : Rd  RD (D >> d) wTΦ(x)+b=0 Φ(X)=[x2 z2 xz] X=[x z] f(x) = sign(w1x2+w2z2+w3xz +b) From Gert Lanckriet, Statistical Learning Theory Tutorial 31
  • 32. Basic principle kernel methods Linear separability : more likely in high dimensions Mapping: Φ maps input into high-dimensional feature space Classifier: construct linear classifier in highdimensional feature space Motivation: appropriate choice of Φ leads to linear separability We can do this efficiently! From Gert Lanckriet, Statistical Learning Theory Tutorial 32
  • 33. Basic principle kernel methods We can use the linear algorithms seen before (Perceptron, SVM) for classification in the higher dimensional space 33
  • 34. Multi-class classification Given: some data items that belong to one of M possible classes Task: Train the classifier and predict the class for a new data item Geometrically: harder problem, no more simple geometry 34
  • 36. Multi-class classification: Examples Author identification Language identification Text categorization (topics) 36
  • 37. (Some) Algorithms for Multi-class classification Linear Parallel class separators: Decision Trees Non parallel class separators: Naïve Bayes Non Linear K-nearest neighbors 37
  • 38. Linear, parallel class separators (ex: Decision Trees) 38
  • 39. Linear, NON parallel class separators (ex: Naïve Bayes) 39
  • 40. Non Linear (ex: k Nearest Neighbor) 40
  • 41. Decision Trees Decision tree is a classifier in the form of a tree structure, where each node is either: Leaf node - indicates the value of the target attribute (class) of examples, or Decision node - specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. http://dms.irb.hr/tutorial/tut_dtrees.php 41
  • 42. Training Examples Goal: learn when we can play Tennis and when we cannot Day Outlook Temp. Humidity Wind Play Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Weak Yes D8 Sunny Mild High Weak No D9 Sunny Cold Normal Weak Yes D10 Rain Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No 42
  • 43. Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Wind Strong No Weak Yes 43
  • 44. Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Each internal node tests an attribute Normal Yes www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Each branch corresponds to an attribute value node Each leaf node assigns a classification 44
  • 45. Decision Tree for PlayTennis Outlook Temperature Humidity Wind Sunny Hot High Weak PlayTennis ? No Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Wind Strong No Weak Yes 45
  • 46. Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 46
  • 47. Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 47
  • 48. Building Decision Trees Given training data, how do we construct them? The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples. Top-down, greedy search through the space of possible decision trees. That is, it picks the best attribute and never looks back to reconsider earlier choices. 48
  • 49. Building Decision Trees Splitting criterion Finding the features and the values to split on – for example, why test first “cts” and not “vs”? – Why test on “cts < 2” and not “cts < 5” ? Split that gives us the maximum information gain (or the maximum reduction of uncertainty) Stopping criterion When all the elements at one node have the same class, no need to split further In practice, one first builds a large tree and then one prunes it back (to avoid overfitting) See Foundations of Statistical Natural Language Processing , Manning and Schuetze for a good introduction 49
  • 50. Decision Trees: Strengths Decision trees are able to generate understandable rules. Decision trees perform classification without requiring much computation. Decision trees are able to handle both continuous and categorical variables. Decision trees provide a clear indication of which features are most important for prediction or classification. http://dms.irb.hr/tutorial/tut_dtrees.php 50
  • 51. Decision Trees: weaknesses Decision trees are prone to errors in classification problems with many classes and relatively small number of training examples. Decision tree can be computationally expensive to train. Need to compare all possible splits Pruning is also expensive Most decision-tree algorithms only examine a single field at a time. This leads to rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space. http://dms.irb.hr/tutorial/tut_dtrees.php 51
  • 53. Naïve Bayes More powerful that Decision Trees Decision Trees Naïve Bayes 53
  • 54. Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges are conditional probabilities A B C P(A) P(B|A) P(C|A) 54
  • 55. Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges are conditional probabilities Absence of an edge between nodes implies independence between the variables of the nodes A B C P(A) P(B|A) P(C|A)  P(C| A,B) 55
  • 56. Naïve Bayes for text classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 56
  • 57. Naïve Bayes for text classification earn Shr 34 cts vs per shr 57
  • 58. Naïve Bayes for text classification Topic w1 w2 w3 w4 wn-1 wn The words depend on the topic: P(wi| Topic) P(cts|earn) > P(tennis| earn) Naïve Bayes assumption: all words are independent given the topic From training set we learn the probabilities P(w i| Topic) for each word and for each topic in the training set 58
  • 59. Naïve Bayes for text classification Topic w1 w2 w3 w4 wn-1 wn To: Classify new example Calculate P(Topic | w1, w2, … wn) for each topic Bayes decision rule: Choose the topic T’ for which P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T≠ T’ 59
  • 60. Naïve Bayes: Math Naïve Bayes define a joint probability distribution: P(Topic , w1, w2, … wn) = P(Topic)∏ P(wi| Topic) We learn P(Topic) and P(wi| Topic) in training Test: we need P(Topic | w1, w2, … wn) P(Topic | w1, w2, … wn) = P(Topic , w1, w2, … wn) / P(w1, w2, … wn) 60
  • 61. Naïve Bayes: Strengths Very simple model Easy to understand Very easy to implement Very efficient, fast training and classification Modest space storage Widely used because it works really well for text categorization Linear, but non parallel decision boundaries 61
  • 62. Naïve Bayes: weaknesses Naïve Bayes independence assumption has two consequences: The linear ordering of words is ignored (bag of words model) The words are independent of each other given the class: False – President is more likely to occur in a context that contains election than in a context that contains poet Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variables (But even if the model is not “right”, Naïve Bayes models do well in a surprisingly large number of cases because often we are interested in classification accuracy and not in accurate probability estimations) 62
  • 64. k Nearest Neighbor Classification Nearest Neighbor classification rule: to classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighbor K Nearest Neighbor (KNN): consult k nearest neighbors. Decision based on the majority category of these neighbors. More robust than k = 1 Example of similarity measure often used in NLP is cosine similarity 64
  • 68. 3-Nearest Neighbor But this is closer.. We can weight neighbors according to their similarity Assign the category of the majority of the neighbors 68
  • 69. k Nearest Neighbor Classification Strengths Robust Conceptually simple Often works well Powerful (arbitrary decision boundaries) Weaknesses Performance is very dependent on the similarity measure used (and to a lesser extent on the number of neighbors k used) Finding a good similarity measure can be difficult Computationally expensive 69
  • 70. Summary Algorithms for Classification Linear versus non linear classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multi-Class classification Decision Trees Naïve Bayes K nearest neighbor On Wednesday: Weka 70