SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Machine 
Learning 
for 
Language 
Technology 
Lecture 
8: 
Decision 
Trees 
and 
k-­‐Nearest 
Neighbors 
Marina 
San:ni 
Department 
of 
Linguis:cs 
and 
Philology 
Uppsala 
University, 
Uppsala, 
Sweden 
Autumn 
2014 
Acknowledgement: 
Thanks 
to 
Prof. 
Joakim 
Nivre 
for 
course 
design 
and 
materials 
1
Supervised 
Classifica:on 
• Divide 
instances 
into 
(two 
or 
more) 
classes 
– Instance 
(feature 
vector): 
• Features 
may 
be 
categorical 
or 
numerical 
– Class 
(label): 
– Training 
data: 
• Classifica:on 
in 
Language 
Technology 
– Spam 
filtering 
(spam 
vs. 
non-­‐spam) 
– Spelling 
error 
detec:on 
(error 
vs. 
no 
error) 
– Text 
categoriza:on 
(news, 
economy, 
culture, 
sport, 
...) 
– Named 
en:ty 
classifica:on 
(person, 
loca:on, 
organiza:on, 
...) 
2 
€ 
X = {xt,y t }N 
t=1 
€ 
x = x1, …, xm 
€ 
y
Models 
for 
Classifica:on 
• Genera:ve 
probabilis:c 
models: 
– Model 
of 
P(x, 
y) 
– Naive 
Bayes 
• Condi:onal 
probabilis:c 
models: 
– Model 
of 
P(y 
| 
x) 
– Logis:c 
regression 
• Discrimina:ve 
model: 
– No 
explicit 
probability 
model 
– Decision 
trees, 
nearest 
neighbor 
classifica:on 
– Perceptron, 
support 
vector 
machines, 
MIRA 
3
Repea:ng… 
• Noise 
– Data 
cleaning 
is 
expensive 
and 
:me 
consuming 
• Margin 
• Induc:ve 
bias 
Types of inductive biases 
• Minimum cross-­‐‑validation error 
• Maximum margin 
• Minimum description length 
• [...] 
4
DECISION 
TREES 
5
Decision 
Trees 
• Hierarchical 
tree 
structure 
for 
classifica:on 
– Each 
internal 
node 
specifies 
a 
test 
of 
some 
feature 
– Each 
branch 
corresponds 
to 
a 
value 
for 
the 
tested 
feature 
– Each 
leaf 
node 
provides 
a 
classifica:on 
for 
the 
instance 
• Represents 
a 
disjunc:on 
of 
conjunc:ons 
of 
constraints 
– Each 
path 
from 
root 
to 
leaf 
specifies 
a 
conjunc:on 
of 
tests 
– The 
tree 
itself 
represents 
the 
disjunc:on 
of 
all 
paths 
6
Decision 
Tree 
7
Divide 
and 
Conquer 
• Internal 
decision 
nodes 
– Univariate: 
Uses 
a 
single 
a]ribute, 
xi 
• Numeric 
xi 
: 
Binary 
split 
: 
xi 
> 
wm 
• Discrete 
xi 
: 
n-­‐way 
split 
for 
n 
possible 
values 
– Mul:variate: 
Uses 
all 
a]ributes, 
x 
• Leaves 
– Classifica:on: 
class 
labels 
(or 
propor:ons) 
– Regression: 
r 
average 
(or 
local 
fit) 
• Learning: 
– Greedy 
recursive 
algorithm 
– Find 
best 
split 
X = 
(X1, 
..., 
Xp), 
then 
induce 
tree 
for 
each 
Xi 
8
Classifica:on 
Trees 
(ID3, 
CART, 
C4.5) 
9 
• For 
node 
m, 
Nm 
instances 
reach 
m, 
Ni 
m 
belong 
to 
Ci 
• Node 
m 
is 
pure 
if 
pi 
m 
i 
Nm 
is 
0 
or 
1 
• Measure 
of 
impurity 
is 
entropy 
KΣ 
Im = − pm 
i logpi 
2m 
i=1 
ˆP 
(Ci |x,m) ≡ pm 
i = 
Nm
Example: 
Entropy 
• Assume 
two 
classes 
(C1, 
C2) 
and 
four 
instances 
(x1, 
x2, 
x3, 
x4) 
• Case 
1: 
– C1 
= 
{x1, 
x2, 
x3, 
x4}, 
C2 
= 
{ 
} 
– Im = 
– 
(1 
log 
1 
+ 
0 
log 
0) 
= 
0 
• Case 
2: 
– C1 
= 
{x1, 
x2, 
x3}, 
C2 
= 
{x4} 
– Im = 
– 
(0.75 
log 
0.75 
+ 
0.25 
log 
0.25) 
= 
0.81 
• Case 
3: 
– C1 
= 
{x1, 
x2}, 
C2 
= 
{x3, 
x4} 
– Im = 
– 
(0.5 
log 
0.5 
+ 
0.5 
log 
0.5) 
= 
1 
10
Best 
Split 
ˆP 
(Ci |x,m, j) ≡ pmj 
i = 
i 
Nmj 
Nmj 
i log2pmj 
11 
• If 
node 
m 
is 
pure, 
generate 
a 
leaf 
and 
stop, 
otherwise 
split 
with 
test 
t 
and 
con:nue 
recursively 
• Find 
the 
test 
that 
minimizes 
impurity 
• Impurity 
ager 
split 
with 
test 
t: 
– Nmj 
of 
Nm 
take 
branch 
j 
– Ni 
mj 
belong 
to 
Ci 
Im 
nΣ 
t = − 
Nmj 
Nj=1 m 
KΣ 
pmj 
i 
KΣ 
i=1 
€ 
Im = − pm 
i logpi 
2m 
i=1 
ˆP 
Ci ( |x,m) ≡ pm 
i = 
Ni 
m 
Nm
12
Informa:on 
Gain 
• We 
want 
to 
determine 
which 
a]ribute 
in 
a 
given 
set 
of 
training 
feature 
vectors 
is 
most 
useful 
for 
discrimina:ng 
between 
the 
classes 
to 
be 
learned. 
• •Informa:on 
gain 
tells 
us 
how 
important 
a 
given 
a]ribute 
of 
the 
feature 
vectors 
is. 
• •We 
will 
use 
it 
to 
decide 
the 
ordering 
of 
a]ributes 
in 
the 
nodes 
of 
a 
decision 
tree. 
13
Informa:on 
Gain 
and 
Gain 
Ra:o 
€ 
KΣ 
Im = − pm 
i logpi 
2m 
i=1 
i log2pmj 
Nmj 
Nj=1 m 
14 
• Choosing 
the 
test 
that 
minimizes 
impurity 
maximizes 
the 
informa:on 
gain 
(IG): 
• Informa:on 
gain 
prefers 
features 
with 
many 
values 
• The 
normalized 
version 
is 
called 
gain 
ra:o 
(GR): 
€ 
Im 
nΣ 
t = − 
Nmj 
j=1 Nm 
pmj 
i 
KΣ 
i=1 
€ 
IGm 
t = I− It 
m m 
Vt = − 
m 
Nmj 
Nm 
log2 
nΣ 
€ 
GRm 
t = 
IGt 
m 
Vm 
t
Pruning 
Trees 
• Decision 
trees 
are 
suscep:ble 
to 
overfijng 
• Remove 
subtrees 
for 
be]er 
generaliza:on: 
– Prepruning: 
Early 
stopping 
(e.g., 
with 
entropy 
threshold) 
– Postpruning: 
Grow 
whole 
tree, 
then 
prune 
subtrees 
• Prepruning 
is 
faster, 
postpruning 
is 
more 
accurate 
(requires 
a 
separate 
valida:on 
set) 
15
Rule 
Extrac:on 
from 
Trees 
16 
C4.5Rules 
(Quinlan, 
1993)
Learning 
Rules 
• Rule 
induc:on 
is 
similar 
to 
tree 
induc:on 
but 
– tree 
induc:on 
is 
breadth-­‐first 
– rule 
induc:on 
is 
depth-­‐first 
(one 
rule 
at 
a 
:me) 
• Rule 
learning: 
– A 
rule 
is 
a 
conjunc:on 
of 
terms 
(cf. 
tree 
path) 
– A 
rule 
covers 
an 
example 
if 
all 
terms 
of 
the 
rule 
evaluate 
to 
true 
for 
the 
example 
(cf. 
sequence 
of 
tests) 
– Sequen:al 
covering: 
Generate 
rules 
one 
at 
a 
:me 
un:l 
all 
posi:ve 
examples 
are 
covered 
– IREP 
(Fürnkrantz 
and 
Widmer, 
1994), 
Ripper 
(Cohen, 
1995) 
17
Proper:es 
of 
Decision 
Trees 
• Decision 
trees 
are 
appropriate 
for 
classifica:on 
when: 
– Features 
can 
be 
both 
categorical 
and 
numeric 
– Disjunc:ve 
descrip:ons 
may 
be 
required 
– Training 
data 
may 
be 
noisy 
(missing 
values, 
incorrect 
labels) 
– Interpreta:on 
of 
learned 
model 
is 
important 
(rules) 
• Induc0ve 
bias 
of 
(most) 
decision 
tree 
learners: 
1. Prefers 
trees 
with 
informa0ve 
aEributes 
close 
to 
the 
root 
2. Prefers 
smaller 
trees 
over 
bigger 
ones 
(with 
pruning) 
3. Preference 
bias 
(incomplete 
search 
of 
complete 
space) 
18
K-­‐NEAREST 
NEIGHBORS 
19
Nearest 
Neighbor 
Classifica:on 
• An 
old 
idea 
• Key 
components: 
– Storage 
of 
old 
instances 
– Similarity-­‐based 
reasoning 
to 
new 
instances 
20 
This “rule of nearest neighbor” has considerable 
elementary intuitive appeal and probably corresponds to 
practice in many situations. For example, it is possible 
that much medical diagnosis is influenced by the doctor's 
recollection of the subsequent history of an earlier patient 
whose symptoms resemble in some way those of the 
current patient. (Fix and Hodges, 1952)
k-­‐Nearest 
Neighbour 
• Learning: 
– Store 
training 
instances 
in 
memory 
• Classifica:on: 
– Given 
new 
test 
instance 
x, 
• Compare 
it 
to 
all 
stored 
instances 
• Compute 
a 
distance 
between 
x 
and 
each 
stored 
instance 
xt 
• Keep 
track 
of 
the 
k 
closest 
instances 
(nearest 
neighbors) 
– Assign 
to 
x 
the 
majority 
class 
of 
the 
k 
nearest 
neighbours 
• A 
geometric 
view 
of 
learning 
– Proximity 
in 
(feature) 
space 
à 
same 
class 
– The 
smoothness 
assump:on 
21
Eager 
and 
Lazy 
Learning 
• Eager 
learning 
(e.g., 
decision 
trees) 
– Learning 
– 
induce 
an 
abstract 
model 
from 
data 
– Classifica:on 
– 
apply 
model 
to 
new 
data 
• Lazy 
learning 
(a.k.a. 
memory-­‐based 
learning) 
– Learning 
– 
store 
data 
in 
memory 
– Classifica:on 
– 
compare 
new 
data 
to 
data 
in 
memory 
– Proper:es: 
• Retains 
all 
the 
informa:on 
in 
the 
training 
set 
– 
no 
abstrac:on 
• Complex 
hypothesis 
space 
– 
suitable 
for 
natural 
language? 
• Main 
drawback 
– 
classifica:on 
can 
be 
very 
inefficient 
22
Dimensions 
of 
a 
k-­‐NN 
Classifier 
• Distance 
metric 
– How 
do 
we 
measure 
distance 
between 
instances? 
– Determines 
the 
layout 
of 
the 
instance 
space 
• The 
k 
parameter 
– How 
large 
neighborhood 
should 
we 
consider? 
– Determines 
the 
complexity 
of 
the 
hypothesis 
space 
23
Distance 
Metric 
1 
• Overlap 
= 
count 
of 
mismatching 
features 
24 
€ 
mΣ 
Δ(x,z) = δ (xi,zi ) 
i=1 
⎧ 
⎪⎪⎪ 
0 
⎨ 
⎪⎪⎪ 
⎩ 
x z if numeric else 
if x = 
z 
i i 
≠ 
− 
i i 
− 
= 
i i 
i i 
i i 
if x z 
x z 
1 
, 
max min 
δ ( , )
Distance 
Metric 
2 
• MVDM 
= 
Modified 
Value 
Difference 
Metric 
25 
Δ(mΣ 
x,z) = δ (xi,zi ) 
i=1 
δ KΣ 
(xi,zi ) = P(Cj | xi ) − P(Cj | zi) 
j=1
The 
k 
parameter 
• Tunes 
the 
complexity 
of 
the 
hypothesis 
space 
– If 
k 
= 
1, 
every 
instance 
has 
its 
own 
neighborhood 
– If 
k 
= 
N, 
all 
the 
feature 
space 
is 
one 
neighborhood 
26 
k = 1 
k = 15 
ˆE 
MΣ 
= E(h|V) = 1 h xt ( ) ≠ rt ( ) 
t=1
A 
Simple 
Example 
Δ(mΣ 
x,z) = δ (xi,zi ) 
i=1 
27 
Training 
set: 
1. (a, 
b, 
a, 
c) 
à 
A 
2. (a, 
b, 
c, 
a) 
à 
B 
3. (b, 
a, 
c, 
c) 
à 
C 
4. (c, 
a, 
b, 
c) 
à 
A 
New 
instance: 
5. (a, 
b, 
b, 
a) 
Distances 
(overlap): 
Δ(1, 
5) 
= 
2 
Δ(2, 
5) 
= 
1 
Δ(3, 
5) 
= 
4 
Δ(4, 
5) 
= 
3 
k-­‐NN 
classifica:on: 
1-­‐NN(5) 
= 
B 
2-­‐NN(5) 
= 
A/B 
3-­‐NN(5) 
= 
A 
4-­‐NN(5) 
= 
A 
⎧ 
⎪⎪⎪ 
0 
⎨ 
⎪⎪⎪ 
⎩ 
x z if numeric else 
if x = 
z 
i i 
≠ 
− 
i i 
− 
= 
i i 
i i 
i i 
if x z 
x z 
1 
, 
max min 
δ ( , )
Further 
Varia:ons 
on 
k-­‐NN 
• Feature 
weights: 
– The 
overlap 
metric 
gives 
all 
features 
equal 
weight 
– Features 
can 
be 
weighted 
by 
IG 
or 
GR 
• Weighted 
vo:ng: 
– The 
normal 
decision 
rule 
gives 
all 
neighbors 
equal 
weight 
– Instances 
can 
be 
weighted 
by 
(inverse) 
distance 
28
Proper:es 
of 
k-­‐NN 
• Nearest 
neighbor 
classifica:on 
is 
appropriate 
when: 
– Features 
can 
be 
both 
categorical 
and 
numeric 
– Disjunc:ve 
descrip:ons 
may 
be 
required 
– Training 
data 
may 
be 
noisy 
(missing 
values, 
incorrect 
labels) 
– Fast 
classifica:on 
is 
not 
crucial 
• Induc0ve 
bias 
of 
k-­‐NN: 
1. Nearby 
instances 
should 
have 
the 
same 
label 
(smoothness 
assump0on) 
2. All 
features 
are 
equally 
important 
(without 
feature 
weights) 
3. Complexity 
tuned 
by 
the 
k 
parameter 
29
End 
of 
Lecture 
8 
30

Contenu connexe

Tendances

Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3
Xueping Peng
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
ESCOM
 

Tendances (20)

K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
My8clst
My8clstMy8clst
My8clst
 
Lecture5 - C4.5
Lecture5 - C4.5Lecture5 - C4.5
Lecture5 - C4.5
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Markov Random Field (MRF)
Markov Random Field (MRF)Markov Random Field (MRF)
Markov Random Field (MRF)
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
 
Learning rule of first order rules
Learning rule of first order rulesLearning rule of first order rules
Learning rule of first order rules
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Greedy method
Greedy methodGreedy method
Greedy method
 
KNN
KNN KNN
KNN
 
Id3,c4.5 algorithim
Id3,c4.5 algorithimId3,c4.5 algorithim
Id3,c4.5 algorithim
 

En vedette

Query dependent ranking using k nearest neighbor
Query dependent ranking using k nearest neighborQuery dependent ranking using k nearest neighbor
Query dependent ranking using k nearest neighbor
iyo
 
Linear Discrimination Centering on Support Vector Machines
Linear Discrimination Centering on Support Vector MachinesLinear Discrimination Centering on Support Vector Machines
Linear Discrimination Centering on Support Vector Machines
butest
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar Ahmed
Zaffar Ahmed Shaikh
 
Text classification methods
Text classification methodsText classification methods
Text classification methods
Young Alista
 
Transaction processing system
Transaction processing systemTransaction processing system
Transaction processing system
Jayson Jueco
 

En vedette (20)

K Nearest Neighbor Presentation
K Nearest Neighbor PresentationK Nearest Neighbor Presentation
K Nearest Neighbor Presentation
 
How Emotional Are Users' Needs? Emotion in Query Logs
How Emotional Are Users' Needs? Emotion in Query LogsHow Emotional Are Users' Needs? Emotion in Query Logs
How Emotional Are Users' Needs? Emotion in Query Logs
 
Query dependent ranking using k nearest neighbor
Query dependent ranking using k nearest neighborQuery dependent ranking using k nearest neighbor
Query dependent ranking using k nearest neighbor
 
k*NN2016
k*NN2016k*NN2016
k*NN2016
 
Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
a novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool wekaa novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool weka
 
K nn
K nnK nn
K nn
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Sistem informasi prediksi harga kebutuhan bahan pokok
Sistem informasi prediksi harga kebutuhan bahan pokokSistem informasi prediksi harga kebutuhan bahan pokok
Sistem informasi prediksi harga kebutuhan bahan pokok
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Linear Discrimination Centering on Support Vector Machines
Linear Discrimination Centering on Support Vector MachinesLinear Discrimination Centering on Support Vector Machines
Linear Discrimination Centering on Support Vector Machines
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar Ahmed
 
K-Nearest Neighbor
K-Nearest NeighborK-Nearest Neighbor
K-Nearest Neighbor
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
Text classification methods
Text classification methodsText classification methods
Text classification methods
 
Transaction processing system
Transaction processing systemTransaction processing system
Transaction processing system
 

Similaire à Lecture 8: Decision Trees & k-Nearest Neighbors

2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
MostafaHazemMostafaa
 

Similaire à Lecture 8: Decision Trees & k-Nearest Neighbors (20)

Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
KNN presentation.pdf
KNN presentation.pdfKNN presentation.pdf
KNN presentation.pdf
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
6 clustering
6 clustering6 clustering
6 clustering
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
09Evaluation_Clustering.pdf
09Evaluation_Clustering.pdf09Evaluation_Clustering.pdf
09Evaluation_Clustering.pdf
 
KNN.pptx
KNN.pptxKNN.pptx
KNN.pptx
 
KNN.pptx
KNN.pptxKNN.pptx
KNN.pptx
 
Clustering
ClusteringClustering
Clustering
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
KNN
KNNKNN
KNN
 
unit 4 nearest neighbor.ppt
unit 4 nearest neighbor.pptunit 4 nearest neighbor.ppt
unit 4 nearest neighbor.ppt
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniques
 
ML unit3.pptx
ML unit3.pptxML unit3.pptx
ML unit3.pptx
 
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
 
Lect4
Lect4Lect4
Lect4
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 

Plus de Marina Santini

Plus de Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 

Dernier

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Dernier (20)

On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 

Lecture 8: Decision Trees & k-Nearest Neighbors

  • 1. Machine Learning for Language Technology Lecture 8: Decision Trees and k-­‐Nearest Neighbors Marina San:ni Department of Linguis:cs and Philology Uppsala University, Uppsala, Sweden Autumn 2014 Acknowledgement: Thanks to Prof. Joakim Nivre for course design and materials 1
  • 2. Supervised Classifica:on • Divide instances into (two or more) classes – Instance (feature vector): • Features may be categorical or numerical – Class (label): – Training data: • Classifica:on in Language Technology – Spam filtering (spam vs. non-­‐spam) – Spelling error detec:on (error vs. no error) – Text categoriza:on (news, economy, culture, sport, ...) – Named en:ty classifica:on (person, loca:on, organiza:on, ...) 2 € X = {xt,y t }N t=1 € x = x1, …, xm € y
  • 3. Models for Classifica:on • Genera:ve probabilis:c models: – Model of P(x, y) – Naive Bayes • Condi:onal probabilis:c models: – Model of P(y | x) – Logis:c regression • Discrimina:ve model: – No explicit probability model – Decision trees, nearest neighbor classifica:on – Perceptron, support vector machines, MIRA 3
  • 4. Repea:ng… • Noise – Data cleaning is expensive and :me consuming • Margin • Induc:ve bias Types of inductive biases • Minimum cross-­‐‑validation error • Maximum margin • Minimum description length • [...] 4
  • 6. Decision Trees • Hierarchical tree structure for classifica:on – Each internal node specifies a test of some feature – Each branch corresponds to a value for the tested feature – Each leaf node provides a classifica:on for the instance • Represents a disjunc:on of conjunc:ons of constraints – Each path from root to leaf specifies a conjunc:on of tests – The tree itself represents the disjunc:on of all paths 6
  • 8. Divide and Conquer • Internal decision nodes – Univariate: Uses a single a]ribute, xi • Numeric xi : Binary split : xi > wm • Discrete xi : n-­‐way split for n possible values – Mul:variate: Uses all a]ributes, x • Leaves – Classifica:on: class labels (or propor:ons) – Regression: r average (or local fit) • Learning: – Greedy recursive algorithm – Find best split X = (X1, ..., Xp), then induce tree for each Xi 8
  • 9. Classifica:on Trees (ID3, CART, C4.5) 9 • For node m, Nm instances reach m, Ni m belong to Ci • Node m is pure if pi m i Nm is 0 or 1 • Measure of impurity is entropy KΣ Im = − pm i logpi 2m i=1 ˆP (Ci |x,m) ≡ pm i = Nm
  • 10. Example: Entropy • Assume two classes (C1, C2) and four instances (x1, x2, x3, x4) • Case 1: – C1 = {x1, x2, x3, x4}, C2 = { } – Im = – (1 log 1 + 0 log 0) = 0 • Case 2: – C1 = {x1, x2, x3}, C2 = {x4} – Im = – (0.75 log 0.75 + 0.25 log 0.25) = 0.81 • Case 3: – C1 = {x1, x2}, C2 = {x3, x4} – Im = – (0.5 log 0.5 + 0.5 log 0.5) = 1 10
  • 11. Best Split ˆP (Ci |x,m, j) ≡ pmj i = i Nmj Nmj i log2pmj 11 • If node m is pure, generate a leaf and stop, otherwise split with test t and con:nue recursively • Find the test that minimizes impurity • Impurity ager split with test t: – Nmj of Nm take branch j – Ni mj belong to Ci Im nΣ t = − Nmj Nj=1 m KΣ pmj i KΣ i=1 € Im = − pm i logpi 2m i=1 ˆP Ci ( |x,m) ≡ pm i = Ni m Nm
  • 12. 12
  • 13. Informa:on Gain • We want to determine which a]ribute in a given set of training feature vectors is most useful for discrimina:ng between the classes to be learned. • •Informa:on gain tells us how important a given a]ribute of the feature vectors is. • •We will use it to decide the ordering of a]ributes in the nodes of a decision tree. 13
  • 14. Informa:on Gain and Gain Ra:o € KΣ Im = − pm i logpi 2m i=1 i log2pmj Nmj Nj=1 m 14 • Choosing the test that minimizes impurity maximizes the informa:on gain (IG): • Informa:on gain prefers features with many values • The normalized version is called gain ra:o (GR): € Im nΣ t = − Nmj j=1 Nm pmj i KΣ i=1 € IGm t = I− It m m Vt = − m Nmj Nm log2 nΣ € GRm t = IGt m Vm t
  • 15. Pruning Trees • Decision trees are suscep:ble to overfijng • Remove subtrees for be]er generaliza:on: – Prepruning: Early stopping (e.g., with entropy threshold) – Postpruning: Grow whole tree, then prune subtrees • Prepruning is faster, postpruning is more accurate (requires a separate valida:on set) 15
  • 16. Rule Extrac:on from Trees 16 C4.5Rules (Quinlan, 1993)
  • 17. Learning Rules • Rule induc:on is similar to tree induc:on but – tree induc:on is breadth-­‐first – rule induc:on is depth-­‐first (one rule at a :me) • Rule learning: – A rule is a conjunc:on of terms (cf. tree path) – A rule covers an example if all terms of the rule evaluate to true for the example (cf. sequence of tests) – Sequen:al covering: Generate rules one at a :me un:l all posi:ve examples are covered – IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen, 1995) 17
  • 18. Proper:es of Decision Trees • Decision trees are appropriate for classifica:on when: – Features can be both categorical and numeric – Disjunc:ve descrip:ons may be required – Training data may be noisy (missing values, incorrect labels) – Interpreta:on of learned model is important (rules) • Induc0ve bias of (most) decision tree learners: 1. Prefers trees with informa0ve aEributes close to the root 2. Prefers smaller trees over bigger ones (with pruning) 3. Preference bias (incomplete search of complete space) 18
  • 20. Nearest Neighbor Classifica:on • An old idea • Key components: – Storage of old instances – Similarity-­‐based reasoning to new instances 20 This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952)
  • 21. k-­‐Nearest Neighbour • Learning: – Store training instances in memory • Classifica:on: – Given new test instance x, • Compare it to all stored instances • Compute a distance between x and each stored instance xt • Keep track of the k closest instances (nearest neighbors) – Assign to x the majority class of the k nearest neighbours • A geometric view of learning – Proximity in (feature) space à same class – The smoothness assump:on 21
  • 22. Eager and Lazy Learning • Eager learning (e.g., decision trees) – Learning – induce an abstract model from data – Classifica:on – apply model to new data • Lazy learning (a.k.a. memory-­‐based learning) – Learning – store data in memory – Classifica:on – compare new data to data in memory – Proper:es: • Retains all the informa:on in the training set – no abstrac:on • Complex hypothesis space – suitable for natural language? • Main drawback – classifica:on can be very inefficient 22
  • 23. Dimensions of a k-­‐NN Classifier • Distance metric – How do we measure distance between instances? – Determines the layout of the instance space • The k parameter – How large neighborhood should we consider? – Determines the complexity of the hypothesis space 23
  • 24. Distance Metric 1 • Overlap = count of mismatching features 24 € mΣ Δ(x,z) = δ (xi,zi ) i=1 ⎧ ⎪⎪⎪ 0 ⎨ ⎪⎪⎪ ⎩ x z if numeric else if x = z i i ≠ − i i − = i i i i i i if x z x z 1 , max min δ ( , )
  • 25. Distance Metric 2 • MVDM = Modified Value Difference Metric 25 Δ(mΣ x,z) = δ (xi,zi ) i=1 δ KΣ (xi,zi ) = P(Cj | xi ) − P(Cj | zi) j=1
  • 26. The k parameter • Tunes the complexity of the hypothesis space – If k = 1, every instance has its own neighborhood – If k = N, all the feature space is one neighborhood 26 k = 1 k = 15 ˆE MΣ = E(h|V) = 1 h xt ( ) ≠ rt ( ) t=1
  • 27. A Simple Example Δ(mΣ x,z) = δ (xi,zi ) i=1 27 Training set: 1. (a, b, a, c) à A 2. (a, b, c, a) à B 3. (b, a, c, c) à C 4. (c, a, b, c) à A New instance: 5. (a, b, b, a) Distances (overlap): Δ(1, 5) = 2 Δ(2, 5) = 1 Δ(3, 5) = 4 Δ(4, 5) = 3 k-­‐NN classifica:on: 1-­‐NN(5) = B 2-­‐NN(5) = A/B 3-­‐NN(5) = A 4-­‐NN(5) = A ⎧ ⎪⎪⎪ 0 ⎨ ⎪⎪⎪ ⎩ x z if numeric else if x = z i i ≠ − i i − = i i i i i i if x z x z 1 , max min δ ( , )
  • 28. Further Varia:ons on k-­‐NN • Feature weights: – The overlap metric gives all features equal weight – Features can be weighted by IG or GR • Weighted vo:ng: – The normal decision rule gives all neighbors equal weight – Instances can be weighted by (inverse) distance 28
  • 29. Proper:es of k-­‐NN • Nearest neighbor classifica:on is appropriate when: – Features can be both categorical and numeric – Disjunc:ve descrip:ons may be required – Training data may be noisy (missing values, incorrect labels) – Fast classifica:on is not crucial • Induc0ve bias of k-­‐NN: 1. Nearby instances should have the same label (smoothness assump0on) 2. All features are equally important (without feature weights) 3. Complexity tuned by the k parameter 29