SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
1
LECTURE 7: Kernel Density Estimation
g Non-parametric Density Estimation
g Histograms
g Parzen Windows
g Smooth Kernels
g Product Kernel Density Estimation
g The Naïve Bayes Classifier
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
2
Non-parametric density estimation
g In the previous two lectures we have assumed that either
n The likelihoods p(x|ωi) were known (Likelihood Ratio Test) or
n At least the parametric form of the likelihoods were known (Parameter
Estimation)
g The methods that will be presented in the next two lectures do not
afford such luxuries
n Instead, they attempt to estimate the density directly from the data without
making any parametric assumptions about the underlying distribution
n Sounds challenging? You bet!
x1
x
2
P(x1, x2| ωi)
NON-PARAMETRIC
DENSITY ESTIMATION
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
1
1
1
1
1
1
1
1
1
1
1
1
1
11
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
3
33
3
3
33
3
3
3
3
3
4
4
4
4
4
4
4
4
4
44
4
4
4
4
4
4
4
5
5
5
5
5
5
5
55
5
5
55
5
5
5
6
6
6
6
6 6
6
6
6
6
6
6
6
77
77
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
88
8
8
8
8
8
8
8
8
8
9
9
9
99
9
9
9
9
9
99
9
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
1*
1*
1*
1*
2*
2*
2*
2*
2*
2*
3*
3*
3*
3*
3*
3*
3*
4*
5*
5*
5*
5*
5*
6*
6*
6*
6*
6*
6*
6*
7*
7*
7*
7*
8*
8*
8*
9*
9*
9*
9*
9*
10*
10*
10*
10*
10*
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
1
1
1
1
1
1
1
1
1
1
1
1
1
11
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
3
33
3
3
33
3
3
3
3
3
4
4
4
4
4
4
4
4
4
44
4
4
4
4
4
4
4
5
5
5
5
5
5
5
55
5
5
55
5
5
5
6
6
6
6
6 6
6
6
6
6
6
6
6
77
77
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
88
8
8
8
8
8
8
8
8
8
9
9
9
99
9
9
9
9
9
99
9
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
1*
1*
1*
1*
2*
2*
2*
2*
2*
2*
3*
3*
3*
3*
3*
3*
3*
4*
5*
5*
5*
5*
5*
6*
6*
6*
6*
6*
6*
6*
7*
7*
7*
7*
8*
8*
8*
9*
9*
9*
9*
9*
10*
10*
10*
10*
10*
x1
x2
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
3
The histogram
g The simplest form of non-parametric D.E. is the familiar histogram
n Divide the sample space into a number of bins and approximate the density at the center of
each bin by the fraction of points in the training data that fall into the corresponding bin
g The histogram requires two “parameters” to be defined: bin width and starting position of the first bin
g The histogram is a very simple form of D.E., but it has several drawbacks
n The final shape of the density estimate depends on the starting position of the bins
g For multivariate data, the final shape of the density is also affected by the orientation of the bins
n The discontinuities of the estimate are not due to the underlying density, they are only an
artifact of the chosen bin locations
g These discontinuities make it very difficult, without experience, to grasp the structure of the data
n A much more serious problem is the curse of dimensionality, since the number of bins grows
exponentially with the number of dimensions
g In high dimensions we would require a very large number of examples or else most of the bins would
be empty
g All these drawbacks make the histogram unsuitable for most practical
applications except for rapid visualization of results in one or two dimensions
n Therefore, we will not spend more time looking at the histogram
[ ]
[ ]
x
containing
bin
of
width
x
as
bin
same
in
x
of
number
N
1
(x)
P
(k
H =
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
4
Non-parametric density estimation, general formulation (1)
g Let us return to the basic definition of probability to get a solid idea of
what we are trying to accomplish
n The probability that a vector x, drawn from a distribution p(x), will fall in a given
region ℜ of the sample space is
n Suppose now that N vectors {x(1, x(2, …, x(N} are drawn from the distribution. The
probability that k of these N vectors fall in ℜ is given by the binomial distribution
n It can be shown (from the properties of the binomial p.m.f.) that the mean and
variance of the ratio k/N are
n Therefore, as N→∞, the distribution becomes sharper (the variance gets
smaller) so we can expect that a good estimate of the probability P can be
obtained from the mean fraction of the points that fall within ℜ
∫
ℜ
= )dx'
p(x'
P
( ) k
N
k
P)
(1
P
k
N
k
P −
−








=
( )
N
P
1
P
P
N
k
E
N
k
Var
and
P
N
k
E
2
−
=














−
=






=






N
k
P ≅
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
5
Non-parametric density estimation, general formulation (2)
n On the other hand, if we assume that ℜ is so small that p(x) does not vary
appreciably within it, then
g where V is the volume enclosed by region ℜ
n Merging with the previous result we obtain
n This estimate becomes more accurate as we increase the number of sample
points N and shrink the volume V
g In practice the value of N (the total number of examples) is fixed
n In order to improve the accuracy of the estimate p(x) we could let V approach
zero but then the region ℜ would then become so small that it would enclose no
examples
n This means that, in practice, we will have to find a compromise value for the
volume V
g Large enough to include enough examples within ℜ
g Small enough to support the assumption that p(x) is constant within ℜ
NV
k
p(x)
N
k
P
p(x)V
)dx'
p(x'
P
≅
⇒





≅
≅
= ∫
ℜ
p(x)V
)dx'
p(x' ≅
∫
ℜ
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
6
Non-parametric density estimation, general formulation (3)
g In conclusion, the general expression for non-parametric density
estimation becomes
g When applying this result to practical density estimation problems,
two basic approaches can be adopted
n We can choose a fixed value of the volume V and determine k from the data.
This leads to methods commonly referred to as Kernel Density Estimation
(KDE), which are the subject of this lecture
n We can choose a fixed value of k and determine the corresponding volume V
from the data. This gives rise to the k Nearest Neighbor (kNN) approach, which
will be covered in the next lecture
g It can be shown that both kNN and KDE converge to the true
probability density as N→∞, provided that V shrinks with N, and k
grows with N appropriately





≅
V
inside
examples
of
number
the
is
k
examples
of
number
total
the
is
N
x
g
surroundin
volume
the
is
V
where
NV
k
p(x)
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
7
Parzen windows (1)
g Suppose that the region ℜ that encloses the
k examples is a hypercube with sides of
length h centered at the estimation point x
n Then its volume is given by V=hD, where D is the
number of dimensions
g To find the number of examples that fall
within this region we define a kernel
function K(u)
n This kernel, which corresponds to a unit
hypercube centered at the origin, is known as a
Parzen window or the naïve estimator
n The quantity K((x-x(n)/h) is then equal to unity if
the point x(n is inside a hypercube of side h
centered on x, and zero otherwise
( )


 =
∀
<
=
otherwise
0
D
1,..,
j
1/2
u
1
u
K j
x
h
h
h
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
8
Parzen windows (2)
g The total number of points inside the
hypercube is then
g Substituting back into the expression
for the density estimate
n Notice that the Parzen window density
estimate resembles the histogram, with
the exception that the bin locations are
determined by the data points
∑
=







 −
=
N
1
n
(n
D
KDE
h
x
x
K
Nh
1
(x)
p
∑
=







 −
=
N
1
n
n
(
h
x
x
K
k
Volume
1 / V
K(x-x(1)=1
K(x-x(2)=1
K(x-x(3)=1
K(x-x(4)=0
x(1
x(2
x(3
x(4
x
x(1
x(2
x(3
x(4
Volume
1 / V
K(x-x(1)=1
K(x-x(1)=1
K(x-x(2)=1
K(x-x(2)=1
K(x-x(3)=1
K(x-x(3)=1
K(x-x(4)=0
K(x-x(4)=0
x(1
x(2
x(3
x(4
x
x(1
x(2
x(3
x(4
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
9
Parzen windows (3)
g To understand the role of the kernel function we compute the
expectation of the probability estimate p(x)
n where we have assumed that the vectors x(n are drawn independently from the
true density p(x)
g We can see that the expectation of the estimated density pKDE(x) is a
convolution of the true density p(x) with the kernel function
n The width h of the kernel plays the role of a smoothing parameter: the wider the
kernel function, the smoother the estimate pKDE(x)
g For h→0, the kernel approaches a Dirac delta function and pKDE(x)
approaches the true density
n However, in practice we have a finite number of points, so h cannot be made
arbitrarily small, since the density estimate pKDE(x) would then degenerate to a set
of impulses located at the training data points
( )
[ ]
∫
∑





 −
=













 −
=
=













 −
=
=
)dx'
p(x'
h
x'
x
K
h
1
h
x
x
K
E
h
1
h
x
x
K
E
Nh
1
x
p
E
D
(n
D
N
1
n
(n
D
KDE
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
10
Numeric exercise
g Given the dataset below, use Parzen windows to estimate the density
p(x) at y=3,10,15. Use a bandwidth of h=4
n X = {x(1, x(2,…x(N} = {4, 5, 5, 6, 12, 14, 15, 15, 16, 17}
g Solution
n Let’s first draw the dataset to get an idea of what numerical results we should
expect
n Let’s now estimate p(y=3):
n Similarly
[ ] 0.025
4
10
1
0
0
0
0
0
0
0
0
0
1
4
10
1
4
17
3
K
...
4
6
3
K
4
5
3
K
4
5
3
K
4
4
3
K
4
10
1
h
x
y
K
Nh
1
3)
(y
p
1
13/4
-
1
-
1/2
-
1/2
-
1/4
-
1
N
1
n
(n
D
KDE
=
×
=
+
+
+
+
+
+
+
+
+
×
=
=















 −
+
+





 −
+





 −
+





 −
+





 −
×
=







 −
=
= ∑
=
4
3
4
2
1
4
3
4
2
1
4
3
4
2
1
4
3
4
2
1
4
3
4
2
1
5 10 15 x
p(x)
y=3 y=10
y=15
[ ] 0
4
10
0
0
0
0
0
0
0
0
0
0
0
4
10
1
10)
(y
p 1
KDE =
×
=
+
+
+
+
+
+
+
+
+
×
=
=
[ ] 0.1
4
10
4
0
1
1
1
1
0
0
0
0
0
4
10
1
15)
(y
p 1
KDE =
×
=
+
+
+
+
+
+
+
+
+
×
=
=
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
11
Smooth kernels (1)
g The Parzen window has several drawbacks
n Yields density estimates that have discontinuities
n Weights equally all the points xi, regardless of their distance to the estimation
point x
g It is easy to to overcome some of these difficulties by generalizing the
Parzen window with a smooth kernel function K(u) which satisfies the
condition
n Usually, but not not always, K(u) will be a radially symmetric and unimodal
probability density function, such as the multivariate Gaussian density function
n where the expression of the density estimate remains the same as with Parzen
windows
( ) 1
dx
x
K
D
R
=
∫
( )
( )






−
= x
x
2
1
exp
π
2
1
x
K T
2
/
D
∑
=







 −
=
N
1
n
(n
D
KDE
h
x
x
K
Nh
1
(x)
p
From [Bishop, 1995]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
12
Smooth kernels (2)
g Just as the Parzen window estimate can be considered a sum of boxes
centered at the observations, the smooth kernel estimate is a sum of
“bumps” placed at the data points
n The kernel function determines the shape of the bumps
n The parameter h, also called the smoothing parameter or bandwidth,
determines their width
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
x
P
K
DE(x);
h=3
Density
estimate
Data
points
Kernel
functions
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
x
P
K
DE(x);
h=3
Density
estimate
Data
points
Kernel
functions
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
13
Choosing the bandwidth: univariate case (1)
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
x
P
KDE
(x);
h=5.0
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
x
P
KDE
(x);
h=10.0
-10 -5 0 5 10 15 20 25 30 35 40
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
x
P
KDE
(x);
h=2.5
-10 -5 0 5 10 15 20 25 30 35 40
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
x
P
KDE
(x);
h=1.0
g The problem of choosing the bandwidth is crucial in density estimation
n A large bandwidth will over-smooth the density and mask the structure in the data
n A small bandwidth will yield a density estimate that is spiky and very hard to interpret
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
14
Choosing the bandwidth: univariate case (2)
g We would like to find a value of the smoothing parameter that minimizes the error between
the estimated density and the true density
n A natural measure is the mean square error at the estimation point x, defined by
g This expression is an example of the bias-variance tradeoff that we saw earlier in the
course: the bias can be reduced at the expense of the variance, and vice versa
n The bias of an estimate is the systematic error incurred in the estimation
n The variance of an estimate is the random error incurred in the estimation
g The bias-variance dilemma applied to bandwidth selection simply means that
n A large bandwidth will reduce the differences among the estimates of pKDE(x) for different data sets (the
variance) but it will increase the bias of pKDE(x) with respect to the true density p(x)
n A small bandwidth will reduce the bias of pKDE(x), at the expense of a larger variance in the estimates pKDE(x)
( ) ( ) ( )
( )
[ ] ( ) ( )
[ ]
{ } ( )
( )
4
43
4
42
1
4
4
4 3
4
4
4 2
1
variance
KDE
2
bias
KDE
2
KDE
KDE
x x
p
var
x
p
x
p
E
x
p
x
p
E
p
MSE +
−
=
−
=
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x
P
KDE
(x);
h=0.1
VARIANCE
-3 -2 -1 0 1 2 3
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
P
KDE
(x);
h=2.0
-3 -2 -1 0 1 2 3
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
P
KDE
(x);
h=2.0
True density
Multiple kernel
density estimates
BIAS
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
15
Bandwidth selection methods, univariate case (3)
g Subjective choice
n The natural way for choosing the smoothing parameter is to plot out several curves and
choose the estimate that is most in accordance with one’s prior (subjective) ideas
n However, this method is not practical in pattern recognition since we typically have high-
dimensional data
g Reference to a standard distribution
n Assume a standard density function and find the value of the bandwidth that minimizes the
integral of the square error (MISE)
n If we assume that the true distribution is a Gaussian density and we use a Gaussian kernel,
it can be shown that the optimal value of the bandwidth becomes
g where σ is the sample variance and N is the number of training examples
( )
( )
{ } ( ) ( )
( )
[ ]
{ }
∫ −
=
= dx
x
p
x
p
E
argmin
x
p
MISE
argmin
h
2
KDE
h
KDE
h
opt
5
/
1
opt N
σ
06
.
1
h −
=
From [Silverman, 1986]
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
16
Bandwidth selection methods, univariate case (4)
n Better results can be obtained if we use a robust measure of the spread instead of the
sample variance and we reduce the coefficient 1.06 to better cope with multimodal densities.
The optimal bandwidth then becomes
g IQR is the interquartile range, a robust estimate of the spread. It is computed as one half the
difference between the 75th
percentile (Q3) and the 25th
percentile (Q1). The formula for semi-
interquartile range is therefore: (Q3-Q1)/2
n A percentile rank is the proportion of examples in a distribution that a specific example is greater than or equal to
g Likelihood cross-validation
n The ML estimate of h is degenerate since it yields hML=0, a density estimate with Dirac delta
functions at each training data point
n A practical alternative is to maximize the “pseudo-likelihood” computed using leave-one-out
cross-validation






=
= −
34
.
1
IQR
,
σ
min
A
where
AN
9
.
0
h 5
/
1
opt
From [Silverman, 1986]
( )
( ) ( ) ∑
∑
≠
=
−
=
−







 −
−
=






=
N
m
n
1,
m
(m
(n
(n
n
N
1
n
(n
n
h
MLCV
h
x
x
K
h
1
N
1
x
p
where
x
p
log
N
1
argmax
h
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
17
Multivariate density estimation
g The derived expression of the estimate PKDE(x) for multiple dimensions was
n Notice that the bandwidth h is the same for all the axes, so this density estimate will be
weight all the axis equally
g However, if the spread of the data is much greater in one of the coordinate
directions than the others, we should use a vector of smoothing parameters or
even a full covariance matrix, which complicates the procedure
g There are two basic alternatives to solve the scaling problem without having to
use a more general kernel density estimate
n Pre-scale each axis (normalize to unit variance, for instance)
n Pre-whiten the data (linearly transform to have unit covariance matrix), estimate the
density, and then transform back [Fukunaga]
g The whitening transform is simply y=Λ-1/2
MT
x,
where Λ and M are the eigenvalue and eigenvector
matrices of the sample covariance of x
g Fukunaga’s method is equivalent to using a
hyper-ellipsoidal kernel
∑
=







 −
=
N
1
n
(n
D
KDE
h
x
x
K
Nh
1
(x)
p
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
18
Product kernels
g A very popular method for performing multivariate density estimation is the
product kernel, defined as
n The product kernel consists of the product of one-dimensional kernels
g Typically the same kernel function is used in each dimension ( Kd(x)=K(x) ), and
only the bandwidths are allowed to differ
n Bandwidth selection can then be performed with any of the methods presented for univariate
density estimation
g It is important to notice that although the expression of K(x,x(n,h1,…hD) uses
kernel independence, this does not imply that any type of feature independence
is being assumed
n A density estimation method that assumed feature independence would have the following
expression
n Notice how the order of the summation and product are reversed compared to the product
kernel
( ) ( )
( ) ∏
∑
=
=







 −
⋅
⋅
⋅
=
=
D
1
d d
(n
d
D
1
D
1
(n
N
1
i
D
1
(n
PKDE
h
(d)
x
x(d)
K
h
h
1
h
,...,
h
,
x
x,
K
where
h
,...,
h
,
x
x,
K
N
1
x
p
( ) ∏ ∑
= =
− 














 −
=
D
1
d
N
1
i d
(n
d
d
IND
FEAT
h
(d)
x
x(d)
K
Nh
1
x
p
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
19
Product kernel, example 1
-2 -1 0 1 2
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
x1
x
2
-2 -1 0 1 2
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
x1
x
2
-2 -1 0 1 2
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
x1
x
2
g This example shows the product kernel density estimate of a bivariate unimodal Gaussian
distribution
n 100 data points were drawn from the distribution
n The figures show the true density (left) and the estimates using h=1.06σN-1/5
(middle) and h=0.9AN-1/5
(right)
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
20
Product kernel, example 2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x
2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x
2
-2 0 2 4 6
-4
-2
0
2
4
6
8
x1
x
2
g This example shows the product kernel density estimate of a bivariate bimodal Gaussian
distribution
n 100 data points were drawn from the distribution
n The figures show the true density (left) and the estimates using h=1.06σN-1/5
(middle) and h=0.9AN-1/5
(right)
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
21
Naïve Bayes classifier
g Recall that the Bayes classifier is given by the following family of discriminant functions
g Using Bayes rule, these discriminant functions can be expressed as
n where P(ωi) is our prior knowledge and P(x|ωi) is obtained through density estimation
g Although we have presented density estimation methods that allow us to estimate the
multivariate likelihood P(x|ωi), the curse of dimensionality makes it a very tough problem!
g One highly practical simplification of the Bayes classifier is the so-called Naïve Bayes
classifier
n The Naïve Bayes classifier makes the assumption that the features are class-conditionally independent
g It is important to notice that this assumption is not as rigid as assuming independent features
n Merging this expression into the discriminant function yields the decision rule for the Naïve Bayes classifier
g The main advantage of the Naïve Bayes classifier is that we only need to compute the
univariate densities P(x(d)|ωi), which is a much easier problem than estimating the
multivariate density P(x|ωi)
n Despite its simplicity, the Naïve Bayes has been shown to have comparable performance to artificial neural
networks and decision tree learning in some domains
x)
|
P(ω
(x)
g
where
i
j
)
x
(
g
)
x
(
g
if
ω
choose
i
i
j
i
i
=
≠
∀
>
)
)P(ω
ω
|
P(x
x)
|
P(ω
(x)
g i
i
i
i ∝
=
∏
=
=
D
1
d
i
i )
ω
|
P(x(d)
)
ω
|
P(x
Naïve Bayes
Classifier
Naïve Bayes
Classifier
( )
∏
=
=
D
1
d
i
i
NB
,
i ω
|
)
d
(
x
P
)
P(ω
)
x
(
g
∏
=
=
D
1
d
P(x(d))
P(x)

Contenu connexe

Tendances (20)

Newton-Raphson Method
Newton-Raphson MethodNewton-Raphson Method
Newton-Raphson Method
 
Independence, basis and dimension
Independence, basis and dimensionIndependence, basis and dimension
Independence, basis and dimension
 
Dobule and triple integral
Dobule and triple integralDobule and triple integral
Dobule and triple integral
 
5desc
5desc5desc
5desc
 
R normal distribution
R   normal distributionR   normal distribution
R normal distribution
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Maxima and minima
Maxima and minimaMaxima and minima
Maxima and minima
 
Power set
Power setPower set
Power set
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
07 approximate inference in bn
07 approximate inference in bn07 approximate inference in bn
07 approximate inference in bn
 
3.2 Power sets
3.2 Power sets3.2 Power sets
3.2 Power sets
 
Ideals and factor rings
Ideals and factor ringsIdeals and factor rings
Ideals and factor rings
 
Fixed point iteration
Fixed point iterationFixed point iteration
Fixed point iteration
 
Probability Density Functions
Probability Density FunctionsProbability Density Functions
Probability Density Functions
 
Kernel Method
Kernel MethodKernel Method
Kernel Method
 
MATCHING GRAPH THEORY
MATCHING GRAPH THEORYMATCHING GRAPH THEORY
MATCHING GRAPH THEORY
 
Mrbml004 : Introduction to Information Theory for Machine Learning
Mrbml004 : Introduction to Information Theory for Machine LearningMrbml004 : Introduction to Information Theory for Machine Learning
Mrbml004 : Introduction to Information Theory for Machine Learning
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to Practice
 

Similaire à Intro to Non-Parametric Density Estimation

2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approachnozomuhamada
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
Expectation propagation
Expectation propagationExpectation propagation
Expectation propagationDong Guo
 
2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlo2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlonozomuhamada
 
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureRobust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureIJRES Journal
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big DataChristian Robert
 
Probability distribution
Probability distributionProbability distribution
Probability distributionRanjan Kumar
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaAlexander Litvinenko
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksFederico Cerutti
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3MuhannadSaleh
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and ApplicationsSaurav Jha
 
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 9
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 9ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 9
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 9tingyuansenastro
 
2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda2012 mdsp pr09 pca lda
2012 mdsp pr09 pca ldanozomuhamada
 

Similaire à Intro to Non-Parametric Density Estimation (20)

2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
Expectation propagation
Expectation propagationExpectation propagation
Expectation propagation
 
2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlo2012 mdsp pr04 monte carlo
2012 mdsp pr04 monte carlo
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity MeasureRobust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
Robust Fuzzy Data Clustering In An Ordinal Scale Based On A Similarity Measure
 
kcde
kcdekcde
kcde
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Probability distribution
Probability distributionProbability distribution
Probability distribution
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formula
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and Applications
 
eatonmuirheadsoaita
eatonmuirheadsoaitaeatonmuirheadsoaita
eatonmuirheadsoaita
 
Slides univ-van-amsterdam
Slides univ-van-amsterdamSlides univ-van-amsterdam
Slides univ-van-amsterdam
 
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 9
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 9ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 9
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 9
 
2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda2012 mdsp pr09 pca lda
2012 mdsp pr09 pca lda
 

Plus de Zahra Amini

Dip azimifar enhancement_l05_2020
Dip azimifar enhancement_l05_2020Dip azimifar enhancement_l05_2020
Dip azimifar enhancement_l05_2020Zahra Amini
 
Dip azimifar enhancement_l04_2020
Dip azimifar enhancement_l04_2020Dip azimifar enhancement_l04_2020
Dip azimifar enhancement_l04_2020Zahra Amini
 
Dip azimifar enhancement_l03_2020
Dip azimifar enhancement_l03_2020Dip azimifar enhancement_l03_2020
Dip azimifar enhancement_l03_2020Zahra Amini
 
Dip azimifar enhancement_l02_2020
Dip azimifar enhancement_l02_2020Dip azimifar enhancement_l02_2020
Dip azimifar enhancement_l02_2020Zahra Amini
 
Dip azimifar enhancement_l01_2020
Dip azimifar enhancement_l01_2020Dip azimifar enhancement_l01_2020
Dip azimifar enhancement_l01_2020Zahra Amini
 
Ch 1-3 nn learning 1-7
Ch 1-3 nn learning 1-7Ch 1-3 nn learning 1-7
Ch 1-3 nn learning 1-7Zahra Amini
 
Ch 1-2 NN classifier
Ch 1-2 NN classifierCh 1-2 NN classifier
Ch 1-2 NN classifierZahra Amini
 
Ch 1-1 introduction
Ch 1-1 introductionCh 1-1 introduction
Ch 1-1 introductionZahra Amini
 
Dr azimifar pattern recognition lect4
Dr azimifar pattern recognition lect4Dr azimifar pattern recognition lect4
Dr azimifar pattern recognition lect4Zahra Amini
 
Dr azimifar pattern recognition lect2
Dr azimifar pattern recognition lect2Dr azimifar pattern recognition lect2
Dr azimifar pattern recognition lect2Zahra Amini
 
Dr azimifar pattern recognition lect1
Dr azimifar pattern recognition lect1Dr azimifar pattern recognition lect1
Dr azimifar pattern recognition lect1Zahra Amini
 

Plus de Zahra Amini (11)

Dip azimifar enhancement_l05_2020
Dip azimifar enhancement_l05_2020Dip azimifar enhancement_l05_2020
Dip azimifar enhancement_l05_2020
 
Dip azimifar enhancement_l04_2020
Dip azimifar enhancement_l04_2020Dip azimifar enhancement_l04_2020
Dip azimifar enhancement_l04_2020
 
Dip azimifar enhancement_l03_2020
Dip azimifar enhancement_l03_2020Dip azimifar enhancement_l03_2020
Dip azimifar enhancement_l03_2020
 
Dip azimifar enhancement_l02_2020
Dip azimifar enhancement_l02_2020Dip azimifar enhancement_l02_2020
Dip azimifar enhancement_l02_2020
 
Dip azimifar enhancement_l01_2020
Dip azimifar enhancement_l01_2020Dip azimifar enhancement_l01_2020
Dip azimifar enhancement_l01_2020
 
Ch 1-3 nn learning 1-7
Ch 1-3 nn learning 1-7Ch 1-3 nn learning 1-7
Ch 1-3 nn learning 1-7
 
Ch 1-2 NN classifier
Ch 1-2 NN classifierCh 1-2 NN classifier
Ch 1-2 NN classifier
 
Ch 1-1 introduction
Ch 1-1 introductionCh 1-1 introduction
Ch 1-1 introduction
 
Dr azimifar pattern recognition lect4
Dr azimifar pattern recognition lect4Dr azimifar pattern recognition lect4
Dr azimifar pattern recognition lect4
 
Dr azimifar pattern recognition lect2
Dr azimifar pattern recognition lect2Dr azimifar pattern recognition lect2
Dr azimifar pattern recognition lect2
 
Dr azimifar pattern recognition lect1
Dr azimifar pattern recognition lect1Dr azimifar pattern recognition lect1
Dr azimifar pattern recognition lect1
 

Dernier

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 

Dernier (20)

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 

Intro to Non-Parametric Density Estimation

  • 1. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 1 LECTURE 7: Kernel Density Estimation g Non-parametric Density Estimation g Histograms g Parzen Windows g Smooth Kernels g Product Kernel Density Estimation g The Naïve Bayes Classifier
  • 2. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 2 Non-parametric density estimation g In the previous two lectures we have assumed that either n The likelihoods p(x|ωi) were known (Likelihood Ratio Test) or n At least the parametric form of the likelihoods were known (Parameter Estimation) g The methods that will be presented in the next two lectures do not afford such luxuries n Instead, they attempt to estimate the density directly from the data without making any parametric assumptions about the underlying distribution n Sounds challenging? You bet! x1 x 2 P(x1, x2| ωi) NON-PARAMETRIC DENSITY ESTIMATION 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 1 1 1 1 1 1 1 1 1 1 1 1 1 11 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 3 33 3 3 33 3 3 3 3 3 4 4 4 4 4 4 4 4 4 44 4 4 4 4 4 4 4 5 5 5 5 5 5 5 55 5 5 55 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 77 77 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 88 8 8 8 8 8 8 8 8 8 9 9 9 99 9 9 9 9 9 99 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1* 1* 1* 1* 2* 2* 2* 2* 2* 2* 3* 3* 3* 3* 3* 3* 3* 4* 5* 5* 5* 5* 5* 6* 6* 6* 6* 6* 6* 6* 7* 7* 7* 7* 8* 8* 8* 9* 9* 9* 9* 9* 10* 10* 10* 10* 10* 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 1 1 1 1 1 1 1 1 1 1 1 1 1 11 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 3 33 3 3 33 3 3 3 3 3 4 4 4 4 4 4 4 4 4 44 4 4 4 4 4 4 4 5 5 5 5 5 5 5 55 5 5 55 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 77 77 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 88 8 8 8 8 8 8 8 8 8 9 9 9 99 9 9 9 9 9 99 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1* 1* 1* 1* 2* 2* 2* 2* 2* 2* 3* 3* 3* 3* 3* 3* 3* 4* 5* 5* 5* 5* 5* 6* 6* 6* 6* 6* 6* 6* 7* 7* 7* 7* 8* 8* 8* 9* 9* 9* 9* 9* 10* 10* 10* 10* 10* x1 x2
  • 3. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 3 The histogram g The simplest form of non-parametric D.E. is the familiar histogram n Divide the sample space into a number of bins and approximate the density at the center of each bin by the fraction of points in the training data that fall into the corresponding bin g The histogram requires two “parameters” to be defined: bin width and starting position of the first bin g The histogram is a very simple form of D.E., but it has several drawbacks n The final shape of the density estimate depends on the starting position of the bins g For multivariate data, the final shape of the density is also affected by the orientation of the bins n The discontinuities of the estimate are not due to the underlying density, they are only an artifact of the chosen bin locations g These discontinuities make it very difficult, without experience, to grasp the structure of the data n A much more serious problem is the curse of dimensionality, since the number of bins grows exponentially with the number of dimensions g In high dimensions we would require a very large number of examples or else most of the bins would be empty g All these drawbacks make the histogram unsuitable for most practical applications except for rapid visualization of results in one or two dimensions n Therefore, we will not spend more time looking at the histogram [ ] [ ] x containing bin of width x as bin same in x of number N 1 (x) P (k H =
  • 4. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 4 Non-parametric density estimation, general formulation (1) g Let us return to the basic definition of probability to get a solid idea of what we are trying to accomplish n The probability that a vector x, drawn from a distribution p(x), will fall in a given region ℜ of the sample space is n Suppose now that N vectors {x(1, x(2, …, x(N} are drawn from the distribution. The probability that k of these N vectors fall in ℜ is given by the binomial distribution n It can be shown (from the properties of the binomial p.m.f.) that the mean and variance of the ratio k/N are n Therefore, as N→∞, the distribution becomes sharper (the variance gets smaller) so we can expect that a good estimate of the probability P can be obtained from the mean fraction of the points that fall within ℜ ∫ ℜ = )dx' p(x' P ( ) k N k P) (1 P k N k P − −         = ( ) N P 1 P P N k E N k Var and P N k E 2 − =               − =       =       N k P ≅ From [Bishop, 1995]
  • 5. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 5 Non-parametric density estimation, general formulation (2) n On the other hand, if we assume that ℜ is so small that p(x) does not vary appreciably within it, then g where V is the volume enclosed by region ℜ n Merging with the previous result we obtain n This estimate becomes more accurate as we increase the number of sample points N and shrink the volume V g In practice the value of N (the total number of examples) is fixed n In order to improve the accuracy of the estimate p(x) we could let V approach zero but then the region ℜ would then become so small that it would enclose no examples n This means that, in practice, we will have to find a compromise value for the volume V g Large enough to include enough examples within ℜ g Small enough to support the assumption that p(x) is constant within ℜ NV k p(x) N k P p(x)V )dx' p(x' P ≅ ⇒      ≅ ≅ = ∫ ℜ p(x)V )dx' p(x' ≅ ∫ ℜ From [Bishop, 1995]
  • 6. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 6 Non-parametric density estimation, general formulation (3) g In conclusion, the general expression for non-parametric density estimation becomes g When applying this result to practical density estimation problems, two basic approaches can be adopted n We can choose a fixed value of the volume V and determine k from the data. This leads to methods commonly referred to as Kernel Density Estimation (KDE), which are the subject of this lecture n We can choose a fixed value of k and determine the corresponding volume V from the data. This gives rise to the k Nearest Neighbor (kNN) approach, which will be covered in the next lecture g It can be shown that both kNN and KDE converge to the true probability density as N→∞, provided that V shrinks with N, and k grows with N appropriately      ≅ V inside examples of number the is k examples of number total the is N x g surroundin volume the is V where NV k p(x) From [Bishop, 1995]
  • 7. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 7 Parzen windows (1) g Suppose that the region ℜ that encloses the k examples is a hypercube with sides of length h centered at the estimation point x n Then its volume is given by V=hD, where D is the number of dimensions g To find the number of examples that fall within this region we define a kernel function K(u) n This kernel, which corresponds to a unit hypercube centered at the origin, is known as a Parzen window or the naïve estimator n The quantity K((x-x(n)/h) is then equal to unity if the point x(n is inside a hypercube of side h centered on x, and zero otherwise ( )    = ∀ < = otherwise 0 D 1,.., j 1/2 u 1 u K j x h h h From [Bishop, 1995]
  • 8. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 8 Parzen windows (2) g The total number of points inside the hypercube is then g Substituting back into the expression for the density estimate n Notice that the Parzen window density estimate resembles the histogram, with the exception that the bin locations are determined by the data points ∑ =         − = N 1 n (n D KDE h x x K Nh 1 (x) p ∑ =         − = N 1 n n ( h x x K k Volume 1 / V K(x-x(1)=1 K(x-x(2)=1 K(x-x(3)=1 K(x-x(4)=0 x(1 x(2 x(3 x(4 x x(1 x(2 x(3 x(4 Volume 1 / V K(x-x(1)=1 K(x-x(1)=1 K(x-x(2)=1 K(x-x(2)=1 K(x-x(3)=1 K(x-x(3)=1 K(x-x(4)=0 K(x-x(4)=0 x(1 x(2 x(3 x(4 x x(1 x(2 x(3 x(4 From [Bishop, 1995]
  • 9. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 9 Parzen windows (3) g To understand the role of the kernel function we compute the expectation of the probability estimate p(x) n where we have assumed that the vectors x(n are drawn independently from the true density p(x) g We can see that the expectation of the estimated density pKDE(x) is a convolution of the true density p(x) with the kernel function n The width h of the kernel plays the role of a smoothing parameter: the wider the kernel function, the smoother the estimate pKDE(x) g For h→0, the kernel approaches a Dirac delta function and pKDE(x) approaches the true density n However, in practice we have a finite number of points, so h cannot be made arbitrarily small, since the density estimate pKDE(x) would then degenerate to a set of impulses located at the training data points ( ) [ ] ∫ ∑       − =               − = =               − = = )dx' p(x' h x' x K h 1 h x x K E h 1 h x x K E Nh 1 x p E D (n D N 1 n (n D KDE From [Bishop, 1995]
  • 10. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 10 Numeric exercise g Given the dataset below, use Parzen windows to estimate the density p(x) at y=3,10,15. Use a bandwidth of h=4 n X = {x(1, x(2,…x(N} = {4, 5, 5, 6, 12, 14, 15, 15, 16, 17} g Solution n Let’s first draw the dataset to get an idea of what numerical results we should expect n Let’s now estimate p(y=3): n Similarly [ ] 0.025 4 10 1 0 0 0 0 0 0 0 0 0 1 4 10 1 4 17 3 K ... 4 6 3 K 4 5 3 K 4 5 3 K 4 4 3 K 4 10 1 h x y K Nh 1 3) (y p 1 13/4 - 1 - 1/2 - 1/2 - 1/4 - 1 N 1 n (n D KDE = × = + + + + + + + + + × = =                 − + +       − +       − +       − +       − × =         − = = ∑ = 4 3 4 2 1 4 3 4 2 1 4 3 4 2 1 4 3 4 2 1 4 3 4 2 1 5 10 15 x p(x) y=3 y=10 y=15 [ ] 0 4 10 0 0 0 0 0 0 0 0 0 0 0 4 10 1 10) (y p 1 KDE = × = + + + + + + + + + × = = [ ] 0.1 4 10 4 0 1 1 1 1 0 0 0 0 0 4 10 1 15) (y p 1 KDE = × = + + + + + + + + + × = =
  • 11. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 11 Smooth kernels (1) g The Parzen window has several drawbacks n Yields density estimates that have discontinuities n Weights equally all the points xi, regardless of their distance to the estimation point x g It is easy to to overcome some of these difficulties by generalizing the Parzen window with a smooth kernel function K(u) which satisfies the condition n Usually, but not not always, K(u) will be a radially symmetric and unimodal probability density function, such as the multivariate Gaussian density function n where the expression of the density estimate remains the same as with Parzen windows ( ) 1 dx x K D R = ∫ ( ) ( )       − = x x 2 1 exp π 2 1 x K T 2 / D ∑ =         − = N 1 n (n D KDE h x x K Nh 1 (x) p From [Bishop, 1995]
  • 12. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 12 Smooth kernels (2) g Just as the Parzen window estimate can be considered a sum of boxes centered at the observations, the smooth kernel estimate is a sum of “bumps” placed at the data points n The kernel function determines the shape of the bumps n The parameter h, also called the smoothing parameter or bandwidth, determines their width -10 -5 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 x P K DE(x); h=3 Density estimate Data points Kernel functions -10 -5 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 x P K DE(x); h=3 Density estimate Data points Kernel functions
  • 13. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 13 Choosing the bandwidth: univariate case (1) -10 -5 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 x P KDE (x); h=5.0 -10 -5 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 x P KDE (x); h=10.0 -10 -5 0 5 10 15 20 25 30 35 40 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 x P KDE (x); h=2.5 -10 -5 0 5 10 15 20 25 30 35 40 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 x P KDE (x); h=1.0 g The problem of choosing the bandwidth is crucial in density estimation n A large bandwidth will over-smooth the density and mask the structure in the data n A small bandwidth will yield a density estimate that is spiky and very hard to interpret
  • 14. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 14 Choosing the bandwidth: univariate case (2) g We would like to find a value of the smoothing parameter that minimizes the error between the estimated density and the true density n A natural measure is the mean square error at the estimation point x, defined by g This expression is an example of the bias-variance tradeoff that we saw earlier in the course: the bias can be reduced at the expense of the variance, and vice versa n The bias of an estimate is the systematic error incurred in the estimation n The variance of an estimate is the random error incurred in the estimation g The bias-variance dilemma applied to bandwidth selection simply means that n A large bandwidth will reduce the differences among the estimates of pKDE(x) for different data sets (the variance) but it will increase the bias of pKDE(x) with respect to the true density p(x) n A small bandwidth will reduce the bias of pKDE(x), at the expense of a larger variance in the estimates pKDE(x) ( ) ( ) ( ) ( ) [ ] ( ) ( ) [ ] { } ( ) ( ) 4 43 4 42 1 4 4 4 3 4 4 4 2 1 variance KDE 2 bias KDE 2 KDE KDE x x p var x p x p E x p x p E p MSE + − = − = -3 -2 -1 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x P KDE (x); h=0.1 VARIANCE -3 -2 -1 0 1 2 3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 x P KDE (x); h=2.0 -3 -2 -1 0 1 2 3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 x P KDE (x); h=2.0 True density Multiple kernel density estimates BIAS
  • 15. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 15 Bandwidth selection methods, univariate case (3) g Subjective choice n The natural way for choosing the smoothing parameter is to plot out several curves and choose the estimate that is most in accordance with one’s prior (subjective) ideas n However, this method is not practical in pattern recognition since we typically have high- dimensional data g Reference to a standard distribution n Assume a standard density function and find the value of the bandwidth that minimizes the integral of the square error (MISE) n If we assume that the true distribution is a Gaussian density and we use a Gaussian kernel, it can be shown that the optimal value of the bandwidth becomes g where σ is the sample variance and N is the number of training examples ( ) ( ) { } ( ) ( ) ( ) [ ] { } ∫ − = = dx x p x p E argmin x p MISE argmin h 2 KDE h KDE h opt 5 / 1 opt N σ 06 . 1 h − = From [Silverman, 1986]
  • 16. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 16 Bandwidth selection methods, univariate case (4) n Better results can be obtained if we use a robust measure of the spread instead of the sample variance and we reduce the coefficient 1.06 to better cope with multimodal densities. The optimal bandwidth then becomes g IQR is the interquartile range, a robust estimate of the spread. It is computed as one half the difference between the 75th percentile (Q3) and the 25th percentile (Q1). The formula for semi- interquartile range is therefore: (Q3-Q1)/2 n A percentile rank is the proportion of examples in a distribution that a specific example is greater than or equal to g Likelihood cross-validation n The ML estimate of h is degenerate since it yields hML=0, a density estimate with Dirac delta functions at each training data point n A practical alternative is to maximize the “pseudo-likelihood” computed using leave-one-out cross-validation       = = − 34 . 1 IQR , σ min A where AN 9 . 0 h 5 / 1 opt From [Silverman, 1986] ( ) ( ) ( ) ∑ ∑ ≠ = − = −         − − =       = N m n 1, m (m (n (n n N 1 n (n n h MLCV h x x K h 1 N 1 x p where x p log N 1 argmax h
  • 17. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 17 Multivariate density estimation g The derived expression of the estimate PKDE(x) for multiple dimensions was n Notice that the bandwidth h is the same for all the axes, so this density estimate will be weight all the axis equally g However, if the spread of the data is much greater in one of the coordinate directions than the others, we should use a vector of smoothing parameters or even a full covariance matrix, which complicates the procedure g There are two basic alternatives to solve the scaling problem without having to use a more general kernel density estimate n Pre-scale each axis (normalize to unit variance, for instance) n Pre-whiten the data (linearly transform to have unit covariance matrix), estimate the density, and then transform back [Fukunaga] g The whitening transform is simply y=Λ-1/2 MT x, where Λ and M are the eigenvalue and eigenvector matrices of the sample covariance of x g Fukunaga’s method is equivalent to using a hyper-ellipsoidal kernel ∑ =         − = N 1 n (n D KDE h x x K Nh 1 (x) p
  • 18. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 18 Product kernels g A very popular method for performing multivariate density estimation is the product kernel, defined as n The product kernel consists of the product of one-dimensional kernels g Typically the same kernel function is used in each dimension ( Kd(x)=K(x) ), and only the bandwidths are allowed to differ n Bandwidth selection can then be performed with any of the methods presented for univariate density estimation g It is important to notice that although the expression of K(x,x(n,h1,…hD) uses kernel independence, this does not imply that any type of feature independence is being assumed n A density estimation method that assumed feature independence would have the following expression n Notice how the order of the summation and product are reversed compared to the product kernel ( ) ( ) ( ) ∏ ∑ = =         − ⋅ ⋅ ⋅ = = D 1 d d (n d D 1 D 1 (n N 1 i D 1 (n PKDE h (d) x x(d) K h h 1 h ,..., h , x x, K where h ,..., h , x x, K N 1 x p ( ) ∏ ∑ = = −                 − = D 1 d N 1 i d (n d d IND FEAT h (d) x x(d) K Nh 1 x p
  • 19. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 19 Product kernel, example 1 -2 -1 0 1 2 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 x1 x 2 -2 -1 0 1 2 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 x1 x 2 -2 -1 0 1 2 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 x1 x 2 g This example shows the product kernel density estimate of a bivariate unimodal Gaussian distribution n 100 data points were drawn from the distribution n The figures show the true density (left) and the estimates using h=1.06σN-1/5 (middle) and h=0.9AN-1/5 (right)
  • 20. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 20 Product kernel, example 2 -2 0 2 4 6 -4 -2 0 2 4 6 8 x1 x 2 -2 0 2 4 6 -4 -2 0 2 4 6 8 x1 x 2 -2 0 2 4 6 -4 -2 0 2 4 6 8 x1 x 2 g This example shows the product kernel density estimate of a bivariate bimodal Gaussian distribution n 100 data points were drawn from the distribution n The figures show the true density (left) and the estimates using h=1.06σN-1/5 (middle) and h=0.9AN-1/5 (right)
  • 21. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 21 Naïve Bayes classifier g Recall that the Bayes classifier is given by the following family of discriminant functions g Using Bayes rule, these discriminant functions can be expressed as n where P(ωi) is our prior knowledge and P(x|ωi) is obtained through density estimation g Although we have presented density estimation methods that allow us to estimate the multivariate likelihood P(x|ωi), the curse of dimensionality makes it a very tough problem! g One highly practical simplification of the Bayes classifier is the so-called Naïve Bayes classifier n The Naïve Bayes classifier makes the assumption that the features are class-conditionally independent g It is important to notice that this assumption is not as rigid as assuming independent features n Merging this expression into the discriminant function yields the decision rule for the Naïve Bayes classifier g The main advantage of the Naïve Bayes classifier is that we only need to compute the univariate densities P(x(d)|ωi), which is a much easier problem than estimating the multivariate density P(x|ωi) n Despite its simplicity, the Naïve Bayes has been shown to have comparable performance to artificial neural networks and decision tree learning in some domains x) | P(ω (x) g where i j ) x ( g ) x ( g if ω choose i i j i i = ≠ ∀ > ) )P(ω ω | P(x x) | P(ω (x) g i i i i ∝ = ∏ = = D 1 d i i ) ω | P(x(d) ) ω | P(x Naïve Bayes Classifier Naïve Bayes Classifier ( ) ∏ = = D 1 d i i NB , i ω | ) d ( x P ) P(ω ) x ( g ∏ = = D 1 d P(x(d)) P(x)