In this talk, I have explained about feature selection, extraction with emphasis to image processing. Methods such as Principal Component Analysis, Canonical ANalysis are explained with numerical examples.
1. Feature Selection and Extraction :Feature Selection and Extraction :
An Introduction with emphasis onAn Introduction with emphasis on
Principal Component AnalysisPrincipal Component Analysis
Dr. N. B.Venkateswarlu, AITAM,Tekkali
2. What I am going to cover?
• What is feature selection/extraction
• Need and discussion
• Methodologies
• PCA
4. 4
What Is Feature Selection?
• Selecting the most “relevant” subset of attributes
according to some selection criteria.
5. 5
Why Feature Selection?
• High-dimensional data often contain irrelevant or
redundant features
– reduce the accuracy of data mining algorithms
– slow down the mining process
– be a problem in storage and retrieval
– hard to interpret
6. Why feature selection is important?
• May improve performance of learning
algorithm
• Learning algorithm may not scale up to the
size of the full feature set either in sample
or time
• Allows us to better understand the domain
• Cheaper to collect a reduced set of features
7. What is feature selection?
cat 2
and 35
it 20
kitten 8
electric 2
trouble 4
then 5
several 9
feline 2
while 4
…
lemon 2
cat 2
kitten 8
feline 2
Vegetarian No
Plays video
games
Yes
Family history No
Athletic No
Smoker Yes
Sex Male
Lung capacity 5.8L
Hair color Red
Car Audi
…
Weight 185 lbs
Family
history
No
Smoker Yes
Task: classify whether a document is
about cats
Data: word counts in the document
Task: predict chances of lung disease
Data: medical history survey
X X
Reduced X
Reduced X
8. Characterising features
• Generally, features are characterized as:
– Relevant: These are features which have an influence on
the output and their role can not be assumed by the rest
– Irrelevant: Irrelevant features are defined as those features
not having any influence on the output, and whose values
are generated at random for each example.
– Redundant: A redundancy exists whenever a feature can
take the role of another (perhaps the simplest way to
model redundancy).
12. 12
Challenges in Feature Selection (1)Challenges in Feature Selection (1)
• Dealing with ultra-high dimensional data and feature
interactions
Traditional feature selection encounter two major problems when the
dimensionality runs into tens or hundreds of thousands:
1. curse of dimensionality
2. the relative shortage of instances.
13. 13
Challenges in Feature Selection (2)Challenges in Feature Selection (2)
• Dealing with active instances (Liu et al., 2005)
When the dataset is huge, feature selection performed on
the whole dataset is inefficient,
so instance selection is necessary:
– Random sampling (pure random sampling without
exploiting any data characteristics)
– Active feature selection (selective sampling using
data characteristics achieves better or equally good
results with a significantly smaller number of
instances).
14. 14
Challenges in Feature Selection (3)Challenges in Feature Selection (3)
• Dealing with new data types (Liu et al., 2005)
– traditional data type: an N*M data matrix
Due to the growth of computer and Internet/Web techniques,
new data types are emerging:
– text-based data (e.g., e-mails, online news, newsgroups)
– semistructure data (e.g., HTML, XML)
– data streams.
15. 15
Challenges in Feature Selection (4)Challenges in Feature Selection (4)
• Unsupervised feature selection
– Feature selection vs classification: almost
every classification algorithm
– Subspace method with the curse of
dimensionality in classification
– Subspace clustering.
16. 16
Challenges in Feature Selection (5)Challenges in Feature Selection (5)
• Dealing with predictive-but-unpredictable
attributes in noisy data
– Attribute noise is difficult to process, and removing
noisy instances is dangerous
– Predictive attributes: essential to classification
– Unpredictable attributes: cannot be predicted by the
class and other attributes
• Noise identification, cleansing, and
measurement need special attention [Yang et
al., 2004]
17. Feature Selection Methods
• Feature selection is an optimization problem.
– Search the space of possible feature subsets.
– Pick the one that is optimal or near-optimal with respect to a
certain criterion.
Search strategies Evaluation strategies
– Optimal - Filter methods
– Heuristic - Wrapper methods
– Randomized
18. Evaluation Strategies
• Filter Methods
– Evaluation is independent of the classification algorithm or its
error criteria.
• Wrapper Methods
– Evaluation uses a criterion related to the classification
algorithm.
• Wrapper methods provide more accurate solutions
than filter methods, but in general are more
computationally expensive.
19. Typical Feature Selection –
First step
Generation Evaluation
Stopping
Criterion Validation
Original
Feature Set Subset
Goodness of
the subset
No Yes
1 2
3 4
Generates subset
of features for
evaluation
Can start with:
•no features
•all features
•random subset
of features
20. Typical Feature Selection –
Second step
Generation Evaluation
Stopping
Criterion Validation
Original
Feature Set Subset
Goodness of
the subset
No Yes
1 2
3 4
Measures the
goodness of
the subset
Compares with
the previous
best subset
if found better,
then replaces
the previous
best subset
21. Typical Feature Selection –
Third step
Generation Evaluation
Stopping
Criterion Validation
Original
Feature Set Subset
Goodness of
the subset
No Yes
1 2
3 4
Based on Generation
Procedure:
•Pre-defined number of features
•Pre-defined number of iterations
Based on Evaluation
Function:
•whether addition or deletion of a
feature does not produce a better
subset
•whether optimal subset based on
some evaluation function is
achieved
22. Typical Feature Selection -
Fourth step
Generation Evaluation
Stopping
Criterion Validation
Original
Feature Set Subset
Goodness of
the subset
No Yes
1 2
3 4
Basically not part of the feature
selection process itself
- compare results with already
established results or results from
competing feature selection
methods
23. Exhaustive Search
• Assuming m features, an exhaustive search would
require:
– Examining all possible subsets of size n.
– Selecting the subset that performs the best according to the
criterion function.
• The number of subsets grows combinatorialy, making
exhaustive search impractical.
• Iterative procedures are often used but they cannot
guarantee the selection of the optimal subset.
23
m
n
÷
24. Naïve Search
• Sort the given d features in order of their probability of
correct recognition.
• Select the top m features from this sorted list.
• Disadvantage
– Feature correlation is not considered.
– Best pair of features may not even contain the best individual
feature.
25. Sequential forward selection
(SFS)
(heuristic search)
• First, the best single feature is selected
(i.e., using some criterion function).
• Then, pairs of features are formed using
one of the remaining features and this
best feature, and the best pair is
selected.
• Next, triplets of features are formed
using one of the remaining features and
these two best features, and the best
triplet is selected.
• This procedure continues until a
predefined number of features are
selected. 25
SFS performs
best when the
optimal subset is
small.
26. Example
26
Results of sequential forward feature selection for classification of a
satellite image using 28 features. x-axis shows the classification accuracy
(%) and y-axis shows the features added at each iteration (the first iteration
is at the bottom). The highest accuracy value is shown with a star.
27. Sequential backward selection
(SBS)
(heuristic search)
• First, the criterion function is computed
for all d features.
• Then, each feature is deleted one at a
time, the criterion function is computed
for all subsets with d − 1 features, and
the worst feature is discarded.
• Next, each feature among the remaining
d − 1 is deleted one at a time, and the
worst feature is discarded to form a
subset with d − 2 features.
• This procedure continues until a
predefined number of features are left.
27
SBS performs
best when the
optimal subset is
large.
28. Example
28
Results of sequential backward feature selection for classification of a
satellite image using 28 features. x-axis shows the classification
accuracy (%) and y-axis shows the features removed at each iteration
(the first iteration is at the bottom). The highest accuracy value is
shown with a star.
29. Plus-L minus-R selection (LRS)
• A generalization of SFS and SBS
– If L>R, LRS starts from the empty set and
repeatedly adds L features and removes R
features.
– If L<R, LRS starts from the full set and
repeatedly removes R features and adds L
features.
• Comments
– LRS attempts to compensate for the
weaknesses of SFS and SBS with some
backtracking capabilities.
– How to choose the optimal values of L and
R?
30. Bidirectional Search (BDS)
• BDS applies SFS and SBS
simultaneously:
– SFS is performed from the empty
set
– SBS is performed from the full set
• To guarantee that SFS and SBS
converge to the same solution
– Features already selected by SFS
are not removed by SBS
– Features already removed by SBS
are not selected by SFS
31. Sequential floating selection
(SFFS and SFBS)
• An extension to LRS with flexible backtracking
capabilities
– Rather than fixing the values of L and R, floating methods
determine these values from the data.
– The dimensionality of the subset during the search can be
thought to be “floating” up and down
• There are two floating methods:
– Sequential floating forward selection (SFFS)
– Sequential floating backward selection (SFBS)
P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature
selection, Pattern Recognition Lett. 15 (1994) 1119–1125.
32. Sequential floating selection
(SFFS and SFBS)
• SFFS
– Sequential floating forward selection (SFFS) starts from
the empty set.
– After each forward step, SFFS performs backward steps
as long as the objective function increases.
• SFBS
– Sequential floating backward selection (SFBS) starts from
the full set.
– After each backward step, SFBS performs forward steps
as long as the objective function increases.
33. Argument for wrapper methods
• The estimated accuracy of the learning
algorithm is the best available heuristic for
measuring the values of features.
• Different learning algorithms may perform
better with different feature sets, even if
they are using the same training set.
34. Wrapper selection algorithms (1)
• The simplest method is forward selection
(FS). It starts with the empty set and
greedily adds features one at a time
(without backtracking).
• Backward stepwise selection (BS) starts
with all features in the feature set and
greedily removes them one at a time
(without backtracking).
35. Wrapper selection algorithms (2)
• The Best First search starts with an empty set of features and
generates all possible single feature expansions. The subset with
the highest evaluation is chosen and is expanded in the same
manner by adding single features (with backtracking). The Best First
search (BFFS) can be combined with forward or backward selection
(BFBS).
• Genetic algorithm selection. A solution is typically a fixed length
binary string representing a feature subset—the value of each
position in the string represents the presence or absence of a
particular feature. The algorithm is an iterative process where each
successive generation is produced by applying genetic operators
such as crossover and mutation to the members of the current
generation.
36. Disadvantages of Support Vector Machines
"Perhaps the biggest limitation of the support vector approach
lies in choice of the kernel."
Burgess (1998)
"A second limitation is speed and size, both in training and
testing."
Burgess (1998)
"Discete data presents another problem..."
Burgess (1998)
"...the optimal design for multiclass SVM classifiers is a
further area for research."
Burgess (1998)
37. "Although SVMs have good generalization performance, they can be
abysmally slow in test phase, a problem addressed in (Burges, 1996;
Osuna and Girosi, 1998)."
Burgess (1998)
"Besides the advantages of SVMs - from a practical point of view - they
have some drawbacks. An important practical question that is not entirely
solved, is the selection of the kernel function parameters - for Gaussian
kernels the width parameter [sigma] - and the value of [epsilon] in the
[epsilon]-insensitive loss function...[more]"
Horváth (2003) in Suykens et al.
"However, from a practical point of view perhaps the most serious
problem with SVMs is the high algorithmic complexity and extensive
memory requirements of the required quadratic programming in large-
scale tasks."
Horváth (2003) in Suykens et al. p 392
39. •Another names for PCA:
1) Karhunen-Loewe Transformation (KLT);
2) Hotelling Transformation.
3)Eigenvector Analysis
•Properties of PCA
1) Data decorrelation;
2) Dimensionality reduction.
40. Important Objective
• The goal of principal component analysis is
to take n variables x1, x2,…, xn and find
linear combinations of these variables to
produce a new set of variables y1, y2, …, yn
that are uncorrelated. The transformed
variables are indexed or ordered so that y1
shows the largest amount of variation, y2
has the second largest amount of variation,
and so on.
41. x’y’
• PCA finds an orthogonal basis that best represents given data
set.
• 2
PCA – the general idea
x
y
42. PCA – the general idea
• PCA finds an orthogonal basis that best represents given data
set.
• PCA finds a best approximating plane (again, in terms of
Σdistances2
)
3D point set in
standard basis
x y
z
43. PCA – the general idea
• PCA finds an orthogonal basis that best represents given data
set.
• PCA finds a best approximating plane (again, in terms of
Σdistances2
)
3D point set in
standard basis
45. Usage of bounding boxes (bounding volumes)
Serve as very simple “approximation” of the object
Fast collision detection, visibility queries
Whenever we need to know the dimensions (size) of the object
The models consist of
thousands of polygons
To quickly test that they
don’t intersect, the
bounding boxes are
tested
Sometimes a hierarchy
of BB’s is used
The tighter the BB – the
less “false alarms” we
have
46. Centered data points x in n-dimensional space:
µ is mean value of the vector x.
Covariance matrix C for the centered data:
{ } ( )
{ }.
,......'xxC
,
1
1
jiji
n
n
xxc
xx
x
x
Ε=
⋅
Ε=Ε=
,...x
1
=
nx
x
{ }
=Ε=
0
...
0
xμ
Here E{f(x)} is expectation value of f(x).
47. { } ( ){ } { }
{ } Cw.w'wxx'w'
wxx'w'wx''wx')x(
22
w
=Ε=
=Ε=Ε=Ε= wPσ
The variance of the projection on to the direction ww:
Projection of the data point xx on to direction ww:
w.x'w,x)x( ⋅>==<wP
w
x
)x(wP
x1
x2
48. So,
The vector ww should be normalized:
Hence, finding the normalized direction of maximal
variances reduces to the following computation.
Maximizing varianceMaximizing variance: The normalized direction ww that
maximizes the variance can be found by solving the
following problem:
{ },Cww'max
w
subject to: .1ww'w
2
==
Cw.w'2
w =σ
.1w
2
=
49. The constrained optimization problem is reduced to
unconstrained one using method of Lagrange
multipliers:
{ }
( ) wCwww'Cww'
w
)1ww'(Cww'max
w
λλ
λ
−=−
∂
∂
−−
0.wCw =−λ
Condition for maximum of the function:
We have to solve the following equation:
wCw λ=
and find eigenvalues λi and eigenvectors wi of the
covariance matrix C.
50. Covariance matrix C is symmetric, so the equation has n
distinct solutions:
• n eigenvectors (w1,w2, …, wn) that form orthonormal
basis in n dimensional space:
• n positive eigenvalues that are the data variances
along the corresponding eigenvectors:
λ1≥ λ2≥ … ≥ λn ≥0.
wCw λ=
≠
=
=
.0
;,1
ww'
ji
ji
ji
51. • Direction of the maximum variance is given by the
eigenvector w1 corresponding to the largest eigenvalue λ1
and the variance of the projection on to the direction is
equal to the largest eigenvalue λ1.
• The direction w1 is called the first principal axes.first principal axes.
• The directionThe direction w2 is called the secondsecond principal axes,
and so on.
w1
w2
52. 1)1) Decorrelation property of PCADecorrelation property of PCA
• Let’s represent vector xx as linear combination of n
eigenvectors with the coefficients aaii:
x=a1w1+ a2w1+…+ anwn,
where coefficients aaii ≡Pw(x)=x’wi≡ (wi)’x are computed as
projection of vector xx on to the basis vectors wwi.i.
53. 1)1) Decorrelation property of PCADecorrelation property of PCA
• Calculate correlation of the coefficients ai and aj:
{ } { } { } .wwwxx'wwxx'w '''
jijijiji Caa =Ε=Ε=Ε
• From the main property of eigenvalues and eigenvector
it follows that any pair of the coefficients is uncorrelated:
{ } .wwww ,
''
jijjijjiji Caa δλλ ===Ε
where δi,j is Kronecker symbol:
≠
=
=
ji
ji
ji
0
,1
,δ
{ } .2
iia λ=Ε
{ } .,0 jiaa ji ≠=Ε
•Variance of the projection is eigenvalue:
• Projections are pairwise uncorrelated:
54. 2) PCA dimensionality reduction:
•The objective of PCA is to perform dimensionality
reduction while preserving as much of the data in high-
dimensonal space as possible:
* for visualization,
* for compression,
* to cancel data containing low/no information.
Demonstrations with 1-D and 2-D data points
on transparencies in 3-D space.
55. 2) PCA dimensionality reduction: Main idea
• Find the m first eigenvectors corresponding to the m
largest eigenvalues.
• Project the data points into the subspace spanned on to
the first m eigenvectors:
∑=
=
m
k
kkm
P
1
w,...,w )ww(x')x(1
56. • Input data:
{ } .)x(x
1
2
w,...,w
2
1 ∑+=
=−Ε=
n
mk
km m
P λσ
• Data projected into the subspace spanned on to the first
m eigenvectors:
• Error caused by the dimensionality reduction:
∑=
=
n
k
kk
1
)wwx'(x
• Variance of the error:
∑=
=
m
k
kkm
P
1
w,...,w )wwx'()x(1
∑+=
=−=
n
mk
kkm
P
1
w,...,w )wwx'()x(xδx 1
57. { } .)x(x
1
2
w,...,w
2
1 ∑+=
=−Ε=
n
mk
km m
P λσ
• Varianc of the error is equal to the sum of eigenvalues
for dropped-out dimensions:
• The PCA used to a some sense optimal representation
of data in a low-dimensional subspace of the original
high-dimensional pattern space. PCA provides the
mimimal mean squared error among all other linear
transformations.
• This subspace is spanned by the first m eigenvectors
of covariance matrix C corresponding to m largest
eigenvalues.
58. .
w
u
k
k
k
λ
=
Data ”whitenning”Data ”whitenning”
Let’s introduce new basis by scaling the eigenvectors:
• The new uu basis is orthonormal:
• Variances of projections in the new basis uu are equal
to unit: 1.2
u =σ
jiji ,
'
uu δ=
59. Scatter matrix -
eigendecomposition
• S is symmetric
⇒ S has eigendecomposition: S = VΛV
T
S = v2v1 vn
λ1
λ2
λn
v2
v1
vn
The eigenvectors form
orthogonal basis
60. Principal components
• S measures the “scatterness” of the data.
• Eigenvectors that correspond to big
eigenvalues are the directions in which the
data has strong components.
• If the eigenvalues are more or less the same
– there is not preferable direction.
61. Principal components
• There’s no preferable
direction
• S looks like this:
• Any vector is an
eigenvector
λ
λ
• There is a clear preferable
direction
• S looks like this:
∀ µ is close to zero, much
smaller than λ.
T
VV
µ
λ
62. How to use what we got
• For finding oriented bounding box – we
simply compute the bounding box with
respect to the axes defined by the
eigenvectors. The origin is at the mean
point m.
v2
v1
v3
64. For approximation
• In general dimension d, the eigenvalues are
sorted in descending order:
λ1 ≥ λ2 ≥ … ≥ λd
• The eigenvectors are sorted accordingly.
• To get an approximation of dimension d’ <
d, we take the d’ first eigenvectors and look
at the subspace they span (d’ = 1 is a line,
d’ = 2 is a plane…)
65. For approximation
• To get an approximating set, we project the
original data points onto the chosen
subspace:
xi = m + α1v1+ α2v2+…+ αd’vd’+…+αdvd
Projection:
xi’ = m + α1v1+ α2v2+…+ αd’vd’+0⋅vd’+1+…+ 0⋅
vd
66. Optimality of approximation
• The approximation is optimal in least-
squares sense. It gives the minimal of:
• The projected points have maximal
variance.
∑=
′−
n
k
kk
1
2
xx
Original set projection on arbitrary line projection on v1 axis
67. • A line graph of the (ordered) eigenvalues is
called a scree plot
68. How to work in practice?
•We have training set X of size l (l n-dimensional vectors):
X,X'
1
C
l
= .
1
1
,,, ∑=
=
l
i
kijikj xx
l
cwhere
=
nll
n
xx
xx
,1,
,11,1
...
...
...
X
•Evaluate covariance matrix using the training set X:
•Find eigenvalues and eigenvectors for the covariance
matrix.
69. Principal Component Analysis (PCA)Principal Component Analysis (PCA)
takes an initial subset of the principal axes of the
training data and project the data (both training and
test) into the space spanned by this set of eigenvectors.
•The data is projected onto subspace spanned by m
first eigenvectors of covariance matrix. The new
coordinates are known as principal coordinates with
eigenvectors referred as principal axes.
70. Algorithm:Algorithm:
Input: Dataset X={x1, x2, …, xl}⊆ℜn
,
Process: ∑=
=
l
i
i
l 1
x
1
μ
')μx)(μx(
1
C
1
∑=
−−=
l
i
ii
l
[ ] )C(eig,W l=Λ
,...,l.,iii 21,xWx~ =⋅=
Output: transformed data { }lS x~,...,x~,x~~
21=
{ }kxxx ~,...,~,~x~
21=
72. Example:Example: 8 vectors in 2-D space8 vectors in 2-D space
X=[1,2; 3,3; 3,5; 5,4; 5,6; 6,5; 8,7; 9,8];
=
5.325.4
25.425.6
C
=
⋅
2
1
2
1
5.325.4
25.425.6
x
x
x
x
w
w
w
w
λ
Find eigenvalues and eigenvectors of the covariance
matrix C:
wCw λ=
Covariance matrix C:
81. Case Study 1: Gender Classification
• Determine the gender of a subject from facial images.
– Race, age, facial expression, hair style, etc.
Z. Sun, G. Bebis, X. Yuan, and S. Louis, "Genetic Feature Subset
Selection for Gender Classification: A Comparison Study", IEEE
Workshop on Applications of Computer Vision, pp. 165-170,
Orlando, December 2002.
82. Feature Extraction Using PCA
• PCA maps the data in a lower-dimensional space
using a linear transformation.
• The columns of the projection matrix are the “best”
eigenvectors (i.e., eigenfaces) of the covariance
matrix of the data.
84. Dataset
• 400 frontal images from 400 different people
– 200 male, 200 female
– Different races, lighting conditions, and facial expressions
• Images were registered and normalized
– No hair information
– Account for different lighting conditions
85. Experiments
• Classifiers
– LDA
– Bayes classifier
– Neural Networks (NNs)
– Support Vector Machines (SVMs)
• Comparison with SFBS
• Three-fold cross validation
– Training set: 75% of the data
– Validation set: 12.5% of the data
– Test set: 12.5% of the data
86. Error Rates
ERM: error rate using top eigenvectors
ERG: error rate using GA selected eigenvectors
17.7%
11.3%
22.4%
13.3%
14.2%
9% 8.9%
4.7%
6.7%
87. Ratio of Features - Information Kept
RN: percentage of eigenvectors in the feature subset selected.
RI: percentage of information contained in the eigenvector subset selected.
17.6%
38%
13.3%
31%
36.4%
61.2%
8.4%
32.4%
42.8%
69.0%
89. Reconstructed Images
Reconstructed faces using GA-selected EVs do not contain information
about identity but disclose strong gender information!
Original images
Using top 30 EVs
Using EVs selected
by SVM+GA
Using EVs selected
by NN+GA
91. Case Study 2: Vehicle Detection
low light camera
rear views
Non-vehicles class much
larger than vehicle class.
Z. Sun, G. Bebis, and R. Miller, "Object Detection Using
Feature Subset Selection", Pattern Recognition,
vol. 37, pp. 2165-2176, 2004.
Ford Motor Company
93. Experiments
• Training data set (collected in Fall 2001)
2102 images (1051 vehicles and 1051 non-vehicles)
• Test data sets (collected in Summer 2001)
231 images (vehicles and non-vehicles)
• Comparison with SFBS
• Three-fold cross-validation
• SVM for classification
96. Vehicle Detection
Original
Top 50 EVs
EVs selected
by SFBS
Evs selected
by GAs
Reconstructed images using the selected feature subsets.
- Lighting differences have been disregarded by the GA approach.
Notes de l'éditeur
1 curse of dimensionality
As most existing feature selection algorithms have quadratic or higher time complexity about N, it is difficult to scale up with high dimensionality.
2 the relative shortage of instances.
That is, the dimensionality N can sometimes greatly exceed the number of instances I
Random sampling (pure random sampling without exploiting any data characteristic)
Active feature selection (selective sampling by using data characteristics achieves better or equally good results with a significantly smaller number of instances)