APM Welcome, APM North West Network Conference, Synergies Across Sectors
CVPR2010: Advanced ITinCVPR in a Nutshell: part 3: Feature Selection
1. A vne
da cd
Ifr t nT e r i
nomai h oyn
o
C P “ aN t e”
VRi n us l
hl CP
VR
T ti
u rl
oa J n 1 -82 1
u e 31 0 0
S nFa c c ,A
a rn i oC
s
High-Dimensional Feature Selection:
Images, Genes & Graphs
Francisco Escolano
2. High-Dimensional Feature Selection
Background. Feature selection in CVPR is mainly motivated by: (i)
dimensionality reduction for improving the performance of classifiers,
and (ii) feature-subset pursuit for a better understanding/description
of the patterns at hand (either using generative or discriminative
methods).
Exploring the domain of images [Zhu et al.,97] and other patterns
(e.g. micro-array/gene expression data) [Wang and Gotoh,09] implies
dealing with thousands of features [Guyon and Elisseeff, 03]. In terms
of Information Theory (IT) this task demands pursuing the most
informative features, not only individually but also capturing their
statistical interactions, beyond taking into account pairs of features.
This is a big challenge since the early 70’s due to its intrinsic
combinatorial nature.
2/56
3. Features Classification
Strong and Weak Relevance
1. Strong relevance: A feature fi is strongly relevant if its removal
degrades the performance of the Bayes optimum classifier.
2. Weak relevance: fi is weakly relevant if it is not strongly
relevant and there exists a subset of features F such that the
performance on F ∪ {fi } is better than the one on just F .
Redundancy and Noisy
1. Redundant features: those which can be removed because there
is another feature or subset of features which already supply the
same information about the class.
2. Noisy features: not redundant and not informative.
3/56
5. Wrappers and Filters
Wrappers vs Filters
1. Wrapper feature selection (WFS) consists in selecting features
according to the classification results that these features yield
(e.g. using Cross Validation (CV)). Therefore wrapper feature
selection is a classifier-dependent approach.
2. Filter feature selection (FFS)is classifier independent, as it is
based on statistical analysis on the input variables (features),
given the classification labels of the samples.
3. Wrappers build classifiers each time a feature set has to be
evaluated. This makes them more prone to overfitting than
filters.
4. In FFS the classifier itself is built and tested after FS.
5/56
6. FS is a Complex Task
The Complexity of FS
The only way to assure that a feature set is optimum (following what
criterion?) is the exhaustive search among feature combinations.
The curse of dimensionality limits this search, as the complexity is
n
n
O(z), z =
i
i=1
Filters design is desirable in order to avoid overfitting, but they must
be as multivariate as Wrappers. However, this implies to design and
estimate a cost function capturing the high-order interaction of
many variables. In the following we will focus on FFS.
6/56
7. Mutual Information Criterion
A Criterion is Needed
The primary problem of feature selection is the criterion which
evaluates a feature set. It must decide whether a feature subset is
suitable for the classification problem, or not. The optimal criterion
for such purpose would be the Bayesian error rate for the subset of
selected features:
E (S) = p(S) 1 − max(p(ci |S)) d S,
S i
where S is the vector of selected features and ci ∈ C is a class from
all the possible classes C existing in the data.
7/56
8. Mutual Information Criterion (2)
Bounding Bayes Error
An upper bound of the Bayesian error, which was obtained by
Hellman and Raviv (1970) is:
H(C |S)
E (S) ≤ .
2
This bound is related to mutual information (MI), because mutual
information can be expressed as
I (S; C ) = H(C ) − H(C |S)
and H(C ) is the entropy of the class labels which do not depend on
the feature subspace S.
8/56
10. Mutual Information Criterion (4)
MI and KL Divergence
From the definition of MI we have that:
p(x, y )
I (X ; Y ) = p(x, y ) log
p(x)p(y )
y ∈Y x∈X
p(x|y )
= p(y ) p(x|y ) log
p(x)
y ∈Y x∈X
= EY (KL (p(x|y )||p(x))).
Then, maximizing MI is equivalent to maximizing the expectation of
the KL divergence between the class-conditional densities P(S|C )
and the density of the feature subset P(S).
10/56
11. Mutual Information Criterion (5)
The Bottleneck
Estimating mutual information between a high-dimensional
continuous set of features, and the class labels, is not
straightforward, due to the entropy estimation.
Simplifying assumptions: (i) Dependency is not-informative
about the class, (ii) redundancies between features are
independent of the class [Vasconcelos & Vasconcelos, 04].
The basic idea is to simplify the computation of the MI. For
instance:
I (S ∗ ; C ) ≈ I (xi ; C )
xi ∈S
More realistic assumptions involving interactions are needed!
11/56
12. The mRMR Criterion
Definition
Maximize the relevance I (xj ; C ) of each individual feature xj ∈ F
and simultaneously minimize the redundancy between xj and the
rest of selected features xi ∈ S, i = j. This criterion is known as the
min-Redundancy Max-Relevance (mRMR) criterion and its
formulation for the selection of the m-th feature is [Peng et al., 2005]:
1
max I (xj ; C ) − I (xj ; xi )
xj ∈F −Sm−1 m−1
xi ∈Sm−1
Thus, second-order interactions are considered!
12/56
13. The mRMR Criterion (2)
Properties
Can be embedded in a forward greedy search, that is, at each
iteration of the algorithm the next feature optimizing the
criterion is selected.
Doing so, this criterion is equivalent to a first-order using the
Maximum Depedency criterion:
max I (S; C )
S⊆F
Then, the m-th feature is selected according to:
max I (Sm−1 , xj ; C )
xj ∈F −Sm−1
13/56
14. Efficient MD
Need of an Entropy Estimator
Whilst in mRMR the mutual information is incrementally
estimated by estimating it between two variables of one
dimension, in MD the estimation of I (S; C ) is not trivial
because S could consist of a large number of features.
If we base MI calculation in terms of the conditional entropy
H(S|C ) we have:
I (S; C ) = H(S) − H(S|C ).
To do this, H(S|C = c)p(C = c) entropies have to be
calculated, Anyway, we must calculate multi-dimensional
entropies!
14/56
15. Leonenko’s Estimator
Theoretical Bases
Recently Leonenko et al. published an extensive study [Leonenko et
al, 08] about R´nyi and Tsallis entropy estimation, also considering
e
the case of the limit of α → 1 for obtaining the Shannon entropy.
Their construction relies on the integral
Iα = E {f α−1 (X )} = f α (x)dx,
Rd
where f (.) refers to the density of a set of n i.i.d. samples
X = {X1 , X2 , . . . , XN }. The latter integral is valid for α = 1,
however, the limits for α → 1 are also calculated in order to consider
the Shannon entropy estimation.
15/56
16. Leonenko’s Estimator (2)
Entropy Maximization Distributions
The α-entropy maximizing distributions are only defined for
0 < α < 1, where the entropy Hα is a concave function. The
maximizing distributions are defined under some constraints.
The uniform distribution maximizes α-entropy under the
constraint that the distribution has a finite support. For
distributions with a given covariance matrix the maximizing
distribution is Student-t, if d/(d + 2) < α < 1, for any number
of dimensions d ≥ 1.
This is a generalization of the property that the Gaussian
distribution maximizes the Shannon entropy H.
16/56
17. Leonenko’s Estimator (3)
R´nyi, Tsallis and Shannon Entropy Estimates
e
For α = 1, the estimated I is:
ˆN,k,α = 1 n 1−α ,
I N i=1 (ζN,i,k )
with
(i)
ζN,i,k = (N − 1)Ck Vd (ρk,N−1 )d ,
(i)
where ρk,N−1 is the Euclidean distance from Xi to its k-th nearest
neighbour from among the resting N − 1 samples.
d
Vd = π 2 /Γ( d + 1) is the volume of the unit ball B(0, 1) in Rd and
2
1
Ck is Ck = [Γ(k)/Γ(k + 1 − α)] 1−α .
17/56
18. Leonenko’s Estimator (4)
R´nyi, Tsallis and Shannon Entropy Estimates (cont.)
e
The estimator ˆN,k,α is asymptotically unbiased, which is to say
I
ˆN,k,α → Iq as N → ∞. It is also consistent under mild
that E I
conditions.
Given these conditions, the estimated R´nyi entropy Hα of f is
e
ˆ log(ˆN,k,α )
I
HN,k,α = 1−α , α = 1,
1
and for the Tsallis entropy Sα = q−1 (1 − x f α (x)dx) is
ˆ 1−ˆN,k,α
I
SN,k,α = α−1 , α = 1.
and as N → ∞, both estimators are asymptotically unbiased
and consistent.
18/56
19. Leonenko’s Estimator (5)
Shannon Entropy Estimate
The limit of the Tsallis entropy estimator as α → 1 gives the
Shannon entropy H estimator:
N
ˆ 1
HN,k,1 = log ξN,i,k ,
N
i=1
(i)
ξN,i,k = (N − 1)e −Ψ(k) Vd (ρk,N−1 )d ,
where Ψ(k) is the digamma function:
j 1
Ψ(1) = −γ 0.5772, Ψ(k) = −γ + Ak−1 , A0 = 0, Aj = i=1 i .
19/56
22. Image Categorization
Problem Statement
Design a mininal complexity and maximal accuracy classifier
based on forward MI-FS for coarse-localization [Lozano et al, 09].
We have a bag of filters (say we have K different filters):
Harris detector, Canny, horizontal and vertical gradient, gradient
magnitude, 12 color filter based on Gaussians over hue range, sparse
depth.
We take the statistics of each filter and compose a histogram
with say B bins.
Therefore, the total number of features NF per image is
NF = K × (B − 1).
22/56
23. Image Categorization (2)
Figure: Examples of filters responses.From top-bottom and left-right: input
image, depth, vertical gradient, gradient magnitude, and five colour filters.
23/56
24. Image Categorization (3)
Greedy Feature Selection on different feature sets Greedy Feature Selection on 4−bin histograms
60 20
16 filters x 6 bins 8 Classes
16 filters x 4 bins 6 Classes
50
16 filters x 2 bins
4 Classes
10−fold CV error (%)
15
10−fold CV error (%)
40 2 Classes
30 10
20
5
10
0 0
0 20 40 60 80 0 10 20 30 40 50
# Selected Features # Selected Features
Figure: Left: Finding the optima number of bins B for K = 16 filters.
Right: Testing whether |C | = 6 is a correct choice.
24/56
31. Image Categorization (10)
Results
With mRMR, the lowest CV error (8.05%) is achieved with a
set of 17 features, while with MD a CV error of 4.16% is
achieved with a set of 13 features.
The CV errors yielded by MmD are very similar to those of MD.
On the other hand, test errors present a slightly different
evolution.
The best error is given by MD, and the worst one is given by
mRMR.
This may be explained by the fact that the entropy estimator
we use does not degrade its accuracy as the number of
dimensions increases.
31/56
32. Gene Selection
Microarray Analysis
The basic setup of a microarray platform is a glass slide which
is spotted with DNA fragments or oligonucleotides representing
some specific gene coding regions. Then, purified RNA is
labelled fluorescently or radioactively and it is hybridized to the
slide.
The hybridization can be done simultaneously with reference
RNA to enable comparison of data with multiple experiments.
Then washing is performed and the raw data is obtained by
scanning with a laser or autoradiographing imaging. The
obtained expression levels of tens of thousands of genes are
numerically quantified and used for statistical studies.
32/56
33. Gene Selection (2)
Fluorescent
labeling
Sample RNA
Combination
Hybridization
Fluorescent
labeling
Referece ADNc
Figure: An illustration of the basic idea of a microarray platform.
33/56
34. Gene Selection (3)
MD Feature Selection on Microarray
45
40
LOOCV error (%)
I(S;C)
35
% error and MI
30
25
20
15
10
0 20 40 60 80 100 120 140 160
# features
Figure: MD-FS performance on the NCI microarray data set with 6,380
features. Lowest LOOCV error with a set of 39 features [Bonev et al.,08]
34/56
35. Gene Selection (4)
Comparison of FS criteria
100
MD criterion
90 MmD criterion
mRMR criterion
LOOCV wrapper
80
70
% LOOCV error
60
50
40
30
20
10
0
5 10 15 20 25 30 35
# features
Figure: Performance of FS criteria on the NCI microarray. Only the
selection of the first 36 features (out of 6,380) is represented.
35/56
36. Gene Selection (5)
MD Feature Selection mRMR Feature Selection
CNS CNS 1
CNS CNS
CNS CNS
RENAL RENAL
BREAST BREAST
CNS CNS
CNS CNS 0.9
BREAST BREAST
NSCLC NSCLC
NSCLC NSCLC
RENAL RENAL
RENAL RENAL
RENAL RENAL 0.8
RENAL RENAL
RENAL RENAL
RENAL RENAL
RENAL RENAL
BREAST BREAST
NSCLC NSCLC
RENAL RENAL 0.7
UNKNOWN UNKNOWN
OVARIAN OVARIAN
MELANOMA MELANOMA
PROSTATE PROSTATE
OVARIAN OVARIAN
OVARIAN OVARIAN 0.6
OVARIAN OVARIAN
OVARIAN OVARIAN
Class (disease)
OVARIAN OVARIAN
PROSTATE PROSTATE
NSCLC NSCLC
NSCLC NSCLC 0.5
NSCLC NSCLC
LEUKEMIA LEUKEMIA
K562B−repro K562B−repro
K562A−repro K562A−repro
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA 0.4
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA
COLON COLON
COLON COLON
COLON COLON
COLON COLON 0.3
COLON COLON
COLON COLON
COLON COLON
MCF7A−repro MCF7A−repro
BREAST BREAST
MCF7D−repro MCF7D−repro
BREAST BREAST 0.2
NSCLC NSCLC
NSCLC NSCLC
NSCLC NSCLC
MELANOMA MELANOMA
BREAST BREAST
BREAST BREAST 0.1
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA 0
→ 2080
→ 6145
→ 2080
→ 6145
1177
1470
1671
3227
3400
3964
4057
4063
4110
4289
4357
4441
4663
4813
5226
5481
5494
5495
5508
5790
5892
6013
6019
6032
6045
6087
6184
6643
1378
1382
1409
1841
2081
2083
2086
3253
3371
3372
4383
4459
4527
5435
5504
5538
5696
5812
5887
5934
6072
6115
6305
6399
6429
6430
6566
→ 135
→ 135
246
663
766
982
133
134
233
259
381
561
19
Number of Selected Gene Number of Selected Gene
Figure: Feature Selection on the NCI DNA microarray data. Features
(genes) selected by MD and mRMR are marked with an arrow.
36/56
37. Gene Selection (6)
Results
Leukemia: 38 bone marrow samples, 27 Acute Lymphoblastic
Leukemia (ALL) and 11 Acute Myeloid Leukemia (AML), over 7,129
probes (features) from 6,817 human genes. Also 34 samples testing
data is provided, with 20 ALL and 14 AML. We obtained a 2.94%
test error with 7 features, while other authors report the same error
selecting 49 features via individualized markers. Other recent works
report test errors higher than 5% for the Leukemia data set.
Central Nervous System Embryonal Tumors: 60 patient samples,
21 are survivors and 39 are failures. There are 7,129 genes in the data
set. We selected 9 genes achieving a 1.67% LOOCV error. Best result
in the literature is a 8.33% CV error with 100 features is reported.
37/56
38. Gene Selection (7)
Results (cont)
Colon: 62 samples collected from colon-cancer patients. Among
them, 40 tumor biopsies are from tumors and 22 normal biopsies are
from healthy parts of the colons of the same patients. From 6,500
genes, 2,000 were selected for this data set, based on the confidence
in the measured expression levels. In the feature selection experiments
we performed, 15 of them were selected, resulting in a 0% LOOCV
error, while recent works report errors higher than 12%.
Prostate: 25 tumor and 9 normal samples. with 12,600 genes.
Our experiments achieved the best error with only 5 features,
with a test error of 5.88%. Other works report test errors
higher than 6%.
38/56
39. Structural Recognition
Reeb Graphs
Given a surface S and a real function f : S → R, the Reeb
graph (RG), represents the topology of S through a graph
structure whose nodes correspond to the critical points of f .
The Extended Reeb Graph (ERG) [Biasotti, 05] is an
approximation of the RG by using of a fixed number of level
sets (63 in this work) that divide the surface into a set of
regions; critical regions, rather than critical points, are
identified according to the behaviour of f along level sets; ERG
nodes correspond to critical regions, while the arcs are detected
by tracking the evolution of level sets.
39/56
40. Structural Recognition (2)
Figure: Extended Reeb Graphs. Image by courtesy of Daniela Giorgi and
Silvia Biasotti. a) Geodesic, b) Distance to center, b) bSphere
40/56
41. Structural Recognition (3)
Catalog of Features
Subgraph node centrality: quantifies the degree of participation
of a node i in a subgraph.
n 2 λk
CS (i) = k=1 φk (i) e ,
where where n = |V |, λk the k-th eigenvalue of A and φk its
corresponding eigenvector.
Perron-Frobenius eigenvector: φn (the eigenvector
corresponding to the largest eigenvalue of A). The components
of this vector denote the importance of each node.
Adjacency Spectra: the magnitudes |φk | of the (leading)
eigenvalues of A have been been experimentally validated for
graph embedding [Luo et al.,03].
41/56
42. Structural Recognition (4)
Catalog of Features (cont.)
Spectra from Laplacians: both from the un-normalized and
normalized ones. The Laplacian spectrum plays a fundamental
role in the development of regularization kernels for graphs.
Friedler vector: eigenvector corresponding to the first
non-trivial eigenvalue of the Laplacian (φ2 in connected
graphs). It encodes the connectivity structure of the graph
(actually is the core of graph-cut methods).
Commute Times either coming from the un-normalized and
normalized Laplacian. They encode the path-length distribution
of the graph.
The Heat-Flow Complexity trace as seen in the section above.
42/56
43. Structural Recognition (5)
Backwards Filtering
The entropies H(·) of a set with a large n number of features
can be efficiently estimated using the k-NN-based method
developed by Leonenko.
Thus, we take the data set with all its features and determine
which feature to discard in order to produce the smallest
decrease of I (Sn−1 ; C ).
We then repeat the process for the features of the remaining
feature set, until only one feature is left [Bonev et al.,10].
Then, the subset yielding the minimum 10-CV error is selected.
Most of the spectral features are histogrammed into a variable
number of bins: 9 · 3 · (2 + 4 + 6 + 8) = 540 features.
43/56
44. Experimental Results
Elements
SHREC database (15 classes × 20 objects). Each of the 300
samples is characterized by 540 features, and has a class label
l ∈ {human, cup, glasses, airplane, chair, octopus, table, hand,
fish, bird, spring, armadillo, buste, mechanic, four-leg }
The errors are measured by 10-fold cross validation (10-fold
CV). MI is maximized as the number of selected features grows.
A high number of features degrades the classification
performance.
However the MI curve, as well as the selected features, do not
depend on the classifier, as it is a purely information-theoretic
measure.
44/56
46. Structural Recognition (8)
Figure: Feature selection on the 15-class experiment (left) and the feature
statistics for the best-error feature set (right).
46/56
47. Structural Recognition (9)
Analysis
For the 15-class problem, the minimal error (23, 3%) is achieved
with a set of 222 features.
The coloured areas in the plot represent how much a feature is
used with respect to the remaining ones (the height on the Y
axis is arbitrary).
For the 15-class experiment, in the feature sets smaller than
100 features, the most important is the Friedler vector, in
combination with the remaining features. Commute time is also
an important feature.Some features that are not relevant are
the node centrality and the complexity flow. Turning our
attention to the graphs type, all three appear relevant.
47/56
48. Structural Recognition (10)
Analysis (cont)
We can see that the four different binnings of the features do
have importance for graph characterization.
These conclusions concerning the relevance of each feature
cannot be drawn without performing some additional
experiments with different groups of graph classes.
We perform our different 3-class experiments. The classes share
some structural similarities, for example the 3 classes of the
first experiment have a head and limbs.
Although in each experiment the minimum error is achieved
with very different numbers of features, the participation of
each feature is highly consistent with the 15-class experiment.
48/56
51. Structural Recognition (14)
Comparison of forward and backward F.S.
80
Forward selection
70 Backward elimination
60
CV error (%)
50
40
30
20
0 100 200 300 400 500 600
# features
Figure: Forward vs Backwards FS
51/56
52. Structural Recognition (15)
Bootstrapping on forward FS Bootstrapping on backward FS
Standard deviation Standard deviation
90 90
Test error Test error
80 80
70 70
% test error
% test error
60 60
50 50
40 40
30 30
0 100 200 300 400 500 0 100 200 300 400 500
# features # features
Forward FS: feature coincidences in bootstraps Backward FS: feature coincidences in bootstraps
25 25
20 20
# coincidences
# coincidences
15 15
10 10
5 5
0 0
0 100 200 300 400 500 0 100 200 300 400 500
# selected features # selected features
Figure: Bootstrapping experiments on forward (on the left) and backward
(on the right) feature selection, with 25 bootstraps.
52/56
53. Structural Recognition (16)
Conclusions
The main difference among experiments is that node centrality
seems to be more important for discerning among elongated
sharp objects with long sharp characteristics. Although all three
graph types are relevant, the sphere graph performs best for
blob-shaped objects.
Bootstrapping shows how better is the Backwards method.
The IT approach not only performs good classification but also
yield the role of each spectral feature in it!
Given other points of view (size functions, eigenfunctions,...) it
is desirable to exploit them!
53/56
54. References
[Zhu et al.,97] Zhu, S.C., Wu, Y.N. and Mumford, D (1997). Filters,
Random Fields And Maximum Entropy (FRAME): towads an unified
theory for texture modeling. IJCV 27 (2) 107–126
[Wang and Gotoh, 2009] Wang, X. and Gotoh, O. (2009). Accurate
molecular classication of cancer using simple rules. BMC Medical
Genomics, 64(2)
[Guyon and Elisseeff, 03] Guyon, I. and Elissee?, A. (2003). An intro-
duction to variable and feature selection. Journal of Machine
Learning Research, volume 3, pages 1157–1182.
[Vasconcelos and Vasconcelos,04] Vasconcelos, N. and Vasconcelos,
M. (2004). Scalable discriminant feature selection for image retrieval
and recognition. In CVPR04, pages 770–775.
54/56
55. References (2)
[Peng et al., 2005]Peng, H., Long, F., and Ding, C. (2005). Feature
selection based on mutual information: Criteria of max-dependency,
max-relevance, and min-redundancy. In IEEE Trans. on PAMI, 27(8),
pp.1226-1238.
[Leonenko et al, 08] Leonenko, N., Pronzato, L., and Savani, V.
(2008). A class of R´nyi information estimators for multidimensional
e
densities. The Annals of Statistics, 36(5):21532182.
[Lozano et al, 09] Lozano M,A., Escolano, F., Bonev, B., Suau, P.,
Aguilar, W., S´ez, J.M., Cazorla, M. (2009). Region and constell.
a
based categorization of images with unsupervised graph learning.
Image Vision Comput. 27(7): 960–978
55/56
56. References (3)
[Bonev et al.,08] Bonev, B., Escolano, F., and Cazorla, M. (2008).
Feature selection, mutual information, and the classication of high-
dimensional patterns. Pattern Analysis and Applications 11(3-4):
304–319.
[Biasotti, 05] Biasotti, S. (2005). Topological coding of surfaces with
boundary using Reeb graphs. Computer Graphics and Geometry,
7(1):3145.
[Luo et al.,03 Luo, B., Wilson, R., and Hancock, E. (2003). Spectral
embedding of graphs. Pattern Recognition, 36(10):22132223
[Bonev et al.,10] Bonev, B., Escolano, F., Giorgi, D., Biasotti, S,
(2010). High-dimensional Spectral Feature Selection for 3D Object
Recognition based on Reeb Graphs, SSPR’2010 (accepted)
56/56