Learning a structured model for visual category recognition

Introduction Sub-space Embedding Fuzzy Visual Model Structure Estimation Structured Sparse Model Summary
Learning A Structured Model For Visual Category
Recognition
Ashish Gupta
University of Surrey
a.gupta@surrey.ac.uk
July 5,2013
Ashish Gupta University of Surrey
Learning A Structured Model For Visual Category Recognition

Introduction
Introduction : What is Category Recognition?
Feature vector Embedding : Information in Sub-Manifold.
Feature vector distribution: Fuzzy Visual Model.
Estimating semantic structure: Co-clustering.
Sparse Models: Semantically structured.
Summary & Future Work

Motivation
Visual Category?
Robot interacts physical objects.
Object taxonomy based on physical
properties.
Robot recognizes object using
visual appearance.

Motivation
Visual Category Model
Appearance variation → scatter of semantically related descriptors in feature
space
Can this scatter distribution be estimated?
Can this structure be used to improve the learnt visual model?
Visual category model ≈ Visual object model + Estimated structure of visual
category variation

Approach
Visual Classiﬁcation Pipeline
Structure in sub-spaces → groups of sub-spaces → dictionary
Structure in dictionary → groups of prototypes → encoding

Approach
Feature Descriptor Matrix
Scene−15 D−SIFT, 500 feature vectors of 128 dimensions
feature vectors
dimensions
0
50
100
150
200
250
Matrix of 500 D-SIFT feature descriptors, each of 128 dimensions.

Approach
Encoded Feature Matrix
Conceptual illustration of encoded feature matrix, occurrence
histogram of visual words in images.

Approach
Conceptual Interpretation
Structure estimation can be interpreted as estimation of
semantically related rows or columns of data matrix. These are
projected to a lower dimensional space such that mutual separation
between equivalent feature vectors is reduced.

Sub-space Embedding
Feature descriptor space is high dimensional.
Relevant information is embedded in a lower dimensional
sub-manifold.
What is the appropriate lower dimensionality?
Measure eﬃcacy of sub-space embedding method?
Measure information in embedded feature vectors.

Intrinsic Dimensionality
Intrinsic dimensionality p estimation
Correlation Dimension
Number of feature vectors in a hypersphere of radius r is proportional to rp
.
Maximum Likelihood Estimate
Expectation of number of feature vectors covered by a hypersphere of growing
radius r.
Eigenvalue Estimate
Number of eigenvalues greater than a small threshold value .
Geodesic Minimum Spanning Tree
Based on length of GMST of k descriptors in a neighbourhood graph.

Estimated Intrinsic Dimensionality

Subspace Embedding Methods
Global Methods
Principal Components
Multi-Dimensional
Scaling
Stochastic Proximity
Embedding
Isomap
Diﬀusion Maps
Local Methods
Locally Linear Embedding
Locality Preserving Projection
Neighbourhood Preserving
Projection
Landmark Isomap
t-Stochastic Neighbourhood
Embedding

Entropic Measure
Entropy Measure Intuition
−10 −5 0 5 10 15
0
20
40
−15
−10
−5
0
5
10
15
x
’swiss’ synthetic data
Y
Z
−1.5
−1
−0.5
0
0.5
1
1.5
−1
−0.5
0
0.5
1
−5
0
5
10
X
’intersect’ synthetic data
Y
Z
−400 −200 0 200 400
−500
0
500
−300
−200
−100
0
100
200
X
’VOC2006,car’ data
Y
Z
0 10 20 30 40 50 60 70 80 90 100
0
0.005
0.01
0.015
0.02
0.025
Bin index
NormalizedFrequency
Distribution of pair−wise distances in data
swiss, H=−25.3355
intersect, H=−19.3150
VOC2006,car, H=−33.0302

Empirical Results
Comparison of Embedded Entropy

Empirical Results
Computational Time Complexity

Empirical Results
Classiﬁcation Performance

Empirical Results
Conclusion
Estimated intrinsic dimensionality was in the neighbourhood
of 14 of the 128-dimensional descriptor.
The performance of LPP in comparison to other embedding
methods accentuates the importance of modelling structure in
local distributions.

Fuzzy Visual Model
Structure in distribution of descriptors in feature space?
Issues with K-means clustering in the Bag-of-Words model.
Visual model incorporating Fuzzy logic framework.

Visual Ambiguity
Descriptor assignment has issues of uncertainty and
plausibility.
Kernel Codebook uses soft-assignment to resolve the
ambiguity.

Fuzzy Models
Visual Dictionary
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
times (normalized scale)
acceleration(normalizedscale)
K−means Hard Partition | Motorcycle Data
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fuzzy K−Means Partition | Motorcycle Data
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Gustafson−Kessel Fuzzy Partition | Motorcycle Data
L(Z; µC) =
r
j=1 i∈Cj
zi − µCj
2
L(Z; D, A) =
r
i=1
n
j=1
(αij )m
zj − µCi
2
Σ
L(Z; D, A, {Σi }) =
r
i=1
n
j=1
(αij )m
zj − di
2
Σi

Fuzzy Models
d2
Σ(z, µC) = (z−µC)T
Σ(z−µC)
Σ =






( 1
σ1
)2
0 · · · 0
0 ( 1
σ2
)2
· · · 0
...
...
...
...
0 0 · · · ( 1
σn
)p






d2
Σi
(zj , µCi ) = (zj −µCi )T
Σi (zj −µCi )
Fi =
n
j=1(αij )m
(zj − di )(zj − di )T
n
j=1(αij )m
Σi =
(ρi det(Fi ))
1
p
Fi

Empirical Results
FKM Classiﬁcation Performance
MITcoast
MITmountain
industrial
livingroom
MITopencountry
PARoffice
MITtallbuilding
CALsuburbstore
bedroom
MITforest
MIThighway
MITstreet
MITinsidecity
kitchen
visual category
0.5
0.6
0.7
0.8
Acc
Scene15
Bag-of-Words
Fuzzy K-means
sheep
horse
bicycle
motorbike cow bus dog cat
person car
visual category
0.45
0.50
0.55
0.60
Acc
VOC2006
Bag-of-Words
Fuzzy K-means

Empirical Results
GK Classiﬁcation Performance
MITcoast
MITmountain
industrial
livingroom
MITopencountry
PARoffice
MITtallbuilding
CALsuburbstore
bedroom
MITforest
MIThighway
MITstreet
MITinsidecity
kitchen
visual category
0.5
0.6
0.7
0.8
Acc
Scene15
Bag-of-Words
Gustafson-Kessel
sheep horse bicycle
motorbike cow bus dog cat person car
visual category
0.45
0.50
0.55
0.60
Acc
VOC2006
Bag-of-Words
Gustafson-Kessel

Empirical Results
Dictionary Size
32 64 128 256 512
dictionary size
0.58
0.60
0.62
0.64
0.66
Acc
Caltech101
Bag-of-Words
Fuzzy K-means
32 64 128 256 512
dictionary size
0.58
0.60
0.62
0.64
0.66
Acc
Caltech101
Bag-of-Words
Gustafson-Kessel
Comparison of BoW with FKM and GK for diﬀerent sizes of
dictionary.

Empirical Results
Aggregate Performance
VOC2006 VOC2010
data set
0.50
0.51
0.52
0.53
0.54
0.55
Acc
Bag-of-Words
Fuzzy K-means
Gustafson-Kessel
(a) VOC datasets
Caltech101 Caltech256
data set
0.60
0.62
0.64
0.66
0.68
Acc
Bag-of-Words
Fuzzy K-means
Gustafson-Kessel
(b) Caltech datasets
Visual Model Data Set
VOC-2006 VOC-2010 Caltech-101 Caltech-256
BoW 0.50825 0.52446 0.60111 0.67606
FKM 0.52635 0.53736 0.61928 0.68357
G-K 0.52885 0.54224 0.62413 0.68623

Empirical Results
Conclusion
Visual model learnt within the framework of fuzzy logic adapts
to the local distribution of feature vectors.
Learning a better fuzzy membership function is an eﬀective
alternative to learning increasing large dictionaries to adapt to
increasing complexity of visual categories.

Co-clustering for Structure Estimation
What is co-clustering?
Co-clustering for structure in descriptor data matrix.
Co-clustering for structure in encoded feature matrix.

Co-clustering Methods
Co-clustering
Co-clustering is simultaneous and alternative row and column
clustering of a data matrix.
At each step of the optimization routine, the groups of rows
guide column clustering and vice versa.
CX : {x1, . . . , xm} → {ˆx1, . . . , ˆxk}
CY : {y1, . . . , yn} → {ˆy1, . . . , ˆyl }

Co-clustering methods
Information-Theoretic Co-Clustering
Data matrix is considered a joint probability distribution.
Minimizes KL-divergence between original data and co-clustered
matrices.
Sum-Squared Residue Co-Clustering
Alternative k-means clustering of rows and columns. Minimizes
squared Euclidean distance between rows and columns from row
and column means respectively.

Information-Theoretic Co-clustering
I(X; Y ) − I( ˆX; ˆY ) = dKL(p(X, Y ), q(X, Y ))

Multiple Sub-spaces
Mutiple Sub-spaces Intuition
i,j
dE (z•
i|Sl
, z•
j|Sq
) >
i,j
dE (z•
i , z•
j ), l = q

Multiple Sub-spaces
Co-clustering descriptor data matrix
Scene−15 D−SIFT, 500 feature vectors of 128 dimensions
feature vectors
dimensions
0
50
100
150
200
250
Information−Theoretic Co−Clustering of Scene−15 D−SIFT 500x128 into 10 row and 10 column clusters
feature vectors
dimensions
0
50
100
150
200
250

Multiple Sub-spaces
Dictionary on single and multiple sub-spaces
Universal PCA Dictionary : VOC−2006 : D−SIFT : 10 x 500 : PCA + Kmeans
dictionary [500]
dimensions[10]PCA
0
100
200
Universal CC Dictionary : VOC−2006 : D−SIFT : 10 x 500 : SSRCC + Kmeans
dictionary [500]
dimensions[10]CC
0
100
200

Multiple Sub-spaces
Classiﬁcation performance
VOC2006 VOC2007
Data Set
0.50
0.55
0.60
0.65
0.70
F1
Dict: 10x1000
MSSD:(i): 5x1000
MSSD:(r): 5x1000
VOC2006 VOC2007
Data Set
0.50
0.55
0.60
0.65
F1
Dict: 10x1000
MSSD:(i): 10x1000
MSSD:(r): 10x1000
Comparison of classiﬁcation performance of single and multiple sub-space
dictionaries.

Multiple Sub-spaces
Dictionary projected to multiple sub-spaces
Universal Dictionary : VOC−2006 : D−SIFT : 128x500 : Kmeans
dictionary [500]
dimensions[128]
0
50
100
150
200
250
Universal Submanifold Dictionary : VOC−2006 : D−SIFT : 128 (10) x 500 : SSRCC + Kmeans
dictionary [500]
dimensions[128],submanifolds[10]
0
50
100
150
200
250

Multiple Sub-spaces
Classiﬁcation performance
VOC2006 VOC2007
Data Set
0.50
0.55
0.60
0.65
F1(5)
Dict: 128x1000
SSSD:(i): 128x1000
SSSD:(r): 128x1000
VOC2006 VOC2007
Data Set
0.50
0.55
0.60
0.65
0.70
F1(50)
Dict: 128x1000
SSSD:(i): 128x1000
SSSD:(r): 128x1000
Comparison of classiﬁcation performance of dictionary projected to multiple
sub-spaces.

Topic Dictionary
Structure in Dictionary Intuition
Estimating groups of non-contiguous partitions of feature space
that are semantically related.

Topic Dictionary
Topic Dictionary Concept

Topic Dictionary
Classiﬁcation Performance
Comparison of classiﬁcation performance of dictionaries using BoW
and ITCC, for VOC2006 and Scene15 datasets.

Topic Dictionary
Dictionary sizes
VOC2006 VOC2007 VOC2010 Scene15 Caltech101
Data Set
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
F1
BoW: 100
CC:i: 100
Data Set
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
F1
BoW: 500
CC:i: 500
Data Set
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
F1
BoW: 1000
CC:i: 1000
Comparative classiﬁcation performance for diﬀerent dictionary
sizes.

Topic Dictionary
Conclusion
Groups of sub-spaces computed using co-clustering yielded
dictionaries with better classiﬁcation performance.
Groups of feature space partition (dictionary elements) yielded
improved classiﬁcation results.
These estimated groups can be used in learning a semantically
structured visual model.

Sparse Decomposition

Sparse Visual Model
Sparse model approximates a feature vector as a combination
of a sub-set of an over-complete basis set.
Sparsity is induced by adding a regularization constraint is
added to the coeﬃcients in the loss function.
Degree of sparsity is determined empirically.
Each basis element is considered individually.
Possible structure amongst basis elements is disregarded.

Structured Sparse Model
SSPCA (structure in sub-spaces)
Co-clustered groups of sub-spaces is used to augment Sparse-PCA
to compute Structured Sparse-PCA dictionary.
Group Lasso (structure in dictionary)
Co-clustered groups of dictionary elements is used to augment
Lasso to compute group Lasso feature encoding.

Sparse Regularization
Sparse regularization : min
α
1
n
n
i=1
L(zi , dαi ) + λΩ(α)
Lasso : min
α
1
n
n
i=1
zi − Dαi
2
+λ αi 1
Group Sparsity : min
α
1
n
n
i=1
zi − Dαi
2
+λ
k
j=1
αi Gj

Structured Sub-space
Structured Sub-space Dictionary using ITCC
sheep
horse
bicycle
person car
Visual Category
50
60
70
80
90
mAP
VOC2006
Sparse Subspace
Structured Subspace
sheep
horse
bicycle
aeroplanecow
sofabus dog cat
person
train
diningtable
bottlecar
pottedplant
tvmonitor
chairbird
boat
motorbike
Visual Category
50
60
70
80
90
mAP
VOC2007
Sparse Subspace
Structured Subspace

Structured Sub-space Dictionary using SSRCC
sheep
horse
bicycle
person car
Visual Category
60
70
80
90
mAP
VOC2006
Sparse Subspace
Structured Subspace
sheep
horse
bicycle
aeroplanecow
sofabus dog cat
person
train
diningtable
bottlecar
pottedplant
tvmonitor
chairbird
boat
motorbike
Visual Category
50
60
70
80
90
mAP
VOC2007
Sparse Subspace
Structured Subspace

Sparse Subspace Structured Sparse Subspace
Data Set ITCC SSRCC
VOC2006 67.5941 70.8295 68.5808
VOC2007 67.9971 68.0783 68.3718
Sparse selection of semantically related set of sub-spaces
performs better than sparse individual selection of sub-spaces.

Structured Sparse Dictionary
Structured Sparse Encoding using ITCC
MITcoast
MITmountain
industrial
livingroom
MITopencountry
PARoffice
MITtallbuilding
CALsuburb
store
bedroom
MITforest
MIThighway
MITstreet
MITinsidecity
kitchen
Visual Category
50
60
70
80
90
mAP
Scene15 ITCC
Sparse Encoding
Structured Encoding
sheep
horse
bicycle
person car
Visual Category
60
70
80
90
100
mAP
VOC2006 ITCC
Sparse Encoding
Structured Encoding

Structured Sparse Encoding using SSRCC
MITcoast
MITmountain
industrial
livingroom
MITopencountry
PARoffice
MITtallbuilding
CALsuburb
store
bedroom
MITforest
MIThighway
MITstreet
MITinsidecity
kitchen
Visual Category
50
55
60
65
70
75
80
85
mAP
Scene15 SSRCC
Sparse Encoding
Structured Encoding
sheep
horse
bicycle
person car
Visual Category
60
70
80
90
100
mAP
VOC2006 SSRCC
Sparse Encoding
Structured Encoding

Sparse Encoding Structured Sparse Encoding
Data Set ITCC SSRCC
VOC-2006 72.8386 73.3977 72.7738
Scene-15 68.5737 79.8794 72.1155
Sparse selection of semantically related set of dictionary
elements performs better than sparse individual selection of
dictionary element.

Summary
Learning semantically relevant structure in feature space used
to compute better visual models.
Analysis of sub-space embedding emphasized modelling local
distributions.
Incorporation of fuzzy logic framework to learn dictionary
kernels that adapt to local distributions yielded better visual
models.
Co-clustering was successful in grouping semantically related
sub-spaces and feature space partitions.
Estimated groups of sub-spaces and dictionary elements were
used to compute structured sparse visual models, improving
upon regular sparse models.

Future Work
Future Work
Visual models using Fisher Kernel coding, which uses a
Gaussian kernel, has been very successful. Combining the
approach in Fisher Kernels with the learnt Fuzzy membership
functions could potentially improve the visual model.
Fuzzy logic based learning algorithms that are more advanced
than Gustafson-Kessel could be explored to learn better
membership functions.
Co-clustering creates a block factorization of the data matrix.
Partial membership of rows and columns to the co-clusters
would be the natural next step.
Explore ways of using semantic structure to improve feature
generation techniques like hierarchical models that aim to
learn category speciﬁc descriptors.

Future Work
End
Questions...

Appendices
BoW Partitioning
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Bag−of−Words Partition | VOC−2006 | #000017
Figure: Bag-of-Words model and image ‘000017’ in VOC-2006 dataset. The dictionary of size 25 ( ) is
computed using K-means clustering. The feature vectors ( ) are projected to 2 dimensions using PCA.

Appendices
FKM Partitioning
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Fuzzy K−means Fuzzy Partition | VOC−2006 | #000017
Figure: Fuzzy K-means model and image ‘000017’ in VOC-2006 dataset.

Appendices
GK Partitioning
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Gustafson−Kessel Fuzzy Partition | VOC−2006 | #000017
Figure: Gustafson-Kessel model and image ‘000017’ in VOC-2006 dataset.

Learning a structured model for visual category recognition

Recommandé

Recommandé

Contenu connexe

Similaire à Learning a structured model for visual category recognition

Similaire à Learning a structured model for visual category recognition (20)

Plus de Ashish Gupta

Plus de Ashish Gupta (6)

Dernier

Dernier (20)

Learning a structured model for visual category recognition