Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Sparse Kernel Learning for Image Annotation
1. Sparse Kernel Learning for Image Annotation
Sean Moran and Victor Lavrenko
Institute of Language, Cognition and Computation
School of Informatics
University of Edinburgh
ICMR’14 Glasgow, April 2014
5. Previous work
Topic models: latent Dirichlet allocation (LDA) [Barnard et
al. ’03], Machine Translation [Duygulu et al. ’02]
Mixture models: Continuous Relevance Model (CRM)
[Lavrenko et al. ’03], Multiple Bernoulli Relevance Model
(MBRM) [Feng ’04]
Discriminative models: Support Vector Machine (SVM)
[Verma and Jahawar ’13], Passive Aggressive Classifier
[Grangier ’08]
Local learning models: Joint Equal Contribution (JEC)
[Makadia’08], Tag Propagation (Tagprop) [Guillaumin et al.
’09], Two-pass KNN (2PKNN) [Verma et al. ’12]
6. Combining different feature types
Previous work: linear combination of feature distances in a
weighted summation with “default” kernels:
Kernels
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Laplacian UniformGaussian
Standard kernel assignment: Gaussian for Gist, Laplacian
for colour features, χ2 for SIFT
7. Data-adaptive visual kernels
Our contribution: permit the visual kernels themselves to
adapt to the data:
Kernels
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Laplacian UniformGaussian
Corel 5K
Hypothesis: Optimal kernels for GIST, SIFT etc dependent
on the image dataset itself
8. Data-adaptive visual kernels
Our contribution: permit the visual kernels themselves to
adapt to the data:
Kernels
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Laplacian UniformGaussian
IAPR TC12
Hypothesis: Optimal kernels for GIST, SIFT etc dependent
on the image dataset itself
10. Continuous Relevance Model (CRM)
CRM estimates joint distribution of image features (f) and
words (w)[Lavrenko et al. 2003]:
P(w, f) =
J∈T
P(J)
N
j=1
P(wj |J)
M
i=1
P(fi |J)
P(J): Uniform prior for training image J
P(fi |J): Gaussian non-parametric kernel density estimate
P(wi |J): Multinomial for word smoothing
Estimate marginal probability distribution over individual tags:
P(w|f) =
P(w, f)
w P(w, f)
Top e.g. 5 words with highest P(w|f) used as annotation
11. Sparse Kernel Learning CRM (SKL-CRM)
Introduce binary kernel-feature alignment matrix Ψu,v
P(I|J) =
M
i=1
R
j=1
exp −
1
β u,v
Ψu,v kv
(f u
i , f u
j )
kv
(f u
i , f u
j ): v-th kernel function on the u-th feature type
β: kernel bandwidth parameter
Goal: learn Ψu,v by directly maximising annotation F1 score
on held-out validation dataset
12. Generalised Gaussian Kernel
Shape factor p: traces out an infinite family of kernels
P(fi |fj ) =
p1−1/p
2βΓ(1/p)
exp −
1
p
|fi − fj |p
βp
Γ: Gamma function
β: kernel bandwidth parameter
13. Generalised Gaussian Kernel
Shape factor p: traces out an infinite family of kernels
P(fi |fj ) =
p1−1/p
2βΓ(1/p)
exp −
1
p
|fi − fj |p
βp
x
GG(x;p)
p =2
14. Generalised Gaussian Kernel
Shape factor p: traces out an infinite family of kernels
P(fi |fj ) =
p1−1/p
2βΓ(1/p)
exp −
1
p
|fi − fj |p
βp
x
GG(x;p)
p =1
15. Generalised Gaussian Kernel
Shape factor p: traces out an infinite family of kernels
P(fi |fj ) =
p1−1/p
2βΓ(1/p)
exp −
1
p
|fi − fj |p
βp
x
GG(x;p)
p =15
16. Multinomial Kernel
Multinomial kernel optimised for count-based features:
P(fi |fj ) =
( d fi,d )!
d (fi,d !)
d
(pj,d )fi,d
fi,d : count for bin d in the unlabelled image i
fj,d count for the training image j
Jelinek-Mercer smoothing used to estimate pj,d :
pj,d = λ
fj,d
d fj,d
+ (1 − λ)
j fj,d
j,d fj,d
We also consider standard χ2 and Hellinger kernels
17. Greedy kernel-feature alignment
Features
Kernels
Laplacian
GIST HAAR
Gaussian Uniform
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
0 0 0 0
0 0 0 0
0 0 0 0
GIST SIFT LAB HAAR
Laplacian
Gaussian
Uniform
Ψvu
X6
Iteration 0:
F1 0.0
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
X6
Testing Image
Training Image
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
18. Greedy kernel-feature alignment
Features
Kernels
Laplacian
GIST HAAR
Uniform
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
0 0 0 0
1 0 0 0
0 0 0 0
GIST SIFT LAB HAAR
Laplacian
Gaussian
Uniform
Ψvu
X6
Iteration 1:
F1 0.25
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
X6
Testing Image
Training Image
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Gaussian
19. Greedy kernel-feature alignment
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
0 0 0 0
1 0 0 0
0 0 0 1
GIST SIFT LAB HAAR
Laplacian
Gaussian
Uniform
Ψvu
X6
Iteration 2:
F1 0.34
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
X6
Testing Image
Training Image
Kernels
Laplacian Uniform
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Gaussian
20. Greedy kernel-feature alignment
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
0 0 0 0
1 1 0 0
0 0 0 1
GIST SIFT LAB HAAR
Laplacian
Gaussian
Uniform
Ψvu
X6
Iteration 3:
F1 0.38
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
X6
Testing Image
Training Image
Kernels
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Gaussian Laplacian Uniform
21. Greedy kernel-feature alignment
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
0 0 1 0
1 1 0 0
0 0 0 1
GIST SIFT LAB HAAR
Laplacian
Gaussian
Uniform
Ψvu
X6
Iteration 4:
F1 0.42
Features
GIST HAAR
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
SIFT LAB
X6
Testing Image
Training Image
Kernels
Laplacian Uniform
x
GG(x;p)
p =1
x
GG(x;p)
p =15
x
GG(x;p)
p =2
Gaussian
23. Datasets/Features
Standard evaluation datasets:
Corel 5K: 5,000 images (landscapes, cities), 260 keywords
IAPR TC12: 19,627 images (tourism, sports), 291 keywords
ESP Game: 20,768 images (drawings, graphs), 268 keywords
Standard “Tagprop” feature set [Guillaumin et al. ’09]:
Bag-of-words histograms: SIFT [Lowe ’04] and Hue [van de
Weijer & Schmid ’06]
Global colour histograms: RGB, HSV, LAB
Global GIST descriptor [Oliva & Torralba ’01]
Descriptors, except GIST, also computed in a 3x1 spatial
arrangement [Lazebnik et al. ’06]
24. Evaluation Metrics
Standard evaluation metrics [Guillaumin et al. ’09]:
Mean per word Recall (R)
Mean per word Precision (P)
F1 Measure
Number of words with recall > 0 (N+)
Fixed annotation length of 5 keywords
25. F1 score of CRM model variants
Corel 5K IAPR TC12 ESP Game
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
CRM
CRM 15
SKL-CRM
F1
26. F1 score of CRM model variants
Corel 5K IAPR TC12 ESP Game
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
CRM
CRM 15
SKL-CRM
F1
Original CRM
Duygulu et al.
features
27. F1 score of CRM model variants
Corel 5K IAPR TC12 ESP Game
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
CRM
CRM 15
SKL-CRM
F1
Original CRM
Duygulu et al.
features
Original CRM
15 Tagprop
features +71%
28. F1 score of CRM model variants
Corel 5K IAPR TC12 ESP Game
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
CRM
CRM 15
SKL-CRM
F1
Original CRM
Duygulu et al.
features
Original CRM
15 Tagprop
features +71%
SKL-CRM
15 Tagprop
features +45%
29. F1 score of SKL-CRM on Corel 5K
HSV_V3H1
DS
HS_V3H1
HSV
HS
HH_V3H1
GIST
LAB_V3H1
RGB_V3H1
RGB
DH_V3H1
DH
HH
LAB
DS_V3H1
0.31
0.33
0.35
0.37
0.39
0.41
0.43
0.45
SKL-CRM (Valid F1)
SKL-CRM (Test F1)
Tagprop (Test F1)
Feature type
F1
30. F1 score of SKL-CRM on Corel 5K
HSV_V3H1
DS
HS_V3H1
HSV
HS
HH_V3H1
GIST
LAB_V3H1
RGB_V3H1
RGB
DH_V3H1
DH
HH
LAB
DS_V3H1
0.31
0.33
0.35
0.37
0.39
0.41
0.43
0.45
SKL-CRM (Valid F1)
SKL-CRM (Test F1)
Tagprop (Test F1)
Feature type
F1
31. F1 score of SKL-CRM on Corel 5K
HSV_V3H1
DS
HS_V3H1
HSV
HS
HH_V3H1
GIST
LAB_V3H1
RGB_V3H1
RGB
DH_V3H1
DH
HH
LAB
DS_V3H1
0.31
0.33
0.35
0.37
0.39
0.41
0.43
0.45
SKL-CRM (Valid F1)
SKL-CRM (Test F1)
Tagprop (Test F1)
Feature type
F1
32. F1 score of SKL-CRM on Corel 5K
HSV_V3H1
DS
HS_V3H1
HSV
HS
HH_V3H1
GIST
LAB_V3H1
RGB_V3H1
RGB
DH_V3H1
DH
HH
LAB
DS_V3H1
0.31
0.33
0.35
0.37
0.39
0.41
0.43
0.45
SKL-CRM (Valid F1)
SKL-CRM (Test F1)
Tagprop (Test F1)
Feature type
F1
33. F1 score of SKL-CRM on Corel 5K
HSV_V3H1
DS
HS_V3H1
HSV
HS
HH_V3H1
GIST
LAB_V3H1
RGB_V3H1
RGB
DH_V3H1
DH
HH
LAB
DS_V3H1
0.31
0.33
0.35
0.37
0.39
0.41
0.43
0.45
SKL-CRM (Valid F1)
SKL-CRM (Test F1)
Tagprop (Test F1)
Feature type
F1
34. Optimal kernel-feature alignments on Corel 5K
Optimal alignments1:
HSV: Multinomial (λ = 0.99)
HSV V3H1: Generalised Gaussian (p=0.9)
Harris Hue (HH V3H1): Generalised Gaussian (p=0.1) ≈
Dirac spike!
Harris SIFT (HS): Gaussian
HS V3H1: Generalised Gaussian (p=0.7)
DenseSift (DS): Laplacian
Our data-driven kernels more effective than standard kernels
No alignment agrees with literature default assignment i.e.
Gaussian for Gist, Laplacian for colour histogram, χ2 for SIFT
1
V3H1 denotes descriptors computed in a spatial arrangement
35. SKL-CRM Results vs. Literature (Precision & Recall)
R P R P
0.20
0.25
0.30
0.35
0.40
0.45
0.50
MBRM JEC
Tagprop GS
SKL-CRM
Corel 5K IAPR TC12
38. Conclusions and Future Work
Proposed a sparse kernel model for image annotation
Key experimental findings:
Default kernel-feature alignment suboptimal
Data-adaptive kernels are superior to standard kernels
Sparse set of features just as effective as much larger set
Greedy forward selection as effective as gradient ascent
Future work: superposition of kernels per feature type
39. Thank you for your attention
Sean Moran
sean.moran@ed.ac.uk
www.seanjmoran.com