Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector

Image Analysis & Retrieval
CS/EE 5590 Special Topics (Class Ids: 44873,44874)
Fall 2016,M/W 4-5:15pm@Bloch0012
Lec 08
Feature Aggregation II:
Fisher Vector, Super Vector and AKULA
Zhu Li
Dept of CSEE, UMKC
Office: FH560E,Email: lizhu@umkc.edu, Ph: x 2346.
http://l.web.umkc.edu/lizhu
p.1Z. Li, Image Analysis&Retrv.2016

Outline
 ReCap of Lecture 07
 Image Retrieval System
 BoW
 VLAD
 Dense SIFT
 Fisher Vector Aggregation
 AKULA
 Summary
Z. Li, Image Analysis&Retrv.2016 p.2

Precision, Recall, F-measure
Precision, TPR = TP/(TP + FP),
Recall = TP/(TP + FN),
 FPR=FP/(TP+FP)
F-measure
= 2*(precision*recall)/(precision +
recall)
Precision:
is the probability that a
retrieved document
is relevant.
Recall:
is the probability that a
relevant document
is retrieved in a search.

Why Aggregation ?
 Curse of Dimensionality
Decision Boundary / Indexing
+
…..

Bag-of-Words: Histogram Coding
Codebook:
 Feature space: Rd, k-means to get k centroids, {𝜇1, 𝜇2, … , 𝜇 𝑘}
 BoW Hard Encoding:
 For n feature points,{x1, x2, …,xn} assignment matrix: kxn,
with column only 1-non zero entry
 Aggregated dimension: k
k
n

Kernel Code Book Soft Encoding
Kernel Code Book Soft Encoding
 Kernel Affinity: 𝐾 𝑥𝑗, 𝜇 𝑘 = 𝑒−𝑘|𝑥 𝑗−𝜇 𝑘|2
 Assignment Matrix: 𝐴𝑗,𝑘 = 𝐾(𝑥𝑗, 𝜇 𝑘)/ 𝑘 𝐾(𝑥𝑗, 𝜇 𝑘)
 Encoding: k-dimensional: X(k)=
1
𝑛 𝑗 𝐴𝑗,𝑘

VLAD- Vector of Locally Aggregated Descriptors
 Aggregate feature difference
from the codebook
 Hard assignment by finding the
NN of feature {xk} to {𝜇 𝑘}
 Compute aggregated
differences
 L2 normalize
 Final feature: k x d
 3
x
v1 v2
v3 v4
v5
1
 4
 2
 5
① assign descriptors
② compute x-  i
③ vi=sum x-  i for cell i
𝑣 𝑘 =
∀𝑗,𝑠.𝑡.𝑁𝑁 𝑥 𝑗 =𝜇 𝑘
𝑥𝑗 − 𝜇 𝑘
𝑣 𝑘 = 𝑣 𝑘/ 𝑣 𝑘 2

VLAD on SIFT
 Example of aggregating SIFT with VLAD
 K=16 codebook entries
 Each cell is a SIFT visualized as centroids in blue, and VLAD
difference in red
 Top row: left image, bottom row: right image, red: code
book, blue: encoded VLAD

Outline
 ReCap of Lecture 07
 Image Retrieval System
 BoW
 VLAD
 Dense SIFT
 Fisher Vector Aggregation
 AKULA
 Summary

One more trick
 Recall that SIFT is a powerful descriptor
 VL_FEAT: vl_dsift
 A dense description of image by computing SIFT descriptor
(no spatial-scale space extrema detection) at predetermined
grid
 Supplement HoG as an alternative texture descriptor

VL_FEAT: vl_dsift
 Compute dense SIFT as a texture descriptor for the
image
 [f, dsift]=vl_dsift(single(rgb2gray(im)), ‘step’, 2);
 There’s also a FAST option
 [f, dsift]=vl_dsift(single(rgb2gray(im)), ‘fast’, ‘step’, 2);
 Huge amount of SIFT data will be generated

Fisher Vector
 Fisher Vector and variations:
 Winning in image classification:
 Winning in the MPEG object re-identification:
o SCFV(Scalable Coded Fisher Vec) in CDVS

Codebook: Gaussian Mixture Model (GMM)
 GMM is a generative model to express data
 Assuming data is generated from with parameters {𝑤 𝑘, 𝜇 𝑘, 𝜎 𝑘}
𝑥 𝑘 ~
𝑘=1
𝐾
𝑤 𝑘 𝑁(𝜇 𝑘, 𝜎 𝑘)
𝑁 𝜇 𝑘, 𝜎 𝑘 =
1
2𝜋
𝑑
2 Σ 𝑘
1/2
𝑒−
1
2
𝑥−𝜇 𝑘
′Σ 𝑘
−1
(𝑥−𝜇 𝑘)

A bit of Theory: Fisher Kernel
Encode the derivation from the generative model
 Observed feature set, {x1, x2, …,xn} in Rd, e.g, d=128 for
SIFT.
 How’s these observations derivate from the given GMM
model with a set of parameter, 𝜆 = 𝑤 𝑘, 𝜇 𝑘, 𝜎 𝑘 ?
o i.e, how the parameter, e.g, mean will move to best fit the observation
?
𝜇4
𝜇3
𝜇2
𝜇1
X1
+

A bit of Theory: Fisher Kernel
Score function w.r.t. the likelihood function 𝜇 𝜆(𝑋)
 𝐺𝜆
𝑋
= 𝛻𝜆 log 𝑢 𝜆(𝑋): derivative on the log likelihood
 The dimension of score function is m, where m is the number
of generative model parameters, m=3 for GMM
 Given the observed data X, score function indicate how
likelihood function parameter (e.g, mean) should move to
better fit the data.
Distance/Derivation of two observation X, Y w.r.t the
generative model
 Fisher Info Matrix (roughly the covariance in the Mahanolibis
distance)
𝐹𝜆 = 𝐸 𝑋 𝐺𝜆
𝑋
𝐺𝜆
𝑋′
 Fisher Kernel Distance: normalized by the Fisher Info
Matrix:
𝐾𝐹𝐾 𝑋, 𝑌 = 𝐺𝜆
𝑋′
𝐹𝜆
−1
𝐺𝜆
𝑋

Fisher Vector
 KFK(X, Y) is a measure of similarity,
w.r.t. the generative model
 Similar to the Mahanolibis distance case,
we can decompose this kernel as,
 That give us a kernel feature mappingof
X to Fisher Vector
 For observed images features {xt}, can
be computed as,
𝐾𝐹𝐾 𝑋, 𝑌 = 𝐺𝜆
𝑋′
𝐹𝜆
−1
𝐺𝜆
𝑋
= 𝐺𝜆
𝑋′
𝐿 𝜆′𝐿 𝜆 𝐺𝜆
𝑋

GMM Fisher Vector
Encode the derivation from the generative model
 Observed feature set, {x1, x2, …,xn} in Rd, e.g, d=128 (!) for SIFT.
 How’s these observations derivate from the given GMM model with a set
of parameter, 𝜃 = 𝑎 𝑘, 𝜇 𝑘, 𝜎 𝑘 ?
 GMM Log Likelihood Gradient
 Let 𝑤 𝑘 =
𝑒 𝑎 𝑘
𝑗 𝑒
𝑎 𝑗
, Then we have
weight
mean
variance

GMM Fisher Vector VL_FEAT implementation
 GMM codebook
 For a K-component GMM, we only allow 3K parameters,
𝜋 𝑘, 𝜇 𝑘, 𝜎 𝑘 𝑘 = 1. . 𝐾}, i.e, iid Gaussian component
 Posterior prob of feature point xi to GMM component k
Σ 𝑘 =
𝜎 𝑘 0 0 0
0 𝜎 𝑘 0 0
….
𝜎 𝑘

GMM Fisher Vector VL_FEAT implementation
 FV encoding
 Gradient on the mean, for GMM component k, j=1..D
 In the end, we have 2K x D aggregation on the derivation
w.r.t. the means and variances
𝐹𝑉 = [𝑢1, 𝑢2,… , 𝑢 𝐾, 𝑣1, 𝑣2, … , 𝑣 𝐾]

VL_FEAT GMM/FV API
 Compute GMM model with VL_FEAT
 Prepare data:
numPoints = 1000 ; dimension = 2 ;
data = rand(dimension,N) ;
 Call vl_gmm:
numClusters = 30 ;
[means, covariances, priors] = vl_gmm(data, numClusters) ;
 Visualize:
figure ;
hold on ;
plot(data(1,:),data(2,:),'r.') ;
for i=1:numClusters
vl_plotframe([means(:,i)' sigmas(1,i) 0 sigmas(2,i)]);
end

VL_FEAT API
 FV encoding
encoding = vl_fisher(datatoBeEncoded, means, covariances,
priors);
 Bonus points:
 Encode HoG features with Fisher Vector ?
 randomly collect 2~3 images from each class
 Stack all HoG features together into an n x 36 data matrix
 Compute its GMM
 Use this GMM to encode all image HoG features (other than
average)

Super Vector Aggregation – Speaker ID
 Fisher Vector: Aggregates Features against a GMM
 Super Vector: Aggregates GMM against GMM
 Ref:
o William M. Campbell, Douglas E. Sturim, Douglas A. Reynolds: Support vector
machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett.
13(5): 308-311(2006)
“Yes, We Can !”
?

Super Vector from MFCC
 Motivated from Speaker ID work
 Speech is a continuousevolution of the vocal tract
 Need to extract a sequence of spectra or sequence of spectral coefficients
 Use a sliding window - 25 ms window, 10 ms shift
DCTLog|X(ω)|
MFCC

GMM Model from MFCC
 GMM on MFCC feature


M
j
s
j
s
j
s
j
s
pp
1
)()()()(
),|()|(  xx
• The acoustic vectors (MFCC) of speaker s is modeled by a
prob. density function parameterized by
M
j
s
j
s
j
s
j
s
1
)()()()(
},,{  
• Gaussian mixture model (GMM) for speaker s:
M
j
s
j
s
j
s
j
s
1
)()()()(
},,{  

Universal Background Model
 UBM GMM Model:


M
j
jjj pp
1
)ubm()ubm()ubm()ubm(
),|()|(  xx
• The acoustic vectors of a general population is modeled by
another GMM called the universal background model
(UBM):
• Parameters of the UBM
M
jjjj 1
)ubm()ubm()ubm()ubm(
},,{  

MAP Adaption
 Given the UBM GMM, how is the new observation
derivate ?
 The adapted mean is given by:

Supervector Distance
 Assuming we have UBM GMM model
𝜆 𝑈𝐵𝑀 = {𝑃𝑘, 𝜇 𝑘, Σ 𝑘},
with identical prior and covariance
Then for two utterance samples a and b, with GMM models
 𝜆 𝑎 = {𝑃𝑘, 𝜇 𝑘
𝑎
, Σ 𝑘},
 𝜆 𝑏 = {𝑃𝑘, 𝜇 𝑘
𝑏
,Σ 𝑘},
The SV distance is,
It means the means of two models need to be normalized by the UBM
covariance induced Mahanolibis distance metric
This is also a linear kernel function scaled by the UBM covariances
𝐾 𝜆 𝑎, 𝜆 𝑏 =
𝑘
𝑃𝑘Σ 𝑘
−(
1
2
)
𝜇 𝑘
𝑎
𝑇
( 𝑃𝑘Σ 𝑘
−(
1
2
)
𝜇 𝑘
𝑏)

Supervector Performance in NIST Speaker ID
 System 5: Gaussian SV
 DCF (Detection Cost Function)

m31491
AKULA – Adaptive KLUster Aggregation
2013/10/25
Abhishek Nagar, Zhu Li, Gaurav Srivastava and Kyungmo Park

Outline
Motivation
Adaptive Aggregation
Results with TM7
Summary

Motivation
Better Aggregation
 Fisher Vector and VLAD type aggregation depending on a
global model
 AKULA removes this dependence, and directly coding the
cluster centroids and sift count
 SCFV/RVD all having situations where clusters are turned off
due to no assignment, this can be avoided in AKULA
SIFTdetection & selection K-means AKULA description

Motivation
Better Subspace Choice
 Both SCFV and RVD do fixed normalization and PCA
projection based on heuristic.
 What is the best possible subspace to do the aggregation ?
 Using a boosting scheme to keep adding subspaces and
aggregations in an iterative fashion, and tune TPR-FPR to
the desired operating points on FPR.

CE2: AKULA – Adaptive KLUster Aggregation
AKULA Descriptor: cluster centroids +
SIFT count
A2={yc2
1, yc2
2, …, yc2
k ; pc2
1, pc2
2, …, pc2
k }
Distance metric:
 Min centroids distance, weighted
by SIFT count
d A1 ,A2 =
1
𝑘 𝑗=0
𝑘
d 𝑚𝑖𝑛
1
𝑗 𝑤 𝑚𝑖𝑛
1
(𝑗) +
1
𝑘 𝑖=0
𝑘
d 𝑚𝑖𝑛
2
𝑖 𝑤 𝑚𝑖𝑛
2
(𝑖)
A1={yc1
1, yc1
2, …, yc1
k ; pc1
1, pc1
2, …, pc1
k },
d 𝑚𝑖𝑛
1
𝑗 = min
𝑖
𝑑𝑗,𝑖
d 𝑚𝑖𝑛
2
𝑖 = min
𝑗
𝑑𝑗,𝑖
w 𝑚𝑖𝑛
1
𝑗 = 𝑤𝑗,𝑖∗ , 𝑖∗ = 𝑎𝑟𝑔min
𝑖
𝑑𝑗,𝑖
w 𝑚𝑖𝑛
2
𝑖 = 𝑤𝑗∗,𝑖, 𝑗∗ = 𝑎𝑟𝑔min
𝑗
𝑑𝑗,𝑖

AKULA implementation in TM7
Inner loop aggregation
 Dimension is fixed at 8
 Numb of clusters, or nc=8, 16, 32, to hit 64, 128, and 256
bytes
 Quantization: scale by ½ and quantized to int8, sift count is
8 bits, total (nc+1)*dim bytes per aggregation

Outer loop subspace optimization by boosting
 Initial set of subspace models {Ak} computed from MIR
FLICKR data set SIFT extractions by k-means the space to
4096 clusters
 Iterative search on subspaces to generate AKULA
aggregation that can improve performance in precision-
recall
 Notice that aggregation is de-coupled in subspace iteration,
to allow more DoF in aggregation, to find subspaces that
provides complimentary info.
The algorithm is still being debugged, hence only
having 1st iteration results in TM7

Outer loop subspace optimization by boosting
 Initial set of subspace models {Ak} computed from MIR
FLICKR data set SIFT extractions by k-means the space to 4096
clusters
 Iterative search on subspaces to generate AKULA aggregation
that can improve performance in precision-recall
 Notice that aggregation is de-coupled in subspace iteration, to
allow more DoF in aggregation, to find subspaces that provides
complimentary info.
The algorithm is still being debugged, hence only having
1st iteration results in TM7
Indexing/Hashing is required for AKULA, it involves nc x
dim multiplications and additions at this time. A
binarization scheme will be considered once its
performance is optimized in non-binary form.

GD Only TPR-FPR: AKULA vs SCFV
Data set 1:
 AKULA (128bytes, dim=8, nc=16) distance is just 1-way
dmin1.*wt
 Forcing a weighted sum on SCFV (512 bytes) hamming
distances without 2D decision fitting, i.e, count hamming
distance between common active clusters, and sum up their
distances

GD Only TPR-FPR: AKULA vs SCFV
Data set 2, 3:
 AKULA distance is just 1-way dmin1.*wt
 AKULA=128bytes, SCFV = 512 bytes.

3D object set: 4 , 5
Data set4, 5:

AKULA in PM
FPR performance:
AKULA rates:
pm rates m akula rates
512 8 64
1K 16 128
2K 16 128
1K_4K 16 128
2K_4K 16 128
4K 16 128
8K 32 256
16K 32 256

TPR@1% FPR
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:512
TM7
AKULA
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5TPR(%)
bitrate:1k
TM7
AKULA

TPR@1%FPR:
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:2k
TM7
AKULA
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:1k-4k
TM7
AKULA

TPR@1%FPR:
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:2k-4k
TM7
AKULA
0
20
40
60
80
100
120
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:4k
TM7
AKULA

TPR@1%FPR:
75
80
85
90
95
100
105
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:8k
TM7
AKULA
80
85
90
95
100
105
1a 1b 1c 2 3 4 5
TPR(%)
bitrate:16k
TM7
AKULA

AKULA Localization
Quite some improvements: 2.7%

AKULA Summary
Benefits:
 Allow more DoF in aggregation optimization,
o by an outer loop boosting scheme for subspace projection optimization
o And an inner loop adaptive clustering without the constraint of the
global GMM model
 Simple weighted distance sum metric, with no need to tune a
multi-dimensional decision boundary
 The overall pair wise matching matched up with TM7 SCFV
with 2-dimensional decision boundary
 In GD only matching outperforms the TM7 GD
 Good improvements to the localization accuracy
 Light in extraction, but still heavy in pair wise matching, and
need binarization scheme and/or indexing scheme to work for
retrieval
 Future Improvements:
 SupervectorAKULA ?

Lec 08 Summary
 Fisher Vector
 Aggregate features {Xk} in RD
against GMM
Super Vector
 Aggregate GMM against a global
GMM (UBM)
 AKULA
 Direct Aggregation
+
+ + +

Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (13)

Similar to Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector

Similar to Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector (20)

More from United States Air Force Academy

More from United States Air Force Academy (8)

Recently uploaded

Recently uploaded (20)

Lec-08 Feature Aggregation II: Fisher Vector, AKULA and Super Vector