P02 sparse coding cvpr2012 deep learning methods for vision

CVPR12 Tutorial on Deep
Learning

Sparse Coding
Kai Yu

yukai@baidu.com
Department of Multimedia, Baidu

Relentless research on visual recognition

Caltech 101 80 Million Tiny Images

ImageNet
PASCAL VOC

08/22/12 2

The pipeline of machine visual perception

Most Efforts in
Machine Learning
Inference:
Low-level Pre- Feature Feature
prediction,
sensing processing extract. selection
recognition

• Most critical for accuracy
• Account for most of the computation for testing
• Most time-consuming in development cycle
• Often hand-craft in practice

08/22/12 3

Computer vision features

SIFT Spin image

HoG RIFT

GLOH

Slide Credit: Andrew Ng

Learning features from data

Machine Learning
Inference:
prediction,
recognition

Feature Learning:
instead of design features,
let’s design feature learners

08/22/12 5

Learning features from data via sparse coding

Inference:
prediction,
recognition

Sparse coding offers an effective building
block to learn useful features

08/22/12 6

Outline

1. Sparse coding for image classification
2. Understanding sparse coding
3. Hierarchical sparse coding
4. Other topics: e.g. structured model, scale-up,
discriminative training
5. Summary

08/22/12 7

“BoW representation + SPM” Paradigm - I

Bag-of-visual-words representation
(BoW) based on VQ coding

08/22/12 Figure credit: Fei-Fei Li 8

“BoW representation + SPM” Paradigm - II

Spatial pyramid matching:
pooling in different scales and locations

08/22/12 Figure credit: Svetlana Lazebnik 9

Image Classification using “BoW ＋ SPM”

Image Classification
Spatial Pooling

Dense SIFT
VQ Coding
Classifier

08/22/12 10

The Architecture of “Coding + Pooling”

Coding Pooling Coding Pooling

• •e.g.,
e.g.,convolutional neural net, HMAX, BoW, …
convolutional neural net, HMAX, BoW, …

11

“BoW+SPM” has two coding+pooling layers

Local Gradients Pooling VQ Coding Average Pooling SVM
(obtain histogram)

e.g, SIFT, HOG

SIFT feature itself follows a coding+pooling operation
SIFT feature itself follows a coding+pooling operation
12

Develop better coding methods

Better Coding Better Pooling Better Coding Better Pooling Better
Classifier

- Coding: nonlinear mapping data into another feature space
- Better coding methods: sparse coding, RBMs, auto-encoders

13

What is sparse coding

Sparse coding (Olshausen & Field,1996). Originally
developed to explain early visual processing in the brain
(edge detection).

Training: given a set of random patches x, learning a
dictionary of bases [Φ1, Φ2, …]

Coding: for data vector x, solve LASSO to find the
sparse coefficient vector a

08/22/12 14

Sparse coding: training time

Input: Images x1, x2, …, xm (eachinRd)
Learn: Dictionary of bases φ1, φ2, …, φk (alsoRd).

Alternating optimization:
1.Fix dictionary φ1, φ2, …, φk , optimize a (a standard LASSO
problem ）

2.Fix activations a, optimize dictionary φ1, φ2, …, φk (a
convex QP problem)

Sparse coding: testing time

Input: Novel image patch x (inRd) and previously learned
φi’s
Output: Representation [ai,1, ai,2, …, ai,κ] of image patch xi.

≈ 0.8 * + 0.3 * + 0.5 *

Represent xi as: ai = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, …]

Sparse coding illustration
50
Natural Images Learned bases (φ1 , …, φ64): “Edges”
1 00

1 50

2 00 50

2 50
1 00

3 00
1 50

3 50
2 00
4 00
2 50 50
4 50

3 00 100
5 00
50 100 1 50 2 00 250 300 3 50 4 00 450
150 5 00
3 50

200
4 00

250
4 50

300
5 00
50 1 00 150 2350
00 250 3 00 3 50 4 00 4 50 500

400

450

500
50 100 150 200 250 3 00 350 400 450 5 00

Test example

≈ 0.8 * + 0.3 * + 0.5 *

x ≈ 0, …, φ , + 0.3 φ42 + 0.5
[a1, …, a64] = [0, 0.8 *0, 0.8360, …, 0,0.3 *, 0, …, 0, 0.5, 0] * φ63
(feature representation)

Slide credit: Andrew Ng Compact & easily interpretable

Self-taught Learning [Raina, Lee, Battle, Packer & Ng, ICML 07]

Motorcycles Not motorcycles

Testing:
Testing:
What isis this?
What this?

…
Unlabeled images Slide credit: Andrew Ng

Classification Result on Caltech 101

9K images, 101 classes

64%
SIFT VQ + Nonlinear
SVM

~50%
Pixel Sparse Coding
+ Linear SVM

08/22/12 19

Sparse Coding on SIFT – ScSPM algorithm
[Yang, Yu, Gong & Huang, CVPR09]

Local Gradients Pooling Sparse Coding Max Pooling Linear
Classifier

e.g, SIFT, HOG

20

Sparse Coding on SIFT – the ScSPM algorithm
[Yang, Yu, Gong & Huang, CVPR09]

Caltech-101

64%
SIFT VQ + Nonlinear
SVM

73%
SIFT Sparse Coding +
Linear SVM (ScSPM)

08/22/12 21

22

Dense SIFT
73%

Sparse Coding
Summary: Accuracies on Caltech 101

Spatial Max Pooling
- Sparse coding is a better building block
Linear SVM
Sparse Coding

50%
- Deep models are preferred
Spatial Max Pooling
Linear SVM
Dense SIFT

Key message:
64%
VQ Coding
Spatial Pooling

08/22/12
Nonlinear SVM

Outline

5. Summary

08/22/12 23

Outline

– Connections to RBMs, autoencoders, …
– Sparse activations vs. sparse models, ..
– Sparsity vs. locality
– local sparse coding methods
3. Summary

08/22/12 24

Classical sparse coding

- a is sparse
- a is often higher dimension than x
- Activation a=f(x) is nonlinear implicit function of x
- reconstruction x’=g(a) is linear & explicit

a
f(x) encoding x’ decoding

a
g(a)
x
08/22/12 25

RBM & autoencoders

- also involve activation and reconstruction
- but have explicit f(x)
- not necessarily enforce sparsity on a
- but if put sparsity on a, often get improved results
[e.g. sparse RBM, Lee et al. NIPS08]

a

x’
f(x) encoding decoding

a
g(a)
x

08/22/12 26

Sparse coding: A broader view

Any feature mapping from x to a, i.e. a = f(x),
where
-a is sparse (and often higher dim. than x)
-f(x) is nonlinear
-reconstruction x’=g(a) , such that x’≈x
a

x’
f(x)

a
g(a)
x

Therefore, sparse RBMs, sparse auto-encoder, even VQ
can be viewed as a form of sparse coding.
08/22/12 27

Outline
– Sparse activations vs. sparse models, …
– local sparse coding methods
3. Summary

08/22/12 28

Sparse activations vs. sparse models

For a general function learning problem a = f(x):
1. sparse model: f(x)’s parameters are sparse
- example: LASSO f(x)=<w,x>, w is sparse
- the goal is feature selection: all data selects a common
subset of features
- hot topic in machine learning

2. sparse activations: f(x)’s outputs are sparse
- example: sparse coding a=f(x), a is sparse
- the goal is feature learning: different data points activate
different feature subsets

08/22/12 29

Example of sparse models

f(x)=<w,x>, where
w=[0, 0.2, 0, 0.1, 0, 0]

］
[ 0 | 0 | 0 0

］
[ 0 | 0 | 0 0
］
[ 0 | 0 | 0 0
| ］
xm [ | | | | |

| ］[ | | | | |
x1 [ | | | | |
x2
| ］

[ 0 | 0 | 0 0
…
x3 [ | | | | |
| ］
… ］

• because the 2nd and 4th elements of w are non-zero,
these are the two selected features in x
• globally-aligned sparse representation
08/22/12 30

Example of sparse activations (sparse
coding)

0 ］
am [ 0 0 0 | | …

0 ］
a2 ］ | 0 0 0 …
0 [|
a1 [ 0 | | 0 0 …
x1
x2
x3
…

xm a3 [ | 0 | 0 0 …
0 ］

• different x has different dimensions activated

• locally-shared sparse representation: similar x’s tend to have
similar non-zero dimensions

08/22/12 31

Example of sparse activations (sparse
coding)

0 ］
am [ 0 0 0 | | …

0 ］
a2 ］ | | 0 0 …
0 [0
a1 [ | | 0 0 0 …
x2
x1 x3 …
xm a3 [ 0 0 | | 0 …
0 ］

• another example: preserving manifold structure

• more informative in highlighting richer data structures,
i.e. clusters, manifolds,

08/22/12 32

Outline

– Local sparse coding methods
3. Summary

08/22/12 33

Sparsity vs. Locality

• Intuition: similar data should
get similar activated features
sparse coding
• Local sparse coding:
• data in the same
neighborhood tend to
local sparse have shared activated
coding features;

• data in different
neighborhoods tend to
have different features
activated.
08/22/12 34

Sparse coding is not always local: example

Case 1 Case 2
independent subspaces data manifold (or clusters)

• Each basis is a “direction” • Each basis an “anchor point”
• Sparsity: each datum is a • Sparsity: each datum is a
linear combination of only linear combination of neighbor
several bases. anchors.
• Sparsity is caused by locality.
08/22/12
35

Two approaches to local sparse coding

Approach 1 Approach 2
Coding via local anchor points Coding via local subspaces

08/22/12
36

Classical sparse coding is empirically local

 When it works best for classification, the codes
are often found local.

 It’s preferred to let similar data have similar
non-zero dimensions in their codes.

08/22/12 37

MNIST Experiment: Classification using
SC

Try different values

• •60K training, 10K for
60K training, 10K for
test
test
• • k=512
Let k=512
Let
••
Linear SVM on
Linear SVM on
sparse codes
sparse codes

08/22/12 38

MNIST Experiment: Lambda = 0.0005

Each basis is like a
Each basis is like a
part or direction.
part or direction.

08/22/12 39


Again, each basis is like
Again, each basis is like
a part or direction.
a part or direction.

08/22/12 40


Now, each basis is
Now, each basis is
more like a digit !!
more like a digit

08/22/12 41


Like VQ now!
Like VQ now!

08/22/12 42

Geometric view of sparse
coding

Error: 4.54% Error: 3.75% Error: 2.64%

• •When sparse coding achieves the best classification
When sparse coding achieves the best classification
accuracy, the learned bases are like digits ––each basis
accuracy, the learned bases are like digits each basis
has aaclear local class association.
has clear local class association.

08/22/12 43

Distribution of coefficients (MNIST)

Neighbor bases tend to
Neighbor bases tend to
get nonzero coefficients
get nonzero coefficients

08/22/12 44

Distribution of coefficient (SIFT,
Caltech101)

Similar observation here!
Similar observation here!

08/22/12 45

Outline

– Local sparse coding methods
2. Summary

08/22/12 46

Why develop local sparse coding methods

 Since locality is a preferred property in sparse
coding, let’s explicitly ensure the locality.

 The new algorithms can be well theoretically
justified

 The new algorithms will have computational
advantages over classical sparse coding

08/22/12 47

Two approaches to local sparse coding

Coding via local anchor points Coding via local subspaces

Local coordinate coding Super-vector coding
Learning locality-constrained linear coding for image Image Classification using Super-Vector Coding of
classification, Jingjun Wang, Jianchao Yang, Kai Yu, Local Image Descriptors, Xi Zhou, Kai Yu, Tong
Fengjun Lv, Thomas Huang. In CVPR 2010. Zhang, and Thomas Huang. In ECCV 2010.

Nonlinear learning using local coordinate coding, Large-scale Image Classification: Fast Feature
Kai Yu, Tong Zhang, and Yihong Gong. In NIPS 2009. Extraction and SVM Training, Yuanqing Lin, Fengjun
Lv, Shenghuo Zhu, Ming Yang, Timothee Cour, Kai
Yu, LiangLiang Cao, Thomas Huang. In CVPR 2011

08/22/12
48

A function approximation framework to
understand coding

• •Assumption: image
Assumption: image
patches xxfollow aanonlinear
patches follow nonlinear
manifold, and f(x) is smooth
manifold, and f(x) is smooth
on the manifold.
on the manifold.
• •Coding: nonlinear mapping
Coding: nonlinear mapping
xx aa

typically, aais high-dim &
typically, is high-dim &
sparse
sparse
• •Nonlinear Learning:
Nonlinear Learning:
f(x) ==<w, a>
f(x) <w, a>

08/22/12 49

Local sparse coding

Approach 1
Local coordinate coding

08/22/12
50

Function Interpolation based on LCC
Yu, Zhang, Gong, NIPS 10

locally linear

data points
bases
08/22/12 51

Local Coordinate Coding (LCC):
connect coding to n onlinear function learning

If f(x) is (alpha, beta)-Lipschitz smooth
The key message:

A good coding scheme should
1. have a small coding error,
2. and also be sufficiently local
Function Coding error Locality term
approximation
error

08/22/12 52

Local Coordinate Coding (LCC) Yu, Zhang & Gong, NIPS 09
Wang, Yang, Yu, Lv, Huang CVPR 10

• •Dictionary Learning: k-means (or hierarchical k-means)
Dictionary Learning: k-means (or hierarchical k-means)

• Coding for x, to obtain its sparse representation a

Step 1 – ensure locality: find the K nearest bases

Step 2 – ensure low coding error:

08/22/12 53

Local sparse coding

Approach 2
Super-vector coding

08/22/12
54

ecewise local linear (first-order)
Function approximation via super-vector coding:
Zhou, Yu, Zhang, and Huang, ECCV 10

Local tangent

data points
cluster centers

Super-vector coding: Justification

If f(x) is beta-Lipschitz smooth, and

Local tangent

Function approximation error Quantization
error

08/22/12 56

Super-Vector Coding (SVC)
Zhou, Yu, Zhang, and Huang, ECCV 10

• •Dictionary Learning: k-means (or hierarchical k-means)
Dictionary Learning: k-means (or hierarchical k-means)

• Coding for x, to obtain its sparse representation a

Step 1 – find the nearest basis of x, obtain its VQ coding

e.g. [0, 0, 1, 0, …]

Step 2 – form super vector coding:

e.g. [0, 0, 1, 0, …, 0, 0, (x-m 3 ）， 0 ，… ]

Zero-order Local tangent
08/22/12 57

Results on ImageNet Challenge Dataset

ImageNet Challenge:
1.4 million images, 1000 classes
40%
VQ + Intersection Kernel

62%
LCC + Linear SVM

65%
SVC + Linear SVM

08/22/12 58

Summary: local sparse coding

Local coordinate coding Super-vector coding

- Sparsity achieved by explicitly ensuring locality
- Sound theoretical justifications
- Much simpler to implement and compute
- Strong empirical success

08/22/12
59

Outline

5. Summary

08/22/12 60

Hierarchical sparse coding
Yu, Lin, & Lafferty, CVPR 11
Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 11

Learning from
unlabeled data

Sparse Coding Pooling Sparse Coding Pooling

61

1
wi αk (φk ) wi = S (W ) Ω(α )
n i =1
A two-layer sparse coding + λ 1 W + γ α
(W , α ) =
k =1
L(W, α ) formulation
1 1
W ,α n Yu, Lin, & Lafferty, CVPR 11
α 0, n
∈ λ
1
S (= ) ≡
(W , α ) W wW,iTαλ + p×1 p W 1 + γ α
L( iw ) R 1
(W , α ) = n((W,αα ) + 1 W 1 + γ α 1
W ,αL W, )
L i =1 n
W ,α n
α 0,
1 α 0,
n
1
2 (W, α )
2
xi − Bwi + λLwi Ω(α )wi .
n i =1 2 L(W, α )
L(W, α )
n
W = (w1 w2 · · n ) ∈2 Rp×λn2 2
n 11 ·1w
L(W, α1 = α 1 RqBW −F Bwi + S2 Wi )Ω(α )wi
) X − xi
∈ xi2− Bwi + + λ 2 w λΩ(α )Ω(α ).
2 (w .
2n
n i =1 n i wi
n i =1
2
q −1
wi
Ω(αw ≡ww1 · 2 ·)·∈n()R∈Σ n α ).= Ω(α )− 1
p× n
W ) ( w α k · w φkp× (
=
W = ( 1 2 ·· n qw ) R
α ∈
α ∈k =1q R
W R S (W ) Ω(α )
1 wi α −1
08/22/12 q W −1 62
q

MNIST Results - classification

HSC vs. CNN: HSC provide even better performance than CNN
 more amazingly, HSC learns features in unsupervised manner!

63

MNIST results -- learned dictionary

A hidden unit in the second layer is connected to a unit group in
the 1st layer: invariance to translation, rotation, and deformation

64

Caltech101 results - classification

Learned descriptor: performs slightly better than SIFT + SC

65

Adaptive Deconvolutional Networks for Mid
and High Level Feature Learning
Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 2011

 Hierarchical L4 Feature
Convolutional Sparse Maps
Coding. Select L4 Features
L3 Feature
 Trained with respect Maps
to image from all
Select L3 Feature Groups
layers (L1-L4).
L2 Feature
 Pooling both spatially Maps
and amongst features.
 Learns invariant mid- L1 Feature Select L2 Feature Groups
level features. Maps

Image L1 Features

Outline

5. Summary

08/22/12 67

Other topics of sparse coding

 Structured sparse coding, for example
– Group sparse coding [Bengio et al, NIPS 09]
– Learning hierarchical dictionary [Jenatton, Mairal et al, 2010]

 Scale-up sparse coding, for example
– Feature-sign algorithm [Lee et al, NIPS 07]
– Feed-forward approximation [Gregor & LeCun, ICML 10]
– Online dictionary learning [Mairal et al, ICML 2009]

 Discriminative training, for example
– Backprop algorithms [Bradley & Jbagnell, NIPS 08; Yang et al.
CVPR 10]
– Supervised dictionary training [Mairal et al, NIPS08]

08/22/12 68

Summary of Sparse Coding

 Sparse coding is an effect way for (unsupervised)
feature learning

 A building block for deep models

 Sparse coding and its local variants (LCC, SVC) have
pushed the boundary of accuracies on Caltech101,
PASCAL VOC, ImageNet, …

 Challenge: discriminative training is not straightforward

P02 sparse coding cvpr2012 deep learning methods for vision

Recommandé

Recommandé

Contenu connexe

Similaire à P02 sparse coding cvpr2012 deep learning methods for vision

Similaire à P02 sparse coding cvpr2012 deep learning methods for vision (20)

Plus de zukun

Plus de zukun (20)

Dernier

Dernier (20)

P02 sparse coding cvpr2012 deep learning methods for vision

Notes de l'éditeur