MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
P02 sparse coding cvpr2012 deep learning methods for vision
1. CVPR12 Tutorial on Deep
Learning
Sparse Coding
Kai Yu
yukai@baidu.com
Department of Multimedia, Baidu
2. Relentless research on visual recognition
Caltech 101 80 Million Tiny Images
ImageNet
PASCAL VOC
08/22/12 2
3. The pipeline of machine visual perception
Most Efforts in
Machine Learning
Inference:
Low-level Pre- Feature Feature
prediction,
sensing processing extract. selection
recognition
• Most critical for accuracy
• Account for most of the computation for testing
• Most time-consuming in development cycle
• Often hand-craft in practice
08/22/12 3
5. Learning features from data
Machine Learning
Inference:
Low-level Pre- Feature Feature
prediction,
sensing processing extract. selection
recognition
Feature Learning:
instead of design features,
let’s design feature learners
08/22/12 5
6. Learning features from data via sparse coding
Inference:
Low-level Pre- Feature Feature
prediction,
sensing processing extract. selection
recognition
Sparse coding offers an effective building
block to learn useful features
08/22/12 6
7. Outline
1. Sparse coding for image classification
2. Understanding sparse coding
3. Hierarchical sparse coding
4. Other topics: e.g. structured model, scale-up,
discriminative training
5. Summary
08/22/12 7
8. “BoW representation + SPM” Paradigm - I
Bag-of-visual-words representation
(BoW) based on VQ coding
08/22/12 Figure credit: Fei-Fei Li 8
9. “BoW representation + SPM” Paradigm - II
Spatial pyramid matching:
pooling in different scales and locations
08/22/12 Figure credit: Svetlana Lazebnik 9
12. “BoW+SPM” has two coding+pooling layers
Local Gradients Pooling VQ Coding Average Pooling SVM
(obtain histogram)
e.g, SIFT, HOG
SIFT feature itself follows a coding+pooling operation
SIFT feature itself follows a coding+pooling operation
12
13. Develop better coding methods
Better Coding Better Pooling Better Coding Better Pooling Better
Classifier
- Coding: nonlinear mapping data into another feature space
- Better coding methods: sparse coding, RBMs, auto-encoders
13
14. What is sparse coding
Sparse coding (Olshausen & Field,1996). Originally
developed to explain early visual processing in the brain
(edge detection).
Training: given a set of random patches x, learning a
dictionary of bases [Φ1, Φ2, …]
Coding: for data vector x, solve LASSO to find the
sparse coefficient vector a
08/22/12 14
15. Sparse coding: training time
Input: Images x1, x2, …, xm (eachinRd)
Learn: Dictionary of bases φ1, φ2, …, φk (alsoRd).
Alternating optimization:
1.Fix dictionary φ1, φ2, …, φk , optimize a (a standard LASSO
problem )
2.Fix activations a, optimize dictionary φ1, φ2, …, φk (a
convex QP problem)
16. Sparse coding: testing time
Input: Novel image patch x (inRd) and previously learned
φi’s
Output: Representation [ai,1, ai,2, …, ai,κ] of image patch xi.
≈ 0.8 * + 0.3 * + 0.5 *
Represent xi as: ai = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, …]
18. Self-taught Learning [Raina, Lee, Battle, Packer & Ng, ICML 07]
Motorcycles Not motorcycles
Testing:
Testing:
What isis this?
What this?
…
Unlabeled images Slide credit: Andrew Ng
19. Classification Result on Caltech 101
9K images, 101 classes
64%
SIFT VQ + Nonlinear
SVM
~50%
Pixel Sparse Coding
+ Linear SVM
08/22/12 19
20. Sparse Coding on SIFT – ScSPM algorithm
[Yang, Yu, Gong & Huang, CVPR09]
Local Gradients Pooling Sparse Coding Max Pooling Linear
Classifier
e.g, SIFT, HOG
20
21. Sparse Coding on SIFT – the ScSPM algorithm
[Yang, Yu, Gong & Huang, CVPR09]
Caltech-101
64%
SIFT VQ + Nonlinear
SVM
73%
SIFT Sparse Coding +
Linear SVM (ScSPM)
08/22/12 21
22. 22
Dense SIFT
73%
Sparse Coding
Summary: Accuracies on Caltech 101
Spatial Max Pooling
- Sparse coding is a better building block
Linear SVM
Sparse Coding
50%
- Deep models are preferred
Spatial Max Pooling
Linear SVM
Dense SIFT
Key message:
64%
VQ Coding
Spatial Pooling
08/22/12
Nonlinear SVM
23. Outline
1. Sparse coding for image classification
2. Understanding sparse coding
3. Hierarchical sparse coding
4. Other topics: e.g. structured model, scale-up,
discriminative training
5. Summary
08/22/12 23
24. Outline
1. Sparse coding for image classification
2. Understanding sparse coding
– Connections to RBMs, autoencoders, …
– Sparse activations vs. sparse models, ..
– Sparsity vs. locality
– local sparse coding methods
1. Hierarchical sparse coding
2. Other topics: e.g. structured model, scale-up,
discriminative training
3. Summary
08/22/12 24
25. Classical sparse coding
- a is sparse
- a is often higher dimension than x
- Activation a=f(x) is nonlinear implicit function of x
- reconstruction x’=g(a) is linear & explicit
a
f(x) encoding x’ decoding
a
g(a)
x
08/22/12 25
26. RBM & autoencoders
- also involve activation and reconstruction
- but have explicit f(x)
- not necessarily enforce sparsity on a
- but if put sparsity on a, often get improved results
[e.g. sparse RBM, Lee et al. NIPS08]
a
x’
f(x) encoding decoding
a
g(a)
x
08/22/12 26
27. Sparse coding: A broader view
Any feature mapping from x to a, i.e. a = f(x),
where
-a is sparse (and often higher dim. than x)
-f(x) is nonlinear
-reconstruction x’=g(a) , such that x’≈x
a
x’
f(x)
a
g(a)
x
Therefore, sparse RBMs, sparse auto-encoder, even VQ
can be viewed as a form of sparse coding.
08/22/12 27
28. Outline
1. Sparse coding for image classification
2. Understanding sparse coding
– Connections to RBMs, autoencoders, …
– Sparse activations vs. sparse models, …
– Sparsity vs. locality
– local sparse coding methods
1. Hierarchical sparse coding
2. Other topics: e.g. structured model, scale-up,
discriminative training
3. Summary
08/22/12 28
29. Sparse activations vs. sparse models
For a general function learning problem a = f(x):
1. sparse model: f(x)’s parameters are sparse
- example: LASSO f(x)=<w,x>, w is sparse
- the goal is feature selection: all data selects a common
subset of features
- hot topic in machine learning
2. sparse activations: f(x)’s outputs are sparse
- example: sparse coding a=f(x), a is sparse
- the goal is feature learning: different data points activate
different feature subsets
08/22/12 29
30. Example of sparse models
f(x)=<w,x>, where
w=[0, 0.2, 0, 0.1, 0, 0]
]
[ 0 | 0 | 0 0
]
[ 0 | 0 | 0 0
]
[ 0 | 0 | 0 0
| ]
xm [ | | | | |
| ][ | | | | |
x1 [ | | | | |
x2
| ]
[ 0 | 0 | 0 0
…
x3 [ | | | | |
| ]
… ]
• because the 2nd and 4th elements of w are non-zero,
these are the two selected features in x
• globally-aligned sparse representation
08/22/12 30
31. Example of sparse activations (sparse
coding)
0 ]
am [ 0 0 0 | | …
0 ]
a2 ] | 0 0 0 …
0 [|
a1 [ 0 | | 0 0 …
x1
x2
x3
…
xm a3 [ | 0 | 0 0 …
0 ]
• different x has different dimensions activated
• locally-shared sparse representation: similar x’s tend to have
similar non-zero dimensions
08/22/12 31
32. Example of sparse activations (sparse
coding)
0 ]
am [ 0 0 0 | | …
0 ]
a2 ] | | 0 0 …
0 [0
a1 [ | | 0 0 0 …
x2
x1 x3 …
xm a3 [ 0 0 | | 0 …
0 ]
• another example: preserving manifold structure
• more informative in highlighting richer data structures,
i.e. clusters, manifolds,
08/22/12 32
33. Outline
1. Sparse coding for image classification
2. Understanding sparse coding
– Connections to RBMs, autoencoders, …
– Sparse activations vs. sparse models, …
– Sparsity vs. locality
– Local sparse coding methods
1. Hierarchical sparse coding
2. Other topics: e.g. structured model, scale-up,
discriminative training
3. Summary
08/22/12 33
34. Sparsity vs. Locality
• Intuition: similar data should
get similar activated features
sparse coding
• Local sparse coding:
• data in the same
neighborhood tend to
local sparse have shared activated
coding features;
• data in different
neighborhoods tend to
have different features
activated.
08/22/12 34
35. Sparse coding is not always local: example
Case 1 Case 2
independent subspaces data manifold (or clusters)
• Each basis is a “direction” • Each basis an “anchor point”
• Sparsity: each datum is a • Sparsity: each datum is a
linear combination of only linear combination of neighbor
several bases. anchors.
• Sparsity is caused by locality.
08/22/12
35
36. Two approaches to local sparse coding
Approach 1 Approach 2
Coding via local anchor points Coding via local subspaces
08/22/12
36
37. Classical sparse coding is empirically local
When it works best for classification, the codes
are often found local.
It’s preferred to let similar data have similar
non-zero dimensions in their codes.
08/22/12 37
38. MNIST Experiment: Classification using
SC
Try different values
• •60K training, 10K for
60K training, 10K for
test
test
• • k=512
Let k=512
Let
••
Linear SVM on
Linear SVM on
sparse codes
sparse codes
08/22/12 38
39. MNIST Experiment: Lambda = 0.0005
Each basis is like a
Each basis is like a
part or direction.
part or direction.
08/22/12 39
40. MNIST Experiment: Lambda = 0.005
Again, each basis is like
Again, each basis is like
a part or direction.
a part or direction.
08/22/12 40
41. MNIST Experiment: Lambda = 0.05
Now, each basis is
Now, each basis is
more like a digit !!
more like a digit
08/22/12 41
43. Geometric view of sparse
coding
Error: 4.54% Error: 3.75% Error: 2.64%
• •When sparse coding achieves the best classification
When sparse coding achieves the best classification
accuracy, the learned bases are like digits ––each basis
accuracy, the learned bases are like digits each basis
has aaclear local class association.
has clear local class association.
08/22/12 43
44. Distribution of coefficients (MNIST)
Neighbor bases tend to
Neighbor bases tend to
get nonzero coefficients
get nonzero coefficients
08/22/12 44
45. Distribution of coefficient (SIFT,
Caltech101)
Similar observation here!
Similar observation here!
08/22/12 45
46. Outline
1. Sparse coding for image classification
2. Understanding sparse coding
– Connections to RBMs, autoencoders, …
– Sparse activations vs. sparse models, …
– Sparsity vs. locality
– Local sparse coding methods
1. Other topics: e.g. structured model, scale-up,
discriminative training
2. Summary
08/22/12 46
47. Why develop local sparse coding methods
Since locality is a preferred property in sparse
coding, let’s explicitly ensure the locality.
The new algorithms can be well theoretically
justified
The new algorithms will have computational
advantages over classical sparse coding
08/22/12 47
48. Two approaches to local sparse coding
Approach 1 Approach 2
Coding via local anchor points Coding via local subspaces
Local coordinate coding Super-vector coding
Learning locality-constrained linear coding for image Image Classification using Super-Vector Coding of
classification, Jingjun Wang, Jianchao Yang, Kai Yu, Local Image Descriptors, Xi Zhou, Kai Yu, Tong
Fengjun Lv, Thomas Huang. In CVPR 2010. Zhang, and Thomas Huang. In ECCV 2010.
Nonlinear learning using local coordinate coding, Large-scale Image Classification: Fast Feature
Kai Yu, Tong Zhang, and Yihong Gong. In NIPS 2009. Extraction and SVM Training, Yuanqing Lin, Fengjun
Lv, Shenghuo Zhu, Ming Yang, Timothee Cour, Kai
Yu, LiangLiang Cao, Thomas Huang. In CVPR 2011
08/22/12
48
49. A function approximation framework to
understand coding
• •Assumption: image
Assumption: image
patches xxfollow aanonlinear
patches follow nonlinear
manifold, and f(x) is smooth
manifold, and f(x) is smooth
on the manifold.
on the manifold.
• •Coding: nonlinear mapping
Coding: nonlinear mapping
xx aa
typically, aais high-dim &
typically, is high-dim &
sparse
sparse
• •Nonlinear Learning:
Nonlinear Learning:
f(x) ==<w, a>
f(x) <w, a>
08/22/12 49
52. Local Coordinate Coding (LCC):
connect coding to n onlinear function learning
If f(x) is (alpha, beta)-Lipschitz smooth
The key message:
A good coding scheme should
1. have a small coding error,
2. and also be sufficiently local
Function Coding error Locality term
approximation
error
08/22/12 52
53. Local Coordinate Coding (LCC) Yu, Zhang & Gong, NIPS 09
Wang, Yang, Yu, Lv, Huang CVPR 10
• •Dictionary Learning: k-means (or hierarchical k-means)
Dictionary Learning: k-means (or hierarchical k-means)
• Coding for x, to obtain its sparse representation a
Step 1 – ensure locality: find the K nearest bases
Step 2 – ensure low coding error:
08/22/12 53
55. ecewise local linear (first-order)
Function approximation via super-vector coding:
Zhou, Yu, Zhang, and Huang, ECCV 10
Local tangent
data points
cluster centers
57. Super-Vector Coding (SVC)
Zhou, Yu, Zhang, and Huang, ECCV 10
• •Dictionary Learning: k-means (or hierarchical k-means)
Dictionary Learning: k-means (or hierarchical k-means)
• Coding for x, to obtain its sparse representation a
Step 1 – find the nearest basis of x, obtain its VQ coding
e.g. [0, 0, 1, 0, …]
Step 2 – form super vector coding:
e.g. [0, 0, 1, 0, …, 0, 0, (x-m 3 ), 0 ,… ]
Zero-order Local tangent
08/22/12 57
58. Results on ImageNet Challenge Dataset
ImageNet Challenge:
1.4 million images, 1000 classes
40%
VQ + Intersection Kernel
62%
LCC + Linear SVM
65%
SVC + Linear SVM
08/22/12 58
59. Summary: local sparse coding
Approach 1 Approach 2
Local coordinate coding Super-vector coding
- Sparsity achieved by explicitly ensuring locality
- Sound theoretical justifications
- Much simpler to implement and compute
- Strong empirical success
08/22/12
59
60. Outline
1. Sparse coding for image classification
2. Understanding sparse coding
3. Hierarchical sparse coding
4. Other topics: e.g. structured model, scale-up,
discriminative training
5. Summary
08/22/12 60
61. Hierarchical sparse coding
Yu, Lin, & Lafferty, CVPR 11
Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 11
Learning from
unlabeled data
Sparse Coding Pooling Sparse Coding Pooling
61
62. 1
wi αk (φk ) wi = S (W ) Ω(α )
n i =1
A two-layer sparse coding + λ 1 W + γ α
(W , α ) =
k =1
L(W, α ) formulation
1 1
W ,α n Yu, Lin, & Lafferty, CVPR 11
α 0, n
∈ λ
1
S (= ) ≡
(W , α ) W wW,iTαλ + p×1 p W 1 + γ α
L( iw ) R 1
(W , α ) = n((W,αα ) + 1 W 1 + γ α 1
W ,αL W, )
L i =1 n
W ,α n
α 0,
1 α 0,
n
1
2 (W, α )
2
xi − Bwi + λLwi Ω(α )wi .
n i =1 2 L(W, α )
L(W, α )
n
W = (w1 w2 · · n ) ∈2 Rp×λn2 2
n 11 ·1w
L(W, α1 = α 1 RqBW −F Bwi + S2 Wi )Ω(α )wi
) X − xi
∈ xi2− Bwi + + λ 2 w λΩ(α )Ω(α ).
2 (w .
2n
n i =1 n i wi
n i =1
2
q −1
wi
Ω(αw ≡ww1 · 2 ·)·∈n()R∈Σ n α ).= Ω(α )− 1
p× n
W ) ( w α k · w φkp× (
=
W = ( 1 2 ·· n qw ) R
α ∈
α ∈k =1q R
W R S (W ) Ω(α )
1 wi α −1
08/22/12 q W −1 62
q
63. MNIST Results - classification
Yu, Lin, & Lafferty, CVPR 11
HSC vs. CNN: HSC provide even better performance than CNN
more amazingly, HSC learns features in unsupervised manner!
63
64. MNIST results -- learned dictionary
Yu, Lin, & Lafferty, CVPR 11
A hidden unit in the second layer is connected to a unit group in
the 1st layer: invariance to translation, rotation, and deformation
64
66. Adaptive Deconvolutional Networks for Mid
and High Level Feature Learning
Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus, ICCV 2011
Hierarchical L4 Feature
Convolutional Sparse Maps
Coding. Select L4 Features
L3 Feature
Trained with respect Maps
to image from all
Select L3 Feature Groups
layers (L1-L4).
L2 Feature
Pooling both spatially Maps
and amongst features.
Learns invariant mid- L1 Feature Select L2 Feature Groups
level features. Maps
Image L1 Features
67. Outline
1. Sparse coding for image classification
2. Understanding sparse coding
3. Hierarchical sparse coding
4. Other topics: e.g. structured model, scale-up,
discriminative training
5. Summary
08/22/12 67
68. Other topics of sparse coding
Structured sparse coding, for example
– Group sparse coding [Bengio et al, NIPS 09]
– Learning hierarchical dictionary [Jenatton, Mairal et al, 2010]
Scale-up sparse coding, for example
– Feature-sign algorithm [Lee et al, NIPS 07]
– Feed-forward approximation [Gregor & LeCun, ICML 10]
– Online dictionary learning [Mairal et al, ICML 2009]
Discriminative training, for example
– Backprop algorithms [Bradley & Jbagnell, NIPS 08; Yang et al.
CVPR 10]
– Supervised dictionary training [Mairal et al, NIPS08]
08/22/12 68
69. Summary of Sparse Coding
Sparse coding is an effect way for (unsupervised)
feature learning
A building block for deep models
Sparse coding and its local variants (LCC, SVC) have
pushed the boundary of accuracies on Caltech101,
PASCAL VOC, ImageNet, …
Challenge: discriminative training is not straightforward
Notes de l'éditeur
Let’s further check what’s happening when best classification performance is achieved.
Hi I’m here to talk about my poster “Adaptive Deconvolutional for Mid and High Level Feature Learning” This is an extension to previous work on Deconvolutional Networks released at CVPR last year. Deconv Nets are a hierarchical convolutional sparse coding model used to learn features from images in an unsupervised fashion. We added two important extensions: - the first novel extension is to train the model with respect to the input image from any given layer. - other hierarchical models build on the layer below - allows stacking the model while still maintaining strong reconstructions - the second extension is a pooling operation that acts both spatially and amongst features - we store the switch locations within each pooling region in order to preserve sharp reconstructions through multiple layers - pooling over features provides more competition amongst features and therefore forces groupings to occur