This Presentation is used in the dissertation of the Eren Golge's Master of Science. It proposes 2 new procedure to learn visual concept models from noisy image sources without any human annotation.
1. 1
Mining Web Images for Concept Learning
A work by Eren Golge
Advisor: Asst. Prof. Dr. Pinar Duygulu
2. 2
Motivation
●
Problem
– hard to have large annotated images.
●
Solution
– Use weakly-labeled images from Internet
●
But
– polysemy and irrelevancy in Internet images
– visual variations of targeted concepts (sub-modularity)
●
Then
– Use our methods CMAP and AME :)
4. 4
Meta Pipeline
GATHER DATA from
Refine DATA
Learn Classifiers
Polysemy, Irrelevancy, Sub-Grouping
Weakly Labelled Images
High quality Concept Models
5. 5
Short Retrospective
●
Use annotated control set as a start point.
– Fergus et. al. [1], OPTIMOL, Li and Fei-Fei [2]
– We use fully autonomous framework.
●
Use Textual Captions
– Berg and Forsyth [3]
– We use only visual content
●
Discriminative image cues
– Efros et al. [4] “Discriminative Patches”, Q. Li et al.[5]
– We use single computer with faster and better results.
●
CMAP and AME have broader possible applications
[1] Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: Computer Vision, 2005. ICCV 2005
[2] Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.W., Learned-Miller, E.G., Forsyth, D.A.: Names and faces in the news. In: IEEE Conference on
Computer Vision
Pattern Recognition (CVPR). Volume 2. (2004) 848–854
[3] Li, L.J., Fei-Fei, L.: Optimol: automatic online picture collection via incremental model learning. International journal of computer vision 88(2) (2010) 147–168
[4] Li, Q., Wu, J., & Tu, Z. (n.d.). Harvesting Mid-level Visual Concepts from Large-scale Internet Images.
7. 7
Method #1 : CMAP
Clustering
Outlier detection+
Concept Map - CMAP
Accepted for
Draft version : http://arxiv.org/abs/1312.4384
Polysemy and Sub-Grouping
Irrelevancy
8. 8
CMAP's motivation
●
Very Generic method for other domains as well (textual, biological etc.)
●
Extension of SOM (a.k.a. Kohonen's Map) *
●
Inspired by biological phenomenas **
●
Able to cluster data and detect outliers
●
Irrelevancy and Sub-Grouping SOLVED!!
*Kohonen, T.: Self-organizing maps. Springer (1997)
**Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal
of physiology 160(1) (1962) 106
Outlier clusters
Outlier instances in salient clusters
9. 9
CMAP cont'
finding outlier units
●
Look activation statistics of each SOM unit in
learning phase
●
Latter learning iterations are more reliable
IF a unit is activated
REARLY → OUTLIER
FREQUENTLY → SALIENT
Winner activations Neighbor activations
13. 13
Learning Models
●
Learn L1 linear SVM models
– Easier to train
– Better for high dimensional data
(wide data matrix)
– Implicit feature selection by L1
norm
●
Learn one linear model from each
salient cluster
●
Each concept has multiple models
– Polysemy SOLVED!!
21. 21
Implementation Details
●
Visual Features :
– BoW SIFT with 4000 words (for texture attribute, object and face)
– Use 3D 10x20x20 Lab Histograms (for attribute)
– 256 dimensional LBP [1] (for object and face)
●
Preprocessing
– Attribute: Extract random 100x100 non-overlapping image patches from each image.
– Scene: Represent each image with the confidence scores of attribute classifiers in a Spatial Pyramid sense
– Face: Apply face detection[2] to each image and get one highest score patch.
– Object: Apply unsupervised saliency detection [3] to images and get a single highest activation region.
●
Model Learning
– Use outliers and some sample of other concept instances as Negative set
– Apply Hard Mining[4]
– Tune all hyper parameters via X-validation on the (classifiers and RSOM parameters)
●
NOTICE:
– We use Google images to train concept models and deal with DOMAIN ADAPTATION
[1] Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Analysis and
Machine Intelligence, IEEE Transactions on 24(7) (2002) 971–987
[2] Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, IEEE (2012) 2879–2886
[3] Erdem, E., Erdem, A.: Visual saliency estimation by nonlinearly integrating features using region covariances. Journal of Vision 13(4) (2013) 1–20
[4] Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained part-based models." Pattern Analysis and Machine Intelligence, IEEE
Transactions on 32.9 (2010): 1627-1645.
23. 23
Attribute Learning
Ours State of art
Attribute Image-Net 0.37 0.36 [4]
Attribute ebay 0.81 0.79 [3]
Attribute bing 0.82
-
[3] Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. Image Processing, IEEE (2009)
[4] Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: Trends and Topics in Computer Vision. Springer (2012)
[5]O. Russakovsky and L. Fei-Fei, “Attribute learning in large-scale datasets,” in Trends and Topics in Computer Vision, pp. 1–14, Springer, 2012.
24. 24
Scene Learning
MIT-indoor Scene-15
CMAP-A 46.2% 82.7%
CMAP-S 40.8% 80.7%
CMAP-S+HM 41.7% 81.3%
Li et al. [1] 47.6% 82.1%
Pandey et al. [82] 43.1% -
Kwitt et al. [3] 44% 82.3%
Lazebnik et al. [4] - 81%
Singh et al. [5] 38% 77%
CMAP-A : Attribute based Scene Learning.
CMAP-S : Scenes Learning from directly CMAP.
CMAP-S+HM : Scene Learning from CMAP with hard mining.
[1] Q. Li, J. Wu, and Z. Tu, “Harvesting mid-level visual concepts from large-scale internet images,” CVPR, 2013.
[2] M. Pandey and S. Lazebnik, “Scene recognition and weakly supervised object localization with deformable part-based models,” ICCV, 2011.
[3] R. Kwitt, N. Vasconcelos, and N. Rasiwasia, “Scene recognition on the semantic manifold,” ECCV, 2012.
[4] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Computer Vision
and Pattern Recognition, 2006”
[5] S. Singh, A. Gupta, and A. A. Efros, “Unsupervised discovery of mid-level discriminative patches,” in European Conference Computer Vision (ECCV), 2012.
25. 25
Object Learning
CMAP [1] [2] CMAP [1] [2]
airplane 0.63 0.51 0.76 car 0.97 0.98 0.94
face 0.67 0.52 0.82 guitar 0.89 0.81 0.60
leopard 0.76 0.74 0.89 motorbike 0.98 0.98 0.67
watch 0.55 0.48 0.53 overall 0.78 0.72 0.75
[1] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning object categories from google’s image search,” in Computer
Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, vol. 2, pp. 1816–1823, IEEE, 2005.
[2] L.-J. Li and L. Fei-Fei, “Optimol: automatic online picture collection via incremental model learning,” International journal of
computer vision, vol. 88, no. 2, pp. 147–168, 2010.
→ Dataset provided by [1]
26. 26
Face Learning
GBC+CF(half)[1] CMAP-1 CMAP-2 Baseline
EASY 0.58 0.63 0.66 0.31
HARD 0.32 0.34 0.38 0.18
→ Face learning results with detecting faces using OpenCV detector
[1] M. Ozcan, J. Luo, V. Ferrari, and B. Caputo, “A large-scale database of images and captions for automatic face
naming.,” in BMVC, pp. 1–11, 2011.
27. 27
Selective Search
From “Uijlings, Jasper RR, et al. "Selective search for object recognition." International journal of computer vision 104.2 (2013): 154-171.”
28. 28
Selective Search with CMAP
●
Remove outlier candidate regions from the detection tree
of Selective Search[1]
●
~ 3.500 lower candidate region per image with better
Recall and MABO*.
[1] Uijlings, Jasper RR, et al. "Selective search for object recognition." International journal of computer vision 104.2 (2013): 154-171.
[2] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of image windows,” Pattern Analysis and Machine Intelligence, IEEE
Transactions on, vol. 34, no. 11, pp. 2189–2202, 2012.
MABO Recall No. of Windows
Objectness [2] 0.69 0.94 1.853
Selective Search [1] 0.87 0.99 10.097
Selective Search +
CMAP
0.89 0.99 6.753
* MABO : Mean Average Best Overlap.
29. 29
Results Summary
Ours State of art
Face 0.66 0.58 [1]
Scene 0.47 0.48 [5]
Object 0.78 0.75 [2]
Attribute Image-Net 0.37 0.36 [3]
Attribute ebay 0.81 0.79 [4]
Attribute bing 0.82
-
[1] Ozcan, M., Luo, J., Ferrari, V., Caputo, B.: A large-scale database of images and captions for automatic face naming. BMVC. (2011)
[2] Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: Computer Vision, 2005. ICCV 2005
[3] Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: Trends and Topics in Computer Vision. Springer (2012)
[4] Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. Image Processing, IEEE (2009)
[5] Li, Q., Wu, J., Tu, Z.: Harvesting mid-level visual concepts from large-scale internet images. CVPR (2013)
31. 31
Method#2:AME
(Association through Model Evolution)
●
Iterative data cleansing
●
Measure discriminativeness and representativeness.
●
Define category versus random instances.
32. 32
AME's motivation
●
One another agnostic data refining method
against Irrelevancy.
●
Make use of far too much random instances
as oppose to limited annotated instances.
●
Evade Sub-Grouping using very high
dimensional representations.
33. 33
AME's method overview
●
First discern category candidates (CC) from
random set (RS).
●
Define category references(CR).
●
Second discern CR from CC.
●
Define spurious instances (SI) against CR and
eliminate.
●
Re-Iterate
Irrelevancy Solved!!
34. 34
Step1
●
Discerning category from random set
– Learn a linear model M1 between CC and RS.
– Take the most confidently classified instances as the
CR.
40. 40
High Dimensional Representation
●
Our problem is discrimination.
●
Therefore ;
– High dimensions makes any category linearly
separable from others despite of category
modularity.
sub-grouping solved !!
41. 41
Feature Learning
●
Learn frequent pattern on the data
●
Learning Pipeline (similar to [1]);
1. Scrap random nxn patches from the images.
Over Collected Patches;
2. Contrast normalization
3. ZCA Whitening
4. K-means for C words
Over Whole Image;
5. Spatial (Max or Avg) Pooling by C words
Learned Visual Words
[1] Coates, Adam, Andrew Y. Ng, and Honglak Lee. "An analysis of single-layer networks in unsupervised feature learning." International Conference on
Artificial Intelligence and Statistics. 2011.
={ 5 x C words }
dimension for each img
43. 43
Implementation Details
●
AME
– L1 Logistic Regression with Gauss-Seidel algorithm [1]
– Final model L1 Linear SVM with Grafting[2].
– At each iteration 5 images are eliminated.
●
Feature Learning
– Use horizontally flipped images.
– Re-size each gray-level image 60px height.
– Contrast Normalization to random patches.
– ZCA whitening with Ɛ=0.5.
– Receptive field size 6x6 pixels
– 1 px stride with 2400 words
[1] Shirish Krishnaj Shevade and S Sathiya Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression.
Bioinformatics,19(17):2246–2253, 2003.
[2] Simon Perkins, Kevin Lacker, and James Theiler. Grafting: Fast, incremental feature selection by gradient descent in function space. The Journal of
Machine Learning Research, 3:1333–1356, 2003.
45. 45
DATASETS
●
FAN-Large [2]
– EASY subset: faces larger than 60x70 px, 138
categories.
– ALL: no constraint, 365 categories.
●
PubFig83[1]
– Subset of PubFig with 83 celebrities at least 100
images for each.
[1]N. Pinto, Z. Stone, T. Zickler, and D. Cox, “Scaling up biologically-inspired computer vision: A case study in unconstrained face
recognition on facebook,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE
Computer Society Conference on, pp. 35–42, IEEE, 2011.
[2] M. Ozcan, J. Luo, V. Ferrari, and B. Caputo, “A large-scale database of images and captions for automatic face naming.,” in
BMVC, pp. 1–11, 2011.
46. 46
Classification Pipeline
● No data refining
● Models are trained on the training set of the given dataset
● Results on PubFig83
[1]N. Pinto, Z. Stone, T. Zickler, and D. Cox, “Scaling up biologically-inspired computer vision: A case
study in unconstrained face recognition on facebook,” in Computer Vision and Pattern Recognition
Workshops (CVPRW), 2011 IEEE Computer Society Conference on, pp. 35–42, IEEE, 2011
[2]B. C. Becker and E. G. Ortiz, “Evaluating open-universe face identification on the web,” in Computer
Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on, pp. 904–911, IEEE,
2013.
● ~5% improvement on State of Art
● Better results with more words
47. 47
AME results
● Baseline is the same classification pipeline without any data refining.
● All models are learned from web images.
[18] S. Singh, A. Gupta, and A. A. Efros, “Unsupervised discovery of mid-level discriminative patches,” in
European Conference Computer Vision (ECCV), 2012.
48. 48
AME-- False vs True Elimination
→ Incremental plot of correct versus false outlier detections until AME finds all the outliers
for all classes. Each iteration values are aggregated by the previous iteration.
49. 49
AME-- X-val Accuracies
→ Cross-validation (final-model) and M1 accuracies as the algorithm proceeds.
This shows a salient correlation between cross-validation classifier and M1 models,
without M1 models incurring over-fitting.
50. 50
AME-- # Elimination vs Accuracy
→ Effect of number of outliers removed at each iteration versus final test accuracy.
It is observed that elimination after some limit imposes degradation of final
performance and eliminating 1 instance per iteration is the salient selection without
any sanity check.
52. 52
Which one to Choose?
●
Polysemy + Irrelevancy in the DATA CMAP→
●
Only Irrelevancy in the DATA AME→
●
Another choice:
– Use AME first then CMAP
– Not testified !
53. 53
At the End
●
We propose two novel algorithms CMAP and
AME
●
Compelling results against state-of-art
methods for variety of Vision Tasks
●
Learn complex visual concepts with a simple
query.