2. Contents
• Introduction
• Related work
• Overview: Object recognition system
• Object classification & detection
• Conclusions
• Future work
3. Introduction
Research Topic: Visual object category recognition using
weakly supervised learning.
DIPLECS: Artificial cognitive system for autonomous systems.
• Interested in object interactions determined by
their functional properties.
• All objects in same category have the same
functional properties.
• Recognition is based on object’s visual
properties.
4. Introduction
Research Topic: Visual object category recognition using
weakly supervised learning.
• A very large training set is required to learn the
large appearance variation in a category.
• So we utilize huge image datasets like Flickr®
and GoogleTM Image.
• The images are corrupt and incompletely
labelled.
• Therefore, weakly supervised learning is
utilized which can handle corrupt and noisy
training data.
10. Occurrence frequency of visual words is characteristic of the object
Object model : bag-of-visual words
Creating a visual codebook
11. Object model : bag-of-visual words
A test image can be classified
based on the distance of its
normalized codebook from the
codebooks of positive and negative
training samples.
Codebook positive samples Codebook negative samples Codebook test image
12. Object model : bag-of-visual words
Visual codebooks for positive and negative samples of ‘car’ category in
PASCAL VOC 2006
13. Object model : bag-of-visual words
Visual codebooks for ‘car’ and ‘cow’ categories in PASCAL VOC 2009 dataset
15. Improve Classification
Larger Visual Codebook:
• More representative of category
• Higher computational cost
ROC of ‘car’ category in PACAL VOC
2006 for codebook sizes from 20 to
20000 visual words.
17. Improve Classification
Training and test images in the
dataset scaled down by same factor.
Training and test images scaled down by
different factors.
19. Improve Classification
ROC for 20 visual categories in
PASCAL VOC 2009
The PACAL VOC 2009 dataset is
larger and more challenging than the
2006 dataset.
20. Improve Classification
ROC for PASCAL VOC 2009 training
and test images images scaled down
by factor of 2
ROC for PASCAL VOC 2009 using a
universal visual vocabulary
21. Object localization using sliding window
The poor localization results are due to:
• Lack of structural information in the
bag-of-words object model
• Classifier learning object background
22. Visual codebook
Training images with
bounding - boxes
Training images without
bounding - boxes
Good Codebook with equal population of
positive and negative visual words
Positive background different
from negative images
Positive background similar to
negative images
With no bounding-box
utilized, the codebook
consists of a majority of
negative visual words.
23. Visual codebook
Training images with
bounding - boxes
Training images without
bounding - boxes
Good Codebook with equal population of
positive and negative visual words
Positive background different
from negative images
Positive background similar to
negative images
Classification based on
object context
(background) rather than
object features.
24. Improve Classification
The detection at each iteration estimates a bounding box which provides a better
visual codebook which in turn leads to better detection.
25. • Key-point configurations as
features are a discriminative
object feature set.
• A configuration of visual words
appends structural information
to the bag-of-words model.
Object detection
• Harvest frequent and discriminative configurations.
• Encode configurations called transaction vectors.
• Association between a transaction vector and the
training type is an association rule.
• Apriori algorithm finds association rules with high
confidence in a support-confidence framework.
Transaction vector encoding
key-point configuration
26. Apriori algorithm
• Uses breadth-first search and tree structure.
• Longer configurations will have lower support as
they are infrequent but higher confidence as they
are more discriminative.
• Downward closure lemma: prune configurations
with infrequent sub-sets.
27. Object localization
Training
Data Set
Test Data
Set
Test Image
Generate
Transactions Transactions
Apriori data
mining
Association
Rules
Generate Confidence
for each Transaction
Threshold
Confidence
Transactions
• A confidence is assigned to every
key-point in the image.
• Key-points with sufficiently high
confidence are retained.
• Key-points which occur on
common background objects like
doors and windows can have high
confidence.
28. Object classification using Apriori
Training
Data Set
Test Data
Set
Generate
Transactions Transactions
Apriori data
mining
Association
Rules
Generate Confidence
for each Transaction
Sum
Confidence
TransactionsTest
Images
ROC ‘car’ in PASCAL VOC 2006
The summed confidence score depends
upon object scale in the image, which
explains the comparatively poor
performance of this approach.
29. Conclusions
• The ‘bag-of-words’ model is good for classification, but poor for localization.
• Separate foreground-background for better visual codebooks.
• The good classification using PASCAL VOC 2006 dataset is attributed to
recognition of object context rather than object features.
• The dataset utilized should have sufficient variation in appearance of the
object and its background.
• Larger visual vocabulary gives slightly better classification, but is
computationally more expensive.
• The visual vocabulary built has majority of background visual words since
bounding-boxes are not utilized during training.
30. Conclusions
• Improving the proportion of visual words representing the object in the
vocabulary is vital for good classification.
• Incorporate object boundary contour to the descriptor.
• Use of frequent and discriminative key-point configurations is a promising
approach for object localization.
• A low quality dataset results in a weak visual codebook and classifiers biased
to the training data.
• Classification using key-point configurations was poor compared to ‘bag-of-
words’ for PASCAL VOC 2006.
31. Future Work
• Improve a visual codebook by increasing the proportion of visual words
pertaining to object features. Combine Apriori based localization and
clustering for visual word selection in an iterative approach.
•Model visual scene information (Use the GIST descriptor by Torralba). Learn
co-occurrence statistics of a scene and a visual category. Recognition of the
scene serves as prior for object presence and improves object recognition
performance.
• Improve object localization by using context priming.
• Model object contextual information to aid foreground-background
disambiguation for better object localization.
32. Future Work
• Share information of features between visual categories. The size of a
universal visual vocabulary should increase sub-linearly with increase in
number of visual categories.
• Combine image segmentation and classification to improve the object
model to provide better classification performance.
• Build a hierarchical framework for visual categorization:
• Representation: combine local and global features.
• Model: combine semantic and structural object models.
• Classification: combine generative and discriminative approaches.