1. Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers Abhinav Gupta and Larry S. Davis University of Maryland, College Park Proceedings of ECCV 2008 Presented by: DebaleenaChattopadhyay
2. Presentation Outline - The Problem Definition - The Novelty - The Problem Solution - The Results
3. The Problem Definition To learn visual classifiers for object recognition from weakly labeled data Input: Labels: city, mountain, sky, sun sun sky Expected Output: mountain city
4. Novelty To learn visual classifiers for object recognition from weakly labeled data utilizing additional language constructs Input: Labels: (Nouns) city, mountain, sky, sun (Relations) below(mountain, sky), below(mountain, sun) above(sky, city), above(sun, city) brighter(sun, mountain), brighter(sun, city) behind(mountain, city), convex(sun, city) in(sun, sky), smaller(sun, sky) sun sky Expected Output: mountain city
5.
6. Overview Pairs of Nouns: Nouns: (SEA, SUN) SEA (SEA, SKY) (SKY, SEA) SKY (SKY, SUN) SUN (SUN, SKY) (SUN, SEA) Relationships: in, above, below
14. E-step: Update assignments of nouns to image regions, given CA and CR
15.
16. Learning the Model EM-approach: Simultaneously solve for the correspondence problem and learn the parameters of classifiers (noun and relationship) E-step: Compute the noun assignment using parameters from the previous iteration. P( noun iassigned to region j) = Where,
18. Learning the Model EM-approach: Simultaneously solve for the correspondence problem and learn the parameters of classifiers (noun and relationship) M-step: Update the model parameters depending on the updated assignments in the E-step. The Maximum Likelihood parameters depends upon the classifier used. To utilize contextual information for labeling test-images, priors on relationship ,P(r|ns,np), are also learnt from a co-occurrence table after the relationship annotations are generated.
31. Range of semantics identified- Both algorithm give similar performance (L)
32. Frequency Correct- Later algorithm performs better in number of times a noun is identified (R)Nouns only Nouns & Relationships (Human) Nouns & Relationships (learned) Proposed EM algorithm bootstrapped by IBM Model 1 Proposed EM algorithm bootstrapped by Duygulu et. al
37. Experimental Results Precision-Recall: Precision Ratio- The ratio of number of images that have been correctly annotated with that word to the number of images which were annotated with the word by the algorithm. (Respect to Human Observers) Recall Ratio: The ratio of the number of images correctly annotated with that word using the algorithm to the number of images that should have been annotated with that word. (Respect to Corel Annotations)
38.
39. This algorithm proposes an EM based method to simultaneously learn visual classifiers for nouns, prepositions and comparative adjectives.
We are to determine the correspondence between image regions and semantic object classes Problem: Significant ambiguities in correspondence of visual features and object class
Instead of using only co-occurrence of nouns and image features over large databases of images to determine the correspondence, additional language constructs are considered like “prepositions” and “comparative adjectives”. This paper simultaneously learns the visual features defining “nouns” and the differential visual features defining “binary-relationships” using EM approach
Not applicable for binary relationships if models for nouns not givenhave used spatial relationships between image patches for scene recognition. The paper applies a feature mining approach to get discriminative image patches and the relationship between them is interpreted as adjectives or prepositions. The authors mined relationships between more than two image patches too. They used SVM to train the data mining problem with different types of adjectives and prepositions encoded. Encoding is based on image representation of multi-scale local patches and the spatial pyramid representation. SIFT descriptors are used to represent each appearance patch. At first the visual code words are recognized in an image and then relationships are extracted using Apriori mining algorithm.introduces an approach to learn jointly detectors for object classes and attributes (color and texture) based on a co-training algorithm. Object to attribute is a one way association here i.e. a red table or a metallic table; but not both. Here also the image is divided into a number of windows and joint multiple instance learning is used to force learners for both the object class and the attribute class to co-operate on labeling windows that must contain both the object and attribute. They have focused on windows that are salient and homogenous to select candidate windows.In most of the cases, the object detection average precision is better than the separate learning approach and moreover “visual attribute object” not in the training set can also be detected by combining visual attribute and object detectors learned from the other categories.
Visual features based on appearance and shapeInitialization with random assignements
Word sense disambiguation is not taken into context
Aij refers to the subset of the set of all possible assignments for animage in which noun i is assigned to region j.
Aij refers to the subset of the set of all possible assignments for animage in which noun i is assigned to region j.
For a Gaussian classifier we estimate the mean and varianceInitialization random Authors use the result of Bernard’s paper, translation based model. Any image annotation approach with localization shall workAfter learning the maximum likelihood parameters, weuse the relationship classifier and the assignment to find possible relationshipsbetween all pairs of words. Using these generated relationship annotations weform a co-occurrence table which is used to compute P
For each region, we have two nodes corresponding tothe noun and image features from that region. For all possible pairs of regions,we have another two nodes representing a relationship word and differentialfeatures from that pair of regions.An example of a Bayesian network with 3 regions. The rjk represent the possiblewords for the relationship between regions (j, k). Due to the non-symmetric nature ofrelationships we consider both (j, k) and (k, j) pairs (in the figure only one is shown).The magenta blocks in the image represent differential features (Ijk).
Relationship model is based one differential features.The parameterlearning M-step therefore also involves feature selection for relationshipclassifiers.
The first measure counts the number of words that are labeled properly bythe algorithm. In this case, each word has similar importance regardless of thefrequency with which it occurs. In the second case, a word which occurs morefrequently is given higher importance.Using the first measure, both algorithms have similar performance becausethey can correctly label one word each. However, using the second measurethe latter algorithm is better as sky is more common and hence the number ofcorrectly identified regions would be higher for the latter algorithm.a co-occurrence based translation model [ibm model 1]and translation based model with mixing probabilities [duygulu et. al] form the baseline algorithms.
For each region, we have two nodes corresponding tothe noun and image features from that region. For all possible pairs of regions,we have another two nodes representing a relationship word and differentialfeatures from that pair of regions.An example of a Bayesian network with 3 regions. The rjk represent the possiblewords for the relationship between regions (j, k). Due to the non-symmetric nature ofrelationships we consider both (j, k) and (k, j) pairs (in the figure only one is shown).The magenta blocks in the image represent differential features (Ijk).