SlideShare a Scribd company logo
1 of 45
Download to read offline
Internet Video Search


Arnold W.M. Smeulders & Cees Snoek



             CWI & UvA
Overview Image and Video Search
Lecture 1   visual search, the problem
            color-spatial-textural-temporal features
            measures and invariances
Lecture 2   descriptors
            words and similarity
            where and what
Lecture 3   data and metadata
            performance
            speed
1 Visual search, the problem
A brief history of television

From broadcasting to narrowcasting




       ~1955       ~1985             ~2005


…to thin casting
                           2008


                                        2010
Any other purpose than tv?

 Surveillance   to alert events
 Forensics      to find evidence / to protect misuse
 Social media   to sort responses
 Safety         to prevent terrorism
 Agriculture    to sort fruit
 News           to reuse archived footage
 Business       to have efficient access
 eBusiness      to mine consumer data
 Science        to understand visual cognition
 Family         “I have it somewhere on this disk”
How big? The answer from the web

  The web is video
How big? The answer from




                           …as of May 2011
How big? Answer from the archive



Yearly influx              Next 6 years
   15.000 hours of video     137.200 hours of video
   1 Pbyte per year          22.510 hours of film
                             2.900.000 photo’s
Crowd-given search

What others say is in the video.




We focus on what digital content says is in the video.
Problem 1: The variation

  So many images of one thing:   illumination
                                 background
                                 occlusion
                                 viewpoint, …




  This is the sensory gap.
Problem 2: What defines things?
     1101011011011     1101011011011
     0110110110011                                1101011011011
                       0110110110011
     0101101111100                                0110110110011
                       0101101111100
     1101011011111                                0101101111100
                       1101011011111
                                                  1101011011111
              Tree             Suit
                                                     Basketball
   1101011011011
                                                                              Machine
                              1101011011011
   0110110110011              0110110110011
   0101101111100                                     1101011011011
                              0101101111100
   1101011011111                                     0110110110011
                              1101011011111
                                                     0101101111100
        US flag                       Building       1101011011111
    1101011011011                                                 Table
    0110110110011
    0101101111100                                                         1101011011011
    1101011011111                                                         0110110110011
                                                                          0101101111100
           Aircraft                   Multimedia Archives                 1101011011111
                               1101011011011                                    Fire
              1101011011011
                               0110110110011       1101011011011
              0110110110011
                                                   0110110110011
Language      0101101111100    0101101111100
                               1101011011111       0101101111100
              1101011011111
                                                   1101011011111
                      Dog                Tennis
                                                         Mountain
Problem 3: The many things

This is the model gap
Problem 4: The story of a video




This is the narrative gap
Problem 5: No shared intuition
                                Query-by-keyword


     Find shots of people       Query-by-concept
        shaking hands
                                Query-by-examples

            Query
                                  What sources
          Prediction


This is the query-context gap
System 1: histogram matching




Histogram as a summary of color characteristics.
This image cannot currently be displayed.




                                            Swain and Ballard, IJCV 1991
1 Conclusion

As content grows, many applications of image search.
Deep cognitive and computer science problems.
With simple means one gets visually simple results.
2 Features
Source . reflection


Light source            e(λ )


Object                  ρ (λ )


Result
                      e( λ ) ρ (λ )
(R,G,B)

                                        
         ∫ e ( λ ) ρ ( λ ) f R ( λ ) dλ 
R λ                                   
                                      
 G  =  ∫ e ( λ ) ρ ( λ ) f G ( λ ) dλ 
B λ                                   
                                      
        ∫  e(λ ) ρ (λ ) f B (λ )dλ 
        λ                               
(r, g, b) in (R,G,B)

          R     
                
 r   R+G+ B
        G     
g =           
b   R + G + B 
        B     
                
       R+G+ B




    Independent of shadow!
The sensation of spectra
Hue:             dominant wavelength            λ(EH)
Saturation:      purity of the colour           (EH - EW)/EH
Intensity:       brightness of the colour       EW



                                EH


                                E
                                W



       “white”                              “green”
The sensation of spectra: opponent

Human perception combines (R,G,B) response
   of the eye in opponent colors
                                   
                 R+G + B           
 Luminance                        λ
             1                   
 BlueYellow  =  ( R − G )        
                   2                  λ
 PuperGreen                      
            
                  1 (2 B − R − G )   λ
                                   
                 4                 



Maximizes perceived contrast!
Color Gaussian space

               E   0.06       0.63     0.27  R 
                                              
               Eλ  =  0.30    0.04    − 0.35  G 
               E   0.34      − 0.60    0.17  B 
               λλ                            




Maximizes information content!
                                         Geusebroek PAMI 2002
Color Gaussian space
               (R,G,B)-pdf   (E0,Eλ,Eλλ)-pdf
Matter body reflectance in (R,G,B)
Taxonomy of diff-image structure
  T-junction                               Junction




  Highlight
                                           Corner




 These junctions later bring recognition
Gabor texture

The 2D Gabor function is:
                           x2 + y2
                 1       −
  h ( x, y ) =       e      2δ 2
                                     e 2πj ( ux + vy )
            2πσ 2
Tuning parameters: u, v, σ
Manjunath and Ma on Gabor for texture in Fourier-space
Gabor texture




            K-means cluster   K-means cluster
              of RGB          Gabor opponent

                              Hoang ECCV 2002
Gabor GIST descriptor

    Calculate Gabor responses locally
    Create histograms as before
    Distinguishes things like naturalness, openness,
         roughness, expansion, and ruggedness




Slide credit: James Hays and Alexei Efros       Olivia IJCV 2001
Receptive field in f(x,t)

Gaussian equivalent over x and t:




zero order   first order t




                                    Burghouts TIP 2006
Gaussians measure differentials

                             Taylor expansion at x

For discretely sampled signal use the Gaussians



The preferred brand of filters: separable by dimension
                                rotation symmetric
                                no new maxima
                                fast implementations.
Receptive fields: overview




             All observables up to first order color,
             second order spatial scales, eight
             frequency bands & first order in t.
System 2: Blobworld, textured world

Group blobs based on color and Tamura texture
User specifies query blob and features
System returns images with similar regions




                                         Carson PAMI 2002
2 Conclusion

Powerful features capture uniqueness.
A large set is needed for open-ended search.
The Gauss family is the preferred brand of filters.
Fast recursive implementation:
Geusebroek, Van de Weijer & Smeulders 2002
3 Measures and invariances
The need for invariance

There are a million appearances to one object




The same part of the same shoe does not have the same
appearance in the image. This is the sensory gap.
Remove unwanted variance as early as you can.
Invariance: definition

A feature g is invariant under condition (transform)
caused by accidental conditions at the time of recording,
iff g observed on equal objects    and     is constant:
Quiz: scale invariant detection

 What properties are invariant to observation scale?
Color invariance


                                                  
 C = mb (n , s ) ∫ e(λ )cb (λ ) f C (λ )d λ + ms (n , s , v ) ∫ e(λ )cs (λ ) f C (λ ) d λ
                  λ                                           λ
cb (λ ) surface albedo                            scene & viewpoint invariant
e(λ )    illumination                             scene dependent
 
n        object surface normal                    object shape variant

s        illumination direction                   scene dependent

v        viewer’s direction                       viewpoint variant
f C (λ ) sensor sensitivity                       scene dependent
Matter body reflectance in E
C is viewpoint invariant
                          R                                 G
c1 ( R, G, B) = arctan           c2 ( R, G, B ) = arctan             c3 ( R, G, B
                       max{G, B}                         max{R, B}




 E space                               C space
                                                        Gevers TIP 2000
Hue is viewpoint invariant




                 3 𝐺−𝐵
               𝑅−𝐺 + 𝑅−𝐵
  H = arctan               , H is a scalar
Differential invariants C’, W’, M’

C’ is for matte objects and uneven white light:
                                     Eλλ
          E
      Cλ = λ                   Cλλ =
           E                          E
                                     Eλ x E − Eλ E x
                               Cλx =
                                           E2
W’ is for matte planar objects and even white light:
           Ex                        Eλ x
      Wx =                     Wλx =
           E                          E
M’ is for matte objects and monochromatic light:
             Eλ x E − Eλ E x
      N λx =
                   E2                   Geusebroek PAMI 2002
Retained discrimination
          shadows shading highlights ill. intensity ill. color
 E            -       -       -               -           -
 H           +         +           +           +         -
 W & W’      -         +           -           +         -
 C & C’      +         +           -           +         -
 M & M’      +         +           -           +         +
 L           +         +           +           +         -
                                       E       990
                                       H       315
Retained from 1000 colors σ = 3:       W’      995
                                       C’      850
                                       M’      900
Geusebroek PAMI 2003
3 Conclusion

Know your variances and invariants.
Good invariant features make algorithms simple.

More Related Content

More from zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVzukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featureszukun
 

More from zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
 

Lecture 01 internet video search

  • 1. Internet Video Search Arnold W.M. Smeulders & Cees Snoek CWI & UvA
  • 2. Overview Image and Video Search Lecture 1 visual search, the problem color-spatial-textural-temporal features measures and invariances Lecture 2 descriptors words and similarity where and what Lecture 3 data and metadata performance speed
  • 3. 1 Visual search, the problem
  • 4. A brief history of television From broadcasting to narrowcasting ~1955 ~1985 ~2005 …to thin casting 2008 2010
  • 5. Any other purpose than tv? Surveillance to alert events Forensics to find evidence / to protect misuse Social media to sort responses Safety to prevent terrorism Agriculture to sort fruit News to reuse archived footage Business to have efficient access eBusiness to mine consumer data Science to understand visual cognition Family “I have it somewhere on this disk”
  • 6. How big? The answer from the web The web is video
  • 7. How big? The answer from …as of May 2011
  • 8. How big? Answer from the archive Yearly influx Next 6 years 15.000 hours of video 137.200 hours of video 1 Pbyte per year 22.510 hours of film 2.900.000 photo’s
  • 9. Crowd-given search What others say is in the video. We focus on what digital content says is in the video.
  • 10. Problem 1: The variation So many images of one thing: illumination background occlusion viewpoint, … This is the sensory gap.
  • 11. Problem 2: What defines things? 1101011011011 1101011011011 0110110110011 1101011011011 0110110110011 0101101111100 0110110110011 0101101111100 1101011011111 0101101111100 1101011011111 1101011011111 Tree Suit Basketball 1101011011011 Machine 1101011011011 0110110110011 0110110110011 0101101111100 1101011011011 0101101111100 1101011011111 0110110110011 1101011011111 0101101111100 US flag Building 1101011011111 1101011011011 Table 0110110110011 0101101111100 1101011011011 1101011011111 0110110110011 0101101111100 Aircraft Multimedia Archives 1101011011111 1101011011011 Fire 1101011011011 0110110110011 1101011011011 0110110110011 0110110110011 Language 0101101111100 0101101111100 1101011011111 0101101111100 1101011011111 1101011011111 Dog Tennis Mountain
  • 12. Problem 3: The many things This is the model gap
  • 13. Problem 4: The story of a video This is the narrative gap
  • 14. Problem 5: No shared intuition Query-by-keyword Find shots of people Query-by-concept shaking hands Query-by-examples Query What sources Prediction This is the query-context gap
  • 15. System 1: histogram matching Histogram as a summary of color characteristics. This image cannot currently be displayed. Swain and Ballard, IJCV 1991
  • 16. 1 Conclusion As content grows, many applications of image search. Deep cognitive and computer science problems. With simple means one gets visually simple results.
  • 18. Source . reflection Light source e(λ ) Object ρ (λ ) Result e( λ ) ρ (λ )
  • 19. (R,G,B)    ∫ e ( λ ) ρ ( λ ) f R ( λ ) dλ  R λ       G  =  ∫ e ( λ ) ρ ( λ ) f G ( λ ) dλ  B λ      ∫ e(λ ) ρ (λ ) f B (λ )dλ  λ 
  • 20. (r, g, b) in (R,G,B)  R     r   R+G+ B    G  g =   b   R + G + B     B     R+G+ B Independent of shadow!
  • 21. The sensation of spectra Hue: dominant wavelength λ(EH) Saturation: purity of the colour (EH - EW)/EH Intensity: brightness of the colour EW EH E W “white” “green”
  • 22. The sensation of spectra: opponent Human perception combines (R,G,B) response of the eye in opponent colors   R+G + B   Luminance    λ   1   BlueYellow  =  ( R − G )  2 λ  PuperGreen       1 (2 B − R − G )  λ   4  Maximizes perceived contrast!
  • 23. Color Gaussian space  E   0.06 0.63 0.27  R        Eλ  =  0.30 0.04 − 0.35  G   E   0.34 − 0.60 0.17  B   λλ    Maximizes information content! Geusebroek PAMI 2002
  • 24. Color Gaussian space (R,G,B)-pdf (E0,Eλ,Eλλ)-pdf
  • 26. Taxonomy of diff-image structure T-junction Junction Highlight Corner These junctions later bring recognition
  • 27. Gabor texture The 2D Gabor function is: x2 + y2 1 − h ( x, y ) = e 2δ 2 e 2πj ( ux + vy ) 2πσ 2 Tuning parameters: u, v, σ Manjunath and Ma on Gabor for texture in Fourier-space
  • 28. Gabor texture K-means cluster K-means cluster of RGB Gabor opponent Hoang ECCV 2002
  • 29. Gabor GIST descriptor Calculate Gabor responses locally Create histograms as before Distinguishes things like naturalness, openness, roughness, expansion, and ruggedness Slide credit: James Hays and Alexei Efros Olivia IJCV 2001
  • 30. Receptive field in f(x,t) Gaussian equivalent over x and t: zero order first order t Burghouts TIP 2006
  • 31. Gaussians measure differentials Taylor expansion at x For discretely sampled signal use the Gaussians The preferred brand of filters: separable by dimension rotation symmetric no new maxima fast implementations.
  • 32. Receptive fields: overview All observables up to first order color, second order spatial scales, eight frequency bands & first order in t.
  • 33. System 2: Blobworld, textured world Group blobs based on color and Tamura texture User specifies query blob and features System returns images with similar regions Carson PAMI 2002
  • 34. 2 Conclusion Powerful features capture uniqueness. A large set is needed for open-ended search. The Gauss family is the preferred brand of filters. Fast recursive implementation: Geusebroek, Van de Weijer & Smeulders 2002
  • 35. 3 Measures and invariances
  • 36. The need for invariance There are a million appearances to one object The same part of the same shoe does not have the same appearance in the image. This is the sensory gap. Remove unwanted variance as early as you can.
  • 37. Invariance: definition A feature g is invariant under condition (transform) caused by accidental conditions at the time of recording, iff g observed on equal objects and is constant:
  • 38. Quiz: scale invariant detection What properties are invariant to observation scale?
  • 39. Color invariance      C = mb (n , s ) ∫ e(λ )cb (λ ) f C (λ )d λ + ms (n , s , v ) ∫ e(λ )cs (λ ) f C (λ ) d λ λ λ cb (λ ) surface albedo scene & viewpoint invariant e(λ ) illumination scene dependent  n object surface normal object shape variant  s illumination direction scene dependent  v viewer’s direction viewpoint variant f C (λ ) sensor sensitivity scene dependent
  • 41. C is viewpoint invariant R G c1 ( R, G, B) = arctan c2 ( R, G, B ) = arctan c3 ( R, G, B max{G, B} max{R, B} E space C space Gevers TIP 2000
  • 42. Hue is viewpoint invariant 3 𝐺−𝐵 𝑅−𝐺 + 𝑅−𝐵 H = arctan , H is a scalar
  • 43. Differential invariants C’, W’, M’ C’ is for matte objects and uneven white light: Eλλ E Cλ = λ Cλλ = E E Eλ x E − Eλ E x Cλx = E2 W’ is for matte planar objects and even white light: Ex Eλ x Wx = Wλx = E E M’ is for matte objects and monochromatic light: Eλ x E − Eλ E x N λx = E2 Geusebroek PAMI 2002
  • 44. Retained discrimination shadows shading highlights ill. intensity ill. color E - - - - - H + + + + - W & W’ - + - + - C & C’ + + - + - M & M’ + + - + + L + + + + - E 990 H 315 Retained from 1000 colors σ = 3: W’ 995 C’ 850 M’ 900 Geusebroek PAMI 2003
  • 45. 3 Conclusion Know your variances and invariants. Good invariant features make algorithms simple.