Lecture 01 internet video search

Internet Video Search

Arnold W.M. Smeulders & Cees Snoek

CWI & UvA

Overview Image and Video Search
Lecture 1 visual search, the problem
color-spatial-textural-temporal features
measures and invariances
Lecture 2 descriptors
words and similarity
where and what
Lecture 3 data and metadata
performance
speed

A brief history of television

From broadcasting to narrowcasting

~1955 ~1985 ~2005

…to thin casting
2008

2010

Any other purpose than tv?

Surveillance to alert events
Forensics to find evidence / to protect misuse
Social media to sort responses
Safety to prevent terrorism
Agriculture to sort fruit
News to reuse archived footage
Business to have efficient access
eBusiness to mine consumer data
Science to understand visual cognition
Family “I have it somewhere on this disk”

How big? The answer from the web

The web is video

How big? The answer from

…as of May 2011

How big? Answer from the archive

Yearly influx Next 6 years
15.000 hours of video 137.200 hours of video
1 Pbyte per year 22.510 hours of film
2.900.000 photo’s

Crowd-given search

What others say is in the video.

We focus on what digital content says is in the video.

Problem 1: The variation

So many images of one thing: illumination
background
occlusion
viewpoint, …

This is the sensory gap.

Problem 2: What defines things?
1101011011011 1101011011011
0110110110011 1101011011011
0110110110011
0101101111100 0110110110011
0101101111100
1101011011111 0101101111100
1101011011111
1101011011111
Tree Suit
Basketball
1101011011011
Machine
1101011011011
0110110110011 0110110110011
0101101111100 1101011011011
0101101111100
1101011011111 0110110110011
1101011011111
0101101111100
US flag Building 1101011011111
1101011011011 Table
0110110110011
0101101111100 1101011011011
1101011011111 0110110110011
0101101111100
Aircraft Multimedia Archives 1101011011111
1101011011011 Fire
1101011011011
0110110110011 1101011011011
0110110110011
0110110110011
Language 0101101111100 0101101111100
1101011011111 0101101111100
1101011011111
1101011011111
Dog Tennis
Mountain

Problem 3: The many things

This is the model gap

Problem 4: The story of a video

This is the narrative gap

Problem 5: No shared intuition
Query-by-keyword

Find shots of people Query-by-concept
shaking hands
Query-by-examples

Query
What sources
Prediction

This is the query-context gap

System 1: histogram matching

Histogram as a summary of color characteristics.
This image cannot currently be displayed.

Swain and Ballard, IJCV 1991

1 Conclusion

As content grows, many applications of image search.
Deep cognitive and computer science problems.
With simple means one gets visually simple results.

Source . reflection

Light source e(λ )

Object ρ (λ )

Result
e( λ ) ρ (λ )

(R,G,B)

 
 ∫ e ( λ ) ρ ( λ ) f R ( λ ) dλ 
R λ 
   
 G  =  ∫ e ( λ ) ρ ( λ ) f G ( λ ) dλ 
B λ 
   
∫ e(λ ) ρ (λ ) f B (λ )dλ 
λ 

(r, g, b) in (R,G,B)

 R 
 
 r   R+G+ B
   G 
g =  
b   R + G + B 
   B 
 
 R+G+ B

Independent of shadow!

The sensation of spectra
Hue: dominant wavelength λ(EH)
Saturation: purity of the colour (EH - EW)/EH
Intensity: brightness of the colour EW

EH

E
W

“white” “green”

The sensation of spectra: opponent

Human perception combines (R,G,B) response
of the eye in opponent colors
 
R+G + B 
 Luminance    λ
  1 
 BlueYellow  =  ( R − G ) 
2 λ
 PuperGreen   
 
 1 (2 B − R − G )  λ
 
4 

Maximizes perceived contrast!

Color Gaussian space

 E   0.06 0.63 0.27  R 
    
 Eλ  =  0.30 0.04 − 0.35  G 
 E   0.34 − 0.60 0.17  B 
 λλ   

Maximizes information content!
Geusebroek PAMI 2002

Color Gaussian space
(R,G,B)-pdf (E0,Eλ,Eλλ)-pdf

Matter body reflectance in (R,G,B)

Taxonomy of diff-image structure
T-junction Junction

Highlight
Corner

These junctions later bring recognition

Gabor texture

The 2D Gabor function is:
x2 + y2
1 −
h ( x, y ) = e 2δ 2
e 2πj ( ux + vy )
2πσ 2
Tuning parameters: u, v, σ
Manjunath and Ma on Gabor for texture in Fourier-space

Gabor texture

K-means cluster K-means cluster
of RGB Gabor opponent

Hoang ECCV 2002

Gabor GIST descriptor

Calculate Gabor responses locally
Create histograms as before
Distinguishes things like naturalness, openness,
roughness, expansion, and ruggedness

Slide credit: James Hays and Alexei Efros Olivia IJCV 2001

Receptive field in f(x,t)

Gaussian equivalent over x and t:

zero order first order t

Burghouts TIP 2006

Gaussians measure differentials

Taylor expansion at x

For discretely sampled signal use the Gaussians

The preferred brand of filters: separable by dimension
rotation symmetric
no new maxima
fast implementations.

Receptive fields: overview

All observables up to first order color,
second order spatial scales, eight
frequency bands & first order in t.

System 2: Blobworld, textured world

Group blobs based on color and Tamura texture
User specifies query blob and features
System returns images with similar regions

Carson PAMI 2002

2 Conclusion

Powerful features capture uniqueness.
A large set is needed for open-ended search.
The Gauss family is the preferred brand of filters.
Fast recursive implementation:
Geusebroek, Van de Weijer & Smeulders 2002

The need for invariance

There are a million appearances to one object

The same part of the same shoe does not have the same
appearance in the image. This is the sensory gap.
Remove unwanted variance as early as you can.

Invariance: definition

A feature g is invariant under condition (transform)
caused by accidental conditions at the time of recording,
iff g observed on equal objects and is constant:

Quiz: scale invariant detection

What properties are invariant to observation scale?

Color invariance

    
C = mb (n , s ) ∫ e(λ )cb (λ ) f C (λ )d λ + ms (n , s , v ) ∫ e(λ )cs (λ ) f C (λ ) d λ
λ λ
cb (λ ) surface albedo scene & viewpoint invariant
e(λ ) illumination scene dependent

n object surface normal object shape variant

s illumination direction scene dependent

v viewer’s direction viewpoint variant
f C (λ ) sensor sensitivity scene dependent

C is viewpoint invariant
R G
c1 ( R, G, B) = arctan c2 ( R, G, B ) = arctan c3 ( R, G, B
max{G, B} max{R, B}

E space C space
Gevers TIP 2000

Hue is viewpoint invariant

3 𝐺−𝐵
𝑅−𝐺 + 𝑅−𝐵
H = arctan , H is a scalar

Differential invariants C’, W’, M’

C’ is for matte objects and uneven white light:
Eλλ
E
Cλ = λ Cλλ =
E E
Eλ x E − Eλ E x
Cλx =
E2
W’ is for matte planar objects and even white light:
Ex Eλ x
Wx = Wλx =
E E
M’ is for matte objects and monochromatic light:
Eλ x E − Eλ E x
N λx =
E2 Geusebroek PAMI 2002

Retained discrimination
shadows shading highlights ill. intensity ill. color
E - - - - -
H + + + + -
W & W’ - + - + -
C & C’ + + - + -
M & M’ + + - + +
L + + + + -
E 990
H 315
Retained from 1000 colors σ = 3: W’ 995
C’ 850
M’ 900
Geusebroek PAMI 2003

3 Conclusion

Know your variances and invariants.
Good invariant features make algorithms simple.

Lecture 01 internet video search

Recommended

Recommended

More Related Content

More from zukun

More from zukun (20)

Lecture 01 internet video search