Vdfp audio and video fingerprinting

audio and video fingerprinting

John Schavemaker, Werner Bailer, Peter-Jan Doets, Jaap Blom

techniek even in kort:

duplicaatherkenning (video fingerprinting)
• bestaat een video in onze databases?

categorisatie
• wat voor categorie video is het? Nieuws, sport, film?

objecten logoherkenning
• bestaat een object of logo (plaatje) in onze databases?

Zie ook ons online rapport over stand van de techniek:

http://research.imagesforthefuture.org/index.php/video-fingerprinting-state-of-the-art-report/

2 audio and video fingerprinting

duplicaatherkenning

VRAAG: bestaat een video in onze databases?

video fingerprints houden rekening
met veranderingen in:
• resolutie
• codec
• ruis
• kleur


SWOT video fingerprinting

STRENGTHS WEAKNESSES
• uitontwikkelde technologie • veel concurrerende partijen, welk
• zeer goede performance op softwarepakket te kiezen?
geproduceerd materiaal • geschiktheid voor video materiaal dat
• veel commerciële pakketten niet geproduceerd is?
verkrijgbaar op de markt

OPPORTUNITIES THREATS
• grotere video databases • video fingerprints gesloten
• niet geproduceerd materiaal standaarden
• open standaard video fingerprints • versleuteling video
• combinatie met audio • slimme “gebruikers”


video categorisatie

VRAAG: Wat voor categorie video is het?
Close-up gezicht, binnensport, buitensport?

images UvA
http://www.science.uva.nl/research/mediamill/


SWOT video categorisatie

• veel belovende techniek • onvolwassen techniek
• generieke herkenning mogelijk • performance (sterk) afhankelijk
• aanvulling op duplicaat- en van gebruikte leervoorbeelden
objectherkenning • leren systeem voor nieuwe
• brug van de ‘semantic gap’ categorieën duurt relatief lang

• combinatie van categorieën • variëteit te groot voor categorie
• sneller en beter leren • keuze van categorieën
• automatische annotatie • afhankelijk van annotatie
leervoorbeelden


objecten logoherkenning

VRAAG: bestaat
een object of logo
in onze databases?

picture from http://www.omniperception.com/


SWOT objecten logoherkenning

• goede, robuuste performance • alleen 2D objecten (logo’s)
• commerciële pakketten • echte duplicaatherkenning
• snel leren en herkennen • rekenintensief
• revolutie in computer vision

• grotere video databases • pre-processing al het materiaal
• open standaard noodzakelijk
• 3D object herkenning • patenten


video fingerprinting


Use of FP: identification

Audio/visual Fingerprint
Labeled signal extraction Fingerprints
Multimedia and
items Metadata
Metadata

Training phase
Identification phase

Unlabeled Fingerprint
Audio/visual Match Which item?
Multimedia extraction
signal Metadata
items


Sound & Vision Pilot
• Observations
• Problem harder than expected
• Transformations
• Crop & scale
• Brightness/contrast
• Logos, captions
• very difficult PIP
• many matching sequences of black frames


Sound & Vision Pilot – results ZiuZ

• TNO has used the ZiuZ video fingerprinting tool on the dataset
• ZiuZ video fingerprinting is optimized for child-abuse material:
• short clips
• low resolution
• low image quality
• Preliminary results on the Sound & Vision dataset show
• material is very challenging
• some but limited recall performance
• application domain differs
• queries containing multiple clips of reference material were
not enabled by this version of the tool


Sound & Vision Pilot – Results JRS
• Recall: 36% (min: 16%, max. 55%)
• Precision: difficult to determine, many black
sequences matching, needs manual checking


Sound & Vision Pilot - Results
• Transformations our system handles


Sound & Vision Pilot - Results
• False positives


Experiments with SIFT (1)
• we do not have a SIFT based fingerprinting
solution in the consortium
• JRS has SIFT-based interactive tool to locate
recurring objects in video
• created video from episode + source clips and
performed analysis and search


• Conclusion
• SIFT can handle cases of scaling and cropping
reliably
• even PIP with distortions
• Scalability issues
• time for extraction and esp. matching
• not sure if ranking of matches is still reliable on
huge datasets


Characteristics of the data set - audio

• Not all archive fragments contain audio
• Often the original audio is used – just cut-and-paste, no serious
distortions
• Sometimes the audio is replaced or combined with a voice over
• Time segmentation of the audio in the episode is different from
the video used. The audio is not always used with the
corresponding video fragments. Example on next slide illustrates
this. The other ways around, and other variations also occur.


Characteristics of the data set – audio example

Time line of one
archive video
video

audio

Time line of one
Andere Tijden episode
video

audio

Continuous audio fragment, with several shorter video fragments


Characteristics of the data set - audio

• Limitations of the use of audio
• the reference material must contain audio
• the audio track might not originate from the same material as
the video track; this is dependent on the video material used.
• the playout speed must not be changed too much (less than
+/- 2%)

• Advantages of the use of audio
• Highly robust algorithms
• Usually audio is undistorted; video is cropped, scaled, etc.
• Audio usually is used continuously, while video fragments are
cut-and-paste from different sections of the reference video,
and ‘glued together’.


Identification results - audio

• Only checked if the correct archive file name is returned
False
Episode Correct Missed Positive
Liggadjati 8 3 0
Veertig jaar STER-reclame 10 4 1
75 jaar afsluitdijk 0 5 2
Strijd tegen de file 9 1 6
Kronkels van de Maas 1 9 1
Op zoek naar Nederland 2 6 1
Modderen in de polder: Lelystad 3 1 2
Burgemeesters in oorlogstijd 6 10 0
De wording van Paars 8 1 0
Pim en zijn volk 7 3 0

silent parts in the video

Fingerprinting – audio algorithm

• Algorithm well-known from literature:
• Haitsma, Kalker, “A Highly Robust Audio Fingerprinting
System”, In Proceedings of 3rd International Conference
onMusic Information Retrieval (ISMIR), October 2002.
• Features: energy in 33 audio frequency bands
• Every 11.6 ms a 32-bit sub-fingerprint is computed, consisting of
coarsely quantized differences between these energy samples
• Fingerprint consists of a time series of sub-fingerprints
• The implementation returns the best matching fragments only
(settings to return no false positives)
• Algorithm is highly robust, and highly discriminative


Future improvements on current results

• Trailing parts contain silence and black frames (no content). The
silences give rise to false positives and irrelevant detections. A
silence/activity detector is needed to exclude these parts.
• Our current implementation from literature allows for only one
fragment per reference file to be returned.
• Our current implementation has only coarse time localization.
• Combination of audio and video fingerprinting


Consortium

http://instituut.beeldengeluid.nl/

http://www.joanneum.at/en/digital.html

http://www.ziuz.com

http://hs-art.com/

http://www.tno.nl


Vdfp audio and video fingerprinting

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (17)

En vedette

En vedette (11)

Similaire à Vdfp audio and video fingerprinting

Similaire à Vdfp audio and video fingerprinting (20)

Dernier

Dernier (20)

Vdfp audio and video fingerprinting