Fast object re-detection and localization in video for spatio-temporal fragment creation, Jul. 2013, San Jose, California, USA. Talk provided by Vasileios Mezaris.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Fast object re-detection and localization in video for spatio-temporal fragment creation
1. Fast object re-detection and localization
in video for spatio-temporal fragment
creation
Evlampios Apostolidis, Vasileios Mezaris, Ioannis Kompatsiaris
Information Technologies Institute / Centre for Research and Technology Hellas
ICME MMIX 2013, San Jose, CA, USA, July 2013
Information Technologies Institute
Centre for Research and Technology Hellas
1
2. Overview
•
•
•
•
Introduction - problem formulation
Related work
Baseline approach
Proposed approach
– GPU-based processing
– Video-structure-based sampling of video frames
– Robustness to scale variations
• Experiments and results
• Conclusions
Information Technologies Institute
Centre for Research and Technology Hellas
2
3. Introduction – problem formulation
• Object re-detection: a particular case of image matching
• Main goal: find instances of a specific object within a single video or a
collection of videos
– Input: object of interest + video file
– Processing: similarity estimation by means of image matching
– Output: detected instances of the object of interest
Information Technologies Institute
Centre for Research and Technology Hellas
3
4. Introduction – problem formulation
Extension for interactive and linked TV
• Semi-automatic identification and annotation of object-specific spatiotemporal media fragments
–
–
–
–
Annotate the object of interest
Run the object re-detection algorithm
Get automatically instance-based annotated video fragments
Find related content fragments and establish links between them
Assign a label
to the object of
interest
Instance-based
annotated
video fragment
Links to related
content
Information Technologies Institute
Centre for Research and Technology Hellas
4
5. Related work
• Extraction and matching of scale- and rotation-invariant local descriptors
is one of the most popular SoA approaches for similarity estimation
between pairs of images
– Local feature extraction
• Edge detectors (e.g. Canny), corner detectors (e.g. Harris-Laplace)
– Local feature description
• SIFT or extensions of it, SURF, BRISK, binary descriptors such as BRIEF, …
– Matching of local descriptors
• k-Nearest Neighbor search between descriptor pairs using brute-force or hashing
– Filtering of erroneous matches
• Symmetry test between the pairs of matched descriptors
• Ratio test regarding the distances of the calculated nearest neighbors
• Geometric verification between the pair of images using RANSAC
– Extensions
• Combined use of keypoints and motion information (tracking)
• Bag-of-Words (BoW) matching for pruning
Information Technologies Institute
Centre for Research and Technology Hellas
5
6. Proposed approach
• Starting from a baseline approach,
– Improve detection accuracy
– Reduce the needed processing time
• Work directions:
– GPU-based processing
– Video-structure-based sampling of frames
– Enhancing robustness to scale variations
Information Technologies Institute
Centre for Research and Technology Hellas
6
7. GPU-based processing
Accelerated parts of the overall pipeline:
• Video decompression
into frames
• Keypoint detection and
description
• Brute-Force matching
and 2-NN search
• Drawing of the
calculated bounding
boxes (optional)
Information Technologies Institute
Centre for Research and Technology Hellas
7
8. Video-structure-based sampling
• Sequential processing of video frames is replaced by a structure-based
one, using the analysis results of a shot segmentation method
Example
Check shot 1
2
No detection!
Detection!
Check the frames
Move to the
of thisshot
next shot
Detect and highlight
the object of interest
Information Technologies Institute
Centre for Research and Technology Hellas
8
9. Robustness to scale variations
Problem
• Major changes in scale may lead to detection failure due to the significant
limitation of the area that is used for matching
• Zoom-in case: the middle image (b) corresponds to a small upper right
area of the object O in the left one (a)
• Zoom-out case: in the right image (c) the object O occupies a very small
part of the frame
• Both cases lead to a considerable reduction of the number of matched
pairs of descriptors, and thus often to detection failure
a
b
Information Technologies Institute
Centre for Research and Technology Hellas
c
9
10. Robustness to scale variations
Solution
• we automatically generate a zoomed-out and a centralized zoomed-in
instance of the object O and we utilize them in the matching procedure
• Zoomed-in instance:
– selection of a center-aligned sub-area of the original object O and
enlargement to the actual size of O by applying bilinear interpolation
– 70% of the original image area 140% zoom-in factor
• Zoomed-out instance:
– shrink the original image O into a smaller one using nearest neighbor
interpolation
– the maximum zoom-out factor is determined by the restrictions of the GPUbased implementation of SURF
Information Technologies Institute
Centre for Research and Technology Hellas
10
11. Experiments and Results
• System specifications
– Intel Core i7 processor at 3.4GHz
– 8GB RAM memory
– CUDA-enabled NVIDIA GeForce GTX560 GPU
• Dataset
– 6 videos* of 273 minutes total duration
– 30 manually selected objects
• Ground-truth (generated via manual annotation)
Examples of sought objects
– 75.632 frames contain at least one of these objects
– 333.455 frames do not include any of the selected objects
* The videos are episodes from the “Antiques Roadshow” of the Dutch public broadcaster AVRO (http://avro.nl/)
Information Technologies Institute
Centre for Research and Technology Hellas
11
12. Experiments and Results
• Aim: quantify the improvement that each extension of the baseline
approach is responsible for
• Four experimental configurations:
– C1: baseline implementation
– C2: GPU-accelerated implementation,
– C3: GPU-accelerated and video-structure-based sampling
implementation
– C4: complete proposed approach which includes:
GPU-processing
video-structure-based sampling
and robustness to scale variations
Information Technologies Institute
Centre for Research and Technology Hellas
12
13. Experiments and Results
• Detection accuracy is expressed in terms of Precision, Recall and F-Score
• Evaluation was performed in a per-frame basis, i.e. considering the 30
selected objects and counting the number of frames where these were
correctly detected, missed, etc.
• Time efficiency was evaluated by expressing the processing time of each
configuration as a factor of the actual duration of the processed videos
• Robustness to scale variations was quantified using two specific sets of
frames where the object of interest was observed from:
– a very close viewing position (2.940 frames) and
– a very distant viewing position (4.648 frames)
Information Technologies Institute
Centre for Research and Technology Hellas
13
14. Experiments and Results
Precision
Recall
F-Score
Processing Time
(x Real-Time)
C1
0.999
0.868
0.929
2.98-5.26
C2
0.999
0.850
0.918
0.35-1.24
C3
0.999
0.849
0.918
0.03-0.13
C4
0.999
0.872
0.931
0.03-0.19
Evaluation results for configurations C1 to C4
Precision
Recall
F-Score
Precision
Recall
F-Score
C1
0.999
0.856
0.922
C1
0.999
0.831
0.907
C2
0.999
0.856
0.922
C2
0.999
0.831
0.907
C3
1.000
0.852
0.920
C3
1.000
0.799
0.888
C4
1.000
0.992
0.996
C4
1.000
0.914
0.955
Evaluation results for highly zoomed-out instances
Information Technologies Institute
Centre for Research and Technology Hellas
Evaluation results for highly zoomed-in instances
14
15. Experiments and Results
Detection accuracy
• All versions exhibited very good results in terms of detection accuracy
• Version C4 (complete proposed approach) achieved the best results
• The algorithm performed considerably well for a range of different scales
and orientations and for partial visibility or partial occlusion
Processing time
• The video-structure-based sampling
strategy led to a great reduction of the
required processing time
• The algorithm needs about 10% of the
video’s duration, preserving the same
high levels of detection accuracy with
the slower configurations
Online demo available at: http://www.youtube.com/watch?v=0IeVkXRTYu8
Information Technologies Institute
Centre for Research and Technology Hellas
15
16. Extensions, ideas and plans
• Recent extension: Multiple instances of an object of interest can be used
as input for more efficient re-detection of 3D objects
• Future ideas: test the algorithm’s performance as a tool for chapter
segmentation in videos where the chapters are temporally demarcated by
the presence of a specific object (e.g. a painting in a video about art)
• Future plans: evaluate the extended algorithm’s performance (detection
accuracy and time efficiency) in a new set of videos
Input
Information Technologies Institute
Centre for Research and Technology Hellas
Output
16
17. Conclusions
• The proposed method can be used for fast and accurate re-detection of
pre-defined objects in videos
• The time performance of the implemented algorithm allows for real-time
processing of multi-media content
• Extended by a prior object labeling step, this technique can be seen as:
– A reliable tool for instance-based annotated, spatio-temporal
fragments in videos
– A key-enabled technology for finding similar content and establishing
links between related media fragments, thus contributing to the
realization of interactive and linked TV
Information Technologies Institute
Centre for Research and Technology Hellas
17
5 representative key-frames per each shot (starting + ending + 3 intermediate)
By applying this sampling strategy the algorithm analyses in full only the parts (i.e. the shots) of the video where the object appears (being visible in at least one of the key-frames of these shots) and quickly rejects all remaining parts by performing a small number of comparisons.
E.g. a highly zoomed-in instance of the object may correspond to a very small portion of the searched object O, while an instance that is seen from a very distant viewing position may occupy only a small part of the overall image.
For the last column (Time), a range is reported for each configuration, since processing times can vary significantly depending on the video structure and the percentage of frames in which the sought object appears.