Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task
1. Technicolor / INRIA / Imperial College
London at the MediaEval 2012 Violent
Scene Detection Task
PENET Cédric – Technicolor, INRIA
DEMARTY Claire-Hélène – Technicolor
SOLEYMANI Mohammad – Imperial College London
GRAVIER Guillaume – CNRS, IRISA
GROS Patrick – INRIA
MediaEval 2012 Pisa Workshop
October, 4th 2012
2. Outline
Introduction
Systems description
Results and conclusion
2 10/7/2012
3. Outline
Introduction
Systems description
Results and conclusion
3 10/7/2012
4. Introduction
Joint effort between Technicolor / INRIA / Imperial College London
5 runs 5 different systems
Re-use of last year’s systems with few differences
Bayesian networks structure learning (Technicolor/INRIA)
Naive bayesian classifier (ICL)
Two new systems from Technicolor/INRIA
Exploiting similarity
Bag-of-Audio words
Fusion of three systems (Technicolor/INRIA – ICL)
4 10/7/2012
5. Outline
Introduction
Systems description
Results and conclusion
5 10/7/2012
6. Run 1: Exploiting Similarity
Idea: can we get the same results as last year using only similarity
measures?
Video features for each frame
Motion activity
Three color harmonisation features: harmonisation template, angle and
energy
Decision: KNN using only closest neighbour
10-movies used to populate KNN
Test frames labelled according to closest neighbour
1 frame of a shot labelled violent shot is labelled violent
6 10/7/2012
7. Run 2: Bag-of-Audio words
Audio features extraction
Extraction of MFCC audio features (with & ) - 20ms windows, 10 ms overlap
Extraction of silence segments with SPro
Extraction of coherent audio segments – Andre-Obrecht 1988
K-Means on non-silent audio segments for vocabulary (of size 128)
Each audio segment replaced by closest centroid
Construction of TF-IDF histograms
Each shot is a document
Classification using SVM
² and histogram intersection kernels
Applied weight on SVM parameter
7 10/7/2012
8. Run 3: Bayesian Networks structure learning
Re-use of Technicolor last year’s system with additionnal features
Audio features: energy, asymmetry, centroid, ZCR, flatness and roll-off at 90%
Video features: shot length, flashes, blood, activity, color coherence, average
luminance, fire and color harmonisation features
Features are averaged over a video shot
Graphical model for modeling conditional probability distributions along
with contextual features and temporal smoothing
Naive Bayesian network (NB)
Bayesian network example
Graph structure learning
Forest augmented naive Bayesian network (FAN)
K2
Late modalities fusion using simple rule
Source: https://controls.engin.umich.edu/wiki/index.php/Bayesian_network_theory
8 10/7/2012
9. Run 4: Naïve Bayesian classifier
Audio modality
Classical low level features extracted from non-silent segments
RMS Energy, pitch, MFCC, ZCR, spectrum flux, Spectral RollOff
Averaged over shots
Video modality
Shot duration, luminance, Average activity, motion component
Averaged over shots
Text features
Simple features such as number of spoken words and the average valence and arousal
per shot (from the dictionary of affect in language)
The results were bad and we decide not to include them in the final submission
A Naïve Bayesian classifier on each modality
Modality fusion using a weighted sum of posterior probabilities.
0.95* audio score +0.05 visual score
9 10/7/2012
10. Run 5: Systems fusion
Simple fusion of three systems
Run 2: Bag-of-Audio words
Run 3: Bayesian networks structure learning
Run 4: Naive bayesian classifier
Fusion by multiplication of probabilities
10 10/7/2012
11. Outline
Introduction
Systems description
Results and conclusions
11 10/7/2012
12. Results
Runs MAP@100 AP-1 AP-2 AP-3 STD
MediaEval Cost
N° Technique (%) (%) (%) (%) (%)
1 Similarity 13.89 0.00 12.91 28.77 14.41 2.29
2 BoAW 40.54 10.85 52.98 57.77 25.82 2.50
3 BN-SL 61.82 60.56 53.15 71.76 9.37 3.57
4 NBN 46.27 40.03 22.97 75.82 26.97 3.64
5 Fusion 57.47 64.52 37.21 70.69 17.82 4.60
Average Precision (AP) for Dead Poet Society (AP-1), Fight Club (AP-2) and Independence Day (AP-3)
STD: Standard deviation of the three test movies
High variation between movies
Best results on Independence day (similar to Armageddon)
Needs more movies to compute MAP
12 10/7/2012
13. Conclusion & perspectives
Similarity search
MAP is bad, but MediaEval Cost is one of the best (6th out of 35)
Adding features and merge decisions from different KNN might improve the
results
Fusion
4th best run overall (out of 35)
Results not as good as expected
Improves precision at the cost of recall (false alarms reduced by a factor of
two)
Test smarter fusion techniques
Bayesian Networks – Structure Learning
3rd best run overall (out of 35)
Very low standard deviation over three movies
Bayesian networks for intermediate concepts
13 10/7/2012
14. Conclusion & perspectives
Bag-of-Audio words
MAP is not bad (11th out of 35)
False alarms and missed detections are pretty low too
Simple tests proved efficient – more investigation needed
Naive bayesian classifier
Simple classifier with audio features can achieve moderatly good results
(10th out of 35)
Text features don’t work
Use a classifier that can learn temporal dynamics
14 10/7/2012