Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task

Technicolor / INRIA / Imperial College
London at the MediaEval 2012 Violent
Scene Detection Task

PENET Cédric – Technicolor, INRIA
DEMARTY Claire-Hélène – Technicolor
SOLEYMANI Mohammad – Imperial College London
GRAVIER Guillaume – CNRS, IRISA
GROS Patrick – INRIA
MediaEval 2012 Pisa Workshop
October, 4th 2012

Outline
 Introduction
 Systems description
 Results and conclusion

2 10/7/2012

Outline
 Introduction

3 10/7/2012

Introduction
Joint effort between Technicolor / INRIA / Imperial College London
 5 runs  5 different systems
 Re-use of last year’s systems with few differences
 Bayesian networks structure learning (Technicolor/INRIA)
 Naive bayesian classifier (ICL)
 Two new systems from Technicolor/INRIA
 Exploiting similarity
 Bag-of-Audio words
 Fusion of three systems (Technicolor/INRIA – ICL)

4 10/7/2012

Outline
 Introduction

5 10/7/2012

Run 1: Exploiting Similarity
 Idea: can we get the same results as last year using only similarity
measures?
 Video features for each frame
 Motion activity
 Three color harmonisation features: harmonisation template, angle and
energy
 Decision: KNN using only closest neighbour
 10-movies used to populate KNN
 Test frames labelled according to closest neighbour
 1 frame of a shot labelled violent  shot is labelled violent

6 10/7/2012

Run 2: Bag-of-Audio words
 Audio features extraction
 Extraction of MFCC audio features (with & ) - 20ms windows, 10 ms overlap
 Extraction of silence segments with SPro
 Extraction of coherent audio segments – Andre-Obrecht 1988

 K-Means on non-silent audio segments for vocabulary (of size 128)
 Each audio segment replaced by closest centroid

 Construction of TF-IDF histograms
 Each shot is a document

 Classification using SVM
 ² and histogram intersection kernels
 Applied weight on SVM parameter

7 10/7/2012

Run 3: Bayesian Networks structure learning
 Re-use of Technicolor last year’s system with additionnal features
 Audio features: energy, asymmetry, centroid, ZCR, flatness and roll-off at 90%
 Video features: shot length, flashes, blood, activity, color coherence, average
luminance, fire and color harmonisation features
 Features are averaged over a video shot

 Graphical model for modeling conditional probability distributions along
with contextual features and temporal smoothing
 Naive Bayesian network (NB)
Bayesian network example
 Graph structure learning
 Forest augmented naive Bayesian network (FAN)
 K2

 Late modalities fusion using simple rule
Source: https://controls.engin.umich.edu/wiki/index.php/Bayesian_network_theory

8 10/7/2012

Run 4: Naïve Bayesian classifier
Audio modality
 Classical low level features extracted from non-silent segments
 RMS Energy, pitch, MFCC, ZCR, spectrum flux, Spectral RollOff
 Averaged over shots
Video modality
 Shot duration, luminance, Average activity, motion component
 Averaged over shots
Text features
 Simple features such as number of spoken words and the average valence and arousal
per shot (from the dictionary of affect in language)
 The results were bad and we decide not to include them in the final submission

A Naïve Bayesian classifier on each modality
 Modality fusion using a weighted sum of posterior probabilities.
 0.95* audio score +0.05 visual score

9 10/7/2012

Run 5: Systems fusion

 Simple fusion of three systems
 Run 2: Bag-of-Audio words
 Run 3: Bayesian networks structure learning
 Run 4: Naive bayesian classifier

 Fusion by multiplication of probabilities

10 10/7/2012

Outline
 Introduction
 Results and conclusions

11 10/7/2012

Results
Runs MAP@100 AP-1 AP-2 AP-3 STD
MediaEval Cost
N° Technique (%) (%) (%) (%) (%)
1 Similarity 13.89 0.00 12.91 28.77 14.41 2.29
2 BoAW 40.54 10.85 52.98 57.77 25.82 2.50
3 BN-SL 61.82 60.56 53.15 71.76 9.37 3.57
4 NBN 46.27 40.03 22.97 75.82 26.97 3.64
5 Fusion 57.47 64.52 37.21 70.69 17.82 4.60
 Average Precision (AP) for Dead Poet Society (AP-1), Fight Club (AP-2) and Independence Day (AP-3)
 STD: Standard deviation of the three test movies

High variation between movies

Best results on Independence day (similar to Armageddon)

Needs more movies to compute MAP

12 10/7/2012

Conclusion & perspectives
 Similarity search
 MAP is bad, but MediaEval Cost is one of the best (6th out of 35)
 Adding features and merge decisions from different KNN might improve the
results

 Fusion
 4th best run overall (out of 35)
 Results not as good as expected
 Improves precision at the cost of recall (false alarms reduced by a factor of
two)
 Test smarter fusion techniques

 Bayesian Networks – Structure Learning
 3rd best run overall (out of 35)
 Very low standard deviation over three movies
 Bayesian networks for intermediate concepts
13 10/7/2012

Conclusion & perspectives

 Bag-of-Audio words
 MAP is not bad (11th out of 35)
 False alarms and missed detections are pretty low too
 Simple tests proved efficient – more investigation needed

 Naive bayesian classifier
 Simple classifier with audio features can achieve moderatly good results
(10th out of 35)
 Text features don’t work
 Use a classifier that can learn temporal dynamics

14 10/7/2012

Thanks for your attention !

15 10/7/2012

Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task

Similaire à Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task (20)

Plus de MediaEval2012

Plus de MediaEval2012 (20)

Dernier

Dernier (20)

Technicolor/INRIA/Imperial College London at the MediaEval 2012 Violent Scene Detection Task