From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Random Forests for Laughter Detection
1. RANDOM FORESTS
FOR LAUGHTER DETECTION
Heysem Kaya, A. Mehdi Ercetin, A. Ali Salah, S. Fikret Gurgen
Department of Computer Engineering
Bogazici University / Istanbul
WASSS '13, Grenoble, 23.08.2013
2. Presentation Outline
●
Introduction and Motivation
●
A Brief Review of Literature
●
The Challenge Corpus
●
Methodology
–
Methods & Features Utilized
–
Experimental Results
●
Conclusions
●
Questions
3. Introduction
•
•
•
•
•
New directions in ASR: emotional and social cues.
Social Signal Sub-Challenge of the Interspeech
2013: detection of laughter and fillers
Challenges: garbage class dominates, too many
samples
Best results obtained with fusion (e.g. speech +
vision)
TUM baseline feature set
4. Brief Review of Related Work
Work
Classifier
Features
Other Notes
Ito et al., 2005
GMM
MFCC,
Δ-MFCC
Class. fusion with AND operator was
found to increase precision
Thruong and
Leeuven, 2007
GMM &
SVM
PLP, prosody
voicing,
modulation
spectrum
Knox and
Mirgafori, 2007
MLP
MFCC, prosody
along with first
and second Δ
Petridis and
Pantic, 2008
MLP,
PLP, prosody &
Adaboost their Δs as well
(feat. sel.) as video features
Vinciarelli et al.,
2009
Fusion with diverse meta classifiers
was found to outperform fusion of
class. with the same algorithm
●
Class prob. stacked to ANN
●Δ-MFCC outdid raw MFCC
●
Multi-modal fusion was superior to
single-modal model
● Best results: stacking to ANN
●
Survey recommends utilization of classifier fusion for SSP
5. The SSP Challenge Dataset - Corpus
●
●
Based on SSPNet Vocalization Corpus
Collected from microphone recordings of 60
phone calls
●
Contains 2763 clips each with 11s length
●
At least one laughter or filler at every clip
●
●
Fillers are vocalizations to hold floor (e.g.
“uhm”, “eh”, “ah”)
To provide speaker independence:
–
calls #1-35 served as training,
–
#36-45 as development and the rest as test
6. The SSP Challenge Dataset – Statistics
Property
Statistic
# of Clips
2763
Clip Duration
11 sec.
# of Phone Calls
60
# of Subjects
120
# Male Subjects
57
# Female Subjects
63
# of Filler Events
3.0 k
# of Laughter Events
1.2 k
7. Methodology
●
Classifier:
–
Random Forests (RF)
–
Support Vector Machines (SVM)
●
Challenge Features (TUM Baseline)
●
Feature Selection:
–
●
minimum Redundancy Maximum Relevance
(mRMR)
Post-processing:
–
Gaussian Window Smoothing
8. The SSP Challenge Dataset - Features
●
●
Non-overlapping frames of length 10ms (over
3 x 10⁶ frames)
Considering memory limitations, a small set of
(141) affectively potent features are extracted
(Schuller et al., 2013):
–
MFCC 1-12, logarithmic energy, voicing
probability, HNR, F0, zero crossing rate
along with their first order Δ
–
Also for MFCC and log. energy second order Δ
–
Frame wise LLDs are augmented with mean
and std. of the frame and its 8 neighbors
9. Random Forests
●
●
Random Forest is a fusion of decision tree
predictors1
Each decision tree is grown with
–
–
●
●
a set of randomly selected instances (sampled with
replacement) having
a subset of features which are also randomly selected
Sampling with replacement leaves on the avg 1/3
instances 'out of the bag'
They are shown to be superior to current
algorithms in accuracy and perform well on large
databases
1
L. Breiman, “Random Forests”, University of California, Berkeley,USA, 2001
10. mRMR Based Feature Selection
●
●
●
mRMR is introduced by Peng et al. (2005) as
a feature ranking algorithm based on mutual
information (MI)
MI quantifies the amount of shared information
between two random variables
A candidate feature is selected having
max MI with target variable
– min MI with already selected features
– For mRMR we used authors' original
implementation* *http://penglab.janelia.org/proj/mRMR/
–
11. Experimental Results
●
●
We used WEKA* implementation of SVM and
RFs.
For Social Signal Sub-Challenge we consider
Area Under Curve (AUC) measure of laughter
and filler classes and their unweighted
average (UAAUC)
*M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten,
“The WEKA Data Mining Software”, 2009 (http://www.cs.waikato.ac.nz/ml/weka/)
12. Baseline Results with Linear SVM (%AUC)
SVM Baseline on Development Set1
Laughter
C2
0.1
81.3
1
81.2
10
81.2
1
2
Filler
83.6
83.7
83.7
UAAUC
82.5
82.5
82.5
Challenge paper reports 86.2% and 89.0% for laughter and filler, respectively
SVM Complexity parameter
SVM Performance with mRMR Features on Devel. Set (C=0.1)
# of mRMR
Features
90
70
50
Laughter
Filler
UAAUC
80.7
79.9
78.8
83.3
83.0
82.4
82.0
81.5
80.6
13. Experiments with Random Forests
●
Hyper-parameters of an RF are
–
–
●
●
the number of (randomly selected) features
per decision tree (d) and
the number of trees (T) to form the forest
We also investigated the effect of mRMR ranked
features (D)
The tested values used for the hyper-parameters
are d = {8,16,32}, T= {10,20,30} and D =
{50,70,90, All}
14. Random Forest Performance with T=20, Varying
d and D on Development Set (% AUC)
Local Dim. (d) #mRMR (D)
All
90
8
70
50
All
90
16
70
50
All
90
32
70
50
Laughter
88.9
89.3
88.3
88.0
89.0
89.3
88.8
88.3
89.5
89.6
89.1
88.2
Filler
90.7
90.8
90.3
89.7
90.6
90.7
90.7
90.0
90.9
90.9
90.8
90.0
UAAUC
89.8
90.1
89.3
88.9
89.8
90.0
89.8
89.2
90.2
90.3
90.0
89.1
15. Random Forest Performance with Varying T, d
and D on Development Set (% UAAUC)
Local Dim. (d) #mRMR (D)
8
16
32
Avg
All
90
70
50
All
90
70
50
All
90
70
50
10 Trees
20 Trees
30 Trees
87.4
88.0
87.9
87.5
88.2
88.6
88.5
87.8
88.8
89.0
88.7
87.9
88.2
89.8
90.1
89.3
88.9
89.8
90.0
89.8
89.2
90.2
90.3
90.0
89.1
89.7
89.7
90.1
89.8
89.4
90.4
90.5
90.2
89.6
90.4
90.8
90.4
89.6
90.0
Development set baseline reported on paper: 87.6%, reproduced baseline: 82.5%
16. Gaussian Smoothing
●
●
●
We further applied Gaussian window smoothing on
posteriors.
The posteriors of each frame were re-calculated as a
weighted sum of its 2K neighbors (K before & K after)
The Gaussian weight function used to smooth frame
i with a neighboring frame j, is given as
−(1/2)
w i , j =(2∗pi∗B)
●
●
∗exp(−∣i− j∣/(2∗B))
We tested development set accuracy for K = 1,...,10
Since the increase from K=8 to K=9 was less than
0.05%, we chose K=8
18. The Challenge Test Set Result
●
We re-trained training and development sets
together with the best setting (T=30, D=90,
d=32)
●
We issued a Gaussian smoothing with K=8
●
On the overall we attained
–
89.6% and 87.3% AUC for laughter and fillers
–
An UAAUC of 88.4% outperformed the
challenge test set baseline (83.3%) by 5.1%
19. Conclusions
●
●
●
●
We proposed the use of RFs for laughter
detection
The results indicate superior performance in
accuracy
We observed that RFs make use of feature
reduction via mRMR while SVMs deteriorate
Together with Gaussian smoothing of
posteriors, we attained an increase of 5.1%
(absolute) from challenge test set baseline