Random Forests for Laughter Detection

RANDOM FORESTS
FOR LAUGHTER DETECTION

Heysem Kaya, A. Mehdi Ercetin, A. Ali Salah, S. Fikret Gurgen
Department of Computer Engineering
Bogazici University / Istanbul

WASSS '13, Grenoble, 23.08.2013

Presentation Outline
●

Introduction and Motivation

●

A Brief Review of Literature

●

The Challenge Corpus

●

Methodology
–

Methods & Features Utilized

–

Experimental Results

●

Conclusions

●

Questions

Introduction
•

•

•

•

•

New directions in ASR: emotional and social cues.
Social Signal Sub-Challenge of the Interspeech
2013: detection of laughter and fillers
Challenges: garbage class dominates, too many
samples
Best results obtained with fusion (e.g. speech +
vision)
TUM baseline feature set

Brief Review of Related Work
Work

Classifier

Features

Other Notes

Ito et al., 2005

GMM

MFCC,
Δ-MFCC

Class. fusion with AND operator was
found to increase precision

Thruong and
Leeuven, 2007

GMM &
SVM

PLP, prosody
voicing,
modulation
spectrum

Knox and
Mirgafori, 2007

MLP

MFCC, prosody
along with first
and second Δ

Petridis and
Pantic, 2008

MLP,
PLP, prosody &
Adaboost their Δs as well
(feat. sel.) as video features

Vinciarelli et al.,
2009

Fusion with diverse meta classifiers
was found to outperform fusion of
class. with the same algorithm
●

Class prob. stacked to ANN
●Δ-MFCC outdid raw MFCC
●

Multi-modal fusion was superior to
single-modal model
● Best results: stacking to ANN
●

Survey recommends utilization of classifier fusion for SSP

The SSP Challenge Dataset - Corpus
●
●

Based on SSPNet Vocalization Corpus
Collected from microphone recordings of 60
phone calls

●

Contains 2763 clips each with 11s length

●

At least one laughter or filler at every clip

●

●

Fillers are vocalizations to hold floor (e.g.
“uhm”, “eh”, “ah”)
To provide speaker independence:
–

calls #1-35 served as training,

–

#36-45 as development and the rest as test

The SSP Challenge Dataset – Statistics
Property

Statistic

# of Clips

2763

Clip Duration

11 sec.

# of Phone Calls

60

# of Subjects

120

# Male Subjects

57

# Female Subjects

63

# of Filler Events

3.0 k

# of Laughter Events

1.2 k

Methodology
●

Classifier:
–

Random Forests (RF)

–

Support Vector Machines (SVM)

●

Challenge Features (TUM Baseline)

●

Feature Selection:
–

●

minimum Redundancy Maximum Relevance
(mRMR)

Post-processing:
–

Gaussian Window Smoothing

The SSP Challenge Dataset - Features
●

●

Non-overlapping frames of length 10ms (over
3 x 10⁶ frames)
Considering memory limitations, a small set of
(141) affectively potent features are extracted
(Schuller et al., 2013):
–

MFCC 1-12, logarithmic energy, voicing
probability, HNR, F0, zero crossing rate
along with their first order Δ

–

Also for MFCC and log. energy second order Δ

–

Frame wise LLDs are augmented with mean
and std. of the frame and its 8 neighbors

Random Forests
●

●

Random Forest is a fusion of decision tree
predictors1
Each decision tree is grown with
–

–
●

●

a set of randomly selected instances (sampled with
replacement) having
a subset of features which are also randomly selected

Sampling with replacement leaves on the avg 1/3
instances 'out of the bag'
They are shown to be superior to current
algorithms in accuracy and perform well on large
databases
1

L. Breiman, “Random Forests”, University of California, Berkeley,USA, 2001

mRMR Based Feature Selection
●

●

●

mRMR is introduced by Peng et al. (2005) as
a feature ranking algorithm based on mutual
information (MI)
MI quantifies the amount of shared information
between two random variables
A candidate feature is selected having
max MI with target variable
– min MI with already selected features
– For mRMR we used authors' original
implementation* *http://penglab.janelia.org/proj/mRMR/
–

Experimental Results
●

●

We used WEKA* implementation of SVM and
RFs.
For Social Signal Sub-Challenge we consider
Area Under Curve (AUC) measure of laughter
and filler classes and their unweighted
average (UAAUC)

*M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten,
“The WEKA Data Mining Software”, 2009 (http://www.cs.waikato.ac.nz/ml/weka/)

Baseline Results with Linear SVM (%AUC)
SVM Baseline on Development Set1
Laughter
C2
0.1
81.3
1
81.2
10
81.2
1
2

Filler
83.6
83.7
83.7

UAAUC
82.5
82.5
82.5

Challenge paper reports 86.2% and 89.0% for laughter and filler, respectively
SVM Complexity parameter

SVM Performance with mRMR Features on Devel. Set (C=0.1)
# of mRMR
Features
90
70
50

Laughter

Filler

UAAUC

80.7
79.9
78.8

83.3
83.0
82.4

82.0
81.5
80.6

Experiments with Random Forests
●

Hyper-parameters of an RF are
–

–
●

●

the number of (randomly selected) features
per decision tree (d) and
the number of trees (T) to form the forest

We also investigated the effect of mRMR ranked
features (D)
The tested values used for the hyper-parameters
are d = {8,16,32}, T= {10,20,30} and D =
{50,70,90, All}

Random Forest Performance with T=20, Varying
d and D on Development Set (% AUC)
Local Dim. (d) #mRMR (D)
All
90
8
70
50
All
90
16
70
50
All
90
32
70
50

Laughter
88.9
89.3
88.3
88.0
89.0
89.3
88.8
88.3
89.5
89.6
89.1
88.2

Filler
90.7
90.8
90.3
89.7
90.6
90.7
90.7
90.0
90.9
90.9
90.8
90.0

UAAUC
89.8
90.1
89.3
88.9
89.8
90.0
89.8
89.2
90.2
90.3
90.0
89.1

Random Forest Performance with Varying T, d
and D on Development Set (% UAAUC)
Local Dim. (d) #mRMR (D)
8

16

32
Avg

All
90
70
50
All
90
70
50
All
90
70
50

10 Trees

20 Trees

30 Trees

87.4
88.0
87.9
87.5
88.2
88.6
88.5
87.8
88.8
89.0
88.7
87.9
88.2

89.8
90.1
89.3
88.9
89.8
90.0
89.8
89.2
90.2
90.3
90.0
89.1
89.7

89.7
90.1
89.8
89.4
90.4
90.5
90.2
89.6
90.4
90.8
90.4
89.6
90.0

Development set baseline reported on paper: 87.6%, reproduced baseline: 82.5%

Gaussian Smoothing
●

●

●

We further applied Gaussian window smoothing on
posteriors.
The posteriors of each frame were re-calculated as a
weighted sum of its 2K neighbors (K before & K after)
The Gaussian weight function used to smooth frame
i with a neighboring frame j, is given as
−(1/2)

w i , j =(2∗pi∗B)
●

●

∗exp(−∣i− j∣/(2∗B))

We tested development set accuracy for K = 1,...,10
Since the increase from K=8 to K=9 was less than
0.05%, we chose K=8

The Development Set Accuracy vs K

The Challenge Test Set Result
●

We re-trained training and development sets
together with the best setting (T=30, D=90,
d=32)

●

We issued a Gaussian smoothing with K=8

●

On the overall we attained
–

89.6% and 87.3% AUC for laughter and fillers

–

An UAAUC of 88.4% outperformed the
challenge test set baseline (83.3%) by 5.1%

Conclusions
●

●

●

●

We proposed the use of RFs for laughter
detection
The results indicate superior performance in
accuracy
We observed that RFs make use of feature
reduction via mRMR while SVMs deteriorate
Together with Gaussian smoothing of
posteriors, we attained an increase of 5.1%
(absolute) from challenge test set baseline

Thanks!
●

Merci de votre attention

Random Forests for Laughter Detection

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (11)

Similaire à Random Forests for Laughter Detection

Similaire à Random Forests for Laughter Detection (20)

Dernier

Dernier (20)

Random Forests for Laughter Detection