presentation_Diarization_MIT

Speaker Independent Diarization for Child Language
Environment Analysis Using Deep Neural Networks
By Maryam Najafian
Supervisor Prof. John Hansen
University of Texas at Dallas, US
4th October 2016
Email: m.najafian@utdallas.edu

 This study investigates language environments of
young children based on location tracking and
speech processing child-adult speaker diarization
 Audio recordings are gathered using LENA units
 Location information are gathered using Ubisence units
 labeled audio is gathered from 32 children wearing the LENA unit with age
range from 2.5 to 5 years old over a typical day at three different time
points
Introduction UBISENSE
LENA
1/16

Child-Adult speaker diarization
2/16

LIUM GMM-HMMs with
bottom-up clustering
MFCC extraction
Viterbi
Re-segmentation
BIC distance
GLR: Generalized Likelihood Ratio
BIC: Bayesian Information Criterion
Audio Segmentation
GLR distance Agglomerative
Clustering
(BIC distance)
UBM MAP
adapted
to each class
MFCC extraction
UBM: Universal Background Model
MAP: Maximum A Posteriori
3/16
1-primary child
2-secondary child
3-Adult
4-music
5-crowd noise
6-silence

Diarization with i-Vector SVM
i-Vector based child-adult turn-taking detection system
 TO-Combo-SAD: Threshold optimized speech activity detection using
Combo-SAD features [3,4]
 Combo features: the mean and variance normalized Harmonicity,
Clarity, Prediction, Priodicity, and Spectral Flux features are linearly
mapped to 1-dimensional 'COMBO' feature
 I-Vector [5] SVM based classifier
4/16

 I-Vector-SVM TO-Combo-SAD system,1.5s segments, on 4.5 hrs [1]
 27.3% Relative error reduction compared to LIUM on 4.5 hrs [1]
System Comparison
4.5 hours:
Distribution of 4 acoustic
classes in our database
From manually labeled data
Adult
22%
Primary
Child
10%
Secondary
Child
16% Non-speech
523%
5/16

System Comparison
 28.5% Relative error reduction compared to LIUM, on 7.2 hrs
7/16
7.2 hours:
on-Speech
Adult
24%
Primary
Child
20%
Secondary
Child
23%
Non-speech
33%

Parallel
Asynchronous
DNN-HMMs
4
8/16

System Comparison
 37.11% Relative error reduction compared to LIUM, on 7.2 hrs
10/16
7.2 hours:
on-Speech
Adult
24%
Primary
Child
20%
Secondary
Child
23%
Non-speech
33%

 3 Classroom Time Points:
Compares level of interaction
between child & other children
and adults
Case study
12/16

Case studyCase study
 3 Classroom Time Points:
Compares % Time Spent in each
of 7 learning/activity areas:
(art, blocks, books, dramatic play,
cubbies, manipulation, science)
13/16

Case study
 Case study aims to collect statistics that enable a wider perspective of child
communication between teachers and peers in classrooms across different
a ti it areas i.e., hi h areas are hot la guage spa es?
Speech produced by adults, primary and secondary children
across 7 activity areas in a 33 minutes green window
14/16

Case study
 Case study aims to collect statistics that enable a wider perspective of child
communication between teachers and peers in classrooms across different
a ti it areas i.e., hi h areas are hot la guage spa es?
Heat map adult word count vocalizations per minute
15/16

Summary
 Explored LOCATION & LANGUAGE interactions via diarization
 Proposed DNN-HMM and diarization solutions to assess child-
adult interaction in naturalistic learning spaces
 Using the fused DNN-HMM based system leads to considerable
relative DER reduction on average compared to the LIUM’s GMM
based system with bottom-up clustering.
 Analysis plots derived from this work support our ability to:
 Determine which children are less engaged in voice communication
 Determine how much talk teachers direct at each child
 Assess how much communication children have with other children
in specific learning/activity areas
 Determine which learning/activities stimulate greater voice
communication between child-teacher and child-child
 Determine which activity areas individual children or all children
within a given classroom on average spend their time
16/16

References
 [1] M. Najafian, D. Ir i , Y. Luo, B.“. Rous, a d J.H.L. Ha se , Auto ati
measurement and analysis of the child verbal communication using classroom
a ousti s ithi a hild are e ter, i WOCCI, 6.
 [ ] M. Najafia , a d J.H.L. Ha se , “peaker i depe de t diarizatio for hild
la guage e iro e t a al sis usi g Deep Neural Net orks, su itted to IEEE “LT-
2016.
 [3] S. O. Sadjadi, J.H.L. Hansen, U super ised speech activity detection using voicing
measures and perceptual spectral Flu , IEEE Signal Processing Letters, vol. 20, no. 3,
pp. 197-200, March 2013
 [4] A. Ziaei, L. Kaushik, A. Sangwan, J.H.L. Hansen, D. Oard, Speech activity detection
for NASA Apollo space missions: challenges and solutions, ISCA INTERSPEECH-2014,
Paper #994, Singapore, Sept. 14-18, 2014.
 [5] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for
speaker verication, INTERSPEECH, 2011.

presentation_Diarization_MIT

Recommandé

Recommandé

Contenu connexe

Similaire à presentation_Diarization_MIT

Similaire à presentation_Diarization_MIT (20)

presentation_Diarization_MIT