Parental involvement in the development of children's reading skill
presentation_Diarization_MIT
1. Speaker Independent Diarization for Child Language
Environment Analysis Using Deep Neural Networks
By Maryam Najafian
Supervisor Prof. John Hansen
University of Texas at Dallas, US
4th October 2016
Email: m.najafian@utdallas.edu
2. This study investigates language environments of
young children based on location tracking and
speech processing child-adult speaker diarization
Audio recordings are gathered using LENA units
Location information are gathered using Ubisence units
labeled audio is gathered from 32 children wearing the LENA unit with age
range from 2.5 to 5 years old over a typical day at three different time
points
Introduction UBISENSE
LENA
1/16
4. LIUM GMM-HMMs with
bottom-up clustering
MFCC extraction
Viterbi
Re-segmentation
BIC distance
GLR: Generalized Likelihood Ratio
BIC: Bayesian Information Criterion
Audio Segmentation
GLR distance Agglomerative
Clustering
(BIC distance)
UBM MAP
adapted
to each class
MFCC extraction
UBM: Universal Background Model
MAP: Maximum A Posteriori
3/16
1-primary child
2-secondary child
3-Adult
4-music
5-crowd noise
6-silence
5. Diarization with i-Vector SVM
i-Vector based child-adult turn-taking detection system
TO-Combo-SAD: Threshold optimized speech activity detection using
Combo-SAD features [3,4]
Combo features: the mean and variance normalized Harmonicity,
Clarity, Prediction, Priodicity, and Spectral Flux features are linearly
mapped to 1-dimensional 'COMBO' feature
I-Vector [5] SVM based classifier
4/16
6. I-Vector-SVM TO-Combo-SAD system,1.5s segments, on 4.5 hrs [1]
27.3% Relative error reduction compared to LIUM on 4.5 hrs [1]
System Comparison
4.5 hours:
Distribution of 4 acoustic
classes in our database
From manually labeled data
Adult
22%
Primary
Child
10%
Secondary
Child
16% Non-speech
523%
5/16
8. System Comparison
28.5% Relative error reduction compared to LIUM, on 7.2 hrs
7/16
7.2 hours:
Distribution of 4 acoustic
classes in our database
From manually labeled data
on-Speech
Adult
24%
Primary
Child
20%
Secondary
Child
23%
Non-speech
33%
10. System Comparison
37.11% Relative error reduction compared to LIUM, on 7.2 hrs
10/16
7.2 hours:
Distribution of 4 acoustic
classes in our database
From manually labeled data
on-Speech
Adult
24%
Primary
Child
20%
Secondary
Child
23%
Non-speech
33%
12. 3 Classroom Time Points:
Compares level of interaction
between child & other children
and adults
Case study
12/16
13. Case studyCase study
3 Classroom Time Points:
Compares % Time Spent in each
of 7 learning/activity areas:
(art, blocks, books, dramatic play,
cubbies, manipulation, science)
13/16
14. Case study
Case study aims to collect statistics that enable a wider perspective of child
communication between teachers and peers in classrooms across different
a ti it areas i.e., hi h areas are hot la guage spa es?
Speech produced by adults, primary and secondary children
across 7 activity areas in a 33 minutes green window
14/16
15. Case study
Case study aims to collect statistics that enable a wider perspective of child
communication between teachers and peers in classrooms across different
a ti it areas i.e., hi h areas are hot la guage spa es?
Heat map adult word count vocalizations per minute
15/16
16. Summary
Explored LOCATION & LANGUAGE interactions via diarization
Proposed DNN-HMM and diarization solutions to assess child-
adult interaction in naturalistic learning spaces
Using the fused DNN-HMM based system leads to considerable
relative DER reduction on average compared to the LIUM’s GMM
based system with bottom-up clustering.
Analysis plots derived from this work support our ability to:
Determine which children are less engaged in voice communication
Determine how much talk teachers direct at each child
Assess how much communication children have with other children
in specific learning/activity areas
Determine which learning/activities stimulate greater voice
communication between child-teacher and child-child
Determine which activity areas individual children or all children
within a given classroom on average spend their time
16/16
17. References
[1] M. Najafian, D. Ir i , Y. Luo, B.“. Rous, a d J.H.L. Ha se , Auto ati
measurement and analysis of the child verbal communication using classroom
a ousti s ithi a hild are e ter, i WOCCI, 6.
[ ] M. Najafia , a d J.H.L. Ha se , “peaker i depe de t diarizatio for hild
la guage e iro e t a al sis usi g Deep Neural Net orks, su itted to IEEE “LT-
2016.
[3] S. O. Sadjadi, J.H.L. Hansen, U super ised speech activity detection using voicing
measures and perceptual spectral Flu , IEEE Signal Processing Letters, vol. 20, no. 3,
pp. 197-200, March 2013
[4] A. Ziaei, L. Kaushik, A. Sangwan, J.H.L. Hansen, D. Oard, Speech activity detection
for NASA Apollo space missions: challenges and solutions, ISCA INTERSPEECH-2014,
Paper #994, Singapore, Sept. 14-18, 2014.
[5] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for
speaker verication, INTERSPEECH, 2011.