SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Speaker Independent Diarization for Child Language
Environment Analysis Using Deep Neural Networks
By Maryam Najafian
Supervisor Prof. John Hansen
University of Texas at Dallas, US
4th October 2016
Email: m.najafian@utdallas.edu
 This study investigates language environments of
young children based on location tracking and
speech processing child-adult speaker diarization
 Audio recordings are gathered using LENA units
 Location information are gathered using Ubisence units
 labeled audio is gathered from 32 children wearing the LENA unit with age
range from 2.5 to 5 years old over a typical day at three different time
points
Introduction UBISENSE
LENA
1/16
Child-Adult speaker diarization
2/16
LIUM GMM-HMMs with
bottom-up clustering
MFCC extraction
Viterbi
Re-segmentation
BIC distance
GLR: Generalized Likelihood Ratio
BIC: Bayesian Information Criterion
Audio Segmentation
GLR distance Agglomerative
Clustering
(BIC distance)
UBM MAP
adapted
to each class
MFCC extraction
UBM: Universal Background Model
MAP: Maximum A Posteriori
3/16
1-primary child
2-secondary child
3-Adult
4-music
5-crowd noise
6-silence
Diarization with i-Vector SVM
i-Vector based child-adult turn-taking detection system
 TO-Combo-SAD: Threshold optimized speech activity detection using
Combo-SAD features [3,4]
 Combo features: the mean and variance normalized Harmonicity,
Clarity, Prediction, Priodicity, and Spectral Flux features are linearly
mapped to 1-dimensional 'COMBO' feature
 I-Vector [5] SVM based classifier
4/16
 I-Vector-SVM TO-Combo-SAD system,1.5s segments, on 4.5 hrs [1]
 27.3% Relative error reduction compared to LIUM on 4.5 hrs [1]
System Comparison
4.5 hours:
Distribution of 4 acoustic
classes in our database
From manually labeled data
Adult
22%
Primary
Child
10%
Secondary
Child
16% Non-speech
523%
5/16
Synchronous
DNN-HMMs
46/16
System Comparison
 28.5% Relative error reduction compared to LIUM, on 7.2 hrs
7/16
7.2 hours:
Distribution of 4 acoustic
classes in our database
From manually labeled data
on-Speech
Adult
24%
Primary
Child
20%
Secondary
Child
23%
Non-speech
33%
Parallel
Asynchronous
DNN-HMMs
4
8/16
System Comparison
 37.11% Relative error reduction compared to LIUM, on 7.2 hrs
10/16
7.2 hours:
Distribution of 4 acoustic
classes in our database
From manually labeled data
on-Speech
Adult
24%
Primary
Child
20%
Secondary
Child
23%
Non-speech
33%
System Comparison
11/16
 3 Classroom Time Points:
Compares level of interaction
between child & other children
and adults
Case study
12/16
Case studyCase study
 3 Classroom Time Points:
Compares % Time Spent in each
of 7 learning/activity areas:
(art, blocks, books, dramatic play,
cubbies, manipulation, science)
13/16
Case study
 Case study aims to collect statistics that enable a wider perspective of child
communication between teachers and peers in classrooms across different
a ti it areas i.e., hi h areas are hot la guage spa es?
Speech produced by adults, primary and secondary children
across 7 activity areas in a 33 minutes green window
14/16
Case study
 Case study aims to collect statistics that enable a wider perspective of child
communication between teachers and peers in classrooms across different
a ti it areas i.e., hi h areas are hot la guage spa es?
Heat map adult word count vocalizations per minute
15/16
Summary
 Explored LOCATION & LANGUAGE interactions via diarization
 Proposed DNN-HMM and diarization solutions to assess child-
adult interaction in naturalistic learning spaces
 Using the fused DNN-HMM based system leads to considerable
relative DER reduction on average compared to the LIUM’s GMM
based system with bottom-up clustering.
 Analysis plots derived from this work support our ability to:
 Determine which children are less engaged in voice communication
 Determine how much talk teachers direct at each child
 Assess how much communication children have with other children
in specific learning/activity areas
 Determine which learning/activities stimulate greater voice
communication between child-teacher and child-child
 Determine which activity areas individual children or all children
within a given classroom on average spend their time
16/16
References
 [1] M. Najafian, D. Ir i , Y. Luo, B.“. Rous, a d J.H.L. Ha se , Auto ati
measurement and analysis of the child verbal communication using classroom
a ousti s ithi a hild are e ter, i WOCCI, 6.
 [ ] M. Najafia , a d J.H.L. Ha se , “peaker i depe de t diarizatio for hild
la guage e iro e t a al sis usi g Deep Neural Net orks, su itted to IEEE “LT-
2016.
 [3] S. O. Sadjadi, J.H.L. Hansen, U super ised speech activity detection using voicing
measures and perceptual spectral Flu , IEEE Signal Processing Letters, vol. 20, no. 3,
pp. 197-200, March 2013
 [4] A. Ziaei, L. Kaushik, A. Sangwan, J.H.L. Hansen, D. Oard, Speech activity detection
for NASA Apollo space missions: challenges and solutions, ISCA INTERSPEECH-2014,
Paper #994, Singapore, Sept. 14-18, 2014.
 [5] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for
speaker verication, INTERSPEECH, 2011.

Contenu connexe

Similaire à presentation_Diarization_MIT

Chapter 16 Young Children learning languages
Chapter 16 Young Children learning languagesChapter 16 Young Children learning languages
Chapter 16 Young Children learning languages
lilianamonserrat
 
Chapter 16 Brumfit
Chapter 16 BrumfitChapter 16 Brumfit
Chapter 16 Brumfit
ffffunes
 
Example of journal
Example of journalExample of journal
Example of journal
amirahjuned
 
The effect of films with and without subtitles on listening comprehension of ...
The effect of films with and without subtitles on listening comprehension of ...The effect of films with and without subtitles on listening comprehension of ...
The effect of films with and without subtitles on listening comprehension of ...
amirahjuned
 
Running head COMMUNICATION .docx
Running head COMMUNICATION                                     .docxRunning head COMMUNICATION                                     .docx
Running head COMMUNICATION .docx
susanschei
 
Background: Audiovisuals in improving Listening Ability
Background: Audiovisuals in improving Listening AbilityBackground: Audiovisuals in improving Listening Ability
Background: Audiovisuals in improving Listening Ability
Muhammad Fajri
 
My New Article- Kamalata Lukama
My New Article- Kamalata LukamaMy New Article- Kamalata Lukama
My New Article- Kamalata Lukama
kamalata lukama
 
Parental involvement in the development of children's reading skill
Parental involvement in the development of children's reading skillParental involvement in the development of children's reading skill
Parental involvement in the development of children's reading skill
mrwindy_3282
 

Similaire à presentation_Diarization_MIT (20)

Chapter 16 Young Children learning languages
Chapter 16 Young Children learning languagesChapter 16 Young Children learning languages
Chapter 16 Young Children learning languages
 
Chapter 16 Brumfit
Chapter 16 BrumfitChapter 16 Brumfit
Chapter 16 Brumfit
 
Chapter 16 brumfit
Chapter 16 brumfitChapter 16 brumfit
Chapter 16 brumfit
 
C382331
C382331C382331
C382331
 
Effects of Phonological Awareness Among ESL Learners
Effects of Phonological Awareness Among ESL LearnersEffects of Phonological Awareness Among ESL Learners
Effects of Phonological Awareness Among ESL Learners
 
Example of journal
Example of journalExample of journal
Example of journal
 
The effect of films with and without subtitles on listening comprehension of ...
The effect of films with and without subtitles on listening comprehension of ...The effect of films with and without subtitles on listening comprehension of ...
The effect of films with and without subtitles on listening comprehension of ...
 
Reference 2
Reference 2Reference 2
Reference 2
 
ASHA Poster_4
ASHA Poster_4ASHA Poster_4
ASHA Poster_4
 
G394959
G394959G394959
G394959
 
Running head COMMUNICATION .docx
Running head COMMUNICATION                                     .docxRunning head COMMUNICATION                                     .docx
Running head COMMUNICATION .docx
 
Assistive Technology for Students with Learning Disabilities.pdf
Assistive Technology for Students with Learning Disabilities.pdfAssistive Technology for Students with Learning Disabilities.pdf
Assistive Technology for Students with Learning Disabilities.pdf
 
The effect of authentic/inauthentic materials in EFL classroom
The effect of authentic/inauthentic materials in EFL classroomThe effect of authentic/inauthentic materials in EFL classroom
The effect of authentic/inauthentic materials in EFL classroom
 
Reaction paper
Reaction paperReaction paper
Reaction paper
 
Reaction paper
Reaction paperReaction paper
Reaction paper
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Background: Audiovisuals in improving Listening Ability
Background: Audiovisuals in improving Listening AbilityBackground: Audiovisuals in improving Listening Ability
Background: Audiovisuals in improving Listening Ability
 
Parrish action research3
Parrish   action research3Parrish   action research3
Parrish action research3
 
My New Article- Kamalata Lukama
My New Article- Kamalata LukamaMy New Article- Kamalata Lukama
My New Article- Kamalata Lukama
 
Parental involvement in the development of children's reading skill
Parental involvement in the development of children's reading skillParental involvement in the development of children's reading skill
Parental involvement in the development of children's reading skill
 

presentation_Diarization_MIT

  • 1. Speaker Independent Diarization for Child Language Environment Analysis Using Deep Neural Networks By Maryam Najafian Supervisor Prof. John Hansen University of Texas at Dallas, US 4th October 2016 Email: m.najafian@utdallas.edu
  • 2.  This study investigates language environments of young children based on location tracking and speech processing child-adult speaker diarization  Audio recordings are gathered using LENA units  Location information are gathered using Ubisence units  labeled audio is gathered from 32 children wearing the LENA unit with age range from 2.5 to 5 years old over a typical day at three different time points Introduction UBISENSE LENA 1/16
  • 4. LIUM GMM-HMMs with bottom-up clustering MFCC extraction Viterbi Re-segmentation BIC distance GLR: Generalized Likelihood Ratio BIC: Bayesian Information Criterion Audio Segmentation GLR distance Agglomerative Clustering (BIC distance) UBM MAP adapted to each class MFCC extraction UBM: Universal Background Model MAP: Maximum A Posteriori 3/16 1-primary child 2-secondary child 3-Adult 4-music 5-crowd noise 6-silence
  • 5. Diarization with i-Vector SVM i-Vector based child-adult turn-taking detection system  TO-Combo-SAD: Threshold optimized speech activity detection using Combo-SAD features [3,4]  Combo features: the mean and variance normalized Harmonicity, Clarity, Prediction, Priodicity, and Spectral Flux features are linearly mapped to 1-dimensional 'COMBO' feature  I-Vector [5] SVM based classifier 4/16
  • 6.  I-Vector-SVM TO-Combo-SAD system,1.5s segments, on 4.5 hrs [1]  27.3% Relative error reduction compared to LIUM on 4.5 hrs [1] System Comparison 4.5 hours: Distribution of 4 acoustic classes in our database From manually labeled data Adult 22% Primary Child 10% Secondary Child 16% Non-speech 523% 5/16
  • 8. System Comparison  28.5% Relative error reduction compared to LIUM, on 7.2 hrs 7/16 7.2 hours: Distribution of 4 acoustic classes in our database From manually labeled data on-Speech Adult 24% Primary Child 20% Secondary Child 23% Non-speech 33%
  • 10. System Comparison  37.11% Relative error reduction compared to LIUM, on 7.2 hrs 10/16 7.2 hours: Distribution of 4 acoustic classes in our database From manually labeled data on-Speech Adult 24% Primary Child 20% Secondary Child 23% Non-speech 33%
  • 12.  3 Classroom Time Points: Compares level of interaction between child & other children and adults Case study 12/16
  • 13. Case studyCase study  3 Classroom Time Points: Compares % Time Spent in each of 7 learning/activity areas: (art, blocks, books, dramatic play, cubbies, manipulation, science) 13/16
  • 14. Case study  Case study aims to collect statistics that enable a wider perspective of child communication between teachers and peers in classrooms across different a ti it areas i.e., hi h areas are hot la guage spa es? Speech produced by adults, primary and secondary children across 7 activity areas in a 33 minutes green window 14/16
  • 15. Case study  Case study aims to collect statistics that enable a wider perspective of child communication between teachers and peers in classrooms across different a ti it areas i.e., hi h areas are hot la guage spa es? Heat map adult word count vocalizations per minute 15/16
  • 16. Summary  Explored LOCATION & LANGUAGE interactions via diarization  Proposed DNN-HMM and diarization solutions to assess child- adult interaction in naturalistic learning spaces  Using the fused DNN-HMM based system leads to considerable relative DER reduction on average compared to the LIUM’s GMM based system with bottom-up clustering.  Analysis plots derived from this work support our ability to:  Determine which children are less engaged in voice communication  Determine how much talk teachers direct at each child  Assess how much communication children have with other children in specific learning/activity areas  Determine which learning/activities stimulate greater voice communication between child-teacher and child-child  Determine which activity areas individual children or all children within a given classroom on average spend their time 16/16
  • 17. References  [1] M. Najafian, D. Ir i , Y. Luo, B.“. Rous, a d J.H.L. Ha se , Auto ati measurement and analysis of the child verbal communication using classroom a ousti s ithi a hild are e ter, i WOCCI, 6.  [ ] M. Najafia , a d J.H.L. Ha se , “peaker i depe de t diarizatio for hild la guage e iro e t a al sis usi g Deep Neural Net orks, su itted to IEEE “LT- 2016.  [3] S. O. Sadjadi, J.H.L. Hansen, U super ised speech activity detection using voicing measures and perceptual spectral Flu , IEEE Signal Processing Letters, vol. 20, no. 3, pp. 197-200, March 2013  [4] A. Ziaei, L. Kaushik, A. Sangwan, J.H.L. Hansen, D. Oard, Speech activity detection for NASA Apollo space missions: challenges and solutions, ISCA INTERSPEECH-2014, Paper #994, Singapore, Sept. 14-18, 2014.  [5] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verication, INTERSPEECH, 2011.