Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Comparison of LRs based on MVKD and GMM-UBM

Hughes, V., French, J. P., Foulkes, P., Gold, E., Harrison, P. and Watt, D. (2014) Comparison of LRs based on MVKD and GMM-UBM. 2nd International Workshop of the 2011 Monopoly Project 'Methodological Guidelines for Semi-Automatic and Automatic Speaker Recognition for Case Assessment and Interpretation', Wiesbaden, Germany. 18 November 2014. (INVITED TALK)

  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Comparison of LRs based on MVKD and GMM-UBM

  1. 1. 2nd International Expert Workshop of the 2011 Monopoly Project Methodological guidelines for semi-automatic and automatic speaker recognition for case assessment and interpretation 18th November 2014 Peter French Paul Foulkes Erica Gold Philip Harrison Vincent Hughes Dominic Watt
  2. 2. OUTLINE 0. Motivation 1. Variables 2. Modelling 3. Previous Work 4. Empirical Studies 5. Conclusion 2
  3. 3. 0. MOTIVATION • modelling = essential methodological decision in computing LRs • ling-phon speaker recognition = MVKD § token-based nature of data • (S)ASR = GMM-UBM § stream-based nature of data • limited research comparing the MVKD and GMM-UBM for ling- phon variables § but no guidance available in selecting appropriate modelling techniques for (S)ASR • this presentation considers the effects that the modelling of data has on system performance 3
  4. 4. 1. VARIABLES • automatic (ASR): § cepstral coefficients (CCs) and derivatives (deltas, delta-deltas) • semi-automatic (SASR): § more researchers now examining semi-automatic speaker recognition variables (e.g. LTFDs) § these types of data find themselves in the grey area between traditional automatic data and linguistic-phonetic data • linguistic phonetic variables analysed as a stream • “automatic” = dealing with the signal as a stream rather than as discrete linguistic units (e.g. phonemes; token-based data) 4
  5. 5. • (S)ASR variables: easy to model § continuous § produce lots of data § orthogonal (in the case of CCs; Rose 2013) • but still necessary to determine which model is most appropriate § MVKD (Aitken and Lucy 2004) vs. GMM-UBM (Reynolds et al. 2000) § which produces the best system performance? 5 2. MODELLING
  6. 6. • developed for forensic glass fragment analysis • suspect data: Gaussian distribution • reference data: speaker-dependent Gaussian kernel density estimation • generally preferred for token-based linguistic-phonetic speaker recognition/ comparison (Morrison 2011; Jessen & Enzinger 2014) § assumption of normality within-speakers § multivariate data with relatively small number of correlated features (Nair et al. 2014) 6 2. MODELLING: MVKD
  7. 7. • developed for ASR • (U)BM = speaker independent GMM based on data from appropriate reference speakers • suspect model = GMM § built using raw data (Becker et al. 2008) or… § built using MAP adaptation based on the UBM • N Gaussians dependent on amount of data and dimensionality 7 2. MODELLING: GMM-UBM
  8. 8. • small amount of work: MVKD vs. GMM-UBM for linguistic-phonetic variables § Rose & Winter (2010) • marginally better performance of MVKD over GMM-UBM using fusion of F1~F3 of 5 AusEng vowels (by around 4% EER, 0.16 Cllr) § Morrison (2011) • marginally better performance of GMM-UBM over MVKD using F2 trajectories of AusEng diphthongs *but limited work on different models for (S)ASR 8 3. PREVIOUS WORK
  9. 9. 4. EMPIRICAL STUDIES: ASR (HUGHES 2014) • TIMIT (DR3): database of American English (Garofolo et al. 1993) • Narrowly controlled read-speech (10 sentences per speaker) • MFCCs and derivatives extracted from speech-active portion of recordings • extracted using HTK (12 coefficients) • speakers: 28 training/ 25 test/ 28 reference • feature-to-score: MVKD & GMM-UBM § GMMs created using 32 Gaussians (based on Reynolds 1995) • score-to-LR: logistic regression calibration (Brümmer & du Preez 2006) 9
  10. 10. 10 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 Log LR Cost (Cllr ) EER(%) GMM-UBM MFCC GMM-UBM MFCC_D GMM-UBM MFCC_D_DD MVKD MFCC MVKD MFCC_D MVKD MFCC_D_DD
  11. 11. • marginally better EER and Cllr using GMM-UBM compared with MVKD § across all forms of MFCC input § but the differences are extremely small • max 0.3% EER/ 0.04 Cllr § ceiling effect - use of read text and limited data? • no obvious improvement with the addition of derivatives • magnitude of LRs themselves considerably greater for MVKD § but not reflected in system performance § better decorrelation of individual features using MVKD? 11 4. EMPIRICAL STUDIES: ASR (HUGHES 2014)
  12. 12. • Dynamic Variability of Speech Corpus (DyViS; Nolan et al. 2009) § 100 male speakers § Southern Standard British English (SSBE) § Aged 18-25 § Task 2: spontaneous speech • recordings auto-segmented to obtain minimum of 50 secs of vowels per speaker § iCAbS (iterative cepstral analysis by synthesis) formant tracker used to automatically extract and measure F1-F4 (5ms shift) 12 4. EMPIRICAL STUDIES: SASR (GOLD ET AL. 2013)
  13. 13. 4. METHODOLOGY: SASR • raw formant data averaged over different lengths of time § ‘packages’ (Moos 2010) § 100-284 (LTFD1-4) measurements per speaker (mode = 100) • speakers: 50 test/ 50 reference (Gold et al. 2013) § feature-to-score: MVKD • French et al. (2012): § LR scores for LTFDs computed using GMM-UBM § MFCCs extractedfrom the same data • LR scores computed using GMM-UBM § Results compared across Gold et al. (2013)/ French et al. (2012)/ Becker et al. (2008) • MVKD vs. GMM-UBM (using LTFD) • LTFDs vs. MFCCs 13
  14. 14. 4. INDIVIDUAL FORMANT RESULTS 14 Comparisons % Correct Mean LLR Min LLR Max LLR EER Cllr LTFD1 SS 72 .224 -2.158 1.902 28.06% 0.8840 LTFD1 DS 71.7 -4.858 -68.768 1.993 LTFD2 SS 70 .162 -1.077 1.259 31.65% 0.8119 LTFD2 DS 67.5 -1.939 -27.814 1.602 LTFD3 SS 88 .288 -8.373 3.743 17.00% 1.0731 LTFD3 DS 80.6 -11.857 -139.273 1.734 LTFD4 SS 68 .238 -2.258 1.378 22.14% 0.8085 LTFD4 DS 80.2 -11.574 -124.808 1.301
  15. 15. 4. INDIVIDUAL FORMANT RESULTS 15 Comparisons % Correct Mean LLR Min LLR Max LLR EER Cllr LTFD1 SS 72 .224 -2.158 1.902 28.06% 0.8840 LTFD1 DS 71.7 -4.858 -68.768 1.993 LTFD2 SS 70 .162 -1.077 1.259 31.65% 0.8119 LTFD2 DS 67.5 -1.939 -27.814 1.602 LTFD3 SS 88 .288 -8.373 3.743 17.00% 1.0731 LTFD3 DS 80.6 -11.857 -139.273 1.734 LTFD4 SS 68 .238 -2.258 1.378 22.14% 0.8085 LTFD4 DS 80.2 -11.574 -124.808 1.301
  16. 16. 16 Comparisons % Correct Mean LLR Min LLR Max LLR EER Cllr LTFD1+2 SS 70 .417 -2.472 2.761 20.41% 0.7648 LTFD1+2 DS 85 -7.477 -76.391 1.996 LTFD2+3 SS 76 .334 -7.828 3.768 13.92% 0.9630 LTFD2+3 DS 89.9 -14.173 -156.130 1.956 LTFD1+2+3 SS 74 .625 -7.632 3.676 11.47% 1.0161 LTFD1+2+3 DS 94.3 -19.307 -155.807 3.007 LTFD1+2+3+4 SS 84 1.160 -5.292 5.466 4.14% 0.5411 LTFD1+2+3+4 DS 97.43 -29.228 -162.931 2.854 4. FORMANT COMBINATION RESULTS
  17. 17. 17 Comparisons % Correct Mean LLR Min LLR Max LLR EER Cllr LTFD1+2 SS 70 .417 -2.472 2.761 20.41% 0.7648 LTFD1+2 DS 85 -7.477 -76.391 1.996 LTFD2+3 SS 76 .334 -7.828 3.768 13.92% 0.9630 LTFD2+3 DS 89.9 -14.173 -156.130 1.956 LTFD1+2+3 SS 74 .625 -7.632 3.676 11.47% 1.0161 LTFD1+2+3 DS 94.3 -19.307 -155.807 3.007 LTFD1+2+3+4 SS 84 1.160 -5.292 5.466 4.14% 0.5411 LTFD1+2+3+4 DS 97.43 -29.228 -162.931 2.854 4. FORMANT COMBINATION RESULTS
  18. 18. 18 Package Length SS % Correct DS % Correct Mean SS LLR Mean DS LLR Min SS LLR Min DS LLR Max SS LLR Max DS LLR EER Cllr .25 sec 76 97.96 0.90 -35.58 -6.51 -199.61 5.64 2.93 4.29% 0.7745 .5 sec 84 97.43 1.16 -29.23 -5.29 -162.93 5.47 2.85 4.14% 0.5411 1 sec 88 96.73 1.34 -24.17 -4.18 -134.33 5.29 2.79 4.22% 0.4001 2.5 sec 94 95.76 1.52 -17.67 -3.17 -98.41 4.99 2.88 4.33% 0.2813 5 sec 96 94.82 1.60 -13.85 -2.45 -84.60 4.77 2.90 4.22% 0.2393 10 sec 98 92.78 1.55 -9.27 -2.59 -62.94 4.41 2.68 5.61% 0.2568 4. PACKAGE LENGTH RESULTS
  19. 19. 19 Package Length SS % Correct DS % Correct Mean SS LLR Mean DS LLR Min SS LLR Min DS LLR Max SS LLR Max DS LLR EER Cllr .25 sec 76 97.96 0.90 -35.58 -6.51 -199.61 5.64 2.93 4.29% 0.7745 .5 sec 84 97.43 1.16 -29.23 -5.29 -162.93 5.47 2.85 4.14% 0.5411 1 sec 88 96.73 1.34 -24.17 -4.18 -134.33 5.29 2.79 4.22% 0.4001 2.5 sec 94 95.76 1.52 -17.67 -3.17 -98.41 4.99 2.88 4.33% 0.2813 5 sec 96 94.82 1.60 -13.85 -2.45 -84.60 4.77 2.90 4.22% 0.2393 10 sec 98 92.78 1.55 -9.27 -2.59 -62.94 4.41 2.68 5.61% 0.2568 4. PACKAGE LENGTH RESULTS
  20. 20. 4. MVKD VS. GMM-UBM SS LLRs > 0 DS LLRs < 0 EER (%) MVKD (Gold et al. 2013) 84% 97.4% 4.14% GMM-UBM (French et al. 2012) 94% 97.4% - 20
  21. 21. 4. MFCC VS. LTFD 21 LTFD1+LTFD2+LTFD3 LTFD1+LTFD2+LTFD3+LTFD4 MFCC SS DS EER SS DS EER SS DS EER Gold et al. (2013) 74% 94.3 % 11.47% 84% 97.4% 4.14% - - - French et al. (2012) - - - 94% 97.4% - 100% 95% - Becker et al. (2008) - - 5.3% - - - - - -
  22. 22. 5. CONCLUSION ASR • GMM-UBM = marginally better than MVKD § true for all forms of MFCC input § but differences are small (but issues with the material itself – read speech etc.) SASR • GMM-UBM = marginally better than MVKD overall § MVKD, however, better at DS comparisons § limited number of SS comparisons § LTFD in Gold et al. (2013) not calibrated 22
  23. 23. 23 THANK YOU QUESTIONS?

×