BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012
1. BUT2012
Brno University of Technology
Faculty of Information Technology
Speech@FIT
Igor Szöke, Michal Fapšo, Karel Veselý
MediaEval 2012 workshop – SWS task, October 4.-5. 2012, Pisa
2. Outlines
Systems overview & Underlying technologies
PhnRec, R-AKWS, AKWS – primary system
DTW
(GMM/HMM) – not submitted
Calibration
Results and discussion
MediaEval SWS 2012 BUT2012 2
workshop - 4.-5.10. Pisa
3. System overview
Our internal task was
− to build simple and minimalistic language
dependent Query-by-Example (QbE).
Ingredients
− Development data, Neural net classifier,
Phoneme recognizer, Acoustic keyword
spotting, DTW, Calibration
MediaEval SWS 2012 BUT2012 3
workshop - 4.-5.10. Pisa
4. System overview
Sentence mean normalization
Bottle-Neck Posteriors
Neural network based features
AKWS - X
− bottle-necks
DTW X X
− three state phone posteriors
(GMM/HMM) X -
Query detector
− AKWS
− DTW
− (GMM/HMM) – not submitted to the evals
MediaEval SWS 2012 BUT2012 4
workshop - 4.-5.10. Pisa
5. Underlying technologies
Universal context, bottle-neck neural network base classifier
devC state re-alignment, Reduced phone set (50 phonemes)
Trained by Tnet – our tool, publicly available
MediaEval SWS 2012 BUT2012 5
workshop - 4.-5.10. Pisa
8. GMM/HMM
Inspired by AKWS, not submitted due to bad results.
MTWV MTWVcalib UBTWV
R-AKWS 0.739 0.786 0.859
AKWS 0.452 0.493 0.600
DTW 0.400 0.468 0.552
GMM/HMM 0.011 - 0.336
MediaEval SWS 2012 BUT2012 8
workshop - 4.-5.10. Pisa
9. Calibration
TWV - pooled, UBTWV - non-pooled TWV (each term has its best thr.)
Calibration of scores (linear combination of 12 parameters - 6 features
with linear and quadratic forms). Trained on UBTWV thresholds.
− Query length (w/o outer sil), Length of inner sil,
− Score average global, Score average by phonemes
− Phonemes count, Detections count
We found that Detections count and Length of inner sil work the best for
AKWS (after evals).
Parameter Training error AKWS Training error DTW
Detections count 0.1272 0.002115
Length of inner sil 0.1577 0.002687
Query length (w/o outer sil) 0.1626 0.002773
Score average global 0.1635 0.002530
Phonemes count 0.1656 0.002779
Score average by phonemes 0.1660 0.002746
MediaEval SWS 2012 BUT2012 9
workshop - 4.-5.10. Pisa
11. Conclusion
devQ-devC evalQ-evalC
ATWV MTWV UBTWV ATWV MTWV UBTWV
AKWS 0.488 0.502 0.600 0.522 0.553 0.672
(0.488) (0.452) (0.492) (0.530)
DTW 0.443 0.468 0.552 0.448 0.488 0.599
• AKWS with new calibration (submitted in brackets)
• Good and consistent data, enough to train good Phnrec
• GMM/HMM does not perform well on in-language condition
and 1 example per query (our best system in last year)
• Number of detections is important calibration feature (due
to TWV)
• Future work: detections calibration, system fusion
12. Like / Dislike / Next evals?
Like:
− Adapted TWV, real KWS scoring
− Phone alignment provided
− Good data, great work of organizers
"Dislike":
− No test data alignment
− No speaker information
Next evals:
− More examples per query?
− Provide query and the query sentence (adaptation issue)?
− Non-pooled scoring metric?
− We would like to share our features – more on poster
session
MediaEval SWS 2012 BUT2012 12
workshop - 4.-5.10. Pisa
13. Thank You for Your attention.
MediaEval SWS 2012 BUT2012 13
workshop - 4.-5.10. Pisa