2024: Domino Containers - The Next Step. News from the Domino Container commu...
Mediaeval 2013 Spoken Web Search results slides
1. Spoken Web Search at Mediaeval
2013
Xavier Anguera, Florian Metze, Andi
Buzo, Igor Szoke and Luis Javier
Rodriguez-Fuentes
2. Spoken Audio Search (or Query-by-Example
Spoken-Term Detection)
Given a spoken query we search for instances at lexical
level within spoken documents
It is similar to Spoken Term Detection (NIST STD2006,
OpenKWS 2013) but…
Queries are spoken
Different speakers
Different acoustic conditions
No prior knowledge of the
language(s) might be available
3. SWS history in Mediaeval
• SWS 2011 had 5 finishing participants and
focused on 4 Indian languages
• SWS 2012 had 9 finishing participants and
focused on 4 African Languages
• SWS 2013 has 13 finishing (18 registered)
participants and contains 9 languages
18
16
14
1400
#teams
1200
database size
1000
12
10
800
8
600
6
400
4
200
2
0
0
2011
2012
2013
4. SWS 2013 evaluation setup
• 1 single search corpus with ~20 hours of
data, collected from contributions of 9
languages
– No transcription or language information is given
to participants
• 500 queries for dev and 500 queries for eval
– For each query, participants need to return all
instances of that query in the search corpus
5. Mediaeval SWS 2013
• 9 languages in different acoustic contexts: 4 African
languages
(isixhosa, isizulu, sepedi, setswana), Albanian, Basqu
e, Czech, non-native English, Romanian
#utts
time
Avg. length/utt.
Search corpus
10762
19:57:55
6.67s
Dev Queries
505
0:11:26h
1.35s
Extended dev*
1046
0:08:42h
0.49s
Eval Queries
503
0:11:37h
1.38s
Extended eval*
1037
0:08:57h
0.51s
Total
13853
20:38:37h
*Only Basque (3x) and Czech (10x) queries have extended versions
6. Database distribution per language
Language
Number of
utterances / total
duration
Number of queries
Speech quality (original
sampling rate)
Recording environment
African - isixhosa
395 / 60 min.
25 / 25
Telephone speech, 8KHz
Field recordings, read
speech
African - isizulu
395 / 60 min.
25 / 25
Telephone speech, 8KHz
Field recordings, read
speech
African - sepedi
395 / 60 min.
25 / 25
Telephone speech, 8KHz
Field recordings, read
speech
African - setswana
395 / 60 min.
25 / 25
Telephone speech, 8KHz
Field recordings, read
speech
Albanian
968 / 127 min.
50 / 50
PC microphone, 16KHz
Lab environment, read
speech
Basque
1841 / 192 min.
100 / 100 (recorded
by mobile phone)
TV Broadcast news,
16KHz
Studio, read speech
Czech
3667 / 252 min.
94 / 93
Telephone speech, 8KHz
Telephone calls into
radio broadcasts,
spontaneous speech
Non-native English
434 / 141 min.
61 / 60
High quality mic, 44KHz
Conference lectures,
spontaneous speech
Romanian
2272 / 244 min.
100 / 100
PC microphone, 16KHz
Lab environment, read
speech
7. SWS 2013 participants
Dto. Electricidad y electrónica, Universidad Pais Vasco
Spain
Speec@FIT, Brno University of Technology
Czech Republic
Telefonica Research
Spain
Romania
School of Electrical and Computer Engineering, Georgia Institute of Technology
USA
L2F - INESC-ID
Portugal
Departament de sistemes informàtics I Computació, Universitat Politècnica de València
Spain
Audiolab, University of Zilina
Slovakia
LIA, University of Avignon
France
Technical University of Kosice
Slovakia
Universitat Pompeu Fabra
Spain
DSP-STL, Dept. of EE, The chinese University of Hong Kong
Hong Kong
International Institute of Information Technology- Hyderabad
Non-finishing
country
University Politechnica of Bucarest
organizers
Team name
India
IAIS, Fraunhofer Institute
Germany
TATA Consultancy Services Ltd.
India
Indian Statistical Institute
India
Northwestern Polytechnical University of Xi’an
China
Toyota Technological Institute at Chicago
USA
8. Possible approaches to QbE-STD
Pattern based
Language spoken
Acoustic models +
Lattice based
Language models +
Word-based
9. Followed approaches
Team name
Dto. Electricidad y electrónica, Universidad Pais Vasco
Speec@FIT, Brno University of Technology
Telefonica Research
University Politechnica of Bucarest
School of Electrical and Computer Engineering, Georgia Institute of Technology
L2F - INESC-ID
Dept. de sistemes informàtics I Computació, Universitat Politècnica de València
Audiolab, University of Zilina
LIA, University of Avignon
Technical University of Kosice
Universitat Pompeu Fabra
DSP-STL, Dept. of EE, The chinese University of Hong Kong
International Institute of Information Technology- Hyderabad
DTW-like
AKWS
10. Scoring metrics
• PRIMARY: Actual Term Weighted Value (ATWV) /
Maximum Term Weighted Value (MTWV)
• Actual/minimum Cnxe
• Real-time factor
• Memory usage
20. DET dev
Miss probability (in %)
98
95
90
80
60
40
20
10
5
.0001
.5 1
2
5
10
20
Random Performance
GTTS (MTWV=0.417, Thr=5.204)
L2F (MTWV=0.390, Thr=3.428)
CUHK (MTWV=0.368, Thr=0.530)
BUT (MTWV=0.371, Thr=0.930)
CMTECHETAL (MTWV=0.264, Thr=16.535)
IIITH (MTWV=0.253, Thr=2.130)
ELIRF (MTWV=0.170, Thr=2.697)
TID (MTWV=0.116, Thr=4.085)
GTC (MTWV=0.116, Thr=3.248)
SPEED (MTWV=0.083, Thr=0.960)
LIA-Late (MTWV=0.005, Thr=13.065)
UNIZA-Late (MTWV=0.000, Thr=1.000)
TUKE-Late (MTWV=0.000, Thr=3.000)
Primary systems (development)
.001 .004 .01 .02 .05 .1 .2
False Alarm probability (in %)
40
21. DET eval
Miss probability (in %)
98
95
90
80
60
40
20
10
5
.0001
.5 1
2
5
10
20
Random Performance
GTTS (MTWV=0.399, Thr=5.243)
L2F (MTWV=0.342, Thr=3.551)
CUHK (MTWV=0.306, Thr=0.618)
BUT (MTWV=0.297, Thr=0.914)
CMTECHETAL (MTWV=0.257, Thr=18.153)
IIITH (MTWV=0.224, Thr=2.721)
ELIRF (MTWV=0.159, Thr=2.759)
TID (MTWV=0.093, Thr=5.051)
GTC (MTWV=0.084, Thr=3.341)
SPEED (MTWV=0.059, Thr=0.923)
LIA-Late (MTWV=0.000, Thr=1079.003)
UNIZA-Late (MTWV=0.001, Thr=1.000)
TUKE-Late (MTWV=0.000, Thr=3.000)
Primary systems (evaluation)
.001 .004 .01 .02 .05 .1 .2
False Alarm probability (in %)
40
22. Cnxe metric
Cnxe
2.9
Min Cnxe (development)
Act Cnxe (development)
3
2.8
Act Cnxe (evaluation)
CUHK
2.7
L2F
Min Cnxe (evaluation)
GTTS
2.6
2.5
2.4
2.3
2.2
2.1
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ELIRF
TID
GTC
Cnxe for primary systems
BUT CMTECHETAL IIITH
SpeeD
LIA
UNIZA
TUKE
23. Extended Queries
• 4 teams submitted 4 extended systems, making use of 3
repetitions of Basque queries and 10 repetitions of Czech
queries available
– TID: computes each query individually and then puts together all
results
– GTTS: DTW-aligns all queries above a minimum duration and searches
with the resulting query
– GeorgiaTech: builds a graphical keyword model using more than one
instance
30. Take home messages
• The task was more complicated than in 2012
– GTTS got MTWV-13 = 0.39 MTWV-12 = 0.51 (on
2013 data)
– HKCU MTWV-12 = 0.74 (on 2012 data)
• It is possible to do QbE-STD on unknown/low
resources data
31. New things to watch out for in the posters session
• BUT:
– Fusion of 26 systems (13 AKWS + 13 DTW)
– M-norm normalization
• IIIT:
– Articulatory Bottleneck features
• CUHK:
– Tokenizer construction using Gaussian Component clustering
– Query expansion using PSOLA
• L2F
– DTW candidate pre-selection
• GTTS:
– Distance matrix normalization in DTW
• GeorgiaTech:
– Low-resource speech modeling using EHMM Models
• LIA:
– Use of I-vectors in SWS
• ARF
– DTW string matching algorithm with a novel scoring
32.
33. System presentations
• 16:30-16:45 "GTTS Systems for the SWS Task at
MediaEval 2013", Luis Javier Rodriguez-Fuentes, DEE,
Universidad del País Vasco
• 16:45-17:00 "The L2F Spoken Web Search system for
Mediaeval 2013”, Alberto Abad, L2F, INESC-ID
• 17:00-17:15 "BUT SWS 2013 - MASSIVE PARALLEL
APPROACH", Lucas Ondel, Speech@BUT, Brno
University of Technology
• 17:15-17:30 "The CMTECH Spoken Web Search System
for MediaEval 2013", Ciro Gracia, UPF
• 17:30-17:45 Discussion and SWS 2014 teaser, Xavier
Anguera
Notes de l'éditeur
AKWS means they use some sort of Viterbi alg.DTW-like means they use DTW algorithms to match different sorts of features
La UPF te molt bona regularització per a trobat el optim score en tots els queries.TID I IIIT tenen mal matching entre ATWV I MTWVOnly the positive scores were plotted