1. Multimedia analysis for the poor
(in training resources)
Xavier Anguera
Telefonica research
Daghstuhl Seminar 13451 - Inspirational talk
2. Does this affect me?
• You work in areas where there is not much
training data available
– Maybe it exists in domains other than your test data.
• The task you are pursuing does not have a well
annotated corpus for training
– E.g. finding structure in signals
• It is difficult / you do not know how to define
training “units” in your task
• You like to work in complicated stuff
3. Typical Speech paper diagram
Labeled
training
data
My favorite ML
technique
“I am a
model”
Testing data
My favorite
decoding
technique
My result
11. Resource-free technologies
• Summarization
– Acoustic word cloud of most repeated acoustic items
– Repetition-based summarization (MODIS software @
INRIA-Rennes)
• Structure analysis in music
• Audio-visual unsupervised learning (e.g. the
Google cats)
• Acquisition of unknown sounds (e.g. Tuomo’s
talk)
• Exemplar-based ASR (Leuven Univ.)
12. EXAMPLE: Spoken Audio Search (or Query-byExample Spoken-Term Detection)
Given a single spoken query we search for instances at
lexical level within spoken documents
It is similar to Spoken Term Detection (NIST
STD2006, OpenKWS 2013) but…
Queries are spoken
Different speakers
Different acoustic conditions
No prior knowledge of the
language(s) might be available
13. Mediaeval SWS 2013
• 9 languages in different acoustic contexts: 4 African
languages (isixhosa, isizulu, sepedi, setswana),
Albanian, Basque, Czech, non-native English,
Romanian
#utts
time
Avg. length/utt.
Search corpus
10762
19:57:55
6.67s
Dev Queries
505
0:11:26h
1.35s
Extended dev*
1046
0:08:42h
0.49s
Eval Queries
503
0:11:37h
1.38s
Extended eval*
1037
0:08:57h
0.51s
Total
13853
20:38:37h
*Only Basque (3x) and Czech (10x) queries have extended versions
14. Mediaeval SWS 2013
Miss probability (in %)
98
95
90
80
60
40
20
10
5
.0001
.5 1
2
5
10
20
Random Performance
GTTS (MTWV=0.399, Thr=5.243)
L2F (MTWV=0.342, Thr=3.551)
CUHK (MTWV=0.306, Thr=0.618)
BUT (MTWV=0.297, Thr=0.914)
CMTECHETAL (MTWV=0.257, Thr=18.153)
IIITH (MTWV=0.224, Thr=2.721)
ELIRF (MTWV=0.159, Thr=2.759)
TID (MTWV=0.093, Thr=5.051)
GTC (MTWV=0.084, Thr=3.341)
SPEED (MTWV=0.059, Thr=0.923)
LIA-Late (MTWV=0.000, Thr=1079.003)
UNIZA-Late (MTWV=0.001, Thr=1.000)
TUKE-Late (MTWV=0.000, Thr=3.000)
Primary systems (evaluation)
.001 .004 .01 .02 .05 .1 .2
False Alarm probability (in %)
40
17. How do children learn?
(from someone who is not a parent…)
1. They hear their environment and identify/isolate
particular audio-visual stimuli they do not know
2. An expert (parent/grandparent) tells them the
“meaning” of those stimuli.
– If the stimuli appears in different forms (or the child is
not sharp) they will need to repeat it a couple of times…
3. The child learns and is able to identify this stimuli
from then on.
20. • How to incorporate acoustic modeling into
dynamic programming techniques?
• How to describe the acoustic space (or
whatever space) in an unsupervised (but
robust) manner?
• How do we discriminate between
“interesting/relevant” and “filler” events
• Does it all make any sense? (maybe we could
consider we will always have enough training
data?)
Notes de l'éditeur
Difficulties in speech recognition because of the multiple acoustic environments we need to account for when training models
In image recognition we are also having problems defining the variability of some concepts