Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

•Télécharger en tant que PPTX, PDF•

1 j'aime•478 vues

This is a 5 minutes presentation I was invited to give in a Daghstuhl seminar about low/zero resources processing. in November 2013

Technologie Formation

Multimedia analysis for the poor
(in training resources)
Xavier Anguera
Telefonica research

Daghstuhl Seminar 13451 - Inspirational talk

Does this affect me?
• You work in areas where there is not much
training data available
– Maybe it exists in domains other than your test data.

• The task you are pursuing does not have a well
annotated corpus for training
– E.g. finding structure in signals

• It is difficult / you do not know how to define
training “units” in your task
• You like to work in complicated stuff

Typical Speech paper diagram
Labeled
training
data

My favorite ML
technique

“I am a
model”

Testing data

My favorite
decoding
technique

My result

…making it as complicated as you
would like to

Resource-free technologies
• Summarization
– Acoustic word cloud of most repeated acoustic items
– Repetition-based summarization (MODIS software @
INRIA-Rennes)

• Structure analysis in music
• Audio-visual unsupervised learning (e.g. the
Google cats)
• Acquisition of unknown sounds (e.g. Tuomo’s
talk)
• Exemplar-based ASR (Leuven Univ.)

EXAMPLE: Spoken Audio Search (or Query-byExample Spoken-Term Detection)
Given a single spoken query we search for instances at
lexical level within spoken documents
It is similar to Spoken Term Detection (NIST
STD2006, OpenKWS 2013) but…
 Queries are spoken

 Different speakers
 Different acoustic conditions
 No prior knowledge of the
language(s) might be available

Mediaeval SWS 2013
• 9 languages in different acoustic contexts: 4 African
languages (isixhosa, isizulu, sepedi, setswana),
Albanian, Basque, Czech, non-native English,
Romanian
#utts

time

Avg. length/utt.

Search corpus

10762

19:57:55

6.67s

Dev Queries

505

0:11:26h

1.35s

Extended dev*

1046

0:08:42h

0.49s

Eval Queries

503

0:11:37h

1.38s

Extended eval*

1037

0:08:57h

0.51s

Total
13853
20:38:37h
*Only Basque (3x) and Czech (10x) queries have extended versions

Mediaeval SWS 2013
Miss probability (in %)

98

95
90

80

60

40

20

10
5
.0001

.5 1

2

5

10

20

Random Performance
GTTS (MTWV=0.399, Thr=5.243)
L2F (MTWV=0.342, Thr=3.551)
CUHK (MTWV=0.306, Thr=0.618)
BUT (MTWV=0.297, Thr=0.914)
CMTECHETAL (MTWV=0.257, Thr=18.153)
IIITH (MTWV=0.224, Thr=2.721)
ELIRF (MTWV=0.159, Thr=2.759)
TID (MTWV=0.093, Thr=5.051)
GTC (MTWV=0.084, Thr=3.341)
SPEED (MTWV=0.059, Thr=0.923)
LIA-Late (MTWV=0.000, Thr=1079.003)
UNIZA-Late (MTWV=0.001, Thr=1.000)
TUKE-Late (MTWV=0.000, Thr=3.000)

Primary systems (evaluation)

.001 .004 .01 .02 .05 .1 .2

False Alarm probability (in %)

40

Mediaeval SWS 2013 (results per
language)

How do children learn?
(from someone who is not a parent…)
1. They hear their environment and identify/isolate
particular audio-visual stimuli they do not know
2. An expert (parent/grandparent) tells them the
“meaning” of those stimuli.
– If the stimuli appears in different forms (or the child is
not sharp) they will need to repeat it a couple of times…

3. The child learns and is able to identify this stimuli
from then on.

book

book

book
book
Machine earning

“book”
model

Machine earning

book
“?” model

• How to incorporate acoustic modeling into
dynamic programming techniques?
• How to describe the acoustic space (or
whatever space) in an unsupervised (but
robust) manner?
• How do we discriminate between
“interesting/relevant” and “filler” events
• Does it all make any sense? (maybe we could
consider we will always have enough training
data?)

Recommandé

MediaEval 2015 - The NNI Query-by-Example System for MediaEval 2015multimediaeval

Time Machine session @ ICME 2012 - DTW's New YouthXavier Anguera

Kaldi-voice: Your personal speech recognition server using open source codeXavier Anguera

MASK: Robust Local Features for Audio FingerprintingXavier Anguera

Information Retrieval Dynamic Time Warping - Interspeech 2013 presentationXavier Anguera

Mediaeval 2013 Spoken Web Search results slidesXavier Anguera

Multimodal pattern matching algorithms and applicationsXavier Anguera

2024 State of Marketing Report – by HubspotMarius Sescu

Recommandé

MediaEval 2015 - The NNI Query-by-Example System for MediaEval 2015multimediaeval

Time Machine session @ ICME 2012 - DTW's New YouthXavier Anguera

Kaldi-voice: Your personal speech recognition server using open source codeXavier Anguera

MASK: Robust Local Features for Audio FingerprintingXavier Anguera

Information Retrieval Dynamic Time Warping - Interspeech 2013 presentationXavier Anguera

Mediaeval 2013 Spoken Web Search results slidesXavier Anguera

Multimodal pattern matching algorithms and applicationsXavier Anguera

2024 State of Marketing Report – by HubspotMarius Sescu

A Year of the Servo Reboot: Where Are We Now?Igalia

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

GenCyber Cyber Security Day PresentationMichael W. Hawkins

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Artificial Intelligence: Facts and MythsJoaquim Jorge

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Histor y of HAM Radio presentation slidevu2urc

🐬 The future of MySQL is Postgres 🐘RTylerCroy

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

A Call to Action for Generative AI in 2024Results

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Slack Application Development 101 Slidespraypatel2

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

Contenu connexe

Dernier

A Year of the Servo Reboot: Where Are We Now?Igalia

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

GenCyber Cyber Security Day PresentationMichael W. Hawkins

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Artificial Intelligence: Facts and MythsJoaquim Jorge

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Histor y of HAM Radio presentation slidevu2urc

🐬 The future of MySQL is Postgres 🐘RTylerCroy

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

A Call to Action for Generative AI in 2024Results

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Slack Application Development 101 Slidespraypatel2

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?

Tata AIG General Insurance Company - Insurer Innovation Award 2024

GenCyber Cyber Security Day Presentation

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Artificial Intelligence: Facts and Myths

Handwritten Text Recognition for manuscripts and early printed texts

Finology Group – Insurtech Innovation Award 2024

Driving Behavioral Change for Information Management through Data-Driven Gree...

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Histor y of HAM Radio presentation slide

🐬 The future of MySQL is Postgres 🐘

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Scaling API-first – The story of a global engineering organization

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

IAC 2024 - IA Fast Track to Search Focused AI Solutions

A Call to Action for Generative AI in 2024

Boost PC performance: How more available memory can improve productivity

Presentation on how to chat with PDF using ChatGPT code interpreter

Slack Application Development 101 Slides

Data Cloud, More than a CDP by Matt Robison

En vedette

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

En vedette (20)

Everything You Need To Know About ChatGPT

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Daghstuhl Seminar 13451 (Computational Audio Analysis) Inspirational Talk

1. Multimedia analysis for the poor (in training resources) Xavier Anguera Telefonica research Daghstuhl Seminar 13451 - Inspirational talk

2. Does this affect me? • You work in areas where there is not much training data available – Maybe it exists in domains other than your test data. • The task you are pursuing does not have a well annotated corpus for training – E.g. finding structure in signals • It is difficult / you do not know how to define training “units” in your task • You like to work in complicated stuff

3. Typical Speech paper diagram Labeled training data My favorite ML technique “I am a model” Testing data My favorite decoding technique My result

4. …making it as complicated as you would like to

10.

11. Resource-free technologies • Summarization – Acoustic word cloud of most repeated acoustic items – Repetition-based summarization (MODIS software @ INRIA-Rennes) • Structure analysis in music • Audio-visual unsupervised learning (e.g. the Google cats) • Acquisition of unknown sounds (e.g. Tuomo’s talk) • Exemplar-based ASR (Leuven Univ.)

12. EXAMPLE: Spoken Audio Search (or Query-byExample Spoken-Term Detection) Given a single spoken query we search for instances at lexical level within spoken documents It is similar to Spoken Term Detection (NIST STD2006, OpenKWS 2013) but…  Queries are spoken  Different speakers  Different acoustic conditions  No prior knowledge of the language(s) might be available

13. Mediaeval SWS 2013 • 9 languages in different acoustic contexts: 4 African languages (isixhosa, isizulu, sepedi, setswana), Albanian, Basque, Czech, non-native English, Romanian #utts time Avg. length/utt. Search corpus 10762 19:57:55 6.67s Dev Queries 505 0:11:26h 1.35s Extended dev* 1046 0:08:42h 0.49s Eval Queries 503 0:11:37h 1.38s Extended eval* 1037 0:08:57h 0.51s Total 13853 20:38:37h *Only Basque (3x) and Czech (10x) queries have extended versions

14. Mediaeval SWS 2013 Miss probability (in %) 98 95 90 80 60 40 20 10 5 .0001 .5 1 2 5 10 20 Random Performance GTTS (MTWV=0.399, Thr=5.243) L2F (MTWV=0.342, Thr=3.551) CUHK (MTWV=0.306, Thr=0.618) BUT (MTWV=0.297, Thr=0.914) CMTECHETAL (MTWV=0.257, Thr=18.153) IIITH (MTWV=0.224, Thr=2.721) ELIRF (MTWV=0.159, Thr=2.759) TID (MTWV=0.093, Thr=5.051) GTC (MTWV=0.084, Thr=3.341) SPEED (MTWV=0.059, Thr=0.923) LIA-Late (MTWV=0.000, Thr=1079.003) UNIZA-Late (MTWV=0.001, Thr=1.000) TUKE-Late (MTWV=0.000, Thr=3.000) Primary systems (evaluation) .001 .004 .01 .02 .05 .1 .2 False Alarm probability (in %) 40

15. Mediaeval SWS 2013 (results per language)

16.

17. How do children learn? (from someone who is not a parent…) 1. They hear their environment and identify/isolate particular audio-visual stimuli they do not know 2. An expert (parent/grandparent) tells them the “meaning” of those stimuli. – If the stimuli appears in different forms (or the child is not sharp) they will need to repeat it a couple of times… 3. The child learns and is able to identify this stimuli from then on.

18. book book book book Machine earning “book” model Machine earning book “?” model

19.

20. • How to incorporate acoustic modeling into dynamic programming techniques? • How to describe the acoustic space (or whatever space) in an unsupervised (but robust) manner? • How do we discriminate between “interesting/relevant” and “filler” events • Does it all make any sense? (maybe we could consider we will always have enough training data?)

Notes de l'éditeur

Difficulties in speech recognition because of the multiple acoustic environments we need to account for when training models
In image recognition we are also having problems defining the variability of some concepts