Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)

Creating a Data Collection for Evaluating Rich Speech Retrieval

Creating a Data Collection
for Evaluating Rich Speech Retrieval

Maria Eskevich1 , Gareth J.F. Jones1
Martha Larson 2 , Roeland Ordelman 3

1 Centre for Digital Video Processing, Centre for Next Generation Localisation
School of Computing, Dublin City University, Dublin, Ireland
2 Delft University of Technology, Delft, The Netherlands
3 University of Twente, The Netherlands


Outline

MediaEval benchmark
MediaEval 2011 Rich Speech Retrieval Task
What is crowdsourcing?
Crowdsourcing in Development of Speech and
Language Resources
Development of effective crowdsourcing task
Comments on results
Conclusion
Future Work: Brave New Task at MediaEval 2012


ediaEval
Multimedia Evaluation benchmarking inititative

Evaluate new algorithms for multimedia access and
retrieval.
Emphasize the ”multi” in multimedia: speech, audio,
visual content, tags, users, context.
Innovates new tasks and techniques focusing on the
human and social aspects of multimedia content.


ediaEval 2011
Rich Speech Retrieval (RSR) Task

Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention


ediaEval 2011

Task Goal:

Transcript 1 Transcript 2


ediaEval 2011

Task Goal:

Transcript 1 Transcript 2
Meaning 1 Meaning 2


ediaEval 2011

Task Goal:

Transcript 1 = Transcript 2
Meaning 1 = Meaning 2


ediaEval 2011

Task Goal:

Conventional retrieval


ediaEval 2011

Task Goal:

Speech act 1 = Speech act 2


ediaEval 2011

Task Goal:

Speech act 1 = Speech act 2
Extended speech retrieval


ediaEval 2011

ME10WWW dataset:
Videos from Internet video sharing platform blip.tv
(1974 episodes, 350 hours)


ediaEval 2011

ME10WWW dataset:
Automatic Speech Recognition (ASR) transcript provided
by LIMSI and Vocapia Research


ediaEval 2011

ME10WWW dataset:
No queries and relevant items


ediaEval 2011

ME10WWW dataset:

− > Collect for Retrieval Experiment:
user-generated queries
user-generated relevant items


ediaEval 2011

ME10WWW dataset:

− > Collect for Retrieval Experiment:
user-generated queries
user-generated relevant items
− > Collect via crowdsourcing technology



Crowdsourcing is a form of human computation.
Human computation is a method of having people do
things that we might consider assigning to a computing
device, e.g. a language translation task.
A crowdsourcing system facilitates a crowdsourcing
process.



process.

Factors to take into account:



process.

Sufﬁcient number of workers



process.

Level of payment



process.

Level of payment
Clear instructions



process.

Level of payment
Clear instructions
Possible cheating


Language Resources


Language Resources

Suitability of crowdsourcing for simple/straightforward
natural language processing tasks:


Language Resources

Work by non-experts crowdsource workers is of similar
standard to that performed by expert workers:
translation/translation assessment
transcription of native language
word sense disambiguation
temporal annotation
[Snow et al., 2008]


Language Resources

Work by non-experts crowdsource workers is of similar
standard to that performed by expert workers:
translation/translation assessment
transcription of native language
word sense disambiguation
temporal annotation
[Snow et al., 2008]
Research question at collection creation stage:
Can untrained crowdsource workers undertake
extended tasks which require them to be creative?


Crowdsourcing with Amazon Mechanical Turk



Task is referred to as a ‘Human Intelligence Task’ or HIT.



Crowdsourcing procedure:
HIT initiation: Requester uploads a HIT.



Work: Workers carry out the HIT



Work: Workers carry out the HIT
Review: Requester reviews the completed work and
conﬁrms payment to the worker with a previously set
payment.
*Requester has an option of paying more (”Bonus”)


Information expected from the worker
to create a test collection for RSR Task



Speech act type:
’expressives’: apology, opinion
’assertives’: deﬁnition
’directives’: warning
’commissives’: promise



Speech act type:
Time of the labeled speech act: beginning and end



Speech act type:
Accurate transcript of the labeled speech act



Speech act type:
Accurate transcript of the labeled speech act
Queries to reﬁnd this speech act:
a full sentence query
a short web style query


Data management for Amazon MTurking

ME10WWW videos vary in length:


Data management for Amazon MTurking

ME10WWW videos vary in length:

− > Starting points for longer videos at a distance of
approximately 7 minutes apart are calculated:

Data set Episodes Starting points
Dev 247 562
Test 1727 3278


Crowdsourcing experiment



Worker expectations:




Reward vs Work




Reward vs Work
Per hour Rate



Requester uploads the HIT:

Reward vs Work
Per hour Rate




Reward vs Work Pilot wording
Per hour Rate




Reward vs Work Pilot wording
Per hour Rate 0.11 $ + bonus per
speech act type



Workers feedback: Requester uploads the HIT:

Reward is not worth
the Work Pilot wording
Task is 0.11 $ + bonus per
too complicated speech act type



Requester updates the HIT:
Workers feedback:

Rewording
Reward is not worth
the Work
Task is
too complicated



Workers feedback:

Rewording
Reward is not worth Examples
the Work
Task is
too complicated



Workers feedback:

Rewording
Reward is not worth Examples
the Work 0.19 $ + bonus (0-21$)
Task is Workers suggest bonus
too complicated size (Mention to be a
non-proﬁt organization)



Workers feedback:

Reward is worth
Rewording
the Work
Examples
Task is
comprehensible 0.19 $ + bonus (0-21$)
Workers suggest bonus
Workers are
size (Mention that we are a
not greedy!
non-proﬁt organization)


HIT example

Pilot:
“Please watch the video and ﬁnd a short portion of the
video (a segment) that contains an interesting quote. The
quote must fall into one of these six categories”


HIT example

Pilot:
“Please watch the video and ﬁnd a short portion of the
video (a segment) that contains an interesting quote. The
quote must fall into one of these six categories”

Revised:
“Imagine that you are watching videos on YouTube.
When you come across something interesting you might
want to share it on Facebook, Twitter or your favorite
social network. Now please watch this video and search
for an interesting video segment that you would like to
share with others because it is (an apology, a deﬁnition,
an opinion, a promise, a warning)”.


Results:
Number of collected queries per speech act

Prices:
Dev set: 40 $ per 30 queries
Test set: 80 $ per 50 queries


Results assessment


Results assessment

Number of accepted HITs = number of collected queries


Results assessment


No overlap of workers in dev and test sets


Results assessment


Creative work - Creative Cheating:


Results assessment


Copy and paste provided examples


Results assessment


− > Examples should be pictures, not texts


Results assessment


Choose the option of no speech act found in the video


Results assessment


− > Manual assessment by requester needed


Results assessment


− > Manual assessment by requester needed
Workers rarely ﬁnd noteworthy content later than the
third minute from the start of playback point in the video


Conclusions

It is possible to crowdsource extensive and complex
tasks to support speech and language resources


Conclusions

Use concepts and vocabulary familiar to the workers


Conclusions

Pay attention to technical issues of watching the video


Conclusions

Video preprocessing into smaller segments


Conclusions

Creative work demands higher reward level, or just
more ﬂexible system


Conclusions

Creative work demands higher reward level, or just
more ﬂexible system
High level of wastage due to task complexity


ediaEval 2012 Brave New Task:
Search and Hyperlinking

Use Scenario: a user is searching for a known segment
in a video collection. Furthermore, because the information
in the segment might not be sufﬁcient for his information
need, s/he wants to have links to other related video
segments, which may help to satisfy information need
related to this video.




Sub-tasks:




Sub-tasks:
Search: ﬁnding suitable video segments based on a short
natural language query,




Sub-tasks:
Search: ﬁnding suitable video segments based on a short
natural language query,
Linking: deﬁning links to other relevant video segments in
the collection.


ediaEval 2012

Thank you for your attention!

Welcome to MediaEval 2012! http://multimediaeval.org

Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)

Recommandé

Recommandé

Contenu connexe

Similaire à Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)

Similaire à Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012) (20)

Plus de Maria Eskevich

Plus de Maria Eskevich (8)

Dernier

Dernier (20)

Creating a Data Collection for Evaluating Rich Speech Retrieval (LREC 2012)