Research on Tagging Photos with Text vs. Speech Input

•

1 like•561 views

This document summarizes a research study that compared text and speech input modalities for tagging photos on camera phones. The study tested three hypotheses: 1) speech is preferred over text for tagging, 2) the advantage of speech increases with longer tags, and 3) text is faster than speech for retrieving photos. A user study was conducted with conditions for speech-only, text-only, and allowing both. Results showed speech was not clearly better than text for tagging or retrieving photos. The implications are that systems should support multiple input modalities, enable reviewing audio tags, and allow combining modalities to address their separate strengths and weaknesses.

Technology Entertainment & Humor

Research & Development

Text vs. Speech
A Comparison of Tagging Input Modalities
for Camera Phones

Mauro Cherubini, Xavier Anguera,
Nuria Oliver, and Rodrigo de Oliveira

people do not want to tag
their pictures
intro → hypotheses → methodology → results → implications

research question:

Assuming that users are willing to
input at least one tag, which input
modality can help the production and
retrieval of the pictures?

intro → hypotheses → methodology → results → implications

hypothesis 1

Speech is preferred to text as an
annotation mechanism on mobile
phones (objective measure)

Support:
- Mitchard and Winkles (2002)

intro → hypotheses → methodology → results → implications

hypothesis 1-bis

Speech annotations are preferred by
users even if this means spending more
time on the task (subjective measure)

Support:
- Perakakis and Potamianos (2008)

intro → hypotheses → methodology → results → implications

hypothesis 2

The longer the tag the larger the
advantage of voice over text for
annotating pictures on mobile phones

Support:
- Hauptmann and Rudnicky (1990)

intro → hypotheses → methodology → results → implications

hypothesis 3

Retrieving pictures on mobile phones
with speech is not faster than with text
(objective measure)

Support:
- Mills et al. (2000)

intro → hypotheses → methodology → results → implications

the user study
ﬁeld study
controlled
(4 weeks)
experiment

T1 - T2 - T3 - T4

3 experimental conditions:
a. Speech only
b. Text only
c. Speech and Text

intro → hypotheses → methodology → results → implications

MAMI

intro → hypotheses → methodology → results → implications

features of MAMI

•  processing is done entirely on the mobile
phone
•  speech is not transcribed
•  to compare the waveforms of the audio tags,
MAMI uses algorithm of Dynamic Time
Warping

intro → hypotheses → methodology → results → implications

task 1: remember the tag
stimulus
retrieval

Pictures taken during the ﬁeld trial

intro → hypotheses → methodology → results → implications

task 2: remember the context
stimulus
retrieval

TASK 2
PICTURE 1

three little bushes
Garden
Tree
Stairs

intro → hypotheses → methodology → results → implications

task 3: remember the picture
stimulus
retrieval

Text
Audio tags were converted into
textual tags and vice versa

intro → hypotheses → methodology → results → implications

task 4: remember the
sequence
assignment
retrieval

TASK 4

Three pictures among
the oldest and three
pictures among the
newest.

intro → hypotheses → methodology → results → implications

metrics

•  time to completion
•  false positives
•  retrieval errors

intro → hypotheses → methodology → results → implications

results H1

intro → hypotheses → methodology → results → implications

results H1-bis
All participants in the BOTH group felt that tagging
with text was more effective than tagging with voice.

Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD])
1 = completely agree; 5 = completely disagree

intro → hypotheses → methodology → results → implications

results H2

intro → hypotheses → methodology → results → implications

results H3

intro → hypotheses → methodology → results → implications

take away 1:
speech is not a given

the advantage of audio as an input modality for tagging
pictures on mobile phones is not a given

why?
1. retrieval precision
2. privacy

intro → hypotheses → methodology → results → implications

take away 2:
input mistakes
we address text input mistakes immediately.
on the contrary mistakes in audio recordings are less
frequently addressed

intro → hypotheses → methodology → results → implications

take away 3:
memory

speech does not help memorizing the tags

intro → hypotheses → methodology → results → implications

implication 1:
allow multiple modalities

© Pixar, 2008

intro → hypotheses → methodology → results → implications

implication 2:
enable audio inspection

intro → hypotheses → methodology → results → implications

implication 3:
enable modality synesthesia

© Disney, 1940
intro → hypotheses → methodology → results → implications

Research Development

end
thanks

martigan@gmail.com
mauro@tid.es

http://www.i-cherubini.it/mauro/blog/
http://research.tid.es/multimedia/

Similar to Research on Tagging Photos with Text vs. Speech Input

CarterCritique1amyecarter

Clark ch 5 and 6Christian King

Pennymotsett ppquizPennyCM

Cognitive principles of instruction (edet 722) ctmlacademic3

GloCALL 2013 conference presentationTakeshi Sato

Science.1207745.fullUniversia Perú

Blenderbottaeseon ryu

Similar to Research on Tagging Photos with Text vs. Speech Input (8)

CarterCritique1

Clark ch 5 and 6

Pennymotsett ppquiz

Cognitive principles of instruction (edet 722) ctml

GloCALL 2013 conference presentation

Science.1207745.full

Blenderbot

Recently uploaded

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Slack Application Development 101 Slidespraypatel2

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Scaling API-first – The story of a global engineering organizationRadu Cotescu

A Domino Admins Adventures (Engage 2024)Gabriella Davis

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Recently uploaded (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

GenCyber Cyber Security Day Presentation

Slack Application Development 101 Slides

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Boost PC performance: How more available memory can improve productivity

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Injustice - Developers Among Us (SciFiDevCon 2024)

Handwritten Text Recognition for manuscripts and early printed texts

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Scaling API-first – The story of a global engineering organization

A Domino Admins Adventures (Engage 2024)

CNv6 Instructor Chapter 6 Quality of Service

Presentation on how to chat with PDF using ChatGPT code interpreter

The 7 Things I Know About Cyber Security After 25 Years | April 2024

My Hashitalk Indonesia April 2024 Presentation

How to Troubleshoot Apps for the Modern Connected Worker

Research on Tagging Photos with Text vs. Speech Input

1. Research & Development Text vs. Speech A Comparison of Tagging Input Modalities for Camera Phones Mauro Cherubini, Xavier Anguera, Nuria Oliver, and Rodrigo de Oliveira

2. people do not want to tag their pictures intro → hypotheses → methodology → results → implications

3. research question: Assuming that users are willing to input at least one tag, which input modality can help the production and retrieval of the pictures? intro → hypotheses → methodology → results → implications

4. hypothesis 1 Speech is preferred to text as an annotation mechanism on mobile phones (objective measure) Support: - Mitchard and Winkles (2002) intro → hypotheses → methodology → results → implications

5. hypothesis 1-bis Speech annotations are preferred by users even if this means spending more time on the task (subjective measure) Support: - Perakakis and Potamianos (2008) intro → hypotheses → methodology → results → implications

6. hypothesis 2 The longer the tag the larger the advantage of voice over text for annotating pictures on mobile phones Support: - Hauptmann and Rudnicky (1990) intro → hypotheses → methodology → results → implications

7. hypothesis 3 Retrieving pictures on mobile phones with speech is not faster than with text (objective measure) Support: - Mills et al. (2000) intro → hypotheses → methodology → results → implications

8. the user study ﬁeld study controlled (4 weeks) experiment T1 - T2 - T3 - T4 3 experimental conditions: a. Speech only b. Text only c. Speech and Text intro → hypotheses → methodology → results → implications

9. MAMI intro → hypotheses → methodology → results → implications

10. features of MAMI •  processing is done entirely on the mobile phone •  speech is not transcribed •  to compare the waveforms of the audio tags, MAMI uses algorithm of Dynamic Time Warping intro → hypotheses → methodology → results → implications

11. task 1: remember the tag stimulus retrieval Pictures taken during the ﬁeld trial intro → hypotheses → methodology → results → implications

12. task 2: remember the context stimulus retrieval TASK 2 PICTURE 1 three little bushes Garden Tree Stairs intro → hypotheses → methodology → results → implications

13. task 3: remember the picture stimulus retrieval Text Audio tags were converted into textual tags and vice versa intro → hypotheses → methodology → results → implications

14. task 4: remember the sequence assignment retrieval TASK 4 Three pictures among the oldest and three pictures among the newest. intro → hypotheses → methodology → results → implications

15. metrics •  time to completion •  false positives •  retrieval errors intro → hypotheses → methodology → results → implications

16. results H1 intro → hypotheses → methodology → results → implications

17. results H1-bis All participants in the BOTH group felt that tagging with text was more effective than tagging with voice. Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD]) 1 = completely agree; 5 = completely disagree intro → hypotheses → methodology → results → implications

18. results H2 intro → hypotheses → methodology → results → implications

19. results H3 intro → hypotheses → methodology → results → implications

20. results H3 - continued

21. take away 1: speech is not a given the advantage of audio as an input modality for tagging pictures on mobile phones is not a given why? 1. retrieval precision 2. privacy intro → hypotheses → methodology → results → implications

22. take away 2: input mistakes we address text input mistakes immediately. on the contrary mistakes in audio recordings are less frequently addressed intro → hypotheses → methodology → results → implications

23. take away 3: memory speech does not help memorizing the tags intro → hypotheses → methodology → results → implications

25. implication 2: enable audio inspection intro → hypotheses → methodology → results → implications

27. Research Development end thanks martigan@gmail.com mauro@tid.es http://www.i-cherubini.it/mauro/blog/ http://research.tid.es/multimedia/

Research on Tagging Photos with Text vs. Speech Input

Recommended

Recommended

More Related Content

Similar to Research on Tagging Photos with Text vs. Speech Input

Similar to Research on Tagging Photos with Text vs. Speech Input (8)

Recently uploaded

Recently uploaded (20)

Research on Tagging Photos with Text vs. Speech Input