This document summarizes a research study that compared text and speech input modalities for tagging photos on camera phones. The study tested three hypotheses: 1) speech is preferred over text for tagging, 2) the advantage of speech increases with longer tags, and 3) text is faster than speech for retrieving photos. A user study was conducted with conditions for speech-only, text-only, and allowing both. Results showed speech was not clearly better than text for tagging or retrieving photos. The implications are that systems should support multiple input modalities, enable reviewing audio tags, and allow combining modalities to address their separate strengths and weaknesses.
How to Troubleshoot Apps for the Modern Connected Worker
Research on Tagging Photos with Text vs. Speech Input
1. Research & Development
Text vs. Speech
A Comparison of Tagging Input Modalities
for Camera Phones
Mauro Cherubini, Xavier Anguera,
Nuria Oliver, and Rodrigo de Oliveira
2. people do not want to tag
their pictures
intro → hypotheses → methodology → results → implications
3. research question:
Assuming that users are willing to
input at least one tag, which input
modality can help the production and
retrieval of the pictures?
intro → hypotheses → methodology → results → implications
4. hypothesis 1
Speech is preferred to text as an
annotation mechanism on mobile
phones (objective measure)
Support:
- Mitchard and Winkles (2002)
intro → hypotheses → methodology → results → implications
5. hypothesis 1-bis
Speech annotations are preferred by
users even if this means spending more
time on the task (subjective measure)
Support:
- Perakakis and Potamianos (2008)
intro → hypotheses → methodology → results → implications
6. hypothesis 2
The longer the tag the larger the
advantage of voice over text for
annotating pictures on mobile phones
Support:
- Hauptmann and Rudnicky (1990)
intro → hypotheses → methodology → results → implications
7. hypothesis 3
Retrieving pictures on mobile phones
with speech is not faster than with text
(objective measure)
Support:
- Mills et al. (2000)
intro → hypotheses → methodology → results → implications
8. the user study
field study
controlled
(4 weeks)
experiment
T1 - T2 - T3 - T4
3 experimental conditions:
a. Speech only
b. Text only
c. Speech and Text
intro → hypotheses → methodology → results → implications
10. features of MAMI
• processing is done entirely on the mobile
phone
• speech is not transcribed
• to compare the waveforms of the audio tags,
MAMI uses algorithm of Dynamic Time
Warping
intro → hypotheses → methodology → results → implications
11. task 1: remember the tag
stimulus
retrieval
Pictures taken during the field trial
intro → hypotheses → methodology → results → implications
12. task 2: remember the context
stimulus
retrieval
TASK 2
PICTURE 1
three little bushes
Garden
Tree
Stairs
intro → hypotheses → methodology → results → implications
13. task 3: remember the picture
stimulus
retrieval
Text
Audio tags were converted into
textual tags and vice versa
intro → hypotheses → methodology → results → implications
14. task 4: remember the
sequence
assignment
retrieval
TASK 4
Three pictures among
the oldest and three
pictures among the
newest.
intro → hypotheses → methodology → results → implications
17. results H1-bis
All participants in the BOTH group felt that tagging
with text was more effective than tagging with voice.
Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD])
1 = completely agree; 5 = completely disagree
intro → hypotheses → methodology → results → implications
21. take away 1:
speech is not a given
the advantage of audio as an input modality for tagging
pictures on mobile phones is not a given
why?
1. retrieval precision
2. privacy
intro → hypotheses → methodology → results → implications
22. take away 2:
input mistakes
we address text input mistakes immediately.
on the contrary mistakes in audio recordings are less
frequently addressed
intro → hypotheses → methodology → results → implications
23. take away 3:
memory
speech does not help memorizing the tags
intro → hypotheses → methodology → results → implications