Visual speech to text conversion applicable to telephone communication
1. Visual-speech to text
conversion applicable
to telephone
communication for deaf
individuals
30TH APRIL 2013
2. Visual-speech to text conversion applicable to telephone communication for deaf individuals
INTRODUCTION
Lip-reading technique,
speech can be understood by interpreting
movements of lips, face and tongue.
not one-to-one
Impossible to distinguish phonemes using
visual information alone
3. Visual-speech to text conversion applicable to telephone communication for deaf individuals
the Cued Speech system
developed by Cornett
contains two components:
the hand shape the hand position relative to the
face.
Hand shapes- consonant phonemes
hand positions -vowel phonemes.
improves speech perception to a large extent
4. Visual-speech to text conversion applicable to telephone communication for deaf individuals
the Cued Speech system
5. Visual-speech to text conversion applicable to telephone communication for deaf individuals
AIM OF NEW SYSTEM
To investigate the designing of a system able to
automatically recognize Cued Speech and convert it
to text.
Possible for deaf or speech-impaired individuals to
communicate with each other and also with normal-hearing
persons
Using gestures
captured by devices equipped by a camera
6. Visual-speech to text conversion applicable to telephone communication for deaf individuals
METHODS
Corpus, feature extraction, and
statistical modeling
The speakers’ lips were painted blue, and color
marks were placed on the speakers’ fingers. .
The data were derived from a video recording of
the cuers pronouncing and coding in Cued
Speech
landmarks with different colors were placed on
the fingers
7. Visual-speech to text conversion applicable to telephone communication for deaf individuals
faster and more accurate image processing
stage.
The audio part of the video recording was
synchronized with the image.
An automatic image processing method was
appliedli pt ow idththe ( Av)i,d eo
lip aperture (B),
lip area (S).
pinching of the upper lip (Bsup)
lower (Binf) lip
8. Visual-speech to text conversion applicable to telephone communication for deaf individuals
Concatenative feature fusion
Tracks and extracts the xy coordinates
each time frame,
uses those values as features in the
HMM modeling.
uses the concatenation of the
synchronous lip shape and hand features
as the joint feature vector given by,
9. Visual-speech to text conversion applicable to telephone communication for deaf individuals
Joint lip hand
feature vector,
Lip shape
feature vector,
Hand feature
vector,
Dimensionality of the
joint feature vector
Parameters used for lip
shape modeling.
10. Visual-speech to text conversion applicable to telephone communication for deaf individuals
RESULTS
Isolated word recognition
1. Recognition in normal-hearing subject
11. Visual-speech to text conversion applicable to telephone communication for deaf individuals
2. Recognition in deaf subject
12. Visual-speech to text conversion applicable to telephone communication for deaf individuals
3. Multi-speaker isolated word recognition:
investigate whether it is possible to train speaker-independent
HMMs for Cued Speech recognition.
The training data consisted of 750 words from the
normal-hearing subject, and 750 words from the
deaf subject.
For testing 700 words from normal-hearing subject
and 700 words from the deaf subject were used,
respectively.
Each state was modeled with a mixture of 4
Gaussian distributions.
For lip shape and hand shape integration,
concatenative feature fusion was used.
13. Visual-speech to text conversion applicable to telephone communication for deaf individuals
14. Visual-speech to text conversion applicable to telephone communication for deaf individuals
4. Continuous phoneme recognition
Phoneme correct for continuous phoneme word
recognition in the case of a normal-hearing subject.
15. Visual-speech to text conversion applicable to telephone communication for deaf individuals
Phoneme correct for continuous phoneme word
recognition in the case of a deaf subject.
16. Visual-speech to text conversion applicable to telephone communication for deaf individuals
CONCLUSION
Hand shapes and lips shape were integrated
using concatenative feature fusion and HMM-based
automatic recognition was conducted.
For continuous phoneme recognition, a 86%
phoneme correct was achieved for the normal-hearing
cuer and a 82.7% phoneme correct for
the dead cuer were achieved, respectively.
Speech in both normal-hearing and deaf
subjects were also conducted obtaining a
94.9% and a 89% accuracy, respectively.
.
17. Visual-speech to text conversion applicable to telephone communication for deaf individuals
CONCLUSION
A multi-speaker experiment using data
from both normal-hearing and deaf subject
showed a 89.6% word accuracy, on
average.
This result indicates that training speaker-independent
HMMs for Cued Speech using
a large number of subjects should not face
particular difficulties
18. Visual-speech to text conversion applicable to telephone communication for deaf individuals
REFERENCES
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior,
“recent Advances in the automatic recognition of audiovisual
speech,” in Proceedings of the IEEE, vol. 91, issue 9, pp.
1306–1326, 2003.
S. Nakamura, K. Kumatani, and S. Tamura, “Multi-modal
temporal asynchronicity modeling by product hmms for
robust audio-visual speech recognition,” in Proceedings of
Fourth IEEE International Conference on Multimodal
Interfaces (ICMI’02), p. 305, 2002.
R. O. Cornett, “Cued speech,” American Annals of the Deaf,
vol. 112, pp. 3–13, 1967.
J. Leybaert, “Phonology acquired through the eyes and
spelling in deaf children,”Journal of Experimental Child
Psychology, vol. 75, pp. 291– 318, 2000