Peter Muschick MSc thesis
Universitat Pollitecnica de Catalunya, 2020
Sign language recognition and translation has been an active research field in the recent years with most approaches using deep neural networks to extract information from sign language data. This work investigates the mostly disregarded approach of using human keypoint estimation from image and video data with OpenPose in combination with transformer network architecture. Firstly, it was shown that it is possible to recognize individual signs (4.5% word error rate (WER)). Continuous sign language recognition though was more error prone (77.3% WER) and sign language translation was not possible using the proposed methods, which might be due to low accuracy scores of human keypoint estimation by OpenPose and accompanying loss of information or insufficient capacities of the used transformer model. Results may improve with the use of datasets containing higher repetition rates of individual signs or focusing more precisely on keypoint extraction of hands.
2. Were my hands visible? Was the
background not distracting? Did
my clothes contrast my skin color?
Was the video quality sufficient?
3. • Problem
• Communication issues of sign language speakers (in digital environments)
[DFG+]
• Proposed solutions
• Creation of automatically generated subtitles and translations of sign languages
• Speech2Signs: Spoken to Sign Language Translation using NN of prof Xavier
Giró and Amanda Duarte (PhD cand.) at Universitat Politècnica de Catalunya,
Barcelona
• Here: Research of sign language translation with a new dataset called How2Sign
and OpenPose
3
Motivation
University of Stuttgart 06.11.2020
4. • Introduction
• Sign language research
• Current state
• Related works
• Methods
• Results
• Discussion & Summary
4
Content
University of Stuttgart 06.11.2020
5. • Sign languages are
• individual and independent languages
• Sign languages are spoken on multiple and parallel channels [Dam11]
• All information of sign languages cannot be covered in texts [Sut95] [Sto05]
[Pri90]
• Research of sign language translation is dependent on the translation direction
5
Characteristics of neural sign language translation research
Introduction
University of Stuttgart 06.11.2020
6. • Research of sign language translation: Sign language to spoken language
6
Translation direction
Introduction
‘Hi my name is ...’ / Audio
[DPG+20]
University of Stuttgart 06.11.2020
Input: image/video Output: text/audio
7. • Research of sign language translation: Spoken language to sign language
7
Translation direction
Introduction
University of Stuttgart 06.11.2020
GAN = Generative Adversarial Networks
‘Hi my name is ...’ / Audio
Input: text/audio Output: animated avatar or
generated videos (GAN)
‘Hi my name is ...’ / Audio
[DPG+20]
8. • Research of sign language translation: Sign language to sign language
8
Translation direction
Introduction
University of Stuttgart 06.11.2020
[DPG+20]
Input: image/video
GAN = Generative Adversarial Networks
‘Hi my name is ...’ / Audio
Output: animated avatar or
generated videos (GAN)
[DPG+20]
9. • Sign language to sign language: no known publications
• Spoken language to sign language: (Saunders et al., 2020 [SCB20], Stoll et al., 2018
[STL+18])
• Sign language to spoken language:
• Sign Recognition (Zahoor et al., 2011 [ZAH+11])
• Continuous Sign Recognition (Koller et al., 2015 [KFN15])
• Sign Language Translation (Camgöz et al., 2018 [CHK+18], Camgöz et al. 2020
[CKHB20])
9
Current state of research
Introduction
University of Stuttgart 06.11.2020
10. 10
Sign language to spoken language tasks
Introduction
Task Sign Recognition Continuous Sign
Recognition
Sign Language
Translation
Sign Language
representation
Images Videos Videos
Spoken Language
representation
Classes Signs Text
“A” “HI ME SARAH”
“Hi my name is
Sarah”
11. • Enable use of sign language with sign language translation
• Current sign language datasets issues
• Limited range of topics & vocabulary & amount of speakers [DPG+20]
→ Collection and Creation of How2Sign dataset [DPG+20]
11
Sign language to spoken language translation
Introduction
University of Stuttgart 06.11.2020
12. 12
Proposed solution - Sign language into spoken language translation
Introduction
Task Sign Recognition Continuous Sign
Recognition
Sign Language
Translation
Dataset SLR [GB]
PHOENIX14T
[CHK+18]
PHOENIX14T,
How2Sign [DPG+20]
Extraction OpenPose [CHS+18]
Model Transformer [VSP+17]
Evaluation R, M, B, W
Rouge [Lin04], Meteor [BL02], BLEU [PRWZ02], Word Error Rate [KP02]
University of Stuttgart 06.11.2020
14. Task Sign
Recognition
Continuous
Sign
Recognition
Sign Language
Translation
Sign Language
Translation
Dataset SLR
PHOENIX14T
(Glosses)
PHOENIX14T
(German)
How2Sign
(English)
Type Images Videos Videos Videos
Annotation Classes Glosses German English
Hours - 10.5 10.5 80
Utterances 5 000 8 200 8 200 35 000
Vocab 24 1 000 3 000 16 000
14
Dataset
Methods
University of Stuttgart 06.11.2020
15. Task Sign
Recognition
Continuous
Sign
Recognition
Sign Language
Translation
Sign Language
Translation
Dataset SLR
PHOENIX14T
(Glosses)
PHOENIX14T
(German)
How2Sign
(English)
Type Images Videos Videos Videos
Annotation Classes Glosses German English
Hours - 10.5 10.5 80
Utterances 5 000 8 200 8 200 35 000
Vocab 24 1 000 3 000 16 000
15
Dataset
Methods
University of Stuttgart 06.11.2020
16. • Human keypoint estimation with pretrained convolutional networks [CHS+18]
16
OpenPose - Human Keypoint Estimation
Methods
Input Output
University of Stuttgart 06.11.2020
17. • Receive 137 estimated keypoints (body, face, hands) per frame
• Keypoint: x- & y-coordinates and confidence score
• Data Normalization [KKJC19]
17
OpenPose - Human Keypoint Estimation
Methods
x = {x ∈ R | 0 ≤ x ≤ max(frame x-axis)}
n = {n ∈ N | 0 ≤ n ≤ #keypoints}
f = {f ∈ N | 0 < f ≤ #frames}
u = {u ∈ N | 0 < u ≤ #utterances}
University of Stuttgart 06.11.2020
18. • Transformer models from Attention is all you need [VSP+17] based on self-attention
• Schematic structure of the used Transformer model [Ala18]:
18
Models
Methods
N = Normalization layer
MLP = Multi layer perceptron
C = Classification layer
University of Stuttgart 06.11.2020
19. 19
Proposed solution - Sign language into spoken language translation
Methods - Overview
Rouge [Lin04], Meteor [BL02], BLEU [PRWZ02], Word Error Rate [KP02]
Task Sign Recognition Continuous Sign
Recognition
Sign Language
Translation
Dataset SLR [GB]
PHOENIX14T
[CHK+18]
PHOENIX14T,
How2Sign [DPG+20]
Extraction OpenPose [CHS+18]
Model Transformer [VSP+17]
Evaluation R, M, B, W
University of Stuttgart 06.11.2020
20. 20
SLR - Sign Recognition
Results
Work Our study Gupta et al. [GB]
Dataset SLR SLR
Extraction OpenPose CNN
Model Transformer MLP
Evaluation W W
Rouge [Lin04], Meteor [BL02], BLEU [PRWZ02], Word Error Rate [KP02]
University of Stuttgart 06.11.2020
21. 21
SLR - Sign Recognition
Results
Experiment Hidden
size
#Layer Dropout LR #Heads WER (%)
Number
MLP size of
transformer
layer
Amount of
Transformer
layer
Dropout in
transformer
layer
Learning
rate
Amount of
attention
heads
Result
University of Stuttgart 06.11.2020
23. 23
PHOENIX14T - Continuous Sign Recognition
Results
Work Our study Camgöz et al., 2020
[CKHB20]
Dataset PHOENIX14T PHOENIX14T
Extraction OpenPose CNN
Model Transformer Transformer
Evaluation W W
Rouge [Lin04], Meteor [BL02], BLEU [PRWZ02], Word Error Rate [KP02]
University of Stuttgart 06.11.2020
24. 24
PHOENIX14T - Continuous Sign Recognition
Results
Experiment Hidden
size
#Layer Dropout LR #Heads WER (%)
Val
WER (%)
Test
1 128 1 0.2 10-4
1 93.3 94.1
2 512 2 0.2 10-4
4 85.5 84.4
3 2048 4 0.2 10-4
8 79.3 81.2
Camgöz et
al., 2020
- 24.88 24.59
University of Stuttgart 06.11.2020
25. 25
PHOENIX14T - Sign Language Translation
Results
Work Our study Ko et al., 2019
[KKJC19]
Camgöz et al., 2020
[CKHB20]
Dataset
PHOENIX14T
How2Sign
KETI (na) PHOENIX14T
Extraction OpenPose OpenPose CNN
Model Transformer Seq2Seq Transformer
Evaluation R, M, B, W R, M, B, C B, W
na = not available
Rouge [Lin04], Meteor [BL02], BLEU [PRWZ02], Word Error Rate [KP02]
University of Stuttgart 06.11.2020
27. 27
How2Sign - Sign Language Translation
Results
Exp #Hid #Lay Drop LR #H B1 B2 B3 B4 M R
1 1024 4 0.4 10-5
32 1.0 0.0 0.0 0.0 2.0 3.0
2 2048 6 0.4 10-5
16 1.0 0.0 0.0 0.0 1.0 2.0
oom 2048 4 0.4 10-5
64 - - - - - -
oom 2048 8 0.4 10-5
32 - - - - - -
oom = out of memory error
BLEU-1, BLEU-2, BLEU-3, BLEU-4, Meteor, Rouge
University of Stuttgart 06.11.2020
28. 28
Translation results
Discussion
University of Stuttgart 06.11.2020
Task Dataset Translation/ Recognition
quality
Sign Recognition SLR High
Continuous Sign Recognition PHOENIX14T Low
Sign Language Translation
PHOENIX14T Low
How2Sign Not possible
→ Bigger and more complex datasets were not possible to translate
29. • Keypoint estimation accuracy of OpenPose might be too low
29
Limitations
Discussion
University of Stuttgart 06.11.2020
30. • Confidence scores of a video of ~2800 frames displaying a sign language speaker
30
OpenPose - How2Sign: face & body confidence scores
Discussion
University of Stuttgart 06.11.2020
31. • Confidence scores of a video of ~2800 frames displaying a sign language speaker
31
OpenPose - How2Sign: left & right hand confidence scores
Discussion
University of Stuttgart 06.11.2020
32. • Keypoint estimation accuracy of OpenPose might be too low
• Models with bigger hyperparameters exceed the server memory
• Complexity of used models might be too low
32
Limitations
Discussion
University of Stuttgart 06.11.2020
33. • OpenPose and transformer model are suited for sign recognition
• Proposed methods did not show satisfying results for continuous sign recognition and
sign language translation
33
Summary
University of Stuttgart 06.11.2020
34. • Run OpenPose with different datasets and examine accuracy
• Datasets with more repetitions of single signs
• Focus on hand recognition
• Continue with transformer models
• Use pre-defined transformer models from libraries
• Use OpenPose for facial recognition
34
Outlook
University of Stuttgart 06.11.2020
35. [Jac96] R. Jacobs. “Just how hard is it to learn ASL? The case for ASL as a truly foreign language.” In: Multicultural aspects of sociolinguistics in
deaf communities 2 (1996), pp. 183–226
[Dam11] S. Damian. “Spoken vs. Sign Languages-What’s the Difference?” In: Cognition, Brain, Behavior 15.2 (2011), p. 251
[DFG+] P. Dreuw, J. Forster, Y. Gweth, D. Stein, H. Ney, G. Martinez, J. V. Llahi, O. Crasborn, E. Ormel, W. Du, T. Hoyoux, J. Piater, J. M. Moya,
M. Wheatley. “SignSpeak – Understanding, Recognition, and Translation of Sign Languages.” en. In: (), p. 8
[ACH+13] M. Adams, C. Castaneda, H. W. Hackman, M. L. Peters, X. Zuniga, W. J. Blumenfeld. Readings for diversity and social justice. Third
edition. New York: Routledge Taylor & Franacis Group, 2013., 2013
[Sut95] V. Sutton. Lessons in sign writing. SignWriting, 1995
[Sto05] W. Stokoe. “Sign language structure: an outline of the visual communication systems of the American deaf. 1960.” In: Journal of deaf
studies and deaf education 10 1 (2005), pp. 3–37
[Pri90] S. Prillwitz. “Hamburger Notations-System - Entwicklung einer Gebärdenschrift mit Computeranwendung.” In: Gebärde, Laut und
graphisches Zeichen: Schrifterwerb im Problemfeld von Mehrsprachigkeit. Ed. by G. List, G. List. Wiesbaden: VS Verlag für Sozialwissenschaften,
1990, pp. 60–82.
[DPG+20] A. Duarte, S. Palaskar, D. Ghadiyaram, K. DeHaan, F. Metze, J. Torres, X. Giro-i-Nieto. “How2Sign: A Large-scale Multimodal Dataset
for Continuous American Sign Language.”
[SCB20] B. Saunders, N. C. Camgoz, R. Bowden. “Progressive Transformers for Endto-End Sign Language Production.” (Apr. 2020)
35
Sources I
36. [CHS+18] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affnity Fields.
2018.
[GB] R. Gupta, V. Behl. im Rishabh Gupta/Indian-Sign-Language-Recognition. URL: https://github.com/imRishabhGupta/Indian-Sign-Language-
Recognition
[CHK+18] N. C. Camgoz, S. Hadfeld, O. Koller, H. Ney, R. Bowden. “Neural Sign Language Translation.” In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). 2018
[CKHB20] N. C. Camgoz, O. Koller, S. Hadfeld, R. Bowden. “Sign Language Transformers: Joint End-to-end Sign Language Recognition and
Translation.”, (Mar. 2020).
[VSP+17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. “Attention Is All You Need.” (Dec.
2017).
[KP02] D. Klakow, J. Peters. “Testing the correlation of word error rate and perplexity.” In: Speech Communication 38.1 (2002), pp. 19–28. ISSN:
0167-6393.
[PRWZ02] K. Papineni, S. Roukos, T. Ward, W. J. Zhu. “BLEU: a Method for Automatic Evaluation of Machine Translation.” In: (Oct. 2002).
[BL02] S. Banerjee, A. Lavie. “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.” en. In: (2002).
[Lin04] C.-Y. Lin. “Rouge: A package for automatic evaluation of summaries.” In: Text summarization branches out. 2004.
36
Sources II
37. [STL+18] S. Stoll, N. Camgoz, S. Hadfield and R. Bowden. Text2Sign: Towards Sign Language Production Using Neural Machine Translation and
Generative Adversarial Networks. 2018.
[KKJC19] S.-K. Ko, C. J. Kim, H. Jung, C. Cho. “Neural Sign Language Translation based on Human Keypoint Estimation.” (June 2019).
[Ala18] J. Alammar. The Illustrated Transformer. June 2018. URL: http://jalammar.github.io/illustrated-transformer/
[KFN15] O. Koller, J. Forster, H. Ney. “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling
multiple signers.” In: Computer Vision and Image Understanding 141 (Dec. 2015).
[ZAH+11] Zafrulla, Zahoor and Brashear, Helene and Starner, Thad and Hamilton, Harley and Presti, Peter. American Sign Language Recognition
with the Kinect. 2011
37
Sources III
38. Thank you!
e-mail
www.
University of Stuttgart
Peter Muschick
github.com/asdf11x/stt
swt89259@stud.uni-stuttgart.de
Photo by Louisa
Schaad on Unsplash
41. • How hard is it to learn Sign Language actually? [Jac96] (for native English speakers)
• American Sign Language is as hard to learn as Japanese or Arabic
Time + Theme + Comment + Speaker
• Time = grammatical tense
• Theme = object of the sentence
• Comment = additional information about the subject
• Speaker = subject of the sentence
“I went to the university yesterday” -> YESTERDAY UNIVERSITY GO I
41
Sign language
43. • Average confidence scores of OpenPose
43
OpenPose
Results
SLR PHOENIX14T How2Sign
body - 0.31 0.40
face - 0.77 0.84
left hand 0.55 0.31 0.47
right hand - 0.29 0.43
44. • Average confidence scores of OpenPose
44
OpenPose
Results
SLR* PHOENIX14T How2Sign
body - 0.31 0.40
face - 0.77 0.84
left hand 0.55 0.31 0.47
right hand - 0.29 0.43
45. • Average confidence scores of OpenPose
45
OpenPose
Results
SLR* PHOENIX14T How2Sign
body - 0.31 0.40
face - 0.77 0.84
left hand 0.55 0.31 0.47
right hand - 0.29 0.43
46. • Confidence scores of 242 images displaying left hand showing the letter A from
different angles
46
OpenPose - SLR
Results
47. • Confidence scores of 120 frames displaying a sign language speaker
47
OpenPose - PHOENIX14T
Results
48. • Confidence scores of 120 frames displaying a sign language speaker
48
OpenPose - PHOENIX14T
Results