SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Being multilingual with EMMA
Jorge Civera
EMMA Summer School
jcivera@dsic.upv.es
Tuesday 7th July, 2015
Index
1. Presentation
2. Multilingual access to MOOCs
3. Video subtitling
• Transcription
• Translation
4. Document translation
5. Conclusions and Discussion
UPV - Being multilingual with EMMA 2 / 21
Presentation
• Lecturer at the Department of Computer Systems and Computation
• Machine Learning and Language Processing (MLLP) group (mllp.upv.es)
• Automatic Speech Recognition:
– Already supported: English (En), Spanish (Es), Italian (It), Dutch (Nl), Estonian (Et),
Portuguese (Pt), French (Fr) and Catalan (Ca)
– In progress: German (De) and Slovene (Sl)
• Machine Translation:
– Language pairs available: En → {Es, It, Fr, Ca} and {Es, It, Nl, Et, Pt, Fr, Ca} → En
• Speech Synthesis:
– Already supported: English (En) and Spanish (Es)
• Experience on EU projects providing multilingual access to educational content:
– transLectures and EMMA
UPV - Being multilingual with EMMA 3 / 21
Presentation
• transLectures (Nov 2011 - Oct 2014)
– Lowering language barrier to access video repositories by providing multilingual subtitles
– Improving subtitles by massive adaptation and intelligent interaction
– VideoLectures.NET (VL) and poliMedia (pM) video repositories with thousands of hours
– Source languages: English and Slovene in VL and Spanish in pM
– Target languages: Spanish, French, German, Slovene and English
• EMMA (Feb 2014 - Jul 2016)
– Providing multilingual access to MOOCs (videos and documents)
– Few hours of video in 7 languages: En, Es, It, Nl, Et, Pt and Fr
– Source language is the national language of the MOOC provider
– Target languages: English, Spanish and Italian
UPV - Being multilingual with EMMA 4 / 21
Multilingual access to MOOCs
• Most MOOCs are offered in few languages
– English (45%), Spanish (32%), French (14%) and other languages (9%)
• Language barrier is keeping millions of potential learners from taking MOOCs
• What components in a MOOC need to be translated?
– Texts
– Images
– Videos
– Conversations (Forums)
• EMMA tackles with translation of texts and videos at the moment
• Videos are translated by providing subtitles in the target language
UPV - Being multilingual with EMMA 5 / 21
Cost of translating MOOCs
Texts
• Manual translation rate is approximately 2.500 words per day
• A 6-week course with 75.000 words takes 1.5 PM to be translated
Videos
• Before translating, videos are manually transcribed (10 RTF)
• Then, transcriptions are translated into the desired language (30 RTF)
• A course including 2 hours of video takes 0.5 PM to be translated
Solutions to lower costs
• Crowdsourcing (TED talks)
• Speech Recognition and Machine Translation to generate draft translations
– User effort to translate a course is reduced to 30% - 50% (0.6 - 1 PM)
UPV - Being multilingual with EMMA 6 / 21
Overview of automatic video subtitling
• Step-by-step process:
1. Generation of automatic transcriptions from video
2. Manual review of automatic transcriptions to correct transcription errors
3. Generation of automatic translations from manually reviewed transcription
4. Manual review of automatic translations to generate final subtitles
• State-of-the-art technology cannot provide perfect automatic subtitles
• However, it significantly reduces the effort to generate multilingual subtitles
• User effort saving depends on automatic transcription+translation accuracy
• You can contribute to improve transcription and translation accuracy
UPV - Being multilingual with EMMA 7 / 21
How to improve transcription accuracy
• Transcription systems learn to transcribe from examples
– At least 50 hours of videos (audio) previously transcribed to learn the acoustic model
– Texts in millions of words to learn the language model
Language Videos (hours) Text (Mwords)
Dutch 532 628
English 620 464000
Estonian 130 410
French 88 1800
German 36 135
Portuguese 54 573
Italian 54 868
Slovene 27 224
Spanish 128 654
• Adaptation of transcription systems to the specific videos is key for high accuracy
– Availability of videos manually transcribed with similar acoustic conditions
– Availability of text resources related to the video in question
∗ Title is used to retrieve related documents from Google
∗ Slides contain most of the words uttered by the lecturer
∗ Documents: text content from the course, additional text resources (bibliography)
UPV - Being multilingual with EMMA 8 / 21
Why automatic transcriptions
• Quality of automatic transcription can be impressive, but it greatly depends on:
– Availability of transcribed videos and related text materials
– Sound quality of the video
– Complexity of language involved (phonetics and grammar)
• All in all, high-accuracy fully automatic transcription is not possible
• Automatic transcriptions need to be manually reviewed
• Reviewing automatic transcription is much faster than doing it from scratch
• Transcriptions are not only needed to generate automatic translations:
– Non-native speakers and hearing impaired persons
– Text searchability and analysis
– Summarisation
– Video recommendation and relation
• Reviewed transcriptions are important to generate usable draft automatic translations
UPV - Being multilingual with EMMA 9 / 21
Reviewing automatic transcriptions
• Once a video is ingested into the system, a draft transcription is automatically generated
• Transcribed videos are available for review using a web interface
• Yet another slide and hands on reviewing an automatic transcription
UPV - Being multilingual with EMMA 10 / 21
Evaluating transcription review process
• Review of automatic transcriptions is evaluated from two viewpoints:
– Transcription accuracy
– Time spent to review automatic transcriptions measured as Real Time Factor (RTF)
Language Accuracy (92%) RTF (10)
Spanish Excellent (86%) 3
Estonian Good (70%) 3
Portuguese Average (57%) 5
Italian Good (82%) 5
English Good (81%) 6
Catalan Good (83%) 6
Dutch Good (75%) 6
French Good (75%) 6
UPV - Being multilingual with EMMA 11 / 21
Demo on transcription
1. Overview of the Transcription and Translation Platform (ttp.mllp.upv.es)
2. Uploading a video
3. Reviewing video transcription
4. Reviewing video translation
5. Reviewing document translation
UPV - Being multilingual with EMMA 12 / 21
How to improve translation accuracy
• Translation systems learn to translate from parallel texts
– Millions of sentences previously translated to learn the translation model
– Texts in millions of words to learn the language model
• Parallel texts are collected from public multilingual organisations (EU, UN, TED, etc.)
• Not all parallel text available is useful to translate your MOOC: need of domain adaptation
Language pairs All (Msents) Selection (Msents)
Dutch-English 27.3 1.7
English-Spanish 14.0 3.2
English-Italian 24.5 6.4
English-French 28.8 3.2
Estonian-English 10.5 10.5
French-English 28.8 0.5
Portuguese-English 27.5 6.4
Italian-English 24.5 6.4
Spanish-English 14.0 6.4
• Adaptation of translation systems to the domain of the MOOC
– Text of the course to be translated
– Domain-related materials previously translated
– Bibliography of the course in the target language
UPV - Being multilingual with EMMA 13 / 21
Reviewing automatic translations
• Speech Recognition technology is in a more mature stage than Machine Translation
• Machine Translation has improved over the last years, but it is still far from perfect
• Quality of automatic translation depends on:
– Proximity between source and target languages
– Complexity of grammar structures used by the speaker
– How specific the vocabulary employed is
– Availability of parallel texts in the same field
• Evaluation of translation is cumbersome, since there is not a unique correct translation
• Translations need to be manually reviewed before publishing them
• Translation review is faster than generating them from scratch
UPV - Being multilingual with EMMA 14 / 21
Reviewing automatic video translations
• Reviewed video transcriptions are automaticaly translated into the desired languages
• The same web interface allows you to review source and target subtitles in parallel
• Reviewed subtitles can be exported as SRT files
UPV - Being multilingual with EMMA 15 / 21
Reviewing automatic document translations
• Text included in the course is ingested into the translation system
• A similar web interface allows you to review source and target texts in parallel
• Preview of source and target texts also available
• Translated text is imported back into the EMMA platform
UPV - Being multilingual with EMMA 16 / 21
Evaluating translation review process
• Review of translations is evaluated from two viewpoints:
– Translation accuracy automatically computed from single reference translation
– Time spent to review automatic translations (in RTF)
Language pairs Accuracy RTF (30)
Spanish → English Good (64%) 7
Spanish → Catalan Excellent (73%) 9
English → Italian Good (59%) 10
Dutch → English Good (52%) 13
Italian → English Good (53%) 14
Estonian → English Poor (13%) 16
English → Spanish Good (62%) 17
French → English Average (22%) 26
UPV - Being multilingual with EMMA 17 / 21
Demo on translation
1. Overview of the Transcription and Translation Platform
2. Uploading a video
3. Reviewing video transcription
4. Reviewing video translation
5. Reviewing document translation
UPV - Being multilingual with EMMA 18 / 21
Conclusions and Discussion
• Multilingual access to your course boosts visibility
• The cost of manually translating your course is high (2 PM)
• Automatic translation can reduce the temporal cost up to 30% - 50%
• Accuracy of automatic translation depends on several factors:
– Languages involved
– Availability of annotated data resources related to your course
– Specificity of the course
• Designing a multilingual MOOC should also take into account:
– Slides
– Images
– Application interfaces (demos)
– Bibliography
– In general, language-dependent content that is not easy or too costly to edit
UPV - Being multilingual with EMMA 19 / 21
Thank you for your attention!
UPV - Being multilingual with EMMA 20 / 21
Comparative results with YouTube/Google
• Comparison with YouTube in terms of Word Error Rate
Word Error Rate
Language EMMA YouTube
Dutch 25.7 38.6
English 39.2 70.8
Italian 28.9 31.6
Portuguese 49.8 62.3
Spanish 14.4 34.3
• Comparison with Google Translate in terms of BLEU
Quality - BLEU
Language pairs EMMA Google
Dutch → English 41.6 33.4
English → Spanish 42.5 39.0
Italian → English 46.9 27.9
Portuguese → English 47.6 45.4
Spanish → English 28.2 27.6
UPV - Being multilingual with EMMA 21 / 21

Contenu connexe

En vedette

مجلة 61 نهائي 17-10-2012
مجلة 61 نهائي 17-10-2012مجلة 61 نهائي 17-10-2012
مجلة 61 نهائي 17-10-2012
Lobna Nabeeh
 
Reference Letter Bettina
Reference Letter BettinaReference Letter Bettina
Reference Letter Bettina
Bettina Dorrek
 
eldorado reference letter
eldorado reference lettereldorado reference letter
eldorado reference letter
Bettina Dorrek
 
juans transcript
juans transcript juans transcript
juans transcript
Juan Aguero
 
JUAN AGUERO RESUME JULY 2016
JUAN AGUERO  RESUME JULY 2016JUAN AGUERO  RESUME JULY 2016
JUAN AGUERO RESUME JULY 2016
Juan Aguero
 

En vedette (16)

Daewoo cert
Daewoo certDaewoo cert
Daewoo cert
 
NBK_014350
NBK_014350NBK_014350
NBK_014350
 
مجلة 61 نهائي 17-10-2012
مجلة 61 نهائي 17-10-2012مجلة 61 نهائي 17-10-2012
مجلة 61 نهائي 17-10-2012
 
Reference Letter Bettina
Reference Letter BettinaReference Letter Bettina
Reference Letter Bettina
 
Ahmed CV
Ahmed CVAhmed CV
Ahmed CV
 
Anna Philippova
Anna PhilippovaAnna Philippova
Anna Philippova
 
NBK_014180
NBK_014180NBK_014180
NBK_014180
 
eldorado reference letter
eldorado reference lettereldorado reference letter
eldorado reference letter
 
juans transcript
juans transcript juans transcript
juans transcript
 
Experience Certificate (Reference_Letter)
Experience Certificate (Reference_Letter)Experience Certificate (Reference_Letter)
Experience Certificate (Reference_Letter)
 
Cert pdf
Cert pdfCert pdf
Cert pdf
 
JUAN AGUERO RESUME JULY 2016
JUAN AGUERO  RESUME JULY 2016JUAN AGUERO  RESUME JULY 2016
JUAN AGUERO RESUME JULY 2016
 
refbettina
refbettinarefbettina
refbettina
 
Translation Report
Translation ReportTranslation Report
Translation Report
 
Carmen Padrón-Nápoles - EMMA webinar: Sharing the experience of the EMMA plat...
Carmen Padrón-Nápoles - EMMA webinar: Sharing the experience of the EMMA plat...Carmen Padrón-Nápoles - EMMA webinar: Sharing the experience of the EMMA plat...
Carmen Padrón-Nápoles - EMMA webinar: Sharing the experience of the EMMA plat...
 
NBK Certificate
NBK CertificateNBK Certificate
NBK Certificate
 

Plus de EUmoocs

Ilaria Merciai, Marco Cerrone - Monitoring a Learning Community in a Hybrid E...
Ilaria Merciai, Marco Cerrone - Monitoring a Learning Community in a Hybrid E...Ilaria Merciai, Marco Cerrone - Monitoring a Learning Community in a Hybrid E...
Ilaria Merciai, Marco Cerrone - Monitoring a Learning Community in a Hybrid E...
EUmoocs
 

Plus de EUmoocs (20)

Rosanna De Dosa - EMMA webinar: Sharing the experience of the EMMA platform
Rosanna De Dosa - EMMA webinar: Sharing the experience of the EMMA platformRosanna De Dosa - EMMA webinar: Sharing the experience of the EMMA platform
Rosanna De Dosa - EMMA webinar: Sharing the experience of the EMMA platform
 
Firssova, Brouns - Hoe ontwerpt u een effectieve MOOC? Voorbeelden uit de pr...
Firssova, Brouns -  Hoe ontwerpt u een effectieve MOOC? Voorbeelden uit de pr...Firssova, Brouns -  Hoe ontwerpt u een effectieve MOOC? Voorbeelden uit de pr...
Firssova, Brouns - Hoe ontwerpt u een effectieve MOOC? Voorbeelden uit de pr...
 
Brouns, Firssova, Kalz - Is there value of learning analytics in MOOCs? - Da...
Brouns, Firssova, Kalz - Is there value of learning analytics in MOOCs?  - Da...Brouns, Firssova, Kalz - Is there value of learning analytics in MOOCs?  - Da...
Brouns, Firssova, Kalz - Is there value of learning analytics in MOOCs? - Da...
 
Brouns, Firssova - Bootcamp EMMA MOOC Assessment for learning in practice - E...
Brouns, Firssova - Bootcamp EMMA MOOC Assessment for learning in practice - E...Brouns, Firssova - Bootcamp EMMA MOOC Assessment for learning in practice - E...
Brouns, Firssova - Bootcamp EMMA MOOC Assessment for learning in practice - E...
 
Ilaria Merciai, Marco Cerrone - Monitoring a Learning Community in a Hybrid E...
Ilaria Merciai, Marco Cerrone - Monitoring a Learning Community in a Hybrid E...Ilaria Merciai, Marco Cerrone - Monitoring a Learning Community in a Hybrid E...
Ilaria Merciai, Marco Cerrone - Monitoring a Learning Community in a Hybrid E...
 
Rosanna De Rosa, Alessandro Bogliolo - Teaching to teachers. A MOOC based hyb...
Rosanna De Rosa, Alessandro Bogliolo - Teaching to teachers. A MOOC based hyb...Rosanna De Rosa, Alessandro Bogliolo - Teaching to teachers. A MOOC based hyb...
Rosanna De Rosa, Alessandro Bogliolo - Teaching to teachers. A MOOC based hyb...
 
Mart Laanpere - Task-centred approach to mooc design
Mart Laanpere - Task-centred approach to mooc designMart Laanpere - Task-centred approach to mooc design
Mart Laanpere - Task-centred approach to mooc design
 
Olga Firssova - Task-centred approach to MOOC design - challenges and opportu...
Olga Firssova - Task-centred approach to MOOC design - challenges and opportu...Olga Firssova - Task-centred approach to MOOC design - challenges and opportu...
Olga Firssova - Task-centred approach to MOOC design - challenges and opportu...
 
Milena Popova - Taste Europeana
Milena Popova - Taste EuropeanaMilena Popova - Taste Europeana
Milena Popova - Taste Europeana
 
Anna Maria Tammaro, Getaneh Alemu - Using Europeanafor learning & teaching: E...
Anna Maria Tammaro, Getaneh Alemu - Using Europeanafor learning & teaching: E...Anna Maria Tammaro, Getaneh Alemu - Using Europeanafor learning & teaching: E...
Anna Maria Tammaro, Getaneh Alemu - Using Europeanafor learning & teaching: E...
 
Eleonora Pantò - Using Social Media effectively in your MOOC
Eleonora Pantò - Using Social Media effectively in your MOOCEleonora Pantò - Using Social Media effectively in your MOOC
Eleonora Pantò - Using Social Media effectively in your MOOC
 
Deborah Arnold - EMMA webinar: Capturing and delivering effective video as pa...
Deborah Arnold - EMMA webinar: Capturing and delivering effective video as pa...Deborah Arnold - EMMA webinar: Capturing and delivering effective video as pa...
Deborah Arnold - EMMA webinar: Capturing and delivering effective video as pa...
 
Mathy Vanbuel - EMMA webinar: Capturing and delivering effective video as par...
Mathy Vanbuel - EMMA webinar: Capturing and delivering effective video as par...Mathy Vanbuel - EMMA webinar: Capturing and delivering effective video as par...
Mathy Vanbuel - EMMA webinar: Capturing and delivering effective video as par...
 
Is there value of learning analytics in MOOCs?
Is there value of learning analytics in MOOCs?Is there value of learning analytics in MOOCs?
Is there value of learning analytics in MOOCs?
 
EMMA Services - EMOOCs 2016 conference
EMMA Services - EMOOCs 2016 conferenceEMMA Services - EMOOCs 2016 conference
EMMA Services - EMOOCs 2016 conference
 
Governing by data: Considerations on the role of learning analytics in education
Governing by data: Considerations on the role of learning analytics in educationGoverning by data: Considerations on the role of learning analytics in education
Governing by data: Considerations on the role of learning analytics in education
 
EMMA services at EMOOCs Conference 2016
 EMMA services at EMOOCs Conference 2016 EMMA services at EMOOCs Conference 2016
EMMA services at EMOOCs Conference 2016
 
What EMMA stands for?
What EMMA stands for?What EMMA stands for?
What EMMA stands for?
 
Fra sperimentazione e policy istituzionale
Fra sperimentazione e policy istituzionaleFra sperimentazione e policy istituzionale
Fra sperimentazione e policy istituzionale
 
Tracking and monitoring learners in MOOCs
Tracking and monitoring learners in MOOCsTracking and monitoring learners in MOOCs
Tracking and monitoring learners in MOOCs
 

Dernier

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Dernier (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 

EMMA Summer School - Jorge Civera - Being multilingual with EMMA

  • 1. Being multilingual with EMMA Jorge Civera EMMA Summer School jcivera@dsic.upv.es Tuesday 7th July, 2015
  • 2. Index 1. Presentation 2. Multilingual access to MOOCs 3. Video subtitling • Transcription • Translation 4. Document translation 5. Conclusions and Discussion UPV - Being multilingual with EMMA 2 / 21
  • 3. Presentation • Lecturer at the Department of Computer Systems and Computation • Machine Learning and Language Processing (MLLP) group (mllp.upv.es) • Automatic Speech Recognition: – Already supported: English (En), Spanish (Es), Italian (It), Dutch (Nl), Estonian (Et), Portuguese (Pt), French (Fr) and Catalan (Ca) – In progress: German (De) and Slovene (Sl) • Machine Translation: – Language pairs available: En → {Es, It, Fr, Ca} and {Es, It, Nl, Et, Pt, Fr, Ca} → En • Speech Synthesis: – Already supported: English (En) and Spanish (Es) • Experience on EU projects providing multilingual access to educational content: – transLectures and EMMA UPV - Being multilingual with EMMA 3 / 21
  • 4. Presentation • transLectures (Nov 2011 - Oct 2014) – Lowering language barrier to access video repositories by providing multilingual subtitles – Improving subtitles by massive adaptation and intelligent interaction – VideoLectures.NET (VL) and poliMedia (pM) video repositories with thousands of hours – Source languages: English and Slovene in VL and Spanish in pM – Target languages: Spanish, French, German, Slovene and English • EMMA (Feb 2014 - Jul 2016) – Providing multilingual access to MOOCs (videos and documents) – Few hours of video in 7 languages: En, Es, It, Nl, Et, Pt and Fr – Source language is the national language of the MOOC provider – Target languages: English, Spanish and Italian UPV - Being multilingual with EMMA 4 / 21
  • 5. Multilingual access to MOOCs • Most MOOCs are offered in few languages – English (45%), Spanish (32%), French (14%) and other languages (9%) • Language barrier is keeping millions of potential learners from taking MOOCs • What components in a MOOC need to be translated? – Texts – Images – Videos – Conversations (Forums) • EMMA tackles with translation of texts and videos at the moment • Videos are translated by providing subtitles in the target language UPV - Being multilingual with EMMA 5 / 21
  • 6. Cost of translating MOOCs Texts • Manual translation rate is approximately 2.500 words per day • A 6-week course with 75.000 words takes 1.5 PM to be translated Videos • Before translating, videos are manually transcribed (10 RTF) • Then, transcriptions are translated into the desired language (30 RTF) • A course including 2 hours of video takes 0.5 PM to be translated Solutions to lower costs • Crowdsourcing (TED talks) • Speech Recognition and Machine Translation to generate draft translations – User effort to translate a course is reduced to 30% - 50% (0.6 - 1 PM) UPV - Being multilingual with EMMA 6 / 21
  • 7. Overview of automatic video subtitling • Step-by-step process: 1. Generation of automatic transcriptions from video 2. Manual review of automatic transcriptions to correct transcription errors 3. Generation of automatic translations from manually reviewed transcription 4. Manual review of automatic translations to generate final subtitles • State-of-the-art technology cannot provide perfect automatic subtitles • However, it significantly reduces the effort to generate multilingual subtitles • User effort saving depends on automatic transcription+translation accuracy • You can contribute to improve transcription and translation accuracy UPV - Being multilingual with EMMA 7 / 21
  • 8. How to improve transcription accuracy • Transcription systems learn to transcribe from examples – At least 50 hours of videos (audio) previously transcribed to learn the acoustic model – Texts in millions of words to learn the language model Language Videos (hours) Text (Mwords) Dutch 532 628 English 620 464000 Estonian 130 410 French 88 1800 German 36 135 Portuguese 54 573 Italian 54 868 Slovene 27 224 Spanish 128 654 • Adaptation of transcription systems to the specific videos is key for high accuracy – Availability of videos manually transcribed with similar acoustic conditions – Availability of text resources related to the video in question ∗ Title is used to retrieve related documents from Google ∗ Slides contain most of the words uttered by the lecturer ∗ Documents: text content from the course, additional text resources (bibliography) UPV - Being multilingual with EMMA 8 / 21
  • 9. Why automatic transcriptions • Quality of automatic transcription can be impressive, but it greatly depends on: – Availability of transcribed videos and related text materials – Sound quality of the video – Complexity of language involved (phonetics and grammar) • All in all, high-accuracy fully automatic transcription is not possible • Automatic transcriptions need to be manually reviewed • Reviewing automatic transcription is much faster than doing it from scratch • Transcriptions are not only needed to generate automatic translations: – Non-native speakers and hearing impaired persons – Text searchability and analysis – Summarisation – Video recommendation and relation • Reviewed transcriptions are important to generate usable draft automatic translations UPV - Being multilingual with EMMA 9 / 21
  • 10. Reviewing automatic transcriptions • Once a video is ingested into the system, a draft transcription is automatically generated • Transcribed videos are available for review using a web interface • Yet another slide and hands on reviewing an automatic transcription UPV - Being multilingual with EMMA 10 / 21
  • 11. Evaluating transcription review process • Review of automatic transcriptions is evaluated from two viewpoints: – Transcription accuracy – Time spent to review automatic transcriptions measured as Real Time Factor (RTF) Language Accuracy (92%) RTF (10) Spanish Excellent (86%) 3 Estonian Good (70%) 3 Portuguese Average (57%) 5 Italian Good (82%) 5 English Good (81%) 6 Catalan Good (83%) 6 Dutch Good (75%) 6 French Good (75%) 6 UPV - Being multilingual with EMMA 11 / 21
  • 12. Demo on transcription 1. Overview of the Transcription and Translation Platform (ttp.mllp.upv.es) 2. Uploading a video 3. Reviewing video transcription 4. Reviewing video translation 5. Reviewing document translation UPV - Being multilingual with EMMA 12 / 21
  • 13. How to improve translation accuracy • Translation systems learn to translate from parallel texts – Millions of sentences previously translated to learn the translation model – Texts in millions of words to learn the language model • Parallel texts are collected from public multilingual organisations (EU, UN, TED, etc.) • Not all parallel text available is useful to translate your MOOC: need of domain adaptation Language pairs All (Msents) Selection (Msents) Dutch-English 27.3 1.7 English-Spanish 14.0 3.2 English-Italian 24.5 6.4 English-French 28.8 3.2 Estonian-English 10.5 10.5 French-English 28.8 0.5 Portuguese-English 27.5 6.4 Italian-English 24.5 6.4 Spanish-English 14.0 6.4 • Adaptation of translation systems to the domain of the MOOC – Text of the course to be translated – Domain-related materials previously translated – Bibliography of the course in the target language UPV - Being multilingual with EMMA 13 / 21
  • 14. Reviewing automatic translations • Speech Recognition technology is in a more mature stage than Machine Translation • Machine Translation has improved over the last years, but it is still far from perfect • Quality of automatic translation depends on: – Proximity between source and target languages – Complexity of grammar structures used by the speaker – How specific the vocabulary employed is – Availability of parallel texts in the same field • Evaluation of translation is cumbersome, since there is not a unique correct translation • Translations need to be manually reviewed before publishing them • Translation review is faster than generating them from scratch UPV - Being multilingual with EMMA 14 / 21
  • 15. Reviewing automatic video translations • Reviewed video transcriptions are automaticaly translated into the desired languages • The same web interface allows you to review source and target subtitles in parallel • Reviewed subtitles can be exported as SRT files UPV - Being multilingual with EMMA 15 / 21
  • 16. Reviewing automatic document translations • Text included in the course is ingested into the translation system • A similar web interface allows you to review source and target texts in parallel • Preview of source and target texts also available • Translated text is imported back into the EMMA platform UPV - Being multilingual with EMMA 16 / 21
  • 17. Evaluating translation review process • Review of translations is evaluated from two viewpoints: – Translation accuracy automatically computed from single reference translation – Time spent to review automatic translations (in RTF) Language pairs Accuracy RTF (30) Spanish → English Good (64%) 7 Spanish → Catalan Excellent (73%) 9 English → Italian Good (59%) 10 Dutch → English Good (52%) 13 Italian → English Good (53%) 14 Estonian → English Poor (13%) 16 English → Spanish Good (62%) 17 French → English Average (22%) 26 UPV - Being multilingual with EMMA 17 / 21
  • 18. Demo on translation 1. Overview of the Transcription and Translation Platform 2. Uploading a video 3. Reviewing video transcription 4. Reviewing video translation 5. Reviewing document translation UPV - Being multilingual with EMMA 18 / 21
  • 19. Conclusions and Discussion • Multilingual access to your course boosts visibility • The cost of manually translating your course is high (2 PM) • Automatic translation can reduce the temporal cost up to 30% - 50% • Accuracy of automatic translation depends on several factors: – Languages involved – Availability of annotated data resources related to your course – Specificity of the course • Designing a multilingual MOOC should also take into account: – Slides – Images – Application interfaces (demos) – Bibliography – In general, language-dependent content that is not easy or too costly to edit UPV - Being multilingual with EMMA 19 / 21
  • 20. Thank you for your attention! UPV - Being multilingual with EMMA 20 / 21
  • 21. Comparative results with YouTube/Google • Comparison with YouTube in terms of Word Error Rate Word Error Rate Language EMMA YouTube Dutch 25.7 38.6 English 39.2 70.8 Italian 28.9 31.6 Portuguese 49.8 62.3 Spanish 14.4 34.3 • Comparison with Google Translate in terms of BLEU Quality - BLEU Language pairs EMMA Google Dutch → English 41.6 33.4 English → Spanish 42.5 39.0 Italian → English 46.9 27.9 Portuguese → English 47.6 45.4 Spanish → English 28.2 27.6 UPV - Being multilingual with EMMA 21 / 21