SlideShare une entreprise Scribd logo
1  sur  45
Language use and
preservation online

    Tadej Gregorčič
“Minor” languages

• 6912+ languages altogether
• 3500 spoken by 0,2 % of world’s speakers
• 40% endangered
• Only 600 non-extinct within 100 years?
Endangered languages
Internet

• 90% of content in just 12 languages
• How big an issue is extinction?
• Language transformation vs. transformation
  of old media (TV, newspapers, radio)
• Unicode - first major breakthrough
Slovenian (my language)

• Roughly 2 million speakers
• More speakers than 96% of languages
• Official EU language - enforcement policies
• Endangerment?
Use of foreign words in scientific text where
 appropriate Slovenian counterparts exist.
Preservation of language
The Rosetta Project

• http://rosettaproject.org/
• Publicly accessible digital library
• Aiming to preserve information about
  eventually all human languages
Preservation of knowledge
   contained in a language
• Smithsonian Institute
• Rosetta Project
• Unesco
• Revitalization (non-extinct)
• Resurrection (extinct)
 • Only successful known example: Hebrew
Keeping use of a language
   viable/economical

• Consistent use
• Dictionaries, tools
• Translation tools
• Advanced language software (TTS, SR)
Language technologies
• Machine translation
• Speech synthesis
• Speech recognition
• ...
• Advance in one field accelerates advances
  in others through increased feasibility
Language technologies
• Machine translation
• Speech synthesis
• Speech recognition
• ...
• Advance in one field accelerates advances
  in others through increased feasibility
2005

• Systran (fr.)
• Yahoo!, Altavista Babelfish
• Google
• Rule based + statistical approach
Live translation
• Done in 2005 as Ethnocon project
  (presented at MS Imagine Cup)
• Speech recognition (language 1)
• Text machine translation (Systran API)
• Speech synthesis (language 2)
• MT quality poor
2006+
• Google Translate Systran
• Google obtained United Nations parallel
  corpora
• Words = data, grammar = code
• Purely statistical approach (a huge amount
  of data, code )
Parallel corpus

• evrokorpus.gov.si
• Translation memory (Trados ipd.)
• TM from governmental institutions
• Open TM projects
• ...
Parallel corpus

• evrokorpus.gov.si
• Translation memory (Trados ipd.)
• TM from governmental institutions
• Open TM projects
• Example: the Bible
Google Translate
Crowdsourcing


• It works (Wikipedia)
• An incorrect translation is a natural
  motivator
• Relatively fast improvement of data
• But: unprofessional
June, 2009
Google Translator Toolkit

• June, 2009 (200+ languages in October)
• “Open Trados”
• Global parallel TM
• Google TT + Google Translate
• 345 languages, 10.664 language pairs
Google Translator Toolkit

• Incentive for professionals: productivity
• Motivated to contribute to global TM
• GT pre-translates text with
• Huge parallel corpora
• Professional translation!
Professional translations are fed into the
 crowdsourced Google Translate parallel
                corpora.

Like Wikipedia with professional editors.

 Huge quality gains over time if Google
      Translator Toolkit takes off.
Results today:
Automatic subtitling




(think hearing impaired users)
Results soon:
AR, “augmented reality”
November 2009




Thank you!

Tadej Gregorcic
Software developer, entrepreneur and amateur linguist




twitter.com/tadej   linkedin.com/in/tadejgregorcic   www.facebook.com/tadej

Contenu connexe

En vedette

8 Google Translate
8 Google Translate8 Google Translate
8 Google Translateaptwano
 
Google translate (new russian)
Google translate (new russian)Google translate (new russian)
Google translate (new russian)Nurbek Matzhani
 
Google Translate in the Classroom
Google Translate in the ClassroomGoogle Translate in the Classroom
Google Translate in the Classroommarafaye
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clusteringGuy De Pauw
 
Google Translate Update
Google Translate UpdateGoogle Translate Update
Google Translate Updatemrsvogel
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
 
Machine Translation=Google Translator
Machine Translation=Google TranslatorMachine Translation=Google Translator
Machine Translation=Google TranslatorNerea
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translationRushdi Shams
 
5 Best Powerpoint Templates Amazing Creative Presentation Themes
5 Best Powerpoint Templates   Amazing Creative Presentation Themes5 Best Powerpoint Templates   Amazing Creative Presentation Themes
5 Best Powerpoint Templates Amazing Creative Presentation ThemesYeasir Arafat
 
Google Translate Fails
Google Translate FailsGoogle Translate Fails
Google Translate FailsMihex
 
Machine Translation: What it is?
Machine Translation: What it is?Machine Translation: What it is?
Machine Translation: What it is?Multilizer
 

En vedette (15)

8 Google Translate
8 Google Translate8 Google Translate
8 Google Translate
 
Google translate (new russian)
Google translate (new russian)Google translate (new russian)
Google translate (new russian)
 
Google Translate in the Classroom
Google Translate in the ClassroomGoogle Translate in the Classroom
Google Translate in the Classroom
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clustering
 
Google Translate Update
Google Translate UpdateGoogle Translate Update
Google Translate Update
 
Google translate
Google translateGoogle translate
Google translate
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Machine Translation=Google Translator
Machine Translation=Google TranslatorMachine Translation=Google Translator
Machine Translation=Google Translator
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
 
Slideshare
SlideshareSlideshare
Slideshare
 
5 Best Powerpoint Templates Amazing Creative Presentation Themes
5 Best Powerpoint Templates   Amazing Creative Presentation Themes5 Best Powerpoint Templates   Amazing Creative Presentation Themes
5 Best Powerpoint Templates Amazing Creative Presentation Themes
 
Google Translate Fails
Google Translate FailsGoogle Translate Fails
Google Translate Fails
 
Machine Translation: What it is?
Machine Translation: What it is?Machine Translation: What it is?
Machine Translation: What it is?
 

Similaire à Language Use And Preservation Online

Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
 
How to create/improve OSS products and its community
How to create/improve OSS products and its communityHow to create/improve OSS products and its community
How to create/improve OSS products and its communitySATOSHI TAGOMORI
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysisPeter Bouda
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Prompsit Language Engineering
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Gema Ramirez-Sanchez
 
MozillaPH Localization in 2016
MozillaPH Localization in 2016MozillaPH Localization in 2016
MozillaPH Localization in 2016Robert 'Bob' Reyes
 
TraduXio project - Cosi10
TraduXio project - Cosi10TraduXio project - Cosi10
TraduXio project - Cosi10PhilippeLacour
 
Open source and free technologies for study skills
Open source and free technologies for study skillsOpen source and free technologies for study skills
Open source and free technologies for study skillsE.A. Draffan
 
Semanticnews 230913-final
Semanticnews 230913-finalSemanticnews 230913-final
Semanticnews 230913-finalDavid Newman
 
Localization past present-future 2007-2014
Localization past present-future 2007-2014Localization past present-future 2007-2014
Localization past present-future 2007-2014Matthias Caesar
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technologytechiaith
 
Search-Driven Programming
Search-Driven ProgrammingSearch-Driven Programming
Search-Driven ProgrammingEthan Herdrick
 
Wanted: Best Practices for Collaborative Translation
Wanted: Best Practices for Collaborative TranslationWanted: Best Practices for Collaborative Translation
Wanted: Best Practices for Collaborative TranslationGrupo Inmigra i+d
 
Laura Welcher - The Rosetta Project and The Language Commons
Laura Welcher - The Rosetta Project and The Language CommonsLaura Welcher - The Rosetta Project and The Language Commons
Laura Welcher - The Rosetta Project and The Language Commonslongnow
 
Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...
Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...
Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...Victor de Boer
 

Similaire à Language Use And Preservation Online (20)

Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
How to create/improve OSS products and its community
How to create/improve OSS products and its communityHow to create/improve OSS products and its community
How to create/improve OSS products and its community
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
MozillaPH Localization in 2016
MozillaPH Localization in 2016MozillaPH Localization in 2016
MozillaPH Localization in 2016
 
TraduXio project - Cosi10
TraduXio project - Cosi10TraduXio project - Cosi10
TraduXio project - Cosi10
 
Open source and free technologies for study skills
Open source and free technologies for study skillsOpen source and free technologies for study skills
Open source and free technologies for study skills
 
Achievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An LocAchievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An Loc
 
Introduction to python
Introduction to python Introduction to python
Introduction to python
 
Intro
IntroIntro
Intro
 
Intro
IntroIntro
Intro
 
Semanticnews 230913-final
Semanticnews 230913-finalSemanticnews 230913-final
Semanticnews 230913-final
 
Localization past present-future 2007-2014
Localization past present-future 2007-2014Localization past present-future 2007-2014
Localization past present-future 2007-2014
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 
Search-Driven Programming
Search-Driven ProgrammingSearch-Driven Programming
Search-Driven Programming
 
Wanted: Best Practices for Collaborative Translation
Wanted: Best Practices for Collaborative TranslationWanted: Best Practices for Collaborative Translation
Wanted: Best Practices for Collaborative Translation
 
Laura Welcher - The Rosetta Project and The Language Commons
Laura Welcher - The Rosetta Project and The Language CommonsLaura Welcher - The Rosetta Project and The Language Commons
Laura Welcher - The Rosetta Project and The Language Commons
 
VOICE TYPING.pptx
VOICE  TYPING.pptxVOICE  TYPING.pptx
VOICE TYPING.pptx
 
Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...
Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...
Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...
 

Dernier

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Dernier (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Language Use And Preservation Online