Unleash Your Potential - Namagunga Girls Coding Club
Google Summer of Code 2011: UOC & Apertium
1. Pre and post editing
environment for Apertium
Lluís Villarejo
Learning Technologies
March 2012
2. c
What is GSoC?
• It's a global program that offers student developers stipends
to write code for various open source software projects.
• Since 2005
• Inspire young developers to participate in OSS projects.
• Give students more exposure to real-world soft dev
scenarios.
• Get more open source code created and released.
• Help open source prjs identify and bring in new developers.
3. c
Some participants
• Apache Soft. Found. • Sakai Foundation
• Debian • Mozilla
• Facebook • Inclusive Design Inst.
• Drupal • The Linux Foundation
• Creative Commons • The GNU project
• DocBook project • Wikimedia Foundation
• GCC • WordPress
• Gnome • Inclusive Design Inst.
• ... • ...
4. c
How does it work?
• Orgs present themselves as mentoring agents.
• Orgs present a list of potential projects and mentors.
• Accepted orgs should try to attract students' interest.
• Students build project proposals.
• Google finances slots for each org (5.000 + 500 USD).
• The project community decides the student-slot assignation.
• Between end of May and end of August.
5. c
GsoC'11 statistics
• $7.2M budget
• 1115 students accepted from 68 countries
• 2096 mentors and co-mentors from 55 countries
• 175 Open Source organizations
• 18.1% of students have participated in previous years
• 97 countries with student applicants
• 88% overall success rate
7. c
Why participating with Apertium?
• Strategically:
– Apertium is a strategic agent inside UOC.
– Developing Apertium means further developing
internationalization aids for UOC.
– Attract and onboard new developers for Apertium.
– Collaboration with Google's Open Source initiatives.
• Functionally:
– Opporutnity to further develop specific UOC needs with
external funding.
– Capitalize specific user feedback on translation quality.
8. c
The Apertium case
• 20 proposed tasks
• 17 tasks got interest from students [1-9]
– Pre and post-editing environment gets 11 students
interested.
• Apertium community ranks the 17 tasks
– Pre and post-editing environment ranks 4th
• Google assigns 9 slots to Apertium (49.500 USD)
– Our task goes through and Camille Mougey is selected
from the Grenoble Insitute of Technology.
9. c
Pre and post-editing, why?
• An important part of the errors you get when translating a
document are due to deficiencies in the original.
• The integration of existing resources can help to ease this
burden:
– Digital knowledge sources (digital dictionaries... )
– Automatic tools (spell-checker, grammar checker, translation
memory generation, search & replace...)
• These processes should be integrated naturally in the
translation workflow → the need for an integrated web interface
to Apertium.
• To improve the system we need to have access to the human
post-editing process.
10. c
Pre and post-editing, features
• Pre and Post-editing web interface integrated with Apertium translation toolbox.
• Spell checking on source and target languages. Integration with Aspell
• Grammar checking on source and target languages. Integration with
LanguageTool
• Integration with several external dictionaries.
• Search & replace functionalities on source and target languages.
• Ability to deal with formatted text.
• Logging system. All events are logged as they happen, ie at the very moment
the user inserts or deletes text. This allows for a further data mining process to
be run on the logs to detect commonly modified structures or vocabulary.
• Translation memory generation. Integration of Maligna.
• PDF translation through pdftohtml
• Image translation. Through tesseract.
Final report 2010
Final report 2011
11. c
Results & learned lessons
• Fully functional environment, goals accomplished.
• Automatic availability of feedback on post-editing human
behaviour.
• Jointly defined task (flexible framework provided).
• Interest in developing great empathy with the student.
• Motivated and pro-active student.
• Student engagement.
• Very frequent feedback.
• Mentoring team with access to ABSOLUTELY ALL the
information regarding the project.
12. c
Further work
• Proof of concept accomplished.
• Base platform developed so further work can be easily
added.
• Integration of other resources (more external dictionaries).
• Extension of currently used resources (addition of
grammar rules, dictionaries improvement, format range
extension).
• Logging information mining to get deeper knowledge on
the human post-editing process.
• Use of this mining process to improve Apertium translation
engine.
13. c
GsoC 2012
• Logging information mining to get deeper knowledge on
the human post-editing process.
• Use of this mining process to improve Apertium translation
engine.
• Post-edition over formatted text.