SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Language Processing Techniques
for
Statistical Machine Translation
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
To start ...
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
… you choose Moses ...
Translation memories + linguistic assets
Cleaning and training following tutorials
BLEU score seems ok in training
… but ...
the results are awful!
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Why?
Not enough data
Unclean translation memories
Misalignments
Spelling and grammar errors
Difficult language pairs
Selection of wrong parameters
Application of suboptimal techniques
So many things … what can you do?
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Some steps
Maximum exploitation of existing assets
Source content optimization
Data selection and cleaning
Improvement of the models
Linguistic processing
Continuous improvement
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Existing assets: increase TM leverage
Translation memory sharing
Clients, Partners, Competitors, EU, UN, TAUS
Relevant on-line data retrieval
Advanced TM techniques
Sub-segment matching
Parts of Speech replacement
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Source optimization (I): Pre-editing
Spell check
Grammar check
Style check
Terminology check
Client checklist
new
doc
proposed
doc + html
report
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Source optimization (II): Summarization
% to reduce
Use translation memories
Project
Client
All
new
doc
proposed
doc + html
report
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Summarization example
http://www.translationautomation.com/press-
releases/free-open-source-machine-translation-
tutorial-is-made-available-by-taus
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Data selection and cleaning – a sample
Clean translation memories
Length, punctuation, terminology, repetitions …
Segment splitting
Optimize weight of most frequent n-grams in corpus
Validate their translations
Add out-of-domain data for irrelevant n-grams
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Models optimization
Filter the translation tables
Remove the garbage + tune the weights if necessary
Optimize language models
Adapt them to the translation purpose
Tune parameters correctly
Tune set, test set, optimization parameters …
Improve recasing
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Linguistic processing
In the source and/or target language
Grammar checking
Entities detection
proper nouns, alphanumeric words, numbers, ...
Compound words splitting
Sentence reordering
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Continuous improvement
Qualitative feedback of translators
Reports
Automatic post-processing with
machine translation + post-edited segments
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
An example from
Source
XXX 335102 doses are calculated as a free acid of the sodium salt (NA).
The potential toxicity of XXX 335102 was studied in a number of acute toxicity studies in mouse and rat
and repeat dose toxicity studies of 8 and 32 weeks each in rat and monkeys.
XXX 335102 was negative in a panel of in vivo and in vitro tests to assess mutagenicity and
clastogenicity identifying no genotoxic risks for human subjects.
An in vitro assay for phototoxic potential suggested that XXX 335102 is photoxic/photosensitive.
In the 8-week studies in monkeys, increases in unconjugated bilirubin were noted at the doses tested
(33, 88, 192 and 444mg/kg/day); the greatest increases occurring at Week 4 and declining or returning
to control levels by Week 8.
Reference
Las dosis de XXX 335102 se calculan como la sal sódica sin ácido (AS).
La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y
ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos.
Se obtuvieron resultados negativos en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia
y clastogenia, sin identificarse riesgos genotóxicos para el ser humano.
En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o
fotosensibilizador.
En los estudios de 8 semanas en monos se apreció el aumento de la bilirrubina no conjugada con las
dosis estudiadas (33, 88, 192 y 444 mg/kg/día), produciéndose el mayor incremento en la semana 4 y
disminuyendo o volviendo a los niveles de control en la semana 8.
Generic engine
XXX 335102 se calculan en forma de dosis de ácido libre del sodio sal (NA).
La Toxicidad potencial de XXX 335102 fue estudiado en una serie de estudios de toxicidad aguda en
ratón y rata y vuelva a dosis estudios de toxicidad, de 8 y de 32 semanas en rata y cada uno de los
monos.
XXX 335102 era negativo en un grupo de in vivo y pruebas in vitro para evaluar mutagenicidad y
genotóxicas clastogenicity no identificar los riesgos para los participantes humanos.
Un para fines de ensayo in vitro phototoxic potencial se sugirió que XXX 335102
photoxic/Photosensitive.
En Los 8 -week estudios en los monos, aumentos en unconjugated bilirrubina salieron a las dosis
analizada (33, 88, 192 y 444 mg/kg/día); los mayores incrementos habidos En la semana 4 y la
reducción o devolver a nivel de control de 8 Por semana.
Medical engine with improvements
Las dosis XXX 335102 se calculan como ácido libre de la sal sódica (AS).
La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y
ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos.
XXX 335102 dio negativo en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia y
clastogenia, sin identificarse riesgos genotóxicos para el ser humano.
En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o
fotosensibilizador.
En los estudios de 8 semanas en monos, el aumento de la bilirrubina no conjugada con las dosis
estudiadas (33, 88, 192 y 444 mg/kg/día); el mayor incremento en la semana 4 y disminuyendo o
volviendo a los niveles de control en la semana 8.
Reference
Las dosis de XXX 335102 se calculan como la sal sódica sin ácido (AS).
La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y
ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos.
Se obtuvieron resultados negativos en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia
y clastogenia, sin identificarse riesgos genotóxicos para el ser humano.
En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o
fotosensibilizador.
En los estudios de 8 semanas en monos se apreció el aumento de la bilirrubina no conjugada con las
dosis estudiadas (33, 88, 192 y 444 mg/kg/día), produciéndose el mayor incremento en la semana 4 y
disminuyendo o volviendo a los niveles de control en la semana 8.
Medical engine with improvements
Las dosis XXX 335102 se calculan como ácido libre de la sal sódica (AS).
La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y
ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos.
XXX 335102 dio negativo en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia y
clastogenia, sin identificarse riesgos genotóxicos para el ser humano.
En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o
fotosensibilizador.
En los estudios de 8 semanas en monos, el aumento de la bilirrubina no conjugada con las dosis
estudiadas (33, 88, 192 y 444 mg/kg/día); el mayor incremento en la semana 4 y disminuyendo o
volviendo a los niveles de control en la semana 8.
Conclusions
MT can be combined with other advanced techniques
Creating and improving an engine requires time
You can also be lucky at the first try!
The optimum results require translators
Implementation of the linguistic knowledge
Continuous improvement
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96

Contenu connexe

En vedette

Gengo at the TAUS Translation Technology Showcase - Silicon Valley 2015
Gengo at the TAUS Translation Technology Showcase - Silicon Valley 2015Gengo at the TAUS Translation Technology Showcase - Silicon Valley 2015
Gengo at the TAUS Translation Technology Showcase - Silicon Valley 2015TAUS - The Language Data Network
 
Seeing the Wood for the Trees - John Tinsley (Iconic Translation Machines)
Seeing the Wood for the Trees - John Tinsley (Iconic Translation Machines)Seeing the Wood for the Trees - John Tinsley (Iconic Translation Machines)
Seeing the Wood for the Trees - John Tinsley (Iconic Translation Machines)TAUS - The Language Data Network
 
Streamling your Translation Agency with a Translation Management System
Streamling your Translation Agency with a Translation Management SystemStreamling your Translation Agency with a Translation Management System
Streamling your Translation Agency with a Translation Management SystemLanguage Network Inc
 
Translation Memory and Terminology Management
Translation Memory and Terminology ManagementTranslation Memory and Terminology Management
Translation Memory and Terminology ManagementLanguage Solutions Inc.
 
SmartCAT: re-engaging translation communities in a high-tech way. Jean-Luc Sa...
SmartCAT: re-engaging translation communities in a high-tech way. Jean-Luc Sa...SmartCAT: re-engaging translation communities in a high-tech way. Jean-Luc Sa...
SmartCAT: re-engaging translation communities in a high-tech way. Jean-Luc Sa...TAUS - The Language Data Network
 
Website Translation Proxy Tool for LSPs and Translation Agencies
Website Translation Proxy Tool for LSPs and Translation AgenciesWebsite Translation Proxy Tool for LSPs and Translation Agencies
Website Translation Proxy Tool for LSPs and Translation AgenciesReverbeo
 
Translation for and in the government - Tanya Helmen (National Virtual Transl...
Translation for and in the government - Tanya Helmen (National Virtual Transl...Translation for and in the government - Tanya Helmen (National Virtual Transl...
Translation for and in the government - Tanya Helmen (National Virtual Transl...TAUS - The Language Data Network
 
Predictive Analysis in Machine Translation is Business Intelligence.
Predictive Analysis in Machine Translation is Business Intelligence.Predictive Analysis in Machine Translation is Business Intelligence.
Predictive Analysis in Machine Translation is Business Intelligence.TAUS - The Language Data Network
 
TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...
TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...
TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...TAUS - The Language Data Network
 
Boldly Going Where Others Have Gone Before (Mimi Hills, VMWare))
Boldly Going Where Others Have Gone Before (Mimi Hills, VMWare))Boldly Going Where Others Have Gone Before (Mimi Hills, VMWare))
Boldly Going Where Others Have Gone Before (Mimi Hills, VMWare))TAUS - The Language Data Network
 
Improving your Bottom Line with Custom Machine Translation
Improving your Bottom Line with Custom Machine TranslationImproving your Bottom Line with Custom Machine Translation
Improving your Bottom Line with Custom Machine Translationkantanmt
 
The Japanese Market - Meeting Requirements, by Hiroki Kawano, Memsource
The Japanese Market - Meeting Requirements, by Hiroki Kawano, MemsourceThe Japanese Market - Meeting Requirements, by Hiroki Kawano, Memsource
The Japanese Market - Meeting Requirements, by Hiroki Kawano, MemsourceTAUS - The Language Data Network
 
From Machine Translation to Machine Interpretation - Jimmy Kunzmann
From Machine Translation to Machine Interpretation - Jimmy KunzmannFrom Machine Translation to Machine Interpretation - Jimmy Kunzmann
From Machine Translation to Machine Interpretation - Jimmy KunzmannTAUS - The Language Data Network
 
Amir Hassan Shakeri--Tech.RESUME
Amir Hassan Shakeri--Tech.RESUMEAmir Hassan Shakeri--Tech.RESUME
Amir Hassan Shakeri--Tech.RESUMEAmir Shakeri
 

En vedette (14)

Gengo at the TAUS Translation Technology Showcase - Silicon Valley 2015
Gengo at the TAUS Translation Technology Showcase - Silicon Valley 2015Gengo at the TAUS Translation Technology Showcase - Silicon Valley 2015
Gengo at the TAUS Translation Technology Showcase - Silicon Valley 2015
 
Seeing the Wood for the Trees - John Tinsley (Iconic Translation Machines)
Seeing the Wood for the Trees - John Tinsley (Iconic Translation Machines)Seeing the Wood for the Trees - John Tinsley (Iconic Translation Machines)
Seeing the Wood for the Trees - John Tinsley (Iconic Translation Machines)
 
Streamling your Translation Agency with a Translation Management System
Streamling your Translation Agency with a Translation Management SystemStreamling your Translation Agency with a Translation Management System
Streamling your Translation Agency with a Translation Management System
 
Translation Memory and Terminology Management
Translation Memory and Terminology ManagementTranslation Memory and Terminology Management
Translation Memory and Terminology Management
 
SmartCAT: re-engaging translation communities in a high-tech way. Jean-Luc Sa...
SmartCAT: re-engaging translation communities in a high-tech way. Jean-Luc Sa...SmartCAT: re-engaging translation communities in a high-tech way. Jean-Luc Sa...
SmartCAT: re-engaging translation communities in a high-tech way. Jean-Luc Sa...
 
Website Translation Proxy Tool for LSPs and Translation Agencies
Website Translation Proxy Tool for LSPs and Translation AgenciesWebsite Translation Proxy Tool for LSPs and Translation Agencies
Website Translation Proxy Tool for LSPs and Translation Agencies
 
Translation for and in the government - Tanya Helmen (National Virtual Transl...
Translation for and in the government - Tanya Helmen (National Virtual Transl...Translation for and in the government - Tanya Helmen (National Virtual Transl...
Translation for and in the government - Tanya Helmen (National Virtual Transl...
 
Predictive Analysis in Machine Translation is Business Intelligence.
Predictive Analysis in Machine Translation is Business Intelligence.Predictive Analysis in Machine Translation is Business Intelligence.
Predictive Analysis in Machine Translation is Business Intelligence.
 
TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...
TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...
TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...
 
Boldly Going Where Others Have Gone Before (Mimi Hills, VMWare))
Boldly Going Where Others Have Gone Before (Mimi Hills, VMWare))Boldly Going Where Others Have Gone Before (Mimi Hills, VMWare))
Boldly Going Where Others Have Gone Before (Mimi Hills, VMWare))
 
Improving your Bottom Line with Custom Machine Translation
Improving your Bottom Line with Custom Machine TranslationImproving your Bottom Line with Custom Machine Translation
Improving your Bottom Line with Custom Machine Translation
 
The Japanese Market - Meeting Requirements, by Hiroki Kawano, Memsource
The Japanese Market - Meeting Requirements, by Hiroki Kawano, MemsourceThe Japanese Market - Meeting Requirements, by Hiroki Kawano, Memsource
The Japanese Market - Meeting Requirements, by Hiroki Kawano, Memsource
 
From Machine Translation to Machine Interpretation - Jimmy Kunzmann
From Machine Translation to Machine Interpretation - Jimmy KunzmannFrom Machine Translation to Machine Interpretation - Jimmy Kunzmann
From Machine Translation to Machine Interpretation - Jimmy Kunzmann
 
Amir Hassan Shakeri--Tech.RESUME
Amir Hassan Shakeri--Tech.RESUMEAmir Hassan Shakeri--Tech.RESUME
Amir Hassan Shakeri--Tech.RESUME
 

Similaire à 2012 MosesCore LocWorld Seattle: Language Processing Techniques for Statistical Machine Translation

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing T...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing T...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing T...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing T...TAUS - The Language Data Network
 
CONTAMINACION AMBIENTAL Y SALUD REPRODUCTIVA.ppt
CONTAMINACION AMBIENTAL Y SALUD REPRODUCTIVA.pptCONTAMINACION AMBIENTAL Y SALUD REPRODUCTIVA.ppt
CONTAMINACION AMBIENTAL Y SALUD REPRODUCTIVA.pptEDGARDOROMEROPOMA
 
Contaminacion ambiental y salud reproductiva
Contaminacion ambiental y salud reproductivaContaminacion ambiental y salud reproductiva
Contaminacion ambiental y salud reproductivaLuis Lopez
 
Evaluacion de los Corticoides durante los Procesos Inflamatorios Producto de ...
Evaluacion de los Corticoides durante los Procesos Inflamatorios Producto de ...Evaluacion de los Corticoides durante los Procesos Inflamatorios Producto de ...
Evaluacion de los Corticoides durante los Procesos Inflamatorios Producto de ...loscorticoides
 
5 1. epidemiología temporal-conceptos
5 1. epidemiología temporal-conceptos5 1. epidemiología temporal-conceptos
5 1. epidemiología temporal-conceptosSINAVEF_LAB
 
Frial Presentacion Cdti 2008
Frial Presentacion Cdti 2008Frial Presentacion Cdti 2008
Frial Presentacion Cdti 2008Grupo Frial
 

Similaire à 2012 MosesCore LocWorld Seattle: Language Processing Techniques for Statistical Machine Translation (9)

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing T...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing T...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing T...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing T...
 
CONTAMINACION AMBIENTAL Y SALUD REPRODUCTIVA.ppt
CONTAMINACION AMBIENTAL Y SALUD REPRODUCTIVA.pptCONTAMINACION AMBIENTAL Y SALUD REPRODUCTIVA.ppt
CONTAMINACION AMBIENTAL Y SALUD REPRODUCTIVA.ppt
 
Contaminacion ambiental y salud reproductiva
Contaminacion ambiental y salud reproductivaContaminacion ambiental y salud reproductiva
Contaminacion ambiental y salud reproductiva
 
Trigliceridos (1)
Trigliceridos (1)Trigliceridos (1)
Trigliceridos (1)
 
Evaluacion de los Corticoides durante los Procesos Inflamatorios Producto de ...
Evaluacion de los Corticoides durante los Procesos Inflamatorios Producto de ...Evaluacion de los Corticoides durante los Procesos Inflamatorios Producto de ...
Evaluacion de los Corticoides durante los Procesos Inflamatorios Producto de ...
 
Terapia fotodinam
Terapia fotodinamTerapia fotodinam
Terapia fotodinam
 
Terapia fotodinam
Terapia fotodinamTerapia fotodinam
Terapia fotodinam
 
5 1. epidemiología temporal-conceptos
5 1. epidemiología temporal-conceptos5 1. epidemiología temporal-conceptos
5 1. epidemiología temporal-conceptos
 
Frial Presentacion Cdti 2008
Frial Presentacion Cdti 2008Frial Presentacion Cdti 2008
Frial Presentacion Cdti 2008
 

Plus de tauyou

Artificial Intelligence and Machine Learning found in Translation
Artificial Intelligence and Machine Learning found in TranslationArtificial Intelligence and Machine Learning found in Translation
Artificial Intelligence and Machine Learning found in Translationtauyou
 
I can't help falling in love with machine translation
I can't help falling in love with machine translationI can't help falling in love with machine translation
I can't help falling in love with machine translationtauyou
 
Workshop on the tauyou machine translation platform
Workshop on the tauyou machine translation platformWorkshop on the tauyou machine translation platform
Workshop on the tauyou machine translation platformtauyou
 
Mind the gap between what you say and what you deliver
Mind the gap between what you say and what you deliverMind the gap between what you say and what you deliver
Mind the gap between what you say and what you delivertauyou
 
Some Lessons Learned on Machine Translation
Some Lessons Learned on Machine TranslationSome Lessons Learned on Machine Translation
Some Lessons Learned on Machine Translationtauyou
 
From the Lab to the Market
From the Lab to the MarketFrom the Lab to the Market
From the Lab to the Markettauyou
 
APIfying the Translation Industry
APIfying the Translation IndustryAPIfying the Translation Industry
APIfying the Translation Industrytauyou
 
The Discreet Charm of Machine Translation
The Discreet Charm of Machine TranslationThe Discreet Charm of Machine Translation
The Discreet Charm of Machine Translationtauyou
 
Women in Localization UK Webinar with Diego Bartolome
Women in Localization UK Webinar with Diego BartolomeWomen in Localization UK Webinar with Diego Bartolome
Women in Localization UK Webinar with Diego Bartolometauyou
 
TAUS Post-editing webinar. Spanish-to-English Module
TAUS Post-editing webinar. Spanish-to-English ModuleTAUS Post-editing webinar. Spanish-to-English Module
TAUS Post-editing webinar. Spanish-to-English Moduletauyou
 
The Beauty of Machine Translation
The Beauty of Machine TranslationThe Beauty of Machine Translation
The Beauty of Machine Translationtauyou
 
Emerging Technologies Enabling New Business Models
Emerging Technologies Enabling New Business ModelsEmerging Technologies Enabling New Business Models
Emerging Technologies Enabling New Business Modelstauyou
 
Innovating in Translation
Innovating in TranslationInnovating in Translation
Innovating in Translationtauyou
 
Pushing Machine Translation Forward
Pushing Machine Translation ForwardPushing Machine Translation Forward
Pushing Machine Translation Forwardtauyou
 
The State of Post-Editing
The State of Post-EditingThe State of Post-Editing
The State of Post-Editingtauyou
 
Machine Translation Master Class at the EUATC Conference by Diego Bartolome
Machine Translation Master Class at the EUATC Conference by Diego BartolomeMachine Translation Master Class at the EUATC Conference by Diego Bartolome
Machine Translation Master Class at the EUATC Conference by Diego Bartolometauyou
 
lo que he aprendido (y quiero compartir)
lo que he aprendido (y quiero compartir)lo que he aprendido (y quiero compartir)
lo que he aprendido (y quiero compartir)tauyou
 
What you need to put Machine Translation into practice: Tools, People, and Pr...
What you need to put Machine Translation into practice: Tools, People, and Pr...What you need to put Machine Translation into practice: Tools, People, and Pr...
What you need to put Machine Translation into practice: Tools, People, and Pr...tauyou
 
How we failed to win a 100,000,000 word contract (GALA Istanbul 2014)
How we failed to win a 100,000,000 word contract (GALA Istanbul 2014)How we failed to win a 100,000,000 word contract (GALA Istanbul 2014)
How we failed to win a 100,000,000 word contract (GALA Istanbul 2014)tauyou
 
Learn to Innovate (GALA Istanbul 2014)
Learn to Innovate (GALA Istanbul 2014)Learn to Innovate (GALA Istanbul 2014)
Learn to Innovate (GALA Istanbul 2014)tauyou
 

Plus de tauyou (20)

Artificial Intelligence and Machine Learning found in Translation
Artificial Intelligence and Machine Learning found in TranslationArtificial Intelligence and Machine Learning found in Translation
Artificial Intelligence and Machine Learning found in Translation
 
I can't help falling in love with machine translation
I can't help falling in love with machine translationI can't help falling in love with machine translation
I can't help falling in love with machine translation
 
Workshop on the tauyou machine translation platform
Workshop on the tauyou machine translation platformWorkshop on the tauyou machine translation platform
Workshop on the tauyou machine translation platform
 
Mind the gap between what you say and what you deliver
Mind the gap between what you say and what you deliverMind the gap between what you say and what you deliver
Mind the gap between what you say and what you deliver
 
Some Lessons Learned on Machine Translation
Some Lessons Learned on Machine TranslationSome Lessons Learned on Machine Translation
Some Lessons Learned on Machine Translation
 
From the Lab to the Market
From the Lab to the MarketFrom the Lab to the Market
From the Lab to the Market
 
APIfying the Translation Industry
APIfying the Translation IndustryAPIfying the Translation Industry
APIfying the Translation Industry
 
The Discreet Charm of Machine Translation
The Discreet Charm of Machine TranslationThe Discreet Charm of Machine Translation
The Discreet Charm of Machine Translation
 
Women in Localization UK Webinar with Diego Bartolome
Women in Localization UK Webinar with Diego BartolomeWomen in Localization UK Webinar with Diego Bartolome
Women in Localization UK Webinar with Diego Bartolome
 
TAUS Post-editing webinar. Spanish-to-English Module
TAUS Post-editing webinar. Spanish-to-English ModuleTAUS Post-editing webinar. Spanish-to-English Module
TAUS Post-editing webinar. Spanish-to-English Module
 
The Beauty of Machine Translation
The Beauty of Machine TranslationThe Beauty of Machine Translation
The Beauty of Machine Translation
 
Emerging Technologies Enabling New Business Models
Emerging Technologies Enabling New Business ModelsEmerging Technologies Enabling New Business Models
Emerging Technologies Enabling New Business Models
 
Innovating in Translation
Innovating in TranslationInnovating in Translation
Innovating in Translation
 
Pushing Machine Translation Forward
Pushing Machine Translation ForwardPushing Machine Translation Forward
Pushing Machine Translation Forward
 
The State of Post-Editing
The State of Post-EditingThe State of Post-Editing
The State of Post-Editing
 
Machine Translation Master Class at the EUATC Conference by Diego Bartolome
Machine Translation Master Class at the EUATC Conference by Diego BartolomeMachine Translation Master Class at the EUATC Conference by Diego Bartolome
Machine Translation Master Class at the EUATC Conference by Diego Bartolome
 
lo que he aprendido (y quiero compartir)
lo que he aprendido (y quiero compartir)lo que he aprendido (y quiero compartir)
lo que he aprendido (y quiero compartir)
 
What you need to put Machine Translation into practice: Tools, People, and Pr...
What you need to put Machine Translation into practice: Tools, People, and Pr...What you need to put Machine Translation into practice: Tools, People, and Pr...
What you need to put Machine Translation into practice: Tools, People, and Pr...
 
How we failed to win a 100,000,000 word contract (GALA Istanbul 2014)
How we failed to win a 100,000,000 word contract (GALA Istanbul 2014)How we failed to win a 100,000,000 word contract (GALA Istanbul 2014)
How we failed to win a 100,000,000 word contract (GALA Istanbul 2014)
 
Learn to Innovate (GALA Istanbul 2014)
Learn to Innovate (GALA Istanbul 2014)Learn to Innovate (GALA Istanbul 2014)
Learn to Innovate (GALA Istanbul 2014)
 

Dernier

Red Dorsal Nacional de Fibra Óptica y Redes Regionales del Perú
Red Dorsal Nacional de Fibra Óptica y Redes Regionales del PerúRed Dorsal Nacional de Fibra Óptica y Redes Regionales del Perú
Red Dorsal Nacional de Fibra Óptica y Redes Regionales del PerúCEFERINO DELGADO FLORES
 
Análisis de los artefactos (nintendo NES)
Análisis de los artefactos (nintendo NES)Análisis de los artefactos (nintendo NES)
Análisis de los artefactos (nintendo NES)JuanStevenTrujilloCh
 
Viguetas Pretensadas en concreto armado
Viguetas Pretensadas  en concreto armadoViguetas Pretensadas  en concreto armado
Viguetas Pretensadas en concreto armadob7fwtwtfxf
 
La Electricidad Y La Electrónica Trabajo Tecnología.pdf
La Electricidad Y La Electrónica Trabajo Tecnología.pdfLa Electricidad Y La Electrónica Trabajo Tecnología.pdf
La Electricidad Y La Electrónica Trabajo Tecnología.pdfjeondanny1997
 
Documentacion Electrónica en Actos Juridicos
Documentacion Electrónica en Actos JuridicosDocumentacion Electrónica en Actos Juridicos
Documentacion Electrónica en Actos JuridicosAlbanyMartinez7
 
TALLER DE ANALISIS SOLUCION PART 2 (1)-1.docx
TALLER DE ANALISIS SOLUCION  PART 2 (1)-1.docxTALLER DE ANALISIS SOLUCION  PART 2 (1)-1.docx
TALLER DE ANALISIS SOLUCION PART 2 (1)-1.docxobandopaula444
 
Análisis de Artefactos Tecnologicos (3) (1).pdf
Análisis de Artefactos Tecnologicos  (3) (1).pdfAnálisis de Artefactos Tecnologicos  (3) (1).pdf
Análisis de Artefactos Tecnologicos (3) (1).pdfsharitcalderon04
 
La electricidad y la electronica.10-7.pdf
La electricidad y la electronica.10-7.pdfLa electricidad y la electronica.10-7.pdf
La electricidad y la electronica.10-7.pdfcristianrb0324
 
Actividades de computación para alumnos de preescolar
Actividades de computación para alumnos de preescolarActividades de computación para alumnos de preescolar
Actividades de computación para alumnos de preescolar24roberto21
 
Guía de Registro slideshare paso a paso 1
Guía de Registro slideshare paso a paso 1Guía de Registro slideshare paso a paso 1
Guía de Registro slideshare paso a paso 1ivanapaterninar
 
David_Gallegos - tarea de la sesión 11.pptx
David_Gallegos - tarea de la sesión 11.pptxDavid_Gallegos - tarea de la sesión 11.pptx
David_Gallegos - tarea de la sesión 11.pptxDAVIDROBERTOGALLEGOS
 
PLANEACION DE CLASES TEMA TIPOS DE FAMILIA.docx
PLANEACION DE CLASES TEMA TIPOS DE FAMILIA.docxPLANEACION DE CLASES TEMA TIPOS DE FAMILIA.docx
PLANEACION DE CLASES TEMA TIPOS DE FAMILIA.docxhasbleidit
 
Trabajando con Formasy Smart art en power Point
Trabajando con Formasy Smart art en power PointTrabajando con Formasy Smart art en power Point
Trabajando con Formasy Smart art en power PointValerioIvanDePazLoja
 
Inteligencia Artificial. Matheo Hernandez Serrano USCO 2024
Inteligencia Artificial. Matheo Hernandez Serrano USCO 2024Inteligencia Artificial. Matheo Hernandez Serrano USCO 2024
Inteligencia Artificial. Matheo Hernandez Serrano USCO 2024u20211198540
 
LAS_TIC_COMO_HERRAMIENTAS_EN_LA_INVESTIGACIÓN.pptx
LAS_TIC_COMO_HERRAMIENTAS_EN_LA_INVESTIGACIÓN.pptxLAS_TIC_COMO_HERRAMIENTAS_EN_LA_INVESTIGACIÓN.pptx
LAS_TIC_COMO_HERRAMIENTAS_EN_LA_INVESTIGACIÓN.pptxAlexander López
 
LINEA DE TIEMPO LITERATURA DIFERENCIADO LITERATURA.pptx
LINEA DE TIEMPO LITERATURA DIFERENCIADO LITERATURA.pptxLINEA DE TIEMPO LITERATURA DIFERENCIADO LITERATURA.pptx
LINEA DE TIEMPO LITERATURA DIFERENCIADO LITERATURA.pptxkimontey
 
certificado de oracle academy cetrificado.pdf
certificado de oracle academy cetrificado.pdfcertificado de oracle academy cetrificado.pdf
certificado de oracle academy cetrificado.pdfFernandoOblitasVivan
 
Slideshare y Scribd - Noli Cubillan Gerencia
Slideshare y Scribd - Noli Cubillan GerenciaSlideshare y Scribd - Noli Cubillan Gerencia
Slideshare y Scribd - Noli Cubillan Gerenciacubillannoly
 
Agencia Marketing Branding Google Workspace Deployment Services Credential Fe...
Agencia Marketing Branding Google Workspace Deployment Services Credential Fe...Agencia Marketing Branding Google Workspace Deployment Services Credential Fe...
Agencia Marketing Branding Google Workspace Deployment Services Credential Fe...Marketing BRANDING
 

Dernier (20)

Red Dorsal Nacional de Fibra Óptica y Redes Regionales del Perú
Red Dorsal Nacional de Fibra Óptica y Redes Regionales del PerúRed Dorsal Nacional de Fibra Óptica y Redes Regionales del Perú
Red Dorsal Nacional de Fibra Óptica y Redes Regionales del Perú
 
Análisis de los artefactos (nintendo NES)
Análisis de los artefactos (nintendo NES)Análisis de los artefactos (nintendo NES)
Análisis de los artefactos (nintendo NES)
 
Viguetas Pretensadas en concreto armado
Viguetas Pretensadas  en concreto armadoViguetas Pretensadas  en concreto armado
Viguetas Pretensadas en concreto armado
 
La Electricidad Y La Electrónica Trabajo Tecnología.pdf
La Electricidad Y La Electrónica Trabajo Tecnología.pdfLa Electricidad Y La Electrónica Trabajo Tecnología.pdf
La Electricidad Y La Electrónica Trabajo Tecnología.pdf
 
Documentacion Electrónica en Actos Juridicos
Documentacion Electrónica en Actos JuridicosDocumentacion Electrónica en Actos Juridicos
Documentacion Electrónica en Actos Juridicos
 
TALLER DE ANALISIS SOLUCION PART 2 (1)-1.docx
TALLER DE ANALISIS SOLUCION  PART 2 (1)-1.docxTALLER DE ANALISIS SOLUCION  PART 2 (1)-1.docx
TALLER DE ANALISIS SOLUCION PART 2 (1)-1.docx
 
Análisis de Artefactos Tecnologicos (3) (1).pdf
Análisis de Artefactos Tecnologicos  (3) (1).pdfAnálisis de Artefactos Tecnologicos  (3) (1).pdf
Análisis de Artefactos Tecnologicos (3) (1).pdf
 
La electricidad y la electronica.10-7.pdf
La electricidad y la electronica.10-7.pdfLa electricidad y la electronica.10-7.pdf
La electricidad y la electronica.10-7.pdf
 
El camino a convertirse en Microsoft MVP
El camino a convertirse en Microsoft MVPEl camino a convertirse en Microsoft MVP
El camino a convertirse en Microsoft MVP
 
Actividades de computación para alumnos de preescolar
Actividades de computación para alumnos de preescolarActividades de computación para alumnos de preescolar
Actividades de computación para alumnos de preescolar
 
Guía de Registro slideshare paso a paso 1
Guía de Registro slideshare paso a paso 1Guía de Registro slideshare paso a paso 1
Guía de Registro slideshare paso a paso 1
 
David_Gallegos - tarea de la sesión 11.pptx
David_Gallegos - tarea de la sesión 11.pptxDavid_Gallegos - tarea de la sesión 11.pptx
David_Gallegos - tarea de la sesión 11.pptx
 
PLANEACION DE CLASES TEMA TIPOS DE FAMILIA.docx
PLANEACION DE CLASES TEMA TIPOS DE FAMILIA.docxPLANEACION DE CLASES TEMA TIPOS DE FAMILIA.docx
PLANEACION DE CLASES TEMA TIPOS DE FAMILIA.docx
 
Trabajando con Formasy Smart art en power Point
Trabajando con Formasy Smart art en power PointTrabajando con Formasy Smart art en power Point
Trabajando con Formasy Smart art en power Point
 
Inteligencia Artificial. Matheo Hernandez Serrano USCO 2024
Inteligencia Artificial. Matheo Hernandez Serrano USCO 2024Inteligencia Artificial. Matheo Hernandez Serrano USCO 2024
Inteligencia Artificial. Matheo Hernandez Serrano USCO 2024
 
LAS_TIC_COMO_HERRAMIENTAS_EN_LA_INVESTIGACIÓN.pptx
LAS_TIC_COMO_HERRAMIENTAS_EN_LA_INVESTIGACIÓN.pptxLAS_TIC_COMO_HERRAMIENTAS_EN_LA_INVESTIGACIÓN.pptx
LAS_TIC_COMO_HERRAMIENTAS_EN_LA_INVESTIGACIÓN.pptx
 
LINEA DE TIEMPO LITERATURA DIFERENCIADO LITERATURA.pptx
LINEA DE TIEMPO LITERATURA DIFERENCIADO LITERATURA.pptxLINEA DE TIEMPO LITERATURA DIFERENCIADO LITERATURA.pptx
LINEA DE TIEMPO LITERATURA DIFERENCIADO LITERATURA.pptx
 
certificado de oracle academy cetrificado.pdf
certificado de oracle academy cetrificado.pdfcertificado de oracle academy cetrificado.pdf
certificado de oracle academy cetrificado.pdf
 
Slideshare y Scribd - Noli Cubillan Gerencia
Slideshare y Scribd - Noli Cubillan GerenciaSlideshare y Scribd - Noli Cubillan Gerencia
Slideshare y Scribd - Noli Cubillan Gerencia
 
Agencia Marketing Branding Google Workspace Deployment Services Credential Fe...
Agencia Marketing Branding Google Workspace Deployment Services Credential Fe...Agencia Marketing Branding Google Workspace Deployment Services Credential Fe...
Agencia Marketing Branding Google Workspace Deployment Services Credential Fe...
 

2012 MosesCore LocWorld Seattle: Language Processing Techniques for Statistical Machine Translation

  • 1. Language Processing Techniques for Statistical Machine Translation Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 2. To start ... Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 3. … you choose Moses ... Translation memories + linguistic assets Cleaning and training following tutorials BLEU score seems ok in training … but ... the results are awful! Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 4. Why? Not enough data Unclean translation memories Misalignments Spelling and grammar errors Difficult language pairs Selection of wrong parameters Application of suboptimal techniques So many things … what can you do? Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 5. Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 6. Some steps Maximum exploitation of existing assets Source content optimization Data selection and cleaning Improvement of the models Linguistic processing Continuous improvement Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 7. Existing assets: increase TM leverage Translation memory sharing Clients, Partners, Competitors, EU, UN, TAUS Relevant on-line data retrieval Advanced TM techniques Sub-segment matching Parts of Speech replacement Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 8. Source optimization (I): Pre-editing Spell check Grammar check Style check Terminology check Client checklist new doc proposed doc + html report Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 9. Source optimization (II): Summarization % to reduce Use translation memories Project Client All new doc proposed doc + html report Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 11. Data selection and cleaning – a sample Clean translation memories Length, punctuation, terminology, repetitions … Segment splitting Optimize weight of most frequent n-grams in corpus Validate their translations Add out-of-domain data for irrelevant n-grams Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 12. Models optimization Filter the translation tables Remove the garbage + tune the weights if necessary Optimize language models Adapt them to the translation purpose Tune parameters correctly Tune set, test set, optimization parameters … Improve recasing Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 13. Linguistic processing In the source and/or target language Grammar checking Entities detection proper nouns, alphanumeric words, numbers, ... Compound words splitting Sentence reordering Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 14. Continuous improvement Qualitative feedback of translators Reports Automatic post-processing with machine translation + post-edited segments Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 15. An example from Source XXX 335102 doses are calculated as a free acid of the sodium salt (NA). The potential toxicity of XXX 335102 was studied in a number of acute toxicity studies in mouse and rat and repeat dose toxicity studies of 8 and 32 weeks each in rat and monkeys. XXX 335102 was negative in a panel of in vivo and in vitro tests to assess mutagenicity and clastogenicity identifying no genotoxic risks for human subjects. An in vitro assay for phototoxic potential suggested that XXX 335102 is photoxic/photosensitive. In the 8-week studies in monkeys, increases in unconjugated bilirubin were noted at the doses tested (33, 88, 192 and 444mg/kg/day); the greatest increases occurring at Week 4 and declining or returning to control levels by Week 8. Reference Las dosis de XXX 335102 se calculan como la sal sódica sin ácido (AS). La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos. Se obtuvieron resultados negativos en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia y clastogenia, sin identificarse riesgos genotóxicos para el ser humano. En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o fotosensibilizador. En los estudios de 8 semanas en monos se apreció el aumento de la bilirrubina no conjugada con las dosis estudiadas (33, 88, 192 y 444 mg/kg/día), produciéndose el mayor incremento en la semana 4 y disminuyendo o volviendo a los niveles de control en la semana 8.
  • 16. Generic engine XXX 335102 se calculan en forma de dosis de ácido libre del sodio sal (NA). La Toxicidad potencial de XXX 335102 fue estudiado en una serie de estudios de toxicidad aguda en ratón y rata y vuelva a dosis estudios de toxicidad, de 8 y de 32 semanas en rata y cada uno de los monos. XXX 335102 era negativo en un grupo de in vivo y pruebas in vitro para evaluar mutagenicidad y genotóxicas clastogenicity no identificar los riesgos para los participantes humanos. Un para fines de ensayo in vitro phototoxic potencial se sugirió que XXX 335102 photoxic/Photosensitive. En Los 8 -week estudios en los monos, aumentos en unconjugated bilirrubina salieron a las dosis analizada (33, 88, 192 y 444 mg/kg/día); los mayores incrementos habidos En la semana 4 y la reducción o devolver a nivel de control de 8 Por semana. Medical engine with improvements Las dosis XXX 335102 se calculan como ácido libre de la sal sódica (AS). La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos. XXX 335102 dio negativo en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia y clastogenia, sin identificarse riesgos genotóxicos para el ser humano. En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o fotosensibilizador. En los estudios de 8 semanas en monos, el aumento de la bilirrubina no conjugada con las dosis estudiadas (33, 88, 192 y 444 mg/kg/día); el mayor incremento en la semana 4 y disminuyendo o volviendo a los niveles de control en la semana 8.
  • 17. Reference Las dosis de XXX 335102 se calculan como la sal sódica sin ácido (AS). La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos. Se obtuvieron resultados negativos en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia y clastogenia, sin identificarse riesgos genotóxicos para el ser humano. En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o fotosensibilizador. En los estudios de 8 semanas en monos se apreció el aumento de la bilirrubina no conjugada con las dosis estudiadas (33, 88, 192 y 444 mg/kg/día), produciéndose el mayor incremento en la semana 4 y disminuyendo o volviendo a los niveles de control en la semana 8. Medical engine with improvements Las dosis XXX 335102 se calculan como ácido libre de la sal sódica (AS). La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos. XXX 335102 dio negativo en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia y clastogenia, sin identificarse riesgos genotóxicos para el ser humano. En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o fotosensibilizador. En los estudios de 8 semanas en monos, el aumento de la bilirrubina no conjugada con las dosis estudiadas (33, 88, 192 y 444 mg/kg/día); el mayor incremento en la semana 4 y disminuyendo o volviendo a los niveles de control en la semana 8.
  • 18. Conclusions MT can be combined with other advanced techniques Creating and improving an engine requires time You can also be lucky at the first try! The optimum results require translators Implementation of the linguistic knowledge Continuous improvement Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96