SlideShare une entreprise Scribd logo
1  sur  24
The NLP pipeline in
Machine Translation
Mārcis Pinnis
Overview
• Short introduction
• The NLP Pipeline in Machine Translation
• Selected tasks that are relevant for others (not MT developers)
• Example of data pre-processing using publicly available tools
• NLP pipelines for English and Latvian
Who am I?
• My name is Mārcis Pinnis
• I am a researcher at Tilde
• I have worked on language
technologies for over 10 years
• Currently, my research focus is
(neural) machine translation (MT)
• You can find more about my research and
what we do in Tilde on:
Tilde MT Neural MT
The “term” cloud summarises the
topics of last five of my publications
What is machine translation?
And ... what are its main goals?
Machine Translation (MT) is «the use of computers to
automate translation from one language to another»
/Jurafsky & Martin, 2009/
What is Machine Translation?
Statistical MT
Neural MT
Bing
GoogleLanguage technologies
are important in human-
computer interaction.
Valodas tehnoloģijas ir svarīga
cilvēka-datora mijiedarbībā.
Valodu tehnoloģijas ir svarīgs
cilvēka-datora mijiedarbībā.
Valodu tehnoloģijas ir svarīgas
cilvēka un datora mijiedarbībai.
Valodas tehnoloģijas ir svarīgas
human-computer interaction.
What are the main goals of machine
translation?
???
Hello!
To lower language barriers
in communication
???
To provide access to
information written in
an unknown language
To increase productivity,
e.g., of professional
translators
The NLP Pipeline
From the very basic to the more challenging tasks
What to Start with?
• Imagine that you have to translate the following text
• What will you do first?
The European Union (EU) is a political and economic union of 28
member states that are located primarily in Europe. It has an
area of 4,475,757 km2 (1,728,099 sq mi), and an estimated
population of over 510 million.
Sentence Breaking
• First, we will split the text into sentences.
• Most basic NLP tools work with individual sentences, therefore, this is
a mandatory step.
The European Union (EU) is a political and economic union of 28
member states that are located primarily in Europe. It has an
area of 4,475,757 km2 (1,728,099 sq mi), and an estimated
population of over 510 million.
The European Union (EU) is a political and economic union of 28
member states that are located primarily in Europe.
It has an area of 4,475,757 km2 (1,728,099 sq mi), and an
estimated population of over 510 million.
Tokenisation
• Then, we will split the text into
primitive textual units - tokens.
The European Union (EU) is a political and economic union of 28
member states that are located primarily in Europe.
It has an area of 4,475,757 km2 (1,728,099 sq mi), and an
estimated population of over 510 million.
The European Union ( EU ) is a political and economic union of 28
member states that are located primarily in Europe .
It has an area of 4,475,757 km 2 ( 1,728,099 sq mi ) , and an
estimated population of over 510 million .
WORD
PUNCTUATION
NUMERAL
ID
SYMBOL
URL
XML
DATE
TIME
EMAIL
SMILEY
HASHTAG
CASHTAG
MENTION
RETWEET
OTHER
Now that the text is broken into tokens, let us
look at some examples...
control system
dry clothes
moving man
The phrases are ambiguous!
Can you guess the translations of the following phrases?
Possible translations may be...
control system
dry clothes
moving man
Note: the phrases are way more ambiguous
than the two examples given!
?
?
? ?
?
?
kontroles sistēma
sausas drēbes
aizkustinošs vīrskustīgs cilvēks
žāvē drēbes
pārbaudi sistēmu
The context is important!
To control system parameters, open Settings
Dry clothes can be taken out of the drier
He was a very moving man
In some cases, morphological disambiguation
helps us to choose better translations.
Morphological Analysis
• Allows to gain insight in the morphological ambiguity of words
• Lists «all» possible (morphological) analyses of a word
• Often is limited to a vocabulary
control dry moving
Noun, singular
Verb, first person, simple present
Verb, second person, simple present
...
Verb, second person, simple present
Adjective, positive
Verb, first person, simple present
...
Noun, singular
Adjective
Verb
...
Morphological Tagging
• Allows to perform morphological disambiguation of
words using the context they are found in
• We have solved the morphological ambiguity of «dry»
• When selecting translation equivalents for the word,
we will be able to take the disambiguated data into
account
• E.g., «dry» = «sauss» and «dry» != «žāvēt»
Dry clothes can be taken out of the drier
JJ = adjective
NNS = noun, plural
MD = modal verb
VB = verb
VBN = verb, past participle
RP = particle
IN = preposition
DT = determiner
NN = noun, singular
JJ VBNNNS MD VB RP IN DT NN
The context continues to be more important!
• Often it is not enough to perform morphological disambiguation
• We need to understand the words in a context
• We need to figure out which word modifies or depends on which
other word in order to:
• translate words in the correct order
• translate words in the correct inflected forms
Syntactic Parsing
• We use a syntactic parser to tell
us, which words depend on
which other words in a
sentence and how phrases are
structured
• We now know that «dry» is an
adjectival modifier of «clothes»
• From this we can conclude that,
e.g.:
«dry» = «sausas» or «sausās»
«dry» != «sausām» or «sauso»
The example has been parsed with the Stanford Parser. You can try it here: http://corenlp.run/
ConstituencytreeDependencytree
What is missing?
• We solved:
• the morphological ambiguity
• the syntactic ambiguity
• What next?
• How would you translate
this sentence?
He took a tablet?
?
?
Terms and named entities
• There is actually context missing to identify the intended meaning, right?
• Term recognition (TR) and named entity recognition (NER) tools allow us
to perform semantic disambiguation
Stīvs Gulbis vinnēja loterijā
Stīvs Gulbis vinnēja loterijā Stīvs Gulbis vinnēja loterijā
PERSON
With NER Without NER
Stivs Gulbis won the lottery A stiff swan won the lottery
He took a tablet
tablet =
tablete
Viņš paņēma planšetiViņš iedzēra tableti
He took a tablet He took a tablet
With TR Without TR
Semantic Analysis
• For some tasks (e.g., question answering), natural language understanding
requires semantic parsing.
• E.g., shallow semantic parsing (a.k.a. semantic role labelling) allows us to
analyse meaning by identifying predicates and their arguments in a sentence
• However, in MT, we tend not to go this deep in text analysis (one reason - these
tools are not widely available for many languages)
The example has been created with the Semantic Role Labeling Demo of University of Illinois at Urbana-
Champaign. You can try it here: http://cogcomp.cs.illinois.edu/page/demo_view/srl
The Building Blocks of NLP
Pragmatics
Semantics
Syntax
Morphology
Were these all tasks of
NLP?
• Obviously, not!
• This just barely touches the surface!
• Other topics that have to be addressed:
• Discourse related phenomena - anaphora resolution,
coreferences
• Document translation (handling of formatting tags)
• Domain adaptation
• Interactive translation
• Localisation (e.g., correct formatting of numbers,
dates, punctuations, units of measurement, etc.)
• Named entities
• Online learning
• Post-editing and computer-assisted translation
• Quality estimation
• Robustness to training data noise
• Rule-based vs. statistical vs. neural vs. hybrid
machine translation
• Terminology
• Truecasing and recasing
• Etc.
• Obviously, not!
• Other tasks that were not discussed are,
e.g.:
• Anaphora resolution
• Coreference analysis
• Detokenisation
• Language identification
• Semantic role labelling
(a.k.a., shallow semantic parsing)
• Sentiment analysis
• Stemming
• Truecasing and recasing
• Word segmentation
(e.g., for Arabic, Japanese)
• Word sense disambiguation
• Word splitting (e.g., in sub-word units or
compound splitting)
• Etc.
Does this address
all issues of MT?
Do we have time left for a short
demo?
If not, try it yourself: https://github.com/pmarcis/nlp-example
NLP pipeline in machine translation

Contenu connexe

Tendances

Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review Jayneel Vora
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and originShubhankar Mohan
 
Natural language processing
Natural language processing Natural language processing
Natural language processing Md.Sumon Sarder
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingsaurabhnarhe
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.netwww.myassignmenthelp.net
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?Bernard Marr
 
Natural language processing
Natural language processingNatural language processing
Natural language processingKarenVacca
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 

Tendances (20)

Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Language models
Language modelsLanguage models
Language models
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and origin
 
NLP
NLPNLP
NLP
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
What Is GPT-3 And Why Is It Revolutionizing Artificial Intelligence?
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
NLP
NLPNLP
NLP
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
gpt3_presentation.pdf
gpt3_presentation.pdfgpt3_presentation.pdf
gpt3_presentation.pdf
 

En vedette

Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introductionnlab_utokyo
 
Machine Translation: What it is?
Machine Translation: What it is?Machine Translation: What it is?
Machine Translation: What it is?Multilizer
 
Translation Types
Translation TypesTranslation Types
Translation TypesElena Shapa
 
Event-based MultiMedia Search and Retrieval for Question Answering
Event-based MultiMedia Search and Retrieval for Question AnsweringEvent-based MultiMedia Search and Retrieval for Question Answering
Event-based MultiMedia Search and Retrieval for Question AnsweringBenoit HUET
 
Que es un sistema de información (si
Que es un sistema de información (siQue es un sistema de información (si
Que es un sistema de información (sigustavoangel92
 
Анализ правонарушений в крыму за первое полугодие 2016 года
Анализ правонарушений в крыму за первое полугодие 2016 годаАнализ правонарушений в крыму за первое полугодие 2016 года
Анализ правонарушений в крыму за первое полугодие 2016 годаUAReforms
 
여성흥분크림『 http://x5.ana.kr 』 톡 w2015 ♡ 여성흥분크림판매, 여성흥분크림효과,여성흥분크림정품구입,여성흥분크림부작용...
여성흥분크림『 http://x5.ana.kr  』 톡 w2015 ♡ 여성흥분크림판매, 여성흥분크림효과,여성흥분크림정품구입,여성흥분크림부작용...여성흥분크림『 http://x5.ana.kr  』 톡 w2015 ♡ 여성흥분크림판매, 여성흥분크림효과,여성흥분크림정품구입,여성흥분크림부작용...
여성흥분크림『 http://x5.ana.kr 』 톡 w2015 ♡ 여성흥분크림판매, 여성흥분크림효과,여성흥분크림정품구입,여성흥분크림부작용...무우 단
 
여성흥분제『 http://x5.ana.kr 』 톡 w2015 ♡ 여성흥분제판매,여성흥분제효능,여성흥분제성분,여성흥분제종류, 여성흥분제치사...
여성흥분제『 http://x5.ana.kr  』 톡 w2015 ♡ 여성흥분제판매,여성흥분제효능,여성흥분제성분,여성흥분제종류, 여성흥분제치사...여성흥분제『 http://x5.ana.kr  』 톡 w2015 ♡ 여성흥분제판매,여성흥분제효능,여성흥분제성분,여성흥분제종류, 여성흥분제치사...
여성흥분제『 http://x5.ana.kr 』 톡 w2015 ♡ 여성흥분제판매,여성흥분제효능,여성흥분제성분,여성흥분제종류, 여성흥분제치사...무우 단
 
Different Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering SystemsDifferent Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering SystemsAndre Freitas
 
Перші кроки в об’єднаних громадах та необхідні зміни законодавства
Перші кроки в об’єднаних громадах та необхідні зміни законодавстваПерші кроки в об’єднаних громадах та необхідні зміни законодавства
Перші кроки в об’єднаних громадах та необхідні зміни законодавстваUAReforms
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translationbdonaldson
 
Manual administrativo
Manual administrativoManual administrativo
Manual administrativolozadachris
 
Extending Machine Translation in AEM
Extending Machine Translation in AEMExtending Machine Translation in AEM
Extending Machine Translation in AEMVivek Sachdeva
 
Statistical machine translation for indian language copy
Statistical machine translation for indian language   copyStatistical machine translation for indian language   copy
Statistical machine translation for indian language copyNakul Sharma
 
U2 actividad 1. el conflicto y los equipos de trabajo
U2 actividad 1. el conflicto y los equipos de trabajoU2 actividad 1. el conflicto y los equipos de trabajo
U2 actividad 1. el conflicto y los equipos de trabajoPaty Prudencio S.
 

En vedette (20)

Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introduction
 
Machine Translation: What it is?
Machine Translation: What it is?Machine Translation: What it is?
Machine Translation: What it is?
 
Translation Types
Translation TypesTranslation Types
Translation Types
 
Matchine translation
Matchine translationMatchine translation
Matchine translation
 
Event-based MultiMedia Search and Retrieval for Question Answering
Event-based MultiMedia Search and Retrieval for Question AnsweringEvent-based MultiMedia Search and Retrieval for Question Answering
Event-based MultiMedia Search and Retrieval for Question Answering
 
Indian Writing in English
Indian Writing in EnglishIndian Writing in English
Indian Writing in English
 
Que es un sistema de información (si
Que es un sistema de información (siQue es un sistema de información (si
Que es un sistema de información (si
 
Анализ правонарушений в крыму за первое полугодие 2016 года
Анализ правонарушений в крыму за первое полугодие 2016 годаАнализ правонарушений в крыму за первое полугодие 2016 года
Анализ правонарушений в крыму за первое полугодие 2016 года
 
여성흥분크림『 http://x5.ana.kr 』 톡 w2015 ♡ 여성흥분크림판매, 여성흥분크림효과,여성흥분크림정품구입,여성흥분크림부작용...
여성흥분크림『 http://x5.ana.kr  』 톡 w2015 ♡ 여성흥분크림판매, 여성흥분크림효과,여성흥분크림정품구입,여성흥분크림부작용...여성흥분크림『 http://x5.ana.kr  』 톡 w2015 ♡ 여성흥분크림판매, 여성흥분크림효과,여성흥분크림정품구입,여성흥분크림부작용...
여성흥분크림『 http://x5.ana.kr 』 톡 w2015 ♡ 여성흥분크림판매, 여성흥분크림효과,여성흥분크림정품구입,여성흥분크림부작용...
 
여성흥분제『 http://x5.ana.kr 』 톡 w2015 ♡ 여성흥분제판매,여성흥분제효능,여성흥분제성분,여성흥분제종류, 여성흥분제치사...
여성흥분제『 http://x5.ana.kr  』 톡 w2015 ♡ 여성흥분제판매,여성흥분제효능,여성흥분제성분,여성흥분제종류, 여성흥분제치사...여성흥분제『 http://x5.ana.kr  』 톡 w2015 ♡ 여성흥분제판매,여성흥분제효능,여성흥분제성분,여성흥분제종류, 여성흥분제치사...
여성흥분제『 http://x5.ana.kr 』 톡 w2015 ♡ 여성흥분제판매,여성흥분제효능,여성흥분제성분,여성흥분제종류, 여성흥분제치사...
 
Exposé gp
Exposé gpExposé gp
Exposé gp
 
Different Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering SystemsDifferent Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering Systems
 
Перші кроки в об’єднаних громадах та необхідні зміни законодавства
Перші кроки в об’єднаних громадах та необхідні зміни законодавстваПерші кроки в об’єднаних громадах та необхідні зміни законодавства
Перші кроки в об’єднаних громадах та необхідні зміни законодавства
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
 
Microsoft - SEO - TAUS Tokyo Forum 2015
Microsoft - SEO - TAUS Tokyo Forum 2015Microsoft - SEO - TAUS Tokyo Forum 2015
Microsoft - SEO - TAUS Tokyo Forum 2015
 
Manual administrativo
Manual administrativoManual administrativo
Manual administrativo
 
Extending Machine Translation in AEM
Extending Machine Translation in AEMExtending Machine Translation in AEM
Extending Machine Translation in AEM
 
Statistical machine translation for indian language copy
Statistical machine translation for indian language   copyStatistical machine translation for indian language   copy
Statistical machine translation for indian language copy
 
Conflicto
ConflictoConflicto
Conflicto
 
U2 actividad 1. el conflicto y los equipos de trabajo
U2 actividad 1. el conflicto y los equipos de trabajoU2 actividad 1. el conflicto y los equipos de trabajo
U2 actividad 1. el conflicto y los equipos de trabajo
 

Similaire à NLP pipeline in machine translation

Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for developmentAravind Reddy
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for developmentAravind Reddy
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AISATHYANARAYANAKB
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptxPriyadharshiniG41
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptxPriyadharshiniG41
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxSHIBDASDUTTA
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Natural Language Processing-(NLP).pptx
Natural Language Processing-(NLP).pptxNatural Language Processing-(NLP).pptx
Natural Language Processing-(NLP).pptxSHIBDASDUTTA
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnNLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnshradhasharma2101
 

Similaire à NLP pipeline in machine translation (20)

Nlp
NlpNlp
Nlp
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
Machine translator Introduction
Machine translator IntroductionMachine translator Introduction
Machine translator Introduction
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
 
Natural Language Processing-(NLP).pptx
Natural Language Processing-(NLP).pptxNatural Language Processing-(NLP).pptx
Natural Language Processing-(NLP).pptx
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnNLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
 
NLP
NLPNLP
NLP
 

Dernier

Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxSilpa
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxSilpa
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 

Dernier (20)

Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 

NLP pipeline in machine translation

  • 1. The NLP pipeline in Machine Translation Mārcis Pinnis
  • 2. Overview • Short introduction • The NLP Pipeline in Machine Translation • Selected tasks that are relevant for others (not MT developers) • Example of data pre-processing using publicly available tools • NLP pipelines for English and Latvian
  • 3. Who am I? • My name is Mārcis Pinnis • I am a researcher at Tilde • I have worked on language technologies for over 10 years • Currently, my research focus is (neural) machine translation (MT) • You can find more about my research and what we do in Tilde on: Tilde MT Neural MT The “term” cloud summarises the topics of last five of my publications
  • 4. What is machine translation? And ... what are its main goals?
  • 5. Machine Translation (MT) is «the use of computers to automate translation from one language to another» /Jurafsky & Martin, 2009/ What is Machine Translation? Statistical MT Neural MT Bing GoogleLanguage technologies are important in human- computer interaction. Valodas tehnoloģijas ir svarīga cilvēka-datora mijiedarbībā. Valodu tehnoloģijas ir svarīgs cilvēka-datora mijiedarbībā. Valodu tehnoloģijas ir svarīgas cilvēka un datora mijiedarbībai. Valodas tehnoloģijas ir svarīgas human-computer interaction.
  • 6. What are the main goals of machine translation? ??? Hello! To lower language barriers in communication ??? To provide access to information written in an unknown language To increase productivity, e.g., of professional translators
  • 7. The NLP Pipeline From the very basic to the more challenging tasks
  • 8. What to Start with? • Imagine that you have to translate the following text • What will you do first? The European Union (EU) is a political and economic union of 28 member states that are located primarily in Europe. It has an area of 4,475,757 km2 (1,728,099 sq mi), and an estimated population of over 510 million.
  • 9. Sentence Breaking • First, we will split the text into sentences. • Most basic NLP tools work with individual sentences, therefore, this is a mandatory step. The European Union (EU) is a political and economic union of 28 member states that are located primarily in Europe. It has an area of 4,475,757 km2 (1,728,099 sq mi), and an estimated population of over 510 million. The European Union (EU) is a political and economic union of 28 member states that are located primarily in Europe. It has an area of 4,475,757 km2 (1,728,099 sq mi), and an estimated population of over 510 million.
  • 10. Tokenisation • Then, we will split the text into primitive textual units - tokens. The European Union (EU) is a political and economic union of 28 member states that are located primarily in Europe. It has an area of 4,475,757 km2 (1,728,099 sq mi), and an estimated population of over 510 million. The European Union ( EU ) is a political and economic union of 28 member states that are located primarily in Europe . It has an area of 4,475,757 km 2 ( 1,728,099 sq mi ) , and an estimated population of over 510 million . WORD PUNCTUATION NUMERAL ID SYMBOL URL XML DATE TIME EMAIL SMILEY HASHTAG CASHTAG MENTION RETWEET OTHER
  • 11. Now that the text is broken into tokens, let us look at some examples... control system dry clothes moving man The phrases are ambiguous! Can you guess the translations of the following phrases?
  • 12. Possible translations may be... control system dry clothes moving man Note: the phrases are way more ambiguous than the two examples given! ? ? ? ? ? ? kontroles sistēma sausas drēbes aizkustinošs vīrskustīgs cilvēks žāvē drēbes pārbaudi sistēmu
  • 13. The context is important! To control system parameters, open Settings Dry clothes can be taken out of the drier He was a very moving man In some cases, morphological disambiguation helps us to choose better translations.
  • 14. Morphological Analysis • Allows to gain insight in the morphological ambiguity of words • Lists «all» possible (morphological) analyses of a word • Often is limited to a vocabulary control dry moving Noun, singular Verb, first person, simple present Verb, second person, simple present ... Verb, second person, simple present Adjective, positive Verb, first person, simple present ... Noun, singular Adjective Verb ...
  • 15. Morphological Tagging • Allows to perform morphological disambiguation of words using the context they are found in • We have solved the morphological ambiguity of «dry» • When selecting translation equivalents for the word, we will be able to take the disambiguated data into account • E.g., «dry» = «sauss» and «dry» != «žāvēt» Dry clothes can be taken out of the drier JJ = adjective NNS = noun, plural MD = modal verb VB = verb VBN = verb, past participle RP = particle IN = preposition DT = determiner NN = noun, singular JJ VBNNNS MD VB RP IN DT NN
  • 16. The context continues to be more important! • Often it is not enough to perform morphological disambiguation • We need to understand the words in a context • We need to figure out which word modifies or depends on which other word in order to: • translate words in the correct order • translate words in the correct inflected forms
  • 17. Syntactic Parsing • We use a syntactic parser to tell us, which words depend on which other words in a sentence and how phrases are structured • We now know that «dry» is an adjectival modifier of «clothes» • From this we can conclude that, e.g.: «dry» = «sausas» or «sausās» «dry» != «sausām» or «sauso» The example has been parsed with the Stanford Parser. You can try it here: http://corenlp.run/ ConstituencytreeDependencytree
  • 18. What is missing? • We solved: • the morphological ambiguity • the syntactic ambiguity • What next? • How would you translate this sentence? He took a tablet? ? ?
  • 19. Terms and named entities • There is actually context missing to identify the intended meaning, right? • Term recognition (TR) and named entity recognition (NER) tools allow us to perform semantic disambiguation Stīvs Gulbis vinnēja loterijā Stīvs Gulbis vinnēja loterijā Stīvs Gulbis vinnēja loterijā PERSON With NER Without NER Stivs Gulbis won the lottery A stiff swan won the lottery He took a tablet tablet = tablete Viņš paņēma planšetiViņš iedzēra tableti He took a tablet He took a tablet With TR Without TR
  • 20. Semantic Analysis • For some tasks (e.g., question answering), natural language understanding requires semantic parsing. • E.g., shallow semantic parsing (a.k.a. semantic role labelling) allows us to analyse meaning by identifying predicates and their arguments in a sentence • However, in MT, we tend not to go this deep in text analysis (one reason - these tools are not widely available for many languages) The example has been created with the Semantic Role Labeling Demo of University of Illinois at Urbana- Champaign. You can try it here: http://cogcomp.cs.illinois.edu/page/demo_view/srl
  • 21. The Building Blocks of NLP Pragmatics Semantics Syntax Morphology
  • 22. Were these all tasks of NLP? • Obviously, not! • This just barely touches the surface! • Other topics that have to be addressed: • Discourse related phenomena - anaphora resolution, coreferences • Document translation (handling of formatting tags) • Domain adaptation • Interactive translation • Localisation (e.g., correct formatting of numbers, dates, punctuations, units of measurement, etc.) • Named entities • Online learning • Post-editing and computer-assisted translation • Quality estimation • Robustness to training data noise • Rule-based vs. statistical vs. neural vs. hybrid machine translation • Terminology • Truecasing and recasing • Etc. • Obviously, not! • Other tasks that were not discussed are, e.g.: • Anaphora resolution • Coreference analysis • Detokenisation • Language identification • Semantic role labelling (a.k.a., shallow semantic parsing) • Sentiment analysis • Stemming • Truecasing and recasing • Word segmentation (e.g., for Arabic, Japanese) • Word sense disambiguation • Word splitting (e.g., in sub-word units or compound splitting) • Etc. Does this address all issues of MT?
  • 23. Do we have time left for a short demo? If not, try it yourself: https://github.com/pmarcis/nlp-example

Notes de l'éditeur

  1. Challenges: 1) Badly formatted texts (e.g., left out whitespaces) 2) Quotations, multiple sentences within quotes or brackets 3) Mathematical or code-like text fragments 4) Language-specific issues (e.g., abbreviations, ordinal numerals in Latvian, etc.) 5) etc.
  2. Challenges: 1) Badly formatted texts (e.g., left out whitespaces) 2) Compounds with dashes 3) Code-like text fragments (e.g., in «EU Regulation 10/2011 adresses plastic materials and articles intended to come into contact with food» the «10/2011» is a code that should perhaps be left as a single token, but then in «In 10/2011 something important may have happened» the «10/2011» is clearly a data and it may be important to split it into separate tokens) 4) Mathematical expressions 5) Language-specific issues (e.g., different spacing guidelines, units of measurement, etc.) 6) etc.