2. Overview
• Short introduction
• The NLP Pipeline in Machine Translation
• Selected tasks that are relevant for others (not MT developers)
• Example of data pre-processing using publicly available tools
• NLP pipelines for English and Latvian
3. Who am I?
• My name is Mārcis Pinnis
• I am a researcher at Tilde
• I have worked on language
technologies for over 10 years
• Currently, my research focus is
(neural) machine translation (MT)
• You can find more about my research and
what we do in Tilde on:
Tilde MT Neural MT
The “term” cloud summarises the
topics of last five of my publications
4. What is machine translation?
And ... what are its main goals?
5. Machine Translation (MT) is «the use of computers to
automate translation from one language to another»
/Jurafsky & Martin, 2009/
What is Machine Translation?
Statistical MT
Neural MT
Bing
GoogleLanguage technologies
are important in human-
computer interaction.
Valodas tehnoloģijas ir svarīga
cilvēka-datora mijiedarbībā.
Valodu tehnoloģijas ir svarīgs
cilvēka-datora mijiedarbībā.
Valodu tehnoloģijas ir svarīgas
cilvēka un datora mijiedarbībai.
Valodas tehnoloģijas ir svarīgas
human-computer interaction.
6. What are the main goals of machine
translation?
???
Hello!
To lower language barriers
in communication
???
To provide access to
information written in
an unknown language
To increase productivity,
e.g., of professional
translators
8. What to Start with?
• Imagine that you have to translate the following text
• What will you do first?
The European Union (EU) is a political and economic union of 28
member states that are located primarily in Europe. It has an
area of 4,475,757 km2 (1,728,099 sq mi), and an estimated
population of over 510 million.
9. Sentence Breaking
• First, we will split the text into sentences.
• Most basic NLP tools work with individual sentences, therefore, this is
a mandatory step.
The European Union (EU) is a political and economic union of 28
member states that are located primarily in Europe. It has an
area of 4,475,757 km2 (1,728,099 sq mi), and an estimated
population of over 510 million.
The European Union (EU) is a political and economic union of 28
member states that are located primarily in Europe.
It has an area of 4,475,757 km2 (1,728,099 sq mi), and an
estimated population of over 510 million.
10. Tokenisation
• Then, we will split the text into
primitive textual units - tokens.
The European Union (EU) is a political and economic union of 28
member states that are located primarily in Europe.
It has an area of 4,475,757 km2 (1,728,099 sq mi), and an
estimated population of over 510 million.
The European Union ( EU ) is a political and economic union of 28
member states that are located primarily in Europe .
It has an area of 4,475,757 km 2 ( 1,728,099 sq mi ) , and an
estimated population of over 510 million .
WORD
PUNCTUATION
NUMERAL
ID
SYMBOL
URL
XML
DATE
TIME
EMAIL
SMILEY
HASHTAG
CASHTAG
MENTION
RETWEET
OTHER
11. Now that the text is broken into tokens, let us
look at some examples...
control system
dry clothes
moving man
The phrases are ambiguous!
Can you guess the translations of the following phrases?
12. Possible translations may be...
control system
dry clothes
moving man
Note: the phrases are way more ambiguous
than the two examples given!
?
?
? ?
?
?
kontroles sistēma
sausas drēbes
aizkustinošs vīrskustīgs cilvēks
žāvē drēbes
pārbaudi sistēmu
13. The context is important!
To control system parameters, open Settings
Dry clothes can be taken out of the drier
He was a very moving man
In some cases, morphological disambiguation
helps us to choose better translations.
14. Morphological Analysis
• Allows to gain insight in the morphological ambiguity of words
• Lists «all» possible (morphological) analyses of a word
• Often is limited to a vocabulary
control dry moving
Noun, singular
Verb, first person, simple present
Verb, second person, simple present
...
Verb, second person, simple present
Adjective, positive
Verb, first person, simple present
...
Noun, singular
Adjective
Verb
...
15. Morphological Tagging
• Allows to perform morphological disambiguation of
words using the context they are found in
• We have solved the morphological ambiguity of «dry»
• When selecting translation equivalents for the word,
we will be able to take the disambiguated data into
account
• E.g., «dry» = «sauss» and «dry» != «žāvēt»
Dry clothes can be taken out of the drier
JJ = adjective
NNS = noun, plural
MD = modal verb
VB = verb
VBN = verb, past participle
RP = particle
IN = preposition
DT = determiner
NN = noun, singular
JJ VBNNNS MD VB RP IN DT NN
16. The context continues to be more important!
• Often it is not enough to perform morphological disambiguation
• We need to understand the words in a context
• We need to figure out which word modifies or depends on which
other word in order to:
• translate words in the correct order
• translate words in the correct inflected forms
17. Syntactic Parsing
• We use a syntactic parser to tell
us, which words depend on
which other words in a
sentence and how phrases are
structured
• We now know that «dry» is an
adjectival modifier of «clothes»
• From this we can conclude that,
e.g.:
«dry» = «sausas» or «sausās»
«dry» != «sausām» or «sauso»
The example has been parsed with the Stanford Parser. You can try it here: http://corenlp.run/
ConstituencytreeDependencytree
18. What is missing?
• We solved:
• the morphological ambiguity
• the syntactic ambiguity
• What next?
• How would you translate
this sentence?
He took a tablet?
?
?
19. Terms and named entities
• There is actually context missing to identify the intended meaning, right?
• Term recognition (TR) and named entity recognition (NER) tools allow us
to perform semantic disambiguation
Stīvs Gulbis vinnēja loterijā
Stīvs Gulbis vinnēja loterijā Stīvs Gulbis vinnēja loterijā
PERSON
With NER Without NER
Stivs Gulbis won the lottery A stiff swan won the lottery
He took a tablet
tablet =
tablete
Viņš paņēma planšetiViņš iedzēra tableti
He took a tablet He took a tablet
With TR Without TR
20. Semantic Analysis
• For some tasks (e.g., question answering), natural language understanding
requires semantic parsing.
• E.g., shallow semantic parsing (a.k.a. semantic role labelling) allows us to
analyse meaning by identifying predicates and their arguments in a sentence
• However, in MT, we tend not to go this deep in text analysis (one reason - these
tools are not widely available for many languages)
The example has been created with the Semantic Role Labeling Demo of University of Illinois at Urbana-
Champaign. You can try it here: http://cogcomp.cs.illinois.edu/page/demo_view/srl
22. Were these all tasks of
NLP?
• Obviously, not!
• This just barely touches the surface!
• Other topics that have to be addressed:
• Discourse related phenomena - anaphora resolution,
coreferences
• Document translation (handling of formatting tags)
• Domain adaptation
• Interactive translation
• Localisation (e.g., correct formatting of numbers,
dates, punctuations, units of measurement, etc.)
• Named entities
• Online learning
• Post-editing and computer-assisted translation
• Quality estimation
• Robustness to training data noise
• Rule-based vs. statistical vs. neural vs. hybrid
machine translation
• Terminology
• Truecasing and recasing
• Etc.
• Obviously, not!
• Other tasks that were not discussed are,
e.g.:
• Anaphora resolution
• Coreference analysis
• Detokenisation
• Language identification
• Semantic role labelling
(a.k.a., shallow semantic parsing)
• Sentiment analysis
• Stemming
• Truecasing and recasing
• Word segmentation
(e.g., for Arabic, Japanese)
• Word sense disambiguation
• Word splitting (e.g., in sub-word units or
compound splitting)
• Etc.
Does this address
all issues of MT?
23. Do we have time left for a short
demo?
If not, try it yourself: https://github.com/pmarcis/nlp-example
Notes de l'éditeur
Challenges:
1) Badly formatted texts (e.g., left out whitespaces)
2) Quotations, multiple sentences within quotes or brackets
3) Mathematical or code-like text fragments
4) Language-specific issues (e.g., abbreviations, ordinal numerals in Latvian, etc.)
5) etc.
Challenges:
1) Badly formatted texts (e.g., left out whitespaces)
2) Compounds with dashes
3) Code-like text fragments (e.g., in «EU Regulation 10/2011 adresses plastic materials and articles intended to come into contact with food» the «10/2011» is a code that should perhaps be left as a single token, but then in «In 10/2011 something important may have happened» the «10/2011» is clearly a data and it may be important to split it into separate tokens)
4) Mathematical expressions
5) Language-specific issues (e.g., different spacing guidelines, units of measurement, etc.)
6) etc.