Session6 01.helmut schmid

Deep Learning-Based Morphological Taggers and
Lemmatizers for Annotating Historical Texts
Helmut Schmid
Centrum für Informations- und Sprachverarbeitung
Ludwig-Maximilian-Universität München
schmid@cis.lmu.de

POS Tagging and Lemmatization
word part-of-speech tag lemma
do AVD dô
begagenda VVFIN.Ind.Past.Sg.3 be-gègenen
imo PPER.Masc.Dat.Sg.3 ër
min DPOSA.Masc.Nom.Sg.* mîn
trohtin NA.Masc.Nom.Sg.st truhtîn
mit APPR mit
ſinero DPOSA.Fem.Dat.Sg.st sîn
arngrihte NA.Fem.Dat.Sg.st êre-grëhte
. $_
Example sentence from the Middle High German Reference Corpus
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 2 / 30

High Spelling Variation in Historical Texts
Variants of the Middle High German word tuon (to do)
Dialectal variation
tuon, dun, doyn
Spelling variation
tuon, thuon, tuen
Script variation
tuon, tvon, tûon, tu
o
n, tv
o
n, to
v
n

Overview
Word representations
POS tagger architecture
Lemmatizer architecture
Evaluations
Conclusions

Tagging Methods
Statistical Taggers
the most widely used class of annotation systems
They learn from manually annotated data.
Unseen words cause problems:
If thuon was never seen, what should its POS tag be?
Guessing the POS tag based on the ﬁnal letters does not always help:
The most frequent word with the same 3-letter-suﬃx is the preposition uon (von).

Word Representations
Statistical systems represent words as atomic units.
We need a better word representation
which reﬂects word similarity
and provides good representations for unseen words
Solution
Representation of words with an n-dimensional vector (embedding)
(-0.7, 1.5, 2.8, -5.5, 0.2, ...)

Word Embeddings
Brussels
London
Paris
Belgium
UK
France
orange
blue
pink
house
building
small
tiny
walk
run
one
two
three
five
Embeddings are points in an n-dimensional space.
Embeddings of similar words are close to each other.
Higher-dimensional spaces can model diﬀerent types of similarity.
syntactic, semantic etc.
Such embeddings can be learned with neural networks.

Part-of-Speech Tagging
The quick brown ... dog
DT JJ JJ NN
bi−RNN
embeddings
The embedding sequence is processed with a bi-RNN
⇒ representation for each word in context
The output tag is predicted from this contextual representation.
The POS tagger is trained end-to-end on a manually annotated text corpus.

Dealing With Unseen Words
The word embeddings of unseen words cannot be learned from the
annotated data.
Solution 1: Pre-train the embeddings on a large unannotated corpus
(e.g. by predicting the current word based on the left and right context)
Solution 2: Compute a word representation from the letter sequence.
The two methods can also be combined.

Letter-Based Word Representations
The letters are represented with letter
embeddings
The embedding sequence is processed with
a bi-RNN
The two ﬁnal representations are
concatenated.
⇒ letter-based word representation i kuq ...

Full Network
t h e od giu kcq ...
DT JJ NN

Advantages
The letter-based RNN
learns regular spelling variations
such as u ↔ v, uo ↔ u
o
etc.
generalizes from words to their unseen spelling variants
tuon → thv
o
n
provides high quality word representations for unseen words
uses fewer parameters than an embedding table
is slower, but the word representations can be precomputed

Technical Details of the POS Tagger
Word representations
character-based representation computed with a biRNN
Forward-RNN processes a 10-letter suffix of the word
Backward-RNN processes a 10-letter prefix of the word
Prefixes/Suffixes are padded with padding symbols if the word is
shorter.
⇒ faster and simpler (parallel) processing
optionally: pre-trained word embeddings
concatenated with the character-based representation
most useful when the training corpus is small
RNN over words
uses two biRNN layers with residual connections
i.e. the input is added to the output
All RNNs are implemented with LSTMs.

Lemmatizer
b a b i e s bab<s>
b a b
Σ
encoder decoder
attention
?
Encoder-decoder architecture with attention
The encoder computes a representation for each letter in context.
The decoder generates the output letters one by one.
The attention module computes an input summary conditioned on the
current decoder state.

POS Tagging Experiments on Tiger
Goal: Check if our tagger is state-of-the-art.
Tiger is the standard corpus for German POS tagging
Resulting accuracy for POS and morphological tagging
test
Heigold 93.23
our tagger 93.42
Heigold+emb 93.85
our tagger+emb 93.88

POS Tagging Experiments on the ReM corpus
Corpus Description
Reference Corpus for Middle High German
period 1050–1350
diplomatically transcribed
semi-automatically annotated with
normalized spelling
part of speech
lemma
morphological features
2.9 times more transcribed word types than normalized word types
Contracted word forms such as inhandon (in handon, in hand) are
annotated with two tags
7830 diﬀerent ﬁne-grained tags including these combined tags

POS Tagging Experiments on the ReM corpus
Resulting POS and morphological tagging accuracies
test
TreeTagger POS 90.40
our tagger POS 95.88
our tagger POS+morph 89.45
POS+morph accuracy on unseen words: 79%
POS+morph accuracy on seen words with unseen tag: 57%

POS Tagging Experiments on Other Corpora
Corpus size tags dev test
Middle Dutch Gysseling corpus 1.5M 3864 91.10 91.01
Syntactic Reference Corpus of Medieval French 1.2M 68 96.45 96.28
York-Toronto-Helsinki Parsed C. of Old English 1.6M 298 97.37 97.13
Ancient Greek Dependency Treebank 2.1 548K 1041 91.35 91.29
Icelandic Parsed Historical Corpus 383K 319 93.88 89.87
Corpus Taurinese (Old Italian) 246K 239 98.46 98.43
GerManC historical German newspaper corpus 770K 62 97.25 98.23
GerManC historical German newspaper corpus 770K 2303 85.65 87.72
The accuracy is high if the tagset is small.
Exception: Icelandic

Lemmatization Experiments on ReM
Encoder-decoder system DL4MT by Kyunghyun Cho
originally developed for machine translation
used for morphological reinﬂection by Kann & Schütze
input: wordform + POS (split into a sequence of letters)
f r o g e t e # V V F I N . * . P a s t . S g . 3
output: lemma (sequence of letters)
v r â g e n

Lemmatization Experiments on ReM
Test accuracy on unseen word-tag combinations
test
overall 91.87
unseen words 86.43
seen words with unseen tags 93.33

Conclusions
The annotation of historical corpora is diﬃcult due to dialectal and
spelling variations.
Neural networks are well suited to deal with this variation.
The letter-based word representations
are robust to spelling variations and
reﬂect the similarity of words.
NN-based POS taggers outperform traditional POS taggers used in
previous work on historical corpora by a large margin.
NN-based lemmatizers also produce accurate results on historical
corpora.
The new tagger and lemmatizer (called RNNTagger) is freely
available for non-commercial purposes via my homepage
http://www.cis.lmu.de/~schmid

Thank you for your attention!

POS Tagger: Network Sizes
Letter LSTM layers: 1
Word LSTM layers: 2
Letter embeddings size: 100
Letter LSTM size: 400
Word LSTM size: 400
Dropout Rate: 0.5 (no dropout on embeddings)
Training algorithm: SGD
Initial learning rate: 1.0 (no average across batch)
Learning Rate decay: 5% after each epoch
Gradient clipping threshold: 1.0
Early Stopping
minimal character frequency: 2

Lemmatizer: Network Sizes
Encoder layers: 2
Decoder layers: 1
Letter embeddings size: 100
Encoder LSTM size: 400
Decoder LSTM size: 400
Dropout Rate: 0.5 (no dropout on embeddings)
Training algorithm: Adam
Initial learning rate: 0.001
Learning Rate decay: 5% after each epoch
Gradient clipping threshold: 1.0
Early Stopping
tied decoder input and output embeddings

Lemmatization Errors
POS tag word lemma predicted
NA.Masc.Akk.Sg.st karger karkære kargære
VVFIN.Ind.Pres.Pl.3 gan’lâzzont geantlâzen gantlâzen
NA.Masc,Fem.Akk.Pl.* wiſeblumen wisebluome wîsbluome
NA.Fem.Nom.Sg.* mæſſenîe massenîe mèssenîe
ADJD.Pos unkunde unkund unkünde
VVFIN.Ind.Pres.Sg.3 geith gân jëhen
NE.Nom.Sg Franzze Franze Franke
ADJA.Pos.Neut.Akk.Pl.st gemaliv gemâl mâl
NA.Masc.Akk.Sg.st buneiz punèiz bunèiz
PAVAP_VVPP inchomen in_komen în_komen
NA.Neut.Dat.Sg.st geſtiche gèstach gestifte
NA.Neut.Nom.Sg.st uber_zinder überzimber überzinter
NA.Masc.Dat.Pl.st rieterin rihtære rîtære
AVD ennot ennoète ennôte
NA.Neut.Nom.Sg.st vzkuchen ûzkûchen ûzkiuchen
AVD_VVIMP.Sg.2 anegenke ane_dènken ane_gènken
NE.Dat.Sg c¯oſtenopele Constantinopel Konstantînôpol
VVINF murmuren murmerièren murmeln

Most Frequent POS Tag Confusions
freq correct erroneous example words
202 NE FM Auguſti, Rome, Stephani
111 ADJA NA ceſue, zeſwe, vordern
105 NA PTKNEG niht, nicht, nit
102 NA ADJA weiſen, ríndeſpuch, oel
100 DRELS DDS der, ds
, dv, die
82 DDS DDART der, daz, den, ds
76 NA AVD vil, iht, mer, lu
o
te
71 NA ADJD gut, weich, recht, gu
o
t
61 ADJA AVD mer, leſtin, mínr, mere
60 DRELS DDART der, ds
, di
56 AVD NA vil, her, mer
56 DRELS KOUS daz, Dat, Du, dc
55 PPER VAFIN is, ſi, ſei
55 PTKNEG NA niht, nit, níht
53 ADJD AVD her, ſu
o
ze, ſtete
52 PTKVZ AVD vor, wids
, wider, ane
51 VVFIN NA lant, zûg, zs
priſt

Most Frequent Morphological Tag Confusions
freq correct erroneous
220 VVFIN.Ind.Past.Sg.3 VVFIN.*.Past.Sg.3
210 PPER.Masc.Nom.Pl.3 PPER.*.Nom.Pl.3
78 VAFIN.Ind.Past.Pl.3 VAFIN.*.Past.Pl.3
60 DDART.Masc.Nom.Pl.* DDART.*.Nom.Pl.*
58 NE.Gen.Sg NE.Nom.Sg
56 DDART.Fem.Akk.Sg.* DDART.Fem.Nom.Sg.*
53 VVFIN.Ind.Past.Pl.3 VVFIN.*.Past.Pl.3
51 PPER.Neut.Nom.Pl.3 PPER.*.Nom.Pl.3
51 NA.Fem.Akk.Sg.st NA.Fem.Nom.Sg.st
50 PPER.Masc,Neut.Nom.Pl.3 PPER.*.Nom.Pl.3
48 NA.Neut.Akk.Sg.st NA.Neut.Nom.Sg.st
47 PPER.Fem.Nom.Pl.3 PPER.*.Nom.Pl.3
47 NA.Neut.Akk.Pl.st NA.Neut.Akk.Sg.st
47 DRELS.Masc.Nom.Pl DRELS.*.Nom.Pl
46 NA.Fem.Dat.Sg.st NA.Fem.Gen.Sg.st
42 NA.Neut.Nom.Sg.st NA.Neut.Akk.Sg.st
42 DDS.Masc.Nom.Pl.* DDS.*.Nom.Pl.*
40 PPER.Neut.Akk.Sg.3 PPER.Neut.Gen.Sg.3

Application: Extraction of Negation Clitics
Middle High German negation particles are often realized as clitics:
clitic prefix:
enſprecheſt (nicht sprichst, not speak+2sg)
nemochte (nicht mochte, not wanted)
niſt (nicht ist, not is)
clitic suffix:
erne (er nicht, he not)
wirn (wir nicht, we not)
double prefix:
ſinenthewalten (sich nicht entwalten, himself not restrain)

Application: Extraction of Negation Clitics
Goal: Automatic extraction of words with negation clitics
Steps:
Training of the tagger on the ReM training corpus
Annotation of the test corpus
Extraction of all words tagged with PTKNEG-..., ...-PTKNEG, or
...-PTKNEG-...
Resulting precision: 96.4%
Resulting recall: 97.8%

Recurrent Neural Network
simple NN RNNsimple NN
ht = tanh(W (ht−1 ◦ x))
ht is the RNN state at time t
(= vector with the activations of all RNN neurons)
x is the input vector
The previous RNN state ht and the input x are concatenated
The resulting vector is multiplied with the matrix W
(= linear projection to a new space)
Finally, the activation function (here tanh) is applied
(squashes the values of the vector to the range [-1,1])

Session6 01.helmut schmid

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (10)

Similaire à Session6 01.helmut schmid

Similaire à Session6 01.helmut schmid (20)

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Dernier

Dernier (20)

Session6 01.helmut schmid