Slides of the paper Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts by Helmut Schmid at the 3rd Edition of the DATeCH2019 International Conference
Exploring the Future Potential of AI-Enabled Smartphone Processors
Session6 01.helmut schmid
1. Deep Learning-Based Morphological Taggers and
Lemmatizers for Annotating Historical Texts
Helmut Schmid
Centrum für Informations- und Sprachverarbeitung
Ludwig-Maximilian-Universität München
schmid@cis.lmu.de
2. POS Tagging and Lemmatization
word part-of-speech tag lemma
do AVD dô
begagenda VVFIN.Ind.Past.Sg.3 be-gègenen
imo PPER.Masc.Dat.Sg.3 ër
min DPOSA.Masc.Nom.Sg.* mîn
trohtin NA.Masc.Nom.Sg.st truhtîn
mit APPR mit
ſinero DPOSA.Fem.Dat.Sg.st sîn
arngrihte NA.Fem.Dat.Sg.st êre-grëhte
. $_
Example sentence from the Middle High German Reference Corpus
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 2 / 30
3. High Spelling Variation in Historical Texts
Variants of the Middle High German word tuon (to do)
Dialectal variation
tuon, dun, doyn
Spelling variation
tuon, thuon, tuen
Script variation
tuon, tvon, tûon, tu
o
n, tv
o
n, to
v
n
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 3 / 30
4. Overview
Word representations
POS tagger architecture
Lemmatizer architecture
Evaluations
Conclusions
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 4 / 30
5. Tagging Methods
Statistical Taggers
the most widely used class of annotation systems
They learn from manually annotated data.
Unseen words cause problems:
If thuon was never seen, what should its POS tag be?
Guessing the POS tag based on the final letters does not always help:
The most frequent word with the same 3-letter-suffix is the preposition uon (von).
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 5 / 30
6. Word Representations
Statistical systems represent words as atomic units.
We need a better word representation
which reflects word similarity
and provides good representations for unseen words
Solution
Representation of words with an n-dimensional vector (embedding)
(-0.7, 1.5, 2.8, -5.5, 0.2, ...)
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 6 / 30
8. Part-of-Speech Tagging
The quick brown ... dog
DT JJ JJ NN
bi−RNN
embeddings
The embedding sequence is processed with a bi-RNN
⇒ representation for each word in context
The output tag is predicted from this contextual representation.
The POS tagger is trained end-to-end on a manually annotated text corpus.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 8 / 30
9. Dealing With Unseen Words
The word embeddings of unseen words cannot be learned from the
annotated data.
Solution 1: Pre-train the embeddings on a large unannotated corpus
(e.g. by predicting the current word based on the left and right context)
Solution 2: Compute a word representation from the letter sequence.
The two methods can also be combined.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 9 / 30
10. Letter-Based Word Representations
The letters are represented with letter
embeddings
The embedding sequence is processed with
a bi-RNN
The two final representations are
concatenated.
⇒ letter-based word representation i kuq ...
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 10 / 30
11. Full Network
t h e od giu kcq ...
DT JJ NN
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 11 / 30
12. Advantages
The letter-based RNN
learns regular spelling variations
such as u ↔ v, uo ↔ u
o
etc.
generalizes from words to their unseen spelling variants
tuon → thv
o
n
provides high quality word representations for unseen words
uses fewer parameters than an embedding table
is slower, but the word representations can be precomputed
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 12 / 30
13. Technical Details of the POS Tagger
Word representations
character-based representation computed with a biRNN
Forward-RNN processes a 10-letter suffix of the word
Backward-RNN processes a 10-letter prefix of the word
Prefixes/Suffixes are padded with padding symbols if the word is
shorter.
⇒ faster and simpler (parallel) processing
optionally: pre-trained word embeddings
concatenated with the character-based representation
most useful when the training corpus is small
RNN over words
uses two biRNN layers with residual connections
i.e. the input is added to the output
All RNNs are implemented with LSTMs.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 13 / 30
14. Lemmatizer
b a b i e s bab<s>
b a b
Σ
encoder decoder
attention
?
Encoder-decoder architecture with attention
The encoder computes a representation for each letter in context.
The decoder generates the output letters one by one.
The attention module computes an input summary conditioned on the
current decoder state.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 14 / 30
15. POS Tagging Experiments on Tiger
Goal: Check if our tagger is state-of-the-art.
Tiger is the standard corpus for German POS tagging
Resulting accuracy for POS and morphological tagging
test
Heigold 93.23
our tagger 93.42
Heigold+emb 93.85
our tagger+emb 93.88
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 15 / 30
16. POS Tagging Experiments on the ReM corpus
Corpus Description
Reference Corpus for Middle High German
period 1050–1350
diplomatically transcribed
semi-automatically annotated with
normalized spelling
part of speech
lemma
morphological features
2.9 times more transcribed word types than normalized word types
Contracted word forms such as inhandon (in handon, in hand) are
annotated with two tags
7830 different fine-grained tags including these combined tags
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 16 / 30
17. POS Tagging Experiments on the ReM corpus
Resulting POS and morphological tagging accuracies
test
TreeTagger POS 90.40
our tagger POS 95.88
our tagger POS+morph 89.45
POS+morph accuracy on unseen words: 79%
POS+morph accuracy on seen words with unseen tag: 57%
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 17 / 30
18. POS Tagging Experiments on Other Corpora
Corpus size tags dev test
Middle Dutch Gysseling corpus 1.5M 3864 91.10 91.01
Syntactic Reference Corpus of Medieval French 1.2M 68 96.45 96.28
York-Toronto-Helsinki Parsed C. of Old English 1.6M 298 97.37 97.13
Ancient Greek Dependency Treebank 2.1 548K 1041 91.35 91.29
Icelandic Parsed Historical Corpus 383K 319 93.88 89.87
Corpus Taurinese (Old Italian) 246K 239 98.46 98.43
GerManC historical German newspaper corpus 770K 62 97.25 98.23
GerManC historical German newspaper corpus 770K 2303 85.65 87.72
The accuracy is high if the tagset is small.
Exception: Icelandic
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 18 / 30
19. Lemmatization Experiments on ReM
Encoder-decoder system DL4MT by Kyunghyun Cho
originally developed for machine translation
used for morphological reinflection by Kann & Schütze
input: wordform + POS (split into a sequence of letters)
f r o g e t e # V V F I N . * . P a s t . S g . 3
output: lemma (sequence of letters)
v r â g e n
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 19 / 30
20. Lemmatization Experiments on ReM
Test accuracy on unseen word-tag combinations
test
overall 91.87
unseen words 86.43
seen words with unseen tags 93.33
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 20 / 30
21. Conclusions
The annotation of historical corpora is difficult due to dialectal and
spelling variations.
Neural networks are well suited to deal with this variation.
The letter-based word representations
are robust to spelling variations and
reflect the similarity of words.
NN-based POS taggers outperform traditional POS taggers used in
previous work on historical corpora by a large margin.
NN-based lemmatizers also produce accurate results on historical
corpora.
The new tagger and lemmatizer (called RNNTagger) is freely
available for non-commercial purposes via my homepage
http://www.cis.lmu.de/~schmid
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 21 / 30
22. Thank you for your attention!
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 22 / 30
23. POS Tagger: Network Sizes
Letter LSTM layers: 1
Word LSTM layers: 2
Letter embeddings size: 100
Letter LSTM size: 400
Word LSTM size: 400
Dropout Rate: 0.5 (no dropout on embeddings)
Training algorithm: SGD
Initial learning rate: 1.0 (no average across batch)
Learning Rate decay: 5% after each epoch
Gradient clipping threshold: 1.0
Early Stopping
minimal character frequency: 2
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 23 / 30
24. Lemmatizer: Network Sizes
Encoder layers: 2
Decoder layers: 1
Letter embeddings size: 100
Encoder LSTM size: 400
Decoder LSTM size: 400
Dropout Rate: 0.5 (no dropout on embeddings)
Training algorithm: Adam
Initial learning rate: 0.001
Learning Rate decay: 5% after each epoch
Gradient clipping threshold: 1.0
Early Stopping
tied decoder input and output embeddings
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 24 / 30
26. Most Frequent POS Tag Confusions
freq correct erroneous example words
202 NE FM Auguſti, Rome, Stephani
111 ADJA NA ceſue, zeſwe, vordern
105 NA PTKNEG niht, nicht, nit
102 NA ADJA weiſen, ríndeſpuch, oel
100 DRELS DDS der, ds
, dv, die
82 DDS DDART der, daz, den, ds
76 NA AVD vil, iht, mer, lu
o
te
71 NA ADJD gut, weich, recht, gu
o
t
61 ADJA AVD mer, leſtin, mínr, mere
60 DRELS DDART der, ds
, di
56 AVD NA vil, her, mer
56 DRELS KOUS daz, Dat, Du, dc
55 PPER VAFIN is, ſi, ſei
55 PTKNEG NA niht, nit, níht
53 ADJD AVD her, ſu
o
ze, ſtete
52 PTKVZ AVD vor, wids
, wider, ane
51 VVFIN NA lant, zûg, zs
priſt
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 26 / 30
28. Application: Extraction of Negation Clitics
Middle High German negation particles are often realized as clitics:
clitic prefix:
enſprecheſt (nicht sprichst, not speak+2sg)
nemochte (nicht mochte, not wanted)
niſt (nicht ist, not is)
clitic suffix:
erne (er nicht, he not)
wirn (wir nicht, we not)
double prefix:
ſinenthewalten (sich nicht entwalten, himself not restrain)
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 28 / 30
29. Application: Extraction of Negation Clitics
Goal: Automatic extraction of words with negation clitics
Steps:
Training of the tagger on the ReM training corpus
Annotation of the test corpus
Extraction of all words tagged with PTKNEG-..., ...-PTKNEG, or
...-PTKNEG-...
Resulting precision: 96.4%
Resulting recall: 97.8%
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 29 / 30
30. Recurrent Neural Network
simple NN RNNsimple NN
ht = tanh(W (ht−1 ◦ x))
ht is the RNN state at time t
(= vector with the activations of all RNN neurons)
x is the input vector
The previous RNN state ht and the input x are concatenated
The resulting vector is multiplied with the matrix W
(= linear projection to a new space)
Finally, the activation function (here tanh) is applied
(squashes the values of the vector to the range [-1,1])
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 30 / 30