SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Deep Learning-Based Morphological Taggers and
Lemmatizers for Annotating Historical Texts
Helmut Schmid
Centrum für Informations- und Sprachverarbeitung
Ludwig-Maximilian-Universität München
schmid@cis.lmu.de
POS Tagging and Lemmatization
word part-of-speech tag lemma
do AVD dô
begagenda VVFIN.Ind.Past.Sg.3 be-gègenen
imo PPER.Masc.Dat.Sg.3 ër
min DPOSA.Masc.Nom.Sg.* mîn
trohtin NA.Masc.Nom.Sg.st truhtîn
mit APPR mit
ſinero DPOSA.Fem.Dat.Sg.st sîn
arngrihte NA.Fem.Dat.Sg.st êre-grëhte
. $_
Example sentence from the Middle High German Reference Corpus
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 2 / 30
High Spelling Variation in Historical Texts
Variants of the Middle High German word tuon (to do)
Dialectal variation
tuon, dun, doyn
Spelling variation
tuon, thuon, tuen
Script variation
tuon, tvon, tûon, tu
o
n, tv
o
n, to
v
n
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 3 / 30
Overview
Word representations
POS tagger architecture
Lemmatizer architecture
Evaluations
Conclusions
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 4 / 30
Tagging Methods
Statistical Taggers
the most widely used class of annotation systems
They learn from manually annotated data.
Unseen words cause problems:
If thuon was never seen, what should its POS tag be?
Guessing the POS tag based on the final letters does not always help:
The most frequent word with the same 3-letter-suffix is the preposition uon (von).
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 5 / 30
Word Representations
Statistical systems represent words as atomic units.
We need a better word representation
which reflects word similarity
and provides good representations for unseen words
Solution
Representation of words with an n-dimensional vector (embedding)
(-0.7, 1.5, 2.8, -5.5, 0.2, ...)
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 6 / 30
Word Embeddings
Brussels
London
Paris
Belgium
UK
France
orange
blue
pink
house
building
small
tiny
walk
run
one
two
three
five
Embeddings are points in an n-dimensional space.
Embeddings of similar words are close to each other.
Higher-dimensional spaces can model different types of similarity.
syntactic, semantic etc.
Such embeddings can be learned with neural networks.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 7 / 30
Part-of-Speech Tagging
The quick brown ... dog
DT JJ JJ NN
bi−RNN
embeddings
The embedding sequence is processed with a bi-RNN
⇒ representation for each word in context
The output tag is predicted from this contextual representation.
The POS tagger is trained end-to-end on a manually annotated text corpus.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 8 / 30
Dealing With Unseen Words
The word embeddings of unseen words cannot be learned from the
annotated data.
Solution 1: Pre-train the embeddings on a large unannotated corpus
(e.g. by predicting the current word based on the left and right context)
Solution 2: Compute a word representation from the letter sequence.
The two methods can also be combined.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 9 / 30
Letter-Based Word Representations
The letters are represented with letter
embeddings
The embedding sequence is processed with
a bi-RNN
The two final representations are
concatenated.
⇒ letter-based word representation i kuq ...
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 10 / 30
Full Network
t h e od giu kcq ...
DT JJ NN
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 11 / 30
Advantages
The letter-based RNN
learns regular spelling variations
such as u ↔ v, uo ↔ u
o
etc.
generalizes from words to their unseen spelling variants
tuon → thv
o
n
provides high quality word representations for unseen words
uses fewer parameters than an embedding table
is slower, but the word representations can be precomputed
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 12 / 30
Technical Details of the POS Tagger
Word representations
character-based representation computed with a biRNN
Forward-RNN processes a 10-letter suffix of the word
Backward-RNN processes a 10-letter prefix of the word
Prefixes/Suffixes are padded with padding symbols if the word is
shorter.
⇒ faster and simpler (parallel) processing
optionally: pre-trained word embeddings
concatenated with the character-based representation
most useful when the training corpus is small
RNN over words
uses two biRNN layers with residual connections
i.e. the input is added to the output
All RNNs are implemented with LSTMs.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 13 / 30
Lemmatizer
b a b i e s bab<s>
b a b
Σ
encoder decoder
attention
?
Encoder-decoder architecture with attention
The encoder computes a representation for each letter in context.
The decoder generates the output letters one by one.
The attention module computes an input summary conditioned on the
current decoder state.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 14 / 30
POS Tagging Experiments on Tiger
Goal: Check if our tagger is state-of-the-art.
Tiger is the standard corpus for German POS tagging
Resulting accuracy for POS and morphological tagging
test
Heigold 93.23
our tagger 93.42
Heigold+emb 93.85
our tagger+emb 93.88
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 15 / 30
POS Tagging Experiments on the ReM corpus
Corpus Description
Reference Corpus for Middle High German
period 1050–1350
diplomatically transcribed
semi-automatically annotated with
normalized spelling
part of speech
lemma
morphological features
2.9 times more transcribed word types than normalized word types
Contracted word forms such as inhandon (in handon, in hand) are
annotated with two tags
7830 different fine-grained tags including these combined tags
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 16 / 30
POS Tagging Experiments on the ReM corpus
Resulting POS and morphological tagging accuracies
test
TreeTagger POS 90.40
our tagger POS 95.88
our tagger POS+morph 89.45
POS+morph accuracy on unseen words: 79%
POS+morph accuracy on seen words with unseen tag: 57%
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 17 / 30
POS Tagging Experiments on Other Corpora
Corpus size tags dev test
Middle Dutch Gysseling corpus 1.5M 3864 91.10 91.01
Syntactic Reference Corpus of Medieval French 1.2M 68 96.45 96.28
York-Toronto-Helsinki Parsed C. of Old English 1.6M 298 97.37 97.13
Ancient Greek Dependency Treebank 2.1 548K 1041 91.35 91.29
Icelandic Parsed Historical Corpus 383K 319 93.88 89.87
Corpus Taurinese (Old Italian) 246K 239 98.46 98.43
GerManC historical German newspaper corpus 770K 62 97.25 98.23
GerManC historical German newspaper corpus 770K 2303 85.65 87.72
The accuracy is high if the tagset is small.
Exception: Icelandic
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 18 / 30
Lemmatization Experiments on ReM
Encoder-decoder system DL4MT by Kyunghyun Cho
originally developed for machine translation
used for morphological reinflection by Kann & Schütze
input: wordform + POS (split into a sequence of letters)
f r o g e t e # V V F I N . * . P a s t . S g . 3
output: lemma (sequence of letters)
v r â g e n
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 19 / 30
Lemmatization Experiments on ReM
Test accuracy on unseen word-tag combinations
test
overall 91.87
unseen words 86.43
seen words with unseen tags 93.33
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 20 / 30
Conclusions
The annotation of historical corpora is difficult due to dialectal and
spelling variations.
Neural networks are well suited to deal with this variation.
The letter-based word representations
are robust to spelling variations and
reflect the similarity of words.
NN-based POS taggers outperform traditional POS taggers used in
previous work on historical corpora by a large margin.
NN-based lemmatizers also produce accurate results on historical
corpora.
The new tagger and lemmatizer (called RNNTagger) is freely
available for non-commercial purposes via my homepage
http://www.cis.lmu.de/~schmid
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 21 / 30
Thank you for your attention!
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 22 / 30
POS Tagger: Network Sizes
Letter LSTM layers: 1
Word LSTM layers: 2
Letter embeddings size: 100
Letter LSTM size: 400
Word LSTM size: 400
Dropout Rate: 0.5 (no dropout on embeddings)
Training algorithm: SGD
Initial learning rate: 1.0 (no average across batch)
Learning Rate decay: 5% after each epoch
Gradient clipping threshold: 1.0
Early Stopping
minimal character frequency: 2
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 23 / 30
Lemmatizer: Network Sizes
Encoder layers: 2
Decoder layers: 1
Letter embeddings size: 100
Encoder LSTM size: 400
Decoder LSTM size: 400
Dropout Rate: 0.5 (no dropout on embeddings)
Training algorithm: Adam
Initial learning rate: 0.001
Learning Rate decay: 5% after each epoch
Gradient clipping threshold: 1.0
Early Stopping
tied decoder input and output embeddings
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 24 / 30
Lemmatization Errors
POS tag word lemma predicted
NA.Masc.Akk.Sg.st karger karkære kargære
VVFIN.Ind.Pres.Pl.3 gan’lâzzont geantlâzen gantlâzen
NA.Masc,Fem.Akk.Pl.* wiſeblumen wisebluome wîsbluome
NA.Fem.Nom.Sg.* mæſſenîe massenîe mèssenîe
ADJD.Pos unkunde unkund unkünde
VVFIN.Ind.Pres.Sg.3 geith gân jëhen
NE.Nom.Sg Franzze Franze Franke
ADJA.Pos.Neut.Akk.Pl.st gemaliv gemâl mâl
NA.Masc.Akk.Sg.st buneiz punèiz bunèiz
PAVAP_VVPP inchomen in_komen în_komen
NA.Neut.Dat.Sg.st geſtiche gèstach gestifte
NA.Neut.Nom.Sg.st uber_zinder überzimber überzinter
NA.Masc.Dat.Pl.st rieterin rihtære rîtære
AVD ennot ennoète ennôte
NA.Neut.Nom.Sg.st vzkuchen ûzkûchen ûzkiuchen
AVD_VVIMP.Sg.2 anegenke ane_dènken ane_gènken
NE.Dat.Sg c¯oſtenopele Constantinopel Konstantînôpol
VVINF murmuren murmerièren murmeln
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 25 / 30
Most Frequent POS Tag Confusions
freq correct erroneous example words
202 NE FM Auguſti, Rome, Stephani
111 ADJA NA ceſue, zeſwe, vordern
105 NA PTKNEG niht, nicht, nit
102 NA ADJA weiſen, ríndeſpuch, oel
100 DRELS DDS der, ds
, dv, die
82 DDS DDART der, daz, den, ds
76 NA AVD vil, iht, mer, lu
o
te
71 NA ADJD gut, weich, recht, gu
o
t
61 ADJA AVD mer, leſtin, mínr, mere
60 DRELS DDART der, ds
, di
56 AVD NA vil, her, mer
56 DRELS KOUS daz, Dat, Du, dc
55 PPER VAFIN is, ſi, ſei
55 PTKNEG NA niht, nit, níht
53 ADJD AVD her, ſu
o
ze, ſtete
52 PTKVZ AVD vor, wids
, wider, ane
51 VVFIN NA lant, zûg, zs
priſt
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 26 / 30
Most Frequent Morphological Tag Confusions
freq correct erroneous
220 VVFIN.Ind.Past.Sg.3 VVFIN.*.Past.Sg.3
210 PPER.Masc.Nom.Pl.3 PPER.*.Nom.Pl.3
78 VAFIN.Ind.Past.Pl.3 VAFIN.*.Past.Pl.3
60 DDART.Masc.Nom.Pl.* DDART.*.Nom.Pl.*
58 NE.Gen.Sg NE.Nom.Sg
56 DDART.Fem.Akk.Sg.* DDART.Fem.Nom.Sg.*
53 VVFIN.Ind.Past.Pl.3 VVFIN.*.Past.Pl.3
51 PPER.Neut.Nom.Pl.3 PPER.*.Nom.Pl.3
51 NA.Fem.Akk.Sg.st NA.Fem.Nom.Sg.st
50 PPER.Masc,Neut.Nom.Pl.3 PPER.*.Nom.Pl.3
48 NA.Neut.Akk.Sg.st NA.Neut.Nom.Sg.st
47 PPER.Fem.Nom.Pl.3 PPER.*.Nom.Pl.3
47 NA.Neut.Akk.Pl.st NA.Neut.Akk.Sg.st
47 DRELS.Masc.Nom.Pl DRELS.*.Nom.Pl
46 NA.Fem.Dat.Sg.st NA.Fem.Gen.Sg.st
42 NA.Neut.Nom.Sg.st NA.Neut.Akk.Sg.st
42 DDS.Masc.Nom.Pl.* DDS.*.Nom.Pl.*
40 PPER.Neut.Akk.Sg.3 PPER.Neut.Gen.Sg.3
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 27 / 30
Application: Extraction of Negation Clitics
Middle High German negation particles are often realized as clitics:
clitic prefix:
enſprecheſt (nicht sprichst, not speak+2sg)
nemochte (nicht mochte, not wanted)
niſt (nicht ist, not is)
clitic suffix:
erne (er nicht, he not)
wirn (wir nicht, we not)
double prefix:
ſinenthewalten (sich nicht entwalten, himself not restrain)
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 28 / 30
Application: Extraction of Negation Clitics
Goal: Automatic extraction of words with negation clitics
Steps:
Training of the tagger on the ReM training corpus
Annotation of the test corpus
Extraction of all words tagged with PTKNEG-..., ...-PTKNEG, or
...-PTKNEG-...
Resulting precision: 96.4%
Resulting recall: 97.8%
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 29 / 30
Recurrent Neural Network
simple NN RNNsimple NN
ht = tanh(W (ht−1 ◦ x))
ht is the RNN state at time t
(= vector with the activations of all RNN neurons)
x is the input vector
The previous RNN state ht and the input x are concatenated
The resulting vector is multiplied with the matrix W
(= linear projection to a new space)
Finally, the activation function (here tanh) is applied
(squashes the values of the vector to the range [-1,1])
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 30 / 30

Contenu connexe

Tendances

A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
IJNSA Journal
 
Fine-Grained Intercontact Characterization in Disruption-Tolerant Networks
Fine-Grained Intercontact Characterization in Disruption-Tolerant NetworksFine-Grained Intercontact Characterization in Disruption-Tolerant Networks
Fine-Grained Intercontact Characterization in Disruption-Tolerant Networks
tiphaine_pheneau
 

Tendances (10)

A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
 
Meta Languages
Meta LanguagesMeta Languages
Meta Languages
 
Language: The Way Words are Use (Technical Writing CS212))
Language: The Way Words are Use (Technical Writing CS212))Language: The Way Words are Use (Technical Writing CS212))
Language: The Way Words are Use (Technical Writing CS212))
 
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling LanguagesCoping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...
 
Fine-Grained Intercontact Characterization in Disruption-Tolerant Networks
Fine-Grained Intercontact Characterization in Disruption-Tolerant NetworksFine-Grained Intercontact Characterization in Disruption-Tolerant Networks
Fine-Grained Intercontact Characterization in Disruption-Tolerant Networks
 
Pointers c imp
Pointers c impPointers c imp
Pointers c imp
 
Pointers In C
Pointers In CPointers In C
Pointers In C
 
IRJET-Market Basket Analysis based on Frequent Itemset Mining
IRJET-Market Basket Analysis based on Frequent Itemset MiningIRJET-Market Basket Analysis based on Frequent Itemset Mining
IRJET-Market Basket Analysis based on Frequent Itemset Mining
 
AbstractKR on Pargram 2006
AbstractKR on Pargram 2006AbstractKR on Pargram 2006
AbstractKR on Pargram 2006
 

Similaire à Session6 01.helmut schmid

Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
Editor IJARCET
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
behzad66
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Lviv Startup Club
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
fridolin.wild
 
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESEFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
kevig
 
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and PhonemesEffect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
kevig
 
Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter System
kkkseld
 
A Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and SemanticsA Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and Semantics
Formal Systems Laboratory at University of Illinois
 
Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)
Red Over
 
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical InformaticsMEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
butest
 

Similaire à Session6 01.helmut schmid (20)

Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
 
Mjfg now
Mjfg nowMjfg now
Mjfg now
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESEFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
 
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and PhonemesEffect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
 
Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter System
 
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIESA REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
 
dialogue act modeling for automatic tagging and recognition
 dialogue act modeling for automatic tagging and recognition dialogue act modeling for automatic tagging and recognition
dialogue act modeling for automatic tagging and recognition
 
Toward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech taggingToward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech tagging
 
A Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and SemanticsA Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and Semantics
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
 
Lfg and gpsg
Lfg and gpsgLfg and gpsg
Lfg and gpsg
 
LFG and GPSG.pptx
LFG and GPSG.pptxLFG and GPSG.pptx
LFG and GPSG.pptx
 
Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
 
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical InformaticsMEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Session6 01.helmut schmid

  • 1. Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts Helmut Schmid Centrum für Informations- und Sprachverarbeitung Ludwig-Maximilian-Universität München schmid@cis.lmu.de
  • 2. POS Tagging and Lemmatization word part-of-speech tag lemma do AVD dô begagenda VVFIN.Ind.Past.Sg.3 be-gègenen imo PPER.Masc.Dat.Sg.3 ër min DPOSA.Masc.Nom.Sg.* mîn trohtin NA.Masc.Nom.Sg.st truhtîn mit APPR mit ſinero DPOSA.Fem.Dat.Sg.st sîn arngrihte NA.Fem.Dat.Sg.st êre-grëhte . $_ Example sentence from the Middle High German Reference Corpus Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 2 / 30
  • 3. High Spelling Variation in Historical Texts Variants of the Middle High German word tuon (to do) Dialectal variation tuon, dun, doyn Spelling variation tuon, thuon, tuen Script variation tuon, tvon, tûon, tu o n, tv o n, to v n Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 3 / 30
  • 4. Overview Word representations POS tagger architecture Lemmatizer architecture Evaluations Conclusions Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 4 / 30
  • 5. Tagging Methods Statistical Taggers the most widely used class of annotation systems They learn from manually annotated data. Unseen words cause problems: If thuon was never seen, what should its POS tag be? Guessing the POS tag based on the final letters does not always help: The most frequent word with the same 3-letter-suffix is the preposition uon (von). Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 5 / 30
  • 6. Word Representations Statistical systems represent words as atomic units. We need a better word representation which reflects word similarity and provides good representations for unseen words Solution Representation of words with an n-dimensional vector (embedding) (-0.7, 1.5, 2.8, -5.5, 0.2, ...) Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 6 / 30
  • 7. Word Embeddings Brussels London Paris Belgium UK France orange blue pink house building small tiny walk run one two three five Embeddings are points in an n-dimensional space. Embeddings of similar words are close to each other. Higher-dimensional spaces can model different types of similarity. syntactic, semantic etc. Such embeddings can be learned with neural networks. Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 7 / 30
  • 8. Part-of-Speech Tagging The quick brown ... dog DT JJ JJ NN bi−RNN embeddings The embedding sequence is processed with a bi-RNN ⇒ representation for each word in context The output tag is predicted from this contextual representation. The POS tagger is trained end-to-end on a manually annotated text corpus. Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 8 / 30
  • 9. Dealing With Unseen Words The word embeddings of unseen words cannot be learned from the annotated data. Solution 1: Pre-train the embeddings on a large unannotated corpus (e.g. by predicting the current word based on the left and right context) Solution 2: Compute a word representation from the letter sequence. The two methods can also be combined. Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 9 / 30
  • 10. Letter-Based Word Representations The letters are represented with letter embeddings The embedding sequence is processed with a bi-RNN The two final representations are concatenated. ⇒ letter-based word representation i kuq ... Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 10 / 30
  • 11. Full Network t h e od giu kcq ... DT JJ NN Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 11 / 30
  • 12. Advantages The letter-based RNN learns regular spelling variations such as u ↔ v, uo ↔ u o etc. generalizes from words to their unseen spelling variants tuon → thv o n provides high quality word representations for unseen words uses fewer parameters than an embedding table is slower, but the word representations can be precomputed Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 12 / 30
  • 13. Technical Details of the POS Tagger Word representations character-based representation computed with a biRNN Forward-RNN processes a 10-letter suffix of the word Backward-RNN processes a 10-letter prefix of the word Prefixes/Suffixes are padded with padding symbols if the word is shorter. ⇒ faster and simpler (parallel) processing optionally: pre-trained word embeddings concatenated with the character-based representation most useful when the training corpus is small RNN over words uses two biRNN layers with residual connections i.e. the input is added to the output All RNNs are implemented with LSTMs. Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 13 / 30
  • 14. Lemmatizer b a b i e s bab<s> b a b Σ encoder decoder attention ? Encoder-decoder architecture with attention The encoder computes a representation for each letter in context. The decoder generates the output letters one by one. The attention module computes an input summary conditioned on the current decoder state. Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 14 / 30
  • 15. POS Tagging Experiments on Tiger Goal: Check if our tagger is state-of-the-art. Tiger is the standard corpus for German POS tagging Resulting accuracy for POS and morphological tagging test Heigold 93.23 our tagger 93.42 Heigold+emb 93.85 our tagger+emb 93.88 Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 15 / 30
  • 16. POS Tagging Experiments on the ReM corpus Corpus Description Reference Corpus for Middle High German period 1050–1350 diplomatically transcribed semi-automatically annotated with normalized spelling part of speech lemma morphological features 2.9 times more transcribed word types than normalized word types Contracted word forms such as inhandon (in handon, in hand) are annotated with two tags 7830 different fine-grained tags including these combined tags Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 16 / 30
  • 17. POS Tagging Experiments on the ReM corpus Resulting POS and morphological tagging accuracies test TreeTagger POS 90.40 our tagger POS 95.88 our tagger POS+morph 89.45 POS+morph accuracy on unseen words: 79% POS+morph accuracy on seen words with unseen tag: 57% Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 17 / 30
  • 18. POS Tagging Experiments on Other Corpora Corpus size tags dev test Middle Dutch Gysseling corpus 1.5M 3864 91.10 91.01 Syntactic Reference Corpus of Medieval French 1.2M 68 96.45 96.28 York-Toronto-Helsinki Parsed C. of Old English 1.6M 298 97.37 97.13 Ancient Greek Dependency Treebank 2.1 548K 1041 91.35 91.29 Icelandic Parsed Historical Corpus 383K 319 93.88 89.87 Corpus Taurinese (Old Italian) 246K 239 98.46 98.43 GerManC historical German newspaper corpus 770K 62 97.25 98.23 GerManC historical German newspaper corpus 770K 2303 85.65 87.72 The accuracy is high if the tagset is small. Exception: Icelandic Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 18 / 30
  • 19. Lemmatization Experiments on ReM Encoder-decoder system DL4MT by Kyunghyun Cho originally developed for machine translation used for morphological reinflection by Kann & Schütze input: wordform + POS (split into a sequence of letters) f r o g e t e # V V F I N . * . P a s t . S g . 3 output: lemma (sequence of letters) v r â g e n Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 19 / 30
  • 20. Lemmatization Experiments on ReM Test accuracy on unseen word-tag combinations test overall 91.87 unseen words 86.43 seen words with unseen tags 93.33 Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 20 / 30
  • 21. Conclusions The annotation of historical corpora is difficult due to dialectal and spelling variations. Neural networks are well suited to deal with this variation. The letter-based word representations are robust to spelling variations and reflect the similarity of words. NN-based POS taggers outperform traditional POS taggers used in previous work on historical corpora by a large margin. NN-based lemmatizers also produce accurate results on historical corpora. The new tagger and lemmatizer (called RNNTagger) is freely available for non-commercial purposes via my homepage http://www.cis.lmu.de/~schmid Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 21 / 30
  • 22. Thank you for your attention! Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 22 / 30
  • 23. POS Tagger: Network Sizes Letter LSTM layers: 1 Word LSTM layers: 2 Letter embeddings size: 100 Letter LSTM size: 400 Word LSTM size: 400 Dropout Rate: 0.5 (no dropout on embeddings) Training algorithm: SGD Initial learning rate: 1.0 (no average across batch) Learning Rate decay: 5% after each epoch Gradient clipping threshold: 1.0 Early Stopping minimal character frequency: 2 Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 23 / 30
  • 24. Lemmatizer: Network Sizes Encoder layers: 2 Decoder layers: 1 Letter embeddings size: 100 Encoder LSTM size: 400 Decoder LSTM size: 400 Dropout Rate: 0.5 (no dropout on embeddings) Training algorithm: Adam Initial learning rate: 0.001 Learning Rate decay: 5% after each epoch Gradient clipping threshold: 1.0 Early Stopping tied decoder input and output embeddings Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 24 / 30
  • 25. Lemmatization Errors POS tag word lemma predicted NA.Masc.Akk.Sg.st karger karkære kargære VVFIN.Ind.Pres.Pl.3 gan’lâzzont geantlâzen gantlâzen NA.Masc,Fem.Akk.Pl.* wiſeblumen wisebluome wîsbluome NA.Fem.Nom.Sg.* mæſſenîe massenîe mèssenîe ADJD.Pos unkunde unkund unkünde VVFIN.Ind.Pres.Sg.3 geith gân jëhen NE.Nom.Sg Franzze Franze Franke ADJA.Pos.Neut.Akk.Pl.st gemaliv gemâl mâl NA.Masc.Akk.Sg.st buneiz punèiz bunèiz PAVAP_VVPP inchomen in_komen în_komen NA.Neut.Dat.Sg.st geſtiche gèstach gestifte NA.Neut.Nom.Sg.st uber_zinder überzimber überzinter NA.Masc.Dat.Pl.st rieterin rihtære rîtære AVD ennot ennoète ennôte NA.Neut.Nom.Sg.st vzkuchen ûzkûchen ûzkiuchen AVD_VVIMP.Sg.2 anegenke ane_dènken ane_gènken NE.Dat.Sg c¯oſtenopele Constantinopel Konstantînôpol VVINF murmuren murmerièren murmeln Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 25 / 30
  • 26. Most Frequent POS Tag Confusions freq correct erroneous example words 202 NE FM Auguſti, Rome, Stephani 111 ADJA NA ceſue, zeſwe, vordern 105 NA PTKNEG niht, nicht, nit 102 NA ADJA weiſen, ríndeſpuch, oel 100 DRELS DDS der, ds , dv, die 82 DDS DDART der, daz, den, ds 76 NA AVD vil, iht, mer, lu o te 71 NA ADJD gut, weich, recht, gu o t 61 ADJA AVD mer, leſtin, mínr, mere 60 DRELS DDART der, ds , di 56 AVD NA vil, her, mer 56 DRELS KOUS daz, Dat, Du, dc 55 PPER VAFIN is, ſi, ſei 55 PTKNEG NA niht, nit, níht 53 ADJD AVD her, ſu o ze, ſtete 52 PTKVZ AVD vor, wids , wider, ane 51 VVFIN NA lant, zûg, zs priſt Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 26 / 30
  • 27. Most Frequent Morphological Tag Confusions freq correct erroneous 220 VVFIN.Ind.Past.Sg.3 VVFIN.*.Past.Sg.3 210 PPER.Masc.Nom.Pl.3 PPER.*.Nom.Pl.3 78 VAFIN.Ind.Past.Pl.3 VAFIN.*.Past.Pl.3 60 DDART.Masc.Nom.Pl.* DDART.*.Nom.Pl.* 58 NE.Gen.Sg NE.Nom.Sg 56 DDART.Fem.Akk.Sg.* DDART.Fem.Nom.Sg.* 53 VVFIN.Ind.Past.Pl.3 VVFIN.*.Past.Pl.3 51 PPER.Neut.Nom.Pl.3 PPER.*.Nom.Pl.3 51 NA.Fem.Akk.Sg.st NA.Fem.Nom.Sg.st 50 PPER.Masc,Neut.Nom.Pl.3 PPER.*.Nom.Pl.3 48 NA.Neut.Akk.Sg.st NA.Neut.Nom.Sg.st 47 PPER.Fem.Nom.Pl.3 PPER.*.Nom.Pl.3 47 NA.Neut.Akk.Pl.st NA.Neut.Akk.Sg.st 47 DRELS.Masc.Nom.Pl DRELS.*.Nom.Pl 46 NA.Fem.Dat.Sg.st NA.Fem.Gen.Sg.st 42 NA.Neut.Nom.Sg.st NA.Neut.Akk.Sg.st 42 DDS.Masc.Nom.Pl.* DDS.*.Nom.Pl.* 40 PPER.Neut.Akk.Sg.3 PPER.Neut.Gen.Sg.3 Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 27 / 30
  • 28. Application: Extraction of Negation Clitics Middle High German negation particles are often realized as clitics: clitic prefix: enſprecheſt (nicht sprichst, not speak+2sg) nemochte (nicht mochte, not wanted) niſt (nicht ist, not is) clitic suffix: erne (er nicht, he not) wirn (wir nicht, we not) double prefix: ſinenthewalten (sich nicht entwalten, himself not restrain) Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 28 / 30
  • 29. Application: Extraction of Negation Clitics Goal: Automatic extraction of words with negation clitics Steps: Training of the tagger on the ReM training corpus Annotation of the test corpus Extraction of all words tagged with PTKNEG-..., ...-PTKNEG, or ...-PTKNEG-... Resulting precision: 96.4% Resulting recall: 97.8% Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 29 / 30
  • 30. Recurrent Neural Network simple NN RNNsimple NN ht = tanh(W (ht−1 ◦ x)) ht is the RNN state at time t (= vector with the activations of all RNN neurons) x is the input vector The previous RNN state ht and the input x are concatenated The resulting vector is multiplied with the matrix W (= linear projection to a new space) Finally, the activation function (here tanh) is applied (squashes the values of the vector to the range [-1,1]) Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 30 / 30