The UniMorph Project and Morphological Reinflection Task: Past, Present, and Future

UniMorph and Morphological Inflection Task: Past, Present, and Future
Ekaterina Vylomova@
@
University of Melbourne
ekaterina.vylomova@unimelb.edu.au
20 августа 2021 г.
Ekaterina Vylomova UniMorph and Morphological Inflection Task 20 августа 2021 г. 1 / 115

PART I: The UniMorph Project

Increasing Multilinguality

Speech is Special
Charles F. Hockett on Essential Properties of Human Languages
Displacement
Ability to refer to things in space and time and communicate about things that are not present
Productivity
Ability to create new and unique meanings of utterances from previously existing utterances and
sounds

Speech is Special
Charles F. Hockett on Essential Properties of Human Languages
Duality of Patterning
Meaningless phonic segments (phonemes) are combined to make meaningful words, etc.
Learnability
A speaker of a language can learn another language

Linguistic Diversity
Roman Jacobson on Differences between Languages
«_»
“Languages differ essentially in what they must convey and not in what they may convey”

Languages differ in many ways!
(1) Chinese (Isolating)
wǒmen
I.PL.AN
xué
learn
le
.PAST
zhè
this
xiē
.PL
shēngcı́.
new word.
“We learned these new words.”
(2) Russian (Synthetic)
My
We.NOM
vyučili
learn.PAST.PL
eti
this.ACC.PL
novyje
new.ACC.PL
slova.
word.ACC.PL
“We learned these new words.”

An example of West Greenlandic taken from Fortescue (2017):
(3) West Greenlandic (Polysynthetic)
Nannu-n-niuti-kkuminar-tu-
Polar.bear-catch-instrument.for.achieving-something.good.for-PART-
rujussu-u-vuq.
big-be-3SG.INDIC
“It (a dog) is good for catching polar bears with.”

An example of Kunwinjku taken from Evans (2003):
(4) Kunwinjku (Polysynthetic)
Aban-yawoith-warrgah-marne-ganj-ginje-ng.
1/3PL-again-wrong-BEN-meat-cook-PP
“I cooked the wrong meat for them again”

An example of Kunwinjku taken from Evans (2003):
(5) Kunwinjku (Polysynthetic)
Aban-yawoith-warrgah-marne-ganj-ginje-ng.
1/3PL-again-wrong-BEN-meat-cook-PP
“I cooked the wrong meat for them again”
Discussion of what should be considered as a word:
John Mansfield’s “The word as a unit of internal predictability”

Some exhibit rich grammatical case systems (e.g., 12 in Erzya and 24 in Veps)
Some mark possessiveness
Others might have complex verbal morphology (e.g., Oto-Manguean languages)
Even “decline” nouns for tense (e.g., Tupi–Guarani languages)

Let’s Discuss The Following Dimensions:
Fusion
Inflectional Synthesis
Position of Case Affixes

Fusion (WALS 20A)

Fusion (WALS 20A)
From isolating to concatenative
Concatenative morphology is the most common system
Non-linearities such as ablaut or tonal morphology can also be present
Isolating languages: the Sahel Belt in West Africa, Southeast Asia and the Pacific
Tonal–concatenative morphology can be found in Mesoamerican languages

Inflectional Synthesis of the Verb (WALS 22A)

Inflectional Synthesis of the Verb (WALS 22A)
Analytic expressions are common in Eurasia
Synthetic expressions are used to a high degree in the Americas

Position of Case Affixes (WALS 51A)

Position of Case Affixes (WALS 51A)
Can variably surface as prefixes, suffixes, infixes, or circumfixes
Suffixation: Most Eurasian and Australian languages
to a lesser extent in South American and New Guinean languages
Prefixation:Mesoamerican languages and African languages spoken below the Sahara

The Earliest Approach to Morphology (Sanskrit)
Pāņini’s karakas
Formalize regularities in the words
Inflectional Morphology is Paradigmatic

..or Russian Morphology
Morphological Inflection
Formalize regularities in the words
Inflectional Morphology is Paradigmatic
Formalizations differ: The number of cases may vary from 6 to 11(Zaliznyak, 1967)

Inflectional Morphology: Paradigms (nouns)
беглец “runner” + pos=N,case=ACC,num=SG → беглеца
ru-noun-table | b | беглец | a=an

Inflectional Morphology: Classes (nouns)
беглец + pos=N,case=ACC,num=SG → беглеца
EN Wiktionary: ru-noun-table | b | беглец | a=an

Inflectional Morphology: Wiktionary annotation is Not Cross-linguistically
Consistent
Other Languages
Hungarian
Wiktionary: Inconsistent annotation across languages
Within a single language: across different editions (en; ru; de; etc)
Many language-specific features

Linguistic Diversity and Universals
Universal Grammar
Evans and Levinson, 2009: The Myth of Language Universals
"Diversity can be found at almost every level of linguistic organization”
Languages vary greatly on phonological, morphological, semantic, and
syntactic levels

Universal Grammar
syntactic levels
Typology: describe the limits of cross-linguistic variation

Universal Grammar
syntactic levels
Haspelmath, 2010
Descriptive categories (specific to languages) vs. comparative concepts.
Typology: describe the limits of cross-linguistic variation

UniMorph – Universal Annotation
Universal Annotation (by John Sylak-Glassman and David Yarowsky)
1) 23 dimensions of meaning (TAM, case, number, animacy), 212 features
2) A-morphous (word-based) morphology (Anderson, 1992)
3) Initial paradigms were mainly extracted from English Edition of
Wiktionary (Kirov et al., 2016)
https://unimorph.github.io/
[Sylak-Glassman, 2016]

PART II: SIGMORPHON Shared Tasks on Morphological (Re-)inflection

Morphological (Re-)Inflection
SIGMORPHON Shared Task 2016–2019
Inflection: PLAY + PRESENT PARTICIPLE → playing
ReInflection: played + PRESENT PARTICIPLE → playing
Lemma Tag Form
RUN PAST ran
RUN PRES;1SG run
RUN PRES;2SG run
RUN PRES;3SG runs
RUN PRES;PL run
RUN PART running
2018 :∼ 96% accuracy on avg.
in high-resource setting
But much less well
in low-resource setting

SIGMORPHON 2016 Shared Task (Cotterell et al., 2016)
Morphological (Re-)Inflection (10 Languages)
Task1: беглец + pos=N,case=ACC,num=SG → беглеца
Task2: беглецами + pos=N,case=INS, num=PL +
pos=N,case=ACC,num=SG → беглеца
Task3: беглецами + pos=N,case=ACC,num=SG → беглеца
[Cotterell et al., 2016]
LMU+BIU+Helsinki: Neural (seq2seq) +/- aligner
MSU+Col/NYU: rule-based/heuristics
Others: external aligner+WFST/CRF

Morphological (Re-)Inflection (10 Languages): Neural
encoder–decoders
1) character-level input: <s> r u n OUT_POS=V OUT_NUM=SG
OUT_TENSE=PRES </s> Output: <s> r u n s </s>
2) Ensembles of seq2seq (GRUs + soft attention (Bahdanau et al., 2015))
3) Enriching the data with combinations of other (non-lemma) forms
[Kann and Schuetze, 2016]

encoder–decoders

encoder–decoders
1) Extract input–output string alignments; 2) Train seq2seq (LSTM-based)
models to learn a sequence of operations (hard monotonic attention)
[Aharoni and Goldberg, 2017]

encoder–decoders

encoder–decoders
Errors
глядеть pos=V,tense=PRS,per=1,num=SG,aspect=IPFV gold: гляжу predicted: глядею
увлекаться pos=V,tense=PRS,per=1,num=SG,aspect=IPFV gold: увлекаюсь
predicted: увлеклюсь
звать pos=V,tense=PRS,per=3,num=SG,aspect=IPFV gold: зовёт predicted: звает

encoder–decoders
Errors
зять pos=N,case=GEN,num=PL gold: зятьёв predicted: зятей
перстень pos=N,case=GEN,num=PL gold: перстней predicted: перстеее
телекамера pos=N,case=GEN,num=PL gold: телекамер predicted: телекаморо

encoder–decoders
Errors
лоботряс pos=N,case=ACC,num=PL gold: лоботрясов predicted: лоботрясы
львица pos=N,case=ACC,num=PL gold: львиц predicted: львица
милиционер pos=N,case=ACC,num=PL gold: милиционеров predicted: милиционеры
светлячок pos=N,case=ACC,num=PL gold: светлячков predicted: светлячки
скот pos=N,case=ACC,num=PL gold: скотов predicted: скоты
счёт pos=N,case=ACC,num=PL gold: счета predicted: счеты

CoNLL–SIGMORPHON 2017 Shared Task (Cotterell et al., 2017)
Universal Morphological Reinflection (52 Languages)
Task1: Morphological Inflection
Task2: Paradigm Completion

3 Settings: Low (100 samples), Medium (1000), High (10,000)
Sampled based on their token frequency in Wikipedia corpus (with
resampling for syncretic slots)

Universal Morphological Reinflection (52 Languages): Neural
encoder–decoders
1) (Align & Copy): Based on Aharoni and Goldberg, 2017
2) Extract input–output string alignments (add COPY/edit operations) 2)
Train seq2seq (LSTM-based) models to learn a sequence of operations (hard
monotonic attention)
[Makarov et al., 2017]

[Makarov et al., 2017]

Error taxonomy
What are common errors that neural systems make?
[Gorman et al., 2019]

Error taxonomy
Types of Errors
Free variation error: more than one acceptable form exists
Extraction errors: flaws in UniMorph’s parsing of Wiktionary
Wiktionary errors: errors in the Wiktionary data itself
Silly errors: “bizarre” errors which defy any purely linguistic characterization (“*membled”
instead of “mailed” or enters a loop such as “ynawemaylmyylmyylmyylmyylmyylmyym...” instead
of “ysnewem”)
Allomorphy errors: misapplication of existing allomorphic patterns
Spelling errors: forms that do not follow language-specific orthographic conventions

Error taxonomy

Error taxonomy
Allomorphy Errors
Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei
kiemurda)
Ablaut in Dutch and German (*pront; *saufte)
Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre)
Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból)
Yers (*kle
˛sek instead of kle
˛sk), Genitive singular suffixes in Polish (*izotopa)
Animacy in Polish and Russian (грузин vs. магазин in ACC.SG )
Aspect in Russian (*будешь сорвать)
Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности
(ACC.PL))

Task1: Morphological Inflection (Low, Medium, High)
Task2: Inflection in Context (Vylomova et al., 2019)

Track 1: With morphosynt. annotation
Track 2: Without morphosynt. annotation
Requires to capture agreement and infer inherent vs. contextual categories (Vylomova et al., 2019)

SIGMORPHON 2019 Shared Task (McCarthy et al., 2019)
Morphological Analysis in Context and Cross-Lingual Transfer for
Inflection (100 Language Pairs)
Task1: Cross-lingual Transfer for Morphological Inflection (10k HR +100 LR)
Task2: Morphological Analysis in Context
[McCarthy et al., 2019]

[Anastasopoulos and Neubig, 2019]

So...
SIGMORPHON Shared Tasks 2016–2019
PLAY + PRESENT PARTICIPLE → playing
played + PRESENT PARTICIPLE → playing
Lemma Tag Form
RUN PAST ran
RUN PRES;1SG run
RUN PRES;2SG run
RUN PRES;3SG runs
RUN PRES;PL run
RUN PART running
But much less well

So...
SIGMORPHON Shared Tasks 2016–2019
PLAY + PRESENT PARTICIPLE → playing
played + PRESENT PARTICIPLE → playing
Lemma Tag Form
RUN PAST ran
RUN PRES;1SG run
RUN PRES;2SG run
RUN PRES;3SG runs
RUN PRES;PL run
RUN PART running
Also see Ling Liu’s 2021 Overview
“Computational Morphology with Neural Network Approaches”
But much less well

PART III: Scaling up and increasing UniMorph Collaboration!
From Wiktionary to more linguistic resources: Including grammar books, Apertium data,
text/glossed corpora.

Language-Specific Biases
As Bender(2009, 2016) notes architectures and training and tuning
algorithms still present language-specific biases

SIGMORPHON 2020 SHARED TASK 0 (Vylomova et al., 2020)
Let’s focus on typological diversity and aim to investigate systems’ ability to
generalize across typologically distinct languages!

Let’s focus on typological diversity and aim to investigate systems’ ability to
generalize across typologically distinct languages!
If a model works well for a sample of IE languages, should the same model
also work well for Tupi–Guarani languages?

90 Languages from 13 languages families

Three Phases
Development
2 months; train & dev: 45 languages from 5 families (Austronesian, Niger-Congo, Oto-Manguean,
Uralic, IE)
Generalization
1 week; train & dev: 45 languages from 10 families ( Afro-Asiatic, Algic, Dravidian,
Indo-European, Niger-Congo, Sino-Tibetan, Siouan, Songhay, Southern Daly, Tungusic, Turkic,
Uralic, and Uto-Aztecan)
Evaluation
1 week; test: all 90 languages

Data
Preparation
Manually converted their features (tags) into the UniMorph format
Canonicalized (https://github.com/unimorph/um-canonicalize) the converted language
data
Splitting
Used only noun, verb, and adjective forms to construct training, development, and evaluation
sets.
Randomly sampled 70%, 10%, and 20% for train, development, and test, respectively.
Zarma, Tajik, Lingala, Ludian, Māori, Sotho, Võro, Anglo-Norman, and Zulu contain less than
400 training samples

Systems: Baselines
Non-neural
Simple alignment-based as in previous years (Cotterell et al., 2017;2018)

Systems: Baselines
Neural
Neural transducer (Wu et al, 2019), which is essentially a hard monotonic attention model
(mono-*)
Transformer adopted for character-level tasks Wu et al, (2020; trm-*), SoTA on ST 2017
+ data augmentation technique used by Anastasopoulos et al. (2019;-aug-)
+ family-wise shared parameters (*-shared)
Team Description System Model Features
Neural Ensemble Multilingual Hallucination
Baseline wu2019exact
mono-single
mono-aug-single
mono-shared
mono-aug-shared
wu2020applying
trm-single
trm-aug-single
trm-shared
trm-aug-shared

Systems: Teams
10 teams submitted 22 systems in total, out of which 19 were neural
Team Description System Model Features
Neural Ensemble Multilingual Hallucination
CMU Tartan Jayarao et al.(2020)
cmu_tartan_00-0
cmu_tartan_00-1
cmu_tartan_01-0
cmu_tartan_01-1
cmu_tartan_02-1
CU7565 Beemer et al. (2020)
CU7565-01-0
CU7565-02-0
CULing Liu et al. (2020) CULing-01-0
DeepSpin Peterset al. (2020)
deepspin-01-1
deepspin-02-1
ETH Zurich Forster et al. (2020)
ETHZ00-1
ETHZ02-1
Flexica Scherbakov (2020)
flexica-01-0
flexica-02-1
flexica-03-1
IMS Yu et al. (2020) IMS-00-0
LTI Murikinati et al. (2020) LTI-00-1
NYU-CUBoulder Singer et al. (2020)
NYU-CUBoulder-01-0
NYU-CUBoulder-02-0
NYU-CUBoulder-03-0
NYU-CUBoulder-04-0
UIUC Canby et al. (2020) uiuc-01-0

Systems: Description (* – winning system)
Improving neural baselines
*UIUC: transformers with synchronous bidirectional decoding technique (Zhou et al.,2019)
and family-wise fine-tuning
ETH Zurich: exact decoding strategy that uses Dijkstra’s search algorithm
Improving previous years’ models: Hard Monotonic Attention
IMS: L2R+R2L models with a genetic algorithm for ensemble search and data hallucination
Flexica:multilingual (family-wise) model with improved alignment strategy
+ new data hallucination technique based on phonotactic modelling

Improving their 2019 models
LTI: multi-source encoder–decoder with two-step attention architecture + cross-lingual
transfer+ data hallucination + romanization of scripts
*DeepSpin: massively multilingual (all languages) gated sparse two-headed attention model
with sparsemax
+ 1.5-entmax
Transformer vs. LSTMs
CMU Tartan: compared trasformer- and LSTM-based encoder–decoders trained mono- and
multilingually with data hallucination

Ensembles of Transformers
NYU-CUBoulder: compared vanilla and pointer-generator (monolingual) transformers
+ ensembles of three and five pointer-generator transformers + data hallucination (less than
1,000 samples)
*CULing: ensemble of three (monolingual) transformers + augmented the initial input (that
only used the lemma as a source form) with entries corresponding to other (non-lemma) slots
(reinflection) to improve learning of principal parts of paradigm

Non-neural systems
CU7565: manually developed finite-state grammars for 25 languages
+ hierarchical paradigm clustering (based on similarity of string transformation rules)
Flexica: a method similar to Hulden (2014) but with transformation rules treated
independently and assigned a score based on their frequency, specificity and diversity of
surrounding characters

Evaluation
Per-language accuracy
Per-language Levenstein distance
Takes into account the statistical significance of differences between systems
Ranking
Any system which is the same (as assessed via statistical significance) as the best performing one
is also ranked 1st for that language.
For genus/family:
We aggregate the systems’ ranks and re-rank them based on the amount of times they ranked
1st, 2nd, etc.

Results: 4 winning systems (outperform baselines)
uiuc-01-0 2.4 90.5
deepspin-02-1 2.9 90.9
BASE: trm-single 2.8 90.1
CULing-01-0 3.2 91.2
deepspin-01-1 3.8 90.5
BASE: trm-aug-single 3.7 90.3
NYU-CUBoulder-04-0 7.1 88.8
IMS-00-0 10.6 89.2
BASE: trm-shared 10.3 85.9
BASE: mono-aug-single 7.5 88.8
cmu_tartan_00-0 8.7 87.1
BASE: mono-single 7.9 85.8
cmu_tartan_01-1 9.0 87.1
BASE: trm-aug-shared 12.5 86.5
BASE: mono-shared 10.8 86.0
cmu_tartan_00-1 9.4 86.5
LTI-00-1 12.0 86.6
BASE: mono-aug-shared 12.8 86.8
cmu_tartan_02-1 10.6 86.1
cmu_tartan_01-0 10.9 86.6
flexica-03-1 16.7 79.6
ETHZ-00-1 20.1 75.6
*CU7565-01-0 24.1 90.7
flexica-02-1 17.1 78.5
*CU7565-02-0 19.2 83.6
ETHZ-02-1 17.0 80.9
flexica-01-0 24.4 70.8
Oracle (Baselines) 96.1
Oracle (Submissions) 97.7
Oracle (All) 97.9

Results: 4 winning systems (outperform baselines)
uiuc-01-0 2.4 90.5
deepspin-02-1 2.9 90.9
BASE: trm-single 2.8 90.1
CULing-01-0 3.2 91.2
deepspin-01-1 3.8 90.5
BASE: trm-aug-single 3.7 90.3
IMS-00-0 10.6 89.2
BASE: trm-shared 10.3 85.9
BASE: mono-aug-single 7.5 88.8
cmu_tartan_00-0 8.7 87.1
BASE: mono-single 7.9 85.8
cmu_tartan_01-1 9.0 87.1
BASE: trm-aug-shared 12.5 86.5
BASE: mono-shared 10.8 86.0
cmu_tartan_00-1 9.4 86.5
LTI-00-1 12.0 86.6
BASE: mono-aug-shared 12.8 86.8
cmu_tartan_02-1 10.6 86.1
cmu_tartan_01-0 10.9 86.6
flexica-03-1 16.7 79.6
ETHZ-00-1 20.1 75.6
*CU7565-01-0 24.1 90.7
flexica-02-1 17.1 78.5
*CU7565-02-0 19.2 83.6
ETHZ-02-1 17.0 80.9
flexica-01-0 24.4 70.8
Oracle (Baselines) 96.1
Oracle (Submissions) 97.7
Oracle (All) 97.9
The baselines and the submissions are complementary
adding them together increases the oracle scored
The largest gaps in oracle systems are observed in Algic, Oto-Manguean
Sino-Tibetan, Southern Daly, Tungusic, and Uto-Aztecan families

Accuracy by language averaged across all submissions

Accuracy by language averaged across all submissions
A significant effect of dataset size was observed
Relatively easy: Austronesian and Niger-Congo
Difficult: some Uralic and Oto-Manguean languages
Challenging: Ludic, Norwegian Nynorsk, Middle Low German , Evenki, and O’odham

Accuracy by Language
Has morphological inflection become a solved problem in certain scenarios?
We have classified test examples into four categories:
Very Easy: all submitted systems got correct
Easy: predicted correctly by 80% of systems
Hard: predicted correctly by 20% of systems
Very Hard: none submitted systems got correct

Noun Samples Difficulty
1
3
6
7
1
5
0
9
2
3
3
9
1
0
8
1
3
4
6
4
5
9
0
1
5
8
5
7
3
9
9
8
8
7
0
1
6
5
1
1
7
9
9
1
9
6
2
4
8
2
3
8
9
5
3
9
7
4
5
4
4
4
2
9
3
3
3
1
3
5
9
6
2
1
1
5
2
0
2
2
6
0
2
7
0
1
7
5
8
4
6
6
3
1
1
3
4
9
1
2
3
5
0
9
8
0
2
4
8
4
4
7
1
6
7
0
4
1
4
9
1
6
3
7
2
1
3
3
9
7
0.00
0.25
0.50
0.75
1.00
ang aze bak ben crh dan deu est evn gmh gml isl izh kan kjh kpv krl liv mdf mhr mlt myv nno nob olo ood pus san sme swe syc tel udm urd vep vot vro
VeryEasy Easy Hard VeryHard

Verb Samples Difficulty
7
6
2
2
9
4
0
1
4
5
6
2
7
2
3
8
7
6
9
4
9
2
0
1
4
8
4
1
1
1
0
9
4
0
1
4
1
2
1
0
8
1
7
2
5
7
8
3
0
2
7
4
6
2
5
5
1
2
5
4
4
2
3
0
1
6
3
5
7
7
5
0
6
7
2
0
6
7
0
3
2
4
3
5
1
5
4
5
1
6
8
6
8
8
1
1
2
4
1
7
2
3
8
4
2
3
6
1
0
3
6
8
3
5
3
9
7
4
8
2
0
8
9
1
0
8
5
1
5
5
5
9
0
8
1
3
4
8
2
4
4
1
7
7
1
4
4
9
9
6
8
4
1
3
0
6
3
5
3
9
3
1
2
6
3
5
0
2
0
9
7
3
6
2
6
8
0
4
3
4
9
5
9
6
8
5
2
2
8
4
2
2
4
4
4
0
3
6
4
2
7
5
9
8
0
2
4
9
3
9
0
2
9
3
3
6
4
4
1
9
8
9
0
9
3
7
3
2
2
1
6
4
7
2
5
7
8
6
7
7
8
6
1
4
6
8
2
2
7
4
7
1
8
5
3
4
8
6
4
9
3
6
4
3
5
9
7
2
2
6
7
5
0.00
0.25
0.50
0.75
1.00
a
k
a
a
n
g
a
s
t
a
z
e
a
z
g
b
e
n
b
o
d
c
a
t
c
e
b
c
l
y
c
p
a
c
r
e
c
r
h
c
t
p
c
z
n
d
a
k
d
a
n
d
e
u
e
n
g
e
s
t
e
v
n
f
a
s
f
r
m
f
r
r
f
u
r
g
a
a
g
l
g
g
m
h
g
m
l
g
s
w
h
i
l
h
i
n
i
s
l
k
a
n
k
a
z
k
i
r
k
o
n
k
p
v
k
r
l
l
i
n
l
i
v
l
l
d
l
u
g
m
a
o
m
d
f
m
h
r
m
l
g
m
l
t
m
w
f
m
y
v
n
l
d
n
n
o
n
o
b
n
y
a
o
l
o
o
o
d
o
r
m
o
t
e
o
t
m
p
e
i
p
u
s
s
m
e
s
n
a
s
o
t
s
w
a
s
w
e
t
e
l
t
g
l
t
u
k
u
d
m
u
i
g
u
r
d
u
z
b
v
e
c
v
e
p
x
n
o
x
t
y
z
p
v
z
u
l

Adjective Samples Difficulty
3
3
2
6
9
4
7
1
1
0
2
4
5
4
1
0
7
2
1
3
4
5
1
8
0
7
9
7
1
2
3
0
2
5
0
5
4
2
8
9
2
4
8
4
1
0
2
0
1
9
1
5
2
2
6
6
9
8
3
8
2
1
4
4
6
1
8
3
5
4
2
3
9
0.00
0.25
0.50
0.75
1.00
a
n
g
b
a
k
c
r
h
e
v
n
g
m
l
i
z
h
k
p
v
k
r
l
l
i
v
m
d
f
m
h
r
m
y
v
n
l
d
n
n
o
n
o
b
o
l
o
p
u
s
s
a
n
s
m
e
s
w
e
s
y
c
u
d
m
v
e
p

Questions Addressed in Papers
Is developing morphological grammars manually worthwhile?
CU7565 manually designed finite-state grammars for 25 languages
Paradigms of some languages were relatively easy to describe but neural networks also
performed quite well
For Ingrian and Tagalog (LRL) grammars demonstrate superior performance but this comes at
the expense of a significant amount of person-hours

Questions Addressed in Papers
What is the best training strategy for low-resource languages?
Hallucinated data highlighted its utility for LRLs.
Augmenting the data with tuples where lemmas are replaced with non-lemma forms and their
tags
Multilingual training
Ensembles

Error Analysis
Systematic Errors:
Data Inconsistency
The train, development and test sets contain 2%, 0.3%, and 0.6% inconsistent entries
Highest rates: Azerbaijani, Old English, Cree, Danish, Middle Low German , Kannada,
Norwegian Bokmål, Chichimec, and Veps
Dialectal variations in Finno-Ugric and Tungusic

Language-Specific Errors
Algic (Cree)
Mean accuracy across systems was 65.1% (41.5% to 73%)
Struggled with the choice of preverbal auxiliary ( ‘kitta’ could refer to future, imperfective, or
imperative)
The paradigms were very large, there were very few lemmas (28 impersonal verbs and 14
transitive verbs

Austronesian
Mean accuracy across systems was 80.5% (39.5% to 100%)
Baseline: Cebuano (84%) and Hiligaynon (96%)
Cebuano only has partial reduplication while Hiligaynon has full reduplication
The prefix choice for Cebuano is more irregular, making it more difficult to predict the correct
conjugation of the verb
In Maori passive voice endings are difficult to predict as the language has undergone a loss of
word-final consonants and there is no clear link between a stem and the passive suffix that it
employs

Niger-Congo
Mean accuracy across systems was very good at 96.4 (62.8% to 100%)
Most languages in this family are considered low resource, and the resources used for data
gathering may have been biased towards the languages’ regular forms

Sino–Tibetan (Tibetan)
Mean accuracy across systems was average at 82.1%(67.9% to 85.1%)
Majority of errors are related to allomorphy
Nonce words and impossible combinations of component units (Di et al., 2019)

Siouan (Dakota)
Mean accuracy across systems was average at 89.4%(0% to 95.7%)
Variable prefixing and infixing of person morphemes, along some complexities related to
fortition processes
Determining the factor(s) that governed variation in affix position was difficult from a
linguist’s perspective, though many systems were largely successful
Issues with first and second person singular allomorphy

Tungusic (Evenki)
Mean accuracy across systems was average at 53.8% (43.5% to 59.0%)
The dataset was created from oral speech samples in various dialects of the language; there
was little attempt at any standardization in the oral speech transcription
Annotation: various past tense forms are all annotated as PST, or there are several comitative
suffixes all annotated as COM
Annotation: some features are present in the word form but they receive no annotation at all

Uto-Aztecan (O’odham)
Mean accuracy across systems was average at 76.4% (54.8% to 82.5%)
Systems with higher accuracy may have benefited from better recall of suppletive forms
relative to lower accuracy systems.

SM2020ST0 (Vylomova et al., 2020): Conclusion
AND.....TO CONCLUDE:
Submissions were able to make productive use of multilingual training
Data augmentation techniques such as hallucination helped
Combined with architecture tweaks like sparsemax, it resulted in excellent overall performance
on many languages
Some morphology types and language families (Tungusic, Oto-Manguean, Southern Daly) are
still challenging
In some languages (Ingrian, Tajik, Tagalog, Zarma, and Lingala) hand-encoding linguist
knowledge in finite state grammars resulted in best performance

A Case Study on Nen (Papua New Guinea); Muradoglu et al., 2020

Spoken in the village of Bimadbn in the Western Province of PNG, by approx 400 people
Verbs: prefixing, middle, and ambifixing
Distributed Exponence (DE); “morphosyntactic feature values can only be
determined after unification of multiple structural positions

Low accuracy on small number of samples (<1000)

Low accuracy on small number of samples (<1000)
Allomorphy: vowel harmony
Variation in forms/spelling
Looping: *ynawemaylmyylmyylmyylmy-ylmyylmyymayamawemyymamya
Shcherbakov et al., 2020

How well do the models generalize?
Syncretism Test: all the TAM categories exhibit syncretism across the second and third-person
singular actor. Exception: The past perfective slot (where they take different forms)
Not observing the past perfective forms, systems tend to predict the forms as syncretic
(generalizing from observed slots), resulting in the misprediction of the actual forms (exceptions)

SIGMORPHON 2021 Shared Task 0 (Pimentel, Ryskina et al., 2021): More
under-resourced languages!

SIGMORPHON 2021 Shared Task 0 (Pimentel, Ryskina et al., 2021)

Allomorphy
Spelling errors
Multi-Word Lemmas
Complex transformation patterns

Allomorphy
Spelling errors
Most errors are due to limited data
Very sparse data w/o complete paradigms (e.g.,Eibela)
Misprediction in unseen lemmas (also see Goldman et al., 2021)
Multi-Word Lemmas
Complex transformation patterns

Russian
Mean accuracy across systems was average at 97.4%(94.31% to 98.06%)
Incorrect prediction of the instrumental case forms (even when the other parts of the same
paradigm observed (for the same lemma))
Incorrect prediction of the accusative forms. The forms are different for animate and inanimate
nouns, and animacy should be inferred (from observing other slot of the same case such as PL or
SG)
Errors in inflection of multi-word lemmas that require to infer dependency information.
Similarly, to the above cases, the information could be inferred from other slots of the same
paradigm

Kunwinjku
Accuracy across systems ranges from 14.75% to 63.93%
Due to limited amount of data, augmentation significantly improved the performance
Systems mispredict *ngurriborlbme instead of ngurriborle.
looping effects (Shcherbakov et al., 2020) are observed in RNN-based architectures:
*ngar-rrrrrrrrrrrrrmbbbijj (should be karribelbmerrinj), ngadjarridarrkddrrdddrrmerri (should be
karriyawoyhdjarrkbidyikarrmerrimeninj)

PART IV: Current Challenges and Future Directions

Challenges in Data Conversion/Annotation
Challenges in Data Conversion/Annotation
Case compounding and stacking (e.g., Kayardild)
I gave the book to my brother’s wife: ‘wife+DAT+ABL, my+GEN+DAT+ABL,
brother+GEN+DAT+ABL’
Clitics: exponential growth of paradigm tables
Polysynthetic languages and paradigms
Derivation – Inflection continuum: some paradigms contain
derivations (participle formation, masdars, etc) and require multi-step transformation
(PL: similar to ‘to run’ → ‘runners’ ).
Multi-word lemmas that might require dependency information
Which features should be added (not language-specific)?

Future Directions
Future Directions
Develop a framework for error analysis, e.g. measuring %-ge of allomorphy errors by providing a
set of tasks specifically for allomorphy (e.g., following Elsner and Sims, 2019; Malouf et al., 2020)
Increase interpretability of the models, design a methodology to extract the patterns learned by
the model
Make more typologically plausible language samples
A pipeline to augment UniMorph with new morphosyntatic features
An approach to estimate how representative a paradigm sample for a specific language is
(estimate of the language coverage)
... And ST0 Part 2: Human-like generalization and WUGS!

Thank you! Questions?
Please join us: https://groups.google.com/g/unimorph

The UniMorph Project and Morphological Reinflection Task: Past, Present, and Future

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (13)

Similaire à The UniMorph Project and Morphological Reinflection Task: Past, Present, and Future

Similaire à The UniMorph Project and Morphological Reinflection Task: Past, Present, and Future (20)

Plus de Katerina Vylomova

Plus de Katerina Vylomova (15)

Dernier

Dernier (20)

The UniMorph Project and Morphological Reinflection Task: Past, Present, and Future