IMACT Final Conference - Language Parallel Sessions - Erjavec

Resources for historical
Slovene

Tomaž Erjavec
Department of Knowledge Technologies
Jožef Stefan Institute
Ljubljana

IMPACT Conference 2011
October 24-25, 2011, London

Tomaž Erjavec: Slovene language resources 2

Background
• Pre-story: AHLib (2004–08)
(Deutsch-slowenische/kroatische Übersetzung 1848–1918)
• Corpus / DL of ger→slv books
• AAS: transcription correction and markup (TEI P4)
• JSI: automatic annotation and editing environment
• Story: EU IP IMPACT (ext. 2010–2011)
• Better OCR for historical texts
• NUK: GTD transcriptions (PAGE/Aletheia)
• JSI: (semi)manual lexicon construction
• Co-story: Google award (2011)
• Developing language models for historical Slovene
• ZRC SAZU: transcriptions of old texts (TEI P5)
• JSI: annotating a corpus of old Slovene


Annotators
Methodology Historical
Texts Corpus lexicon
• Develop 3 resources:
• transcribed texts
• hand-annotated corpus
ToTrTaLe
• lexicon of historical words
• Develop annotation tool, ToTrTaLe Contemporary
models
• How to tag and lemmatise historical Slovene?
Little chance of developing training data comparable to that for
contemporary Slovene
• Basic idea:
• modernise words then use models for modern Slovene
• transcription is via fixed lexicon + transcription patterns
• patterns implemented via LMU Vaam
• mostly OK for XIX and XVIII century language


Issues
• Tokenisation - words were split differently in historical
language :
• žnjo → z njo
• po noči → ponoči
• Variability:
• archaic forms:
ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin
• inflection:
ljubezen ← ljubezni, ljubeznijo
• both:
ljubezen ←
ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezi
n, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin
• Extinct words:
• zajhen / cajhen / znamenje


Transcribed historical texts
• AHLib corpus/DL:
90 books, 10,000 pages, 2M words (> 1850)
• NUK GTD:
5,000 pages, 1M words
• Google Books:
30 books, 10,000 pages, 2M words (in progress)
• WikiSource (Lj Uni):
200 books, 5M words (in progress)
~ 10M words

• most texts have associated facsimiles
• can be made freely available


Initial Lexicon
• Development of initial lexicon (2010), using the data and tools at hand
• AHLib collection (70 books > 1850)
• Transcription rules + FidaPLUS lexicon of contemporary slv
• LMU LeXtractor editing tool
• produced 3,000 entries (word-forms)


Reference corpus Period Units Pages Tokens

goo300k 1584
1695
1
1
8
27
6000
10000
• Page sampled 1751-1800 8 155 27000
1801-1850 12 206 74000
• Each word annotated with: 1851-1875 36 380 126000
• Contemporary equivalent 1876-1900 23 224 51000
• Modern lemma ∑ 81 1000 296000
• Part-of-speech tag
• First with ToTrTaLe
• Then manually correct
• INL Cobalt Lexicon Tool
• A team of annotators
• Also correcting errors in transcription
• Manual, cookbook, FAQ, mailing list, meetings…
• TEI P5 – bibliography, links to facsimiles & DL


INL Cobalt lexicon building tool


TEI
corpus
dump


Final lexicon
goo300k All Historical
Composition: Lex. entries 56346 22849
• Initial LeXtractor lexicon (3k entries) Word-forms 53853 19627
• Lexicon dump from goo300k Normalised 46996 15402
• Additional lexicon from full Modernised 37334 11396
text collection Lemmas 19569 8605
Format:
• TEI P5
• lemma oriented
• grammatical properties, glosses, historical spelling, (corpus)
examples


Results
• Language resources for historical Slovene:
• Text Collection hs5M:
• facsimile + transcription, DL (+ automatic annotation)
• Annotated Corpus goo300k:
• page-sampled , hand-annotated
• Structured Lexicon imp20k:
• grammar + glosses + forms + attestations
• TEI P5, CC BY
• ToTrTaLe + resources for HS:
• tokenisation & transcription patterns
• Services: CUWI, (moderniser+archaiser)
• all still work in progress, available mid-2012


Further work
• Better IR for Digital Libraries: NUK
• Dictionary of historical Slovene: ZRC
• Beyond words: changes in syntax
• MT paradigm
• tweets & Croatian

IMACT Final Conference - Language Parallel Sessions - Erjavec

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (17)

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Dernier

Dernier (20)

IMACT Final Conference - Language Parallel Sessions - Erjavec