IMACT Final Conference - Language Parallel Sessions - Erjavec
1. Resources for historical
Slovene
Tomaž Erjavec
Department of Knowledge Technologies
Jožef Stefan Institute
Ljubljana
IMPACT Conference 2011
October 24-25, 2011, London
2. Tomaž Erjavec: Slovene language resources 2
Background
• Pre-story: AHLib (2004–08)
(Deutsch-slowenische/kroatische Übersetzung 1848–1918)
• Corpus / DL of ger→slv books
• AAS: transcription correction and markup (TEI P4)
• JSI: automatic annotation and editing environment
• Story: EU IP IMPACT (ext. 2010–2011)
• Better OCR for historical texts
• NUK: GTD transcriptions (PAGE/Aletheia)
• JSI: (semi)manual lexicon construction
• Co-story: Google award (2011)
• Developing language models for historical Slovene
• ZRC SAZU: transcriptions of old texts (TEI P5)
• JSI: annotating a corpus of old Slovene
3. Tomaž Erjavec: Slovene language resources 3
Annotators
Methodology Historical
Texts Corpus lexicon
• Develop 3 resources:
• transcribed texts
• hand-annotated corpus
ToTrTaLe
• lexicon of historical words
• Develop annotation tool, ToTrTaLe Contemporary
models
• How to tag and lemmatise historical Slovene?
Little chance of developing training data comparable to that for
contemporary Slovene
• Basic idea:
• modernise words then use models for modern Slovene
• transcription is via fixed lexicon + transcription patterns
• patterns implemented via LMU Vaam
• mostly OK for XIX and XVIII century language
4. Tomaž Erjavec: Slovene language resources 4
Issues
• Tokenisation - words were split differently in historical
language :
• žnjo → z njo
• po noči → ponoči
• Variability:
• archaic forms:
ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin
• inflection:
ljubezen ← ljubezni, ljubeznijo
• both:
ljubezen ←
ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezi
n, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin
• Extinct words:
• zajhen / cajhen / znamenje
5. Tomaž Erjavec: Slovene language resources 5
Transcribed historical texts
• AHLib corpus/DL:
90 books, 10,000 pages, 2M words (> 1850)
• NUK GTD:
5,000 pages, 1M words
• Google Books:
30 books, 10,000 pages, 2M words (in progress)
• WikiSource (Lj Uni):
200 books, 5M words (in progress)
~ 10M words
• most texts have associated facsimiles
• can be made freely available
6. Tomaž Erjavec: Slovene language resources 6
Initial Lexicon
• Development of initial lexicon (2010), using the data and tools at hand
• AHLib collection (70 books > 1850)
• Transcription rules + FidaPLUS lexicon of contemporary slv
• LMU LeXtractor editing tool
• produced 3,000 entries (word-forms)
7. Tomaž Erjavec: Slovene language resources 7
Reference corpus Period Units Pages Tokens
goo300k 1584
1695
1
1
8
27
6000
10000
• Page sampled 1751-1800 8 155 27000
1801-1850 12 206 74000
• Each word annotated with: 1851-1875 36 380 126000
• Contemporary equivalent 1876-1900 23 224 51000
• Modern lemma ∑ 81 1000 296000
• Part-of-speech tag
• First with ToTrTaLe
• Then manually correct
• INL Cobalt Lexicon Tool
• A team of annotators
• Also correcting errors in transcription
• Manual, cookbook, FAQ, mailing list, meetings…
• TEI P5 – bibliography, links to facsimiles & DL
10. Tomaž Erjavec: Slovene language resources 10
Final lexicon
goo300k All Historical
Composition: Lex. entries 56346 22849
• Initial LeXtractor lexicon (3k entries) Word-forms 53853 19627
• Lexicon dump from goo300k Normalised 46996 15402
• Additional lexicon from full Modernised 37334 11396
text collection Lemmas 19569 8605
Format:
• TEI P5
• lemma oriented
• grammatical properties, glosses, historical spelling, (corpus)
examples
11. Tomaž Erjavec: Slovene language resources 11
Results
• Language resources for historical Slovene:
• Text Collection hs5M:
• facsimile + transcription, DL (+ automatic annotation)
• Annotated Corpus goo300k:
• page-sampled , hand-annotated
• Structured Lexicon imp20k:
• grammar + glosses + forms + attestations
• TEI P5, CC BY
• ToTrTaLe + resources for HS:
• tokenisation & transcription patterns
• Services: CUWI, (moderniser+archaiser)
• all still work in progress, available mid-2012
12. Tomaž Erjavec: Slovene language resources 12
Further work
• Better IR for Digital Libraries: NUK
• Dictionary of historical Slovene: ZRC
• Beyond words: changes in syntax
• MT paradigm
• tweets & Croatian