SlideShare une entreprise Scribd logo
1  sur  12
Resources for historical
Slovene

Tomaž Erjavec
Department of Knowledge Technologies
Jožef Stefan Institute
Ljubljana


                  IMPACT Conference 2011
                 October 24-25, 2011, London
Tomaž Erjavec: Slovene language resources   2




Background
• Pre-story: AHLib (2004–08)
  (Deutsch-slowenische/kroatische Übersetzung 1848–1918)
   • Corpus / DL of ger→slv books
   • AAS: transcription correction and markup (TEI P4)
   • JSI: automatic annotation and editing environment
• Story: EU IP IMPACT (ext. 2010–2011)
  • Better OCR for historical texts
  • NUK: GTD transcriptions (PAGE/Aletheia)
  • JSI: (semi)manual lexicon construction
• Co-story: Google award (2011)
  • Developing language models for historical Slovene
  • ZRC SAZU: transcriptions of old texts (TEI P5)
  • JSI: annotating a corpus of old Slovene
Tomaž Erjavec: Slovene language resources   3


                                                              Annotators
Methodology                                                                         Historical
                                          Texts                   Corpus             lexicon
• Develop 3 resources:
  • transcribed texts
  • hand-annotated corpus
                                                                ToTrTaLe
  • lexicon of historical words
• Develop annotation tool, ToTrTaLe                          Contemporary
                                                                models
  • How to tag and lemmatise historical Slovene?
    Little chance of developing training data comparable to that for
    contemporary Slovene
  • Basic idea:
     •   modernise words then use models for modern Slovene
     •   transcription is via fixed lexicon + transcription patterns
     •   patterns implemented via LMU Vaam
     •   mostly OK for XIX and XVIII century language
Tomaž Erjavec: Slovene language resources   4




Issues
• Tokenisation - words were split differently in historical
 language :
  • žnjo → z njo
  • po noči → ponoči
• Variability:
   • archaic forms:
    ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin
  • inflection:
    ljubezen ← ljubezni, ljubeznijo
  • both:
    ljubezen ←
         ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezi
    n, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin
• Extinct words:
  • zajhen / cajhen / znamenje
Tomaž Erjavec: Slovene language resources   5




Transcribed historical texts
• AHLib corpus/DL:
  90 books, 10,000 pages, 2M words (> 1850)
• NUK GTD:
  5,000 pages, 1M words
• Google Books:
  30 books, 10,000 pages, 2M words (in progress)
• WikiSource (Lj Uni):
  200 books, 5M words (in progress)
~ 10M words

• most texts have associated facsimiles
• can be made freely available
Tomaž Erjavec: Slovene language resources   6




Initial Lexicon
• Development of initial lexicon (2010), using the data and tools at hand
• AHLib collection (70 books > 1850)
• Transcription rules + FidaPLUS lexicon of contemporary slv
• LMU LeXtractor editing tool
• produced 3,000 entries (word-forms)
Tomaž Erjavec: Slovene language resources          7


Reference corpus                       Period          Units       Pages           Tokens

goo300k                               1584
                                      1695
                                                              1
                                                              1
                                                                           8
                                                                          27
                                                                                      6000
                                                                                     10000
• Page sampled                      1751-1800                 8          155         27000
                                    1801-1850                12          206         74000
• Each word annotated with:         1851-1875                36          380        126000
  • Contemporary equivalent         1876-1900                23          224         51000
  • Modern lemma                        ∑                    81         1000        296000
  • Part-of-speech tag
• First with ToTrTaLe
• Then manually correct
  • INL Cobalt Lexicon Tool
  • A team of annotators
  • Also correcting errors in transcription
  • Manual, cookbook, FAQ, mailing list, meetings…
• TEI P5 – bibliography, links to facsimiles & DL
Tomaž Erjavec: Slovene language resources   8



INL Cobalt lexicon building tool
Tomaž Erjavec: Slovene language resources   9




TEI
corpus
dump
Tomaž Erjavec: Slovene language resources       10




Final lexicon
                                                 goo300k               All       Historical
Composition:                                     Lex. entries            56346        22849
• Initial LeXtractor lexicon (3k entries)        Word-forms              53853        19627
• Lexicon dump from goo300k                      Normalised              46996        15402
• Additional lexicon from full                   Modernised              37334        11396
  text collection                          Lemmas           19569                     8605
Format:
• TEI P5
• lemma oriented
• grammatical properties, glosses, historical spelling, (corpus)
  examples
Tomaž Erjavec: Slovene language resources   11




Results
• Language resources for historical Slovene:
   • Text Collection hs5M:
     • facsimile + transcription, DL (+ automatic annotation)
  • Annotated Corpus goo300k:
     • page-sampled , hand-annotated
  • Structured Lexicon imp20k:
     • grammar + glosses + forms + attestations
  • TEI P5, CC BY
• ToTrTaLe + resources for HS:
   • tokenisation & transcription patterns
• Services: CUWI, (moderniser+archaiser)
• all still work in progress, available mid-2012
Tomaž Erjavec: Slovene language resources   12




Further work
• Better IR for Digital Libraries: NUK
• Dictionary of historical Slovene: ZRC
• Beyond words: changes in syntax
• MT paradigm
• tweets & Croatian

Contenu connexe

En vedette

IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT Centre of Competence
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Centre of Competence
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Centre of Competence
 

En vedette (17)

IMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna RoadmapIMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna Roadmap
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a Portal
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to Taverna
 
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEPIMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEP
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACT
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer Laamanen
 
IMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul FogelIMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul Fogel
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos Antonacopoulos
 
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael FuchsIMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
 
IMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens NeudeckerIMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens Neudecker
 
IMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory CraneIMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory Crane
 
IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
IMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus GravenhorstIMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus Gravenhorst
 
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan PletschacherIMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan Pletschacher
 
IMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich ReffleIMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich Reffle
 
IMPACT Final Conference - Jesse de Does
IMPACT Final Conference - Jesse de DoesIMPACT Final Conference - Jesse de Does
IMPACT Final Conference - Jesse de Does
 
IMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien DepuydtIMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien Depuydt
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Dernier

Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Dernier (20)

Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 

IMACT Final Conference - Language Parallel Sessions - Erjavec

  • 1. Resources for historical Slovene Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana IMPACT Conference 2011 October 24-25, 2011, London
  • 2. Tomaž Erjavec: Slovene language resources 2 Background • Pre-story: AHLib (2004–08) (Deutsch-slowenische/kroatische Übersetzung 1848–1918) • Corpus / DL of ger→slv books • AAS: transcription correction and markup (TEI P4) • JSI: automatic annotation and editing environment • Story: EU IP IMPACT (ext. 2010–2011) • Better OCR for historical texts • NUK: GTD transcriptions (PAGE/Aletheia) • JSI: (semi)manual lexicon construction • Co-story: Google award (2011) • Developing language models for historical Slovene • ZRC SAZU: transcriptions of old texts (TEI P5) • JSI: annotating a corpus of old Slovene
  • 3. Tomaž Erjavec: Slovene language resources 3 Annotators Methodology Historical Texts Corpus lexicon • Develop 3 resources: • transcribed texts • hand-annotated corpus ToTrTaLe • lexicon of historical words • Develop annotation tool, ToTrTaLe Contemporary models • How to tag and lemmatise historical Slovene? Little chance of developing training data comparable to that for contemporary Slovene • Basic idea: • modernise words then use models for modern Slovene • transcription is via fixed lexicon + transcription patterns • patterns implemented via LMU Vaam • mostly OK for XIX and XVIII century language
  • 4. Tomaž Erjavec: Slovene language resources 4 Issues • Tokenisation - words were split differently in historical language : • žnjo → z njo • po noči → ponoči • Variability: • archaic forms: ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin • inflection: ljubezen ← ljubezni, ljubeznijo • both: ljubezen ← ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezi n, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin • Extinct words: • zajhen / cajhen / znamenje
  • 5. Tomaž Erjavec: Slovene language resources 5 Transcribed historical texts • AHLib corpus/DL: 90 books, 10,000 pages, 2M words (> 1850) • NUK GTD: 5,000 pages, 1M words • Google Books: 30 books, 10,000 pages, 2M words (in progress) • WikiSource (Lj Uni): 200 books, 5M words (in progress) ~ 10M words • most texts have associated facsimiles • can be made freely available
  • 6. Tomaž Erjavec: Slovene language resources 6 Initial Lexicon • Development of initial lexicon (2010), using the data and tools at hand • AHLib collection (70 books > 1850) • Transcription rules + FidaPLUS lexicon of contemporary slv • LMU LeXtractor editing tool • produced 3,000 entries (word-forms)
  • 7. Tomaž Erjavec: Slovene language resources 7 Reference corpus Period Units Pages Tokens goo300k 1584 1695 1 1 8 27 6000 10000 • Page sampled 1751-1800 8 155 27000 1801-1850 12 206 74000 • Each word annotated with: 1851-1875 36 380 126000 • Contemporary equivalent 1876-1900 23 224 51000 • Modern lemma ∑ 81 1000 296000 • Part-of-speech tag • First with ToTrTaLe • Then manually correct • INL Cobalt Lexicon Tool • A team of annotators • Also correcting errors in transcription • Manual, cookbook, FAQ, mailing list, meetings… • TEI P5 – bibliography, links to facsimiles & DL
  • 8. Tomaž Erjavec: Slovene language resources 8 INL Cobalt lexicon building tool
  • 9. Tomaž Erjavec: Slovene language resources 9 TEI corpus dump
  • 10. Tomaž Erjavec: Slovene language resources 10 Final lexicon goo300k All Historical Composition: Lex. entries 56346 22849 • Initial LeXtractor lexicon (3k entries) Word-forms 53853 19627 • Lexicon dump from goo300k Normalised 46996 15402 • Additional lexicon from full Modernised 37334 11396 text collection Lemmas 19569 8605 Format: • TEI P5 • lemma oriented • grammatical properties, glosses, historical spelling, (corpus) examples
  • 11. Tomaž Erjavec: Slovene language resources 11 Results • Language resources for historical Slovene: • Text Collection hs5M: • facsimile + transcription, DL (+ automatic annotation) • Annotated Corpus goo300k: • page-sampled , hand-annotated • Structured Lexicon imp20k: • grammar + glosses + forms + attestations • TEI P5, CC BY • ToTrTaLe + resources for HS: • tokenisation & transcription patterns • Services: CUWI, (moderniser+archaiser) • all still work in progress, available mid-2012
  • 12. Tomaž Erjavec: Slovene language resources 12 Further work • Better IR for Digital Libraries: NUK • Dictionary of historical Slovene: ZRC • Beyond words: changes in syntax • MT paradigm • tweets & Croatian