SlideShare une entreprise Scribd logo
1  sur  12
Resources for historical
Slovene

Tomaž Erjavec
Department of Knowledge Technologies
Jožef Stefan Institute
Ljubljana


                  IMPACT Conference 2011
                 October 24-25, 2011, London
Tomaž Erjavec: Slovene language resources   2




Background
• Pre-story: AHLib (2004–08)
  (Deutsch-slowenische/kroatische Übersetzung 1848–1918)
   • Corpus / DL of ger→slv books
   • AAS: transcription correction and markup (TEI P4)
   • JSI: automatic annotation and editing environment
• Story: EU IP IMPACT (ext. 2010–2011)
  • Better OCR for historical texts
  • NUK: GTD transcriptions (PAGE/Aletheia)
  • JSI: (semi)manual lexicon construction
• Co-story: Google award (2011)
  • Developing language models for historical Slovene
  • ZRC SAZU: transcriptions of old texts (TEI P5)
  • JSI: annotating a corpus of old Slovene
Tomaž Erjavec: Slovene language resources   3


                                                              Annotators
Methodology                                                                         Historical
                                          Texts                   Corpus             lexicon
• Develop 3 resources:
  • transcribed texts
  • hand-annotated corpus
                                                                ToTrTaLe
  • lexicon of historical words
• Develop annotation tool, ToTrTaLe                          Contemporary
                                                                models
  • How to tag and lemmatise historical Slovene?
    Little chance of developing training data comparable to that for
    contemporary Slovene
  • Basic idea:
     •   modernise words then use models for modern Slovene
     •   transcription is via fixed lexicon + transcription patterns
     •   patterns implemented via LMU Vaam
     •   mostly OK for XIX and XVIII century language
Tomaž Erjavec: Slovene language resources   4




Issues
• Tokenisation - words were split differently in historical
 language :
  • žnjo → z njo
  • po noči → ponoči
• Variability:
   • archaic forms:
    ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin
  • inflection:
    ljubezen ← ljubezni, ljubeznijo
  • both:
    ljubezen ←
         ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezi
    n, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin
• Extinct words:
  • zajhen / cajhen / znamenje
Tomaž Erjavec: Slovene language resources   5




Transcribed historical texts
• AHLib corpus/DL:
  90 books, 10,000 pages, 2M words (> 1850)
• NUK GTD:
  5,000 pages, 1M words
• Google Books:
  30 books, 10,000 pages, 2M words (in progress)
• WikiSource (Lj Uni):
  200 books, 5M words (in progress)
~ 10M words

• most texts have associated facsimiles
• can be made freely available
Tomaž Erjavec: Slovene language resources   6




Initial Lexicon
• Development of initial lexicon (2010), using the data and tools at hand
• AHLib collection (70 books > 1850)
• Transcription rules + FidaPLUS lexicon of contemporary slv
• LMU LeXtractor editing tool
• produced 3,000 entries (word-forms)
Tomaž Erjavec: Slovene language resources          7


Reference corpus                       Period          Units       Pages           Tokens

goo300k                               1584
                                      1695
                                                              1
                                                              1
                                                                           8
                                                                          27
                                                                                      6000
                                                                                     10000
• Page sampled                      1751-1800                 8          155         27000
                                    1801-1850                12          206         74000
• Each word annotated with:         1851-1875                36          380        126000
  • Contemporary equivalent         1876-1900                23          224         51000
  • Modern lemma                        ∑                    81         1000        296000
  • Part-of-speech tag
• First with ToTrTaLe
• Then manually correct
  • INL Cobalt Lexicon Tool
  • A team of annotators
  • Also correcting errors in transcription
  • Manual, cookbook, FAQ, mailing list, meetings…
• TEI P5 – bibliography, links to facsimiles & DL
Tomaž Erjavec: Slovene language resources   8



INL Cobalt lexicon building tool
Tomaž Erjavec: Slovene language resources   9




TEI
corpus
dump
Tomaž Erjavec: Slovene language resources       10




Final lexicon
                                                 goo300k               All       Historical
Composition:                                     Lex. entries            56346        22849
• Initial LeXtractor lexicon (3k entries)        Word-forms              53853        19627
• Lexicon dump from goo300k                      Normalised              46996        15402
• Additional lexicon from full                   Modernised              37334        11396
  text collection                          Lemmas           19569                     8605
Format:
• TEI P5
• lemma oriented
• grammatical properties, glosses, historical spelling, (corpus)
  examples
Tomaž Erjavec: Slovene language resources   11




Results
• Language resources for historical Slovene:
   • Text Collection hs5M:
     • facsimile + transcription, DL (+ automatic annotation)
  • Annotated Corpus goo300k:
     • page-sampled , hand-annotated
  • Structured Lexicon imp20k:
     • grammar + glosses + forms + attestations
  • TEI P5, CC BY
• ToTrTaLe + resources for HS:
   • tokenisation & transcription patterns
• Services: CUWI, (moderniser+archaiser)
• all still work in progress, available mid-2012
Tomaž Erjavec: Slovene language resources   12




Further work
• Better IR for Digital Libraries: NUK
• Dictionary of historical Slovene: ZRC
• Beyond words: changes in syntax
• MT paradigm
• tweets & Croatian

Contenu connexe

En vedette

IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT Centre of Competence
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Centre of Competence
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Centre of Competence
 

En vedette (17)

IMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna RoadmapIMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna Roadmap
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a Portal
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to Taverna
 
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEPIMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEP
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACT
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer Laamanen
 
IMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul FogelIMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul Fogel
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos Antonacopoulos
 
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael FuchsIMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
 
IMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens NeudeckerIMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens Neudecker
 
IMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory CraneIMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory Crane
 
IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
IMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus GravenhorstIMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus Gravenhorst
 
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan PletschacherIMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan Pletschacher
 
IMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich ReffleIMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich Reffle
 
IMPACT Final Conference - Jesse de Does
IMPACT Final Conference - Jesse de DoesIMPACT Final Conference - Jesse de Does
IMPACT Final Conference - Jesse de Does
 
IMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien DepuydtIMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien Depuydt
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Dernier

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 

Dernier (20)

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 

IMACT Final Conference - Language Parallel Sessions - Erjavec

  • 1. Resources for historical Slovene Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Ljubljana IMPACT Conference 2011 October 24-25, 2011, London
  • 2. Tomaž Erjavec: Slovene language resources 2 Background • Pre-story: AHLib (2004–08) (Deutsch-slowenische/kroatische Übersetzung 1848–1918) • Corpus / DL of ger→slv books • AAS: transcription correction and markup (TEI P4) • JSI: automatic annotation and editing environment • Story: EU IP IMPACT (ext. 2010–2011) • Better OCR for historical texts • NUK: GTD transcriptions (PAGE/Aletheia) • JSI: (semi)manual lexicon construction • Co-story: Google award (2011) • Developing language models for historical Slovene • ZRC SAZU: transcriptions of old texts (TEI P5) • JSI: annotating a corpus of old Slovene
  • 3. Tomaž Erjavec: Slovene language resources 3 Annotators Methodology Historical Texts Corpus lexicon • Develop 3 resources: • transcribed texts • hand-annotated corpus ToTrTaLe • lexicon of historical words • Develop annotation tool, ToTrTaLe Contemporary models • How to tag and lemmatise historical Slovene? Little chance of developing training data comparable to that for contemporary Slovene • Basic idea: • modernise words then use models for modern Slovene • transcription is via fixed lexicon + transcription patterns • patterns implemented via LMU Vaam • mostly OK for XIX and XVIII century language
  • 4. Tomaž Erjavec: Slovene language resources 4 Issues • Tokenisation - words were split differently in historical language : • žnjo → z njo • po noči → ponoči • Variability: • archaic forms: ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin • inflection: ljubezen ← ljubezni, ljubeznijo • both: ljubezen ← ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezi n, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin • Extinct words: • zajhen / cajhen / znamenje
  • 5. Tomaž Erjavec: Slovene language resources 5 Transcribed historical texts • AHLib corpus/DL: 90 books, 10,000 pages, 2M words (> 1850) • NUK GTD: 5,000 pages, 1M words • Google Books: 30 books, 10,000 pages, 2M words (in progress) • WikiSource (Lj Uni): 200 books, 5M words (in progress) ~ 10M words • most texts have associated facsimiles • can be made freely available
  • 6. Tomaž Erjavec: Slovene language resources 6 Initial Lexicon • Development of initial lexicon (2010), using the data and tools at hand • AHLib collection (70 books > 1850) • Transcription rules + FidaPLUS lexicon of contemporary slv • LMU LeXtractor editing tool • produced 3,000 entries (word-forms)
  • 7. Tomaž Erjavec: Slovene language resources 7 Reference corpus Period Units Pages Tokens goo300k 1584 1695 1 1 8 27 6000 10000 • Page sampled 1751-1800 8 155 27000 1801-1850 12 206 74000 • Each word annotated with: 1851-1875 36 380 126000 • Contemporary equivalent 1876-1900 23 224 51000 • Modern lemma ∑ 81 1000 296000 • Part-of-speech tag • First with ToTrTaLe • Then manually correct • INL Cobalt Lexicon Tool • A team of annotators • Also correcting errors in transcription • Manual, cookbook, FAQ, mailing list, meetings… • TEI P5 – bibliography, links to facsimiles & DL
  • 8. Tomaž Erjavec: Slovene language resources 8 INL Cobalt lexicon building tool
  • 9. Tomaž Erjavec: Slovene language resources 9 TEI corpus dump
  • 10. Tomaž Erjavec: Slovene language resources 10 Final lexicon goo300k All Historical Composition: Lex. entries 56346 22849 • Initial LeXtractor lexicon (3k entries) Word-forms 53853 19627 • Lexicon dump from goo300k Normalised 46996 15402 • Additional lexicon from full Modernised 37334 11396 text collection Lemmas 19569 8605 Format: • TEI P5 • lemma oriented • grammatical properties, glosses, historical spelling, (corpus) examples
  • 11. Tomaž Erjavec: Slovene language resources 11 Results • Language resources for historical Slovene: • Text Collection hs5M: • facsimile + transcription, DL (+ automatic annotation) • Annotated Corpus goo300k: • page-sampled , hand-annotated • Structured Lexicon imp20k: • grammar + glosses + forms + attestations • TEI P5, CC BY • ToTrTaLe + resources for HS: • tokenisation & transcription patterns • Services: CUWI, (moderniser+archaiser) • all still work in progress, available mid-2012
  • 12. Tomaž Erjavec: Slovene language resources 12 Further work • Better IR for Digital Libraries: NUK • Dictionary of historical Slovene: ZRC • Beyond words: changes in syntax • MT paradigm • tweets & Croatian