SlideShare a Scribd company logo
1 of 22
Download to read offline
Innovations in Slovenian
(e-)lexicography:
from (semi-)automatic data
extraction to crowdsourcing
and beyond
Dr Iztok Kosem
Faculty of Arts, University of Ljubljana &
Centre for Applied Linguistics, Trojina Institute
Lexicographical process (Klosa, 2013)
Born-digital dictionaries
• ANW (Dictionary of Contemporary Dutch)
• 51079 entries (incl. partly complete entries)
• Innovative features (e.g. semagrams)
• Great Dictionary of Polish
• A great deal of manual work included (Zmigrodzki 2014)
• Immediate release of final entries
• 15,000 entries in 5 years (not many examples!)
• Estonian collocations dictionary (Kallas et al. 2015)
• Starting point: automatically extracted data
• Problems: examples extracted using a very general
configuration; missing collocation clustering etc.
• Publication of the entire dictionary at the end
Dictionary situation in Slovenia
• Last comprehensive dictionary of Slovene published in 1991
(with many entries older, from 70s and 80s)
• Based on material from late 19th century to 1970s
• dictionary database not accessible (also question marks about its
usefulness)
• Second edition published in 2014
• minor updates to the first edition (also opposing the conceptual
framework of the first version; Krek 2014; Ahlin et al 2014)
• online version requires a purchase of a printed version
• database is not available
• Dictionary publishing in general:
• Commercial publishers closing dictionary departments (no new
projects)
• General monolingual projects publicly funded
Dictionary of Contemporary
Slovene Language
• Challenges:
• Compiling a corpus-based dictionary from scratch, using
state-of-the-art lexicographic methods and theoretical
underpinnings
• Meeting needs of dictionary users (digital natives)
• Meeting the needs of NLP and language technology
communities
• Communication in Slovene (2008-2013)
• Gigafida corpus (1.2 billion words)
• New POS-tagger, parser and lexicon of word forms
• Slovene Lexical Database (Gantar et al. 2016)
• Testing new methods and approaches
Lexicography and automation
• Which parts of dictionary entry can be
(semi-)automatically extracted:
• List of words (e.g. terms)
• New words (Cook et al. 2013)
• Definitions (e.g. Pearson 1998; Pollak 2014)
• Some types of labels (Rundell & Kilgarriff 2011)
• Grammatical relations, collocations, multi-word
expressions (PARSEME COST Action)
• Corpus examples (Kosem et al. 2013; Gantar et al. 2016;
Cook et al. 2014)
11
authority (“manual” Sketch Grammar”)
35 gramrels
authority (automatic Sketch Grammar)
39 gramrels
19 gramrels with 92 multi-word links
(separate page)
“it is more efficient to edit out the
computer’s errors than to go through
the whole data-selection process from
the beginning”
(Rundell & Kilgarriff, 2011)
“too many choices early in the data-
selection process leave more room for
error”
(Kosem, Gantar & Krek, 2013)
Main (unproven) criticisms
• Automatic tools cannot replace lexicographers
• Important information can be missed
• Analysis is not as detailed and reliable as with the
manual approach
• Etc.
• Evaluation (Kosem et al. 2015)
SLD entries
coverage of
syntactic
structures
coverage of
collocates under
structures
nouns 82.40% 72.79%
adjectives 94.33% 75.80%
adverbs 92.78% 78.32%
• 100% coverage of all collocates:
• 12% of noun entries
• 8.4% of verb entries
• 16.4% of adjective entries
• 25% of adverb entries
• 100% coverage of collocates under syntactic structures:
• 9.7% of noun entries
• 18.5% of adjective entries
• 22.5% of adverb entries
• 100% coverage of syntactic structures
• 35.4% of noun entries
• 81.1% of adjective entries
• 82.5% of adverb entries.
Why not always 100%?
11.8.2015 Herstmonceux castle, eLex 2015
• Errors in SLD – a small amount (e.g. typos, wrong case
of collocate under certain syntactic structure)
• Different corpora and sketch grammars used
• Parameters for automatic extraction quite strict
• E.g. structure not exported if no collocates match the
minimum criteria  structure marked as not found by ADE
• On the other hand:
• Five to six times more collocates extracted
• Several syntactic structures in automatically extracted data,
which were not detected by lexicographers
• Several (good) examples match (more examples analysed)
Post-processing
• Tasks that are automated:
• Converting extracted data into the correct form (lemma
+ collocate)
• Removing duplicate examples
• Cleaning examples of noise (e.g. removing any extra
spaces before full stops and commas
• Assigning IDs of lemmas from the lexicon of word forms
• Other issues:
• False collocates (e.g. tagging problems)
• Incorrect examples (i.e. where the collocation does not
match the grammatical relation it belongs to)
• Grouping collocates, attributing them under senses, etc.
"Crowdsourcing" in lexicography:
(improving) the final product
(Abel & Meyer, 2013)
Crowdsourcing – dividing a complex
task into a series of simple ones
• Why is crowdsourcing needed in lexicography:
• challenges:
• lexicographers are facing increasing time constraints
& amounts of data
• lexicographers are overqualified for routine post-
editing of automatic procedures
• potential:
• non-expert individuals are talented, creative &
productive enough to solve such tasks
• modern technology makes using the potential of the
crowd simple, affordable & effective
Crowdsourcing - caveats
• estimate of the required investment wrt.
time, money & personnel is crucial
(should not take up more time &
resources than conventional methods)
• if fully integrated in the project,
microtasks can be designed according to
the same principles, use the same pre- &
post-processing chains & platforms
(economizing the initial investment)
Lessons learned
• Instructions must be clearly formulated and simple,
answers must not allow grading (only YES, NO, I
DON’T KNOW)
• not all automatically extracted data is suitable for
crowdsourcing:
• e.g. some grammatical relations are too complex for
evaluation
• users need to focus on some other objective:
competition, credits, money (micro payments)
• Gamification:
• examples: language games such as ESP Game (von Ahn,
2006) and Phrase Detectives (Chamberlain et al., 2008)
Lexicographical process of DCSL
DCSL – implementation and
future
• Meeting the needs of users
• Release of entries at each stage (thus, dictionary is
available from the start)
• Making the database available to NLP community,
researchers etc.
• A parallel project for testing and improving the first
stages of the procedure: Collocations dictionary of
Slovene
Thank you!
• Funded by Slovenian Research Agency project :
Koncept madžarsko-slovenskega slovarja: od
jezikovnega vira do uporabnika (V6-1509)

More Related Content

Viewers also liked

MoeDict: Crowd Lexicography
MoeDict: Crowd LexicographyMoeDict: Crowd Lexicography
MoeDict: Crowd LexicographyAudrey Tang
 
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...eveline wandl-vogt
 
The Dictionary of Food and Nutrition: a proposal for a new electronic multil...
The Dictionary of Food and Nutrition: a  proposal for a new electronic multil...The Dictionary of Food and Nutrition: a  proposal for a new electronic multil...
The Dictionary of Food and Nutrition: a proposal for a new electronic multil...Carlos Valcarcel Riveiro
 
umair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationumair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationUmair Ijaz
 
Lexicography 2011
Lexicography 2011Lexicography 2011
Lexicography 2011Lenochka83
 
lexicography
lexicographylexicography
lexicographyayfa
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...eveline wandl-vogt
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY mimisy
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in Englishteflang
 

Viewers also liked (11)

MoeDict: Crowd Lexicography
MoeDict: Crowd LexicographyMoeDict: Crowd Lexicography
MoeDict: Crowd Lexicography
 
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
 
The Dictionary of Food and Nutrition: a proposal for a new electronic multil...
The Dictionary of Food and Nutrition: a  proposal for a new electronic multil...The Dictionary of Food and Nutrition: a  proposal for a new electronic multil...
The Dictionary of Food and Nutrition: a proposal for a new electronic multil...
 
umair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationumair ijaz's Lexicography presentation
umair ijaz's Lexicography presentation
 
Lexicography 2011
Lexicography 2011Lexicography 2011
Lexicography 2011
 
lexicography
lexicographylexicography
lexicography
 
Lexicography
LexicographyLexicography
Lexicography
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
 
Lexicography
 Lexicography Lexicography
Lexicography
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in English
 

Similar to Innovations in Slovenian Lexicography: From Automation to Crowdsourcing

Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Association for Computational Linguistics
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxSyedNadeemAbbas6
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlpankit_ppt
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfAdityaMishra178868
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaNizar Ghoula
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...Matīss ‎‎‎‎‎‎‎  
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
 
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsLiz Grumbach
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingLionel Briand
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine HarvesterTry PurpleSearch
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 

Similar to Innovations in Slovenian Lexicography: From Automation to Crowdsourcing (20)

Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_Ghoula
 
Introduction
IntroductionIntroduction
Introduction
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
 
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
 
2012.11 - ISWC 2012 - DC - 1
2012.11 - ISWC 2012 - DC - 12012.11 - ISWC 2012 - DC - 1
2012.11 - ISWC 2012 - DC - 1
 
Filling the gaps
Filling the gapsFilling the gaps
Filling the gaps
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 

Recently uploaded

RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGYpruthirajnayak525
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this periodSaraIsabelJimenez
 
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...Henrik Hanke
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxCarrieButtitta
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 

Recently uploaded (20)

RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this period
 
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptx
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 

Innovations in Slovenian Lexicography: From Automation to Crowdsourcing

  • 1. Innovations in Slovenian (e-)lexicography: from (semi-)automatic data extraction to crowdsourcing and beyond Dr Iztok Kosem Faculty of Arts, University of Ljubljana & Centre for Applied Linguistics, Trojina Institute
  • 3. Born-digital dictionaries • ANW (Dictionary of Contemporary Dutch) • 51079 entries (incl. partly complete entries) • Innovative features (e.g. semagrams) • Great Dictionary of Polish • A great deal of manual work included (Zmigrodzki 2014) • Immediate release of final entries • 15,000 entries in 5 years (not many examples!) • Estonian collocations dictionary (Kallas et al. 2015) • Starting point: automatically extracted data • Problems: examples extracted using a very general configuration; missing collocation clustering etc. • Publication of the entire dictionary at the end
  • 4. Dictionary situation in Slovenia • Last comprehensive dictionary of Slovene published in 1991 (with many entries older, from 70s and 80s) • Based on material from late 19th century to 1970s • dictionary database not accessible (also question marks about its usefulness) • Second edition published in 2014 • minor updates to the first edition (also opposing the conceptual framework of the first version; Krek 2014; Ahlin et al 2014) • online version requires a purchase of a printed version • database is not available • Dictionary publishing in general: • Commercial publishers closing dictionary departments (no new projects) • General monolingual projects publicly funded
  • 5. Dictionary of Contemporary Slovene Language • Challenges: • Compiling a corpus-based dictionary from scratch, using state-of-the-art lexicographic methods and theoretical underpinnings • Meeting needs of dictionary users (digital natives) • Meeting the needs of NLP and language technology communities • Communication in Slovene (2008-2013) • Gigafida corpus (1.2 billion words) • New POS-tagger, parser and lexicon of word forms • Slovene Lexical Database (Gantar et al. 2016) • Testing new methods and approaches
  • 6. Lexicography and automation • Which parts of dictionary entry can be (semi-)automatically extracted: • List of words (e.g. terms) • New words (Cook et al. 2013) • Definitions (e.g. Pearson 1998; Pollak 2014) • Some types of labels (Rundell & Kilgarriff 2011) • Grammatical relations, collocations, multi-word expressions (PARSEME COST Action) • Corpus examples (Kosem et al. 2013; Gantar et al. 2016; Cook et al. 2014) 11
  • 7.
  • 8. authority (“manual” Sketch Grammar”) 35 gramrels authority (automatic Sketch Grammar) 39 gramrels 19 gramrels with 92 multi-word links (separate page)
  • 9. “it is more efficient to edit out the computer’s errors than to go through the whole data-selection process from the beginning” (Rundell & Kilgarriff, 2011) “too many choices early in the data- selection process leave more room for error” (Kosem, Gantar & Krek, 2013)
  • 10. Main (unproven) criticisms • Automatic tools cannot replace lexicographers • Important information can be missed • Analysis is not as detailed and reliable as with the manual approach • Etc. • Evaluation (Kosem et al. 2015)
  • 11. SLD entries coverage of syntactic structures coverage of collocates under structures nouns 82.40% 72.79% adjectives 94.33% 75.80% adverbs 92.78% 78.32%
  • 12. • 100% coverage of all collocates: • 12% of noun entries • 8.4% of verb entries • 16.4% of adjective entries • 25% of adverb entries • 100% coverage of collocates under syntactic structures: • 9.7% of noun entries • 18.5% of adjective entries • 22.5% of adverb entries • 100% coverage of syntactic structures • 35.4% of noun entries • 81.1% of adjective entries • 82.5% of adverb entries.
  • 13. Why not always 100%? 11.8.2015 Herstmonceux castle, eLex 2015 • Errors in SLD – a small amount (e.g. typos, wrong case of collocate under certain syntactic structure) • Different corpora and sketch grammars used • Parameters for automatic extraction quite strict • E.g. structure not exported if no collocates match the minimum criteria  structure marked as not found by ADE • On the other hand: • Five to six times more collocates extracted • Several syntactic structures in automatically extracted data, which were not detected by lexicographers • Several (good) examples match (more examples analysed)
  • 14. Post-processing • Tasks that are automated: • Converting extracted data into the correct form (lemma + collocate) • Removing duplicate examples • Cleaning examples of noise (e.g. removing any extra spaces before full stops and commas • Assigning IDs of lemmas from the lexicon of word forms • Other issues: • False collocates (e.g. tagging problems) • Incorrect examples (i.e. where the collocation does not match the grammatical relation it belongs to) • Grouping collocates, attributing them under senses, etc.
  • 15. "Crowdsourcing" in lexicography: (improving) the final product (Abel & Meyer, 2013)
  • 16. Crowdsourcing – dividing a complex task into a series of simple ones • Why is crowdsourcing needed in lexicography: • challenges: • lexicographers are facing increasing time constraints & amounts of data • lexicographers are overqualified for routine post- editing of automatic procedures • potential: • non-expert individuals are talented, creative & productive enough to solve such tasks • modern technology makes using the potential of the crowd simple, affordable & effective
  • 17. Crowdsourcing - caveats • estimate of the required investment wrt. time, money & personnel is crucial (should not take up more time & resources than conventional methods) • if fully integrated in the project, microtasks can be designed according to the same principles, use the same pre- & post-processing chains & platforms (economizing the initial investment)
  • 18. Lessons learned • Instructions must be clearly formulated and simple, answers must not allow grading (only YES, NO, I DON’T KNOW) • not all automatically extracted data is suitable for crowdsourcing: • e.g. some grammatical relations are too complex for evaluation • users need to focus on some other objective: competition, credits, money (micro payments) • Gamification: • examples: language games such as ESP Game (von Ahn, 2006) and Phrase Detectives (Chamberlain et al., 2008)
  • 20.
  • 21. DCSL – implementation and future • Meeting the needs of users • Release of entries at each stage (thus, dictionary is available from the start) • Making the database available to NLP community, researchers etc. • A parallel project for testing and improving the first stages of the procedure: Collocations dictionary of Slovene
  • 22. Thank you! • Funded by Slovenian Research Agency project : Koncept madžarsko-slovenskega slovarja: od jezikovnega vira do uporabnika (V6-1509)