This document discusses innovations in Slovenian lexicography, including the use of automation, crowdsourcing, and born-digital dictionaries. It summarizes the challenges of compiling a new Dictionary of Contemporary Slovene Language from scratch using modern methods. Some key points discussed are the automated extraction of dictionary entries, evaluation of automatic tools, and the potential and caveats of using crowdsourcing for post-editing. The document also provides examples of born-digital dictionaries from other languages and lessons learned from projects testing new lexicographical approaches.
Innovations in Slovenian Lexicography: From Automation to Crowdsourcing
1. Innovations in Slovenian
(e-)lexicography:
from (semi-)automatic data
extraction to crowdsourcing
and beyond
Dr Iztok Kosem
Faculty of Arts, University of Ljubljana &
Centre for Applied Linguistics, Trojina Institute
3. Born-digital dictionaries
• ANW (Dictionary of Contemporary Dutch)
• 51079 entries (incl. partly complete entries)
• Innovative features (e.g. semagrams)
• Great Dictionary of Polish
• A great deal of manual work included (Zmigrodzki 2014)
• Immediate release of final entries
• 15,000 entries in 5 years (not many examples!)
• Estonian collocations dictionary (Kallas et al. 2015)
• Starting point: automatically extracted data
• Problems: examples extracted using a very general
configuration; missing collocation clustering etc.
• Publication of the entire dictionary at the end
4. Dictionary situation in Slovenia
• Last comprehensive dictionary of Slovene published in 1991
(with many entries older, from 70s and 80s)
• Based on material from late 19th century to 1970s
• dictionary database not accessible (also question marks about its
usefulness)
• Second edition published in 2014
• minor updates to the first edition (also opposing the conceptual
framework of the first version; Krek 2014; Ahlin et al 2014)
• online version requires a purchase of a printed version
• database is not available
• Dictionary publishing in general:
• Commercial publishers closing dictionary departments (no new
projects)
• General monolingual projects publicly funded
5. Dictionary of Contemporary
Slovene Language
• Challenges:
• Compiling a corpus-based dictionary from scratch, using
state-of-the-art lexicographic methods and theoretical
underpinnings
• Meeting needs of dictionary users (digital natives)
• Meeting the needs of NLP and language technology
communities
• Communication in Slovene (2008-2013)
• Gigafida corpus (1.2 billion words)
• New POS-tagger, parser and lexicon of word forms
• Slovene Lexical Database (Gantar et al. 2016)
• Testing new methods and approaches
6. Lexicography and automation
• Which parts of dictionary entry can be
(semi-)automatically extracted:
• List of words (e.g. terms)
• New words (Cook et al. 2013)
• Definitions (e.g. Pearson 1998; Pollak 2014)
• Some types of labels (Rundell & Kilgarriff 2011)
• Grammatical relations, collocations, multi-word
expressions (PARSEME COST Action)
• Corpus examples (Kosem et al. 2013; Gantar et al. 2016;
Cook et al. 2014)
11
9. “it is more efficient to edit out the
computer’s errors than to go through
the whole data-selection process from
the beginning”
(Rundell & Kilgarriff, 2011)
“too many choices early in the data-
selection process leave more room for
error”
(Kosem, Gantar & Krek, 2013)
10. Main (unproven) criticisms
• Automatic tools cannot replace lexicographers
• Important information can be missed
• Analysis is not as detailed and reliable as with the
manual approach
• Etc.
• Evaluation (Kosem et al. 2015)
12. • 100% coverage of all collocates:
• 12% of noun entries
• 8.4% of verb entries
• 16.4% of adjective entries
• 25% of adverb entries
• 100% coverage of collocates under syntactic structures:
• 9.7% of noun entries
• 18.5% of adjective entries
• 22.5% of adverb entries
• 100% coverage of syntactic structures
• 35.4% of noun entries
• 81.1% of adjective entries
• 82.5% of adverb entries.
13. Why not always 100%?
11.8.2015 Herstmonceux castle, eLex 2015
• Errors in SLD – a small amount (e.g. typos, wrong case
of collocate under certain syntactic structure)
• Different corpora and sketch grammars used
• Parameters for automatic extraction quite strict
• E.g. structure not exported if no collocates match the
minimum criteria structure marked as not found by ADE
• On the other hand:
• Five to six times more collocates extracted
• Several syntactic structures in automatically extracted data,
which were not detected by lexicographers
• Several (good) examples match (more examples analysed)
14. Post-processing
• Tasks that are automated:
• Converting extracted data into the correct form (lemma
+ collocate)
• Removing duplicate examples
• Cleaning examples of noise (e.g. removing any extra
spaces before full stops and commas
• Assigning IDs of lemmas from the lexicon of word forms
• Other issues:
• False collocates (e.g. tagging problems)
• Incorrect examples (i.e. where the collocation does not
match the grammatical relation it belongs to)
• Grouping collocates, attributing them under senses, etc.
16. Crowdsourcing – dividing a complex
task into a series of simple ones
• Why is crowdsourcing needed in lexicography:
• challenges:
• lexicographers are facing increasing time constraints
& amounts of data
• lexicographers are overqualified for routine post-
editing of automatic procedures
• potential:
• non-expert individuals are talented, creative &
productive enough to solve such tasks
• modern technology makes using the potential of the
crowd simple, affordable & effective
17. Crowdsourcing - caveats
• estimate of the required investment wrt.
time, money & personnel is crucial
(should not take up more time &
resources than conventional methods)
• if fully integrated in the project,
microtasks can be designed according to
the same principles, use the same pre- &
post-processing chains & platforms
(economizing the initial investment)
18. Lessons learned
• Instructions must be clearly formulated and simple,
answers must not allow grading (only YES, NO, I
DON’T KNOW)
• not all automatically extracted data is suitable for
crowdsourcing:
• e.g. some grammatical relations are too complex for
evaluation
• users need to focus on some other objective:
competition, credits, money (micro payments)
• Gamification:
• examples: language games such as ESP Game (von Ahn,
2006) and Phrase Detectives (Chamberlain et al., 2008)
21. DCSL – implementation and
future
• Meeting the needs of users
• Release of entries at each stage (thus, dictionary is
available from the start)
• Making the database available to NLP community,
researchers etc.
• A parallel project for testing and improving the first
stages of the procedure: Collocations dictionary of
Slovene
22. Thank you!
• Funded by Slovenian Research Agency project :
Koncept madžarsko-slovenskega slovarja: od
jezikovnega vira do uporabnika (V6-1509)