The document discusses computer lexica and their uses in optical character recognition (OCR) and information retrieval from historical texts. It describes different types of lexica, including OCR lexica containing common words to improve OCR accuracy and IR lexica linking word forms to modern lemmas to aid search and retrieval. Tools are presented for building lexica from corpora and dictionaries, applying lexica in OCR and retrieval, and analyzing spelling variation patterns.
How to Remove Document Management Hurdles with X-Docs?
Language tools bne-5-10-2011
1. Computer Lexica in OCR and Retrieval Katrien Depuydt, Jesse de Does (Instituut voor Nederlandse Lexicologie, Leiden)
2. Can we handle ‘de wereld’ (‘the world’)’? 4 March 2009 presentation The Hague werreid
3. IMPACT <Demo Day BL, 12 July 2011> OCR: Abbyy Finereader SDK with built in standard Dutch dictionary OCR: Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch: werreld
5. The long s problem: An example …. IMPACT workshop, Bratislava, May 7, 2010 OCR at start of project A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. .
6. The long s problem: An example …. IMPACT workshop, Bratislava, May 7, 2010 OCR at start of project Results April 2010 A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
7. The long s problem: An example …. IMPACT workshop, Bratislava, May 7, 2010 Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon. In this way we keep it from turning to “eerde” (earth) instead of “eerste” (first) OCR at start of project Results April 2010 A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
8.
9. What is a computer lexicon? IMPACT <Demo Day BL, 12 July 2011>
17. OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
18.
19. IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
20. Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
31. IMPACT Attestation Tool IMPACT workshop, Bratislava, May 7, 2010 Tool Lemma headword Quotations Sorted by uncertainty Up-to-date overview of what is done and needs to be done Done by this user so far
A snippet from a Dutch magazine (De Denker. No. 4. Den 24. January 1763) ------------------------------------------- OCR, improving Access to text: improving the quality of the text. RETRIEVAL: Improving Access to text: dealing with historical spelling variants Used: HISTORICAL LEXICON OF DUTCH Can we handle ‘the world’? Yes we can, ought to be our answer, especially when investing hugely in mass digitisation. Mass digitisation is the very reason for investing in lexicon building. Efforts in digitising huge quantities of historical text demand efforts in quality of OCR as well as retrieval. Historical lexicon building for OCR and Retrieval, as shown above in this little example, can contribute to that. An example: in a ground truth text corpus of Dutch texts from 1550 until 1950, containing approximately 150 million words, search for the very common word ‘wereld’ yielded 23396 hits. Using a historical lexicon, containing spelling and morphological variants of this word, resulted in 34339 hits. I
This presentation is based on how the INL works with language. A electronic dictionary is not what we need for OCR and simple retrieval but is introduced anyway because we can (and do) use our dictionaries for lexicon construction.
This is what an XML-based electronic dictionary looks like.
This is the XML of the Oxford English dictionary. The horizontal lines mark a place where part of the structure has been folded in.
<ed> We need further explanation for what ‘lemma’, ‘part of speech’ and ‘morphology’ mean Lemma: headword, like in an ordinary dictionary the entry Morphology: morphological analysis is done for compounds and derivates: which parts are to be distinguished in a word, e.g. apple pie : apple + pie
This is an little part of a computational lexicon (of a certain type; there are many types of computational lexica)
<ed> again, unsure of what LEMMA means Be, was, am, is, etc. all forms of the same word BE (and that is an example of a lemma)
Two types of variation, examples for Dutch from the lexicon
To give an indication of possible spelling variants of the word ‘world’ for English, a screenshot from the OED online...
These are some of the ways in which we are using Computer lexica as building blocks.
Uitleg: Semi-sipervised approach: match word list from corpus with lexicon and find both the pairings of corpus words with lexicon words and the patterns needed for transformation. This only works if corpus and lexicon are a good match.
Uitleg: Semi-sipervised approach: match word list from corpus with lexicon and find both the pairings of corpus words with lexicon words and the patterns needed for transformation. This only works if corpus and lexicon are a good match.
Uitleg: Semi-sipervised approach: match word list from corpus with lexicon and find both the pairings of corpus words with lexicon words and the patterns needed for transformation. This only works if corpus and lexicon are a good match.
Note: applicable to other historical dictionaries with attestations. Tested on OED material!
Note: applicable to other historical dictionaries with attestations. Tested on OED material!