SlideShare une entreprise Scribd logo
1  sur  53
Télécharger pour lire hors ligne
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  A gentle introduction to lexicon building and lexicon
                      application

Katrien Depuydt (Institute for Dutch Lexicology, Leiden)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Outline
       What is a lexicon
       Lexica in IMPACT
       Lexicon building and lexicon application tools
       Results so far with focus on Dutch




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                          What is a lexicon?




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexicon vs. electronic dictionary (1)
 An electronic dictionary has
Of course, digitized full text (no images)
Primarily: for human use
Ideally: searchable with explicitly (XML) tagged information
          lemma, Part of speech, meaning, quotations etc.
Example:online Oxford English Dictionary




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




        Dictionary XML (example)




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexicon vs. electronic dictionary (2)
 A computational lexicon is
 Of course, in structured digital format (XML, relational database)
 Primarily for use in computer applications
 Has explicitly coded information
  (eg. lemma, part of speech, morphology, semantics, syntax…).
 Used (for instance):
 Linguistic annotation
 Enhanced retrieval (basic: inflected forms; advanced: synonyms etc.)
 Syntactic parsing, machine translation

IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                          Lexica in IMPACT




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The OCR lexicon
   An OCR lexicon is
   A verified list of words in a language
   Based on a corpus, dated to enable relevant selection
   Preferably with frequency information
   Preferably from same period/text type as the documents
   you want OCR’d (selection!)




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      OCR lexicon example
          From WNT attestation lexicon                                            From DBNL historical corpus
          absoluut     8                                                          wechgerukt   5
          absoluyt    2                                                           wechgeschickt 6
          absoluyter   1                                                          wechgeven     6
          absolveren    3                                                         wech-gevoerde 11
          absolverende 1                                                          wechgevoerde 14
          absorbeeren 1                                                           wech-gevoert 59
          absorbeert    1                                                         wechgevoert 98
          absorberen     1                                                        wechgeworpen 21
          absorptie    3                                                          wechghenomen 12
          absoute 2                                                               wechghevoert 7
          abstineeren 1                                                           wechginck    5
          abstinencie 1                                                           wechloopen    6
          abstinentie 2                                                           wechneemt     11
          abstineren    1                                                         wechneme      6
          abstrackheyt 1                                                          wech-nemen     20
          abstract    7                                                           wechnemen      74
          abstracta    1                                                          wechneminge 12
          abstracte    7                                                          wech-neminge 6
          abstracten    4                                                         wechrapen    6
          abstractheid 1                                                          wechrucken    6
          abstractie   1                                                          wechruiming 7
          abstractiën 1                                                           wecht 7

IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The IR lexicon
IR lexicon: Main information categories:
   wordforms (list of words) +
         - frequency information
         - quotations (dated sources) from corpora or
           electronic dictionaries
         - MODERN LEMMA (// dictionary entry) assigned to spelling
           variants and morphological variants of the same word
 The modern lemma forms are the main search keys for retrieval
 This is a standard practice in corpus linguistics and modern historical
   lexicography

IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


<?xml version='1.0'?>
<!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'>
<lexicon>
<lexical_entry><lemma_id>219490</lemma_id>
<modern_lemma>aantuilen</modern_lemma>
<gloss></gloss>
<POS>VRB</POS>
<ne_label></ne_label>
<language_id></language_id>
<portmanteau_lemma_id></portmanteau_lemma_id>

<wordform><form_representation>
<wordform_id>850026</wordform_id>
<written_form>tuyld</written_form>
<attestation><id>92141</id>
<token_id></token_id>
<quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen:
Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote>
<derivation_id>0</derivation_id>
<document_id>204</document_id>
<start_pos>119</start_pos>
<end_pos>124</end_pos>
</attestation>
</form_representation>
</wordform>
       IMPACT workshop, Bratislava, May 7, 2010                                                         12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                       How to build and apply these lexica?




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexicon building
Build a lexicon with the aims of
 Be profitable to OCR and OCR postcorrection
 Improving retrieval by building a lexicon of variants with the modern
   lemma as a main entry key

     Tools for lexicon building
     Tools on how to use the lexicon (lexicon deployment) for enrichment
     Lexicon cookbook
     Best practice and tools to use lexica in OCR

!!! No lexicon will ever contain all variants found in historical text

IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




     Types of variation (orthographical and other)
             uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken
             uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken
             d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke
             uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke
I            uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken
             uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke
             uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick
             uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic
             uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk

                  (most of these can be dealt with by means of patterns)
             werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt
             wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys
II           swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt
             worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde
             tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

                   (some of these can be dealt with by patterns and/or fuzzy matching,
                   others can only be handled by explicit listing)
     IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The “hypothetical” vs. the witnessed lexicon (1)
Mechanisms
      - to extend the lexicon
      - to assess the plausibility of “hypothetical” words
        without previous attestations, i.e. words we have not
        seen before.




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The “hypothetical” vs. the witnessed lexicon (2)

 Unknown inflected forms of registered lemmata: automatic
  expansion from the lemma to the full paradigm of word forms:
  paradigmatic expansion or reverse lemmatization
 New spellings of known words can be dealt with by developing
  a good model of the historical spelling.
  (The database structure provides for the storage of
  orthographic variant patterns.)
 Previously unseen compounds can be dealt with by means of
  a good model of word formation. (work scheduled for 2010)


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                Virtual lexicon
                                                                of generated word
                                                                forms

Witnessed Modern Word                                                 Hypothetical Modern Word




                                                                                                     Historical Variant 1
     Transformation
        Patterns
                                                                                                  Historical Variant 2




  IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




What is needed for lexicon building

 Build models of linguistic variation (inflection, orthography)
 Collect variants

 Approach
 Cycle: model helps to construct lexicon, and vice versa (induction of
  rules/patterns)
 Combination of manual work and computational linguistics
 Lexicon building toolkit to support development, containing both
  computational linguistic tools and tools supporting manual work

IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          Cf. Computational Tools and Lexica to Improve Access to Text, Jesse de Does, Katrien
          Depuydt

IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Spelling variation tools (pattern-based)
 Language-independent approach:
    Supervised rule (pattern) induction from pairs (“modern” word,
     historical word), yielding patterns like aa/ae, s/z, ….
    Pattern weights are computed from example material

Additional approaches possible:
 Use of aligned data (parallel historical text and modern version)
 Unsupervised pattern weighting (=~ text profiling from TR5)



IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lemmatization
 Reduction of historical word forms to modern lemma
 Historical word  standard (“modern”) spelling  lemma form
           (pattern matching)           (lemmatizer)

        Dystels  (1)                                                     distels  (2)                                                                  distel

 When we have a perfect or near-perfect modern full form
   lexicon, the second step is simply lexicon lookup.
         But:
1) We will not have full form information for many lemmata
   (especially the historical ones)
2) Even lemmata present in modern language may have historical
   inflected forms different from the present-day paradigm
IMPACT workshop, Bratislava, May 7, 2010                                                                                                                          22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lemmatization and reverse lemmatization
We also need a lemmatization process for these situations
 A typical lemmatizer assigns some standard form (infinitive,
   nominative, stem) to inflected forms. Usually based on patterns
   relating the inflected form to the standard form.
But:
 Matching these patterns can be hard to combine with matching
   both spelling variation patterns and OCR errors
   (bok/bokken/bokkeu)
 We adopt the solution of actually expanding the “hypothetical
   modern full form lexicon” containing the most plausible possible
   paradigmatic expansions of lemmata
 This construction is carried out by means of a statistical reverse
   lemmatizer
IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 23
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Attestation
 From hypothetical (non-witnessed) lexicon content to attested word
  forms in “real” text
 Automatic selection of candidate attestations
 Manual work: verification and correction

 Two approaches
    Dictionary based (INL): Woordenboek der Nederlandsche Taal
    Corpus based (LMU, INL): Dutch DBNL corpus



IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  IMPACT Dictionary Attestation Tool
   Lexicon building at work: Verifying attestations in historical dictionaries
Task
       Find the variants of a headword as they occur in the quotations
 headword


          work
             • We are working on what works.
                     • Depart from me, ye that worke iniquity.                                                                        Quotations

                     • She worcketh knittinge of stockings.



                                                                  variants
   IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             IMPACT Dictionary Attestation Tool
 Task
                 Find the variants of a headword as they occur in the quotations
                                                      Automatically (preprocessing)
Electronic                                                                  • match            literally
historical                                                                                               e.g: work  work, Work
dictionary
                                  Database

                                 with lemmata
                                                                            • match            using existing lexica and lists
                                and quotatioms
                                                                                                           e.g: work  works, worked, wrought
                                                                            • approximate matching
                                                                                        e.g: work worke
                                                      By hand (using the tool)
                                                                            • correct            automatic mismatches
                                                                                                            e.g: works  words, worms
                                                                            • find       missed matches
                                                                                                            e.g: work  worketh, wrowght

             IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 26
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      IMPACT Attestation Tool
                                                                               Up-to-date overview of what is done and needs to be don
Tool
                                                                                           Done by this user so far




Lemma headword




Quotations
Sorted by uncertainty




      IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 27
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  IMPACT Lexicon Tool
Task
       Find and verify attestations in a historical corpus
                          Automatically (preprocessing = apply lemmatizer)
                                                • match            literally
                                                                             e.g: work  work, Work
                                                • match            using existing lexica and lists
                                                                              e.g: work  works, worked, wrought
                                                • matching                using spelling variation module
                                                                              e.g: uiterlijk uyterlick
                          By hand (using the tool)
                                                • assign            correct lemma
                                                                               e.g: was (N)  zijn (V)
                                                • group           tokens belonging together
                                                                               e.g: konings zoon  koningszoon
                                                • select attestations
  IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 28
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Corpus-based lexicon building: Impact Lexicon Tool




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 29
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




General vocabulary vs. Named entities
 Tools for lexicon building described so far: applicable to general
  lexicon
 Tools for NE recognition, classification and variant matching
  - library requirement
  - distinguish general vocabulary from NE’s
  - avoid unpleasant mixups like Abimelech  apemelk!
     (b/p; i/e; e/0; k/ch)




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 30
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Improvement of state of the art / innovation

 We use existing computational linguistic approaches, but figure out
  how to apply them to historical language
                               
 We develop a workflow to deal with the problems posed by historical
  language, figuring out how all pieces fit together
    Data selection and acquisition
    Manual work
    Computational linguistics tools


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 31
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          Some results so far with focus on Dutch




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 32
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Measuring results for Dutch

  We use the ground truth data developed in the project
  Evaluation of EE tools
  Evaluation of lexicon coverage
  Evaluation of lexicon usage in IR (2010)
  Evaluation of OCR and lexicon usage in OCR (2010)
  Evaluation of benefit of lexicon building for OCR (for which type
  of material / quality of OCR does this make sense) (2010-11)



IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 33
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Dutch ground truth data
Type and genre                                                                                     # words
Gold Standard Book                                                                                                                                                300k
Random Set Book                                                                                                                                                   340k
Random Set Staten Generaal                                                                                                                                       2.5M
Gold Standard Staten Generaal                                                                                                                                     500k
Gold Standard Newspapers 1                                                                                                                                       3.4M
Gold Standard Newspapers 2                                                                                                                                        170k
Random Set Newspapers                                                                                                                                            3.2M


total                                                                                                                                                            13.1M

        IMPACT workshop, Bratislava, May 7, 2010                                                                                                                         34
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Efficiency of lexicon building
Dictionary-based lexicon building using historical dictionary:
   Woordenboek der Nederlandsche Taal
 Lemmata: 220211, quotations: 1524366
 Tempo: 1725 quotations/hour; 231 lemmata/hour




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 35
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Reverse lemmatization
 Reminder: build hypothetical (non-attested) word forms in a “quick
  and dirty” way to use in lemmatization and corpus-based lexicon
  building
 Using simple statistical algorithms and a simple approach to
  inflection
 Results:
                                                                                                                    Accuracy
                    Small Dutch lexicon (JVKlex)                                                                    96.6%
                    French lexicon (Morphalou)                                                                      99.4%
                    Polish lexicon, verbs (Morfologik)                                                              98.7%

IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 36
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexicon coverage (1: ground truth books)
                                                                   Type coverage                                        Token coverage
Modern lexicon (e-Lex) 46%                                                                                              76%
EE3.3                                                              56%                                                  84%
1+2                                                                63%                                                  89%
Type frequency list                                                70%                                                  93%
historical corpus, top
200K (freq >= 19)
Type frequency list                                                78%                                                  95%
historical corpus, top
500K (freq >= 5)


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 37
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexicon coverage (2: gt newspapers 18th-19th c.)
                                                                   Type coverage                                        Token coverage
Modern lexicon (e-Lex) 40%                                                                                              83%
EE3.3                                                              41%                                                  84%
1+2                                                                51%                                                  89%
Type frequency list                                                52%                                                  93%
historical corpus, top
200K
Type frequency list                                                62%                                                  95%
historical corpus, top
500K


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 38
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexicon coverage (3: gt Parl. Papers 19th c.)
                                                                   Type coverage                                        Token coverage
Modern lexicon (e-Lex) 51%                                                                                              89%
EE3.3                                                              47%                                                  88%
1+2                                                                58%                                                  93%
Type frequency                                                     59%                                                  96%
historical corpus, top
200K
Type frequency                                                     68%                                                  97%
historical corpus, top
500K


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 39
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexicon coverage (4: gt Parl. Papers 20th c.)
                                                                   Type coverage                                        Token coverage
Modern lexicon (e-Lex) 70%                                                                                              93%
EE3.3                                                              66%                                                  93%
1+2                                                                76%                                                  96%
Type frequency                                                     74%                                                  97%
historical corpus, top
200K
Type frequency                                                     81%                                                  98%
historical corpus, top
500K


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 40
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexicon coverage (5: Genesis, 1637 bible)
                                                                   Type coverage                                        Token coverage
Modern lexicon (e-Lex) 31%                                                                                              61%
EE3.3                                                              62%                                                  83%
1+2                                                                65%                                                  89%
Type frequency                                                     76%                                                  97%
historical corpus, top
200K
Type frequency                                                     87%                                                  98.6%
historical corpus, top
500K


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 41
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lexicon coverage (6: Hooft, historiën)
                                                                   Type coverage                                        Token coverage
Modern lexicon (e-Lex) 26%                                                                                              67%
EE3.3                                                              47%                                                  88%
1+2                                                                50%                                                  90%
Type frequency                                                     44%                                                  93%
historical corpus, top
200K
Type frequency                                                     58%                                                  96%
historical corpus, top
500K


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 42
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Conclusion from this evaluation
 Evident next step for Dutch lexicon building is corpus based work

 First target: cover the top 200000 from the historical corpus.
   – Contains 97885 types not in the witnessed historical EE3.3
      lexicon
   – Roughly 24% of these are covered by the modern lexicon
   – Roughly 25% are names
   – This leaves about 45000 common words to look into.


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 43
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Measuring effect of lexicon use in IR
 Example: Improved recall for retrieval in a historical corpus of about
  150 million tokens, using only the modern lexicon for wereld yields
  23396 hits, using th current EE3.3 lexicon we get 34339 hits.
 Simple IR will be part of the demonstrators
 Hard to IR results proper without special datasets
 We have measured up to now either lemmatization or modern to
  historical word form matching accuracy




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 44
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lemmatization
 Combination of lookup, matching of spelling variation, reverse
  lemmatization
 As yet no good evaluation set for IMPACT (current work)
 Evaluation on “type” level
We will use other material here (1637 Genesis, 97144 tokens)
Approach
 Restrict to “ordinary words” (no names, numbers, clitic
  combinations)
 Ambiguous lemmatization (context is not used) (avg. 5
  suggestions per word)
 Ranking based on frequency and pattern weights
IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 45
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Result
 6265 distinct types. 5991 (95.7%) had at least one correct
  suggestion
 Average rank of correct suggestions: 1.23
   – 5222 types found in current EE3.3 (83%)
   – 65 additional types in modern lexicon
   – 49 types without any match
   – 969 types (15%) identified with “approximate” matching using
     ~500 weighted patterns and returning at most 2 suggestions



IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 46
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Real and hypothetical lexicon coverage
(Hooft, historiën)
 Result (again restricting to ‘ordinary’ words)
 36332 distinct types. Avg rank of correct suggestions: 1.23
   – 20087 types found in current EE3.3 (55%)
   – 1061 additional types in modern lexicon
   – 2411 types without any match (7%)
   – 12773 types (35%) identified with “approximate” matching using
     ~500 weighted patterns and returning at most 2 suggestions
     (Probably about 75% of the highest-ranking approximate matches
     are correct)


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 47
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Evaluation of TR results
Using Finereader SDK (version 9)
 External dictionary interface for experimentation
 Not completely straighforward how to apply this
    Translation of corpus frequencies to weights on a scale 0-100
    Other details: hyphenated words, case-sensitivity, …
    Workaround to circumvent the long s problem
Lexicon Data used
Corpus-based type-frequency list
EE3.2 deliverable lexicon
Finereader internal lexicon
IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 48
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR evaluation
1. Character accuracy
2. Word accuracy
3. In case of block alignment problems, a simple alternative is bag-of-
   words accuracy

1. and 2. presuppose a good alignment of OCR with ground truth.

 We will use word accuracy, or the simpler alternative 3. when there
  are alignment problems

IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 49
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR results
Dataset                     With ABBYY internal Dutch                                     With combination                        With combination of
                            lexicon                                                       of corpus-based                         corpus-based
                                                                                          historical lexicon                      historical lexicon and
                                                                                          and EE3.2                               EE3.2 deliverable
                                                                                          deliverable (case                       improved deployment
                                                                                          insensitive, taking
                                                                                          hyphenation into
                                                                                          account)

DPO35                       88.8%                                                         90.9%                                   94.4 % accuracy
(word
accuracy)
Parliamenta                 90.9%                                                         94.9%                                   94.9%
ry papers,
1826-27
selection
(bag of
words
recall)
IMPACT workshop, Bratislava, May 7, 2010                                                                                                                   50
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




    ‘The Book’

                                        “Kort begrip der waereld-historie voor de jeugd”
                                                          J.F. Martinet
                                                Predikant te Zutphen, uit 1789.

Why this book?
Representative font and amount of spelling variation etc for late 18th century Dutch
It has the “long s problem”:


                                                                        = stilste not ftilfte
 ….


       IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 51
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




            The long s problem: An example ….




OCR at start of project                                                                               Results April 2010
A. De eerde was de gevaarlykflti om de verlei¬                                                        A. De eerste was de gevaarlykste om de verlei-
ding aan 't Hof; de tweede de ftillie en veiligde;                                                    ding aan 't Hof; de tweede de stilste en veiligste;
de derde de zwaarde, daar hy byna drie millioenen                                                     de derde de zwaarste, daar hy byna drie millioenen
harde en onbefchaafde Menfchen beftieren moest.                                                       harde en onbeschaafde Menschen bestieren moest.


    Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and
    postcorrect it afterwards with the lexicon.
    In this way we keep it from turning to “eerde”

            IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 52
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Future work
 Compound analysis
 Irregular historical white space use (“impacttok++”) (cf attestations)
 Corpus based lexicon extension
 Testing and optimization with ground truth data
 Improve the TR lexicon by extending the IR lexicon and removing
  false friends from the DBNL-corpus based TR lexicon
 Continue work on best way deploy lexica in OCR, with help from
  ABBYY



IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 53

Contenu connexe

En vedette (13)

Syntax
SyntaxSyntax
Syntax
 
A Brief Introduction of Morphology
 A Brief Introduction of Morphology A Brief Introduction of Morphology
A Brief Introduction of Morphology
 
Phonology Introduction
Phonology IntroductionPhonology Introduction
Phonology Introduction
 
Semantics
SemanticsSemantics
Semantics
 
Semantics: Meanings of Language
Semantics: Meanings of LanguageSemantics: Meanings of Language
Semantics: Meanings of Language
 
Grammar
GrammarGrammar
Grammar
 
Introduction to linguistics ppt
Introduction to linguistics pptIntroduction to linguistics ppt
Introduction to linguistics ppt
 
Introduction to Morphology
Introduction to MorphologyIntroduction to Morphology
Introduction to Morphology
 
Introduction to linguistics lec 1
Introduction to linguistics lec 1Introduction to linguistics lec 1
Introduction to linguistics lec 1
 
Semantics
SemanticsSemantics
Semantics
 
Morphology (linguistics)
Morphology (linguistics)Morphology (linguistics)
Morphology (linguistics)
 
Semantics5
Semantics5Semantics5
Semantics5
 
Pragmatics presentation
Pragmatics presentationPragmatics presentation
Pragmatics presentation
 

Similaire à Bratislava WS - Depuydt - INL - lexicon building_pdf

Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
IMPACT Centre of Competence
 
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Centre of Competence
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Centre of Competence
 

Similaire à Bratislava WS - Depuydt - INL - lexicon building_pdf (9)

Computer Lexica in OCR and Retrieval
Computer Lexica in OCR and RetrievalComputer Lexica in OCR and Retrieval
Computer Lexica in OCR and Retrieval
 
IMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for DigitisationIMPACT: Building a Centre of Competence for Digitisation
IMPACT: Building a Centre of Competence for Digitisation
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
IMPACT OCR in a nutshell. Clemens Neudecker
IMPACT OCR in a nutshell. Clemens NeudeckerIMPACT OCR in a nutshell. Clemens Neudecker
IMPACT OCR in a nutshell. Clemens Neudecker
 
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Language Parallel Sessions -  GotscharekIMPACT Final Conference - Language Parallel Sessions -  Gotscharek
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
 
Structural analysis of documents Functional Extension Parser (FEP). Günter Mü...
Structural analysis of documents Functional Extension Parser (FEP). Günter Mü...Structural analysis of documents Functional Extension Parser (FEP). Günter Mü...
Structural analysis of documents Functional Extension Parser (FEP). Günter Mü...
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Dernier

Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Dernier (20)

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 

Bratislava WS - Depuydt - INL - lexicon building_pdf

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. A gentle introduction to lexicon building and lexicon application Katrien Depuydt (Institute for Dutch Lexicology, Leiden)
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Outline  What is a lexicon  Lexica in IMPACT  Lexicon building and lexicon application tools  Results so far with focus on Dutch IMPACT workshop, Bratislava, May 7, 2010 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. What is a lexicon? IMPACT workshop, Bratislava, May 7, 2010 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon vs. electronic dictionary (1) An electronic dictionary has Of course, digitized full text (no images) Primarily: for human use Ideally: searchable with explicitly (XML) tagged information lemma, Part of speech, meaning, quotations etc. Example:online Oxford English Dictionary IMPACT workshop, Bratislava, May 7, 2010 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Dictionary XML (example) IMPACT workshop, Bratislava, May 7, 2010 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon vs. electronic dictionary (2) A computational lexicon is Of course, in structured digital format (XML, relational database) Primarily for use in computer applications Has explicitly coded information (eg. lemma, part of speech, morphology, semantics, syntax…). Used (for instance): Linguistic annotation Enhanced retrieval (basic: inflected forms; advanced: synonyms etc.) Syntactic parsing, machine translation IMPACT workshop, Bratislava, May 7, 2010 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT workshop, Bratislava, May 7, 2010 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexica in IMPACT IMPACT workshop, Bratislava, May 7, 2010 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The OCR lexicon An OCR lexicon is A verified list of words in a language Based on a corpus, dated to enable relevant selection Preferably with frequency information Preferably from same period/text type as the documents you want OCR’d (selection!) IMPACT workshop, Bratislava, May 7, 2010 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR lexicon example From WNT attestation lexicon From DBNL historical corpus absoluut 8 wechgerukt 5 absoluyt 2 wechgeschickt 6 absoluyter 1 wechgeven 6 absolveren 3 wech-gevoerde 11 absolverende 1 wechgevoerde 14 absorbeeren 1 wech-gevoert 59 absorbeert 1 wechgevoert 98 absorberen 1 wechgeworpen 21 absorptie 3 wechghenomen 12 absoute 2 wechghevoert 7 abstineeren 1 wechginck 5 abstinencie 1 wechloopen 6 abstinentie 2 wechneemt 11 abstineren 1 wechneme 6 abstrackheyt 1 wech-nemen 20 abstract 7 wechnemen 74 abstracta 1 wechneminge 12 abstracte 7 wech-neminge 6 abstracten 4 wechrapen 6 abstractheid 1 wechrucken 6 abstractie 1 wechruiming 7 abstractiën 1 wecht 7 IMPACT workshop, Bratislava, May 7, 2010 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The IR lexicon IR lexicon: Main information categories: wordforms (list of words) + - frequency information - quotations (dated sources) from corpora or electronic dictionaries - MODERN LEMMA (// dictionary entry) assigned to spelling variants and morphological variants of the same word  The modern lemma forms are the main search keys for retrieval  This is a standard practice in corpus linguistics and modern historical lexicography IMPACT workshop, Bratislava, May 7, 2010 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> <modern_lemma>aantuilen</modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> <written_form>tuyld</written_form> <attestation><id>92141</id> <token_id></token_id> <quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform> IMPACT workshop, Bratislava, May 7, 2010 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. How to build and apply these lexica? IMPACT workshop, Bratislava, May 7, 2010 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon building Build a lexicon with the aims of  Be profitable to OCR and OCR postcorrection  Improving retrieval by building a lexicon of variants with the modern lemma as a main entry key  Tools for lexicon building  Tools on how to use the lexicon (lexicon deployment) for enrichment  Lexicon cookbook  Best practice and tools to use lexica in OCR !!! No lexicon will ever contain all variants found in historical text IMPACT workshop, Bratislava, May 7, 2010 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Types of variation (orthographical and other) uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke I uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk (most of these can be dealt with by means of patterns) werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys II swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled (some of these can be dealt with by patterns and/or fuzzy matching, others can only be handled by explicit listing) IMPACT workshop, Bratislava, May 7, 2010 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The “hypothetical” vs. the witnessed lexicon (1) Mechanisms - to extend the lexicon - to assess the plausibility of “hypothetical” words without previous attestations, i.e. words we have not seen before. IMPACT workshop, Bratislava, May 7, 2010 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The “hypothetical” vs. the witnessed lexicon (2)  Unknown inflected forms of registered lemmata: automatic expansion from the lemma to the full paradigm of word forms: paradigmatic expansion or reverse lemmatization  New spellings of known words can be dealt with by developing a good model of the historical spelling. (The database structure provides for the storage of orthographic variant patterns.)  Previously unseen compounds can be dealt with by means of a good model of word formation. (work scheduled for 2010) IMPACT workshop, Bratislava, May 7, 2010 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Virtual lexicon of generated word forms Witnessed Modern Word Hypothetical Modern Word Historical Variant 1 Transformation Patterns Historical Variant 2 IMPACT workshop, Bratislava, May 7, 2010 18
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. What is needed for lexicon building  Build models of linguistic variation (inflection, orthography)  Collect variants  Approach  Cycle: model helps to construct lexicon, and vice versa (induction of rules/patterns)  Combination of manual work and computational linguistics  Lexicon building toolkit to support development, containing both computational linguistic tools and tools supporting manual work IMPACT workshop, Bratislava, May 7, 2010 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Cf. Computational Tools and Lexica to Improve Access to Text, Jesse de Does, Katrien Depuydt IMPACT workshop, Bratislava, May 7, 2010 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Spelling variation tools (pattern-based)  Language-independent approach:  Supervised rule (pattern) induction from pairs (“modern” word, historical word), yielding patterns like aa/ae, s/z, ….  Pattern weights are computed from example material Additional approaches possible:  Use of aligned data (parallel historical text and modern version)  Unsupervised pattern weighting (=~ text profiling from TR5) IMPACT workshop, Bratislava, May 7, 2010 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lemmatization  Reduction of historical word forms to modern lemma  Historical word  standard (“modern”) spelling  lemma form (pattern matching) (lemmatizer) Dystels  (1) distels  (2) distel  When we have a perfect or near-perfect modern full form lexicon, the second step is simply lexicon lookup. But: 1) We will not have full form information for many lemmata (especially the historical ones) 2) Even lemmata present in modern language may have historical inflected forms different from the present-day paradigm IMPACT workshop, Bratislava, May 7, 2010 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lemmatization and reverse lemmatization We also need a lemmatization process for these situations  A typical lemmatizer assigns some standard form (infinitive, nominative, stem) to inflected forms. Usually based on patterns relating the inflected form to the standard form. But:  Matching these patterns can be hard to combine with matching both spelling variation patterns and OCR errors (bok/bokken/bokkeu)  We adopt the solution of actually expanding the “hypothetical modern full form lexicon” containing the most plausible possible paradigmatic expansions of lemmata  This construction is carried out by means of a statistical reverse lemmatizer IMPACT workshop, Bratislava, May 7, 2010 23
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Attestation  From hypothetical (non-witnessed) lexicon content to attested word forms in “real” text  Automatic selection of candidate attestations  Manual work: verification and correction  Two approaches  Dictionary based (INL): Woordenboek der Nederlandsche Taal  Corpus based (LMU, INL): Dutch DBNL corpus IMPACT workshop, Bratislava, May 7, 2010 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Dictionary Attestation Tool Lexicon building at work: Verifying attestations in historical dictionaries Task Find the variants of a headword as they occur in the quotations headword work • We are working on what works. • Depart from me, ye that worke iniquity. Quotations • She worcketh knittinge of stockings. variants IMPACT workshop, Bratislava, May 7, 2010 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Dictionary Attestation Tool Task Find the variants of a headword as they occur in the quotations Automatically (preprocessing) Electronic • match literally historical e.g: work  work, Work dictionary Database with lemmata • match using existing lexica and lists and quotatioms e.g: work  works, worked, wrought • approximate matching e.g: work worke By hand (using the tool) • correct automatic mismatches e.g: works  words, worms • find missed matches e.g: work  worketh, wrowght IMPACT workshop, Bratislava, May 7, 2010 26
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Attestation Tool Up-to-date overview of what is done and needs to be don Tool Done by this user so far Lemma headword Quotations Sorted by uncertainty IMPACT workshop, Bratislava, May 7, 2010 27
  • 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Lexicon Tool Task Find and verify attestations in a historical corpus Automatically (preprocessing = apply lemmatizer) • match literally e.g: work  work, Work • match using existing lexica and lists e.g: work  works, worked, wrought • matching using spelling variation module e.g: uiterlijk uyterlick By hand (using the tool) • assign correct lemma e.g: was (N)  zijn (V) • group tokens belonging together e.g: konings zoon  koningszoon • select attestations IMPACT workshop, Bratislava, May 7, 2010 28
  • 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Corpus-based lexicon building: Impact Lexicon Tool IMPACT workshop, Bratislava, May 7, 2010 29
  • 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. General vocabulary vs. Named entities  Tools for lexicon building described so far: applicable to general lexicon  Tools for NE recognition, classification and variant matching - library requirement - distinguish general vocabulary from NE’s - avoid unpleasant mixups like Abimelech  apemelk! (b/p; i/e; e/0; k/ch) IMPACT workshop, Bratislava, May 7, 2010 30
  • 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improvement of state of the art / innovation  We use existing computational linguistic approaches, but figure out how to apply them to historical language   We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together  Data selection and acquisition  Manual work  Computational linguistics tools IMPACT workshop, Bratislava, May 7, 2010 31
  • 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Some results so far with focus on Dutch IMPACT workshop, Bratislava, May 7, 2010 32
  • 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Measuring results for Dutch We use the ground truth data developed in the project Evaluation of EE tools Evaluation of lexicon coverage Evaluation of lexicon usage in IR (2010) Evaluation of OCR and lexicon usage in OCR (2010) Evaluation of benefit of lexicon building for OCR (for which type of material / quality of OCR does this make sense) (2010-11) IMPACT workshop, Bratislava, May 7, 2010 33
  • 34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Dutch ground truth data Type and genre # words Gold Standard Book 300k Random Set Book 340k Random Set Staten Generaal 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M IMPACT workshop, Bratislava, May 7, 2010 34
  • 35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Efficiency of lexicon building Dictionary-based lexicon building using historical dictionary: Woordenboek der Nederlandsche Taal  Lemmata: 220211, quotations: 1524366  Tempo: 1725 quotations/hour; 231 lemmata/hour IMPACT workshop, Bratislava, May 7, 2010 35
  • 36. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Reverse lemmatization  Reminder: build hypothetical (non-attested) word forms in a “quick and dirty” way to use in lemmatization and corpus-based lexicon building  Using simple statistical algorithms and a simple approach to inflection  Results: Accuracy Small Dutch lexicon (JVKlex) 96.6% French lexicon (Morphalou) 99.4% Polish lexicon, verbs (Morfologik) 98.7% IMPACT workshop, Bratislava, May 7, 2010 36
  • 37. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (1: ground truth books) Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% EE3.3 56% 84% 1+2 63% 89% Type frequency list 70% 93% historical corpus, top 200K (freq >= 19) Type frequency list 78% 95% historical corpus, top 500K (freq >= 5) IMPACT workshop, Bratislava, May 7, 2010 37
  • 38. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (2: gt newspapers 18th-19th c.) Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% EE3.3 41% 84% 1+2 51% 89% Type frequency list 52% 93% historical corpus, top 200K Type frequency list 62% 95% historical corpus, top 500K IMPACT workshop, Bratislava, May 7, 2010 38
  • 39. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (3: gt Parl. Papers 19th c.) Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% EE3.3 47% 88% 1+2 58% 93% Type frequency 59% 96% historical corpus, top 200K Type frequency 68% 97% historical corpus, top 500K IMPACT workshop, Bratislava, May 7, 2010 39
  • 40. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (4: gt Parl. Papers 20th c.) Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% EE3.3 66% 93% 1+2 76% 96% Type frequency 74% 97% historical corpus, top 200K Type frequency 81% 98% historical corpus, top 500K IMPACT workshop, Bratislava, May 7, 2010 40
  • 41. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (5: Genesis, 1637 bible) Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% EE3.3 62% 83% 1+2 65% 89% Type frequency 76% 97% historical corpus, top 200K Type frequency 87% 98.6% historical corpus, top 500K IMPACT workshop, Bratislava, May 7, 2010 41
  • 42. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (6: Hooft, historiën) Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% EE3.3 47% 88% 1+2 50% 90% Type frequency 44% 93% historical corpus, top 200K Type frequency 58% 96% historical corpus, top 500K IMPACT workshop, Bratislava, May 7, 2010 42
  • 43. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Conclusion from this evaluation  Evident next step for Dutch lexicon building is corpus based work  First target: cover the top 200000 from the historical corpus. – Contains 97885 types not in the witnessed historical EE3.3 lexicon – Roughly 24% of these are covered by the modern lexicon – Roughly 25% are names – This leaves about 45000 common words to look into. IMPACT workshop, Bratislava, May 7, 2010 43
  • 44. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Measuring effect of lexicon use in IR  Example: Improved recall for retrieval in a historical corpus of about 150 million tokens, using only the modern lexicon for wereld yields 23396 hits, using th current EE3.3 lexicon we get 34339 hits.  Simple IR will be part of the demonstrators  Hard to IR results proper without special datasets  We have measured up to now either lemmatization or modern to historical word form matching accuracy IMPACT workshop, Bratislava, May 7, 2010 44
  • 45. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lemmatization  Combination of lookup, matching of spelling variation, reverse lemmatization  As yet no good evaluation set for IMPACT (current work)  Evaluation on “type” level We will use other material here (1637 Genesis, 97144 tokens) Approach  Restrict to “ordinary words” (no names, numbers, clitic combinations)  Ambiguous lemmatization (context is not used) (avg. 5 suggestions per word)  Ranking based on frequency and pattern weights IMPACT workshop, Bratislava, May 7, 2010 45
  • 46. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Result  6265 distinct types. 5991 (95.7%) had at least one correct suggestion  Average rank of correct suggestions: 1.23 – 5222 types found in current EE3.3 (83%) – 65 additional types in modern lexicon – 49 types without any match – 969 types (15%) identified with “approximate” matching using ~500 weighted patterns and returning at most 2 suggestions IMPACT workshop, Bratislava, May 7, 2010 46
  • 47. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Real and hypothetical lexicon coverage (Hooft, historiën)  Result (again restricting to ‘ordinary’ words)  36332 distinct types. Avg rank of correct suggestions: 1.23 – 20087 types found in current EE3.3 (55%) – 1061 additional types in modern lexicon – 2411 types without any match (7%) – 12773 types (35%) identified with “approximate” matching using ~500 weighted patterns and returning at most 2 suggestions (Probably about 75% of the highest-ranking approximate matches are correct) IMPACT workshop, Bratislava, May 7, 2010 47
  • 48. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation of TR results Using Finereader SDK (version 9)  External dictionary interface for experimentation  Not completely straighforward how to apply this Translation of corpus frequencies to weights on a scale 0-100 Other details: hyphenated words, case-sensitivity, … Workaround to circumvent the long s problem Lexicon Data used Corpus-based type-frequency list EE3.2 deliverable lexicon Finereader internal lexicon IMPACT workshop, Bratislava, May 7, 2010 48
  • 49. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR evaluation 1. Character accuracy 2. Word accuracy 3. In case of block alignment problems, a simple alternative is bag-of- words accuracy 1. and 2. presuppose a good alignment of OCR with ground truth.  We will use word accuracy, or the simpler alternative 3. when there are alignment problems IMPACT workshop, Bratislava, May 7, 2010 49
  • 50. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR results Dataset With ABBYY internal Dutch With combination With combination of lexicon of corpus-based corpus-based historical lexicon historical lexicon and and EE3.2 EE3.2 deliverable deliverable (case improved deployment insensitive, taking hyphenation into account) DPO35 88.8% 90.9% 94.4 % accuracy (word accuracy) Parliamenta 90.9% 94.9% 94.9% ry papers, 1826-27 selection (bag of words recall) IMPACT workshop, Bratislava, May 7, 2010 50
  • 51. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. ‘The Book’ “Kort begrip der waereld-historie voor de jeugd” J.F. Martinet Predikant te Zutphen, uit 1789. Why this book? Representative font and amount of spelling variation etc for late 18th century Dutch It has the “long s problem”: = stilste not ftilfte  …. IMPACT workshop, Bratislava, May 7, 2010 51
  • 52. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The long s problem: An example …. OCR at start of project Results April 2010 A. De eerde was de gevaarlykflti om de verlei¬ A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de ftillie en veiligde; ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarde, daar hy byna drie millioenen de derde de zwaarste, daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. harde en onbeschaafde Menschen bestieren moest. Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon. In this way we keep it from turning to “eerde” IMPACT workshop, Bratislava, May 7, 2010 52
  • 53. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Future work  Compound analysis  Irregular historical white space use (“impacttok++”) (cf attestations)  Corpus based lexicon extension  Testing and optimization with ground truth data  Improve the TR lexicon by extending the IR lexicon and removing false friends from the DBNL-corpus based TR lexicon  Continue work on best way deploy lexica in OCR, with help from ABBYY IMPACT workshop, Bratislava, May 7, 2010 53