SlideShare a Scribd company logo
1 of 65
Computer Lexica in OCR and Retrieval Katrien Depuydt, Jesse de Does  (Instituut voor Nederlandse Lexicologie, Leiden)
Can we handle ‘de wereld’ (‘the world’)’? 4 March 2009 presentation The Hague werreid
IMPACT <Demo Day BL, 12 July 2011> OCR: Abbyy Finereader SDK with built in standard Dutch dictionary OCR: Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch: werreld
IMPACT <Demo Day BL, 12 July 2011> werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds  weerlyt  wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds  sweerels   zwerlys   swarels   swerelts  werelts  swerrels  weirelts tsweerelds  werret  vverelt werlts werrelt  worreld  werlden  wareld   weirelt weireld  waerelt  werreld  werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje  weurlt wald weëled   RETRIEVAL:  key in modern  WERELD  and find all
The long  s  problem: An example …. IMPACT workshop, Bratislava,  May 7, 2010 OCR at start of project A. De  eerde   was de  gevaarlykflti  om de verlei¬ ding aan 't Hof; de tweede de  ftillie  en  veiligde ; de derde de  zwaarde , daar hy byna drie millioenen harde en  onbefchaafde   Menfchen   beftieren  moest. .
The long  s  problem: An example …. IMPACT workshop, Bratislava,  May 7, 2010 OCR at start of project Results April 2010 A. De  eerde   was de  gevaarlykflti  om de verlei¬ ding aan 't Hof; de tweede de  ftillie  en  veiligde ; de derde de  zwaarde , daar hy byna drie millioenen harde en  onbefchaafde   Menfchen   beftieren  moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
The long  s  problem: An example …. IMPACT workshop, Bratislava,  May 7, 2010 Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and  postcorrect it afterwards with the lexicon. In this way we keep it from turning to “eerde” (earth) instead of “eerste” (first) OCR at start of project Results April 2010 A. De  eerde   was de  gevaarlykflti  om de verlei¬ ding aan 't Hof; de tweede de  ftillie  en  veiligde ; de derde de  zwaarde , daar hy byna drie millioenen harde en  onbefchaafde   Menfchen   beftieren  moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
What is a computer lexicon?  IMPACT <Demo Day BL, 12 July 2011>
Computer lexicon vs electronic dictionary (1) IMPACT <Demo Day BL, 12 July 2011> An electronic dictionary is:  ,[object Object],[object Object],[object Object],[object Object]
Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011>
Computer Lexicon vs Electronic Dictionary (2) IMPACT <Demo Day BL, 12 July 2011> ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
IMPACT <Demo Day BL, 12 July 2011>
Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011>
The OCR lexicon IMPACT <Demo Day BL, 12 July 2011> An OCR lexicon is   ,[object Object],[object Object],[object Object],[object Object]
OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
The IR lexicon  ,[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en  tuyld  daer weer op  an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
Types variation (spelling, inflection…) IMPACT <Demo Day BL, 12 July 2011> uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken  uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk  uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk  I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds  weerlyt  wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds  sweerels   zwerlys   swarels   swerelts  werelts  swerrels  weirelts tsweerelds  werret  vverelt werlts werrelt  worreld  werlden  wareld   weirelt weireld  waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje  weurlt wald weëled   II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
Neil Fitzgerald, 7th July 2011
Computer lexica ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
Tools (more specific) ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
Spelling variation tools (pattern-based) ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT workshop, Bratislava,  May 7, 2010
Lemmatization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT workshop, Bratislava,  May 7, 2010
Lemmatization and reverse lemmatization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT workshop, Bratislava,  May 7, 2010
Attestation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT workshop, Bratislava,  May 7, 2010
IMPACT Dictionary Attestation Tool IMPACT workshop, Bratislava,  May 7, 2010 ,[object Object],[object Object],[object Object],[object Object],headword Quotations variants Task Find the variants of a headword as they occur in the quotations Lexicon building at work: Verifying attestations in historical dictionaries
IMPACT Dictionary Attestation Tool IMPACT workshop, Bratislava,  May 7, 2010 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Task Find the variants of a headword as they occur in the quotations Electronic historical dictionary Database with lemmata and quotatioms
IMPACT Attestation Tool IMPACT workshop, Bratislava,  May 7, 2010 Tool Lemma headword Quotations Sorted by uncertainty Up-to-date overview of what is done and needs to be done Done by this user so far
IMPACT Lexicon Tool IMPACT workshop, Bratislava,  May 7, 2010 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Task Find and verify attestations in a historical corpus
Corpus-based lexicon building: Impact Lexicon Tool IMPACT workshop, Bratislava,  May 7, 2010
General vocabulary vs. Named entities ,[object Object],[object Object],[object Object],IMPACT workshop, Bratislava,  May 7, 2010
Improvement of state of the art / innovation ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT workshop, Bratislava,  May 7, 2010 
languages in IMPACT ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
OCR evaluation results (preliminary!)
1. Czech ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
2.Dutch ,[object Object],[object Object],[object Object],[object Object],[object Object]
Precision: 0.8432889410216431 , Recall: 0.843331934927516
 
English ,[object Object],[object Object]
 
French ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
German ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Polish ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Slovene ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Retrieval demonstrator ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
 
 
 
 
 
 
 
 
 
 
 
 

More Related Content

Similar to Language tools bne-5-10-2011

BL Demo Day - July2011 - (6) Language Tools for IMPACT
BL Demo Day - July2011 - (6) Language Tools for IMPACTBL Demo Day - July2011 - (6) Language Tools for IMPACT
BL Demo Day - July2011 - (6) Language Tools for IMPACTIMPACT Centre of Competence
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Cornelius Puschmann
 
Automatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech PhenomenaAutomatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech PhenomenaITIIIndustries
 
NATURAL OBJECT ORIENTED PROGRAMMING USING ELICA
NATURAL OBJECT ORIENTED PROGRAMMING USING ELICANATURAL OBJECT ORIENTED PROGRAMMING USING ELICA
NATURAL OBJECT ORIENTED PROGRAMMING USING ELICANIKHIL NAWATHE
 
Internationalisation with PHP and Intl
Internationalisation with PHP and IntlInternationalisation with PHP and Intl
Internationalisation with PHP and IntlDaniel_Rhodes
 
Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Rinke Hoekstra
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionStephen Marquard
 
Reborn Digital: text, transmission, and technology
Reborn Digital: text, transmission, and technologyReborn Digital: text, transmission, and technology
Reborn Digital: text, transmission, and technologyPip Willcox
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Iconic Translation Machines
 
Encoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML TechnologiesEncoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML TechnologiesBaden Hughes
 
American Standard Sign Language Representation Using Speech Recognition
American Standard Sign Language Representation Using Speech RecognitionAmerican Standard Sign Language Representation Using Speech Recognition
American Standard Sign Language Representation Using Speech Recognitionpaperpublications3
 
Linked Open Europeana: Semantics for the Citizen
Linked Open Europeana: Semantics for the CitizenLinked Open Europeana: Semantics for the Citizen
Linked Open Europeana: Semantics for the CitizenStefan Gradmann
 
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTAlexandre Rademaker
 
Speech recognition-using-wavelet-transform
Speech recognition-using-wavelet-transformSpeech recognition-using-wavelet-transform
Speech recognition-using-wavelet-transformvidhateswapnil
 

Similar to Language tools bne-5-10-2011 (20)

Language Tools for OCR with Katrien Depuydt
Language Tools for OCR with Katrien DepuydtLanguage Tools for OCR with Katrien Depuydt
Language Tools for OCR with Katrien Depuydt
 
BL Demo Day - July2011 - (6) Language Tools for IMPACT
BL Demo Day - July2011 - (6) Language Tools for IMPACTBL Demo Day - July2011 - (6) Language Tools for IMPACT
BL Demo Day - July2011 - (6) Language Tools for IMPACT
 
IMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien DepuydtIMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien Depuydt
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)
 
Automatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech PhenomenaAutomatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech Phenomena
 
NATURAL OBJECT ORIENTED PROGRAMMING USING ELICA
NATURAL OBJECT ORIENTED PROGRAMMING USING ELICANATURAL OBJECT ORIENTED PROGRAMMING USING ELICA
NATURAL OBJECT ORIENTED PROGRAMMING USING ELICA
 
Internationalisation with PHP and Intl
Internationalisation with PHP and IntlInternationalisation with PHP and Intl
Internationalisation with PHP and Intl
 
Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 
Reborn Digital: text, transmission, and technology
Reborn Digital: text, transmission, and technologyReborn Digital: text, transmission, and technology
Reborn Digital: text, transmission, and technology
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
Encoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML TechnologiesEncoding and Presenting Interlinear Text Using XML Technologies
Encoding and Presenting Interlinear Text Using XML Technologies
 
Bird05 nltk-intro
Bird05 nltk-introBird05 nltk-intro
Bird05 nltk-intro
 
American Standard Sign Language Representation Using Speech Recognition
American Standard Sign Language Representation Using Speech RecognitionAmerican Standard Sign Language Representation Using Speech Recognition
American Standard Sign Language Representation Using Speech Recognition
 
Linked Open Europeana: Semantics for the Citizen
Linked Open Europeana: Semantics for the CitizenLinked Open Europeana: Semantics for the Citizen
Linked Open Europeana: Semantics for the Citizen
 
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PT
 
Speech recognition-using-wavelet-transform
Speech recognition-using-wavelet-transformSpeech recognition-using-wavelet-transform
Speech recognition-using-wavelet-transform
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 

Language tools bne-5-10-2011

  • 1. Computer Lexica in OCR and Retrieval Katrien Depuydt, Jesse de Does (Instituut voor Nederlandse Lexicologie, Leiden)
  • 2. Can we handle ‘de wereld’ (‘the world’)’? 4 March 2009 presentation The Hague werreid
  • 3. IMPACT <Demo Day BL, 12 July 2011> OCR: Abbyy Finereader SDK with built in standard Dutch dictionary OCR: Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch: werreld
  • 4. IMPACT <Demo Day BL, 12 July 2011> werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled RETRIEVAL: key in modern WERELD and find all
  • 5. The long s problem: An example …. IMPACT workshop, Bratislava, May 7, 2010 OCR at start of project A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. .
  • 6. The long s problem: An example …. IMPACT workshop, Bratislava, May 7, 2010 OCR at start of project Results April 2010 A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
  • 7. The long s problem: An example …. IMPACT workshop, Bratislava, May 7, 2010 Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon. In this way we keep it from turning to “eerde” (earth) instead of “eerste” (first) OCR at start of project Results April 2010 A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
  • 8.
  • 9. What is a computer lexicon? IMPACT <Demo Day BL, 12 July 2011>
  • 10.
  • 11. Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011>
  • 12. IMPACT <Demo Day BL, 12 July 2011>
  • 13.
  • 14. IMPACT <Demo Day BL, 12 July 2011>
  • 15. Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011>
  • 16.
  • 17. OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
  • 18.
  • 19. IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
  • 20. Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
  • 21. Types variation (spelling, inflection…) IMPACT <Demo Day BL, 12 July 2011> uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
  • 22. Neil Fitzgerald, 7th July 2011
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31. IMPACT Attestation Tool IMPACT workshop, Bratislava, May 7, 2010 Tool Lemma headword Quotations Sorted by uncertainty Up-to-date overview of what is done and needs to be done Done by this user so far
  • 32.
  • 33. Corpus-based lexicon building: Impact Lexicon Tool IMPACT workshop, Bratislava, May 7, 2010
  • 34.
  • 35.
  • 36.
  • 37. OCR evaluation results (preliminary!)
  • 38.
  • 39.  
  • 40.
  • 41. Precision: 0.8432889410216431 , Recall: 0.843331934927516
  • 42.  
  • 43.
  • 44.  
  • 45.
  • 46.  
  • 47.
  • 48.  
  • 49.
  • 50.  
  • 51.
  • 52.  
  • 53.
  • 54.  
  • 55.  
  • 56.  
  • 57.  
  • 58.  
  • 59.  
  • 60.  
  • 61.  
  • 62.  
  • 63.  
  • 64.  
  • 65.  

Editor's Notes

  1. A snippet from a Dutch magazine (De Denker. No. 4. Den 24. January 1763) ------------------------------------------- OCR, improving Access to text: improving the quality of the text. RETRIEVAL: Improving Access to text: dealing with historical spelling variants Used: HISTORICAL LEXICON OF DUTCH Can we handle ‘the world’? Yes we can, ought to be our answer, especially when investing hugely in mass digitisation. Mass digitisation is the very reason for investing in lexicon building. Efforts in digitising huge quantities of historical text demand efforts in quality of OCR as well as retrieval. Historical lexicon building for OCR and Retrieval, as shown above in this little example, can contribute to that. An example: in a ground truth text corpus of Dutch texts from 1550 until 1950, containing approximately 150 million words, search for the very common word ‘wereld’ yielded 23396 hits. Using a historical lexicon, containing spelling and morphological variants of this word, resulted in 34339 hits. I
  2. This presentation is based on how the INL works with language. A electronic dictionary is not what we need for OCR and simple retrieval but is introduced anyway because we can (and do) use our dictionaries for lexicon construction.
  3. This is what an XML-based electronic dictionary looks like.
  4. This is the XML of the Oxford English dictionary. The horizontal lines mark a place where part of the structure has been folded in.
  5. &lt;ed&gt; We need further explanation for what ‘lemma’, ‘part of speech’ and ‘morphology’ mean Lemma: headword, like in an ordinary dictionary the entry Morphology: morphological analysis is done for compounds and derivates: which parts are to be distinguished in a word, e.g. apple pie : apple + pie
  6. This is an little part of a computational lexicon (of a certain type; there are many types of computational lexica)
  7. &lt;ed&gt; again, unsure of what LEMMA means Be, was, am, is, etc. all forms of the same word BE (and that is an example of a lemma)
  8. Two types of variation, examples for Dutch from the lexicon
  9. To give an indication of possible spelling variants of the word ‘world’ for English, a screenshot from the OED online...
  10. These are some of the ways in which we are using Computer lexica as building blocks.
  11. Uitleg: Semi-sipervised approach: match word list from corpus with lexicon and find both the pairings of corpus words with lexicon words and the patterns needed for transformation. This only works if corpus and lexicon are a good match.
  12. Uitleg: Semi-sipervised approach: match word list from corpus with lexicon and find both the pairings of corpus words with lexicon words and the patterns needed for transformation. This only works if corpus and lexicon are a good match.
  13. Uitleg: Semi-sipervised approach: match word list from corpus with lexicon and find both the pairings of corpus words with lexicon words and the patterns needed for transformation. This only works if corpus and lexicon are a good match.
  14. Note: applicable to other historical dictionaries with attestations. Tested on OED material!
  15. Note: applicable to other historical dictionaries with attestations. Tested on OED material!