SlideShare une entreprise Scribd logo
1  sur  35
Computer Lexica in OCR and Retrieval Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
What is a computer lexicon?  IMPACT <Demo Day BL, 12 July 2011>
Computer lexicon vs electronic dictionary (1) IMPACT <Demo Day BL, 12 July 2011> An electronic dictionary is:  ,[object Object],[object Object],[object Object],[object Object]
Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011>
Computer Lexicon vs Electronic Dictionary (2) IMPACT <Demo Day BL, 12 July 2011> ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
IMPACT <Demo Day BL, 12 July 2011>
Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011>
The OCR lexicon IMPACT <Demo Day BL, 12 July 2011> An OCR lexicon is   ,[object Object],[object Object],[object Object],[object Object]
OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
The IR lexicon  ,[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en  tuyld  daer weer op  an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
Types variation (spelling, inflection…) IMPACT <Demo Day BL, 12 July 2011> uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken  uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk  uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk  I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds  weerlyt  wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds  sweerels   zwerlys   swarels   swerelts  werelts  swerrels  weirelts tsweerelds  werret  vverelt werlts werrelt  worreld  werlden  wareld   weirelt weireld  waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje  weurlt wald weëled   II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
Neil Fitzgerald, 7th July 2011
Computer lexica ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
Tools (more specific) ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
Ordinary words vs Names (NEs) ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
A number of results for Dutch and German IMPACT <Demo Day BL, 12 July 2011>
Ground truth data: Dutch IMPACT <Demo Day BL, 12 July 2011> Type and genre # words Gold Standard Book 300k Random Set Books 340k Random Set Staten Generaal (Legal Papers) 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M
Lexicon coverage (1: ground truth books) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% Core general lexicon 56% 84% 1 + 2 63% 89% Expansion with corpus material  78% 95%
Lexicon coverage  (2: GT newspapers 18 th -19 th  C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% Core general lexicon 41% 84% 1 + 2 51% 89% Expansion with corpus material 62% 95%
Lexicon coverage  (3: GT Staten Generaal 19 e  C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% Core general lexicon 47% 88% 1 + 2 58% 93% Expansion with corpus material 68% 97%
Lexicon coverage  (4: GT Staten Generaal 20 e  C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% Core general lexicon 66% 93% 1 + 2 76% 96% Expansion with corpus material 81% 98%
Lexicon coverage (5: Genesis, 1637 bible) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% Core lexicon 62% 83% 1 + 2 65% 89% Expansion with corpus material 87% 98.6%
Lexicon coverage (6: P.C. Hooft, histories) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% Core lexicon 47% 88% 1 + 2 50% 90% Expansion with corpus material 58% 96%
Evaluation of OCR IMPACT <Demo Day BL, 12 July 2011> ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
OCR results: word recognition rate IMPACT <Demo Day BL, 12 July 2011> Dataset With ABBYY internal Dutch lexicon With IMPACT lexicon for Dutch  (case hyphenation) With IMPACT lexicon for Dutch  (case hyphenation) + long S problem) DPO35 88.8% 90.9% 93,5 %
An example: IMPACT <Demo Day BL, 12 July 2011> OCR at the beginning of the project: Results: A. De  eerde   was de  gevaarlykflti  om de verlei¬ ding aan 't Hof; de tweede de  ftillie  en  veiligde ; de derde de  zwaarde , daar hy byna drie millioenen harde en  onbefchaafde   Menfchen   beftieren  moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
IMPACT <Demo Day BL, 12 July 2011> Dictionary 16 th  century No. of  word errors Reduction of error rate 18 th  century  No. of  word errors Reduction of error rate 19 th  century  No. of  word errors Reduction of error rate No Lexicon 1306 - 827 - 2074 - Optimal Lexicon 756 42% 395 52% 612 70% Modern Lexicon 1096 16% 501 39% 888 57% W.Historical Lexicon 938 28% 481 42% 856 59% Modern + Virtual H.L. 1011 25% 480 42% 849 59%
Languages in IMPACT ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
English in IMPACT ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011> An indemnity shall be granted to the surfer…. …  bikini …
Retrieval demonstrator ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>

Contenu connexe

En vedette

IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies BalkIMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies BalkIMPACT Centre of Competence
 
BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
BL Demo Day - July2011 - (7) OCR Profiler and Post-CorrectionBL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
BL Demo Day - July2011 - (7) OCR Profiler and Post-CorrectionIMPACT Centre of Competence
 
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension ParserBL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension ParserIMPACT Centre of Competence
 
BL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCRBL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCRIMPACT Centre of Competence
 
BL Demo Day - July2011 - (2) IMPACT Learning Resources
BL Demo Day - July2011 - (2) IMPACT  Learning ResourcesBL Demo Day - July2011 - (2) IMPACT  Learning Resources
BL Demo Day - July2011 - (2) IMPACT Learning ResourcesIMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Centre of Competence
 

En vedette (8)

IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies BalkIMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
 
Fep bne demoday
Fep bne demodayFep bne demoday
Fep bne demoday
 
BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
BL Demo Day - July2011 - (7) OCR Profiler and Post-CorrectionBL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
 
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension ParserBL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
 
BL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCRBL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCR
 
BL Demo Day - July2011 - (2) IMPACT Learning Resources
BL Demo Day - July2011 - (2) IMPACT  Learning ResourcesBL Demo Day - July2011 - (2) IMPACT  Learning Resources
BL Demo Day - July2011 - (2) IMPACT Learning Resources
 
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEPIMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEP
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
 

Similaire à Language Tools for OCR with Katrien Depuydt

Microsoft Power Point Neuro Disorders
Microsoft Power Point   Neuro DisordersMicrosoft Power Point   Neuro Disorders
Microsoft Power Point Neuro DisordersNio Noveno
 
Lotusphere 2006 AD212 Introduction to DXL
Lotusphere 2006 AD212 Introduction to DXLLotusphere 2006 AD212 Introduction to DXL
Lotusphere 2006 AD212 Introduction to DXLdominion
 
XML Training Presentation
XML Training PresentationXML Training Presentation
XML Training PresentationSarah Corney
 
Akoma Ntoso 2
Akoma Ntoso 2Akoma Ntoso 2
Akoma Ntoso 2tbruce
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaAjax Experience 2009
 
XML and XPath with PHP
XML and XPath with PHPXML and XPath with PHP
XML and XPath with PHPTobias Schlitt
 
"Why the Semantic Web will Never Work" (note the quotes)
"Why the Semantic Web will Never Work"  (note the quotes)"Why the Semantic Web will Never Work"  (note the quotes)
"Why the Semantic Web will Never Work" (note the quotes)James Hendler
 
Intro XML for archivists (2011)
Intro XML for archivists (2011)Intro XML for archivists (2011)
Intro XML for archivists (2011)Jane Stevenson
 
NEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator PresentationNEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator Presentationaskankit
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaganohmad
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Sagakaven yan
 
The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationIconic Translation Machines
 

Similaire à Language Tools for OCR with Katrien Depuydt (20)

Language tools bne-5-10-2011
Language tools bne-5-10-2011Language tools bne-5-10-2011
Language tools bne-5-10-2011
 
Microsoft Power Point Neuro Disorders
Microsoft Power Point   Neuro DisordersMicrosoft Power Point   Neuro Disorders
Microsoft Power Point Neuro Disorders
 
Pmm05 16
Pmm05 16Pmm05 16
Pmm05 16
 
Lotusphere 2006 AD212 Introduction to DXL
Lotusphere 2006 AD212 Introduction to DXLLotusphere 2006 AD212 Introduction to DXL
Lotusphere 2006 AD212 Introduction to DXL
 
Alabot
AlabotAlabot
Alabot
 
XML Training Presentation
XML Training PresentationXML Training Presentation
XML Training Presentation
 
Akoma Ntoso 2
Akoma Ntoso 2Akoma Ntoso 2
Akoma Ntoso 2
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
 
The ISO-DCR
The ISO-DCRThe ISO-DCR
The ISO-DCR
 
XML and XPath with PHP
XML and XPath with PHPXML and XPath with PHP
XML and XPath with PHP
 
XML
XMLXML
XML
 
Lecture 5 XML
Lecture 5  XMLLecture 5  XML
Lecture 5 XML
 
"Why the Semantic Web will Never Work" (note the quotes)
"Why the Semantic Web will Never Work"  (note the quotes)"Why the Semantic Web will Never Work"  (note the quotes)
"Why the Semantic Web will Never Work" (note the quotes)
 
Intro XML for archivists (2011)
Intro XML for archivists (2011)Intro XML for archivists (2011)
Intro XML for archivists (2011)
 
NEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator PresentationNEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator Presentation
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaga
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Saga
 
The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
ISOcat to LMF to TEI
ISOcat to LMF to TEIISOcat to LMF to TEI
ISOcat to LMF to TEI
 
Metadata Cloud
Metadata CloudMetadata Cloud
Metadata Cloud
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Dernier

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 

Dernier (20)

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 

Language Tools for OCR with Katrien Depuydt

  • 1. Computer Lexica in OCR and Retrieval Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
  • 2.
  • 3. What is a computer lexicon? IMPACT <Demo Day BL, 12 July 2011>
  • 4.
  • 5. Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011>
  • 6. IMPACT <Demo Day BL, 12 July 2011>
  • 7.
  • 8. IMPACT <Demo Day BL, 12 July 2011>
  • 9. Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011>
  • 10.
  • 11. OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
  • 12.
  • 13. IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
  • 14. Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
  • 15. Types variation (spelling, inflection…) IMPACT <Demo Day BL, 12 July 2011> uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
  • 16. Neil Fitzgerald, 7th July 2011
  • 17.
  • 18.
  • 19.
  • 20. A number of results for Dutch and German IMPACT <Demo Day BL, 12 July 2011>
  • 21. Ground truth data: Dutch IMPACT <Demo Day BL, 12 July 2011> Type and genre # words Gold Standard Book 300k Random Set Books 340k Random Set Staten Generaal (Legal Papers) 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M
  • 22. Lexicon coverage (1: ground truth books) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% Core general lexicon 56% 84% 1 + 2 63% 89% Expansion with corpus material 78% 95%
  • 23. Lexicon coverage (2: GT newspapers 18 th -19 th C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% Core general lexicon 41% 84% 1 + 2 51% 89% Expansion with corpus material 62% 95%
  • 24. Lexicon coverage (3: GT Staten Generaal 19 e C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% Core general lexicon 47% 88% 1 + 2 58% 93% Expansion with corpus material 68% 97%
  • 25. Lexicon coverage (4: GT Staten Generaal 20 e C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% Core general lexicon 66% 93% 1 + 2 76% 96% Expansion with corpus material 81% 98%
  • 26. Lexicon coverage (5: Genesis, 1637 bible) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% Core lexicon 62% 83% 1 + 2 65% 89% Expansion with corpus material 87% 98.6%
  • 27. Lexicon coverage (6: P.C. Hooft, histories) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% Core lexicon 47% 88% 1 + 2 50% 90% Expansion with corpus material 58% 96%
  • 28.
  • 29. OCR results: word recognition rate IMPACT <Demo Day BL, 12 July 2011> Dataset With ABBYY internal Dutch lexicon With IMPACT lexicon for Dutch (case hyphenation) With IMPACT lexicon for Dutch (case hyphenation) + long S problem) DPO35 88.8% 90.9% 93,5 %
  • 30. An example: IMPACT <Demo Day BL, 12 July 2011> OCR at the beginning of the project: Results: A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
  • 31. IMPACT <Demo Day BL, 12 July 2011> Dictionary 16 th century No. of word errors Reduction of error rate 18 th century No. of word errors Reduction of error rate 19 th century No. of word errors Reduction of error rate No Lexicon 1306 - 827 - 2074 - Optimal Lexicon 756 42% 395 52% 612 70% Modern Lexicon 1096 16% 501 39% 888 57% W.Historical Lexicon 938 28% 481 42% 856 59% Modern + Virtual H.L. 1011 25% 480 42% 849 59%
  • 32.
  • 33.
  • 34. IMPACT <Demo Day BL, 12 July 2011> An indemnity shall be granted to the surfer…. … bikini …
  • 35.

Notes de l'éditeur

  1. This presentation is based on how the INL works with language. A electronic dictionary is not what we need for OCR and simple retrieval but is introduced anyway because we can (and do) use our dictionaries for lexicon construction.
  2. This is what an XML-based electronic dictionary looks like.
  3. This is the XML of the Oxford English dictionary. The horizontal lines mark a place where part of the structure has been folded in.
  4. &lt;ed&gt; We need further explanation for what ‘lemma’, ‘part of speech’ and ‘morphology’ mean Lemma: headword, like in an ordinary dictionary the entry Morphology: morphological analysis is done for compounds and derivates: which parts are to be distinguished in a word, e.g. apple pie : apple + pie
  5. This is an little part of a computational lexicon (of a certain type; there are many types of computational lexica)
  6. &lt;ed&gt; again, unsure of what LEMMA means Be, was, am, is, etc. all forms of the same word BE (and that is an example of a lemma)
  7. Two types of variation, examples for Dutch from the lexicon
  8. To give an indication of possible spelling variants of the word ‘world’ for English, a screenshot from the OED online...
  9. These are some of the ways in which we are using Computer lexica as building blocks.
  10. The
  11. The
  12. The
  13. The
  14. The
  15. The
  16. These are results with a rather limited historical lexicon of German.
  17. Computational Natural Language Learning
  18. 322445 (vierde kolom middennin) 424979