SlideShare une entreprise Scribd logo
1  sur  31
TR5 Profiler and Post-Correction System  Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung
TR5 Post-Correction System ,[object Object],[object Object],[object Object]
Customizable user interface ,[object Object],[object Object],[object Object],[object Object],OCR and image fragments Correction candidates, Special functions Complete image Font size
[object Object],[object Object],View: OCR and Image clippings
[object Object],[object Object],[object Object],View: Original image
[object Object],[object Object],[object Object],Word by word correction of text
[object Object],[object Object],Batch correction: efficient postcorrection
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Batch correction: efficient postcorrection
Postcorrection system: Evaluation Ulrich Reffle, 4, Juli 2011 ,[object Object],[object Object]
Korrektursystem
Korrektursystem
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Why another postcorrection system?
[object Object],[object Object],[object Object],[object Object],Underlying language technology
Text and error profiles ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],patterns OCR errors
Historical variant and OCR error patterns Historical Variants OCR Error patterns teil    theil theil    iheil
Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’  Absolute frequency: Pattern was found 120 times in the current document.
[object Object],[object Object],Occurrence  of spelling variant “i->y”: Occurrence  of ocr error “ i->y”:
[object Object],[object Object],Occurrences of spelling variant “i->y”: +0.999771 Occurrences of ocr error “ i->y”: +0.000224948
Computation of profile: initialization OCR result w 0 , w 1  ,w 2 , w 3 , … Initial global profile ,[object Object],[object Object],[object Object],[object Object]
Computation of profile: global to local w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Local profile Initial global profile OCR result w 0 , w 1  ,w 2 , w 3 , … ,[object Object],[object Object],[object Object],[object Object]
Computation of profile: local to global w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Local profile Global profile OCR result w 0 , w 1  ,w 2 , w 3 , … ,[object Object],[object Object],[object Object],[object Object]
Computation of profile: iteration Ulrich Reffle, 4, Juli 2011 Local profile Global profile w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … OCR result w 0 , w 1  ,w 2 , w 3 , … ,[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Evaluation: Measures (1)  Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2)  OCR Error Detection   Precision and Recall for the OCR errors detected by the Profiler (3)  Indirect evaluation (For instance, by means of the postcorrection system)
Evaluation: Data preparation (1)  Deep Evaluation: For each token of the evaluation document the historical interpretation and the  OCR  interpretation have been manually annotated.  ++ fully accurate  -- manual work (2)  Shallow Evaluation:  The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document  the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work  – not completely accurate
Evaluation: Data Deep:  Eckartshausen  100 pages  Briefkunst  40 pages Shallow:  5 books each,  16 th , 17 th  and 18 th  century
Evaluation: Eckartshausen ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Graphical Evaluation: Eckartshausen
Graphical Evaluation: diacritics Hist. Var. OCR
Shallow Evaluation Results 16th  17th 18th HIST Patterns  first 10 60% 74% 78% OCR Patterns  first 10 48% 70% 50% Error Detection Prec 95% 92% 81% Error Detection Recall 49% 43% 45% Content Words Errors 64% 44% 16% Easy Interactive Correction  per 10,000 words ≈ 3000 words ≈  1892 words ≈  720 words

Contenu connexe

Similaire à BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character RecognitionRahul Mallik
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyEr. Ashish Pandey
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Karan Panjwani
 
Evaluating Google Cloud Vision for OCR
Evaluating Google Cloud Vision for OCREvaluating Google Cloud Vision for OCR
Evaluating Google Cloud Vision for OCRShin Hashitani
 
Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"Fwdays
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontIRJET Journal
 
Entering the Fourth Dimension of OCR with Tesseract
Entering the Fourth Dimension of OCR with TesseractEntering the Fourth Dimension of OCR with Tesseract
Entering the Fourth Dimension of OCR with Tesseract🎤 Hanno Embregts 🎸
 
IRJET-Optical Character Recognition using ANN
IRJET-Optical Character Recognition using ANNIRJET-Optical Character Recognition using ANN
IRJET-Optical Character Recognition using ANNIRJET Journal
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
User-friendly ways to capture temporal properties - Seminar at KTH, June 2015
User-friendly ways to capture temporal properties - Seminar at KTH, June 2015User-friendly ways to capture temporal properties - Seminar at KTH, June 2015
User-friendly ways to capture temporal properties - Seminar at KTH, June 2015Patrizio Pelliccione
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Editor IJARCET
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Editor IJARCET
 
Online Hand Written Character Recognition
Online Hand Written Character RecognitionOnline Hand Written Character Recognition
Online Hand Written Character RecognitionIOSR Journals
 
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET Journal
 
Rule based algorithm for handwritten characters recognition
Rule based algorithm for handwritten characters recognitionRule based algorithm for handwritten characters recognition
Rule based algorithm for handwritten characters recognitionRanda Elanwar
 

Similaire à BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction (20)

Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
Evaluating Google Cloud Vision for OCR
Evaluating Google Cloud Vision for OCREvaluating Google Cloud Vision for OCR
Evaluating Google Cloud Vision for OCR
 
Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"
 
Cpcs302 1
Cpcs302  1Cpcs302  1
Cpcs302 1
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English Font
 
06 traub
06 traub06 traub
06 traub
 
Hardware to Software
Hardware to SoftwareHardware to Software
Hardware to Software
 
Practically genius1
Practically genius1Practically genius1
Practically genius1
 
Entering the Fourth Dimension of OCR with Tesseract
Entering the Fourth Dimension of OCR with TesseractEntering the Fourth Dimension of OCR with Tesseract
Entering the Fourth Dimension of OCR with Tesseract
 
IRJET-Optical Character Recognition using ANN
IRJET-Optical Character Recognition using ANNIRJET-Optical Character Recognition using ANN
IRJET-Optical Character Recognition using ANN
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
User-friendly ways to capture temporal properties - Seminar at KTH, June 2015
User-friendly ways to capture temporal properties - Seminar at KTH, June 2015User-friendly ways to capture temporal properties - Seminar at KTH, June 2015
User-friendly ways to capture temporal properties - Seminar at KTH, June 2015
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
Online Hand Written Character Recognition
Online Hand Written Character RecognitionOnline Hand Written Character Recognition
Online Hand Written Character Recognition
 
UseR 2017
UseR 2017UseR 2017
UseR 2017
 
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
 
Rule based algorithm for handwritten characters recognition
Rule based algorithm for handwritten characters recognitionRule based algorithm for handwritten characters recognition
Rule based algorithm for handwritten characters recognition
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Dernier

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Dernier (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction

  • 1. TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Historical variant and OCR error patterns Historical Variants OCR Error patterns teil  theil theil  iheil
  • 17. Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’ Absolute frequency: Pattern was found 120 times in the current document.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25. Evaluation: Measures (1) Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2) OCR Error Detection Precision and Recall for the OCR errors detected by the Profiler (3) Indirect evaluation (For instance, by means of the postcorrection system)
  • 26. Evaluation: Data preparation (1) Deep Evaluation: For each token of the evaluation document the historical interpretation and the OCR interpretation have been manually annotated. ++ fully accurate -- manual work (2) Shallow Evaluation: The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work – not completely accurate
  • 27. Evaluation: Data Deep: Eckartshausen 100 pages Briefkunst 40 pages Shallow: 5 books each, 16 th , 17 th and 18 th century
  • 28.
  • 31. Shallow Evaluation Results 16th 17th 18th HIST Patterns first 10 60% 74% 78% OCR Patterns first 10 48% 70% 50% Error Detection Prec 95% 92% 81% Error Detection Recall 49% 43% 45% Content Words Errors 64% 44% 16% Easy Interactive Correction per 10,000 words ≈ 3000 words ≈ 1892 words ≈ 720 words

Notes de l'éditeur

  1. DictModule name=“modern” File=“../dicts/modern.dic” max_ocr_errors=3 max_spelling_variants
  2. DictModule name=“modern” File=“../dicts/modern.dic” max_ocr_errors=3 max_spelling_variants