SlideShare a Scribd company logo
1 of 31
Download to read offline
Automatic OCR correction http://overproof.projectcomputing.com
Correcting noisy OCR
- Context beats Confusion
[ presentation viewableat http://goo.gl/n85gR6 ]
Automatic OCR correction http://overproof.projectcomputing.com
who are we?
● Australian software company
● developers John and Kent
● we put theory into practice
Automatic OCR correction http://overproof.projectcomputing.com
● the first draft of history
● popular if made available
● usually poorly digitized
● too extensive for full human
correction
main target - newspapers
Automatic OCR correction http://overproof.projectcomputing.com
goals
● run on commodity cloud server
● optimal for noisy text
● at least 1000 words/sec
● correct at least 50% of errors
Automatic OCR correction http://overproof.projectcomputing.com
division of labour
bad
good
models
models
MANAGER,
TRIAGE
CORE
Automatic OCR correction http://overproof.projectcomputing.com
snippets for the core
● prefer triaged good words at start/end
● column aware
● some easy corrections applied
● some suggestions supplied
● bag of topic words available
● surrounding noise level indicated
Automatic OCR correction http://overproof.projectcomputing.com
error contexts
● spell: vowals or consonnants
● type: you jit teh wrng key
● OCR: roprcroiitativcs cf thc Coveriuient
● random: anygh<eg 0at7happen
Automatic OCR correction http://overproof.projectcomputing.com
confusion cost matrix
93: w ← w
155: e ← e
3750: c ← e
4451: m ← rn
6652: rn ← m
11065: E ← m
Automatic OCR correction http://overproof.projectcomputing.com
word cost (eg rnorniny|morning)
language cost
● lexicon frequency
● entity list
● rare word list
● character 4-gram
error cost
● edit sum
● visual correlation
● generator hint
Automatic OCR correction http://overproof.projectcomputing.com
word character confusion
m o r n i n g
r n o r n i n y
Automatic OCR correction http://overproof.projectcomputing.com
visual correlation
Automatic OCR correction http://overproof.projectcomputing.com
suggestion methods
● gift
● common, cached
● language
● entities
● split/join
● generated (magic)
Automatic OCR correction http://overproof.projectcomputing.com
searching for gold (A*)
l
i
i
ne
r
h
hcii
h li b n ...
c e r o …
i i 1 l n u …
i i 1 l ...
purple nodes: working priority queue
red nodes: output priority queue
Automatic OCR correction http://overproof.projectcomputing.com
amazing generated suggestions
Parhumuitar} ← Parliamentary
I.iulwuvB ← Railways
Itegtniont ← Regiment
niltfltory ← adultery
uj.rccu.eut← agreement
couniutfc.o ← committee
cnuipuii ← company
dctoimiuatJOu ← determination
uiidcrtkikcr’a ← undertaker’s
Automatic OCR correction http://overproof.projectcomputing.com
selecting best combination
unsiejitlv
unsightly
unseemly
unsettle
unsteady
Unsightly
urgently
bohavlour
behaviour
behavour
behavior
Behaviour
behaviours
behaving
abonf
about
above
along
been
am
am
an
a
in
as
unsiejitlv
unsightly
unseemly
unsettle
unsteady
Unsightly
urgently
disgrie
disgrace
disagree
disguise
desire
degree
disease
[NOTE: word joins and splits are also supported]
Automatic OCR correction http://overproof.projectcomputing.com
training
● 5-grams - subset selection
● corpus 1,2,3-grams - statistical build
● extra word lists - easy
● error model - bootstrap or new pairs
Automatic OCR correction http://overproof.projectcomputing.com
testing
● 65000 words ground truth including
foreign (US) newspapers
● all measures exceeded goal:
○ search errors (article word types)
○ read errors (article word tokens)
○ entropy weighted term errors
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 83.8% 94.1% recall misses reduced 63.3%
Raw Error Rate 18.5% 5.5% errors reduced 70.1%
Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4%
SMH sample
Automatic OCR correction http://overproof.projectcomputing.com
¿preguntas?
Presentation viewable at http://goo.gl/n85gR6
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
National Library of Australia’s
TROVE
● 1.4m distinct visitors/month
● 16m pageviews/month
● 80% of usage is old newspapers
o 13m pages, over 600 titles
o 85k lines corrected/day
Automatic OCR correction http://overproof.projectcomputing.com
Even this massive volunteer effort
cannot keep up
● < 2% of errors have been corrected
● % corrected is declining
● Hence searching is unreliable, OCR’ed text
is hard to read and reuse
● Trove’s accuracy is “typical”
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
159 randomly selected news
articles from The Sydney
Morning Herald
47.4K words hand-corrected to ground truth
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 83.8% 94.1% recall misses reduced 63.3%
False positive recall 26.7% 9.1% false positives reduced 65.8%
Raw Error Rate 18.5% 5.5% errors reduced 70.1%
Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4%
SMH sample
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
49 randomly selected news
articles from LoC
Chronicling America
18.1K words hand-corrected to ground truth
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 84.0% 93.1% recall misses reduced 56.6%
False positive recall 23.6% 8.8% false positives reduced 62.8%
Raw Error Rate 19.1% 6.4% errors reduced 66.7%
Weighted Error Rate 16.0% 7.7% weighted errors reduced 51.8%
LOC sample

More Related Content

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
 
Session1 01.konstantin baierer
Session1 01.konstantin baiererSession1 01.konstantin baierer
Session1 01.konstantin baierer
 

Recently uploaded

UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 

Recently uploaded (20)

UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 

Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

  • 1. Automatic OCR correction http://overproof.projectcomputing.com Correcting noisy OCR - Context beats Confusion [ presentation viewableat http://goo.gl/n85gR6 ]
  • 2. Automatic OCR correction http://overproof.projectcomputing.com who are we? ● Australian software company ● developers John and Kent ● we put theory into practice
  • 3. Automatic OCR correction http://overproof.projectcomputing.com ● the first draft of history ● popular if made available ● usually poorly digitized ● too extensive for full human correction main target - newspapers
  • 4. Automatic OCR correction http://overproof.projectcomputing.com goals ● run on commodity cloud server ● optimal for noisy text ● at least 1000 words/sec ● correct at least 50% of errors
  • 5. Automatic OCR correction http://overproof.projectcomputing.com division of labour bad good models models MANAGER, TRIAGE CORE
  • 6. Automatic OCR correction http://overproof.projectcomputing.com snippets for the core ● prefer triaged good words at start/end ● column aware ● some easy corrections applied ● some suggestions supplied ● bag of topic words available ● surrounding noise level indicated
  • 7. Automatic OCR correction http://overproof.projectcomputing.com error contexts ● spell: vowals or consonnants ● type: you jit teh wrng key ● OCR: roprcroiitativcs cf thc Coveriuient ● random: anygh<eg 0at7happen
  • 8. Automatic OCR correction http://overproof.projectcomputing.com confusion cost matrix 93: w ← w 155: e ← e 3750: c ← e 4451: m ← rn 6652: rn ← m 11065: E ← m
  • 9. Automatic OCR correction http://overproof.projectcomputing.com word cost (eg rnorniny|morning) language cost ● lexicon frequency ● entity list ● rare word list ● character 4-gram error cost ● edit sum ● visual correlation ● generator hint
  • 10. Automatic OCR correction http://overproof.projectcomputing.com word character confusion m o r n i n g r n o r n i n y
  • 11. Automatic OCR correction http://overproof.projectcomputing.com visual correlation
  • 12. Automatic OCR correction http://overproof.projectcomputing.com suggestion methods ● gift ● common, cached ● language ● entities ● split/join ● generated (magic)
  • 13. Automatic OCR correction http://overproof.projectcomputing.com searching for gold (A*) l i i ne r h hcii h li b n ... c e r o … i i 1 l n u … i i 1 l ... purple nodes: working priority queue red nodes: output priority queue
  • 14. Automatic OCR correction http://overproof.projectcomputing.com amazing generated suggestions Parhumuitar} ← Parliamentary I.iulwuvB ← Railways Itegtniont ← Regiment niltfltory ← adultery uj.rccu.eut← agreement couniutfc.o ← committee cnuipuii ← company dctoimiuatJOu ← determination uiidcrtkikcr’a ← undertaker’s
  • 15. Automatic OCR correction http://overproof.projectcomputing.com selecting best combination unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently bohavlour behaviour behavour behavior Behaviour behaviours behaving abonf about above along been am am an a in as unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently disgrie disgrace disagree disguise desire degree disease [NOTE: word joins and splits are also supported]
  • 16. Automatic OCR correction http://overproof.projectcomputing.com training ● 5-grams - subset selection ● corpus 1,2,3-grams - statistical build ● extra word lists - easy ● error model - bootstrap or new pairs
  • 17. Automatic OCR correction http://overproof.projectcomputing.com testing ● 65000 words ground truth including foreign (US) newspapers ● all measures exceeded goal: ○ search errors (article word types) ○ read errors (article word tokens) ○ entropy weighted term errors
  • 18. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 83.8% 94.1% recall misses reduced 63.3% Raw Error Rate 18.5% 5.5% errors reduced 70.1% Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4% SMH sample
  • 19. Automatic OCR correction http://overproof.projectcomputing.com ¿preguntas? Presentation viewable at http://goo.gl/n85gR6
  • 20. Automatic OCR correction http://overproof.projectcomputing.com
  • 21. Automatic OCR correction http://overproof.projectcomputing.com National Library of Australia’s TROVE ● 1.4m distinct visitors/month ● 16m pageviews/month ● 80% of usage is old newspapers o 13m pages, over 600 titles o 85k lines corrected/day
  • 22. Automatic OCR correction http://overproof.projectcomputing.com Even this massive volunteer effort cannot keep up ● < 2% of errors have been corrected ● % corrected is declining ● Hence searching is unreliable, OCR’ed text is hard to read and reuse ● Trove’s accuracy is “typical”
  • 23. Automatic OCR correction http://overproof.projectcomputing.com
  • 24. Automatic OCR correction http://overproof.projectcomputing.com 159 randomly selected news articles from The Sydney Morning Herald 47.4K words hand-corrected to ground truth
  • 25. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 83.8% 94.1% recall misses reduced 63.3% False positive recall 26.7% 9.1% false positives reduced 65.8% Raw Error Rate 18.5% 5.5% errors reduced 70.1% Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4% SMH sample
  • 26. Automatic OCR correction http://overproof.projectcomputing.com
  • 27. Automatic OCR correction http://overproof.projectcomputing.com
  • 28. Automatic OCR correction http://overproof.projectcomputing.com
  • 29. Automatic OCR correction http://overproof.projectcomputing.com
  • 30. Automatic OCR correction http://overproof.projectcomputing.com 49 randomly selected news articles from LoC Chronicling America 18.1K words hand-corrected to ground truth
  • 31. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 84.0% 93.1% recall misses reduced 56.6% False positive recall 23.6% 8.8% false positives reduced 62.8% Raw Error Rate 19.1% 6.4% errors reduced 66.7% Weighted Error Rate 16.0% 7.7% weighted errors reduced 51.8% LOC sample