2. Automatic OCR correction http://overproof.projectcomputing.com
who are we?
● Australian software company
● developers John and Kent
● we put theory into practice
3. Automatic OCR correction http://overproof.projectcomputing.com
● the first draft of history
● popular if made available
● usually poorly digitized
● too extensive for full human
correction
main target - newspapers
4. Automatic OCR correction http://overproof.projectcomputing.com
goals
● run on commodity cloud server
● optimal for noisy text
● at least 1000 words/sec
● correct at least 50% of errors
5. Automatic OCR correction http://overproof.projectcomputing.com
division of labour
bad
good
models
models
MANAGER,
TRIAGE
CORE
6. Automatic OCR correction http://overproof.projectcomputing.com
snippets for the core
● prefer triaged good words at start/end
● column aware
● some easy corrections applied
● some suggestions supplied
● bag of topic words available
● surrounding noise level indicated
7. Automatic OCR correction http://overproof.projectcomputing.com
error contexts
● spell: vowals or consonnants
● type: you jit teh wrng key
● OCR: roprcroiitativcs cf thc Coveriuient
● random: anygh<eg 0at7happen
8. Automatic OCR correction http://overproof.projectcomputing.com
confusion cost matrix
93: w ← w
155: e ← e
3750: c ← e
4451: m ← rn
6652: rn ← m
11065: E ← m
9. Automatic OCR correction http://overproof.projectcomputing.com
word cost (eg rnorniny|morning)
language cost
● lexicon frequency
● entity list
● rare word list
● character 4-gram
error cost
● edit sum
● visual correlation
● generator hint
10. Automatic OCR correction http://overproof.projectcomputing.com
word character confusion
m o r n i n g
r n o r n i n y
13. Automatic OCR correction http://overproof.projectcomputing.com
searching for gold (A*)
l
i
i
ne
r
h
hcii
h li b n ...
c e r o …
i i 1 l n u …
i i 1 l ...
purple nodes: working priority queue
red nodes: output priority queue
15. Automatic OCR correction http://overproof.projectcomputing.com
selecting best combination
unsiejitlv
unsightly
unseemly
unsettle
unsteady
Unsightly
urgently
bohavlour
behaviour
behavour
behavior
Behaviour
behaviours
behaving
abonf
about
above
along
been
am
am
an
a
in
as
unsiejitlv
unsightly
unseemly
unsettle
unsteady
Unsightly
urgently
disgrie
disgrace
disagree
disguise
desire
degree
disease
[NOTE: word joins and splits are also supported]
16. Automatic OCR correction http://overproof.projectcomputing.com
training
● 5-grams - subset selection
● corpus 1,2,3-grams - statistical build
● extra word lists - easy
● error model - bootstrap or new pairs
17. Automatic OCR correction http://overproof.projectcomputing.com
testing
● 65000 words ground truth including
foreign (US) newspapers
● all measures exceeded goal:
○ search errors (article word types)
○ read errors (article word tokens)
○ entropy weighted term errors
21. Automatic OCR correction http://overproof.projectcomputing.com
National Library of Australia’s
TROVE
● 1.4m distinct visitors/month
● 16m pageviews/month
● 80% of usage is old newspapers
o 13m pages, over 600 titles
o 85k lines corrected/day
22. Automatic OCR correction http://overproof.projectcomputing.com
Even this massive volunteer effort
cannot keep up
● < 2% of errors have been corrected
● % corrected is declining
● Hence searching is unreliable, OCR’ed text
is hard to read and reuse
● Trove’s accuracy is “typical”
24. Automatic OCR correction http://overproof.projectcomputing.com
159 randomly selected news
articles from The Sydney
Morning Herald
47.4K words hand-corrected to ground truth
30. Automatic OCR correction http://overproof.projectcomputing.com
49 randomly selected news
articles from LoC
Chronicling America
18.1K words hand-corrected to ground truth