SlideShare une entreprise Scribd logo
1  sur  48
USING OPEN SOURCE OCR
TOOLS FOR DIGITIZATION
PROJECTS
Matthew J. Christy
Intro – Me
• Matthew J. Christy
• Lead Software Applications Developer at the Initiative for Digital
Humanities, Media and Culture (IDHMC) at Texas A&M University
• @matt_christy
• idhmc.tamu.edu
• @idhmc_nexus
• Co-project manager of the Early Modern OCR Project (eMOP)
• emop.tamu.edu
• #emop
• Former Systems/Electronic Resources Librarian
Tuesday, August 12, 2014 Open Source OCR Tools 2
Intro – You
• Name & Institution
• Experience with OCR
• What’s your project or what are you bringing with you?
Tuesday, August 12, 2014 Open Source OCR Tools 3
Intro – Outline
• OCR & Open Source Engines
• Digitization vs OCR
• Tesseract
• OCROpus
• Gamera
• Setup
• Installing Tesseract
• Installing Aletheia
• Installing Franken+
• Installing ImageMacick / GIMP
• Running Tesseract (default)
• Identifying issues with your page
images
• What’s your font?
• Image quality problems
• Pre-processing
• Binarization
• Cropping
• “de”-ing (noise, skew, warp, etc.)
• Training Tesseract for your font
• Tesseract’s native training mechanism
• When more is needed
• Aletheia
• Franken+
• Word lists
• Common transformation errors
• Running Tesseract (your training)
• Your results
• Comparing OCR results to Groundtruth
• Creating Groundtruth
• Post-processing
• Hand correction
• Crowd-source correction
• eMOP tools
4Tuesday, August 12, 2014 Open Source OCR Tools
OCR & Open Source Engines
Digitization vs. OCR
• Digitization is the creation of a digital representation of an
object.
• In the print world, a digital image of a page: page image
• end product: image files (.tif .jpg .png .pdf)
• Optical Character Recognition (OCR) is the use of
software to recognize the characters on a page image
and turn that into text.
• text that is searchable, and editable
• end product: text files (.txt .rtf .doc .pdf)
Tuesday, August 12, 2014 Open Source OCR Tools 5
Tesseract
• Developed by Ray Smith at HP
• Taken up by Google
• Used in their Google Books mass-digitization & OCR
program
• Open Source: code.google.com/p/tesseract-ocr/
• version 3.02
• Windows, Mac and UNIX
• Documentation is not always helpful
• User group: groups.google.com/forum/ - !forum/tesseract-ocr
• Training for various scripts and languages available
• Lots of users, so Google it
Tuesday, August 12, 2014 Open Source OCR Tools 6
OCR Opus
• Developed by Thomas Breuel
• Originally used Tesseract for character recognition
• Was not under active development for a while, but a new
version is now available
• Open Source: code.google.com/p/ocropus/
• version 0.7
• Windows, Mac & UNIX
• User group: groups.google.com/forum/ - !forum/ocropus
Tuesday, August 12, 2014 Open Source OCR Tools 7
Gamera
• Developed by Ichiro Fujinaga (McGill University)
• Designed to OCR music
• It’s actually the Gamera OCR Toolkit that you want
• Open Source:
gamera.informatik.hsnr.de/addons/ocr4gamera/
• version 1.1.0 (Jun, 2014)
• Windows, Mac and UNIX
• User group: groups.yahoo.com/neo/groups/gamera-devel/info
• Training can take a while.
• emop.tamu.edu/Gamera-OCR
Tuesday, August 12, 2014 Open Source OCR Tools 8
Installing Tesseract
• Mac: emop.tamu.edu/Installing-Tesseract-Mac
• PC: emop.tamu.edu/Installing-Tesseract-PC
• code.google.com/p/tesseract-ocr/wiki/ReadMe
• Standard English-language training:
code.google.com/p/tesseract-ocr/downloads/list (tesseract-ocr-
3.02.eng.tar.gz)
> combine_tessdata -u eng.traineddata ../unpacked/eng
> dawg2wordlist eng.unicharset eng.word-dawg eng-word-
list.txt
Tuesday, August 12, 2014 Open Source OCR Tools 9
Installing Aletheia
• Windows only
• Download the zip file
• www.primaresearch.org/tools/Aletheia
• Click the Download the previous version button (v2.1)
• Run executable file
Tuesday, August 12, 2014 Open Source OCR Tools 10
Installing Franken+
• Windows only
• Download the zip file
• dh-emopweb.tamu.edu/Franken+/
• Install executable file
• Requirements:
• .NET Framework 4.5 (standard on Windows 8)
• a local MySQL server with root username
(MySQL Community Server 5.6)
• See emop.tamu.edu/Installing-FrankenPlus for more instructions
Tuesday, August 12, 2014 Open Source OCR Tools 11
Installing ImageMagick/GIMP
• Two good free image manipulation programs available for
Windows, Mac and Unix
• ImageMagick
• typically command-line but has a limited graphical interface
inWindows
• www.imagemagick.org/
• GIMP (GNU Image Manipulation Program)
• has a graphical user interface for all platforms
• www.gimp.org/
Tuesday, August 12, 2014 Open Source OCR Tools 12
Running Tesseract with default training
> tesseract <page image> <outfile> -l <lang> <config file>
• Where:
• <outfile> is the name of the of the .txt and .html files to be created
• <lang> is the “language name” you gave your training, i.e. what you
called your typeface training set
• <config file> is a file name containing some configuration information for
Tesseract
• “tessedit_create_hocr 1” produces hOCR (HTML) output
• Tesseract’s default output is text only
• Tesseract’s default <lang> in “eng” their standard english-language
training
Tuesday, August 12, 2014 Open Source OCR Tools 13
Identifying issues with your page images
What’s your font?
• OCR engines need to be trained on the typeface they will
be trying to recognize
• Modern fonts (fonts available via a word processor) make
it easy to train an OCR engine
• Other fonts (bus signs, secretary hand, early modern
fonts) require special training procedures
Tuesday, August 12, 2014 Open Source OCR Tools 14
WhatTheFont
• www.myfonts.com/WhatTheFont/
• crop your page image down to a section of 20 or so letters (<2 MB)
• try to find some distinctive characters
• submit, then help identify the characters found
Tuesday, August 12, 2014 Open Source OCR Tools 15
Image Quality Issues
• Small file size/resolution
(< 300 dpi)
• Noise
• Bleedthrough
• Over/under inking
• Skew
• Warp
Tuesday, August 12, 2014 Open Source OCR Tools 16
Pre-processing
• There are pre-processing algorithms available to fix most of
these issues
• Very useful if you have a small number of documents, or if you
know that all your documents have the same issues (need the
same pre-processing)
• Can dramatically improve OCR results
• Tools:
• GIMP: www.gimp.org/
• ImageMagick: www.imagemagick.org/
• (www.fmwconcepts.com/imagemagick)
Tuesday, August 12, 2014 Open Source OCR Tools 17
Binarization
• Converting to Black & White
• ImageMagick:
> convert <infile> -colorspace gray +dither -colors 2 -
normalize  <outfile>
• Fred’s scripts
> otsuthresh <in> <out>
> localthresh
• GIMP
• Image -> Mode -> Indexed ...
• Tools -> Color Tools… -> Threshold…
Tuesday, August 12, 2014 Open Source OCR Tools 18
Cropping
• Sometimes it helps to crop images to:
• remove noise
• remove unwanted elements (rulers, fingers, note cards, etc.)
• separate multi-page images
• It can also reduce the length of time needed to pre-
process
• Only feasible with a small number of documents
• Can use:
• GIMP
• Paint
• Preview
Tuesday, August 12, 2014 Open Source OCR Tools 19
Denoise
• or “Despeckle”
• Removes speckles from page image
• There’s a trade-off
• Being too aggressive can reduce the integrity of the glyphs
• ImageMagick:
> convert <infile> -noise 1 <outfile>
> convert <infile> -despeckle <outfule>
• GIMP:
• Filters -> Enhance -> Despeckle …
• Try it multiple times, but watch your glyph
integrity
Tuesday, August 12, 2014 Open Source OCR Tools 20
Deskew
• or “Rotate” or “Auto-straighten”
• ImageMagick:
• Fred’s scripts:
> sh ./skew.sh -a 2 -m degrees -d b2r
-v background <infile> <outfile>
• GIMP:
• There’s a plugin, but I couldn’t get it
installed
• registry.gimp.org/node/2958
Tuesday, August 12, 2014 Open Source OCR Tools 21
Dewarp
• Dealing with warping (for example, when a page bends
due to a tight or think spine) is much trickier.
Tuesday, August 12, 2014 Open Source OCR Tools 22
Training Tesseract for your font
• The difference between Training and OCRing
• You may end up using some of the documents you want to OCR to
create the training.
Tuesday, August 12, 2014 Open Source OCR Tools 23
• Training:
• Binarize
• Clean
• Aletheia: Find glyphs
(unicode values and
coordinates on page)
• Franken+: choose best
exemplars of glyphs
• Add word lists (optional)
• Process to create Tesseract
training data
• OCRing:
• Binarize
• Clean (if possible)
• OCR with Tesseract
Training Tesseract for your font
Tuesday, August 12, 2014 Open Source OCR Tools 24
code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
When more is needed
Tuesday, August 12, 2014 Open Source OCR Tools 25
Aletheia: PRImA Research Labs
www.primaresearch.org/
tools/Aletheia
Franken+
dh-
emopweb.tamu.edu/Fra
nken+/
See: Aletheia/Franken+ Quick Start Guide for
more information
Aletheia
Open Source OCR Tools
26
www.primaresearch.org/tools.p
hp
Available for free but requires
registration.
• Created by PRImA Research Labs,
University of Salford, UK.
• Windows based tool.
• Developed as a groundtruth
creation tool
• Used by eMOP undergraduate
student workers to create training of
desired typeface for Tesseract.
• Can identify glyphs on a page image
with page coordinates and Unicode
values.
Tuesday, August 12, 2014 Open Source OCR Tools
Aletheia:Workflow
Open Source OCR Tools
27
• Binarization and Denoise are native Aletheia functions
• A team of Undergraduate student workers refines and corrects glyph
boxes and unicode values, where needed.
• Output: A set of PAGE XML files with page coordinates and unicode
values for every identified glyph on each processed TIFF image.
Tuesday, August 12,
2014
Open Source OCR ToolsTuesday, August 12, 2014
Aletheia: Glyph Recognition
Open Source OCR Tools
28
Uses Tesseract to find glyphs
Tuesday, August 12, 2014 Open Source OCR Tools
Aletheia: I/O
Open Source OCR Tools
29
We then convert PAGE XML
file to Tesseract Box file using
XSLT
Tuesday, August 12, 2014 Open Source OCR Tools
Tesseract Training
Open Source OCR Tools
30Tuesday, August 12, 2014 Open Source OCR Tools
Franken+
1. Windows based tool that uses a
MySQL DB.
2. Developed for eMOP by IDHMC
Graduate student worker Bryan
Tarpley.
3. Designed to be easily used by eMOP
Undergraduate student workers
4. Takes Aletheia's output files as input.
5. Outputs the same box files and TIFF
images that Tesseract's first stage of
native training.
• Available open-source at:
github.com/idhmc-tamu/FrankenPlus
Open Source OCR Tools
31Tuesday, August 12, 2014 Open Source OCR Tools
Franken+Workflow
Open Source OCR Tools
32
1. Groups all glyphs with the
same Unicode values into
one window for
comparison.
2. Uses all selected glyphs
to create a Franken-page
image (TIFF) using a
selected text as a base.
3. Outputs the same box
files and TIFF images
that Tesseract's first stage
of native training.
Tuesday, August 12,
2014
Open Source OCR ToolsTuesday, August 12, 2014
Franken+ Ingestion
Open Source OCR Tools
33Tuesday, August 12, 2014 Open Source OCR Tools
Franken+
Open Source OCR Tools
34
• All exemplars of the
same glyph are
displayed together.
• Users can quickly
identify and deselect:
• Incorrectly labeled
glyphs
• Incomplete glyphs
• Unrepresentative
exemplars
• Different sized glyphs
Tuesday, August 12, 2014 Open Source OCR Tools
Open Source OCR Tools
35
Franken+
Tuesday, August 12, 2014 Open Source OCR Tools
TrainingTesseract
Open Source OCR Tools
36
Thiſ great conſumption to a fever turn'd,
And ſo the oꝗld had fitſ; it joy'd, it mourn'd;
And, aſ men thinke, that Agueſ phy ck are,
And th'Ague being ſpent, give over care.
Žo thou cke World, mꝗꝗak'ꝗ thy ſelże to bee
Well, when ãlaſ, thou'rt in a Lethargie.
Her death did wound and tame thee than, and than
Thou might'ꝗ ha e better ſpar'd the Sunne, or man.
That wound waſ deep, but 'tiſ more miżery,
That thou haꝗ loꝗ thy ſenſe and memor .
'Twaſ heavy then to heare thy voyce of mone,
But thiſ iſ worſe, that thou art ſpeechle e growne.
Thou haꝗ forgot thy name thou hadꝗ; thou waꝗ
Nothing but ee, and her thou haꝗ o'rpaꝗ.
For aſ a child kept from the Fount, untill
Ä prince, expe ed long, come to fulfill
The ceremonieſ, thou unnam'd had'ꝗ laid,
Had not her comming, thee her palace made:
Her name defin'd thee, gave thee forme, and frame,
And thou forgett'ꝗ to celebrate th n me.
Some monethſ e hath beene dead (but beìng dead,
Meaſureſ of timeſ are all determined)
But long e'ath beene away, long, long, et none
Offerſ to tell uſ who it iſ that'ſ gone.
But aſ in ꝗateſ doubtfull of future heireſ,
When ckne e without remedie empaireſ
The preſent Prince, they're loth it ould be ſaid,
The Prince doth langui , or the Prince iſ dead:
So mankinde feeling no a generall tha ,
Tuesday, August 12, 2014 Open Source OCR Tools
F+TraininigText.txt
When more is needed
Tuesday, August 12, 2014 Open Source OCR Tools 37
Tesseract – Word Lists
• Tesseract has the ability to use word lists or dictionaries to look up
words while scanning.
• Word lists help Tesseract decide what a word is when it’s not sure.
• Takes advantage of the character confidence score that Tesseract computes
while scanning.
• This character confidence info is lost when the hOCR output is created.
• DAWG (Directed Acyclic Word Graph) files (8)
• word-dawg: A dawg made from dictionary words from the language.
• freq-dawg: A dawg made from the most frequent words which would have
gone into word-dawg.
• punc-dawg: A dawg made from punctuation patterns found around words.
The "word" part is replaced by a single space.
• number-dawg: A dawg made from tokens which originally contained digits.
Each digit is replaced by a space character.
Tuesday, August 12, 2014 Open Source OCR Tools 38
Tesseract – Word Lists
• Collect a word list
• spellcheckers (ispell, aspell, hunspell) – check the license
• period specific works will require period specific word lists
• dh-emopweb.tamu.edu/eebo-word-freq.php
• emop.tamu.edu/Early-Modern-Word-List
• You can also take Google’s eng.traineddata file apart and use their
word list. (combine_tessdata –u, dawg2wordlist)
• Format: one word per line, no other info, UTF-8.
• If you have a word count associated with your list then split it into
two lists: frequent and other.
• Apply wordlist2dawg application to create dawg files.
Tuesday, August 12, 2014 Open Source OCR Tools 39
Tesseract – Ambiguity and Transformation
Errors
• Tesseract, like all OCR engines, can make consistent
transformation errors across pages, documents and
collections.
• m  rn
• ri  n
• 1)  D
• Tesseract’s ambiguous characters file to helps it to correct
some of these errors while it’s OCRing
• Can also be used to force substitutions
• st  st
• ſ  s
• The name of the file is <lang>.unicharambigs
Tuesday, August 12, 2014 Open Source OCR Tools 40
tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharambigs
Tesseract – .unicharambigs file
• Type Indicator:
0: Substitute B for A if doing
so produces a word in the
dictionary.
1: Always substitute B for A.
• This really only works for
substitutions where at least
one side is multiple
characters.
The .unicharambigs file
must end with a blank line
(/n) at the bottom of the file.
Tuesday, August 12, 2014 Open Source OCR Tools 41
Running Tesseract with your training
> tesseract <page image> <outfile> -l <lang> <config file>
• on my computer:
• go to: C:Program Files (x86)Tesseract-OCR
• > tesseract C:UsersIDHMCDesktopocr-test-
files2633700005.000.001.tif C:UsersIDHMCocr-test-
files26337eebo32989-out-test-1 -l <lang> tess_cfg.txt
Tuesday, August 12, 2014 Open Source OCR Tools 42
Tesseract – Results
• hOCR file
• XML-like .html file & .txt file (tessedit_create_hocr option)
• creates blocks for page, areas, paragraphs, lines, and words
• each block contains page coordinates
• words contain confidence values (version 3.02.03)
Tuesday, August 12, 2014 Open Source OCR Tools 43
Comparing OCR text to Groundtruth
• Juxta-cl (command line)
• created for eMOP
• based on JuxtaCommons tool (juxtacommons.org/)
• several different comparison algorithms to choose from and other
options
• open-source: github.com/performant-software/juxta-cl
• java-based tool run from command line
• Download: emop.tamu.edu/Installing-JuxtaCL
• ocrevalUAtion
• created for Succeed (www.succeed-project.eu/)
• java-based tool
• open-source: sites.google.com/site/textdigitisation/ocrevaluation
Tuesday, August 12, 2014 Open Source OCR Tools 44
Creating Groundtruth
• Aletheia was developed as a groundtruth creation tool for
Succeed.
• Use it to process some of your page images to quickly produce
corrected full-text.
• Worth the effort if you have a large collection
Tuesday, August 12, 2014 Open Source OCR Tools 45
Post Processing
• No OCR is perfect. It will need to be corrected.
• Hand Correction
• The most thorough way, but time consuming.
• Proofread Page: A media wiki extension
(www.mediawiki.org/wiki/Extension:Proofread_Page)
• Crowdsourced Correction
• Give it to the c(l/r)owd
• Tools:
• Online collaborative manuscript transcription tools
• FromThePage: beta.fromthepage.com/ (github.com/benwbrum/fromthepage/wiki)
• T-Pen: t-pen.org
• Scripto: scripto.org
Tuesday, August 12, 2014 Open Source OCR Tools 46
eMOP Post Processing
• Open source tools for:
• scoring OCR results without groundtruth
• estimating the correctability of a page
• removing noise (i.e. junk that Tesseract identifies as words)
• correcting OCR results using dictionaries and google 3-grams
• gitlab.tamu.edu/groups/emop
• Other tools:
• succeed-project.eu/publications/available-tools/index-succeed
Tuesday, August 12, 2014 Open Source OCR Tools 47
The end
mchristy@tamu.edu
48Tuesday, August 12, 2014 Open Source OCR Tools

Contenu connexe

Tendances

mchristy-Dh2014- emop-postOCR-triage
mchristy-Dh2014- emop-postOCR-triagemchristy-Dh2014- emop-postOCR-triage
mchristy-Dh2014- emop-postOCR-triageMatt Christy
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckTao Xie
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
Introducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzureIntroducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzurePlain Concepts
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesPaige Morgan
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchTao Xie
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesTao Xie
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Demo the reactive jargons
Demo the reactive jargonsDemo the reactive jargons
Demo the reactive jargonsThoughtworks
 
Visualising the world of competitive programming with Python (Codeforces)
Visualising the world of competitive programming with Python (Codeforces)Visualising the world of competitive programming with Python (Codeforces)
Visualising the world of competitive programming with Python (Codeforces)Anuj Menta
 
Math-Bridge Authoring
Math-Bridge AuthoringMath-Bridge Authoring
Math-Bridge Authoringmathgear
 

Tendances (17)

mchristy-Dh2014- emop-postOCR-triage
mchristy-Dh2014- emop-postOCR-triagemchristy-Dh2014- emop-postOCR-triage
mchristy-Dh2014- emop-postOCR-triage
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Introducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzureIntroducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en Azure
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Demo the reactive jargons
Demo the reactive jargonsDemo the reactive jargons
Demo the reactive jargons
 
Visualising the world of competitive programming with Python (Codeforces)
Visualising the world of competitive programming with Python (Codeforces)Visualising the world of competitive programming with Python (Codeforces)
Visualising the world of competitive programming with Python (Codeforces)
 
resume16
resume16resume16
resume16
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Math-Bridge Authoring
Math-Bridge AuthoringMath-Bridge Authoring
Math-Bridge Authoring
 

Similaire à SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools

Open Source Tools and the Software Engineering Process
Open Source Tools and the Software Engineering ProcessOpen Source Tools and the Software Engineering Process
Open Source Tools and the Software Engineering ProcessSteve Arnold
 
Doug McCune - Using Open Source Flex and ActionScript Projects
Doug McCune - Using Open Source Flex and ActionScript ProjectsDoug McCune - Using Open Source Flex and ActionScript Projects
Doug McCune - Using Open Source Flex and ActionScript ProjectsDoug McCune
 
Curating activism workshop 1
Curating activism workshop 1Curating activism workshop 1
Curating activism workshop 1Rodric Yates
 
Open source softrware, group 5 final
Open source softrware, group 5 finalOpen source softrware, group 5 final
Open source softrware, group 5 finalbigrouge
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformJason Letourneau
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformBasis Technology
 
Coursera data science specialization
Coursera data science specializationCoursera data science specialization
Coursera data science specializationMengshu Liu
 
Cool Tools for Technical Writers
Cool Tools for Technical WritersCool Tools for Technical Writers
Cool Tools for Technical WritersJeff Haas
 
Part A--Scanners, Conversion
Part A--Scanners, ConversionPart A--Scanners, Conversion
Part A--Scanners, Conversiongollanmel
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProopenminted_eu
 
Malicious File for Exploiting Forensic Software
Malicious File for Exploiting Forensic SoftwareMalicious File for Exploiting Forensic Software
Malicious File for Exploiting Forensic SoftwareTakahiro Haruyama
 
Know thy logos
Know thy logosKnow thy logos
Know thy logosVishal V
 
Metadata & Interoperability: Free Tools
Metadata & Interoperability: Free ToolsMetadata & Interoperability: Free Tools
Metadata & Interoperability: Free ToolsMike Jennings
 
Data visualisation in python tool - a brief
Data visualisation in python tool - a briefData visualisation in python tool - a brief
Data visualisation in python tool - a briefameermalik11
 
Introduction to r
Introduction to rIntroduction to r
Introduction to rgslicraf
 

Similaire à SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools (20)

Open Source Tools and the Software Engineering Process
Open Source Tools and the Software Engineering ProcessOpen Source Tools and the Software Engineering Process
Open Source Tools and the Software Engineering Process
 
Doug McCune - Using Open Source Flex and ActionScript Projects
Doug McCune - Using Open Source Flex and ActionScript ProjectsDoug McCune - Using Open Source Flex and ActionScript Projects
Doug McCune - Using Open Source Flex and ActionScript Projects
 
Digital Library Software
Digital Library SoftwareDigital Library Software
Digital Library Software
 
Curating activism workshop 1
Curating activism workshop 1Curating activism workshop 1
Curating activism workshop 1
 
Open source softrware, group 5 final
Open source softrware, group 5 finalOpen source softrware, group 5 final
Open source softrware, group 5 final
 
Sound soft hackday-100905
Sound soft hackday-100905Sound soft hackday-100905
Sound soft hackday-100905
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
 
Coursera data science specialization
Coursera data science specializationCoursera data science specialization
Coursera data science specialization
 
Cool Tools for Technical Writers
Cool Tools for Technical WritersCool Tools for Technical Writers
Cool Tools for Technical Writers
 
Part A--Scanners, Conversion
Part A--Scanners, ConversionPart A--Scanners, Conversion
Part A--Scanners, Conversion
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKPro
 
Malicious File for Exploiting Forensic Software
Malicious File for Exploiting Forensic SoftwareMalicious File for Exploiting Forensic Software
Malicious File for Exploiting Forensic Software
 
AzureOpenAI.pptx
AzureOpenAI.pptxAzureOpenAI.pptx
AzureOpenAI.pptx
 
Know thy logos
Know thy logosKnow thy logos
Know thy logos
 
Metadata & Interoperability: Free Tools
Metadata & Interoperability: Free ToolsMetadata & Interoperability: Free Tools
Metadata & Interoperability: Free Tools
 
Case study
Case studyCase study
Case study
 
IT_Tools_in_Research.ppt
IT_Tools_in_Research.pptIT_Tools_in_Research.ppt
IT_Tools_in_Research.ppt
 
Data visualisation in python tool - a brief
Data visualisation in python tool - a briefData visualisation in python tool - a brief
Data visualisation in python tool - a brief
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 

Dernier

How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 

Dernier (20)

How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 

SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools

  • 1. USING OPEN SOURCE OCR TOOLS FOR DIGITIZATION PROJECTS Matthew J. Christy
  • 2. Intro – Me • Matthew J. Christy • Lead Software Applications Developer at the Initiative for Digital Humanities, Media and Culture (IDHMC) at Texas A&M University • @matt_christy • idhmc.tamu.edu • @idhmc_nexus • Co-project manager of the Early Modern OCR Project (eMOP) • emop.tamu.edu • #emop • Former Systems/Electronic Resources Librarian Tuesday, August 12, 2014 Open Source OCR Tools 2
  • 3. Intro – You • Name & Institution • Experience with OCR • What’s your project or what are you bringing with you? Tuesday, August 12, 2014 Open Source OCR Tools 3
  • 4. Intro – Outline • OCR & Open Source Engines • Digitization vs OCR • Tesseract • OCROpus • Gamera • Setup • Installing Tesseract • Installing Aletheia • Installing Franken+ • Installing ImageMacick / GIMP • Running Tesseract (default) • Identifying issues with your page images • What’s your font? • Image quality problems • Pre-processing • Binarization • Cropping • “de”-ing (noise, skew, warp, etc.) • Training Tesseract for your font • Tesseract’s native training mechanism • When more is needed • Aletheia • Franken+ • Word lists • Common transformation errors • Running Tesseract (your training) • Your results • Comparing OCR results to Groundtruth • Creating Groundtruth • Post-processing • Hand correction • Crowd-source correction • eMOP tools 4Tuesday, August 12, 2014 Open Source OCR Tools
  • 5. OCR & Open Source Engines Digitization vs. OCR • Digitization is the creation of a digital representation of an object. • In the print world, a digital image of a page: page image • end product: image files (.tif .jpg .png .pdf) • Optical Character Recognition (OCR) is the use of software to recognize the characters on a page image and turn that into text. • text that is searchable, and editable • end product: text files (.txt .rtf .doc .pdf) Tuesday, August 12, 2014 Open Source OCR Tools 5
  • 6. Tesseract • Developed by Ray Smith at HP • Taken up by Google • Used in their Google Books mass-digitization & OCR program • Open Source: code.google.com/p/tesseract-ocr/ • version 3.02 • Windows, Mac and UNIX • Documentation is not always helpful • User group: groups.google.com/forum/ - !forum/tesseract-ocr • Training for various scripts and languages available • Lots of users, so Google it Tuesday, August 12, 2014 Open Source OCR Tools 6
  • 7. OCR Opus • Developed by Thomas Breuel • Originally used Tesseract for character recognition • Was not under active development for a while, but a new version is now available • Open Source: code.google.com/p/ocropus/ • version 0.7 • Windows, Mac & UNIX • User group: groups.google.com/forum/ - !forum/ocropus Tuesday, August 12, 2014 Open Source OCR Tools 7
  • 8. Gamera • Developed by Ichiro Fujinaga (McGill University) • Designed to OCR music • It’s actually the Gamera OCR Toolkit that you want • Open Source: gamera.informatik.hsnr.de/addons/ocr4gamera/ • version 1.1.0 (Jun, 2014) • Windows, Mac and UNIX • User group: groups.yahoo.com/neo/groups/gamera-devel/info • Training can take a while. • emop.tamu.edu/Gamera-OCR Tuesday, August 12, 2014 Open Source OCR Tools 8
  • 9. Installing Tesseract • Mac: emop.tamu.edu/Installing-Tesseract-Mac • PC: emop.tamu.edu/Installing-Tesseract-PC • code.google.com/p/tesseract-ocr/wiki/ReadMe • Standard English-language training: code.google.com/p/tesseract-ocr/downloads/list (tesseract-ocr- 3.02.eng.tar.gz) > combine_tessdata -u eng.traineddata ../unpacked/eng > dawg2wordlist eng.unicharset eng.word-dawg eng-word- list.txt Tuesday, August 12, 2014 Open Source OCR Tools 9
  • 10. Installing Aletheia • Windows only • Download the zip file • www.primaresearch.org/tools/Aletheia • Click the Download the previous version button (v2.1) • Run executable file Tuesday, August 12, 2014 Open Source OCR Tools 10
  • 11. Installing Franken+ • Windows only • Download the zip file • dh-emopweb.tamu.edu/Franken+/ • Install executable file • Requirements: • .NET Framework 4.5 (standard on Windows 8) • a local MySQL server with root username (MySQL Community Server 5.6) • See emop.tamu.edu/Installing-FrankenPlus for more instructions Tuesday, August 12, 2014 Open Source OCR Tools 11
  • 12. Installing ImageMagick/GIMP • Two good free image manipulation programs available for Windows, Mac and Unix • ImageMagick • typically command-line but has a limited graphical interface inWindows • www.imagemagick.org/ • GIMP (GNU Image Manipulation Program) • has a graphical user interface for all platforms • www.gimp.org/ Tuesday, August 12, 2014 Open Source OCR Tools 12
  • 13. Running Tesseract with default training > tesseract <page image> <outfile> -l <lang> <config file> • Where: • <outfile> is the name of the of the .txt and .html files to be created • <lang> is the “language name” you gave your training, i.e. what you called your typeface training set • <config file> is a file name containing some configuration information for Tesseract • “tessedit_create_hocr 1” produces hOCR (HTML) output • Tesseract’s default output is text only • Tesseract’s default <lang> in “eng” their standard english-language training Tuesday, August 12, 2014 Open Source OCR Tools 13
  • 14. Identifying issues with your page images What’s your font? • OCR engines need to be trained on the typeface they will be trying to recognize • Modern fonts (fonts available via a word processor) make it easy to train an OCR engine • Other fonts (bus signs, secretary hand, early modern fonts) require special training procedures Tuesday, August 12, 2014 Open Source OCR Tools 14
  • 15. WhatTheFont • www.myfonts.com/WhatTheFont/ • crop your page image down to a section of 20 or so letters (<2 MB) • try to find some distinctive characters • submit, then help identify the characters found Tuesday, August 12, 2014 Open Source OCR Tools 15
  • 16. Image Quality Issues • Small file size/resolution (< 300 dpi) • Noise • Bleedthrough • Over/under inking • Skew • Warp Tuesday, August 12, 2014 Open Source OCR Tools 16
  • 17. Pre-processing • There are pre-processing algorithms available to fix most of these issues • Very useful if you have a small number of documents, or if you know that all your documents have the same issues (need the same pre-processing) • Can dramatically improve OCR results • Tools: • GIMP: www.gimp.org/ • ImageMagick: www.imagemagick.org/ • (www.fmwconcepts.com/imagemagick) Tuesday, August 12, 2014 Open Source OCR Tools 17
  • 18. Binarization • Converting to Black & White • ImageMagick: > convert <infile> -colorspace gray +dither -colors 2 - normalize <outfile> • Fred’s scripts > otsuthresh <in> <out> > localthresh • GIMP • Image -> Mode -> Indexed ... • Tools -> Color Tools… -> Threshold… Tuesday, August 12, 2014 Open Source OCR Tools 18
  • 19. Cropping • Sometimes it helps to crop images to: • remove noise • remove unwanted elements (rulers, fingers, note cards, etc.) • separate multi-page images • It can also reduce the length of time needed to pre- process • Only feasible with a small number of documents • Can use: • GIMP • Paint • Preview Tuesday, August 12, 2014 Open Source OCR Tools 19
  • 20. Denoise • or “Despeckle” • Removes speckles from page image • There’s a trade-off • Being too aggressive can reduce the integrity of the glyphs • ImageMagick: > convert <infile> -noise 1 <outfile> > convert <infile> -despeckle <outfule> • GIMP: • Filters -> Enhance -> Despeckle … • Try it multiple times, but watch your glyph integrity Tuesday, August 12, 2014 Open Source OCR Tools 20
  • 21. Deskew • or “Rotate” or “Auto-straighten” • ImageMagick: • Fred’s scripts: > sh ./skew.sh -a 2 -m degrees -d b2r -v background <infile> <outfile> • GIMP: • There’s a plugin, but I couldn’t get it installed • registry.gimp.org/node/2958 Tuesday, August 12, 2014 Open Source OCR Tools 21
  • 22. Dewarp • Dealing with warping (for example, when a page bends due to a tight or think spine) is much trickier. Tuesday, August 12, 2014 Open Source OCR Tools 22
  • 23. Training Tesseract for your font • The difference between Training and OCRing • You may end up using some of the documents you want to OCR to create the training. Tuesday, August 12, 2014 Open Source OCR Tools 23 • Training: • Binarize • Clean • Aletheia: Find glyphs (unicode values and coordinates on page) • Franken+: choose best exemplars of glyphs • Add word lists (optional) • Process to create Tesseract training data • OCRing: • Binarize • Clean (if possible) • OCR with Tesseract
  • 24. Training Tesseract for your font Tuesday, August 12, 2014 Open Source OCR Tools 24 code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
  • 25. When more is needed Tuesday, August 12, 2014 Open Source OCR Tools 25 Aletheia: PRImA Research Labs www.primaresearch.org/ tools/Aletheia Franken+ dh- emopweb.tamu.edu/Fra nken+/ See: Aletheia/Franken+ Quick Start Guide for more information
  • 26. Aletheia Open Source OCR Tools 26 www.primaresearch.org/tools.p hp Available for free but requires registration. • Created by PRImA Research Labs, University of Salford, UK. • Windows based tool. • Developed as a groundtruth creation tool • Used by eMOP undergraduate student workers to create training of desired typeface for Tesseract. • Can identify glyphs on a page image with page coordinates and Unicode values. Tuesday, August 12, 2014 Open Source OCR Tools
  • 27. Aletheia:Workflow Open Source OCR Tools 27 • Binarization and Denoise are native Aletheia functions • A team of Undergraduate student workers refines and corrects glyph boxes and unicode values, where needed. • Output: A set of PAGE XML files with page coordinates and unicode values for every identified glyph on each processed TIFF image. Tuesday, August 12, 2014 Open Source OCR ToolsTuesday, August 12, 2014
  • 28. Aletheia: Glyph Recognition Open Source OCR Tools 28 Uses Tesseract to find glyphs Tuesday, August 12, 2014 Open Source OCR Tools
  • 29. Aletheia: I/O Open Source OCR Tools 29 We then convert PAGE XML file to Tesseract Box file using XSLT Tuesday, August 12, 2014 Open Source OCR Tools
  • 30. Tesseract Training Open Source OCR Tools 30Tuesday, August 12, 2014 Open Source OCR Tools
  • 31. Franken+ 1. Windows based tool that uses a MySQL DB. 2. Developed for eMOP by IDHMC Graduate student worker Bryan Tarpley. 3. Designed to be easily used by eMOP Undergraduate student workers 4. Takes Aletheia's output files as input. 5. Outputs the same box files and TIFF images that Tesseract's first stage of native training. • Available open-source at: github.com/idhmc-tamu/FrankenPlus Open Source OCR Tools 31Tuesday, August 12, 2014 Open Source OCR Tools
  • 32. Franken+Workflow Open Source OCR Tools 32 1. Groups all glyphs with the same Unicode values into one window for comparison. 2. Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base. 3. Outputs the same box files and TIFF images that Tesseract's first stage of native training. Tuesday, August 12, 2014 Open Source OCR ToolsTuesday, August 12, 2014
  • 33. Franken+ Ingestion Open Source OCR Tools 33Tuesday, August 12, 2014 Open Source OCR Tools
  • 34. Franken+ Open Source OCR Tools 34 • All exemplars of the same glyph are displayed together. • Users can quickly identify and deselect: • Incorrectly labeled glyphs • Incomplete glyphs • Unrepresentative exemplars • Different sized glyphs Tuesday, August 12, 2014 Open Source OCR Tools
  • 35. Open Source OCR Tools 35 Franken+ Tuesday, August 12, 2014 Open Source OCR Tools
  • 36. TrainingTesseract Open Source OCR Tools 36 Thiſ great conſumption to a fever turn'd, And ſo the oꝗld had fitſ; it joy'd, it mourn'd; And, aſ men thinke, that Agueſ phy ck are, And th'Ague being ſpent, give over care. Žo thou cke World, mꝗꝗak'ꝗ thy ſelże to bee Well, when ãlaſ, thou'rt in a Lethargie. Her death did wound and tame thee than, and than Thou might'ꝗ ha e better ſpar'd the Sunne, or man. That wound waſ deep, but 'tiſ more miżery, That thou haꝗ loꝗ thy ſenſe and memor . 'Twaſ heavy then to heare thy voyce of mone, But thiſ iſ worſe, that thou art ſpeechle e growne. Thou haꝗ forgot thy name thou hadꝗ; thou waꝗ Nothing but ee, and her thou haꝗ o'rpaꝗ. For aſ a child kept from the Fount, untill Ä prince, expe ed long, come to fulfill The ceremonieſ, thou unnam'd had'ꝗ laid, Had not her comming, thee her palace made: Her name defin'd thee, gave thee forme, and frame, And thou forgett'ꝗ to celebrate th n me. Some monethſ e hath beene dead (but beìng dead, Meaſureſ of timeſ are all determined) But long e'ath beene away, long, long, et none Offerſ to tell uſ who it iſ that'ſ gone. But aſ in ꝗateſ doubtfull of future heireſ, When ckne e without remedie empaireſ The preſent Prince, they're loth it ould be ſaid, The Prince doth langui , or the Prince iſ dead: So mankinde feeling no a generall tha , Tuesday, August 12, 2014 Open Source OCR Tools F+TraininigText.txt
  • 37. When more is needed Tuesday, August 12, 2014 Open Source OCR Tools 37
  • 38. Tesseract – Word Lists • Tesseract has the ability to use word lists or dictionaries to look up words while scanning. • Word lists help Tesseract decide what a word is when it’s not sure. • Takes advantage of the character confidence score that Tesseract computes while scanning. • This character confidence info is lost when the hOCR output is created. • DAWG (Directed Acyclic Word Graph) files (8) • word-dawg: A dawg made from dictionary words from the language. • freq-dawg: A dawg made from the most frequent words which would have gone into word-dawg. • punc-dawg: A dawg made from punctuation patterns found around words. The "word" part is replaced by a single space. • number-dawg: A dawg made from tokens which originally contained digits. Each digit is replaced by a space character. Tuesday, August 12, 2014 Open Source OCR Tools 38
  • 39. Tesseract – Word Lists • Collect a word list • spellcheckers (ispell, aspell, hunspell) – check the license • period specific works will require period specific word lists • dh-emopweb.tamu.edu/eebo-word-freq.php • emop.tamu.edu/Early-Modern-Word-List • You can also take Google’s eng.traineddata file apart and use their word list. (combine_tessdata –u, dawg2wordlist) • Format: one word per line, no other info, UTF-8. • If you have a word count associated with your list then split it into two lists: frequent and other. • Apply wordlist2dawg application to create dawg files. Tuesday, August 12, 2014 Open Source OCR Tools 39
  • 40. Tesseract – Ambiguity and Transformation Errors • Tesseract, like all OCR engines, can make consistent transformation errors across pages, documents and collections. • m  rn • ri  n • 1)  D • Tesseract’s ambiguous characters file to helps it to correct some of these errors while it’s OCRing • Can also be used to force substitutions • st  st • ſ  s • The name of the file is <lang>.unicharambigs Tuesday, August 12, 2014 Open Source OCR Tools 40 tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharambigs
  • 41. Tesseract – .unicharambigs file • Type Indicator: 0: Substitute B for A if doing so produces a word in the dictionary. 1: Always substitute B for A. • This really only works for substitutions where at least one side is multiple characters. The .unicharambigs file must end with a blank line (/n) at the bottom of the file. Tuesday, August 12, 2014 Open Source OCR Tools 41
  • 42. Running Tesseract with your training > tesseract <page image> <outfile> -l <lang> <config file> • on my computer: • go to: C:Program Files (x86)Tesseract-OCR • > tesseract C:UsersIDHMCDesktopocr-test- files2633700005.000.001.tif C:UsersIDHMCocr-test- files26337eebo32989-out-test-1 -l <lang> tess_cfg.txt Tuesday, August 12, 2014 Open Source OCR Tools 42
  • 43. Tesseract – Results • hOCR file • XML-like .html file & .txt file (tessedit_create_hocr option) • creates blocks for page, areas, paragraphs, lines, and words • each block contains page coordinates • words contain confidence values (version 3.02.03) Tuesday, August 12, 2014 Open Source OCR Tools 43
  • 44. Comparing OCR text to Groundtruth • Juxta-cl (command line) • created for eMOP • based on JuxtaCommons tool (juxtacommons.org/) • several different comparison algorithms to choose from and other options • open-source: github.com/performant-software/juxta-cl • java-based tool run from command line • Download: emop.tamu.edu/Installing-JuxtaCL • ocrevalUAtion • created for Succeed (www.succeed-project.eu/) • java-based tool • open-source: sites.google.com/site/textdigitisation/ocrevaluation Tuesday, August 12, 2014 Open Source OCR Tools 44
  • 45. Creating Groundtruth • Aletheia was developed as a groundtruth creation tool for Succeed. • Use it to process some of your page images to quickly produce corrected full-text. • Worth the effort if you have a large collection Tuesday, August 12, 2014 Open Source OCR Tools 45
  • 46. Post Processing • No OCR is perfect. It will need to be corrected. • Hand Correction • The most thorough way, but time consuming. • Proofread Page: A media wiki extension (www.mediawiki.org/wiki/Extension:Proofread_Page) • Crowdsourced Correction • Give it to the c(l/r)owd • Tools: • Online collaborative manuscript transcription tools • FromThePage: beta.fromthepage.com/ (github.com/benwbrum/fromthepage/wiki) • T-Pen: t-pen.org • Scripto: scripto.org Tuesday, August 12, 2014 Open Source OCR Tools 46
  • 47. eMOP Post Processing • Open source tools for: • scoring OCR results without groundtruth • estimating the correctability of a page • removing noise (i.e. junk that Tesseract identifies as words) • correcting OCR results using dictionaries and google 3-grams • gitlab.tamu.edu/groups/emop • Other tools: • succeed-project.eu/publications/available-tools/index-succeed Tuesday, August 12, 2014 Open Source OCR Tools 47
  • 48. The end mchristy@tamu.edu 48Tuesday, August 12, 2014 Open Source OCR Tools

Notes de l'éditeur

  1. This is something I think a lot of users don’t really understand. PDF’s have really confused the issue.
  2. There is a new version of Aletheia which changed the output format. Franken+ is currently being updated to handle this new output. In the meantime…
  3. GNU Image Manipulation Program
  4. Aletheia: Created by PRImA Research Labs at the University of Salford, as a groundtruth creation tool. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
  5. This document had 3 distinct point sizes, but Franken+ revealed there were more like 5 or 6. In this case it probably doesn’t matter, but it points out the possibilities