Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
1.
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Elizabeth Grumbach, Co-Project Manager, IDHMC
Laura Mandell, PI, IDHMC
Apostolos Antonacopoulos, PRImA Lab
Clemens Neudecker, Koninklijke Bibliotheek
Matthew Christy, Co-Project Manager, IDHMC
Loretta Auvil, SEASR Analytics
Todd Samuelson, Cushing Memorial Library
emop.tamu.edu
2.
Navigating the Storm:
eMOP, Big DH Projects, and Agile Steering Standards
Initial Goals
Challenges
Or
Failures
Analysis
New Directions
Adaptability
Navigating the Storm | @EMGrumbach | emop.tamu.edu
3.
Straight from the grant proposal…
“Our overarching goals”
1) Train three open-access OCR engines to “read” early modern
fonts
2) Map specific font training onto specific sets of documents
3) Create error-evaluation mechanisms for failed documents
4) Use crowd-sourced correction tools specific to OCR errors
5) Identify pages that are too flawed to be “readable”
6) Share our workflow procedure and results, so that the
community can use them in digitizing and transcribing early
modern documents.
Navigating the Storm | @EMGrumbach | emop.tamu.edu
4.
Main Collaborators
CIIR
IDHMC + Cushing Memorial Library
Koninklijke Bibliotheek
Performant Software Solutions
PRImA Labs
PSI Labs
SEASR
UMass Amhearst
Texas A&M
Netherlands
Charlottesville, Virginia
University of Salford, Manchester
Texas A&M
U of Illinois, Urbana-Champaign
Navigating the Storm | @EMGrumbach | emop.tamu.edu
5.
Data Contributors + Collaborators
Early English Books Online (EEBO)
Eighteenth Century Collections Online (ECCO)
Text Creation Partnership (TCP)
Brazos Computing Cluster (Texas A&M)
Main Collaborators
Navigating the Storm | @EMGrumbach | emop.tamu.edu
6.
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Laura Mandell, Principal Investigator, eMOP
Director, IDHMC
@mandellc
idhmc@tamu.edu
7.
Early Modern
Printing
• Individual, hand-made
typefaces
• Worn and broken type
• Poor quality equipment/paper
• Inconsistent line bases
• Unusual page layouts,
decorative page elements,
• Special characters & ligatures
• Spelling variations
• Mixed typefaces and languages
Slides by Matthew Christy 7
8.
Slides by Matthew Christy 8
• Irregular Layouts
• Print Bleedthrough
9.
Document/Image
Quality
• Torn and damaged
pages
• Noise introduced to
images of pages
• Skewed pages
• Warped pages
• Missing pages
• Inverted pages
• Incorrect metadata
• Extremely low quality
TIFFs (~50K)
Slides by Matthew Christy 9
11.
11
There may be as much
difference between one letter
and another in a specific font
As there is between letters in
different fonts.
Reality
Dream
Training Tesseract in different
fonts and applying them to the
documents printed in those
particular fonts will improve OCR
quality.
12.
Training Tesseract
Aletheia
Created by PRImA Research Labs. A team of undergraduates uses Aletheia to identify each glyph on
the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an
XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode
values.
13.
Training Tesseract
Franken+
1. Takes Aletheia's output files as input.
2. Groups all glyphs with the same Unicode values
into one window for comparison.
3. Mistakenly coded glyphs are easily identified and
re-coded.
4. A user can quickly compare all exemplars of a
glyph and choose just the best subset, if desired.
5. Uses all selected glyphs to create a Franken-page
image (TIFF) using a selected text as a base.
6. Outputs the same box files and TIFF images that
Tesseract's first stage of native training.
7. Also allows users to complete Tesseract training
using newly created box/TIFF file pairs, and add
optional dictionary and other files.
8. Outputs a .traineddata file used by Tesseract
when OCRing page images.
Slides by Matthew Christy 13
14.
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Clemens Neudecker, Koninklijke Bibliotheek
@cneudecker
15.
The case of IMPAC T
• IMPACT = IMProving ACcess to Text
• EU FP7, 2008 – 2012
• €16.7 M budget
• 22 partners (libraries, universities, companies)
• Goal: Significantly improve OCR for historical
documents
16.
Issue 1
• Expectation: The "IMPACT OCR"
• Reality: A collection of very diverse tools,
algorithms, etc. Some prototypes, some
commercial tools, different programming
languages, different levels of maturity etc.
•
• No integrated product possible!
17.
Issue 1
• Solution: Interoperability rather than integration
• Change: Individual applications as pluggable
modules in a web-based framework
• Result: Flexible framework with additional
benefits for testing, transparency, provenance
18.
Issue 2
• Diversity: Librarians, Computer Scientists,
Computational Linguists, Humanists
• Are we really talking the same language?
• Different focus points in the project: applicable
solutions vs. academic publications
19.
Issue 2
• Solution: Create bonding activities, foster
atmosphere for knowledge exchange
• Change: Buddy programme, social games,
quizzes about partners
• Result: Understand your partners background,
their way of thinking
enrich the experience for everyone
20.
Large Digitisation Projects:
Two Key Perspectives
Apostolos Antonacopoulos
PRImA Research Lab
21.
Background
Since 2002 the PRImA Lab has been involved in large digitisation
projects, creating software tools for all stages of the workflow
• From Image Enhancement to Layout Analysis to OCR
• Use-scenario based evaluation of extracted text quality
• Crowd/Scholar-sourcing
Two general points are routinely underestimated:
• (Really) Understanding stakeholders and their roles
• (Real) Understanding of problems, their extent and the
effectiveness/requirements of potential solutions
22.
Stakeholders and their
roles
Seems obvious and often mentioned but the significance of
understanding this point and its effects is vastly underestimated
Content holders
• Keen for their content to be widely available and used
• Do not know their content well and neither its potential uses
Computer scientists
• Have technical expertise to solve many of the problems
• Do not know the material and its use to prioritise problems well
DH researchers – the catalysts
• Very knowledgeable of material and potential use
• Have complementary technical skills to computer scientists
23.
Problem understanding
At the start of each project everyone is eager to deliver “big” results but
it is important to identify and understand a few key problems and solve
them well
“Improve OCR results” is an ill-defined and short-sighted goal
• Measured in terms of word-accuracy, OCR results are of little use
• Layout is very important
• Even if all the words are recognised correctly, the reading order is unlikely to be
correct, limiting potentially interesting uses.
• Page numbers, captions, running headers etc. should not be mixed with body text
• Graphical elements / illustrations are important too
Think: Useful data (investment) vs. just more of any data (instant
gratification)
24.
Navigating the Storm:
eMOP, Big DH Projects, and Agile
Steering Standards
Elizabeth Grumbach, Co-Project Manager, IDHMC
@EMGrumbach
egrumbac@tamu.edu
25.
“If an electronic scholarly project can’t fail and
doesn’t produce new ignorance, then it isn’t
worth a damn.”
- John Unsworth
“Documenting the Reinvention of Text: The Importance of Failure”
Navigating the Storm | @EMGrumbach | emop.tamu.edu
26.
Navigating the Storm:
eMOP, Big DH Projects, and Agile Steering Standards
Initial Goals
Challenges
Or
Failures
Analysis
New Directions
Adaptability
Navigating the Storm | @EMGrumbach | emop.tamu.edu
27.
Navigating the Storm:
eMOP, Big DH Projects, and Agile Steering Standards
Challenges
Or
Failures
Analysis
New Directions
Adaptability
Challenges +Failures
should be constantly or
consistently
communicated.
Analysis + New
Directions
should lead to research
and communication
with similar projects.
Adaptability
should allow for new
possibilities, new
questions.
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Notes de l'éditeur
sf
eMOP – early modern OCR project, funded by the Mellon foundation for 734,000 for two years, and our initial goals were the following
Influenced us a lot when we were discussing putting together this paper and presentation, as we’ve all come to this international, interdisciplinary grant project from similar projects – we’ve faced challenges to our initial premises, we’ve not met milestone in the grant, yet we’ve produced interesting results and raised new research questions
Il semblerait que vous ayez déjà ajouté cette diapositive à .
Créer un clipboard
Vous avez clippé votre première diapositive !
En clippant ainsi les diapos qui vous intéressent, vous pourrez les revoir plus tard. Personnalisez le nom d’un clipboard pour mettre de côté vos diapositives.
Créer un clipboard
Partager ce SlideShare
Vous avez les pubs en horreur?
Obtenez SlideShare sans publicité
Bénéficiez d'un accès à des millions de présentations, documents, e-books, de livres audio, de magazines et bien plus encore, sans la moindre publicité.
Offre spéciale pour les lecteurs de SlideShare
Juste pour vous: Essai GRATUIT de 60 jours dans la plus grande bibliothèque numérique du monde.
La famille SlideShare vient de s'agrandir. Profitez de l'accès à des millions de livres numériques, livres audio, magazines et bien plus encore sur Scribd.
Apparemment, vous utilisez un bloqueur de publicités qui est en cours d'exécution. En ajoutant SlideShare à la liste blanche de votre bloqueur de publicités, vous soutenez notre communauté de créateurs de contenu.
Vous détestez les publicités?
Nous avons mis à jour notre politique de confidentialité.
Nous avons mis à jour notre politique de confidentialité pour nous conformer à l'évolution des réglementations mondiales en matière de confidentialité et pour vous informer de la manière dont nous utilisons vos données de façon limitée.
Vous pouvez consulter les détails ci-dessous. En cliquant sur Accepter, vous acceptez la politique de confidentialité mise à jour.