SlideShare une entreprise Scribd logo
1  sur  21
eMOP Book History Tools
Book History and Software Tools: Examining Typefaces for OCR
Training in eMOP
Matt Christy,
Todd Samuelson,
Katayoun Torabi,
Bryan Tarpley,
Elizabeth Grumbach
 emop.tamu.edu/
 Dh2014 Presentation
 emop.tamu.edu/book-
history-tools
 eMOP Workflows
 emop.tamu.edu/workflows
 Mellon Grant Proposal
 idhmc.tamu.edu/projects/
Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
 Facebook
 Early Modern OCR Project
 Twitter
 #emop
 @IDHMC_Nexus
 @matt_christy
 @EMGrumbach
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
2
Early Modern OCR Project
 The Early Modern OCR Project (eMOP) is an Andrew W.
Mellon Foundation funded grant project running out of the
Initiative for Digital Humanities, Media, and Culture (IDHMC)
at Texas A&M University, to develop and test tools and
techniques to apply Optical Character Recognition (OCR)
to early modern English documents from the hand press
period, roughly 1475-1800.
 eMOP aims to improve the visibility of early modern texts by
making their contents fully searchable. The current
paradigm of searching special collections for early modern
materials by either metadata alone or “dirty” OCR is
insufficient for scholarly research.
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
3
Specifically, eMOP’s goal is to make
machine readable, or improve the
readability, for 305,000 document/45
million pages of text from two major
proprietary databases: Eighteenth
Century Collections Online (ECCO)
and Early English Books Online (EEBO).
Generally, our aim is to use typeface
and book history techniques to train
modern OCR engines specifically on
the typefaces in our collection of
documents, and thereby improve the
accuracy of the OCR results.
TrainingTesseract
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
4
Aletheia
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
5
www.primaresearch.org/too
ls.php
Available for free but requires
registration.
 Created by PRImA Research
Labs, University of Salford, UK.
 Windows based tool.
 Developed as a groundtruth
creation tool
 Used by eMOP undergraduate
student workers to create training
of desired typeface for Tesseract.
 Can identify glyphs on a page
image with page coordinates and
Unicode values.
Aletheia:Workflow
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
6  Binarization and Denoise are native Aletheia functions
 A team of Undergraduate student workers refines and
corrects glyph boxes and unicode values, where needed.
 Output: A set of PAGE XML files with page coordinates and
unicode values for every identified glyph on each processed
TIFF image.
Aletheia: Glyph Recognition
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
7
Uses Tesseract to find glyphs
Aletheia: I/O
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
8
We then convert PAGE XML
file to Tesseract Box file using
XSLT
Tesseract Training
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
9
Franken+
1. Windows based tool that uses a
MySQL DB.
2. Developed for eMOP by IDHMC
Graduate student worker Bryan
Tarpley.
3. Designed to be easily used by
eMOP Undergraduate student
workers
4. Takes Aletheia's output files as
input.
5. Outputs the same box files and TIFF
images that Tesseract's first stage
of native training.
 Available open-source at:
github.com/idhmc-
tamu/FrankenPlus
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
10
Franken+Workflow
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
11
1. Groups all glyphs with
the same Unicode
values into one window
for comparison.
2. Uses all selected glyphs
to create a Franken-
page image (TIFF) using
a selected text as a
base.
3. Outputs the same box
files and TIFF images
that Tesseract's first
stage of native training.
Franken+ Ingestion
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
12
Franken+
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
13
 All exemplars of the
same glyph are
displayed together.
 Users can quickly
identify and
deselect:
 Incorrectly labeled
glyphs
 Incomplete glyphs
 Unrepresentative
exemplars
 Different sized glyphs
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
14
Franken+
TrainingTesseract
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
15
Thiſ great conſumption to a fever turn'd,
And ſo the oꝗld had fitſ; it joy'd, it mourn'd;
And, aſ men thinke, that Agueſ phy ck are,
And th'Ague being ſpent, give over care.
Žo thou cke World, mꝗſtak'ſt thy ſelże to bee
Well, when ãlaſ, thou'rt in a Lethargie.
Her death did wound and tame thee than, and than
Thou might'ſt ha e better ſpar'd the Sunne, or man.
That wound waſ deep, but 'tiſ more miżery,
That thou haſt loſt thy ſenſe and memor .
'Twaſ heavy then to heare thy voyce of mone,
But thiſ iſ worſe, that thou art ſpeechle e growne.
Thou haſt forgot thy name thou hadſt; thou waſt
Nothing but ee, and her thou haſt o'rpaſt.
For aſ a child kept from the Fount, untill
Ä prince, expe ed long, come to fulfill
The ceremonieſ, thou unnam'd had'ſt laid,
Had not her comming, thee her palace made:
Her name defin'd thee, gave thee forme, and frame,
And thou forgett'ſt to celebrate th n me.
Some monethſ e hath beene dead (but beìng dead,
Meaſureſ of timeſ are all determined)
But long e'ath beene away, long, long, et none
Offerſ to tell uſ who it iſ that'ſ gone.
But aſ in ſtateſ doubtfull of future heireſ,
When ckne e without remedie empaireſ
The preſent Prince, they're loth it ould be ſaid,
The Prince doth langui , or the Prince iſ dead:
So mankinde feeling no a generall tha ,
Franken+ Results
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
16
AFTER
BEFORE
eMOP
TesseractTraining
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
17
S-face / Y-face
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in
eMOP
18
Weiss, Adrian. “Font Analysis as a Bibliographical Method: the
Elizabethan Play-Quarto Printers and Compositors.” Studies
in Bibliography 43 (1990): 95-164.
 Weiss organized late 16th and early 17th century
typefaces into these two general types (named for the
first works in which they were identified)
 Y-Face, from an edition of The Malcontents
 S-Face, from Ben Jonson's Sejanus
S-face /
Y-face
19
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
Other Applications
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
20
 A close examination of the typefaces used by a printer
 An investigation of the typefaces used in a work or in the
same editions of a work
 A reexamination of typefaces classified via a system (Proctor-
Haebler)
The end
For eMOP questions please
contact us at :
mchristy@tamu.edu
egrumbac@tamu.edu
DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in
eMOP
21

Contenu connexe

Tendances

Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin
 

Tendances (20)

Tamu big data-conf-1b
Tamu big data-conf-1bTamu big data-conf-1b
Tamu big data-conf-1b
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
 
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source ToolsSAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
SAA 2014 Pre-conference Workshop - OCRing with Open Source Tools
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Recent trends in natural language processing
Recent trends in natural language processingRecent trends in natural language processing
Recent trends in natural language processing
 
Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...
Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...
Technical Services and the Virtual Reference Desk: Mining Chat Transcripts fo...
 
Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006
 
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open DataSSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 

Similaire à mchristy-DH2014-emop-bookhistory-tools

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Apache OpenNLP
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
DataWorks Summit
 

Similaire à mchristy-DH2014-emop-bookhistory-tools (20)

OpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for allOpenWN-PT: a Brazilian Wordnet for all
OpenWN-PT: a Brazilian Wordnet for all
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshopAI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
 
LSDI.pptx
LSDI.pptxLSDI.pptx
LSDI.pptx
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
Session 0.0   aussenac semanticsnl-pwebsem2017-v4Session 0.0   aussenac semanticsnl-pwebsem2017-v4
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
 
Aussenac semanticsnl pwebsem2017-v4
Aussenac semanticsnl pwebsem2017-v4Aussenac semanticsnl pwebsem2017-v4
Aussenac semanticsnl pwebsem2017-v4
 
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
 
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig MaximiliansTR5 Prolifer and Post-Correction System. Ludwig Maximilians
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text Mining
 
A detailed study and recent research on handwritten recognition
A detailed study and recent research on handwritten recognitionA detailed study and recent research on handwritten recognition
A detailed study and recent research on handwritten recognition
 
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital HumanitiesEstimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Keynote new convergences between natural language processing and knowledge ...
Keynote   new convergences between natural language processing and knowledge ...Keynote   new convergences between natural language processing and knowledge ...
Keynote new convergences between natural language processing and knowledge ...
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
06 traub
06 traub06 traub
06 traub
 

Dernier

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 

mchristy-DH2014-emop-bookhistory-tools

  • 1. eMOP Book History Tools Book History and Software Tools: Examining Typefaces for OCR Training in eMOP Matt Christy, Todd Samuelson, Katayoun Torabi, Bryan Tarpley, Elizabeth Grumbach
  • 2.  emop.tamu.edu/  Dh2014 Presentation  emop.tamu.edu/book- history-tools  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/ Mellon/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @matt_christy  @EMGrumbach DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 2
  • 3. Early Modern OCR Project  The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents from the hand press period, roughly 1475-1800.  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 3 Specifically, eMOP’s goal is to make machine readable, or improve the readability, for 305,000 document/45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, our aim is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results.
  • 4. TrainingTesseract DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 4
  • 5. Aletheia DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 5 www.primaresearch.org/too ls.php Available for free but requires registration.  Created by PRImA Research Labs, University of Salford, UK.  Windows based tool.  Developed as a groundtruth creation tool  Used by eMOP undergraduate student workers to create training of desired typeface for Tesseract.  Can identify glyphs on a page image with page coordinates and Unicode values.
  • 6. Aletheia:Workflow DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 6  Binarization and Denoise are native Aletheia functions  A team of Undergraduate student workers refines and corrects glyph boxes and unicode values, where needed.  Output: A set of PAGE XML files with page coordinates and unicode values for every identified glyph on each processed TIFF image.
  • 7. Aletheia: Glyph Recognition DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 7 Uses Tesseract to find glyphs
  • 8. Aletheia: I/O DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 8 We then convert PAGE XML file to Tesseract Box file using XSLT
  • 9. Tesseract Training DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 9
  • 10. Franken+ 1. Windows based tool that uses a MySQL DB. 2. Developed for eMOP by IDHMC Graduate student worker Bryan Tarpley. 3. Designed to be easily used by eMOP Undergraduate student workers 4. Takes Aletheia's output files as input. 5. Outputs the same box files and TIFF images that Tesseract's first stage of native training.  Available open-source at: github.com/idhmc- tamu/FrankenPlus DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 10
  • 11. Franken+Workflow DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 11 1. Groups all glyphs with the same Unicode values into one window for comparison. 2. Uses all selected glyphs to create a Franken- page image (TIFF) using a selected text as a base. 3. Outputs the same box files and TIFF images that Tesseract's first stage of native training.
  • 12. Franken+ Ingestion DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 12
  • 13. Franken+ DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 13  All exemplars of the same glyph are displayed together.  Users can quickly identify and deselect:  Incorrectly labeled glyphs  Incomplete glyphs  Unrepresentative exemplars  Different sized glyphs
  • 14. DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 14 Franken+
  • 15. TrainingTesseract DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 15 Thiſ great conſumption to a fever turn'd, And ſo the oꝗld had fitſ; it joy'd, it mourn'd; And, aſ men thinke, that Agueſ phy ck are, And th'Ague being ſpent, give over care. Žo thou cke World, mꝗſtak'ſt thy ſelże to bee Well, when ãlaſ, thou'rt in a Lethargie. Her death did wound and tame thee than, and than Thou might'ſt ha e better ſpar'd the Sunne, or man. That wound waſ deep, but 'tiſ more miżery, That thou haſt loſt thy ſenſe and memor . 'Twaſ heavy then to heare thy voyce of mone, But thiſ iſ worſe, that thou art ſpeechle e growne. Thou haſt forgot thy name thou hadſt; thou waſt Nothing but ee, and her thou haſt o'rpaſt. For aſ a child kept from the Fount, untill Ä prince, expe ed long, come to fulfill The ceremonieſ, thou unnam'd had'ſt laid, Had not her comming, thee her palace made: Her name defin'd thee, gave thee forme, and frame, And thou forgett'ſt to celebrate th n me. Some monethſ e hath beene dead (but beìng dead, Meaſureſ of timeſ are all determined) But long e'ath beene away, long, long, et none Offerſ to tell uſ who it iſ that'ſ gone. But aſ in ſtateſ doubtfull of future heireſ, When ckne e without remedie empaireſ The preſent Prince, they're loth it ould be ſaid, The Prince doth langui , or the Prince iſ dead: So mankinde feeling no a generall tha ,
  • 16. Franken+ Results DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 16 AFTER BEFORE
  • 17. eMOP TesseractTraining DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 17
  • 18. S-face / Y-face DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 18 Weiss, Adrian. “Font Analysis as a Bibliographical Method: the Elizabethan Play-Quarto Printers and Compositors.” Studies in Bibliography 43 (1990): 95-164.  Weiss organized late 16th and early 17th century typefaces into these two general types (named for the first works in which they were identified)  Y-Face, from an edition of The Malcontents  S-Face, from Ben Jonson's Sejanus
  • 19. S-face / Y-face 19 DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP
  • 20. Other Applications DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 20  A close examination of the typefaces used by a printer  An investigation of the typefaces used in a work or in the same editions of a work  A reexamination of typefaces classified via a system (Proctor- Haebler)
  • 21. The end For eMOP questions please contact us at : mchristy@tamu.edu egrumbac@tamu.edu DH2014 - Book History and Software Tools: Examining Typefaces for OCR Training in eMOP 21

Notes de l'éditeur

  1. Aletheia: Created by PRImA Research Labs at the University of Salford, as a groundtruth creation tool. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
  2. This is cheating: the result of scanning the same page we used to create the training.
  3. So we think that Franken+ can be a really useful tool for the close examination of typefaces and book history, and opens up some of the admittedly tedious work to non-experts.