SlideShare a Scribd company logo
1 of 12
eMOP’s Printers and Publishers:
Toward Crafting an Early Modern Print Database
Matthew Christy,
Elizabeth Grumbach
emop.tamu.edu
 eMOP ImprintDB
 github.com/Early-
Modern-OCR/ImprintDB
 Mellon Grant Proposal
 idhmc.tamu.edu/projects
/Mellon/eMOPPublic.pdf
eMOP Info
eMOP Resources More eMOP
 Facebook
 The Early Modern OCR
Project
 Twitter
 #emop
 @IDHMC_Nexus
 @mandellc
 @matt_christy
 @EMGrumbach
2
Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB
 The Early Modern OCR Project (eMOP) is an
 Andrew W. Mellon Foundation funded grant project
running out of the Initiative for Digital Humanities,
Media, and Culture (IDHMC) at Texas A&M
University, to
 develop and test tools and techniques to apply
Optical Character Recognition (OCR) to early
modern English documents
 from the hand press period, roughly 1475-1800.
 eMOP aims to improve the visibility of early
modern texts by making their contents fully
searchable. The current paradigm of searching
special collections for early modern materials by
either metadata alone or “dirty” OCR is
insufficient for scholarly research.
3
Digital Frontiers 2015 - eMOP ImprintDB
Goals
Sept. 18, 2015
Digital Frontiers 2015 - eMOP ImprintDB
4
Sept. 18, 2015
Wrangling Data
The Numbers
 EEBO: ~125,000 documents, ~13 million
pages images (1475-1700)
 ECCO: ~182,000 documents, ~32
million page images (1700-1800)
 TCP: ~46,000 double-keyed hand
transcriptions (44,000 EEBO, 2,200
ECCO) – Groundtruth
 Total: >300,000 documents & ~45
million page images.
The Data
 ECCO page images (1 pg/
image)
 ECCO original OCR results
(doc-level XML files)
 ECCO TCP transcriptions (doc-
level XML and text files)
 EEBO page images (2 pgs/
image)
 EEBO TCP transcriptions (doc-
level XML and text files)
Digital Frontiers 2015 - eMOP ImprintDB
5
Sept. 18, 2015
eMOP DB
Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB
6
• Document metadata
• File locations
• Page images
• Pages text
• Groundtruth text
• OCR Results
• Pages text
• Scores against
Groundtruth
• Results of analysis
• noise measure
• skew measure
• multiple column
coords
• corrections
made
The Problems
Early Modern Imprints
 Missing
 Incorrect
 accidentally by printer
 accidentally by DB
provider
 purposefully
 No standard format or
consistent inclusion of
information
 Inconsistent spelling and
use of initials
 Use of conversational
language
 Use of non-English (Latin,
Welsh)
 or a mix of languages
7
Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB
Imprinted at London : by
John Jugge, dwellyng at
the north doore of Paules
Early Modern Imprints
 Iterative application of
regular expressions to cull
out the data:
 Who the work was Printed
By
 Who the work was Printed
For
 Who the work was Sold
By
 The Place of printing
(London, Cambridge,
Dublin, etc.)
 The Location of printing
(“the north doore of
Paules”)
 Date (gathered from
separate metadata field)
The Solution 8
Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB
Printed by: Iohn Iugge
Place: London
Location: the north doore
of Paules
: 1580?
Terms to identify the printer:
• “printed”, sometimes also accompanied by “by”
• prynted
• reprinted or re-printed
• imprinted
• pressed
• brintwyd (Welsh)
• Typis, presso, pressare, excudebat, … (Latin)
• etc. etc.
Results
<work>
<emopNO>140776</emopNO>
<eccoNO>67101600</eccoNO>
<tcpNO>NULL</tcpNO>
<estcNO>T077294</estcNO>
<imprintORIG>[London] : In the Savoy: printed by
John Nutt; for John Walthoe, 1713.</imprintORIG>
<date>1713</date>
<imprintCLN>London : in the Savoy: printed by
John Nutt; for John Walthoe,</imprintCLN>
<place>London</place>
<printedBy>John Nutt</printedBy>
<printedFor>John Walthoe</printedFor>
<location>in the Savoy</location>
</work>
Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB
9
sourcehttp://bit.ly/1hXpVpd
eMOP Outcomes - Github
Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB
10
https://github.com/Early-Modern-OCR/ImprintDB
source: http://blog.volkovlaw.com/2013/03/the-future-of-compliance-what-will-the-new-tools-look-like/
Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB
11
Outcomes – DB of EM Printers
The end
For eMOP questions please
contact us at :
mchristy@tamu.edu
egrumbac@tamu.edu
mandell@tamu.edu
12
Digital Frontiers 2015 - eMOP ImprintDB Sept. 18, 2015

More Related Content

Similar to Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB

How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?cneudecker
 
Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)
Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)
Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)Vladimir Alexiev, PhD, PMP
 
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshopAI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshoptagtog
 
BL Labs Presentation at Liverpool John Moores University
BL Labs Presentation at Liverpool John Moores UniversityBL Labs Presentation at Liverpool John Moores University
BL Labs Presentation at Liverpool John Moores Universitylabsbl
 
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...Trevor Owens
 
British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017benosteen
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProopenminted_eu
 
Culture Globe: EuropeanaTech Hackathon Project
Culture Globe: EuropeanaTech Hackathon ProjectCulture Globe: EuropeanaTech Hackathon Project
Culture Globe: EuropeanaTech Hackathon ProjectPetr Pridal
 
Digital Humanities Research
Digital Humanities ResearchDigital Humanities Research
Digital Humanities Researchelli.m
 
The Future is All Mine
The Future is All MineThe Future is All Mine
The Future is All Mineopenminted_eu
 
ESWC2015 opening ceremony
ESWC2015 opening ceremonyESWC2015 opening ceremony
ESWC2015 opening ceremonyFabien Gandon
 
slides_ZU_Text_mining_final (MEDIUM).pdf
slides_ZU_Text_mining_final (MEDIUM).pdfslides_ZU_Text_mining_final (MEDIUM).pdf
slides_ZU_Text_mining_final (MEDIUM).pdfPetr Korab
 
CONTENTdm 'Quick Start' at The Metropolitan Museum of Art
CONTENTdm 'Quick Start' at The Metropolitan Museum of ArtCONTENTdm 'Quick Start' at The Metropolitan Museum of Art
CONTENTdm 'Quick Start' at The Metropolitan Museum of ArtDan Lipcan
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Paige Morgan
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...cneudecker
 
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...UBC Library
 
Crowdheritage: The RE-usable Fashion Museum and Crowd Engagement
Crowdheritage: The RE-usable Fashion Museum and Crowd EngagementCrowdheritage: The RE-usable Fashion Museum and Crowd Engagement
Crowdheritage: The RE-usable Fashion Museum and Crowd EngagementOlivier Schulbaum
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsEmma Huber
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesPaige Morgan
 

Similar to Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB (20)

How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)
Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)
Museum LOD (Ontotext, 1 May 2019, Doha, Qatar)
 
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshopAI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop
 
BL Labs Presentation at Liverpool John Moores University
BL Labs Presentation at Liverpool John Moores UniversityBL Labs Presentation at Liverpool John Moores University
BL Labs Presentation at Liverpool John Moores University
 
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
 
British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKPro
 
Culture Globe: EuropeanaTech Hackathon Project
Culture Globe: EuropeanaTech Hackathon ProjectCulture Globe: EuropeanaTech Hackathon Project
Culture Globe: EuropeanaTech Hackathon Project
 
Digital Humanities Research
Digital Humanities ResearchDigital Humanities Research
Digital Humanities Research
 
The Future is All Mine
The Future is All MineThe Future is All Mine
The Future is All Mine
 
Europeana Newspapers -
Europeana Newspapers - Europeana Newspapers -
Europeana Newspapers -
 
ESWC2015 opening ceremony
ESWC2015 opening ceremonyESWC2015 opening ceremony
ESWC2015 opening ceremony
 
slides_ZU_Text_mining_final (MEDIUM).pdf
slides_ZU_Text_mining_final (MEDIUM).pdfslides_ZU_Text_mining_final (MEDIUM).pdf
slides_ZU_Text_mining_final (MEDIUM).pdf
 
CONTENTdm 'Quick Start' at The Metropolitan Museum of Art
CONTENTdm 'Quick Start' at The Metropolitan Museum of ArtCONTENTdm 'Quick Start' at The Metropolitan Museum of Art
CONTENTdm 'Quick Start' at The Metropolitan Museum of Art
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
 
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...
Shaping our Future: Digitization Partnerships Across Libraries, Archives and ...
 
Crowdheritage: The RE-usable Fashion Museum and Crowd Engagement
Crowdheritage: The RE-usable Fashion Museum and Crowd EngagementCrowdheritage: The RE-usable Fashion Museum and Crowd Engagement
Crowdheritage: The RE-usable Fashion Museum and Crowd Engagement
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
 

Recently uploaded

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 

Recently uploaded (20)

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 

Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB

  • 1. eMOP’s Printers and Publishers: Toward Crafting an Early Modern Print Database Matthew Christy, Elizabeth Grumbach
  • 2. emop.tamu.edu  eMOP ImprintDB  github.com/Early- Modern-OCR/ImprintDB  Mellon Grant Proposal  idhmc.tamu.edu/projects /Mellon/eMOPPublic.pdf eMOP Info eMOP Resources More eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @mandellc  @matt_christy  @EMGrumbach 2 Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB
  • 3.  The Early Modern OCR Project (eMOP) is an  Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to  develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents  from the hand press period, roughly 1475-1800.  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. 3 Digital Frontiers 2015 - eMOP ImprintDB Goals Sept. 18, 2015
  • 4. Digital Frontiers 2015 - eMOP ImprintDB 4 Sept. 18, 2015
  • 5. Wrangling Data The Numbers  EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  ECCO: ~182,000 documents, ~32 million page images (1700-1800)  TCP: ~46,000 double-keyed hand transcriptions (44,000 EEBO, 2,200 ECCO) – Groundtruth  Total: >300,000 documents & ~45 million page images. The Data  ECCO page images (1 pg/ image)  ECCO original OCR results (doc-level XML files)  ECCO TCP transcriptions (doc- level XML and text files)  EEBO page images (2 pgs/ image)  EEBO TCP transcriptions (doc- level XML and text files) Digital Frontiers 2015 - eMOP ImprintDB 5 Sept. 18, 2015
  • 6. eMOP DB Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB 6 • Document metadata • File locations • Page images • Pages text • Groundtruth text • OCR Results • Pages text • Scores against Groundtruth • Results of analysis • noise measure • skew measure • multiple column coords • corrections made
  • 7. The Problems Early Modern Imprints  Missing  Incorrect  accidentally by printer  accidentally by DB provider  purposefully  No standard format or consistent inclusion of information  Inconsistent spelling and use of initials  Use of conversational language  Use of non-English (Latin, Welsh)  or a mix of languages 7 Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB Imprinted at London : by John Jugge, dwellyng at the north doore of Paules
  • 8. Early Modern Imprints  Iterative application of regular expressions to cull out the data:  Who the work was Printed By  Who the work was Printed For  Who the work was Sold By  The Place of printing (London, Cambridge, Dublin, etc.)  The Location of printing (“the north doore of Paules”)  Date (gathered from separate metadata field) The Solution 8 Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB Printed by: Iohn Iugge Place: London Location: the north doore of Paules : 1580? Terms to identify the printer: • “printed”, sometimes also accompanied by “by” • prynted • reprinted or re-printed • imprinted • pressed • brintwyd (Welsh) • Typis, presso, pressare, excudebat, … (Latin) • etc. etc.
  • 9. Results <work> <emopNO>140776</emopNO> <eccoNO>67101600</eccoNO> <tcpNO>NULL</tcpNO> <estcNO>T077294</estcNO> <imprintORIG>[London] : In the Savoy: printed by John Nutt; for John Walthoe, 1713.</imprintORIG> <date>1713</date> <imprintCLN>London : in the Savoy: printed by John Nutt; for John Walthoe,</imprintCLN> <place>London</place> <printedBy>John Nutt</printedBy> <printedFor>John Walthoe</printedFor> <location>in the Savoy</location> </work> Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB 9 sourcehttp://bit.ly/1hXpVpd
  • 10. eMOP Outcomes - Github Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB 10 https://github.com/Early-Modern-OCR/ImprintDB
  • 11. source: http://blog.volkovlaw.com/2013/03/the-future-of-compliance-what-will-the-new-tools-look-like/ Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB 11 Outcomes – DB of EM Printers
  • 12. The end For eMOP questions please contact us at : mchristy@tamu.edu egrumbac@tamu.edu mandell@tamu.edu 12 Digital Frontiers 2015 - eMOP ImprintDB Sept. 18, 2015

Editor's Notes

  1. [LIZ: this is just my “find out more info about eMOP” slide]
  2. The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the IDHMC at Texas A&M. Our goal is to develop and test tools and techniques to improve Optical Character Recognition (or OCR) outcomes for printed English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
  3. [LIZ: Not sure you want this in here or not, sometimes it’s useful] This slide gives you a good idea of the timeline of the documents we worked with on eMOP, which constitutes the “early modern” period. We worked with over 300,000 documents. You can also see information on right about orthography (spelling/language conventions) and typefaces for this period. These were both issues that complicated eMOP. EEBO had FOR EXAMPLE quite a lot of blackletter typefaces which were quite distinct from each other, requiring more specific OCR training for those font faces. And spelling irregularity through this period made checking and correcting OCR output that much more difficult. (We used several alternate spelling lists we gathered from several places.)
  4. The data came to us in many different forms that had to be normalized, organized, and scraped for metadata: For example: Some page images were 1 page per image, some were 2 pages per image. So we had to figure out a page number scheme that would be consistent and would match up with groudtruth files either way. AND Metadata had to be generated from a variety of different formats and sources; from XML files, Excel spreadsheets, or Unix file info, etc, (depending on the source) then ingested to create the eMOP DB That in itself was a major undertaking that took about the first 3 months of the project Afterwards we continued to discovered that some of the metadata supplied to us was incorrect, contradictory, or missing. So we had to do as much cleanup of the data as we could as we went along. We still have more cleanup to do, but we think we have one of the most comprehensive and correct collections of metadata for early modern printed documents
  5. So that is and overview of the eMOP DB, and here’s an internal view of the eMOP DB. The details aren’t necessarily important, except to say that our DB contains all of this information on each individual page from EEBO & ECCO, which is a great deal. [ENTER] AND from this eMOP DB, what we extracted to create the printers and publishers DB was actually from this single field. In this wks_publisher field was contained the imprint line from each individual work OCR’d by eMOP in the last two years (all 300,000+ documents). Our original intention was to use this data to identity which printers used which typefaces over the early modern period. We would then apply that research to our OCR font training and specifically apply, for example, Caslon font training to a document printed with the Caslon typeface – much like our “modern” OCR engines are trained to read modern fonts like Helvetica or Times New Roman. After much trial and error, we realized that 1) while we could connect some printers and typefaces, creating a database with this information would take much longer than the grant period and 2) these connections weren’t needed due to the way the Tesseract OCR engine worked. With that said, it’s still an eMOP goal to produce this DB, so we took the first steps to creating an “Imprint Database” containing all the information we could glean from the Imprint line of these books (confirming that information with other metadata on hand, in the eMOP DB).
  6. There are HOWEVER a lot of issues involved with programmatically identifying publishing information from early modern imprints: Imprints can be missing or misleading, either intentionally (political pamphlets) or unintentionally, or the data could have been entered wrong into the DB by the provider Use of conversational English (so, there was no standard format for how this information was displayed – usually it was stylized as a “sentence” instead of the more easily algorithmically identifiable modern publisher information in a book). There is also a good deal of non-English or mixed language information in these imprint lines. The use of Latin place names was particularly common, but my favorite examples come from the Welsh language documents that we have in our DB. It’s relatively easy to find someone with experience in medieval Latin; but slightly harder to find an expert in early modern Welsh. In the image on the right we can see the imprint (at the bottom) [hit Enter] contains [hit Enter] : conversational English – in the form of a sentence the use of “I” in place of “J” – which complicates things “dwellyng”? Is that the publishing location? – which is an example of inconsistent spelling, and something that we didn’t take into account in our first pass at algorithmically pulling out publishing location information
  7. To do this work we used a set of regular expressions applied over several iterations. The regular expression looked for cues—key words, initials, punctuation, etc.—to break the phrase up and then identify the category: personal name place name Role (who the work was printed for/ who it was printed by) location of printing – which we looked for things like: (at, by (but not proceeded by sold, printed, etc.) To give you an idea of how complicated this was [hit Enter] , we discovered that the printer of a document could be identified by a number of keywords (or cues…or clues): “printed”, sometimes also accompanied by “by” prynted reprinted or re-printed imprinted pressed brintwyd (Welsh) Typis, presso, pressare, excudebat, … (Latin)
  8. The result was a text file (one each for EEBO and ECCO separately), and those text files contained: The eMOP # of the work (from the eMOP DB) The original imprint line A cleaned version of the imprint line (We kept the original, and a cleaned/formatted version of the imprint line as a reference. We don’t expect that we got everything right with our regular expressions, so this is a way for scholars, or anyone that we collaborate with in the future, to see the original imprint line in order to double-check or correct the formatted imprint data.) The date (from the eMOP DB—this is based on metadata created by the providers, Gale-Cengage Learning and Proquest) And when available: Place Printed By Printed For Sold By Location We then transformed the text files into XML files for easy use and portability to multiple formats. (We can use this to ingest into a DB, transform to HTML, turn into a spreadsheet, etc). We also added ESTC and TCP numbers for each document, when we had them. We collected this information from various sources. And we also added ECCO and EEBO numbers as identifiers. For ECCO there is one number, but for EEBO there are two. It was the eMOP DB that allowed us to tie all these numbers together and to the Imprints.
  9. We’ve taken these XML files we created and made them available on Github for anyone to make use of. With a few caveats: Because these imprints came to us from proprietary sources, we can’t technically share them. However, we can share the imprint info from those works which are available via the ESTC (English Short Title Catalog). So on the Github page the XML files contain only those imprints which have ESTC numbers For EEBO that’s 115,789 out of ~139,000 works (84%) For ECCO that’s 207,662 out of ~211,000 works (98%) We also included the schemas used to validate these files in Github We would really like it if scholars who download these files let us know if they end up finding problems and making corrections. This Github page is also available via the eMOP website [hit Enter] as emop.tamu.edu. Just click on the Github Repo tab to see all of eMOP’s open resources in Github.
  10. So, going forward we want to implement a better solution for sharing this data and having some kind of centralized clearing-house for it. We are planning on implementing this as a single online database (eXistDB) to make it easily searchable. We also want to create an online mechanism, via some kind of form, which would allow users to identify errors and request corrections, keeping this work in a single location for everyone to take advantage of. Eventually, we can tie this DB to other available open, online DBs, using the identifying numbers (eMOP, ESTC, ECCO, EEBO, TCP).
  11. Please contact us with any questions.