SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
USE CASE STUDY
We investigated OCR-related issues of concrete queries on a
digital newspaper archive. We tried to better understand
the corpus and technical details on how it was created,
how to estimate the impact of the OCR errors based on the
confidence values assigned by the OCR engine,
and what search strategies may yield better results.
We found that not all details of the digitization pipeline are known by the
collection owner. Scholars are therefore not given sufficient information to
assess the reliability of their results.
We believe that this is typical for large archives and libraries.
M. Traub, J. van Ossenbruggen, and L. Hardman, 

Impact Analysis of OCR Quality on Research Tasks in Digital Archives,
TPDL2015, September, 2015, Poznan, Poland
INTERVIEWS
Categorization of research tasks
T1 find the first mention of a concept
T2 find a subset with relevant documents
T3 investigate quantitative results over time
T3.a compare quantitative results for two terms
T3.b compare quantitative results from two corpora
T4 tasks using external tools on archive data
IMPACT ANALYSIS OF OCR QUALITY ON
RESEARCH TASKS IN DIGITAL ARCHIVES
Myriam C. Traub, Jacco van Ossenbruggen, and Lynda Hardman
Centrum Wiskunde & Informatica, Amsterdam
The shift in research methodology within the digital
humanities has the hidden cost of working with digitally
processed historical documents:
How much trust can a scholar place in noisy
representations of source texts?
To improve upon the current situation, we think the communities
involved should begin to approach the problem from the scholars’
perspective.
Humanity scholars need to transfer their valuable tradition of source
criticism into the digital realm, and more openly criticize the potential
limitations and biases of the digital tools we provide them with.
We interviewed four humanities scholars to find out what 

research tasks humanities scholars typically perform on digital
archives.
Our main findings were
that scholars use digital archives mainly for exploration or to select
items for close reading,
that their interest in quantitative analyses is limited because of the
perceived low OCR quality, and
a categorization of common research tasks (see table below).
LITERATURE STUDY
Quality in digitization projects of libraries is mostly studied from the point of
view of the collection owner, and not from the perspective of the scholar.
User-centric studies on digital libraries typically focus on user interface
design and other usability issues.
Ample research is done on how to reduce the error rates of OCRed 

text in a post-processing phase or how to improve OCR engines.
None of the approaches aiming at improving data quality can remove all
errors.
Scholars still need to be able to quantify the remaining errors and assess
their impact on their research tasks.
Confusion Matrix OCR Confidence
Values
Alternative
Confidence
Values
available: sample only full corpus not available
T1 find all queries for x,
impractical
estimated precision, not
helpful
improve recall
T2 as above estimated precision,
requires improved UI
improve recall
T3 pattern summarized over
set of alternative queries
estimates of corrected
precision
estimates of
corrected recall
T3.a warn for different
susceptibility to errors
as above, warn for
different distribution of
confidence values
as above
T3.b as above as above as above
The different types of tasks require different levels of quality. A confusion
matrix or confidence values can be used to create a larger query set.
Thereby, scholars may be able to better estimate the quality and also (to
some extent) to compensate low quality.

Contenu connexe

Dernier

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 

Dernier (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 

En vedette

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

  • 1. USE CASE STUDY We investigated OCR-related issues of concrete queries on a digital newspaper archive. We tried to better understand the corpus and technical details on how it was created, how to estimate the impact of the OCR errors based on the confidence values assigned by the OCR engine, and what search strategies may yield better results. We found that not all details of the digitization pipeline are known by the collection owner. Scholars are therefore not given sufficient information to assess the reliability of their results. We believe that this is typical for large archives and libraries. M. Traub, J. van Ossenbruggen, and L. Hardman, Impact Analysis of OCR Quality on Research Tasks in Digital Archives, TPDL2015, September, 2015, Poznan, Poland INTERVIEWS Categorization of research tasks T1 find the first mention of a concept T2 find a subset with relevant documents T3 investigate quantitative results over time T3.a compare quantitative results for two terms T3.b compare quantitative results from two corpora T4 tasks using external tools on archive data IMPACT ANALYSIS OF OCR QUALITY ON RESEARCH TASKS IN DIGITAL ARCHIVES Myriam C. Traub, Jacco van Ossenbruggen, and Lynda Hardman Centrum Wiskunde & Informatica, Amsterdam The shift in research methodology within the digital humanities has the hidden cost of working with digitally processed historical documents: How much trust can a scholar place in noisy representations of source texts? To improve upon the current situation, we think the communities involved should begin to approach the problem from the scholars’ perspective. Humanity scholars need to transfer their valuable tradition of source criticism into the digital realm, and more openly criticize the potential limitations and biases of the digital tools we provide them with. We interviewed four humanities scholars to find out what 
 research tasks humanities scholars typically perform on digital archives. Our main findings were that scholars use digital archives mainly for exploration or to select items for close reading, that their interest in quantitative analyses is limited because of the perceived low OCR quality, and a categorization of common research tasks (see table below). LITERATURE STUDY Quality in digitization projects of libraries is mostly studied from the point of view of the collection owner, and not from the perspective of the scholar. User-centric studies on digital libraries typically focus on user interface design and other usability issues. Ample research is done on how to reduce the error rates of OCRed 
 text in a post-processing phase or how to improve OCR engines. None of the approaches aiming at improving data quality can remove all errors. Scholars still need to be able to quantify the remaining errors and assess their impact on their research tasks. Confusion Matrix OCR Confidence Values Alternative Confidence Values available: sample only full corpus not available T1 find all queries for x, impractical estimated precision, not helpful improve recall T2 as above estimated precision, requires improved UI improve recall T3 pattern summarized over set of alternative queries estimates of corrected precision estimates of corrected recall T3.a warn for different susceptibility to errors as above, warn for different distribution of confidence values as above T3.b as above as above as above The different types of tasks require different levels of quality. A confusion matrix or confidence values can be used to create a larger query set. Thereby, scholars may be able to better estimate the quality and also (to some extent) to compensate low quality.