SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
Evaluation Infrastructure or: How do I know my digitised content is any good? 
Apostolos Antonacopoulos 
PRImA Research Lab
Why evaluate? What to evaluate? How? 
PRImA Research Lab
Content Holders - Why? 
•Objectively assess what can be expected from current best OCR 
•Prioritise different material 
•Re-scan / re-OCR existing content? 
•Specify precise service contracts 
•QA of results from service providers 
PRImA Research Lab
Developers / Contractors - Why? 
•Select best workflow components 
•For different batches of material 
•Identify performance bottlenecks 
•Tune performance of system components 
•Quality Assurance 
PRImA Research Lab
OK… What to evaluate? 
Isn’t Word Accuracy enough? 
•No – on its own it is of relatively little help… For anything other than just querying a word it is important to first have accurate 
•Layout 
•Reading order 
PRImA Research Lab
What else? 
Documents also contain graphical elements 
•An often ignored fact! And even for difficult to OCR documents 
•Layout still provides useful information (location of headers, page numbers etc.) 
PRImA Research Lab
In a Nutshell 
As obvious as it may sound… 
Need to evaluate according to different Use Scenarios 
PRImA Research Lab
Use Scenario Examples 
•Keyword search 
•Phrase search 
•Newspaper article search 
•ToC / book structure extraction 
•Layout re-flowing for mobile browsing 
PRImA Research Lab
How can I evaluate all that? 
PRImA Evaluation Infrastructure 
•In partnership with the IMPACT CoC 
①Comprehensive datasets 
②Ground truthing tools – Aletheia 
③Scenario-based evaluation tools 
•Layout, reading order, text accuracy 
•Results in several levels of detail 
PRImA Research Lab
Proven Use 
Several International Competitions (SUCCEED and at ICDAR conferences) 
oHistorical book recognition 
oHistorical newspaper layout analysis Continuous evaluation challenge 
oWorkflows and individual components Wellcome Trust Library Case Study 
oAssessment of material for prioritisation of digitisation 
PRImA Research Lab
Layout Quality 
OCR Accuracy 
Text 
Eval 
Layout 
Eval 
PAGE XML 
Layout 
Text Content 
Aletheia 
Web Aletheia 
Crowd 
Prototype 
Tesseract Exporter 
FineReader Exporter 
Document Image 
Typewritten 
OCR 
Segmenter 
Repositories 
Converter 
Validator 
Dewarping 
Image Tool 
Metadata 
Extractor 
Extractor 
Exporter 
Snippet 
Serialised 
Text 
SimplePageExporter C++ 
JAletheia 
Sandbox 
PAGE to SVG XSD 
Optimiser 
Layout 
correspondence, 
reading order 
Validation 
Conversion 
Filtering 
Bag of Words, Character 
and word accuracy 
Dewarping 
Eval … 
Threshold, Otsu, 
Sauvola binarisation 
Image and 
PAGE XML 
snippets 
Gamera 
XML 
(PAGE Scanner) 
Tool 
Prototype 
Data 
Java 
Web 
Command 
Line 
ALTO XML FineReader XML 
For more: www.primaresearch.org

Contenu connexe

Similaire à Succeed Evaluation Infrastructure - Apostolos Antonacopoulos

Testing Tools Online Training.pdf
Testing Tools Online Training.pdfTesting Tools Online Training.pdf
Testing Tools Online Training.pdfSpiritsoftsTraining
 
Adventures in Azure Machine Learning from NE Bytes
Adventures in Azure Machine Learning from NE BytesAdventures in Azure Machine Learning from NE Bytes
Adventures in Azure Machine Learning from NE BytesDerek Graham
 
Xml more trouble than it's worth
Xml   more trouble than it's worthXml   more trouble than it's worth
Xml more trouble than it's worthAndy Williams
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...PerformanceVision (previously SecurActive)
 
Babysitting your orm essenmacher, adam
Babysitting your orm   essenmacher, adamBabysitting your orm   essenmacher, adam
Babysitting your orm essenmacher, adamAdam Essenmacher
 
JD Edwards Manufacturing Deep Dive Workshop
JD Edwards Manufacturing Deep Dive WorkshopJD Edwards Manufacturing Deep Dive Workshop
JD Edwards Manufacturing Deep Dive WorkshopTerillium
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsAltuna Akalin
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
ProjectsSummary.pptx
ProjectsSummary.pptxProjectsSummary.pptx
ProjectsSummary.pptxJamesKirk79
 
Software Quality without Testing
Software Quality without TestingSoftware Quality without Testing
Software Quality without TestingNagarro
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation SystemsHilary Aben
 
Text Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & AzureText Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & AzureSanil Mhatre
 
Q con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyQ con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyMichael Zhang
 
What is an Automation Framework ?
What is an Automation Framework ?�What is an Automation Framework ?�
What is an Automation Framework ?Sriram Angajala
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Lucidworks
 
Three signs your architecture is too small for big data. Camp IT December 2014
Three signs your architecture is too small for big data.  Camp IT December 2014Three signs your architecture is too small for big data.  Camp IT December 2014
Three signs your architecture is too small for big data. Camp IT December 2014Craig Jordan
 

Similaire à Succeed Evaluation Infrastructure - Apostolos Antonacopoulos (20)

Testing Tools Online Training.pdf
Testing Tools Online Training.pdfTesting Tools Online Training.pdf
Testing Tools Online Training.pdf
 
Adventures in Azure Machine Learning from NE Bytes
Adventures in Azure Machine Learning from NE BytesAdventures in Azure Machine Learning from NE Bytes
Adventures in Azure Machine Learning from NE Bytes
 
Xml more trouble than it's worth
Xml   more trouble than it's worthXml   more trouble than it's worth
Xml more trouble than it's worth
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
Babysitting your orm essenmacher, adam
Babysitting your orm   essenmacher, adamBabysitting your orm   essenmacher, adam
Babysitting your orm essenmacher, adam
 
JD Edwards Manufacturing Deep Dive Workshop
JD Edwards Manufacturing Deep Dive WorkshopJD Edwards Manufacturing Deep Dive Workshop
JD Edwards Manufacturing Deep Dive Workshop
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
PerfTest in SOA
PerfTest in SOAPerfTest in SOA
PerfTest in SOA
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Solved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdfSolved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdf
 
ProjectsSummary.pptx
ProjectsSummary.pptxProjectsSummary.pptx
ProjectsSummary.pptx
 
Software Quality without Testing
Software Quality without TestingSoftware Quality without Testing
Software Quality without Testing
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Text Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & AzureText Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & Azure
 
Q con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyQ con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodology
 
What is an Automation Framework ?
What is an Automation Framework ?�What is an Automation Framework ?�
What is an Automation Framework ?
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Three signs your architecture is too small for big data. Camp IT December 2014
Three signs your architecture is too small for big data.  Camp IT December 2014Three signs your architecture is too small for big data.  Camp IT December 2014
Three signs your architecture is too small for big data. Camp IT December 2014
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Succeed Evaluation Infrastructure - Apostolos Antonacopoulos

  • 1. Evaluation Infrastructure or: How do I know my digitised content is any good? Apostolos Antonacopoulos PRImA Research Lab
  • 2. Why evaluate? What to evaluate? How? PRImA Research Lab
  • 3. Content Holders - Why? •Objectively assess what can be expected from current best OCR •Prioritise different material •Re-scan / re-OCR existing content? •Specify precise service contracts •QA of results from service providers PRImA Research Lab
  • 4. Developers / Contractors - Why? •Select best workflow components •For different batches of material •Identify performance bottlenecks •Tune performance of system components •Quality Assurance PRImA Research Lab
  • 5. OK… What to evaluate? Isn’t Word Accuracy enough? •No – on its own it is of relatively little help… For anything other than just querying a word it is important to first have accurate •Layout •Reading order PRImA Research Lab
  • 6. What else? Documents also contain graphical elements •An often ignored fact! And even for difficult to OCR documents •Layout still provides useful information (location of headers, page numbers etc.) PRImA Research Lab
  • 7. In a Nutshell As obvious as it may sound… Need to evaluate according to different Use Scenarios PRImA Research Lab
  • 8. Use Scenario Examples •Keyword search •Phrase search •Newspaper article search •ToC / book structure extraction •Layout re-flowing for mobile browsing PRImA Research Lab
  • 9. How can I evaluate all that? PRImA Evaluation Infrastructure •In partnership with the IMPACT CoC ①Comprehensive datasets ②Ground truthing tools – Aletheia ③Scenario-based evaluation tools •Layout, reading order, text accuracy •Results in several levels of detail PRImA Research Lab
  • 10. Proven Use Several International Competitions (SUCCEED and at ICDAR conferences) oHistorical book recognition oHistorical newspaper layout analysis Continuous evaluation challenge oWorkflows and individual components Wellcome Trust Library Case Study oAssessment of material for prioritisation of digitisation PRImA Research Lab
  • 11. Layout Quality OCR Accuracy Text Eval Layout Eval PAGE XML Layout Text Content Aletheia Web Aletheia Crowd Prototype Tesseract Exporter FineReader Exporter Document Image Typewritten OCR Segmenter Repositories Converter Validator Dewarping Image Tool Metadata Extractor Extractor Exporter Snippet Serialised Text SimplePageExporter C++ JAletheia Sandbox PAGE to SVG XSD Optimiser Layout correspondence, reading order Validation Conversion Filtering Bag of Words, Character and word accuracy Dewarping Eval … Threshold, Otsu, Sauvola binarisation Image and PAGE XML snippets Gamera XML (PAGE Scanner) Tool Prototype Data Java Web Command Line ALTO XML FineReader XML For more: www.primaresearch.org