IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Interoperability and
Evaluation Framework
Clemens Neudecker, National Library of the Netherlands
IMPACT Demo Day, Biblioteca Nacional de España
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR: A multitude of challenges…
I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR: A multitude of challenges…
II. Language challenges (spelling variants, inflection, and many more!)
Example: historical variants of the Dutch word ‘wereld’ (world):
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt
wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels
zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts
werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts
werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
And a multitude of solutions!
22 different ‘tools’ from diverse developers:
OCR (C++, C#),
Image Processing & Lexica (DLL),
Command Line Tools (Win/Linux),
Java, Ruby, PHP, Perl, etc.
+ 3rd party software!
“One ring to rule them all...”
→ IMPACT Interoperability Framework (IIF)
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Main requirements
Behavioural:
Minimize integration effort
Minimize deployment effort
Maximize usability
Maximize scalability
Functional:
Modular
Transparent
Expandable
Open source
Platform independent
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Architecture
IMPACT Interoperability Framework: Technologies
- Java 6
- Generic Web Service Wrapper
- Apache Ant/Maven
- Apache Tomcat/httpd
- Apache Axis2
- Apache Synapse
- Taverna Workflow Engine
IMPACT Evaluation Framework: Dataset
- approx. 5 TB raw data (images, text files, metadata) and growing
- Ground truth transcriptions
- Evaluation modules
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Components I: IIF
Enterprise Service Bus
receives (SOAP) requests
from users and distributes
the load to the available
worker nodes
Main effect:
Process parallelization,
Load distribution,
Fail over
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Framework integration
Easy to use generic command line wrapper (open source)
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow development
OCR workflow =
data pipeline
Building blocks =
processing steps (nodes)
Integration =
interaction between nodes
(mashup)
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow management
Web 2.0 style registry: myExperiment
Local client: Taverna Workbench
Web client: project website
API: SOAP/REST
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Community
Web2.0 style workflow registry
Community of experts
Sharing of resources
Knowledge exchange
A central meeting point
for users and researchers
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Components II: Dataset
Database and front end, hosted at the PRIMA
research group at University of Salford,
School of Computing, United Kingdom
- more than 500.000 images from Digital Libraries
- more than 50.000 ground truth representations
- up to 10.000 direct access calls per month
- 4 TB of space and growing
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Dataset
Access to a representative and annotated dataset of significant size,
with metadata, ground truth and search facilities
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation features
Text based comparison of result with ground truth,
using Levenshtein distance method
Layout based comparison of result with ground truth,
using the Page Analysis And Ground Truth Elements Framework
Example:
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Ground-Truthing Tools
Aletheia
FineReader
PAGE Exporter
GT Validator
GT Normalizer
16
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Measures – Segmentation Errors
Miss Partial Miss
Mis-
classi- Merge
fication
Caption
Paragraph
Ground Truth
Segmentation
Result
Split
18
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR Accuracy
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you! Questions?