An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis
International Workshop on Historical Document Imaging and Processing (HIP).
ICDAR 2011, 16-17 September 2011, Beijing, China.
An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
An Experimental Workflow Development
Platform for Historical Document
Digitisation and Analysis
Clemens Neudecker, KB National Library of the Netherlands
International workshop on Historical Document Imaging and Processing, Beijing, 17 September 2011
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
2
Background
IMPACT – Improving Access to Text (2008 – 2011)
Large-scale integrating research project, funded by the EC
Main objectives:
- Innovate OCR technology
- Capacity building in mass-digitisation
From a technical perspective:
> 20 software toolkits for solving specific issues
Prototyping new algorithms
“One ring to rule them all…”
IMPACT Interoperability Framework (IIF)
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
3
Main requirements
Behavioural:
Minimize integration effort
Minimize deployment effort
Maximize usability
Maximize scalability
Functional:
Modular
Transparent
Expandable
Open source
Platform independent
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
4
Architecture
IMPACT Interoperability Framework: Technologies
- Java 6
- Generic Web Service Wrapper
- Apache Ant/Maven
- Apache Tomcat/httpd
- Apache Axis2
- Apache Synapse
- Taverna Workflow Engine
IMPACT Interoperability Framework: Dataset
- more than 500.000 images from digital libraries
- more than 25.000 ground truth transcriptions
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
5
So how does it work?
1. Digitisation/OCR challenges registered and tagged in database
2. Database contains 99,99% correct result: “ground truth”
3. Researcher develops new method to tackle a problem
4. Research prototype is wrapped to a web service
5. Web service is integrated as a workflow module
6. Workflow module can be evaluated, combined, etc.
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
6
Framework integration
Easy to use generic command line wrapper (open source)
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
7
Workflow development
OCR workflow =
data pipeline
Building blocks =
processing steps (nodes)
Integration =
interaction between nodes
(mashup)
Collaboration with
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
8
Workflow management
Web 2.0 style registry: myExperiment
Local client: Taverna Workbench
Web client: project website
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
9
Compute cluster
Enterprise Service Bus
receives requests from
users and distributes
the load to the available
worker nodes
Main effect:
Process parallelization,
Load distribution,
Fail over
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
10
Dataset
Access to a representative and annotated dataset of significant size,
with metadata, ground truth and search facilities
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
11
Evaluation features
Text based comparison of result with ground truth,
using Levenshtein distance method
Layout based comparison of result with ground truth,
using the Page Analysis And Ground Truth Elements Framework
Example:
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
12
Community
Web2.0 style workflow registry
Community of experts
Sharing of resources
Knowledge exchange
A central meeting point
for users and researchers
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
13
Summary
Benefits:
- Availability of resources (images, ground truth and tools)
to the international research community
- A common baseline for transparent evaluation and comparison
- Sharing of results and know-how
- Enable new research through scalable computing
- Consolidation of support and maintenance
Thank you!
Questions?