Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The Functional Extension Parser – a rule-based
system for flexible structural analysis
Lukas Gander
University of Innsbruck
Bratislava 07.05.2010
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Overview
Objectives of the Functional Extension Parser
Concepts of the FEP
Workflow
FEP Core
Current status
Expected benefits
Vision
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Objectives of the FEP
The Functional Extension Parser (FEP) is a software tool capable of
detecting and reconstructing some of the main features of a digitised
book.
These features are:
– Page numbers
– Print space
– Logical structural elements like
Footnotes
Headlines
Running titles
Marginalia
Signature Marks
– Detection and reconstruction of the table of content
3
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Concepts of the FEP
Human beings are able to identify logical
structure elements of books simply by
looking at the layout without understanding
the language
A person intuitively applies a set of rules.
OCR output provides much more than a
simple fulltext
– Coordinates of lines, blocks, strings.
– Style information like bold or italic
– Font size and font type
– Mostly everything what a user can see on the
image is somehow available within the OCR
output
4
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
FEP Workflow
5
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
FEP Core Architecture
6
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Current Status
During the last year the whole infrastructure was set up. This
includes
– The Visualizer and Editor Application which is online available under
http://dea-
gulliver.uibk.ac.at/org.dea.impact.FEP_Prototype.FEP_Prototype/FEP_Proto
type.html
– FEP Core module using a rulebased approach
First rule sets were developed for page number detection and print
space reconstruction
– 98.34 % correctly detected page numbers
– 91.77% correctly reconstructed print spaces
7
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Expected benefits
Page number detection
– Results of page number detection can be used for quality assurance for
the whole digitisation process.
missing pages which were lost during the scan process are identified.
Duplicated pages can be determined
– page numbers are a prerequisite for users browsing through the book in
a digital library application.
Print space reconstruction
– The size of the page was always calculated on the basis of the print
space. During digitisation process information about the margins within
the document are lost. The margins needed for a reprint can be
calculated using the print space and well known reconstruction
schemes.
8
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Expected benefits (2)
Print space reconstruction
– All images can be cropped to the same size which allows an enjoyable
look and feel (with the content centered) in digital repositories. (e.g
Google books)
Logical structure reconstruction
– Improvement for knowledge discovery in digital repositories. Headlines
for example are more important than normal text or footnotes. A reliable
result of the logical structure analysis allows an adequate handling of
these elements during indexing process (e.g Headlines should be
boosted, running titles and signature marks be ignored)
9
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Expected benefits (3)
Reconstruction of TOC eases
navigation in
– PDF
– EPUB
– Online repositories
It is a very challenging task
– Google books shows good but not
perfect results
– Microsoft Serbia won INEX book
structure 2008competition with
precision of 53 %
10
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Vision
11
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
12