SlideShare une entreprise Scribd logo
1  sur  35
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




The Functional Extension Parser
A Document Understanding Platform
Günter Mühlberger
University Innsbruck Library (ULB Tyrol)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Document understanding
 A book is more than just pure text – it contains a lot of structural
  metadata
 These metadata are (often) encoded in the layout of a document
 Size of characters, position on page, distance to other lines, etc. is
  used to express structural meaning
 FEP is designed to “understand” the meaning of the layout




                                                                                                                                             2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.


 Headlines
 Footnotes

 Print space




                                                                                                                                             3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.


 Running title
 Page number
 Signature mark




                                                                                                                                             4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.


 Table of Contents
 Single entries
 Authors

 Titles

 Page numbers




                                                                                                                                             5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Why structural tagging is important
– some examples
 Search & Retrieval
 References and links to other documents
 Reading: analogue and digital




                                                                                                                                             6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
    Netherlands.


 Search & retrieval
   – Ranking and scoring,
     noise reduction
            The same word
             appears in the running
             title of a journal at
             every page
             “Alpenverein”
            Front matters, such as
             title pages, dedications,
             table of contents
             tables, etc.
            Back matters such as
             indexes, ads, etc.




                                                                                                                                                 7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
    Netherlands.




– Search & retrieval
   – Facets for full-text
            Currently facets are
             used for metadata such
             as author, year, text
             type, ...
            A user might be
             interested in facets
             such as headline,
             footnote, index, etc...




                                                                                                                                                 8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
    Netherlands.


 Citations index / cloud
    – Footnotes, reference
      lists, citations contain
      bibliographic links to
      books, journal articles,
      texts, etc.
    – Structural tagging
      supports detection of
      bibliographic references
    – May also be used for
      catalogue enrichment


                                                                                     Cawkell, A. E. (1971)




                                                                                                                                                 9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
    Netherlands.


 Digital reading
   – Tablet computers as
     alternative for reading
     historical books with
     OCR below reading
     quality
   – Expected features
            Nicely cropped pages
            Bookmarks
            ToC page linked with
             headings
 Advanced reading
   – eBooks for modern texts
     with satisfying OCR
     quality
   – Structure can be
     encoded into ePUB etc.
                                                                                                                                                 10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
   Netherlands.


 Analogue reading
   – Print on Demand
   – Print space as old
     concept with new
     benefits
   – Reconstruction helps to
     semi-automate the
     standardized production
     of pre-press files




                                                                                                                                                11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Technical background
 Input
        – OCR text which needs to contain at least word coordinates
        – E.g. ALTO files, ABBYY XML or Google Books (Tesseract) HTML
 Output
        – Annotations of structural elements with coordinates, e.g. page numbers,
          running titles, headings, footnotes, printspace, etc.
        – Output format: METS/ALTO, XML, etc.
 FEP System
        –    Images and/or OCR files are loaded via a web-service
        –    OCR data are converted into internal format
        –    Information is processed based on rules
        –    Results are stored in a database
        –    Quality control on the basis of “ground truth”, e.g. expected results
        –    Rules are either manually encoded (expert knowledge) and/or based on
             machine learning (large document sets)
                                                                                                                                             12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




                                                                                                                                             13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Apart from books...
 FEP
        – IMPACT: A generic rule set for historical books has been developed
        – This rule set can be used as basis for similar documents
                 Journals
                 Critical editions
                 etc.
        – Other rule sets can be developed from the scratch
                 Manual and/or machine learning
 Other document types
        –    Index cards
        –    Title pages
        –    Journals
        –    Dissertations
        –    Printed catalogues and bibliographies
        –    ...

                                                                                                                                             14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Results
 Basic rules set
        – General structural elements of books from e.g. 1700 to 2010
        – Data set: 155 books, 30.673 pages (141 training set, 41 evaluation set)
        – All pages were manually annotated (ground truth)
 Recall, Precision, F-Measure
        – 10 lines with headings in a book. We find e.g. 12 lines, 8 of them correct, 4
          false:
        – Recall               = 8 of 10                     = 0,8
        – Precision            = 8 of 12                     = 0,66
        – F-Measure            = 2*0.8*0.66/(0.8+0.66)       = 0,72
 More information
        – Important: We count lines, not structural entities!
                 E.g. if a heading has two lines one might be correct, the other one might not be
                  recognised
        – Differences between training and evaluation set are low

                                                                                                                                             15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Some results on the evaluation set

                                                           F-
                                                   Precisi measu
                                            Recall on      re
Running
text                                                    0,99                            0,98                           0,98
Running
titles                                                   0,97                                         1                0,98
Page
                                                                                                                                             16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Comment
 Research situation
        – Document analysis is a wide field and many applications
        – But only very little research on (historical) books
        – Due to lack of datasets hard to compare our results with other research
          groups  dataset will be published next year
 Detection of ToC pages and ToC entries
        – Rules set for ToC was developed recently
        – Reasonable results compared with INEX competition
        – Foreseen to publish results in spring 2011
 Method
        – Combination of manual and machine learning methods using fuzzy logic
        – Application for a patent at the European Patent Office in September
          2011
                                                                                                                                             17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




How to deal with uncertainty and errors?
 Option 1: Leave it as it is
        – Accept the accuracy which can be provided automatically
        – Inclusion of ground truth in the database allows to exactly measure the
          quality of the automated processing  one knows in advance what can
          be expected
 Pro
        – Maybe the only solution for really large document sets
        – It is much cheaper to develop better rule sets than to correct large
          numbers of documents
        – Good results for homogenous sets are possible
        – Similar to OCR
 Con
        – You and your users need to accept errors
        – People want to contribute and to correct

                                                                                                                                             18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




How to deal with uncertainty and errors?
 Option 2: Correct it
        – Service providers or library staff needs to correct
        – Manual correction with automated support
 Pro
        – Batch correction + off shore is relatively cheap and effective
        – Quick and standardized results
        – Users are satisfied
 Con
        – A reasonable investment is necessary
        – The complexity of the workflow may not be underestimated
        – Probably it will be too expensive to correct all interesting elements, therefore
          you and your users still need to accept “some” errors
        – Users still want to contribute but do not have a chance

                                                                                                                                             19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Option 3
 Provide a user interface for the crowd
        – Correction of OCR results may only be the start for also providing interfaces
          for structural annotations
        – Might be combined with some basic corrections carried out by service
          providers
 Pro
        – Satisfies the willingness of users to contribute
        – Users get immediate benefit, e.g. they are able to download structured
          PDFs for their iPad, or annotated full-text for further processing
        – Users are satisfied AND are able to contribute
        – Library gets correct and standardized data
 Con
        – An reasonable investment is necessary both for the user interface as well
          as for adapting the digital library application
        – User interfaces need to be powerful, self-explaining and simple
        – You and your users need to accept that there are always errors in the
          collection and that it will take decades to come to an end
                                                                                                                                             20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




FEP User Interface
 A concept study for a powerful, self-explaining and simple GUI
        – Currently a “general purpose interface” to display, edit and correct the
          structural elements of books
        – No optimisation for specific tasks and large amounts of documents
        – Has the potential to become a user interface for the crowd
        – Could look completely different!

 Based on Google Web Tool Kit (GWT)
        –    Open source tool kit for complex browser based developments
        –    GWT allows for features previously seen mainly in FLASH interfaces
        –    Growing community
        –    Good experiences: GWT allows to create interfaces in a relatively short
             time period

                                                                                                                                             21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Display of results




                                                                                                                                             22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Rich interface




                                                                                                                                             23
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Recognized elements, e.g. headings




                                                                                                                                             24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Display of ground truth




                                                                                                                                             25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Page numbers




                                                                                                                                             26
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Page numbers control




                                                                                                                                             27
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




ToC pages




                                                                                                                                             28
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




ToC entries




                                                                                                                                             29
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Linking of entries with pages/headings




                                                                                                                                             30
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




ToC hierarchy editor




                                                                                                                                             31
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Drag and drop of entries




                                                                                                                                             32
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




Export from FEP web-interface
 METS/ALTO
        – XML Standard for digitised books and documents
 PDFs
        – Advanced PDFs for eBooks
                 Original version
                 FEP processed version
        – Pre-press files for Print on Demand
                 FEP prepress file
 ePUB
        – For modern documents with good OCR quality or corrected books




                                                                                                                                             33
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




After the project
 General
        – Innovative projects with research component will be done via the
          University Innsbruck
        – Commercial projects via a spin-out of the University (transidee)
 FEP as a service
        – Currently not foreseen to create a product or stand alone version, but to
          offer web-services for OCR/structural annotation and remote correction
        – Adaptation of the rule sets for specific documents
 Pilot
        – EOD Network: Digitisation on Demand carried out by more than 30
          libraries in Europe
        – FEP shall be integrated during 2012
        – Member libraries get the chance to use the FEP for producing enhanced
          PDFs for eBooks

                                                                                                                                             34
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the
Netherlands.




               Thank you for your attention!




                                                                                                                                             35

Contenu connexe

Tendances

The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesMichael Day
 
Models and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAPModels and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAPPaolo Nesi
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014munarmu
 
Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?AubreyMcFato
 
Jpl cv en JUN10
Jpl cv en JUN10Jpl cv en JUN10
Jpl cv en JUN10JoaoPL
 
Javier Diaz Presentacion Korea V4
Javier Diaz Presentacion Korea V4Javier Diaz Presentacion Korea V4
Javier Diaz Presentacion Korea V4Javier Diaz
 
Jpl cv en_mar11
Jpl cv en_mar11Jpl cv en_mar11
Jpl cv en_mar11JoaoPL
 
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Centre of Competence
 
iDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover Interactief Erfgoed
 
Jpl Cv En v2
Jpl Cv En v2Jpl Cv En v2
Jpl Cv En v2JoaoPL
 

Tendances (12)

IMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich ReffleIMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich Reffle
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiatives
 
ITILT
ITILTITILT
ITILT
 
Models and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAPModels and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAP
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014
 
Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?
 
Jpl cv en JUN10
Jpl cv en JUN10Jpl cv en JUN10
Jpl cv en JUN10
 
Javier Diaz Presentacion Korea V4
Javier Diaz Presentacion Korea V4Javier Diaz Presentacion Korea V4
Javier Diaz Presentacion Korea V4
 
Jpl cv en_mar11
Jpl cv en_mar11Jpl cv en_mar11
Jpl cv en_mar11
 
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
 
iDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guides
 
Jpl Cv En v2
Jpl Cv En v2Jpl Cv En v2
Jpl Cv En v2
 

En vedette

IMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT Centre of Competence
 
IMPACT Final Conference - Language Parallel Sessions - Landsbergen
IMPACT Final Conference - Language Parallel Sessions -  LandsbergenIMPACT Final Conference - Language Parallel Sessions -  Landsbergen
IMPACT Final Conference - Language Parallel Sessions - LandsbergenIMPACT Centre of Competence
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT Centre of Competence
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Centre of Competence
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Centre of Competence
 

En vedette (20)

IMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - ErjavecIMACT Final Conference - Language Parallel Sessions - Erjavec
IMACT Final Conference - Language Parallel Sessions - Erjavec
 
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
 
IMPACT Final Conference - Aly Conteh
IMPACT Final Conference - Aly ContehIMPACT Final Conference - Aly Conteh
IMPACT Final Conference - Aly Conteh
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
 
IMPACT Final Conference - Khalil Rouhana
IMPACT Final Conference - Khalil  RouhanaIMPACT Final Conference - Khalil  Rouhana
IMPACT Final Conference - Khalil Rouhana
 
IMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven KrauwerIMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven Krauwer
 
IMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to TavernaIMPACT/myGrid Hackathon - Introduction to Taverna
IMPACT/myGrid Hackathon - Introduction to Taverna
 
IMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna RoadmapIMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna Roadmap
 
IMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACTIMPACT/myGrid Hackathon - Introduction to IMPACT
IMPACT/myGrid Hackathon - Introduction to IMPACT
 
IMPACT Final Conference - Language Parallel Sessions - Landsbergen
IMPACT Final Conference - Language Parallel Sessions -  LandsbergenIMPACT Final Conference - Language Parallel Sessions -  Landsbergen
IMPACT Final Conference - Language Parallel Sessions - Landsbergen
 
IMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a PortalIMPACT/myGrid Hackathon - Taverna Server as a Portal
IMPACT/myGrid Hackathon - Taverna Server as a Portal
 
IMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory CraneIMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory Crane
 
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael FuchsIMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
 
IMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos AntonacopoulosIMPACT Final Conference - Apostolos Antonacopoulos
IMPACT Final Conference - Apostolos Antonacopoulos
 
IMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens NeudeckerIMPACT Final Conference - Clemens Neudecker
IMPACT Final Conference - Clemens Neudecker
 
IMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer LaamanenIMPACT Final Conference - Majlis Bremer Laamanen
IMPACT Final Conference - Majlis Bremer Laamanen
 
IMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul FogelIMPACT Final Conference - Paul Fogel
IMPACT Final Conference - Paul Fogel
 
IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
IMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus GravenhorstIMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus Gravenhorst
 
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan PletschacherIMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan Pletschacher
 

Similaire à IMPACT Final Conference - Muehlberger - FEP

An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...cneudecker
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)cneudecker
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KBcneudecker
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Daycneudecker
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summitcneudecker
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTcneudecker
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisationcneudecker
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerBiblioteca Nacional de España
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...IMPACT Centre of Competence
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsEmma Huber
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayEuropeana Newspapers
 
European network for co-ordination of policies and programmes on e-infrastruc...
European network for co-ordination of policies and programmes on e-infrastruc...European network for co-ordination of policies and programmes on e-infrastruc...
European network for co-ordination of policies and programmes on e-infrastruc...Jisc
 
Europeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana Newspapers
 
Positioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscapePositioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscapeLIBER Europe
 
ECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsPaolo Nesi
 

Similaire à IMPACT Final Conference - Muehlberger - FEP (20)

An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KB
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summit
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACT
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
Metadata
MetadataMetadata
Metadata
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
Enoll hannover-2013-anna
Enoll hannover-2013-annaEnoll hannover-2013-anna
Enoll hannover-2013-anna
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
 
European network for co-ordination of policies and programmes on e-infrastruc...
European network for co-ordination of policies and programmes on e-infrastruc...European network for co-ordination of policies and programmes on e-infrastruc...
European network for co-ordination of policies and programmes on e-infrastruc...
 
Europeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLieder
 
text summarization
text summarizationtext summarization
text summarization
 
Positioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscapePositioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscape
 
ECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming arts
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

IMPACT Final Conference - Muehlberger - FEP

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. The Functional Extension Parser A Document Understanding Platform Günter Mühlberger University Innsbruck Library (ULB Tyrol)
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Document understanding  A book is more than just pure text – it contains a lot of structural metadata  These metadata are (often) encoded in the layout of a document  Size of characters, position on page, distance to other lines, etc. is used to express structural meaning  FEP is designed to “understand” the meaning of the layout 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands.  Headlines  Footnotes  Print space 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands.  Running title  Page number  Signature mark 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands.  Table of Contents  Single entries  Authors  Titles  Page numbers 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Why structural tagging is important – some examples  Search & Retrieval  References and links to other documents  Reading: analogue and digital 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands.  Search & retrieval – Ranking and scoring, noise reduction  The same word appears in the running title of a journal at every page “Alpenverein”  Front matters, such as title pages, dedications, table of contents tables, etc.  Back matters such as indexes, ads, etc. 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. – Search & retrieval – Facets for full-text  Currently facets are used for metadata such as author, year, text type, ...  A user might be interested in facets such as headline, footnote, index, etc... 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands.  Citations index / cloud – Footnotes, reference lists, citations contain bibliographic links to books, journal articles, texts, etc. – Structural tagging supports detection of bibliographic references – May also be used for catalogue enrichment Cawkell, A. E. (1971) 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands.  Digital reading – Tablet computers as alternative for reading historical books with OCR below reading quality – Expected features  Nicely cropped pages  Bookmarks  ToC page linked with headings  Advanced reading – eBooks for modern texts with satisfying OCR quality – Structure can be encoded into ePUB etc. 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands.  Analogue reading – Print on Demand – Print space as old concept with new benefits – Reconstruction helps to semi-automate the standardized production of pre-press files 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Technical background  Input – OCR text which needs to contain at least word coordinates – E.g. ALTO files, ABBYY XML or Google Books (Tesseract) HTML  Output – Annotations of structural elements with coordinates, e.g. page numbers, running titles, headings, footnotes, printspace, etc. – Output format: METS/ALTO, XML, etc.  FEP System – Images and/or OCR files are loaded via a web-service – OCR data are converted into internal format – Information is processed based on rules – Results are stored in a database – Quality control on the basis of “ground truth”, e.g. expected results – Rules are either manually encoded (expert knowledge) and/or based on machine learning (large document sets) 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Apart from books...  FEP – IMPACT: A generic rule set for historical books has been developed – This rule set can be used as basis for similar documents  Journals  Critical editions  etc. – Other rule sets can be developed from the scratch  Manual and/or machine learning  Other document types – Index cards – Title pages – Journals – Dissertations – Printed catalogues and bibliographies – ... 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Results  Basic rules set – General structural elements of books from e.g. 1700 to 2010 – Data set: 155 books, 30.673 pages (141 training set, 41 evaluation set) – All pages were manually annotated (ground truth)  Recall, Precision, F-Measure – 10 lines with headings in a book. We find e.g. 12 lines, 8 of them correct, 4 false: – Recall = 8 of 10 = 0,8 – Precision = 8 of 12 = 0,66 – F-Measure = 2*0.8*0.66/(0.8+0.66) = 0,72  More information – Important: We count lines, not structural entities!  E.g. if a heading has two lines one might be correct, the other one might not be recognised – Differences between training and evaluation set are low 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Some results on the evaluation set F- Precisi measu Recall on re Running text 0,99 0,98 0,98 Running titles 0,97 1 0,98 Page 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Comment  Research situation – Document analysis is a wide field and many applications – But only very little research on (historical) books – Due to lack of datasets hard to compare our results with other research groups  dataset will be published next year  Detection of ToC pages and ToC entries – Rules set for ToC was developed recently – Reasonable results compared with INEX competition – Foreseen to publish results in spring 2011  Method – Combination of manual and machine learning methods using fuzzy logic – Application for a patent at the European Patent Office in September 2011 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. How to deal with uncertainty and errors?  Option 1: Leave it as it is – Accept the accuracy which can be provided automatically – Inclusion of ground truth in the database allows to exactly measure the quality of the automated processing  one knows in advance what can be expected  Pro – Maybe the only solution for really large document sets – It is much cheaper to develop better rule sets than to correct large numbers of documents – Good results for homogenous sets are possible – Similar to OCR  Con – You and your users need to accept errors – People want to contribute and to correct 18
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. How to deal with uncertainty and errors?  Option 2: Correct it – Service providers or library staff needs to correct – Manual correction with automated support  Pro – Batch correction + off shore is relatively cheap and effective – Quick and standardized results – Users are satisfied  Con – A reasonable investment is necessary – The complexity of the workflow may not be underestimated – Probably it will be too expensive to correct all interesting elements, therefore you and your users still need to accept “some” errors – Users still want to contribute but do not have a chance 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Option 3  Provide a user interface for the crowd – Correction of OCR results may only be the start for also providing interfaces for structural annotations – Might be combined with some basic corrections carried out by service providers  Pro – Satisfies the willingness of users to contribute – Users get immediate benefit, e.g. they are able to download structured PDFs for their iPad, or annotated full-text for further processing – Users are satisfied AND are able to contribute – Library gets correct and standardized data  Con – An reasonable investment is necessary both for the user interface as well as for adapting the digital library application – User interfaces need to be powerful, self-explaining and simple – You and your users need to accept that there are always errors in the collection and that it will take decades to come to an end 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. FEP User Interface  A concept study for a powerful, self-explaining and simple GUI – Currently a “general purpose interface” to display, edit and correct the structural elements of books – No optimisation for specific tasks and large amounts of documents – Has the potential to become a user interface for the crowd – Could look completely different!  Based on Google Web Tool Kit (GWT) – Open source tool kit for complex browser based developments – GWT allows for features previously seen mainly in FLASH interfaces – Growing community – Good experiences: GWT allows to create interfaces in a relatively short time period 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Display of results 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Rich interface 23
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Recognized elements, e.g. headings 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Display of ground truth 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Page numbers 26
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Page numbers control 27
  • 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. ToC pages 28
  • 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. ToC entries 29
  • 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Linking of entries with pages/headings 30
  • 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. ToC hierarchy editor 31
  • 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Drag and drop of entries 32
  • 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Export from FEP web-interface  METS/ALTO – XML Standard for digitised books and documents  PDFs – Advanced PDFs for eBooks  Original version  FEP processed version – Pre-press files for Print on Demand  FEP prepress file  ePUB – For modern documents with good OCR quality or corrected books 33
  • 34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. After the project  General – Innovative projects with research component will be done via the University Innsbruck – Commercial projects via a spin-out of the University (transidee)  FEP as a service – Currently not foreseen to create a product or stand alone version, but to offer web-services for OCR/structural annotation and remote correction – Adaptation of the rule sets for specific documents  Pilot – EOD Network: Digitisation on Demand carried out by more than 30 libraries in Europe – FEP shall be integrated during 2012 – Member libraries get the chance to use the FEP for producing enhanced PDFs for eBooks 34
  • 35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of the Netherlands. Thank you for your attention! 35