SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Structural analysis of documents
Functional Extension Parser (FEP)

Günter Mühlberger
University Innsbruck Library (UIBK)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Agenda
      Introduction
      Features
        – What do we recognise with the structural analysis?
      Benefits
        – Why is structural analysis useful?
      Architecture
        – How does it work?
      Results
        – How good are we?
      Roadmap
        – When will it come into being?
      Business
        – Which offers will be available?
                                                                                                                                                         2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Introduction
      Document understanding platform
      Try to enhance and exploit the logical structure of documents for
        – Display
        – Navigation
        – Retrieval
      Enhance OCR output with structural metadata
        – Fully automated processing
        – Interactive correction




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Features
      General
        – We are able to recognise all structural elements which have some layout
          representation: e.g. region, size, typeface, distance to other elements, etc.
        – Focus in IMPACT: Basic features which are typical for all documents
        – Rules set can be extended or specified according to other datasets
                      E.g. journals, dissertations, index cards, yearbooks, newspapers, etc.
        – The better the OCR, the better our structural analysis
      Basic features for books
        –     Page numbers
        –     Running titles (headers)
        –     Print space
        –     Footnotes
        –     Signature marks
        –     Headings (within the running text)
        –     Table of contents entries (additional to headings)
        –     Front/Body/Back
        –     Paragraphs
                                                                                                                                                         4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Print space
      Headings
      Footnotes




                                                                                                                                                         5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Running title (header)
      Page number
      Signature mark




                                                                                                                                                         6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Table of contents
        – (linked with headings in
          the running text,
          respectively page
          numbers)




                                                                                                                                                         7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Benefits (1)
      Display
        – Correct print space allows to display images centred (no flipping
          between pages)
      Search & retrieval
        – Scoring of results
                      Could take into account structural data (headings, footnotes)
        – Noise reduction
                      Front, body, back are separated, text from the front is often misleading
                      Running titles repeat the same words
                      Footnotes can be included or excluded
        – Facetted search
                      Results can be displayed for running text, footnotes, headings



                                                                                                                                                         8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Benefits (2)
      Navigation
        – Page numbers allow usage of original table of contents
        – Original table of contents can be linked with headings/page numbers in
          the book
      Document editing
        –     Further mark up (e.g. TEI) is supported
        –     Manual preparation for Print-on-Demand is eased (print space)
        –     Selective OCR correction can be applied:
        –     E.g. only headings, running text, footnotes could be fed to CONCERT
      Document matching
        – Contributions or footnotes can be matched with existing bibliographical
          databases


                                                                                                                                                         9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Improved display in the
      Internet and PDF




                                                                                                                                                         10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Refinement of full-text
      search
      Facets for e.g.
        – Running text
        – Footnotes
        – Headings
      Less noise
        – Running titles,
          signature marks
          excluded from search




                                                                                                                                                         11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Clickable table of contents
      entries
        – Google style
      Selective OCR correction
        – Correct only ToC,
          headings, footnotes, etc.




                                                                                                                                                         12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Matching of documents
      with external sources
        – Match footnotes with
          library catalogues
          (bibliographies)Clickable
          table of content
        – Match table of contents
          entries and headings with
          bibliographies




                                                                                                                                                         13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Improved editing
        – Alternating print spaces
          for Print on Demand
        – Further processing for
          TEI editions etc.




                                                                                                                                                         14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Architecture
      Input
        – Results from OCR processing on word level (coordinates)
        – E.g. ALTO file, ABBYY XML file or Google HTML
      Output
        – Structural annotations for recognized text features, e.g. page numbers,
          running titles, headings, etc.
        – E.g. XML, ALTO, METS, TEI, etc.
      General workflow
        –     OCR result files are parsed (FEP general XML format)
        –     Rules set is applied to the dataset (rules are managed by rules engine)
        –     Results are stored in a database
        –     Export on various levels is provided
      Optional
        – Online or offline correction (GUI)
        – Adaptation of rules set
        – Quality assurance on basis of ground truth

                                                                                                                                                         15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The FEP Core
      Based on expert-system like rule engine for java (Jess)
      Both manually crafted rules and rules obtained by machine learning
      Uses fuzzy logic to deal with uncertainty

Typical rules:
      IF there is a numeral in the first line of the page AND this numeral is centred
      THEN this numeral may be the page number
      IF there is a numeral in the first line of the page AND this numeral is at the
      right hand side of the page AND this numeral is an odd number THEN this
      numeral may be the page number
      IF there is a numeral in the first line of the page AND this numeral is at the
      left hand side of the page AND this numeral is an even number THEN this
      numeral may be the page number.

IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Results
      Basic rules set
        – General features for books from 1700 to 2000
        – Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set)
        – All books were manually annotated (ground truth)
      Recall, Precision, F-Measure
        – E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are
          correct, 4 are false.
        – Recall                = 8 of 10                   = 0,8
        – Precision             = 8 of 12                   = 0,66
        – F-Measure             = 2*0.8*0.66/(0.8+0.66)     = 0,72
      More explanations
        – Important: We are counting lines, not structural items!
                      E.g. a heading consists of two lines (often with different size of typeface we have
                      to find both to succeed)
        – Difference between training and evaluation sets are marginal

                                                                                                                                                         18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  Results on Evaluation Set

                                                Recall                            Precision                          F‐measure
Running text                                                           0,99                              0,98                              0,98
Footnotes                                                              0,83                              0,89                              0,86
Page numbers                                                           0,97                                      1                         0,98
Running titles                                                         0,97                                      1                         0,98
Heading                                                                0,85                              0,80                              0,82
Signature marks                                                        0,68                              0,89                              0,77
                                                                                                                                                           19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Roadmap
      Summer 2011: Beta version
        – Integration into IMPACT Interoperability Platform
        – Basic rules set: books from 1700 to 1900
      End of the year: Version 1.0
        – Full featured version
        – Enhanced online correction interface
        – FEP as a service, not as a product for local installation




                                                                                                                                                         20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Business offers
      Web-service for processing single volumes and correction
        – Will be integrated into eBooks-on-Demand EOD Network
        – Already now 30 libraries are uploading their images to OCR server in
          Innsbruck
        – FEP will be an additional service for general material
        – Similar offers can be made to other libraries or networks as well
      Adaptation of rules set
        – For specific datasets much more can be detected than just the basic
          features
        – E.g. journals with a fixed structure over many years or parliamentary
          papers, dissertations, research papers, etc.
        Onsite installations
        – Not our focus, but could be done for very large datasets or due to legal
          requirements (e.g. Google images)

                                                                                                                                                         21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         23
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Results: TOC

      25 TOC entries in total
      22 TOC entries are completely correct
      1 TOC entry was missed
      2 TOC entries are grouped incorrectly
      1 TOC entry has no link
      1 TOC entry has a wrong link




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        26
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Thank you for your attention!




                                                                                                                                                         27

Contenu connexe

Tendances

Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
IMPACT Centre of Competence
 
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Centre of Competence
 
I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2
imec.archive
 

Tendances (14)

Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
 
Enoll hannover-2013-anna
Enoll hannover-2013-annaEnoll hannover-2013-anna
Enoll hannover-2013-anna
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014
 
Jpl cv en JUN10
Jpl cv en JUN10Jpl cv en JUN10
Jpl cv en JUN10
 
Jpl cv en_mar11
Jpl cv en_mar11Jpl cv en_mar11
Jpl cv en_mar11
 
iDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guidesiDiscover: Towards the next generation of contextualised mobile museum guides
iDiscover: Towards the next generation of contextualised mobile museum guides
 
Jpl Cv En v2
Jpl Cv En v2Jpl Cv En v2
Jpl Cv En v2
 
CV of Joao Penha-Lopes (En)
CV of Joao Penha-Lopes (En)CV of Joao Penha-Lopes (En)
CV of Joao Penha-Lopes (En)
 
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109
 
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109
 
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EUWorkshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
 
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
 
EuropeanaConnect
EuropeanaConnectEuropeanaConnect
EuropeanaConnect
 
I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2I Lab2 I Lab Vision Ws 3 Oct 06 V2
I Lab2 I Lab Vision Ws 3 Oct 06 V2
 

Similaire à Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
Emma Huber
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Centre of Competence
 
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshopDissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Erik Axdorph
 
Share.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. AxdorphShare.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. Axdorph
Share.TEC
 
Dissemination activities
Dissemination activitiesDissemination activities
Dissemination activities
guest1e6768
 

Similaire à Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger (20)

Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KB
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summit
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan PletschacherIMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan Pletschacher
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACT
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
 
text summarization
text summarizationtext summarization
text summarization
 
Models and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAPModels and tools for aggregating and annotating content on ECLAP
Models and tools for aggregating and annotating content on ECLAP
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
Metadata
MetadataMetadata
Metadata
 
Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Mapping the European TEL Project Landscape Using Social Network Analysis and ...Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Mapping the European TEL Project Landscape Using Social Network Analysis and ...
 
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshopDissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
 
Share.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. AxdorphShare.TEC dissemination activities 2009, E. Axdorph
Share.TEC dissemination activities 2009, E. Axdorph
 
Dissemination activities
Dissemination activitiesDissemination activities
Dissemination activities
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 

Plus de Biblioteca Nacional de España

Plus de Biblioteca Nacional de España (20)

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
 
Data privacy in library authority files: a survey
Data privacy in library authority files: a surveyData privacy in library authority files: a survey
Data privacy in library authority files: a survey
 
Perfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambiosPerfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambios
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
 
RDA: el nuevo texto
RDA: el nuevo textoRDA: el nuevo texto
RDA: el nuevo texto
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019
 
Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019
 
Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019
 
Evaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección CulturalEvaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección Cultural
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
 
VIAF GDPR
VIAF GDPRVIAF GDPR
VIAF GDPR
 
Renacer prensa historica
Renacer prensa historicaRenacer prensa historica
Renacer prensa historica
 
RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)
 
Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Structural analysis of documents Functional Extension Parser (FEP) Günter Mühlberger University Innsbruck Library (UIBK)
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Agenda Introduction Features – What do we recognise with the structural analysis? Benefits – Why is structural analysis useful? Architecture – How does it work? Results – How good are we? Roadmap – When will it come into being? Business – Which offers will be available? 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Introduction Document understanding platform Try to enhance and exploit the logical structure of documents for – Display – Navigation – Retrieval Enhance OCR output with structural metadata – Fully automated processing – Interactive correction IMPACT EVA/MINERVA 12th Nov. 2008 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Features General – We are able to recognise all structural elements which have some layout representation: e.g. region, size, typeface, distance to other elements, etc. – Focus in IMPACT: Basic features which are typical for all documents – Rules set can be extended or specified according to other datasets E.g. journals, dissertations, index cards, yearbooks, newspapers, etc. – The better the OCR, the better our structural analysis Basic features for books – Page numbers – Running titles (headers) – Print space – Footnotes – Signature marks – Headings (within the running text) – Table of contents entries (additional to headings) – Front/Body/Back – Paragraphs 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Print space Headings Footnotes 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Running title (header) Page number Signature mark 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Table of contents – (linked with headings in the running text, respectively page numbers) 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits (1) Display – Correct print space allows to display images centred (no flipping between pages) Search & retrieval – Scoring of results Could take into account structural data (headings, footnotes) – Noise reduction Front, body, back are separated, text from the front is often misleading Running titles repeat the same words Footnotes can be included or excluded – Facetted search Results can be displayed for running text, footnotes, headings 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits (2) Navigation – Page numbers allow usage of original table of contents – Original table of contents can be linked with headings/page numbers in the book Document editing – Further mark up (e.g. TEI) is supported – Manual preparation for Print-on-Demand is eased (print space) – Selective OCR correction can be applied: – E.g. only headings, running text, footnotes could be fed to CONCERT Document matching – Contributions or footnotes can be matched with existing bibliographical databases 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved display in the Internet and PDF 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Refinement of full-text search Facets for e.g. – Running text – Footnotes – Headings Less noise – Running titles, signature marks excluded from search 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Clickable table of contents entries – Google style Selective OCR correction – Correct only ToC, headings, footnotes, etc. 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Matching of documents with external sources – Match footnotes with library catalogues (bibliographies)Clickable table of content – Match table of contents entries and headings with bibliographies 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved editing – Alternating print spaces for Print on Demand – Further processing for TEI editions etc. 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Architecture Input – Results from OCR processing on word level (coordinates) – E.g. ALTO file, ABBYY XML file or Google HTML Output – Structural annotations for recognized text features, e.g. page numbers, running titles, headings, etc. – E.g. XML, ALTO, METS, TEI, etc. General workflow – OCR result files are parsed (FEP general XML format) – Rules set is applied to the dataset (rules are managed by rules engine) – Results are stored in a database – Export on various levels is provided Optional – Online or offline correction (GUI) – Adaptation of rules set – Quality assurance on basis of ground truth 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The FEP Core Based on expert-system like rule engine for java (Jess) Both manually crafted rules and rules obtained by machine learning Uses fuzzy logic to deal with uncertainty Typical rules: IF there is a numeral in the first line of the page AND this numeral is centred THEN this numeral may be the page number IF there is a numeral in the first line of the page AND this numeral is at the right hand side of the page AND this numeral is an odd number THEN this numeral may be the page number IF there is a numeral in the first line of the page AND this numeral is at the left hand side of the page AND this numeral is an even number THEN this numeral may be the page number. IMPACT EVA/MINERVA 12th Nov. 2008 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results Basic rules set – General features for books from 1700 to 2000 – Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set) – All books were manually annotated (ground truth) Recall, Precision, F-Measure – E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are correct, 4 are false. – Recall = 8 of 10 = 0,8 – Precision = 8 of 12 = 0,66 – F-Measure = 2*0.8*0.66/(0.8+0.66) = 0,72 More explanations – Important: We are counting lines, not structural items! E.g. a heading consists of two lines (often with different size of typeface we have to find both to succeed) – Difference between training and evaluation sets are marginal 18
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results on Evaluation Set Recall Precision F‐measure Running text 0,99 0,98 0,98 Footnotes 0,83 0,89 0,86 Page numbers 0,97 1 0,98 Running titles 0,97 1 0,98 Heading 0,85 0,80 0,82 Signature marks 0,68 0,89 0,77 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Roadmap Summer 2011: Beta version – Integration into IMPACT Interoperability Platform – Basic rules set: books from 1700 to 1900 End of the year: Version 1.0 – Full featured version – Enhanced online correction interface – FEP as a service, not as a product for local installation 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Business offers Web-service for processing single volumes and correction – Will be integrated into eBooks-on-Demand EOD Network – Already now 30 libraries are uploading their images to OCR server in Innsbruck – FEP will be an additional service for general material – Similar offers can be made to other libraries or networks as well Adaptation of rules set – For specific datasets much more can be detected than just the basic features – E.g. journals with a fixed structure over many years or parliamentary papers, dissertations, research papers, etc. Onsite installations – Not our focus, but could be done for very large datasets or due to legal requirements (e.g. Google images) 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 23
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT EVA/MINERVA 12th Nov. 2008 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results: TOC 25 TOC entries in total 22 TOC entries are completely correct 1 TOC entry was missed 2 TOC entries are grouped incorrectly 1 TOC entry has no link 1 TOC entry has a wrong link IMPACT EVA/MINERVA 12th Nov. 2008 26
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention! 27