SlideShare a Scribd company logo
1 of 21
Download to read offline
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The Improving Access to Text (IMPACT) project
and other European initiatives
Michael Day
UKOLN, University of Bath
m.day@ukoln.ac.uk
http://www.ukoln.ac.uk/


JISC Workshop: OCR for the Mass Digitisation of Textual Materials,
University of Bath 24 September 2009
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          Presentation outline
                 Contexts
                   – Some European digitisation activity
                   – Digitisation challenges
                 The IMPACT project
                   – The consortium and project structure
                   – Major project activities




JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          Digitisation activity in Europe (1)
                 European Commission
                   – i2010 digital libraries initiative
                         Launched September 2005
                         Bringing together European cultural heritage online
                           – Europeana portal
                 Many projects dealing with the digitisation of texts in Europe
                   – Many at large-scale, with selectivity at collection level or higher
                     (industrial-scale mass digitisation)
                   – Content holders often work with commercial providers (e.g.,
                     outsourcing of conversion processes, partnering with Google Books)
                   – However, "Europe is facing a very important cultural and economic
                     challenge: Only some 1% of the books in Europe's national libraries
                     have been digitised so far, leaving an enormous task ahead of us"
                     (Viviane Reding and Charlie McCreevy, EU Commissioners,
                     September 2009)
JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          Digitisation activity in Europe (2)
                 Europeana - Europe's digital library
                   – Website: http://europeana.eu/portal/
                   – Launched in November 2008
                   – Hosted by the National Library of the Netherlands; run by the
                     European Digital Library Foundation
                   – Part funded by the EU's eContent plus programme
                   – A portal providing access to ca. 4.6 million items
                   – Mixed content:
                         Books, newspapers, photographs, maps, film clips
                         Books included are mainly those in the public domain
                   – EC public consultation on Europeana and the digitisation of books,
                     open until 15 November 2009
                     http://ec.europa.eu/information_society/newsroom/cf/itemlongdetail.cf
                     m?item_id=5181

JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          Digitisation challenges (1)
                 Large-scale digitisation
                   – Mostly based on the image front searching technique (pioneered by
                     projects like JSTOR)
                         Scan physical item to create digital images of pages
                         Subject those pages to OCR
                         Combine OCR output with the images, OCR output considered
                         good enough for searching, but any ambiguous results are able to
                         be compared with page images
                         “The strategy of linking page images with OCR enables us to
                         make effective use of large corpora of relatively cheaply scanned
                         books and was, in large measure, effective because it points
                         backwards to the limitations of print: search gets human readers
                         to the page and leaves them to parse out its meaning” (Many
                         More than a Million seminar report, CLIR, November 2007:
                         http://www.clir.org/activities/digitalscholar/Nov28final.pdf)

JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          Digitisation challenges (2)
                 Current generations of OCR tools do not always provide
                 satisfactory results for historical documents
                   – Main focus of tools is on modern documents
                   – Not always fit for historic material with archaic fonts, obsolete
                     characters, complex layouts, warped or degraded pages, language
                     variation, etc.
                   – Manual post-correction has a role, but is slow and expensive
                 Example of OCR errors
                   – From Australian Newspapers (National Library of Australia):
                     http://newspapers.nla.gov.au/
                   – "The text in the left panel has been electronically translated by a
                     computer. Computers are not as good at reading as humans, and
                     often make mistakes“
                   – This system permits users to correct the OCR output
                   – Article by Rose Holley in D-Lib Magazine, March/April 2009:
                     http://www.dlib.org/dlib/march09/holley/03holley.html

JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          Extremely simplified text digitisation workflow
                                               Linguistic                  Named entity
                                                 tools                     identification                                                Access
 Printed                                                                                                                                 package
  item                         Image                                             Post correction
                            processing +
                                                                  OCR                                                OCR
                            enhancement
                                                                                                                    output
                                                                                                                    Image
                       Imaging                     Structural
                                                 analysis and
                                                 segmentation


                                                        Metadata



JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          The IMPACT project
                 Research project funded by the European Commission
                   –     Large-scale Integrating Project
                   –     Funded from January 2008, for four years
                   –     Coordinated by the National Library of the Netherlands (KB)
                   –     Total budget: EUR 15.5M; EU funding: EUR 11.5M
                   –     Consortium of 15 partners
                             Libraries
                             Universities and research centres
                             Industrial partners




JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          The IMPACT consortium
           • Libraries                                                                     • Universities and research
                   – National Library of the                                                 centres
                     Netherlands (coordinator)                                                      – Dutch Institute for Lexicology
                   – The British Library                                                            – National Centre for Scientific
                   – Bibliothèque nationale de France                                                 Research - Demokritos
                   – German National Library                                                        – University of Salford
                   – Bavarian State Library                                                         – University of Munich
                   – Goettingen State and University                                                – University of Innsbruck
                     Library                                                                        – University of Bath (UKOLN)
                   – Austrian National Library
                   – University of Innsbruck Library
                                                                                           • Industrial partners
                                                                                                    – ABBYY
                                                                                                    – IBM Haifa Research Lab


JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          IMPACT project objectives
                 Aims to significantly improve the mass digitisation of historical
                 printed text by
                   – Innovating OCR software and language technology
                   – Sharing expertise and building capacity across Europe
                   – Ensuring that tools and services will be sustained after the end of the
                     project
                 Specific principles:
                   – Reduce effort and enhance speed and results of mass digitisation
                     (speed and scalability)
                   – Focus on the whole post scanning workflow: image processing, OCR
                     processing (including dictionaries), OCR correction, and document
                     formatting
                   – All research and development to be grounded in the needs of libraries
                   – Working with other centres of competence

JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          IMPACT project approach (1)
                 Project structure
                   – 22 work packages
                 Four sub-projects
                   – Technical and research based:
                         TR (Text Recognition) focused on the extraction of text in a digital
                         form from an image (OCR)
                         EE (Enhancement and Enrichment) using linguistic technologies
                         to make the results of full-text digitisation more accurate and
                         accessible
                   – Strategic:
                         OC (Operational Context) guiding the direction of the project from
                         the libraries' perspective
                         CB (Capacity Building) stimulating the uptake of results in the
                         museums, libraries and archives communities

JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          IMPACT project approach (2)
                                                             Operational Context
                                               Requirements, Benchmarking and Metrics
                                                    Best Practices and Guidelines
                                               Technical Framework and Interoperability




                    Text Recognition                                                             Enhancement and Enrichment
          Pre-processing and segmentation                                                                  Collaborative Correction
          Adaptive and experimental OCR                                                                        Historical Lexica
               Models and dictionaries                                                                      Structural Metadata



                                                                Capacity Building
                                                               Packaging of resources
                                                                Training and support
                                                                   Demonstration



JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          IMPACT tools and services (1)
                 Text Recognition
                   – Technologies for supporting the extraction of text from the page
                   – Adaptive OCR engine, integrating:
                        Image enhancement toolkit
                        Segmentation toolkit
                        Post-correction modules
                        Other OCR engines
                   – Experimental prototypes
                        Typewritten OCR
                        Wordspotting
                        Inventory extraction




JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          IMPACT tools and services (2)
                 Enhancement and enrichment
                   – Focus on making OCR results more accurate and accessible
                   – Collaborative correction
                         Web based, linked to OCR engine
                   – Tools and content
                         General and named entities lexica for Dutch, German and
                         English, general support for lexicon building in other languages
                         Dealing with historical languages
                         Collaborative environments for managing named entities
                   – Structural metadata
                         Functional Extension Parser, for the automatic detection and
                         tagging of structural metadata of scanned material



JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          IMPACT tools and services (3)
                 Strategic tools and services
                   – Website (http://www.impact-project.eu/)
                   – Decision support tools, to support the initiation, organisation,
                     management of mass-digitisation projects
                   – A set of learning resources providing guidance on the digitisation of
                     texts and the implementation of project tools
                   – Training and support
                         Helpdesk
                         Training programme (events)
                   – Demonstration of the tools (case studies)

                   – IMPACT Centre of Competence



JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          Thank you for your attention!
                 Any questions?

                 Additional information:
                   – The IMPACT project:
                        Website: http://www.impact-project.eu/
                        Project office: impact@kb.nl
                   – Europeana: http://www.europeana.eu/portal/




                   – Workshop Materials: http://www.ukoln.ac.uk/events/ocr-2009/


JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009                                                            21

More Related Content

What's hot

Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?AubreyMcFato
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfIMPACT Centre of Competence
 
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Centre of Competence
 
ECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsPaolo Nesi
 
Enhancing user access to european digital heritage
Enhancing user access to european digital heritageEnhancing user access to european digital heritage
Enhancing user access to european digital heritageEuropeanaConnect
 
Testbeds for CONTENT experimentations
Testbeds for CONTENT experimentationsTestbeds for CONTENT experimentations
Testbeds for CONTENT experimentationsexperimedia
 
EMPATIC Information Leaflet
EMPATIC Information LeafletEMPATIC Information Leaflet
EMPATIC Information LeafletEmpatic Project
 
Digitised Content: How we Make It Relevant to Researchers, Teachers and Students
Digitised Content: How we Make It Relevant to Researchers, Teachers and StudentsDigitised Content: How we Make It Relevant to Researchers, Teachers and Students
Digitised Content: How we Make It Relevant to Researchers, Teachers and StudentsLIBER Europe
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014munarmu
 
Eastern Europe Partnership Event - 002 ognjen prjnat
Eastern Europe Partnership Event - 002 ognjen prjnatEastern Europe Partnership Event - 002 ognjen prjnat
Eastern Europe Partnership Event - 002 ognjen prjnatTERENA
 
Ideas for Internationalisation@Home in Higher Education
Ideas for Internationalisation@Home in Higher EducationIdeas for Internationalisation@Home in Higher Education
Ideas for Internationalisation@Home in Higher EducationTon Koenraad
 
Digital Humanities @ Net7
Digital Humanities @ Net7Digital Humanities @ Net7
Digital Humanities @ Net7Net7
 
net lab Theme 2 DIGHUMLAB Launch 10 September 2012:1315
net lab Theme 2 DIGHUMLAB Launch 10 September 2012:1315net lab Theme 2 DIGHUMLAB Launch 10 September 2012:1315
net lab Theme 2 DIGHUMLAB Launch 10 September 2012:1315DIGHUMLAB
 

What's hot (16)

Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?Europeana. A Digital Library for the Humanities?
Europeana. A Digital Library for the Humanities?
 
Bratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdfBratislava WS - Schlarb - ONB - technical tools_pdf
Bratislava WS - Schlarb - ONB - technical tools_pdf
 
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan PletschacherIMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Stefan Pletschacher
 
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
 
IMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich ReffleIMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich Reffle
 
ECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming artsECLAP White paper, social network for Cultural Heritage on Peforming arts
ECLAP White paper, social network for Cultural Heritage on Peforming arts
 
Enhancing user access to european digital heritage
Enhancing user access to european digital heritageEnhancing user access to european digital heritage
Enhancing user access to european digital heritage
 
Testbeds for CONTENT experimentations
Testbeds for CONTENT experimentationsTestbeds for CONTENT experimentations
Testbeds for CONTENT experimentations
 
EMPATIC Information Leaflet
EMPATIC Information LeafletEMPATIC Information Leaflet
EMPATIC Information Leaflet
 
Digitised Content: How we Make It Relevant to Researchers, Teachers and Students
Digitised Content: How we Make It Relevant to Researchers, Teachers and StudentsDigitised Content: How we Make It Relevant to Researchers, Teachers and Students
Digitised Content: How we Make It Relevant to Researchers, Teachers and Students
 
Etok
EtokEtok
Etok
 
Tel concertation meeting project presentations - 7-2-2014
Tel concertation meeting   project presentations - 7-2-2014Tel concertation meeting   project presentations - 7-2-2014
Tel concertation meeting project presentations - 7-2-2014
 
Eastern Europe Partnership Event - 002 ognjen prjnat
Eastern Europe Partnership Event - 002 ognjen prjnatEastern Europe Partnership Event - 002 ognjen prjnat
Eastern Europe Partnership Event - 002 ognjen prjnat
 
Ideas for Internationalisation@Home in Higher Education
Ideas for Internationalisation@Home in Higher EducationIdeas for Internationalisation@Home in Higher Education
Ideas for Internationalisation@Home in Higher Education
 
Digital Humanities @ Net7
Digital Humanities @ Net7Digital Humanities @ Net7
Digital Humanities @ Net7
 
net lab Theme 2 DIGHUMLAB Launch 10 September 2012:1315
net lab Theme 2 DIGHUMLAB Launch 10 September 2012:1315net lab Theme 2 DIGHUMLAB Launch 10 September 2012:1315
net lab Theme 2 DIGHUMLAB Launch 10 September 2012:1315
 

Viewers also liked

Sindhi computing in Human Language Technology
Sindhi computing in Human Language TechnologySindhi computing in Human Language Technology
Sindhi computing in Human Language TechnologyFayaz Amar
 
Matlab based vehicle number plate identification system using ocr
Matlab based vehicle number plate identification system using ocrMatlab based vehicle number plate identification system using ocr
Matlab based vehicle number plate identification system using ocrGhanshyam Dusane
 
Optical character recognition (ocr) ppt
Optical character recognition (ocr) pptOptical character recognition (ocr) ppt
Optical character recognition (ocr) pptDeijee Kalita
 
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APPLICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APPAditya Mishra
 
Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009Svetlin Nakov
 
Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Vidyut Singhania
 
Automatic Number Plate Recognition (ANPR)
Automatic Number Plate Recognition (ANPR)Automatic Number Plate Recognition (ANPR)
Automatic Number Plate Recognition (ANPR)Vidyut Singhania
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition systemVijay Apurva
 
Number plate recognition system using matlab.
Number plate recognition system using matlab.Number plate recognition system using matlab.
Number plate recognition system using matlab.Namra Afzal
 
Vehicle Number Plate Recognition System
Vehicle Number Plate Recognition SystemVehicle Number Plate Recognition System
Vehicle Number Plate Recognition Systemprashantdahake
 
Digital Image Processing
Digital Image ProcessingDigital Image Processing
Digital Image ProcessingSahil Biswas
 
Introduction of Cloud computing
Introduction of Cloud computingIntroduction of Cloud computing
Introduction of Cloud computingRkrishna Mishra
 

Viewers also liked (14)

Sindhi computing in Human Language Technology
Sindhi computing in Human Language TechnologySindhi computing in Human Language Technology
Sindhi computing in Human Language Technology
 
Matlab based vehicle number plate identification system using ocr
Matlab based vehicle number plate identification system using ocrMatlab based vehicle number plate identification system using ocr
Matlab based vehicle number plate identification system using ocr
 
Optical character recognition (ocr) ppt
Optical character recognition (ocr) pptOptical character recognition (ocr) ppt
Optical character recognition (ocr) ppt
 
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APPLICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
LICENSE NUMBER PLATE RECOGNITION SYSTEM USING ANDROID APP
 
Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009
 
Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Optical Character Recognition (OCR)
Optical Character Recognition (OCR)
 
Automatic Number Plate Recognition (ANPR)
Automatic Number Plate Recognition (ANPR)Automatic Number Plate Recognition (ANPR)
Automatic Number Plate Recognition (ANPR)
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
 
Number plate recognition system using matlab.
Number plate recognition system using matlab.Number plate recognition system using matlab.
Number plate recognition system using matlab.
 
Text Detection and Recognition
Text Detection and RecognitionText Detection and Recognition
Text Detection and Recognition
 
Image processing ppt
Image processing pptImage processing ppt
Image processing ppt
 
Vehicle Number Plate Recognition System
Vehicle Number Plate Recognition SystemVehicle Number Plate Recognition System
Vehicle Number Plate Recognition System
 
Digital Image Processing
Digital Image ProcessingDigital Image Processing
Digital Image Processing
 
Introduction of Cloud computing
Introduction of Cloud computingIntroduction of Cloud computing
Introduction of Cloud computing
 

Similar to The Improving Access to Text (IMPACT) project and other European initiatives

An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...cneudecker
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisationcneudecker
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Daycneudecker
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerBiblioteca Nacional de España
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)cneudecker
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summitcneudecker
 
Positioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscapePositioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscapeLIBER Europe
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsEmma Huber
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KBcneudecker
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...IMPACT Centre of Competence
 
BL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACTBL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACTIMPACT Centre of Competence
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTcneudecker
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewEuropeana Newspapers
 
Status EuropeanaConnect May 2010
Status EuropeanaConnect May 2010Status EuropeanaConnect May 2010
Status EuropeanaConnect May 2010Max Kaiser
 
Rio Info 2009 - Europeana - Bram van der Werf
Rio Info 2009 - Europeana - Bram van der WerfRio Info 2009 - Europeana - Bram van der Werf
Rio Info 2009 - Europeana - Bram van der WerfRio Info
 

Similar to The Improving Access to Text (IMPACT) project and other European initiatives (20)

An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summit
 
Positioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscapePositioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscape
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
IMPACT Demo Dag at KB
IMPACT Demo Dag at KBIMPACT Demo Dag at KB
IMPACT Demo Dag at KB
 
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...IMPACT Final Event 26-06-2012  - Library experiences in IMPACT: National and ...
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
 
BL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACTBL Demo Day - July2011 - (1) Introduction to IMPACT
BL Demo Day - July2011 - (1) Introduction to IMPACT
 
Aly
AlyAly
Aly
 
Metadata
MetadataMetadata
Metadata
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop intro
 
OCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACTOCR challenges in historic documents and the contribution of IMPACT
OCR challenges in historic documents and the contribution of IMPACT
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
 
Status EuropeanaConnect May 2010
Status EuropeanaConnect May 2010Status EuropeanaConnect May 2010
Status EuropeanaConnect May 2010
 
Rio Info 2009 - Europeana - Bram van der Werf
Rio Info 2009 - Europeana - Bram van der WerfRio Info 2009 - Europeana - Bram van der Werf
Rio Info 2009 - Europeana - Bram van der Werf
 

More from Michael Day

What can libraries do for researchers?
What can libraries do for researchers?What can libraries do for researchers?
What can libraries do for researchers?Michael Day
 
Preservation planning at the British Library
Preservation planning at the British LibraryPreservation planning at the British Library
Preservation planning at the British LibraryMichael Day
 
Implementing digital preservation strategy: collection profiling at the Briti...
Implementing digital preservation strategy: collection profiling at the Briti...Implementing digital preservation strategy: collection profiling at the Briti...
Implementing digital preservation strategy: collection profiling at the Briti...Michael Day
 
Developing institutional RDM services
Developing institutional RDM servicesDeveloping institutional RDM services
Developing institutional RDM servicesMichael Day
 
Open access data
Open access dataOpen access data
Open access dataMichael Day
 
Digital Preservation (UWE)
Digital Preservation (UWE)Digital Preservation (UWE)
Digital Preservation (UWE)Michael Day
 
Digital Curation 101 (University of Glamorgan)
Digital Curation 101 (University of Glamorgan)Digital Curation 101 (University of Glamorgan)
Digital Curation 101 (University of Glamorgan)Michael Day
 
Continuity and change: Opportunities and challenges for the future of researc...
Continuity and change: Opportunities and challenges for the future of researc...Continuity and change: Opportunities and challenges for the future of researc...
Continuity and change: Opportunities and challenges for the future of researc...Michael Day
 
Developing a Community Capability Model Framework for data-intensive research
Developing a Community Capability Model Framework for data-intensive researchDeveloping a Community Capability Model Framework for data-intensive research
Developing a Community Capability Model Framework for data-intensive researchMichael Day
 
Introduction to research data management
Introduction to research data managementIntroduction to research data management
Introduction to research data managementMichael Day
 
Introduction to Research Data Management: activities, roles and requirements
Introduction to Research Data Management: activities, roles and requirementsIntroduction to Research Data Management: activities, roles and requirements
Introduction to Research Data Management: activities, roles and requirementsMichael Day
 
Digital Preservation
Digital PreservationDigital Preservation
Digital PreservationMichael Day
 
UKOLN activities on research information management
UKOLN activities on research information managementUKOLN activities on research information management
UKOLN activities on research information managementMichael Day
 
UKOLN Programme Support for the JISC Research Information Management Programme
UKOLN Programme Support for the JISC Research Information Management ProgrammeUKOLN Programme Support for the JISC Research Information Management Programme
UKOLN Programme Support for the JISC Research Information Management ProgrammeMichael Day
 
Digital Preservation
Digital PreservationDigital Preservation
Digital PreservationMichael Day
 
Models for integrating institutional repositories and research information ma...
Models for integrating institutional repositories and research information ma...Models for integrating institutional repositories and research information ma...
Models for integrating institutional repositories and research information ma...Michael Day
 
Research Information Management
Research Information ManagementResearch Information Management
Research Information ManagementMichael Day
 
Digital preservation exercises
Digital preservation exercisesDigital preservation exercises
Digital preservation exercisesMichael Day
 
Brief Introduction to Digital Preservation
Brief Introduction to Digital PreservationBrief Introduction to Digital Preservation
Brief Introduction to Digital PreservationMichael Day
 

More from Michael Day (20)

What can libraries do for researchers?
What can libraries do for researchers?What can libraries do for researchers?
What can libraries do for researchers?
 
Preservation planning at the British Library
Preservation planning at the British LibraryPreservation planning at the British Library
Preservation planning at the British Library
 
Implementing digital preservation strategy: collection profiling at the Briti...
Implementing digital preservation strategy: collection profiling at the Briti...Implementing digital preservation strategy: collection profiling at the Briti...
Implementing digital preservation strategy: collection profiling at the Briti...
 
Developing institutional RDM services
Developing institutional RDM servicesDeveloping institutional RDM services
Developing institutional RDM services
 
Open access data
Open access dataOpen access data
Open access data
 
Digital Preservation (UWE)
Digital Preservation (UWE)Digital Preservation (UWE)
Digital Preservation (UWE)
 
Digital Curation 101 (University of Glamorgan)
Digital Curation 101 (University of Glamorgan)Digital Curation 101 (University of Glamorgan)
Digital Curation 101 (University of Glamorgan)
 
Continuity and change: Opportunities and challenges for the future of researc...
Continuity and change: Opportunities and challenges for the future of researc...Continuity and change: Opportunities and challenges for the future of researc...
Continuity and change: Opportunities and challenges for the future of researc...
 
Developing a Community Capability Model Framework for data-intensive research
Developing a Community Capability Model Framework for data-intensive researchDeveloping a Community Capability Model Framework for data-intensive research
Developing a Community Capability Model Framework for data-intensive research
 
Introduction to research data management
Introduction to research data managementIntroduction to research data management
Introduction to research data management
 
Introduction to Research Data Management: activities, roles and requirements
Introduction to Research Data Management: activities, roles and requirementsIntroduction to Research Data Management: activities, roles and requirements
Introduction to Research Data Management: activities, roles and requirements
 
Digital Preservation
Digital PreservationDigital Preservation
Digital Preservation
 
UKOLN activities on research information management
UKOLN activities on research information managementUKOLN activities on research information management
UKOLN activities on research information management
 
UKOLN Programme Support for the JISC Research Information Management Programme
UKOLN Programme Support for the JISC Research Information Management ProgrammeUKOLN Programme Support for the JISC Research Information Management Programme
UKOLN Programme Support for the JISC Research Information Management Programme
 
Digital Preservation
Digital PreservationDigital Preservation
Digital Preservation
 
EASTER project
EASTER projectEASTER project
EASTER project
 
Models for integrating institutional repositories and research information ma...
Models for integrating institutional repositories and research information ma...Models for integrating institutional repositories and research information ma...
Models for integrating institutional repositories and research information ma...
 
Research Information Management
Research Information ManagementResearch Information Management
Research Information Management
 
Digital preservation exercises
Digital preservation exercisesDigital preservation exercises
Digital preservation exercises
 
Brief Introduction to Digital Preservation
Brief Introduction to Digital PreservationBrief Introduction to Digital Preservation
Brief Introduction to Digital Preservation
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

The Improving Access to Text (IMPACT) project and other European initiatives

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The Improving Access to Text (IMPACT) project and other European initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk http://www.ukoln.ac.uk/ JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath 24 September 2009
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Presentation outline Contexts – Some European digitisation activity – Digitisation challenges The IMPACT project – The consortium and project structure – Major project activities JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digitisation activity in Europe (1) European Commission – i2010 digital libraries initiative Launched September 2005 Bringing together European cultural heritage online – Europeana portal Many projects dealing with the digitisation of texts in Europe – Many at large-scale, with selectivity at collection level or higher (industrial-scale mass digitisation) – Content holders often work with commercial providers (e.g., outsourcing of conversion processes, partnering with Google Books) – However, "Europe is facing a very important cultural and economic challenge: Only some 1% of the books in Europe's national libraries have been digitised so far, leaving an enormous task ahead of us" (Viviane Reding and Charlie McCreevy, EU Commissioners, September 2009) JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digitisation activity in Europe (2) Europeana - Europe's digital library – Website: http://europeana.eu/portal/ – Launched in November 2008 – Hosted by the National Library of the Netherlands; run by the European Digital Library Foundation – Part funded by the EU's eContent plus programme – A portal providing access to ca. 4.6 million items – Mixed content: Books, newspapers, photographs, maps, film clips Books included are mainly those in the public domain – EC public consultation on Europeana and the digitisation of books, open until 15 November 2009 http://ec.europa.eu/information_society/newsroom/cf/itemlongdetail.cf m?item_id=5181 JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digitisation challenges (1) Large-scale digitisation – Mostly based on the image front searching technique (pioneered by projects like JSTOR) Scan physical item to create digital images of pages Subject those pages to OCR Combine OCR output with the images, OCR output considered good enough for searching, but any ambiguous results are able to be compared with page images “The strategy of linking page images with OCR enables us to make effective use of large corpora of relatively cheaply scanned books and was, in large measure, effective because it points backwards to the limitations of print: search gets human readers to the page and leaves them to parse out its meaning” (Many More than a Million seminar report, CLIR, November 2007: http://www.clir.org/activities/digitalscholar/Nov28final.pdf) JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digitisation challenges (2) Current generations of OCR tools do not always provide satisfactory results for historical documents – Main focus of tools is on modern documents – Not always fit for historic material with archaic fonts, obsolete characters, complex layouts, warped or degraded pages, language variation, etc. – Manual post-correction has a role, but is slow and expensive Example of OCR errors – From Australian Newspapers (National Library of Australia): http://newspapers.nla.gov.au/ – "The text in the left panel has been electronically translated by a computer. Computers are not as good at reading as humans, and often make mistakes“ – This system permits users to correct the OCR output – Article by Rose Holley in D-Lib Magazine, March/April 2009: http://www.dlib.org/dlib/march09/holley/03holley.html JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Extremely simplified text digitisation workflow Linguistic Named entity tools identification Access Printed package item Image Post correction processing + OCR OCR enhancement output Image Imaging Structural analysis and segmentation Metadata JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The IMPACT project Research project funded by the European Commission – Large-scale Integrating Project – Funded from January 2008, for four years – Coordinated by the National Library of the Netherlands (KB) – Total budget: EUR 15.5M; EU funding: EUR 11.5M – Consortium of 15 partners Libraries Universities and research centres Industrial partners JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The IMPACT consortium • Libraries • Universities and research – National Library of the centres Netherlands (coordinator) – Dutch Institute for Lexicology – The British Library – National Centre for Scientific – Bibliothèque nationale de France Research - Demokritos – German National Library – University of Salford – Bavarian State Library – University of Munich – Goettingen State and University – University of Innsbruck Library – University of Bath (UKOLN) – Austrian National Library – University of Innsbruck Library • Industrial partners – ABBYY – IBM Haifa Research Lab JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT project objectives Aims to significantly improve the mass digitisation of historical printed text by – Innovating OCR software and language technology – Sharing expertise and building capacity across Europe – Ensuring that tools and services will be sustained after the end of the project Specific principles: – Reduce effort and enhance speed and results of mass digitisation (speed and scalability) – Focus on the whole post scanning workflow: image processing, OCR processing (including dictionaries), OCR correction, and document formatting – All research and development to be grounded in the needs of libraries – Working with other centres of competence JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT project approach (1) Project structure – 22 work packages Four sub-projects – Technical and research based: TR (Text Recognition) focused on the extraction of text in a digital form from an image (OCR) EE (Enhancement and Enrichment) using linguistic technologies to make the results of full-text digitisation more accurate and accessible – Strategic: OC (Operational Context) guiding the direction of the project from the libraries' perspective CB (Capacity Building) stimulating the uptake of results in the museums, libraries and archives communities JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT project approach (2) Operational Context Requirements, Benchmarking and Metrics Best Practices and Guidelines Technical Framework and Interoperability Text Recognition Enhancement and Enrichment Pre-processing and segmentation Collaborative Correction Adaptive and experimental OCR Historical Lexica Models and dictionaries Structural Metadata Capacity Building Packaging of resources Training and support Demonstration JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT tools and services (1) Text Recognition – Technologies for supporting the extraction of text from the page – Adaptive OCR engine, integrating: Image enhancement toolkit Segmentation toolkit Post-correction modules Other OCR engines – Experimental prototypes Typewritten OCR Wordspotting Inventory extraction JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT tools and services (2) Enhancement and enrichment – Focus on making OCR results more accurate and accessible – Collaborative correction Web based, linked to OCR engine – Tools and content General and named entities lexica for Dutch, German and English, general support for lexicon building in other languages Dealing with historical languages Collaborative environments for managing named entities – Structural metadata Functional Extension Parser, for the automatic detection and tagging of structural metadata of scanned material JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 18
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT tools and services (3) Strategic tools and services – Website (http://www.impact-project.eu/) – Decision support tools, to support the initiation, organisation, management of mass-digitisation projects – A set of learning resources providing guidance on the digitisation of texts and the implementation of project tools – Training and support Helpdesk Training programme (events) – Demonstration of the tools (case studies) – IMPACT Centre of Competence JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention! Any questions? Additional information: – The IMPACT project: Website: http://www.impact-project.eu/ Project office: impact@kb.nl – Europeana: http://www.europeana.eu/portal/ – Workshop Materials: http://www.ukoln.ac.uk/events/ocr-2009/ JISC Workshop: OCR for the Mass Digitisation of Textual Materials, University of Bath, 24 September 2009 21