SlideShare une entreprise Scribd logo
1  sur  34
Europeana Newspapers -
Turkish Information Day
WP2 - Refinement
Ankara, 3 May 2013
Clemens Neudecker (@cneudecker)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Overview
• Objectives & Challenges
• Introduction to Refinement Dataset
• Overview of Refinement Workflow & Tools
• Refinement with OCR
• Refinement with OLR
• Refinement with NER
• Short summary
• Questions & Answers
2
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Objectives & Challenges
3
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Objectives
- Analysis of available digital newspaper collections at project partners and
selection of subsets suitable for refinement
- Definition of requirements and minimum quality of digitized newspapers for
refinement and advanced services in Europeana
- Coordinate timely processing of 10 million newspaper pages provided by
libraries with several refinement technologies
- Provide recommendations on best practices for refinement of digitized
newspaper collections for full-text ingest to Europeana
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Challenges
• Processing quality vs. speed/throughput
• Volume of data requires focus on simple & strictly
followed workflows with checkpoints on progress
• Large number of partners supplying content with
different digitisation & access policies
• Large variety of content in terms of file formats,
fonts, languages
5
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement Dataset
6
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Initial dataset
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Master List
https://sp.uibk.ac.at/sites/eu-news/Refinement/Lists/MasterList/AllItems.aspx
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Workflow & Tools
13
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement Workflow steps
14
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT.1)
• BCT = Binarisation and Colour Reduction Tool
• Produced by UIBK as a Windows EXE-Tool with GUI
• Purpose: Convert grey/colour scans to bitonal using special
method from Gatos/Pratikakis/Perantonis (GPP)
• Background: Need to reduce total file size of master images
to guarantee feasibility and timing of data transfers
15
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT.2)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT.3)
• Internally wraps Graphicsmagick tool to create lower
resolution images for viewing in content browser
• Integration of Kakadu for JP2000 support being discussed
• Using GPP method, next to no decrease in OCR accuracy
observed when using bitonal images for OCR rather than
grey/colour (in small test even went up from 72% to 83%)
17
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT.1)
• FRT = File Rename Tool
• Produced by UIBK as a Windows EXE-Tool with GUI
• Purpose: Support content holders in preparing their data in
the correct structure required for large-scale processing by
refinement partners
18
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT.2)
19
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT.3)
• Simplifies batch renaming of files, folders according to
project delivery specification
• Visual checks in the tool interface help spotting issues
that still have to be corrected
• Highlights possible errors and conflicts to the user
20
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT.1)
• FAT = File Analyzer Tool
• Produced by UIBK as a Windows EXE-Tool with GUI
• Purpose: Final quality check of data preparation
• FAT analyses the final data (images & metadata) prepared
by content holder for refinement and checks whether all
necessary data preparation steps have been successfully
completed
21
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT.2)
22
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT.3)
• Verifies metadata against data available in Master List
• Verifies file & folder structure against project specification
• Produces log and XML information about the data and
provenance about the processing
23
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OCR
24
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OCR
• OCR = Optical Character Recognition
• Executing organisation: University of Innsbruck (UIBK)
• Number of pages to be refined: 8 million
• Technologies: ABBYY FineReader SDK
• State-of-the-art OCR software, fully supports Fraktur/Arabic/Cyrillic fonts
25
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OCR processing at UIBK
26
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OLR
27
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OLR
• OLR = Optical Layout Recognition
• Executing organisation: Content Conversion Specialists (CCS)
• Number of pages to be refined: 2 million
• Technologies: docWorks
• Columns, articles, headlines, page classification
28
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OLR processing at CCS
29
Three ways are offered to
libraries for doing the OLR
process with CCS:
1.Fully on-site at the library
(requires local installation
of docWorks)
2.Conversion off-shore, QA
at the library via internet
connection
3.Conversion off-shore, QA
at the library via backup
shipment
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: NER
30
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: NER
• NER = Named Entities Recognition
• Executing organisation: Koninklijke Bibliotheek
• Number of pages to be refined: > 2 million
• Technologies: Stanford CRF-NER
• Languages: German, Dutch, English, (French)
• Open source available: https://github.com/KBNLresearch/europeananp-ner
• Named entities: Person, Location, Organization
31
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
NER processing at KB
32
1. UIBK/CCS complete refinement with
OCR/OLR
2. Data (OCR, images, metadata) sent
via harddisk to Europeana/TEL and
KB in the ENMAP package format
3. KB NER-Tool extracts references to
the OCR files from the ENMAP
package
4. OCR files (ALTO) are processed
with the Stanford CRF-NER algorithm
5. Detected named entities can be
exported in a variety of output formats
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp 33
Issues encountered
1. Issue: Amount of data to be transferred from libraries to refinement partners
 What will be done to address the problem?
Reduction of file size by applying optimized GPP binarization
2. Issue: Storage format for named entities needs to preserve coordinates,
but ALTO-XML cannot store semantic information
 What will be done to address the problem?
Several alternative storage formats have been implemented
NER-Tool ensures the word coordinates are retained after processing
3. Issue: Ottoman language/script currently not supported in OCR software
 What will be done to address the problem?
Select only newspapers in Latin alphabet for refinement of NLT content
Thank you for your attention!
Questions?
clemens.neudecker@kb.nl

Contenu connexe

Tendances

Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspaperscneudecker
 
The challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineThe challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineLIBER Europe
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaEuropeana Newspapers
 
Europeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers
 
Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayEuropeana Newspapers
 
Overview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectOverview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectEuropeana Newspapers
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspapers
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Onlinecneudecker
 
Europeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers
 
Europeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana Newspapers
 
Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...cneudecker
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers
 
04 europeana newspapers
04 europeana newspapers04 europeana newspapers
04 europeana newspapersEuropeana
 
Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser Europeana Newspapers
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 

Tendances (20)

Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
 
The challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineThe challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available online
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza Atanassova
 
ENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilmsENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilms
 
Europeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation Plan
 
Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop intro
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
 
Overview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectOverview of the Europeana Newspapers Project
Overview of the Europeana Newspapers Project
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Online
 
Europeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers Polish Information Day
Europeana Newspapers Polish Information Day
 
Europeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLieder
 
EurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_NeudeckerEurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_Neudecker
 
Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista Kiisa
 
04 europeana newspapers
04 europeana newspapers04 europeana newspapers
04 europeana newspapers
 
Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 

Similaire à Turkish Information Day Refinement Overview

Europeana Newspapers in a nutshell
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshellcneudecker
 
S.2.h Meter Data Management Service
S.2.h Meter Data Management ServiceS.2.h Meter Data Management Service
S.2.h Meter Data Management ServiceSUNSHINEProject
 
Tulipp starter-kit-agri
Tulipp starter-kit-agriTulipp starter-kit-agri
Tulipp starter-kit-agriTulipp. Eu
 
Acatech.pptx
Acatech.pptxAcatech.pptx
Acatech.pptxFIWARE
 
Performance Evaluation and Quality Assessment
Performance Evaluation and Quality AssessmentPerformance Evaluation and Quality Assessment
Performance Evaluation and Quality AssessmentEuropeana Newspapers
 
Seminario IoT - Internet of Things
Seminario IoT - Internet of ThingsSeminario IoT - Internet of Things
Seminario IoT - Internet of ThingsLuiz Oliveira
 
Briseide overview of the project and its objectives
Briseide overview of the project and its objectives  Briseide overview of the project and its objectives
Briseide overview of the project and its objectives Raffaele de Amicis
 
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Nathalie Danse
 
Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...
Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...
Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...Sitra / Hyvinvointi
 
S.2.f Specifications for Data Ingestion via Green Button
S.2.f Specifications for Data Ingestion via Green ButtonS.2.f Specifications for Data Ingestion via Green Button
S.2.f Specifications for Data Ingestion via Green ButtonSUNSHINEProject
 
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...Invest Northern Ireland
 
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...DataScienceConferenc1
 
Digitising European Industry - 12/10/2017
Digitising European Industry - 12/10/2017Digitising European Industry - 12/10/2017
Digitising European Industry - 12/10/2017Sandro D'Elia
 
Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...
Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...
Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...ICT FOOTPRINT .eu
 

Similaire à Turkish Information Day Refinement Overview (17)

Europeana Newspapers in a nutshell
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshell
 
S.2.h Meter Data Management Service
S.2.h Meter Data Management ServiceS.2.h Meter Data Management Service
S.2.h Meter Data Management Service
 
Tulipp starter-kit-agri
Tulipp starter-kit-agriTulipp starter-kit-agri
Tulipp starter-kit-agri
 
Acatech.pptx
Acatech.pptxAcatech.pptx
Acatech.pptx
 
Performance Evaluation and Quality Assessment
Performance Evaluation and Quality AssessmentPerformance Evaluation and Quality Assessment
Performance Evaluation and Quality Assessment
 
Seminario IoT - Internet of Things
Seminario IoT - Internet of ThingsSeminario IoT - Internet of Things
Seminario IoT - Internet of Things
 
Fiona ollerenshaw
Fiona ollerenshawFiona ollerenshaw
Fiona ollerenshaw
 
IO3_3DP-courseware_EN.pdf
IO3_3DP-courseware_EN.pdfIO3_3DP-courseware_EN.pdf
IO3_3DP-courseware_EN.pdf
 
Briseide overview of the project and its objectives
Briseide overview of the project and its objectives  Briseide overview of the project and its objectives
Briseide overview of the project and its objectives
 
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
 
Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...
Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...
Gaia-X and how to accelerate growth – pathway to EU funding webinar 10 March ...
 
S.2.f Specifications for Data Ingestion via Green Button
S.2.f Specifications for Data Ingestion via Green ButtonS.2.f Specifications for Data Ingestion via Green Button
S.2.f Specifications for Data Ingestion via Green Button
 
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
 
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...
 
Digitising European Industry - 12/10/2017
Digitising European Industry - 12/10/2017Digitising European Industry - 12/10/2017
Digitising European Industry - 12/10/2017
 
Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...
Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...
Calculation Tools & ICT Insights on energy saving: SAT-S, Save@Work, GreenSpe...
 
Summit2013 john domingue - horizon2020
Summit2013   john domingue - horizon2020Summit2013   john domingue - horizon2020
Summit2013 john domingue - horizon2020
 

Plus de Europeana Newspapers

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisEuropeana Newspapers
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayEuropeana Newspapers
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayEuropeana Newspapers
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayEuropeana Newspapers
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers
 

Plus de Europeana Newspapers (20)

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information Day
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne Kouts
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel Veimann
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista Aru
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred Puss
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday Neudecker
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday Thompson
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday Rossi
 
Enp lft infoday_neudecker
Enp lft infoday_neudeckerEnp lft infoday_neudecker
Enp lft infoday_neudecker
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday Messina
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday Marchetti
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday Kempf
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday Genereux
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday Bolioli
 
ENP_Dutch_Infoday_MWillems
ENP_Dutch_Infoday_MWillemsENP_Dutch_Infoday_MWillems
ENP_Dutch_Infoday_MWillems
 
ENP_Dutch_Infoday_PHuijnen
ENP_Dutch_Infoday_PHuijnen ENP_Dutch_Infoday_PHuijnen
ENP_Dutch_Infoday_PHuijnen
 

Dernier

International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...ssuserf63bd7
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfJos Voskuil
 
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckPitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckHajeJanKamps
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMintel Group
 
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...ShrutiBose4
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCRashishs7044
 
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptxContemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptxMarkAnthonyAurellano
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCRashishs7044
 
Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Seta Wicaksana
 
Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03DallasHaselhorst
 
Islamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in IslamabadIslamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in IslamabadAyesha Khan
 
MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?Olivia Kresic
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy Verified Accounts
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCRashishs7044
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607dollysharma2066
 
Kenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby AfricaKenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby Africaictsugar
 
APRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfAPRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfRbc Rbcua
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessSeta Wicaksana
 

Dernier (20)

International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...International Business Environments and Operations 16th Global Edition test b...
International Business Environments and Operations 16th Global Edition test b...
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdf
 
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
 
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckPitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 Edition
 
Corporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information TechnologyCorporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information Technology
 
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR
 
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptxContemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR
 
Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...
 
Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03
 
Islamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in IslamabadIslamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in Islamabad
 
MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail Accounts
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
 
Kenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby AfricaKenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby Africa
 
APRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfAPRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdf
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful Business
 

Turkish Information Day Refinement Overview

  • 1. Europeana Newspapers - Turkish Information Day WP2 - Refinement Ankara, 3 May 2013 Clemens Neudecker (@cneudecker)
  • 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Overview • Objectives & Challenges • Introduction to Refinement Dataset • Overview of Refinement Workflow & Tools • Refinement with OCR • Refinement with OLR • Refinement with NER • Short summary • Questions & Answers 2
  • 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Objectives & Challenges 3
  • 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Objectives - Analysis of available digital newspaper collections at project partners and selection of subsets suitable for refinement - Definition of requirements and minimum quality of digitized newspapers for refinement and advanced services in Europeana - Coordinate timely processing of 10 million newspaper pages provided by libraries with several refinement technologies - Provide recommendations on best practices for refinement of digitized newspaper collections for full-text ingest to Europeana
  • 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Challenges • Processing quality vs. speed/throughput • Volume of data requires focus on simple & strictly followed workflows with checkpoints on progress • Large number of partners supplying content with different digitisation & access policies • Large variety of content in terms of file formats, fonts, languages 5
  • 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement Dataset 6
  • 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Initial dataset
  • 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Master List https://sp.uibk.ac.at/sites/eu-news/Refinement/Lists/MasterList/AllItems.aspx
  • 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (1)
  • 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (2)
  • 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (3)
  • 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (4)
  • 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Workflow & Tools 13
  • 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement Workflow steps 14
  • 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT.1) • BCT = Binarisation and Colour Reduction Tool • Produced by UIBK as a Windows EXE-Tool with GUI • Purpose: Convert grey/colour scans to bitonal using special method from Gatos/Pratikakis/Perantonis (GPP) • Background: Need to reduce total file size of master images to guarantee feasibility and timing of data transfers 15
  • 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT.2)
  • 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT.3) • Internally wraps Graphicsmagick tool to create lower resolution images for viewing in content browser • Integration of Kakadu for JP2000 support being discussed • Using GPP method, next to no decrease in OCR accuracy observed when using bitonal images for OCR rather than grey/colour (in small test even went up from 72% to 83%) 17
  • 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT.1) • FRT = File Rename Tool • Produced by UIBK as a Windows EXE-Tool with GUI • Purpose: Support content holders in preparing their data in the correct structure required for large-scale processing by refinement partners 18
  • 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT.2) 19
  • 20. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT.3) • Simplifies batch renaming of files, folders according to project delivery specification • Visual checks in the tool interface help spotting issues that still have to be corrected • Highlights possible errors and conflicts to the user 20
  • 21. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT.1) • FAT = File Analyzer Tool • Produced by UIBK as a Windows EXE-Tool with GUI • Purpose: Final quality check of data preparation • FAT analyses the final data (images & metadata) prepared by content holder for refinement and checks whether all necessary data preparation steps have been successfully completed 21
  • 22. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT.2) 22
  • 23. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT.3) • Verifies metadata against data available in Master List • Verifies file & folder structure against project specification • Produces log and XML information about the data and provenance about the processing 23
  • 24. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OCR 24
  • 25. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OCR • OCR = Optical Character Recognition • Executing organisation: University of Innsbruck (UIBK) • Number of pages to be refined: 8 million • Technologies: ABBYY FineReader SDK • State-of-the-art OCR software, fully supports Fraktur/Arabic/Cyrillic fonts 25
  • 26. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OCR processing at UIBK 26
  • 27. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OLR 27
  • 28. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OLR • OLR = Optical Layout Recognition • Executing organisation: Content Conversion Specialists (CCS) • Number of pages to be refined: 2 million • Technologies: docWorks • Columns, articles, headlines, page classification 28
  • 29. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR processing at CCS 29 Three ways are offered to libraries for doing the OLR process with CCS: 1.Fully on-site at the library (requires local installation of docWorks) 2.Conversion off-shore, QA at the library via internet connection 3.Conversion off-shore, QA at the library via backup shipment
  • 30. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: NER 30
  • 31. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: NER • NER = Named Entities Recognition • Executing organisation: Koninklijke Bibliotheek • Number of pages to be refined: > 2 million • Technologies: Stanford CRF-NER • Languages: German, Dutch, English, (French) • Open source available: https://github.com/KBNLresearch/europeananp-ner • Named entities: Person, Location, Organization 31
  • 32. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER processing at KB 32 1. UIBK/CCS complete refinement with OCR/OLR 2. Data (OCR, images, metadata) sent via harddisk to Europeana/TEL and KB in the ENMAP package format 3. KB NER-Tool extracts references to the OCR files from the ENMAP package 4. OCR files (ALTO) are processed with the Stanford CRF-NER algorithm 5. Detected named entities can be exported in a variety of output formats
  • 33. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 33 Issues encountered 1. Issue: Amount of data to be transferred from libraries to refinement partners  What will be done to address the problem? Reduction of file size by applying optimized GPP binarization 2. Issue: Storage format for named entities needs to preserve coordinates, but ALTO-XML cannot store semantic information  What will be done to address the problem? Several alternative storage formats have been implemented NER-Tool ensures the word coordinates are retained after processing 3. Issue: Ottoman language/script currently not supported in OCR software  What will be done to address the problem? Select only newspapers in Latin alphabet for refinement of NLT content
  • 34. Thank you for your attention! Questions? clemens.neudecker@kb.nl