SlideShare une entreprise Scribd logo
1  sur  20
Europeana Newspapers Workshop:
Refinement
WP2 – Introduction to Refinement
Munich, 26 June 2013
Clemens Neudecker (@cneudecker)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Overview
• Objectives & Challenges
• Overview of Refinement Dataset
• Introduction to Refinement: Workflow & Technologies
• Questions & Answers
2
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Objectives
- Analysis of available digital newspaper collections of project partners
and identification of subsets suitable for refinement
- Definition of requirements and minimum quality of digitized newspapers
for refinement to enable advanced services in Europeana
- Coordination of the scalable processing of 10 million digitised newspaper
pages with several refinement technologies
- Providing recommendations on best practices for the refinement of
digitised newspaper collections with full-text (and ingest to Europeana)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Challenges
• Processing quality vs. speed/throughput
• Volume of data requires focus on simple &
standardised workflow with clear checkpoints
• Diverse partners supplying content with different
digitisation & access policies
• Large variety of content in terms of file formats,
fonts, languages, etc.
4
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
The data
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement Workflow steps
10
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT)
• BCT = Binarisation and Colour Reduction Tool
• Purpose: Convert grey/colour
scans to bitonal using highly
optimised GPP method
• Background: Reduce total file
size of master images to
guarantee feasibility and
timing of data transfers
11
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT)
• FRT = File Rename Tool
• Purpose: Support content
holders in preparing their
data in the correct format
• Background: Ensure folder
structure and file naming
requirements for automated
processing are met
12
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT)
• FAT = File Analyzer Tool
• Purpose: Final quality check
of data before refinement
• Background: Ensure content
and refinement partners that
all preparation steps have
been executed successfully
13
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OCR@UIBK
• OCR = Optical Character Recognition
• Number of pages to be refined: 8 million
• Technologies: ABBYY FineReader SDK
• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts
• Result: METS/ALTO package containing images, metadata & full text
14
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OCR  Full text search
15
http://www.europeana-newspapers.eu/building-a-content-browser-for-digital-newspapers/
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OLR@CCS
• OLR = Optical Layout Recognition
• Number of pages to be refined: 2 million
• Technologies: docWorks
• Separation of columns, articles, headlines, page classes
• Result: METS/ALTO package containing images, metadata & full text
16
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OLR  Article separation
17
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: NER@KB
• NER = Named Entities Recognition
• Number of pages to be refined: 2 million
• Technologies: Stanford CRF-NER
• Languages supported: German, Dutch, English (+ French, Latvian)
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Detection of Named entities: Person, Location, Organization
• Feedback cycle with manual training step  better results
18
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
NER  Browse by names or places
19
Thank you for your attention!
clemens.neudecker@kb.nl

Contenu connexe

Tendances

ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewEuropeana Newspapers
 
Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage Europeana Newspapers
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayEuropeana Newspapers
 
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Onlinecneudecker
 
The challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineThe challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineLIBER Europe
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
 
Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...cneudecker
 
Europeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers
 
04 europeana newspapers
04 europeana newspapers04 europeana newspapers
04 europeana newspapersEuropeana
 
Open Research Data Pilot in Horizon 2020
Open Research Data Pilot in Horizon 2020Open Research Data Pilot in Horizon 2020
Open Research Data Pilot in Horizon 2020Elena Simukovic
 
Europeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers
 
Performance Evaluation and Quality Assessment
Performance Evaluation and Quality AssessmentPerformance Evaluation and Quality Assessment
Performance Evaluation and Quality AssessmentEuropeana Newspapers
 
High-Performance Computing Research in Europe
High-Performance Computing Research in EuropeHigh-Performance Computing Research in Europe
High-Performance Computing Research in EuropeGovnet Events
 
Supporting transition towards the Paris Agreement using TIMES: insights from ...
Supporting transition towards the Paris Agreement using TIMES: insights from ...Supporting transition towards the Paris Agreement using TIMES: insights from ...
Supporting transition towards the Paris Agreement using TIMES: insights from ...IEA-ETSAP
 

Tendances (20)

ENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project OverviewENP Belgrade Workshop Project Overview
ENP Belgrade Workshop Project Overview
 
Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage Realising the value of Europe's newspaper heritage
Realising the value of Europe's newspaper heritage
 
ENP Belgrade WS Introduction
ENP Belgrade WS IntroductionENP Belgrade WS Introduction
ENP Belgrade WS Introduction
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
 
Europeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introductionEuropeana Newspapers Amsterdam workshop introduction
Europeana Newspapers Amsterdam workshop introduction
 
ENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilmsENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilms
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Online
 
The challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineThe challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available online
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...
 
Europeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLieder
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista Kiisa
 
EurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_NeudeckerEurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_Neudecker
 
04 europeana newspapers
04 europeana newspapers04 europeana newspapers
04 europeana newspapers
 
Open Research Data Pilot in Horizon 2020
Open Research Data Pilot in Horizon 2020Open Research Data Pilot in Horizon 2020
Open Research Data Pilot in Horizon 2020
 
Europeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers Polish Information Day
Europeana Newspapers Polish Information Day
 
Performance Evaluation and Quality Assessment
Performance Evaluation and Quality AssessmentPerformance Evaluation and Quality Assessment
Performance Evaluation and Quality Assessment
 
High-Performance Computing Research in Europe
High-Performance Computing Research in EuropeHigh-Performance Computing Research in Europe
High-Performance Computing Research in Europe
 
Supporting transition towards the Paris Agreement using TIMES: insights from ...
Supporting transition towards the Paris Agreement using TIMES: insights from ...Supporting transition towards the Paris Agreement using TIMES: insights from ...
Supporting transition towards the Paris Agreement using TIMES: insights from ...
 
20090921 Expp Presentation
20090921 Expp Presentation20090921 Expp Presentation
20090921 Expp Presentation
 

Similaire à Europeana Newspapers wp2 liber2013

Europeana Newspapers in a nutshell
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshellcneudecker
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspapers
 
Seminario IoT - Internet of Things
Seminario IoT - Internet of ThingsSeminario IoT - Internet of Things
Seminario IoT - Internet of ThingsLuiz Oliveira
 
Acatech.pptx
Acatech.pptxAcatech.pptx
Acatech.pptxFIWARE
 
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Nathalie Danse
 
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...Invest Northern Ireland
 
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...Apulian ICT Living Labs
 
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...DataScienceConferenc1
 

Similaire à Europeana Newspapers wp2 liber2013 (15)

Europeana Newspapers in a nutshell
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshell
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop intro
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
 
Seminario IoT - Internet of Things
Seminario IoT - Internet of ThingsSeminario IoT - Internet of Things
Seminario IoT - Internet of Things
 
Barbato leit ict 15-16-17
Barbato leit ict 15-16-17Barbato leit ict 15-16-17
Barbato leit ict 15-16-17
 
Acatech.pptx
Acatech.pptxAcatech.pptx
Acatech.pptx
 
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
Presentation of H2020 ICT-32-2017 Startup Europe for Growth & Innovation Rada...
 
Fiona ollerenshaw
Fiona ollerenshawFiona ollerenshaw
Fiona ollerenshaw
 
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
UK FP7 National Contact Point ICT, Peter Walters, FP7UK National Contact Poin...
 
H2020 space wp 2018 2020 call 2019
H2020 space wp 2018 2020 call 2019H2020 space wp 2018 2020 call 2019
H2020 space wp 2018 2020 call 2019
 
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
Darko Fercej: Central European Living Lab for Territorial Innovation - Open d...
 
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...
[DSC Europe 22] BI Toolkit Powered Natural Language Processing - Sanda Martin...
 
Summit2013 john domingue - horizon2020
Summit2013   john domingue - horizon2020Summit2013   john domingue - horizon2020
Summit2013 john domingue - horizon2020
 
Future Internet PPP for Smart Cities Ana Garcia
Future Internet PPP for Smart Cities Ana GarciaFuture Internet PPP for Smart Cities Ana Garcia
Future Internet PPP for Smart Cities Ana Garcia
 

Plus de Europeana Newspapers

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisEuropeana Newspapers
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayEuropeana Newspapers
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayEuropeana Newspapers
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayEuropeana Newspapers
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers
 

Plus de Europeana Newspapers (20)

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information Day
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne Kouts
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel Veimann
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista Aru
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred Puss
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday Neudecker
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday Thompson
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday Rossi
 
Enp lft infoday_neudecker
Enp lft infoday_neudeckerEnp lft infoday_neudecker
Enp lft infoday_neudecker
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday Messina
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday Marchetti
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday Kempf
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday Genereux
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday Bolioli
 
ENP_Dutch_Infoday_MWillems
ENP_Dutch_Infoday_MWillemsENP_Dutch_Infoday_MWillems
ENP_Dutch_Infoday_MWillems
 
ENP_Dutch_Infoday_PHuijnen
ENP_Dutch_Infoday_PHuijnen ENP_Dutch_Infoday_PHuijnen
ENP_Dutch_Infoday_PHuijnen
 

Dernier

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Dernier (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Europeana Newspapers wp2 liber2013

  • 1. Europeana Newspapers Workshop: Refinement WP2 – Introduction to Refinement Munich, 26 June 2013 Clemens Neudecker (@cneudecker)
  • 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Overview • Objectives & Challenges • Overview of Refinement Dataset • Introduction to Refinement: Workflow & Technologies • Questions & Answers 2
  • 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Objectives - Analysis of available digital newspaper collections of project partners and identification of subsets suitable for refinement - Definition of requirements and minimum quality of digitized newspapers for refinement to enable advanced services in Europeana - Coordination of the scalable processing of 10 million digitised newspaper pages with several refinement technologies - Providing recommendations on best practices for the refinement of digitised newspaper collections with full-text (and ingest to Europeana)
  • 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Challenges • Processing quality vs. speed/throughput • Volume of data requires focus on simple & standardised workflow with clear checkpoints • Diverse partners supplying content with different digitisation & access policies • Large variety of content in terms of file formats, fonts, languages, etc. 4
  • 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp The data
  • 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (1)
  • 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspaper Dataset (2)
  • 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (3)
  • 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Europeana Newspapers Dataset (4)
  • 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement Workflow steps 10
  • 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (BCT) • BCT = Binarisation and Colour Reduction Tool • Purpose: Convert grey/colour scans to bitonal using highly optimised GPP method • Background: Reduce total file size of master images to guarantee feasibility and timing of data transfers 11
  • 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FRT) • FRT = File Rename Tool • Purpose: Support content holders in preparing their data in the correct format • Background: Ensure folder structure and file naming requirements for automated processing are met 12
  • 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Tools (FAT) • FAT = File Analyzer Tool • Purpose: Final quality check of data before refinement • Background: Ensure content and refinement partners that all preparation steps have been executed successfully 13
  • 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OCR@UIBK • OCR = Optical Character Recognition • Number of pages to be refined: 8 million • Technologies: ABBYY FineReader SDK • State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts • Result: METS/ALTO package containing images, metadata & full text 14
  • 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OCR  Full text search 15 http://www.europeana-newspapers.eu/building-a-content-browser-for-digital-newspapers/
  • 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: OLR@CCS • OLR = Optical Layout Recognition • Number of pages to be refined: 2 million • Technologies: docWorks • Separation of columns, articles, headlines, page classes • Result: METS/ALTO package containing images, metadata & full text 16
  • 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp OLR  Article separation 17
  • 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Refinement: NER@KB • NER = Named Entities Recognition • Number of pages to be refined: 2 million • Technologies: Stanford CRF-NER • Languages supported: German, Dutch, English (+ French, Latvian) • Open source: https://github.com/KBNLresearch/europeananp-ner • Detection of Named entities: Person, Location, Organization • Feedback cycle with manual training step  better results 18
  • 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp NER  Browse by names or places 19
  • 20. Thank you for your attention! clemens.neudecker@kb.nl