2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Overview
• Objectives & Challenges
• Introduction to Refinement Dataset
• Overview of Refinement Workflow & Tools
• Refinement with OCR
• Refinement with OLR
• Refinement with NER
• Short summary
• Questions & Answers
2
3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Objectives & Challenges
3
4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Objectives
- Analysis of available digital newspaper collections at project partners and
selection of subsets suitable for refinement
- Definition of requirements and minimum quality of digitized newspapers for
refinement and advanced services in Europeana
- Coordinate timely processing of 10 million newspaper pages provided by
libraries with several refinement technologies
- Provide recommendations on best practices for refinement of digitized
newspaper collections for full-text ingest to Europeana
5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Challenges
• Processing quality vs. speed/throughput
• Volume of data requires focus on simple & strictly
followed workflows with checkpoints on progress
• Large number of partners supplying content with
different digitisation & access policies
• Large variety of content in terms of file formats,
fonts, languages
5
6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement Dataset
6
7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Initial dataset
8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Master List
https://sp.uibk.ac.at/sites/eu-news/Refinement/Lists/MasterList/AllItems.aspx
9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Workflow & Tools
13
14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement Workflow steps
14
15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT.1)
• BCT = Binarisation and Colour Reduction Tool
• Produced by UIBK as a Windows EXE-Tool with GUI
• Purpose: Convert grey/colour scans to bitonal using special
method from Gatos/Pratikakis/Perantonis (GPP)
• Background: Need to reduce total file size of master images
to guarantee feasibility and timing of data transfers
15
16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT.2)
17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (BCT.3)
• Internally wraps Graphicsmagick tool to create lower
resolution images for viewing in content browser
• Integration of Kakadu for JP2000 support being discussed
• Using GPP method, next to no decrease in OCR accuracy
observed when using bitonal images for OCR rather than
grey/colour (in small test even went up from 72% to 83%)
17
18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT.1)
• FRT = File Rename Tool
• Produced by UIBK as a Windows EXE-Tool with GUI
• Purpose: Support content holders in preparing their data in
the correct structure required for large-scale processing by
refinement partners
18
19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT.2)
19
20. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FRT.3)
• Simplifies batch renaming of files, folders according to
project delivery specification
• Visual checks in the tool interface help spotting issues
that still have to be corrected
• Highlights possible errors and conflicts to the user
20
21. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT.1)
• FAT = File Analyzer Tool
• Produced by UIBK as a Windows EXE-Tool with GUI
• Purpose: Final quality check of data preparation
• FAT analyses the final data (images & metadata) prepared
by content holder for refinement and checks whether all
necessary data preparation steps have been successfully
completed
21
22. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT.2)
22
23. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Tools (FAT.3)
• Verifies metadata against data available in Master List
• Verifies file & folder structure against project specification
• Produces log and XML information about the data and
provenance about the processing
23
24. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OCR
24
25. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OCR
• OCR = Optical Character Recognition
• Executing organisation: University of Innsbruck (UIBK)
• Number of pages to be refined: 8 million
• Technologies: ABBYY FineReader SDK
• State-of-the-art OCR software, fully supports Fraktur/Arabic/Cyrillic fonts
25
26. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OCR processing at UIBK
26
27. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OLR
27
28. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: OLR
• OLR = Optical Layout Recognition
• Executing organisation: Content Conversion Specialists (CCS)
• Number of pages to be refined: 2 million
• Technologies: docWorks
• Columns, articles, headlines, page classification
28
29. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OLR processing at CCS
29
Three ways are offered to
libraries for doing the OLR
process with CCS:
1.Fully on-site at the library
(requires local installation
of docWorks)
2.Conversion off-shore, QA
at the library via internet
connection
3.Conversion off-shore, QA
at the library via backup
shipment
30. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: NER
30
31. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Refinement: NER
• NER = Named Entities Recognition
• Executing organisation: Koninklijke Bibliotheek
• Number of pages to be refined: > 2 million
• Technologies: Stanford CRF-NER
• Languages: German, Dutch, English, (French)
• Open source available: https://github.com/KBNLresearch/europeananp-ner
• Named entities: Person, Location, Organization
31
32. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
NER processing at KB
32
1. UIBK/CCS complete refinement with
OCR/OLR
2. Data (OCR, images, metadata) sent
via harddisk to Europeana/TEL and
KB in the ENMAP package format
3. KB NER-Tool extracts references to
the OCR files from the ENMAP
package
4. OCR files (ALTO) are processed
with the Stanford CRF-NER algorithm
5. Detected named entities can be
exported in a variety of output formats
33. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp 33
Issues encountered
1. Issue: Amount of data to be transferred from libraries to refinement partners
What will be done to address the problem?
Reduction of file size by applying optimized GPP binarization
2. Issue: Storage format for named entities needs to preserve coordinates,
but ALTO-XML cannot store semantic information
What will be done to address the problem?
Several alternative storage formats have been implemented
NER-Tool ensures the word coordinates are retained after processing
3. Issue: Ottoman language/script currently not supported in OCR software
What will be done to address the problem?
Select only newspapers in Latin alphabet for refinement of NLT content
34. Thank you for your attention!
Questions?
clemens.neudecker@kb.nl