Cultural heritage institutions hold collections of printed newspapers that are valuable resources for the study of history, linguistics and other Digital Humanities scientific domains. Effective retrieval of newspapers content based on metadata only is a task nearly impossible, making the retrieval based on (digitized) full-text particularly relevant. Europeana, Europe’s Digital Library, is in the position to provide access to large newspapers collections with full-text resources. Full-text corpora are also relevant for Europeana’s objective of promoting the usage of cultural heritage resources for use within research infrastructures. We have derived requirements for aggregating and publishing Europeana’s newspapers full-text corpus in an interoperable way, based on investigations into the specific characteristics of cultural data, the needs of two research infrastructures (CLARIN and EUDAT) and the practices being promoted in the International Image Interoperability Framework (IIIF) community. We have then defined a ‘full-text profile’ for the Europeana Data Model, which is being applied to Europeana’s newspaper corpus.
Opening Digitized Newspapers Corpora: Europeana’s Full-text Data Interoperability Case
1. LDK 2019 – 2nd Conference on Language, Data and Knowledge
May 2019
2. Title here
CC BY-SA
Context
CC BY-SA
● Cultural heritage institutions hold collections of printed newspapers that are
valuable resources for several scientific disciplines
● Effective retrieval of newspapers requires full-text
● Full-text corpora are also relevant for Europeana’s objective of promoting the
usage of cultural heritage resources for use within research infrastructures
● We have derived requirements for aggregating and publishing newspapers
full-text with two research infrastructures (CLARIN and EUDAT) and based on
the practices being promoted in IIIF community
● We have defined a ‘full-text profile’ for the Europeana Data Model, which is
being applied to Europeana’s newspaper corpus
3. Title here
CC BY-SA
Europeana
The Platform for Europe’s Digital Cultural Heritage
● Aggregates and makes available data:
• From all EU countries
• From ~3,700 galleries, libraries,
archives and museums
• Under a CC0 licence
• More than 58M objects and
• In about 50 languages
“We transform the world with culture! We
want to build on Europe’s rich heritage and
make it easier for people to use, whether
for work, for learning or just for fun.”
CC BY-SA
4. Title here
CC BY-SA
Europeana
The Platform for Europe’s Digital Cultural Heritage
CC BY-SA
Data aggregation
focused on metadata
… with cultural objects
as the main entity
… content (e.g. full-text)
is a recent activity
... promoting the
research use of heritage
data is ongoing
5. Title here
CC BY-SA
Europeana’s Newspapers Corpus
CC BY-SA
● This corpus contains over 11 million pages of full text of historic
newspapers
○ Mainly from the 19th century
○ Aggregated from national and research libraries across Europe.
● Europeana has the objective to:
○ expose the aggregated full text of the corpus
○ enable its data driven usage in research
● … through interoperability with european research infrastructures
6. EUDAT’s Service Suite
https://www.eudat.eu/services
Covering both access and deposit,
from informal data sharing to
long-term archiving, and
addressing identification,
discoverability and computability
of both long-tail and big data,
EUDAT services seek to address the
full lifecycle of research data
7. Connecting two e-Infrastructures to
facilitate the reuse of cultural
heritage data in research
● The challenges presented by cultural
heritage data resources
● How EUDAT services were used
● How a deeper interoperability
between Europeana and EUDAT can
be pursued
CC BY-SA
The Europeana Data Pilot
outcomes and
conclusions
8. CLARIN - Common Language Resources
and Technology Infrastructure
• CLARIN provides easy and sustainable access for scholars
in the humanities and social sciences and beyond
• to digital language data (in written, spoken, video or
multimodal form)
• and advanced tools to discover, explore, exploit,
annotate, analyse or combine them, wherever they are
located
9. Connecting two e-Infrastructures to
facilitate the reuse of cultural heritage
data in research
● The technical solution in use
● The available functionality in CLARIN for
the application of Language Processing
Resources for conducting research in
based on cultural heritage data resources
CC BY-SA
Bringing Europeana and CLARIN
together: Dissemination and
exploitation of cultural heritage
data in a research infrastructure
10. Title here
CC BY-SA
The Europeana Data Model - EDM
CC BY-SA
● Europeana uses EDM as its technological solution for data exchange
○ For data acquisition from cultural heritage data providers
○ For data distribution and reuse by third parties
● EDM follows the principles of the Semantic Web
● EDM is defined collaboratively with all the sectors represented in Europeana
11. Title here
CC BY-SA
Design decisions on modeling of
full-text in EDM
CC BY-SA
● IIIF is a technology widely used in cultural heritage for end-user rich user
interfaces for images (with much potential for researchers)
● The IIIF community has specified textual representations of images (e.g.
transcriptions) using annotations from the W3C Web Annotation model
● It defines the use of a list of annotations, each one referring to a portion of
the full-text and indicating its corresponding position in the image of a page
● Our decision for EDM was to represent of the full-text content of newspapers
as annotations on the images of newspapers’ pages
○ … in a way that is compatible with IIIF
12. Title here
CC BY-SA
CC BY-SA
Full-text Profile for EDM
General principles for full-text annotations in the EDM extension
13. Title here
CC BY-SA
CC BY-SA
Full-text Profile for EDM
Full-text without position information
14. Title here
CC BY-SA
CC BY-SA
Full-text Profile for EDM
Full-text resource with position on the image
W3C Media
Fragment
15. Title here
CC BY-SA
CC BY-SA
Full-text Profile for EDM
Full-text fragment with position on the image using a:FragmentSelector
W3C Media
Fragment
16. Title here
CC BY-SA
CC BY-SA
Full-text Profile for EDM
Full-text fragment with position using oa:TextPositionSelector
17. Title here
CC BY-SA
CC BY-SA
Full-text Profile for EDM
Representing the logical structure of articles and paragraphs of full-text
18. Title here
CC BY-SA
CC BY-SA
Full-text Profile for EDM
Specification of the language of the text: for the whole
edm:FullTextResource
19. Title here
CC BY-SA
CC BY-SA
Full-text Profile for EDM
Specification of the language of the text: for a piece of text in isolation
20. Title here
CC BY-SA
Conclusions
CC BY-SA
● The work on the Europeana corpus with CLARIN and EUDAT allowed all three
infrastructures to gain knowledge on requirements for data interoperability
for research
...particularly from Europeana’s point of view, on how to make cultural
heritage corpora better discoverable, accessible, machine processable
and citable in research contexts.
● With full-text supported by EDM, Europeana expects to lower the technical
barriers to a sustainable aggregation of cultural heritage corpora, leading to
its increased availability for research and other purposes
21. Title here
CC BY-SA
Future work
CC BY-SA
● Ongoing work with CLARIN to ensure optimal processing of Europeana
corpora through CLARIN’s Virtual Language Observatory
● Resume aggregation of newspapers corpora by Europeana
● Improve EDM’s full-text model
○ Identify additional requirements resulting from different digitization
practices across Europe
○ Follow any progress made by the IIIF community on full-text
22. Thank you for your attention
nuno.freire@tecnico.ulisboa.pt
Netherlands, Public Domain
1660 - 1625, Rijksmuseum
Anonymous
Arrival of a Portuguese ship
Acknowledgments
Fundação para a Ciência e a Tecnologia (FCT): UID/CEC/50021/2013
European Commission contract number 30-CE-0885387/00-80.