Dr. Kepa Rodriguez, Data and Content Specialist, Archives Division, Yad Vashem
Integration and Retrieval of Heterogeneous Archival Metadata
2016 EVA/Minerva Jerusalem International Conference on Digitisation of Cultural Heritage
http://2016.minervaisrael.org.il
http://www.digital-heritage.org.il
1. EVA/Minerva 2016
Integration and Retrieval of
Heterogeneous Archival Metadata
CONNECTING
COLLECTIONS
Kepa J. Rodriguez – Archives Yad Vashem
09/11/2016
2. Outline
●
Data integration in the first phase of the project
●
Our actual integration approach
●
Retrieval of data using controlled vocabularies
●
Development of the EHRI controlled vocabularies
3. Data integration in the first phase of the project
●
Holding institutions delivered data in very different formats:
●
XML, text files, CSV, JSON, etc...
●
Ingestion into the portal was made case by case
●
We interpreted data model and map it with our model
●
Sometimes without help of the institution
●
Lots of data introduced by hand
●
Process no sustainable, it cannot be repeated
●
No automatic updates are possible
●
If an institution updates content, data has to be updated by hand
●
Other problems: infrastructure, persistent identifiers, etc.
4. Proposal for the second phase of the project
● Data conversion
● Data publication and synchronization
● Data ingestion
5. Data conversion
●
Converstion tool: different data formats into EAD:
●
XML, JSON, CSV...
●
Generic transformation
●
Useful for a relevant number of institutions
●
Reusable functions, as mappings for specific fields of their export
format into EAD
●
Utilities to configure specific transformations
●
Validation of the output:
●
Machine validation: XML validation protocols
●
Schematron, RNG
●
Human validation: HTML preview including mark-up
for validation errors
6. EAD File sample (1)
<archdesc level="subgrp">
<did>
<unitid>M.49.E</unitid>
<unittitle encodinganalog="3.1.2">Testimonies of Holocaust Survivors collected by the
Central Jewish Historical Commission in Poland, 1944-1947</unittitle>
<physdesc encodinganalog="3.1.5">6845 files</physdesc>
<langmaterial>
<language langcode="deu" encodinganalog="3.4.3">German</language>
<language langcode="pol" encodinganalog="3.4.3">Polish</language>
<language langcode="yid" encodinganalog="3.4.3">Yiddish</language>
</langmaterial>
<repository>
<corpname>ושם יד ארכיון / Yad Vashem Archives</corpname>
</repository>
</did>
<scopecontent encodinganalog="3.3.1">
<p>The collection consists of approximately 7,200 testimonies collected by the
Centralna Żydowska Komisja Historyczna (Central Jewish Historical Committee) in
Poland during its during its active years, 1944-1947.
…..
as well as testimonies from survivors who fought in partisan units and survivors who
were in hiding.</p>
</scopecontent>
…....
8. Data publication and synchronization
●
We plan to use two data publication protocols:
●
OAI-PMH: one of the first protocols for publication of data
●
Publication of data in different formats: Dublin Core (default), EAD,
etc.
●
PMH-servers are not easy to implement and to mantain for small
archives
●
But we want to implement a client for institutions that already use it
●
RessourceSync: a new protocol
●
Based on SiteMaps
●
Data can be published on the web page of the institution
●
Higher security
●
Use sitemaps to expose changes and updates
●
Only modified and new data will be tranferred to the portal
●
Both are standard protocols of the Open Archives Initiative
9. Data ingestion
●
After data is ingested into the portal, it will receive a
permanent URL:
●
Formal protocol is in progress
●
Necessary to publish our data in the Linked Open Data cloud
●
Updates: data will be overwritten
●
But the portal keeps the user generated data
●
But... is it enough for the user just to have all
information in a single infrastructure?
10. Data retrieval
●
The user needs to be able to retrieve information related to
selected topics, places, people, organizations, creators...
●
Regardless which institution holds it
●
Regardless in which language the metadata is written
11. EHRI controlled vocabularies
●
EHRI Thesaurus
●
Concepts: hierarchy of concepts formalized in SKOS
●
A first set translated into 10 languages
●
Made by historians and content specialists
●
Authority lists:
●
Named entities or instances of the concepts
●
Proposed by historians and especialists: not really useful for indexing
and retrieval of data
●
During import a lot were added by hand to address necessities of the real
data
●
Domain specific authorities: Ghettos, Camps, Administrative Districts
●
Vocabularies created for applications in the portal:
●
Two research guides
●
Linked to the EHRI Thesaurus
12. Problems of the first approach of the project
●
A vocabulary built with knowledge about the Shoah can be
helpful to represent the history, but not necessarily the
documentation:
●
The complilation of an encyclopedia and the implementation of an
engine for cataloguing and retrieval are two very different things
and require different strategies and kinds of expertise.
●
The vocabularies should be able to retrieve the real existing
data:
●
Vocabularies should be able to describe the data, not only the
content... i.e: types of documents, physical format of the data...
●
A strategy to increase te datasets when new data addresses new
necessities has to be implemented.
13. The reality of the data
●
Different institutions use different systems to assign
keywords (or no system)
●
Keywords can have different relevance in different systems
●
In a National Archive “holocaust” can be a relevant keyword, but it
is not relevant for the EHRI portal.
●
A same keyword can have different meanings in different
knowledge basis
●
i.e: “labor” in one set of imported data corresponds to “forced
labor”, in another set to “trade unions”
●
Relevant information is often given as free text:
●
Necessary to use Natural Language Processing to extract this
information, but we can do in the project only in a experimental
level.
14. EHRI's data driven approach (1)
●
Extraction of access points of the EAD files during import
<controlaccess>
<geogname>Poland</geogname>
<geogname>Warsaw</geogname>
</controlaccess>
<controlaccess>
<subject>Persecution of Jews</subject>
<subject>Testimonies, Biographies</subject>
<subject>Holocaust survivors</subject>
</controlaccess>
<controlaccess>
<corpname>Centralna Żydowska Komisja Historyczna</corpname>
</controlaccess>
15. EHRI's data driven approach (2)
●
Person, corporate bodies:
●
Check whether we have corresponding authority files
●
If we have: link the description unit with the correspoinding authority
file
●
If we don't have: create a new authority file
●
Priority of EHRI: creators of archival collections
●
Places:
●
Link the places with the geographical database GeoNames
●
Problematic for historical places, some of them will be added as extra
vocabulary.
16. EHRI's data driven approach (3)
●
Concepts/terms: the most complicated case
●
Archives used very different strategies for concepts:
●
Some institutions make composition of terms using different rules
(or no-rule)
●
Subject: “Jews--Persecution--France” (data of USHMM)
●
EHRI has an atomic approach
●
Subject: “Persecution of Jews”
●
Place: “France”
●
Steps to process concepts/terms:
●
Terms are normalized and de-duplicated
●
If there are equivalent terms in the thesaurus we establish a link
●
If there are not equivalent terms the concept goes to further
analysis
●
If necessary a board of experts will consider to accomodate a new
concept in our concept hierarchy.
17. Ghethos and Concentration Camps
●
We evaluate to start a WikiData project for ghettos and
concentration camps
●
Strategy:
●
Extract information from the actual thesaurus and alternative
sources
●
Encyclopedic knowledge
●
Data from project partners
●
Integration of all this data in the WikiData platform
●
Enrichment with help of the community
●
Multilingual labels and no controversial information
●
Finally the data in WikiData and in the portal should be
synchronized
18. NIOD Institute for War, Holocaust and Genocide
Studies (NL)
CEGESOMA Centre for Historical Research and
Documentation
on War and Contemporary Society (BE)
Jewish Museum in Prague (CZ)
Center for Holocaust Studies at the Institute for
Contemporary History in Munich (DE)
YAD VASHEM The Holocaust Martyrs’ and
Heroes’ Remembrance Authority (IL)
United States Holocaust Memorial Museum (USA)
Bundesarchiv (DE)
The Wiener Library Institute for the Study of
the Holocaust & Genocide (UK)
Holocaust Documentation Centre (SK)
Polish Center for Holocaust Research (PL)
The Jewish Museum of Greece (GR)
Jewish Historical Institute (PL)
King’s College London (UK)
Ontotext AD (BG)
Elie Wiesel National Institute for the Study of Holocaust
in Romania (RO)
DANS Data Archiving and Networked Services (NL)
Shoah Memorial, Museum, Center for Contemporary
Jewish Documentation (FR)
ITS International Tracing Service (DE)
Hungarian Jewish Archives (HU)
INRIA Institute for Research in Computer Science and Automation (FR)
Vilna Gaon State Jewish Museum (LT)
VWI Vienna Wiesenthal Institute for Holocaust Studies (AT)
Foundation Jewish Contemporary Documentation Center (IT)
CONNECTING
KNOWLEDGE