Slides for my keynote presentateion "Linked Data for Digital History" presented at Semantic Web for Scientific History (SW4SH) co-located with ESWC 2015
Judging the Relevance and worth of ideas part 2.pptx
Sw4 sh slides
1. Linked Data for
Digital Humanities
SW4SH 2015
Victor de Boer
With input from Christophe Guéret, Serge ter Braake,
Niels Ockeloen, Antske Fokkens, Dirk Roorda, Lora
Aroyo, Johan Oomen, Oana Inel, Jan Wielemaker
2. Victor de Boer
Web & Media Group, CS, VU University Amsterdam
Netherlands Institute for Sound and Vision
Cultural Heritage
Digital History
Linked Data for Development
3. Digital History
Sub-discipline of digital humanities
Buzzword that people are embracing and at
the same time already getting tired of.
“ ..the use of digital media and computational
analytics for furthering historical practice,
presentation, analysis, or research”
--Wikipedia
4. Digital History
Part of the effort of historian is moved from
the physical archives to digital ones
Cross-domain collaboration
Img:www.doaks.org, www.dkrz.de
9. Aging
Data Tool
C. Guéret based on http://redmonk.com/jgovernor/2007/04/05/why-applciations-are-like-fish-and-data-is-like0wine/
10. Even better
Do not bake the data into the tool and treat
data as an end product.
Build tools on top of the data.
Make sure others can do so as well.
Fig: C. Guéret
11. Framework generic solutions with historians
1. Preprocess, Clean, Model, Link, Enrich data in a collaboration with
domain experts
2. Access heterogeneous datasets in a convenient way to get an
intuition of the character and anomalies of the (linked) data;
3. Perform arbitrary queries to retrieve results relevant to their
research questions;
4. Verify the veracity of query results, by following provenance links
to original material
5. Retrieve and analyze the data with tool of preference.
6. Republish and share results
12. Linked Data for Digital History
• Represent heterogeneous datasets with their own data models
– In one data format (RDF)
– Link what can be linked to integrate at project level (and beyond)
– Keep specificity of original data
• Links to other sources: re-use knowledge
• Common-sense or very specific
• Digital hermeneutics
• Allow multiple levels of semantic enrichment
– normalization
– through Named Graphs
– Provenance
• Linked Data is the (technically) best way to publish and share your
research data
15. The Problem:
((Maritime) historical) data is not integrated
• Researchers’ data is “lost”
– In different physical locations
– In different file formats
– In different semantic structures
• In a workshop, we identified 25+
maritime historical datasets.
– http://dutchshipsandsailors.nl
• We do not want to force one
monolithic data model for
integration
16. The solution: Linked Open Data
• Represent heterogeneous datasets
with their own data models
– In one data format (RDF)
– Link what can be linked to integrate at
project level (and beyond)
– Keep specificity of original data
• Links to other sources: re-use
knowledge
• Allow multiple levels of semantic
enrichment/ normalization
– through Named Graphs
– Provenance
17. What we did
1. Model four maritime historical datasets as
RDF
– Noordelijke Monsterrollen Database [J. Leinenga]
– Generale Zeemonsterrollen [M. van Rossum]
– Dutch Asiatic Shipping
– VOC Opvarenden
2. Link to each other (based on ships, ship types,
ranks, geography,…)
– Models and links evaluated by domain experts
3. Publish as Linked Open Data
4. Show how this data cloud can lead to new
types of integrated research questions
18. Modeling in collaboration with historians
dss:Record
gzmvoc:Telling
gzmvoc:telling-1046-De_Berkel
__bnode_1
gzmvoc:aziatischeBemanning
dss:Ship
gzmvoc:Schip
gzmvoc: schip-1046-De_Berkel
dss:has_ship
gzmvoc:schip
"1046"
“Schip”
“De Berkel”
rdfs:label
dss:scheepsnaam
gzmvoc:scheepsnaam
dss:ShipType
gzmvoc:Scheepstype
gzmvoc: type-Ship
dss:has_shiptype
gzmvoc:has_shiptype
gzmvoc:scheepstype
“21”
“Moorse
mattroosen”
dss:azRegistratieKop
gzmvoc:azAantalMatrozen
gzmvoc:telling
gzmvoc:heeft DAS heenreis
dss:Record
das:Voyage
das:voyage-1918_61
DIFFERENT but LINKED DATAMODELS
19. Modelling principles
Model each dataset as directly as possible
Only “syntactical” transformation to RDF
No normalization
Reusability
Transparency, trust
Normalize and link in second stage
store in separate RDF Named Graphs
20. Links to Historical Newspapers
[HARLINGEN, 24 October.] …gestrand.
Tevens is het berigt ontvan°e > dat het hier
behoorende schoonerschip Transit,
kapitein Schaap, in de Noordzee is
gezonken, nadat het achterschip was
weggeslagen ; een ligtmatroos verloor
daarbij het leven. Mede zijn hier drie
vreemde schepen met meer en minder
zware averij binnengeloopen.
- Andrea Bravo Balado
22. ClioPatria Triplestore
Data live at Huygens Institute for Dutch
History
http://dutchshipsandsailors.nl/data
~30 Million triples
Dev. Server
http://semanticweb.cs.vu.nl/dss
Purl.org URIs redirect to live server w/
content negotiation
SPARQL endpoint
Web interface
24. • SPARQL for R package
Data analysis and visualisation
25. Dutch Ships and Sailors
• Linked Data principles are a great fit to digital
history requirements
– Heterogeneous models/datasets, light-weight
reusable integration
– Multiple levels of normalisation, through separate
named graphs
– SW Provenance matches Historical Provenance
• Watch out when you sail your Schooner into the
North Sea
38. Starting point
Starting Point: Biography Portal of the
Netherlands; www.biografischportaal.nl
125,000 short biographical descriptions with limited
metadata from 23 Dutch biographical dictionaries
(~76,000 individuals)
What kind of historical questions can be
answered with these data with the help of
computational methods
Biographynet.nl
42. Johan Rudolph Thorbecke werd
in 1798 geboren op 14 januari
in Zwolle en komt uit een half-Duit
Johan Rudolph Thorbecke werd
in 1798 geboren op 14 januari
in Zwolle en komt uit een half-Duit
Linked Data for
BiograpyNet
Thorbecke
Biographical
Description
Provenance
Meta Data
NNBW
Person
Meta Data
“Thorbecke”
Biography
Parts
Birth
1798
Event
Biographical
Description
Enrichment NLP Tool
Person
Meta Data
Event
Birth
Johan Rudolph Thorbecke werd
in 1798 geboren op 14 januari
in Zwolle en komt uit een half-Duit
Zwolle
1798-01-14
Biographynet.nl
43. a
Provenance in Biographynet
Ensure credibility of the demonstrator, to evaluate its
performance and to improve the academic status of the tool
Information involved Sources, but also: NER input data, etc.
Processes involved All steps in enrichment, aggregation…
People involved Who was responsible for pipeline, tool,
Includes P-PLAN:* Allows for comparing the actual activity
and its input/output with the original plan and its variables
Biographynet.nl*Daniel Garijo, Yolanda Gil; http://www.opmw.org/model/p-plan
46. Biographynet
Data sustainably stored at historical research institute
Structured data accessible as LOD
Provenance through deep linking into enrichment
Generic browse / compare functionality for broad
historic research (prosopography)
Re-usability
Biographynet.nl
48. Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog
History of German occupied Dutch society (1940-1945)
Published 1969 - 1991 in 14 volumes, 30 parts, 18.000 pages
1. Digitization
2. Open Data
3. Enriched access with Linked Data
Verrijkt Koninkrijk
49. Step 1: Lou de Jong’s “Het Koninkrijk” was digitized and made
available in a reusable format
Step 2: Named Entity Recognition and consolidation of the back-of-
the-book index provide structured vocabularies with links into the text
country, collection, doc-type, volume, chapter, section, sub-section, paragraph
Back-of-the-Book index Named Entities
50. Verrijkt Koninkrijk
Step 3: Enrichment with Linked Data makes new ways of
interaction and analysis possible
Back-of-the-Book index Named Entities
55. Reuse in comparative interface
Quantifying Historical Perspectives on WWII
http://qhp.science.uva.nl/
56. Verrijkt Koninkrijk
Textual data in digital repositories
Structured data accessible as LOD
Provenance
Simple tooling for primary research project
Re-usability
58. Nanopublications in digital history
Cf. http://www.slideshare.net/schambers3/nanopublications-in-the-arts-and-humanities
59. Shebanq
Source material shared in sustainable repo
Queries and metadata shared among
researchers
Working on Linked Data model based on
OpenAnnotation
60. Historical tool / data criticism
What happens between question/query and answer?
What happened to the original data?
Can we make the complex computer processes understandable
for ‘lay’ people?
A detailed and understandable (these two match poorly)
description of the process of preprocessing and enrichments
A detailed description of the way the data is visualized and the
choices made in the design
61. Historical tool criticism
… willingness from historians to invest the time to
learn about computer processes (at least the basic
principles)
Possibilities for education at universities to bridge
the gap between computer science and humanities
studies and make tool criticism an integral part of
student’s curricula
“Why do we still teach history student to decipher
17th Century handwriting, but not SQL”
65. MultimediaN E-Culture project (2006)
Museums have increasingly nice websites
But: most of them are driven by stand-alone collection
databases
Data is isolated, both syntactically and semantically
If users can do cross-collection search, the individual
collections become more valuable!
Semantic Search
Humanities research foc
As (digital) humanities researchers seek more (international and cross-domain) collaboration, integrating humanities datasets becomes more important to thoseresearchers. One subdomain where this is very much prevalent is in (social) historical research. Often historical researchers collect data from historical archivesfor their specific research questions. However, these datasets are often not presented in sharable formats to other researchers. If they are shared at all, thedatasets are published in a multitude of formats. To further the digital historyagenda, it has been recognized that representing and sharing data is key [4, 10].
“Published between 1969 and 1991, the 30 volumes still combine the qualities of an authoritative work for a general audience, and an inevitable point of reference for scholars”
Digitized version online in 2011, crashing the server
VIC: update namespaces
How can we facilitate building applications on top of Linked Cultural Data