  1. 1. Linked Data for Digital Humanities SW4SH 2015 Victor de Boer With input from Christophe Guéret, Serge ter Braake, Niels Ockeloen, Antske Fokkens, Dirk Roorda, Lora Aroyo, Johan Oomen, Oana Inel, Jan Wielemaker
  2. 2. Victor de Boer Web & Media Group, CS, VU University Amsterdam Netherlands Institute for Sound and Vision Cultural Heritage Digital History Linked Data for Development
  3. 3. Digital History Sub-discipline of digital humanities Buzzword that people are embracing and at the same time already getting tired of. “ ..the use of digital media and computational analytics for furthering historical practice, presentation, analysis, or research” --Wikipedia
  4. 4. Digital History Part of the effort of historian is moved from the physical archives to digital ones Cross-domain collaboration Img:www.doaks.org, www.dkrz.de
  5. 5. Data-driven research Fig: Christophe Guéret
  6. 6. Tools and visualisations http://armstrongdigitalhistory.org/, http://www.vcdh.virginia.edu/courses/fall07/hius401-f/, http://digitalhistory.unl.edu/essays/thomasessay.php, http://www.philipvickersfithian.com/2013/05/gender-in-stacks-on-managing-small.html
  7. 7. “That is great. I would love that… …but my research questions are slightly different.” Img:Monty Python
  8. 8. Better Enter your (national) research infrastructure here Fig: C. Guéret
  9. 9. Aging Data Tool C. Guéret based on http://redmonk.com/jgovernor/2007/04/05/why-applciations-are-like-fish-and-data-is-like0wine/
  10. 10. Even better Do not bake the data into the tool and treat data as an end product. Build tools on top of the data. Make sure others can do so as well. Fig: C. Guéret
  11. 11. Framework generic solutions with historians 1. Preprocess, Clean, Model, Link, Enrich data in a collaboration with domain experts 2. Access heterogeneous datasets in a convenient way to get an intuition of the character and anomalies of the (linked) data; 3. Perform arbitrary queries to retrieve results relevant to their research questions; 4. Verify the veracity of query results, by following provenance links to original material 5. Retrieve and analyze the data with tool of preference. 6. Republish and share results
  12. 12. Linked Data for Digital History • Represent heterogeneous datasets with their own data models – In one data format (RDF) – Link what can be linked to integrate at project level (and beyond) – Keep specificity of original data • Links to other sources: re-use knowledge • Common-sense or very specific • Digital hermeneutics • Allow multiple levels of semantic enrichment – normalization – through Named Graphs – Provenance • Linked Data is the (technically) best way to publish and share your research data
  13. 13. Some examples
  14. 14. Dutch Ships and Sailors
  15. 15. The Problem: ((Maritime) historical) data is not integrated • Researchers’ data is “lost” – In different physical locations – In different file formats – In different semantic structures • In a workshop, we identified 25+ maritime historical datasets. – http://dutchshipsandsailors.nl • We do not want to force one monolithic data model for integration
  16. 16. The solution: Linked Open Data • Represent heterogeneous datasets with their own data models – In one data format (RDF) – Link what can be linked to integrate at project level (and beyond) – Keep specificity of original data • Links to other sources: re-use knowledge • Allow multiple levels of semantic enrichment/ normalization – through Named Graphs – Provenance
  17. 17. What we did 1. Model four maritime historical datasets as RDF – Noordelijke Monsterrollen Database [J. Leinenga] – Generale Zeemonsterrollen [M. van Rossum] – Dutch Asiatic Shipping – VOC Opvarenden 2. Link to each other (based on ships, ship types, ranks, geography,…) – Models and links evaluated by domain experts 3. Publish as Linked Open Data 4. Show how this data cloud can lead to new types of integrated research questions
  18. 18. Modeling in collaboration with historians dss:Record gzmvoc:Telling gzmvoc:telling-1046-De_Berkel __bnode_1 gzmvoc:aziatischeBemanning dss:Ship gzmvoc:Schip gzmvoc: schip-1046-De_Berkel dss:has_ship gzmvoc:schip "1046" “Schip” “De Berkel” rdfs:label dss:scheepsnaam gzmvoc:scheepsnaam dss:ShipType gzmvoc:Scheepstype gzmvoc: type-Ship dss:has_shiptype gzmvoc:has_shiptype gzmvoc:scheepstype “21” “Moorse mattroosen” dss:azRegistratieKop gzmvoc:azAantalMatrozen gzmvoc:telling gzmvoc:heeft DAS heenreis dss:Record das:Voyage das:voyage-1918_61 DIFFERENT but LINKED DATAMODELS
  19. 19. Modelling principles Model each dataset as directly as possible Only “syntactical” transformation to RDF No normalization Reusability Transparency, trust Normalize and link in second stage store in separate RDF Named Graphs
  20. 20. Links to Historical Newspapers [HARLINGEN, 24 October.] …gestrand. Tevens is het berigt ontvan°e > dat het hier behoorende schoonerschip Transit, kapitein Schaap, in de Noordzee is gezonken, nadat het achterschip was weggeslagen ; een ligtmatroos verloor daarbij het leven. Mede zijn hier drie vreemde schepen met meer en minder zware averij binnengeloopen. - Andrea Bravo Balado
  21. 21. DAS GZMVOC MDB VOCOPV Begunstig den VOCOPV Soldijboek en PROV AAT VOCOPV Opvaren den foaf owl:sameAs dss:hasKBLink rdfs:subClassOf, rdfs:subPropertyOf dss:DAS link skos :exactMatch
  22. 22. ClioPatria Triplestore Data live at Huygens Institute for Dutch History http://dutchshipsandsailors.nl/data ~30 Million triples Dev. Server http://semanticweb.cs.vu.nl/dss Purl.org URIs redirect to live server w/ content negotiation SPARQL endpoint Web interface
  23. 23. Search, browse and query
  24. 24. • SPARQL for R package Data analysis and visualisation
  25. 25. Dutch Ships and Sailors • Linked Data principles are a great fit to digital history requirements – Heterogeneous models/datasets, light-weight reusable integration – Multiple levels of normalisation, through separate named graphs – SW Provenance matches Historical Provenance • Watch out when you sail your Schooner into the North Sea
  26. 26. DIVE
  27. 27. DIGITAL HUMANITIES RESEARCHERS MediaresearcherLarsArveRøsslandoftheUniversityofBergen.(Photo:AndreasR.Graven) EXPLORATIVE SEARCH Digital Hermeneutics: The combination of digital (Web) technology and theory of interpretation
  28. 28. Builds on AGORA PROJECT Slide: Lora Aroyo
  29. 29. DATA: OPENIMAGES.EU Open videos Netherlands Institute for Sound and Vision ~3000, mostly news broadcasts Descriptions
  30. 30. DATA: DELPHER.NL Scans of Radio bulletins (hand annotated) 1937 – 1984 1.5 Million OCR’ed and NErred
  33. 33. DIGITAL SUBMARINE UI https://www.flickr.com/photos/benjcarson/245171885 https://www.flickr.com/photos/mibuchat/2774251415 INFINITY OF EXPLORATION
  35. 35. DIVE Data conversion pipeline includes crowdsourcing, text analysis Again provenance Generic browsing doesn’t have to be boring
  36. 36. BiographyNet Biographynet.nl
  37. 37. Starting point Starting Point: Biography Portal of the Netherlands; www.biografischportaal.nl 125,000 short biographical descriptions with limited metadata from 23 Dutch biographical dictionaries (~76,000 individuals) What kind of historical questions can be answered with these data with the help of computational methods Biographynet.nl
  38. 38. Methods developed in collaboration
  39. 39. Biographynet.nl
  40. 40. Biographynet conversion Biographynet.nl
  41. 41. Johan Rudolph Thorbecke werd in 1798 geboren op 14 januari in Zwolle en komt uit een half-Duit Johan Rudolph Thorbecke werd in 1798 geboren op 14 januari in Zwolle en komt uit een half-Duit Linked Data for BiograpyNet Thorbecke Biographical Description Provenance Meta Data NNBW Person Meta Data “Thorbecke” Biography Parts Birth 1798 Event Biographical Description Enrichment NLP Tool Person Meta Data Event Birth Johan Rudolph Thorbecke werd in 1798 geboren op 14 januari in Zwolle en komt uit een half-Duit Zwolle 1798-01-14 Biographynet.nl
  42. 42. a Provenance in Biographynet Ensure credibility of the demonstrator, to evaluate its performance and to improve the academic status of the tool Information involved Sources, but also: NER input data, etc. Processes involved All steps in enrichment, aggregation… People involved Who was responsible for pipeline, tool, Includes P-PLAN:* Allows for comparing the actual activity and its input/output with the original plan and its variables Biographynet.nl*Daniel Garijo, Yolanda Gil; http://www.opmw.org/model/p-plan
  43. 43. Interface for historians (mockup) Biographynet.nl
  44. 44. Biographynet.nl
  45. 45. Biographynet Data sustainably stored at historical research institute Structured data accessible as LOD Provenance through deep linking into enrichment Generic browse / compare functionality for broad historic research (prosopography) Re-usability Biographynet.nl
  46. 46. Verrijkt Koninkrijk
  47. 47. Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog History of German occupied Dutch society (1940-1945) Published 1969 - 1991 in 14 volumes, 30 parts, 18.000 pages 1. Digitization 2. Open Data 3. Enriched access with Linked Data Verrijkt Koninkrijk
  48. 48. Step 1: Lou de Jong’s “Het Koninkrijk” was digitized and made available in a reusable format Step 2: Named Entity Recognition and consolidation of the back-of- the-book index provide structured vocabularies with links into the text country, collection, doc-type, volume, chapter, section, sub-section, paragraph Back-of-the-Book index Named Entities
  49. 49. Verrijkt Koninkrijk Step 3: Enrichment with Linked Data makes new ways of interaction and analysis possible Back-of-the-Book index Named Entities
  50. 50. niod:Blitzkrieg niod:oai_wo2_niod_nl_rec_102045 dct:subject http://resolver.verrijktkoninkrijk.nl/nl.vk.d.reg.4.1386 botb:Blitzkrieg skos:exactMatch
  51. 51. 52 National- Socialist 29% Social- Democrat 21% Protestant 13% Liberal 12% R-Catholic 12% Communist 8% Jewish 5% http://semanticweb.cs.vu.nl/verrijktkoninkrijk/ http://search.loedejongdigitaal.nl/
  52. 52. Results are links to paragraphs
  53. 53. re-usability
  54. 54. Reuse in comparative interface Quantifying Historical Perspectives on WWII http://qhp.science.uva.nl/
  55. 55. Verrijkt Koninkrijk Textual data in digital repositories Structured data accessible as LOD Provenance Simple tooling for primary research project Re-usability
  56. 56. Shebanq http://shebanq.ancient-data.org/
  57. 57. Nanopublications in digital history Cf. http://www.slideshare.net/schambers3/nanopublications-in-the-arts-and-humanities
  58. 58. Shebanq Source material shared in sustainable repo Queries and metadata shared among researchers Working on Linked Data model based on OpenAnnotation
  59. 59. Historical tool / data criticism What happens between question/query and answer? What happened to the original data? Can we make the complex computer processes understandable for ‘lay’ people? A detailed and understandable (these two match poorly) description of the process of preprocessing and enrichments A detailed description of the way the data is visualized and the choices made in the design
  60. 60. Historical tool criticism … willingness from historians to invest the time to learn about computer processes (at least the basic principles) Possibilities for education at universities to bridge the gap between computer science and humanities studies and make tool criticism an integral part of student’s curricula “Why do we still teach history student to decipher 17th Century handwriting, but not SQL”
  61. 61. Thank you! Victor de Boer http://victordeboer.com v.de.boer@vu.nl @victordeboer
  62. 62. Backup slides
  63. 63. MultimediaN E-Culture project (2006) Museums have increasingly nice websites But: most of them are driven by stand-alone collection databases Data is isolated, both syntactically and semantically If users can do cross-collection search, the individual collections become more valuable! Semantic Search
  64. 64. E-Culture data cloud
  65. 65. Vocabulary alignment “Easel-pieces” RMA concept “Schilderij” RMA is the thesaurus of Rijksmuseum AAT artefact type “Easel Piece” “Painting” AAT is Getty’s Art & Architecture Thesaurus
  66. 66. http://e-culture.multimedian.nl/

