Building an Operating System for Open Science: data integration challenges, Dataverse data repository and knowledge graphs. Lecture by Slava Tykhonov, DANS-KNAW, for the Journées Scientifiques de Rochebrune 2023 (JSR'23).
1. Journées Scientifiques de Rochebrune 2023 (JSR'23)
Slava Tykhonov, R&D
(DANS-KNAW, the Netherlands)
29 March 2023
Decentralized research data infrastructure
and knowledge graphs
3. Building an Operating System for Open Science
3
● Generic Common Research and Data Infrastructure should be distributed
and robust enough to be scaled up and reused for any challenging tasks like
cancer research etc
● Networked services built from Open Source components
● Data processed and published in FAIR way, the provenance information is
the part of our Data Lake
● Data evaluation and credibility is the top priority, we’re providing tools for the
expert community for the verification of our datasets
● The transparency of data and services guarantees the reproducibility of all
experiments and get bring new insights in the multidisciplinary research
● Infrastructure should enforce collaboration between people, bring together
general public, researchers, citizen scientists, etc
● Infrastructure is free of charge, (meta)data is protected and licenced.
5. Building a horizontal platform to serve vertical teams
Source: CoronaWhy infrastructure introduction 5
6. DANS Data Stations - Future Data Services
Dataverse is API based data platform and a key framework for Open Innovation!
7. What is Dataverse?
● Open source data repository developed by IQSS of Harvard University
● Great product with very long history (from 2006) created by experienced and
Agile development team
● Clear vision and understanding of research communities requirements, public
roadmap
● Well developed architecture with rich APIs allows to build application layers
around Dataverse
● Strong community behind of Dataverse is helping to improve the basic
functionality and develop it further.
● DANS-KNAW delivered production ready (Docker/k8s) Dataverse repository for
the European Open Science Cloud (EOSC) communities CESSDA, CLARIN and
DARIAH.
● Dataverse is de facto standard for FAIR data repositories in Europe with wide
adoption in the Netherlands, France, Norway, Portugal in other EU countries
8. Data integration challenges
● datasets are very heterogeneous and multilingual
● data usually lacks sufficient data quality control
● data providers using different modeling schemas and styles
● linked data cleansing and versioning is very difficult to track and maintain
properly, web resources aren’t persistent
● even modern data repositories providing only metadata records
describing data without giving access to individual data items stored in
files
● difficult to assign and manually keep up-to-date entity relationships in
knowledge graphs
8
9. Benefits of the Common Data Infrastructure
● It’s distributed and sustainable, suitable for the future
● maintenance costs will drop massively, as more organizations will join,
less expensive it will be to support
● maintenance costs could be reallocated to the training and further
development of the new (common) features
● reuse of the same infrastructure components will enforce the quality and
the speed of the knowledge exchange
● building a multidisciplinary teams reusing the same infra can bring us new
insights and unexpected views
● Common Data Infrastructure plays a role of the “universal gravity” power
for Data Science projects
(and so on…)
10. Semantic interoperability on the infrastructure level
We envision a situation where thousands of Dataverse instances (due to EOSC) on the web
can be simultaneously search for data and will form shared Data Lake.
The old dream of Federated search/Universal catalogue can only be realised if:
(1) Crosswalks; mapping across different metadata schemes are implemented
(2) In metadata schemes we seek for ways to enrich indexes with values from controlled
vocabularies
Standard response (centralized) = standardisation and harmonisation = repository software,
certain metadata standards, or certain controlled vocabularies
New response (distributed) = explore agile solutions (Proof of Concepts) which can be
implemented by different communities (even smaller ones), so we keep variety and still enable
integration in the Distributed Data Network by applying Linked Data technologies.
11. “Archive in a box” features (SSHOC Dataverse)
● Dockerized version of Dataverse application and shared networked services
● fully automatic Dataverse deployment with Traefik proxy
● Dataverse configuration managed through environmental file .env
● different Dataverse distributions with services on your preference suitable for different
use cases and research communities
● external controlled vocabularies support (demo of CESSDA CMM metadata fields
connected to Skosmos framework)
● S3 compatible MinIO storage support for Cloud Storage
● data previewers integrated in the Dataverse distribution
● startup process managed through scripts located in init.d folder
● automatic SOLR reindex
● external services integration with PostgreSQL triggers
● support of custom metadata schemes (CESSDA CMM, CLARIN CMDI, ...)
● built-in Web interface localization uses Dataverse language pack to support multiple
languages out of the box
https://github.com/IQSS/dataverse-docker
12. “Archive in a box” infra suitable both for academics and industry
Source: Citizen Science and Open Science Core Concepts and Areas of Synergy (Vohland and Göbel, 2017)
Anyone can
setup own
digital
archive and
share the
content in
distributed
infra
Decentralized
FAIR
Dataverse
network with
APIs to share
(meta)data,
search,
storage and
provenance
14. Open Data vs Restricted (Sensitive) Data
Credits: OECD
Can Data still be Sensitive and FAIR in the same time?
15. Building FAIR decentralized data network for any type of content
Source: Wikipedia
We’re considering experimental implementation of the decentralized identifiers for controlled
vocabularies and content types extension to archive various types of content.
DIDs can be assigned to any artefacts including images, audio and video, for example, to store and link
metadata records and provenance information together with their digitized content.
DID can be private (invisible and not resolvable for public) but available for access with cryptokey.
16. DOI costs for Open Data
DataCite agency charge some fee from data providers depending on the amount of identifiers
and it can be significant amount starting from 1 million DOIs. What about DIDs?
17. Typical problems of “centralized” identifiers
Disambiguation and authorship issues:
● two authors with the same name mentioned in different papers, how do you know who is who?
● it’s very difficult to assign a paper to a specific person with ORCID without knowing the fact that it’s the original author
● some people can claim their false (fraudulent) authorship
Centralized entity which can be considered as a single point of failure.
Typical questions:
● can email be considered as identifier?
● what to do when email is changed because the domain name is changing and the identifier disappears
or not resolvable any more?
● how reliable is ORCID database?
18. “Centralized” controlled vocabularies
The European Language Social
Science Thesaurus (ELSST) hosted
by various data providers like
CESSDA and ODISSEI in Skosmos.
CESSDA has updated version with
more language properties.
How about versions of
vocabularies and concepts
changes and drift?
19. Decentralized identifiers as possible solution
We envision the near future where the it will be possible to create a decentralized system which will not depend on any specific
registry, one provider, one authority, etc., so all connections will be established in a peer-to-peer network, and but will be persistent at
the same time.
The resolution of the global decentralized identifier (DID) should be cryptographically verifiable to prove the identity and the
ownership of that identifier.
Core DID features are listed below:
1. A permanent (persistent) identifier (never change)
2. A resolvable identifier (you can look it up to discover metadata)
3. A cryptographically-verifiable identifier (with private and public keys)
4. A decentralized identifier (no centralized authority)
DID should bring control of all provenance and metadata back to their owners instead of giving them away. In the same time public part
will/could not be very different from other persistent identifiers like DOIs and even replace them for the specific use cases like sharing sensitive
data.
20. Major Concerns about DIDs
● Selection of PID technology, governance and business model highly depends on a variety of
additional non-technical factors, and that based on the use case, one needs a sensible
mechanism for identifying the best solution.
● Centralized solutions can work better for some use case, depends from requirements.
● The cost of DID can increase if you don’t have resources to run infrastructure, more expertise
required.
● DID takes power away from centralized authorities and gives it back to individuals, they should
be prepared for the concept shift, for example, how to use “digital wallets” to keep their
ownership.
● The automation of trust with DID technology means “no human in the loop” involved - could
be risky in the long run.
21. The place of DID as unified resource
Source: “Self-Sovereign Identity”. by Alex Preukschat, Drummond Reed
DID can be considered as “replacement” of domain names and DNS from the “centralized” network
22. Example of DID with private and public key, and service endpoints
Service endpoints can tell how exactly to interact with the subject, what kind of protocols, what kind of network endpoints
are available to connect, for example, to an agent that represents the data subjects so that you can then exchange
credentials or some other messages.
24. DID URLs with parameters
Source: Decentralized identifiers (DIDs) fundamentals and deep dive, SSIMeetup
25. “Decentralized” technology is not the same as “Blockchain” technology
“Blockchain is a digitally distributed database that is shared among nodes, which are computers in the blockchain network, that makes
it difficult or impossible to change, hack, or cheat the system”.
Blockchain parties:
- Holder (Owner of the Verifiable Credential)
- Issuer (provides a credential to a holder and signs the credential with their private key)
- Verifier can check the blockchain to ensure that the issued certificate belongs to who it was issued to.
it’s not necessary to use blockchain to release decentralized identifiers as there are about 100 methods to register DIDs being
developed by various companies and organizations in the world. They implemented in the different way the same spec for interface
where input and output are standardized.
OYDID method was developed in Vienna and provides a self-sustained environment for managing digital identifiers
(DIDs). The did:oyd method links the identifier cryptographically to the DID Document and through also cryptographically
linked provenance information in a public log it ensures resolving to the latest valid version of the DID Document.
26. Universal Resolver for DIDs
Try this! https://dev.uniresolver.io
curl https://dev.uniresolver.io/1.0/identifiers/did:oyd:zQmdQvLdpogfEf5EHK7778EM9xoxFMVFdJgRD7SdYRcCHeL
27. OYDID methods explained
“OYDID (Own Your Decentralized IDentifier) takes the approach to not maintain DID and DID Document on a public ledger
but on one or more local storages (that usually are publicly available). Through cryptographically linking the DID identifier
to the DID Document, and furthermore linking the DID Document to a chained provenance trail, the same security and
validation properties as a traditional DID are maintained while avoiding highly redundant storage and general public access.”
(from OYDID docs)
28. DIDs for controlled vocabularies
Generic problem of CVs: the most of controlled vocabularies are published and distributed in not sustainable way and often
don’t even have persistent identifiers resolving to their concepts.
Possible solution for CLARIAH FAIR vocabularies:
● assign DID identifier to every vocabulary concept and use their built-in “update” mechanism to keep all revisions in the chain of
linked DIDs resolving to the archived version of every change
● metadata records can be linked in the distributed way to DID identifiers corresponding to a specific version of concept
preserved in data ledger
● this approach is more sustainable by design and can be considered as a step towards FAIR vocabularies, also high scores after
FAIR assessment
● vocabulary management/update in the hands of vocabulary owner/creator, separate private key will be generated for every
concept and should be stored it in a secure place
● extra properties and attributes could be added to DID documents representing specific vocabulary concept, such as
provenance information containing the date of creation or modification, authors, the name of ontology, relations to other
ontologies. They can even have their own labels.
● statistics of concepts usage, linkages, relations and other metrics will be available directly from the DID chains
29. CoronaWhy Proof of Concept on DIDs
Dataverse with information on Monkeypox 2022 outbreak use DIDs as persistent identifiers
https://datasets.coronawhy.org
31. Vocabulary recommender
Vocabulary Recommender Command-line interface
(CLI) was developed by Triply and provides a
recommendation interface which returns relevant
Internationalized Resource Identifiers (IRIs) based on
the search input. It works with SPARQL or
Elasticsearch endpoints which contain relevant
vocabulary datasets.
DANS has created a service out of it.
32. Decentralized archiving with DIDs
Cache and storage
All concepts are being cached in RAM using Redis framework and preserved in MongoDB database. After every restart the key:value
pair for URI:DID reindexed and available for lookup in the cache. It should be possible to move all DIDs data from one network to
another without too much efforts.
Archiving layer
Content archiving functionality is optional and implemented by using S3 protocol compliant with cloud storage services like AWS,
Amazon Blob and Google Cloud Platform (GCP). By default the contents of every object or web page with global DID identifier can be
stored in MinIO High Performance Object Storage.
33. Use case: COVID-19 Museum (C19M) with Yves Rozenholc
“Archive in a box” infrastructure based on Dataverse
34. Archive in a box: increasing Dataverse metadata interoperability
34
External controlled vocabularies support contributed by SSHOC project (data infrastructure for the EOSC)
37. C19M components: Cloud Storage - MinIO
MinIO is an open source distributed object storage
server written in Go, designed for Private Cloud
infrastructure providing S3 storage functionality.
MinIO is suited for storing unstructured data such as
photos, videos, log files, backups, and container.
Some features:
● supports multiple, sophisticated server-side
encryption schemes to protect data - wherever it
may be.
● MinIO supports the most advanced standards in
identity management, integrating with the
OpenID connect compatible providers
● MinIO’s continuous replication is designed for
large scale, cross data center deployments
● A MinIO Federation Server supports an unlimited
number of Distributed Mode sets
38. Human-in-the-Loop for Machine Learning
“Computers are incredibly fast, accurate
and stupid; humans are incredibly slow,
inaccurate and brilliant; together they
are powerful beyond imagination."
Albert Einstein
“A combination of AI and Human
Intelligence gives rise to an extremely
high level of accuracy and intelligence
(Super Intelligence)”
38
Source: Hackernoon.com
40. C19M components: Hypothes.is as a peer review service
1. AI pipeline does
domain specific
entities extraction
and ranking of
relevant CORD-19
papers.
2. Automatic entities
and statements will
be added, important
fragments should be
highlighted.
3. Human annotators
should verify results
and validate all
statements.
40
41. SEMAF service - semantic transformations
Proposal: SEMAF: A Proposal for a Flexible Semantic Mapping Framework