This document discusses the need for a harmonized dataset model for open data portals. It describes existing dataset models like DCAT, VoID, CKAN, and others. It proposes classifying metadata into information groups (resource, tag, group, organization) and types (general, ownership, provenance, etc.). The document outlines a process for harmonizing existing models which includes mapping these information groups and types and examining how extras fields are used across different models and portals. The goal is to define a minimum set of metadata needed to build dataset profiles and enable interoperability.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
HDL Model for Harmonizing Open Data Metadata
1. HDL
Towards a Harmonized Dataset
Model for Open Data Portals
Ahmad Assaf, Raphaël Troncy And Aline Senart
@ahmadaassaf
PROFILES 15 – 2nd International Workshop on Dataset PROFIling & fEderated Search for Linked Data 1st June 2015
2. HDL Towards a Harmonized Dataset Model for Open Data Portals
Open Data/Linked Open Data
Open Data (OD) is the data that can be easily discovered, accessed, reused and
redistributed by anyone [Davies et al. 2014]
Open Data should be placed in public domain under liberal terms of use and available
in electronic formats that are non-proprietary and machine readable.
Linked Open Data (LOD) refers to the semantically rich, linked and machine readable
open data.
Open Data has major benefits for citizens, businesses, societies and governments.
2
3. HDL Towards a Harmonized Dataset Model for Open Data Portals
Metadata
Metadata is structured information that describes, explains, locates or otherwise makes it
easier to retrieve use or manage information resources
Data Discovery,
exploration and
reuse
Organization
&
identification
Archiving
&
preservation
3
4. HDL Towards a Harmonized Dataset Model for Open Data Portals
Data Portals/Data Management Systems
Data Portals (Catalogs) are the entry points to discover published
datasets
Data Portals are a curated collection of datasets metadata providing a
set discovery and integration services.
Data Portals can be private like datahub.io, publicdata.eu or private like
enigma.io or quandle.com
Portals are built on top of Data Management Systems (DMS) like
CKAN, DKAN and Socrata
4
5. HDL Towards a Harmonized Dataset Model for Open Data Portals
Why a Harmonized Model ?
Exploring/discovering datasets for
(re)use
Defining a “minimal” set of
information needed to build a
“profile”
Building tools that will
automatically generate/validate
metadata models
5
6. The Data Catalog Vocabulary (DCAT)✝ is a W3C recommendation to facilitate interoperability
between data catalogs on the web
DCAT is an RDF vocabulary with three main classes: dcat:Catalog, dcat:Dataset and dcat:Distribution
DCAT Profiles [extensions built upon DCAT]
DCAT-AP✝✝ defines a minimal set of properties that should be included in a datasets
profile by specifying mandatory and optional properties
The Asset Description Metadata Schema (ADMS)✝✝✝ is used to semantically describe
assets (code lists, taxonomies, vocabularies)
HDL Towards a Harmonized Dataset Model for Open Data Portals
Dataset Models - DCAT
6
✝ http://w3.org/TR/vocab-dcat/
✝✝ https://joinup.ec.europa.eu/asset/dcat_application_profile/description
✝✝✝ http://www.w3.org/TR/vocab-adms/
7. HDL Towards a Harmonized Dataset Model for Open Data Portals
Dataset Models - VoID✝
RDF vocabulary for interlinked datasets
In addition to describing datasets, VoID
describes the links between datasets
VoID defines three main classes:
void:Dataset, void:Linkset and void:subset
A linkset in voiD is a subclass of a dataset,
used for storing triples to express the
interlinking relationship between datasets
7
✝ http://www.w3.org/TR/void/
8. HDL Towards a Harmonized Dataset Model for Open Data Portals
Dataset Models – CKAN✝/DKAN✝✝
Data model describes a set of entities (dataset, resource, group, tag)
Allow additional information to be added via “extra” arbitrary key/value fields
The core metadata restricted as a JSON file
Supports Linked Data and RDF by providing a complete and functional mapping of its
model to LD formats
CKAN support descriptions of vocabularies
DKAN is a Drupal based DMS
8
✝ http://ckan.org/
✝✝ http://demo.getdkan.com/
9. Online collection of best practices
and case studies to help data
publishers
POD data model is based on DCAT
Similarly to DCAT-AP, POD defines
three types of metadata elements:
Required, Required-If and
Expanded(optional)
Metadata extensions using elements
from the “Expanded” fields
HDL Towards a Harmonized Dataset Model for Open Data Portals
Dataset Models - Continued
Commercial platform to streamline
data publishing, management,
analysis and reusing.
The model is designed specifically to
represent tabular data
The model covers a basic set of
metadata properties and has good
support for geospatial data
A collection of schema used to
markup HTML pages with structured
data
Covers many domains. We are
interested in the Dataset schema
although we also use various
properties from schemas like
organizations, authors, etc.
9
✝ http://socrata.com/
✝✝ http://schema.org/
✝✝✝ https://project-open-data.cio.gov/
✝ ✝✝ ✝✝✝
11. HDL Towards a Harmonized Dataset Model for Open Data Portals
Metadata Classification – Information Groups
11
Organization
Clustering or curation
solely based on
associations with specific
administration parties
Resource
Actual raw data that can
be downloaded or
accessed directly e.g.
JSON, CSV, SPARQL
endpoint
Tag
Descriptive knowledge
about the dataset
contents and structure.
This can range from
simple textual tags to
semantically rich
controlled terms
Group
Organizational units that
share common
semantics. They can be
seen as a cluster or
curation based on shared
themes/categories
12. HDL Towards a Harmonized Dataset Model for Open Data Portals
Metadata Classification – Information Types
12
General Information
title, description, id
Ownership Information
author, maintainer_email
Provenance Information
version, creation_date, update_date
Access Information
URL, license_title, license_id
Geospatial Information
bbox, layers
Temporal Information
coverage_from, coverage_to
Statistical Information
max_value, uniques, average
Quality Information
rating, availability, freshness
Dataset Metadata
13. HDL Towards a Harmonized Dataset Model for Open Data Portals
Harmonization Process
Examine the model or vocabulary specification and documentation
Examine existing datasets using these models
Examine the source code for DMS
13
1 Map the information groups [resource, tag, group, organization]
2 Map the information types [general, ownership, provenance, etc.]
14. HDL Towards a Harmonized Dataset Model for Open Data Portals
Mapping Information Types
14
CKAN maintainer_email
DKAN maintainer_email
POD ContactPoint -> hasEmail
Schema.org CreativeWork:producer -> Person:email
VoID void:Dataset -> dct:creator -> foaf:Person:givenName
DCAT dcat:Dataset -> dct:creator -> foaf:Person:givenName
15. HDL Towards a Harmonized Dataset Model for Open Data Portals
Extra Information
15
Examining the models, we noticed an abundance of information filled in “extras” fields
Using Roomba we generated aggregation reports to inspect those extras on LOD Cloud✝ and
OpenAfrica✝✝
extras>value:extras>name1 Extra fields names and values
resources>resource_type:resources>name2 Types describing resources
53% of the datasets in OpenAfrica have additional geospatial attached (spatial-reference-system, spatial
harvester, bbox-east-long, bbox-north-long, bbox-south-long, bbox-west-long)
16% of the datasets have additional provenance and ownership information (frequency-of-update, dataset-
reference-date)
✝ http://datahub.io/group/lodcloud
✝✝ http://africaopendata.org/https://github.com/ahmadassaf/opendata-checker/tree/master/model
16. HDL Towards a Harmonized Dataset Model for Open Data Portals 16
https://xkcd.com/927/
17. 17HDL Towards a Harmonized Dataset Model for Open Data Portals
Questions?
Ahmad Assaf
http://ahmadassaf.com/
@ahmadaassaf
http://github.com/ahmadassaf
Notes de l'éditeur
An asset is something that can be opened and read using a familiar desktop software as opposed to the need to be processed like raw data.
The interlinking is modelled by a linkset (void:Linkset). A linkset in voiD is a subclass of a dataset, used for storing triples to express the interlinking relationship between datasets. In each interlinking triple, the subject is a resource hosted in one dataset and the object is a resource hosted in another dataset. This modelling enables a flexible and powerful way to talk in great detail about the interlinking between two datasets, such as how many links there exist, which kind of links (e.g. owl:sameAs or foaf:knows) are present, or stating who claims these statements.