1. Metadata Quality Assurance
Péter Király
peter.kiraly@gwdg.de
Heyne Haus, Göttingen, 18/12/2015
Oberseminar Datenmanagement, Cloud und e-Infrastructure
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
2. Metadata Quality Assurance Framework
2
What is metadata?
Data about data
Specifically: descriptive data about ...
digitized (or physical) object
such as paintings, books, photos
larger datasets
such as research data
Provides access points to the underlining data
3. Metadata Quality Assurance Framework
3
Why data quality is important?
„Fitness for purpose”
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft 17 December 2015
http://www.w3.org/TR/2015/WD-dwbp-20151217/
4. Metadata Quality Assurance Framework
4
Symptoms of bad quality metadata
Hard to identify („What it is?”)
Hard to distinguish from other records
Misleading descriptions
Uninterpretable descriptions
Missing fields
Unreusable (lost original context)
Hard to find
9. Metadata Quality Assurance Framework
9
Same entity, differently recorded
lucas cranach der ältere
Cranach, Lucas (der Ältere) [Herstellung]
Cranach, Lucas (I) (naar tekening van)
Cranach, Lucas vanem (autor)
Result of entity detection:
http://dbpedia.org/resource/Lucas_Cranach_t
he_Elder
http://viaf.org/viaf/49268177/
none
10. Metadata Quality Assurance Framework
10
Same entity recorded differently
Different displays, and content:
http://dbpedia.org/resource/Lucas_Cranach_t
he_Elder
http://viaf.org/viaf/49268177/
none
11. Metadata Quality Assurance Framework
11
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value for a record
12. Metadata Quality Assurance Framework
12
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value for a record set (e.g. a collection from the same
source)
13. Metadata Quality Assurance Framework
13
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value for a field – how users utilize the field?
14. Metadata Quality Assurance Framework
14
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
Field group. A group of fields together supports a given
funtionality, e.g. display, search, identify, re-use, multilinguality.
15. Metadata Quality Assurance Framework
15
Grouping fields by functionalities
Mandatory
Descriptiveness
Searchability
Contextualisation
Identification
Browsing
Viewing
Re-Usability
Multilinguality
dc:title × × × × ×
dcterms:alternative × × × ×
dc:description × × × × × ×
dc:creator × × × ×
dc:publisher × ×
dc:contributor ×
Created by Valentine Charles, Europeana Research and Development team
16. Metadata Quality Assurance Framework
16
Metrics
The foundational metrics were set by Bruce–
Hillmann, Stvilia, Ochoa–Duval, Gavrilis et al.
Completeness
Accuracy
Conformance to expectations
Logical consistency and coherence
Accessibility
Timeliness
Provenance
17. Metadata Quality Assurance Framework
17
Data sources
Europeana – the European digital library,
museum and archive: 48M+ medatata records
in EDM (Europeana Data Model) schema
TextGrid repository: Dublin Core metadata
and TEI (Text Encoding Initiative) records
Research data from the Göttingen Campus
Library catalogue records in MARC (Machine
Readable Catalog) schema
Other open data
18. Metadata Quality Assurance Framework
18
Method: collection – measuring – sharing
Data collection (ingestion) via REST API, OAI-
OMH harvesting, file download etc.
Issues:
GWDG cloud: 160 GB, Europeana: 300 GB
low I/O performance
Europeana OAI-PMH is in a „beta” state
OAI-PMH requires 10M+ HTTP requests
REST API requires 50M+ HTTP requests
19. Metadata Quality Assurance Framework
19
Method: collection – measuring – sharing
Measuring records
Big data so it should be scalable
Apache Hadoop: MapReduce and friends
Plugable architecture: „meters”
UI: set parameters for meters
input: records, schema, meters, config files
output:
identifier, projected metadata fields
metric1, metric2, metric3 ... metricN
20. Metadata Quality Assurance Framework
20
Method: collection – measuring – sharing
Statistical analysis
Calculating descriptive statistics with
R/Julia/other tool
Derivation of numbers representing
collections and fields from the record level
measurements
21. Metadata Quality Assurance Framework
21
Method: collection – measuring – sharing
Completeness of 3 collections 2 response types
best in
collection
worst in
collection
similar records
heterogenious
records
different
manifestations
22. Metadata Quality Assurance Framework
22
Method: collection – measuring – sharing
outputs
Display results in an interactive dashboard
REST API to share the raw data
Images: i) European Data Portal Metadata Quality Dashboard ii) Kibana promotional video
24. Metadata Quality Assurance Framework
24
What it is good for?
Improve the metadata
Improve metadata schema and its docum.
Propagate „good practice”
Improve services: „good” data is ranked
higher in search result list
Specifically for GWDG:
Could be built in to current and planned data
management / data archiving tools
25. Metadata Quality Assurance Framework
25
Further steps
Define meters by Domain Specific Language
Pattern discovery, machine learning,
clustering
Connectors for data sources
„Jenkins for data publication”
Problem catalogue
Data source
Schema
Metadata QA Report
26. Metadata Quality Assurance Framework
26
Follow me
Project plan and blog: http://pkiraly.github.io
Software development:
https://github.com/pkiraly/europeana-oai-pmh-client:
Harvester for Europeana OAI-PMH Service
https://github.com/pkiraly/oai-pmh-lib: OAI-PMH client
library
https://github.com/pkiraly/europeana-api-php-client: PHP
client for Europeana’s REST API
https://github.com/pkiraly/europeana-qa: Europeana
Metadata Quality Assurance Toolkit
@kiru, https://www.linkedin.com/in/peterkiraly