Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Measuring completeness as metadata quality metric in Europeana (DH 2017)
1. Measuring completeness as metadata
quality metric in Europeana
Péter Király
peter.kiraly@gwdg.de
Gesellschaft für wissenschaftliche
Datenverarbeitung mbH Göttingen, Germany
Digital Humanities 2017 (Montréal, Canada)
9th August, 2017
2. Measuring completeness. Glossary
bit.ly/mq-dh2017 - 2
★ Metadata here: cultural heritage metadata (descriptions of books etc.)
★ Europeana a metadata aggregator from 3500+ cultural heritage
institutions with 53M metadata records http://europeana.eu
★ Big Data here: 10-100 million metadata records, 100 GB - 1.5 TB
★ EDM Europeana Data Model, Europeana’s metadata schema
5. ★ The Royal Library: The National Library of Denmark
and Copenhagen University Library (40,680)
★ The Royal Library: The National Library of Denmark
and Copenhagen University Library (20,688)
Measuring completeness. Non normalized institution names
5
★ National Library of the Netherlands (1,291,139)
★ National Library of the Netherlands - Koninklijke
Bibliotheek (554,068)
★ Bodleian Libraries, University of Oxford (354,441)
★ Bodleian Libraries, Oxford University (3,243)
★ Cinecittà Luce S.p.A. (372,412)
★ Cinecittà Luce (2,405)
★ LUCE (105)
difference in whitespaces (“n “)
affects “filter by institution”
function & web widget
difference in name (translations, extra attributes)
bit.ly/mq-dh2017 - 5
6. Measuring completeness. Non normalized values in “year” facet
6
good
★ 1666
★ 1914
bad
★ -1988
★ 13436
★ 97500000
★ 20140409
★ 1146345933
affects “filter by year” function
bit.ly/mq-dh2017 - 6
7. Measuring completeness. Multilinguality problem
7
★ Mona Lisa → 456 results
★ La Gioconda → 365
results
★ La Joconde → 71 results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
affects search function
bit.ly/mq-dh2017 - 7
8. Measuring completeness. Empty fields
8
no useful information
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015) bit.ly/mq-dh2017 - 8
9. Measuring completeness. The question
9
How can we determine which records should be improved
and which are good enough?
we would like to have metrics like this:
support of functional requirements
good
acceptable
bad
bit.ly/mq-dh2017 - 9
10. Measuring completeness. Why data quality is important?
10
“Fitness for purpose” (QA principle)
purpose: to access content
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft, https://www.w3.org/TR/dwbp/
bit.ly/mq-dh2017 - 10
12. Measuring completeness. Purposes
12
★improve the metadata
★services: good data → reliable functions
★better metadata schema & documentation
★propagate “good practice”
bit.ly/mq-dh2017 - 12
13. Measuring completeness. Proposal I. - an organization
13
Europeana Data Quality Committee
★ analyzing/revising metadata schema
★ functional requirement analysis
○ defining “enabling” elements
★ problem catalog
★ multilinguality
bit.ly/mq-dh2017 - 13
14. Measuring completeness. Proposal II. - a tool proposal
14
“Metadata Quality Assurance Framework”
a generic tool for measuring metadata quality
★ adaptable to different metadata schemes
★ scalable (to Big Data)
★ understandable reports for data curators
★ open source
bit.ly/mq-dh2017 - 14
15. Measuring completeness. What to measure?
15
★Structure and semantics
Completeness, cardinality, uniqueness, length, dictionary entry, data type
conformance, multilinguality (see [bibliography])
★Functional requirements
Requirements of the most important functions, discovery scenarios
★Problem catalog
Known metadata problems
bit.ly/mq-dh2017 - 15
16. Measuring completeness. Completeness categories
16
★ simple completeness
ratio of filled fields
★ cardinality of fields
which fields are filled and how intensively
★ functionalities
field groups supporting functions
○ mandatory elements
○ descriptiveness
○ searchability
○ contextualization
○ identification
○ browsing
○ …
bit.ly/mq-dh2017 - 16
17. Measuring completeness. Measurement levels
17
overall view collection view record view
Completeness
Field cardinality
Uniqueness
Multilinguality
Language specification
Problem catalog
etc.
links
measurements
aggregated statistics
metrics
bit.ly/mq-dh2017 - 17
18. Measuring completeness. Completeness score calculation
18
Weighted
cardinality
Completeness
score
Weighted
functionality
Pearson’s correlation
coefficient is 0.52
Method I Method II
weight: 2.5 × score
bit.ly/mq-dh2017 - 18
19. Measuring completeness. Completeness score distribution
19
Distribution of completeness scores in one dataset.
functionality-based method
★ higher scores
★ more variant
cardinality-based method
★ lower scores
★ less variant
combined method
★ closer to functionality
bit.ly/mq-dh2017 - 19
20. Measuring completeness. Results
20
★ lots of records miss semantic enrichments (contextual entities)
○ 6% have agent, 28% place, 32% timespan, 40% concept entities
○ only a couple of data providers have 100% coverage
★ only mandatory elements appear in each record
★ there are unused fields
★ there are overused fields
○ suggestion: generic fields → specific field
○ dc:description → dc:subject, dct:alternative, dct:tableOfContents
bit.ly/mq-dh2017 - 20
23. Measuring completeness. Further steps
23
★scores into recommendations
★communication
★expert evaluation
★cooperation with other projects
★ingestion process
★W3C recommendations
○ Shape Constraint Language
○ Data Quality Vocabulary
★is usage in-line with scores?
★do scores change?
★machine learning-based
classification & clustering
human technical
bit.ly/mq-dh2017 - 23
24. Measuring completeness. Credits and links
bit.ly/mq-dh2017 - 24
This research is conducted in close collaboration with the Europeana Data
Quality Committee, thanks to all the members! Special thanks to Marco
Büchler & the eTRAP team and to the GWDG HPC experts!
★Europeana Data Quality Committee // http://pro.europeana.eu/europeana-
tech/data-quality-committee
★demo site // http://144.76.218.178/europeana-qa/
★source code (GPL v3.0) // http://pkiraly.github.io/about/#source-codes
★Europeana data (CC0) // http://hdl.handle.net/21.11101/0000-0001-781F-7
★[bibliography] // http://zotero.org/groups/metadata_assessment
★contact // peter.kiraly@gwdg.de, @kiru slides // http://bit.ly/mq-dh2017