SlideShare une entreprise Scribd logo
1  sur  24
Measuring completeness as metadata
quality metric in Europeana
Péter Király
peter.kiraly@gwdg.de
Gesellschaft für wissenschaftliche
Datenverarbeitung mbH Göttingen, Germany
Digital Humanities 2017 (Montréal, Canada)
9th August, 2017
Measuring completeness. Glossary
bit.ly/mq-dh2017 - 2
★ Metadata here: cultural heritage metadata (descriptions of books etc.)
★ Europeana a metadata aggregator from 3500+ cultural heritage
institutions with 53M metadata records http://europeana.eu
★ Big Data here: 10-100 million metadata records, 100 GB - 1.5 TB
★ EDM Europeana Data Model, Europeana’s metadata schema
Measuring completeness.
bit.ly/mq-dh2017 - 3
The problem
Measuring completeness. Generic title and bad thumbnail
bit.ly/mq-dh2017 - 4
affects search and identification
★ The Royal Library: The National Library of Denmark
and Copenhagen University Library (40,680)
★ The Royal Library: The National Library of Denmark
and Copenhagen University Library (20,688)
Measuring completeness. Non normalized institution names
5
★ National Library of the Netherlands (1,291,139)
★ National Library of the Netherlands - Koninklijke
Bibliotheek (554,068)
★ Bodleian Libraries, University of Oxford (354,441)
★ Bodleian Libraries, Oxford University (3,243)
★ Cinecittà Luce S.p.A. (372,412)
★ Cinecittà Luce (2,405)
★ LUCE (105)
difference in whitespaces (“n “)
affects “filter by institution”
function & web widget
difference in name (translations, extra attributes)
bit.ly/mq-dh2017 - 5
Measuring completeness. Non normalized values in “year” facet
6
good
★ 1666
★ 1914
bad
★ -1988
★ 13436
★ 97500000
★ 20140409
★ 1146345933
affects “filter by year” function
bit.ly/mq-dh2017 - 6
Measuring completeness. Multilinguality problem
7
★ Mona Lisa → 456 results
★ La Gioconda → 365
results
★ La Joconde → 71 results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
affects search function
bit.ly/mq-dh2017 - 7
Measuring completeness. Empty fields
8
no useful information
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015) bit.ly/mq-dh2017 - 8
Measuring completeness. The question
9
How can we determine which records should be improved
and which are good enough?
we would like to have metrics like this:
support of functional requirements
good
acceptable
bad
bit.ly/mq-dh2017 - 9
Measuring completeness. Why data quality is important?
10
“Fitness for purpose” (QA principle)
purpose: to access content
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft, https://www.w3.org/TR/dwbp/
bit.ly/mq-dh2017 - 10
Measuring completeness. Hypothesis
11
by measuring structural elements we
can approximate metadata record quality
≃ metadata smell
bit.ly/mq-dh2017 - 11
Measuring completeness. Purposes
12
★improve the metadata
★services: good data → reliable functions
★better metadata schema & documentation
★propagate “good practice”
bit.ly/mq-dh2017 - 12
Measuring completeness. Proposal I. - an organization
13
Europeana Data Quality Committee
★ analyzing/revising metadata schema
★ functional requirement analysis
○ defining “enabling” elements
★ problem catalog
★ multilinguality
bit.ly/mq-dh2017 - 13
Measuring completeness. Proposal II. - a tool proposal
14
“Metadata Quality Assurance Framework”
a generic tool for measuring metadata quality
★ adaptable to different metadata schemes
★ scalable (to Big Data)
★ understandable reports for data curators
★ open source
bit.ly/mq-dh2017 - 14
Measuring completeness. What to measure?
15
★Structure and semantics
Completeness, cardinality, uniqueness, length, dictionary entry, data type
conformance, multilinguality (see [bibliography])
★Functional requirements
Requirements of the most important functions, discovery scenarios
★Problem catalog
Known metadata problems
bit.ly/mq-dh2017 - 15
Measuring completeness. Completeness categories
16
★ simple completeness
ratio of filled fields
★ cardinality of fields
which fields are filled and how intensively
★ functionalities
field groups supporting functions
○ mandatory elements
○ descriptiveness
○ searchability
○ contextualization
○ identification
○ browsing
○ …
bit.ly/mq-dh2017 - 16
Measuring completeness. Measurement levels
17
overall view collection view record view
Completeness
Field cardinality
Uniqueness
Multilinguality
Language specification
Problem catalog
etc.
links
measurements
aggregated statistics
metrics
bit.ly/mq-dh2017 - 17
Measuring completeness. Completeness score calculation
18
Weighted
cardinality
Completeness
score
Weighted
functionality
Pearson’s correlation
coefficient is 0.52
Method I Method II
weight: 2.5 × score
bit.ly/mq-dh2017 - 18
Measuring completeness. Completeness score distribution
19
Distribution of completeness scores in one dataset.
functionality-based method
★ higher scores
★ more variant
cardinality-based method
★ lower scores
★ less variant
combined method
★ closer to functionality
bit.ly/mq-dh2017 - 19
Measuring completeness. Results
20
★ lots of records miss semantic enrichments (contextual entities)
○ 6% have agent, 28% place, 32% timespan, 40% concept entities
○ only a couple of data providers have 100% coverage
★ only mandatory elements appear in each record
★ there are unused fields
★ there are overused fields
○ suggestion: generic fields → specific field
○ dc:description → dc:subject, dct:alternative, dct:tableOfContents
bit.ly/mq-dh2017 - 20
Measuring completeness. Visualization
21
bit.ly/mq-dh2017 - 21
Measuring completeness. Technical background
22
★ OAI-PMH
★ Europeana API
★ Hadoop
★ NoSQL
★ Spark
★ Hadoop
★ Java
★ Apache Solr
★ Spark
★ Scala
★ R
★ PHP
★ D3.js
★ highchart.js
★ NoSQL
ingest measure statistical
analysis
web
interface
processing workflow
json csv html, svg
json, jpg
bit.ly/mq-dh2017 - 22
Measuring completeness. Further steps
23
★scores into recommendations
★communication
★expert evaluation
★cooperation with other projects
★ingestion process
★W3C recommendations
○ Shape Constraint Language
○ Data Quality Vocabulary
★is usage in-line with scores?
★do scores change?
★machine learning-based
classification & clustering
human technical
bit.ly/mq-dh2017 - 23
Measuring completeness. Credits and links
bit.ly/mq-dh2017 - 24
This research is conducted in close collaboration with the Europeana Data
Quality Committee, thanks to all the members! Special thanks to Marco
Büchler & the eTRAP team and to the GWDG HPC experts!
★Europeana Data Quality Committee // http://pro.europeana.eu/europeana-
tech/data-quality-committee
★demo site // http://144.76.218.178/europeana-qa/
★source code (GPL v3.0) // http://pkiraly.github.io/about/#source-codes
★Europeana data (CC0) // http://hdl.handle.net/21.11101/0000-0001-781F-7
★[bibliography] // http://zotero.org/groups/metadata_assessment
★contact // peter.kiraly@gwdg.de, @kiru slides // http://bit.ly/mq-dh2017

Contenu connexe

Similaire à Measuring completeness as metadata quality metric in Europeana (DH 2017)

polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfRim Moussa
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfRAKESHG79
 
Let your data shine... with OpenRefine
Let your data shine... with OpenRefineLet your data shine... with OpenRefine
Let your data shine... with OpenRefineOpen Knowledge Belgium
 
How Data Science can help energy companies map their infrastructure
How Data Science can help energy companies map their infrastructureHow Data Science can help energy companies map their infrastructure
How Data Science can help energy companies map their infrastructureAlex Combessie
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Datadapaasproject
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Péter Király
 
Tracking research data footprints - slides
Tracking research data footprints - slidesTracking research data footprints - slides
Tracking research data footprints - slidesARDC
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greaterCristina Sarasua
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
Big data analysis and modelling
Big data analysis and modellingBig data analysis and modelling
Big data analysis and modellingkeivan mahdavi
 
Produktdatenmanagement mit Neo4j
Produktdatenmanagement mit Neo4jProduktdatenmanagement mit Neo4j
Produktdatenmanagement mit Neo4jNeo4j
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell
 
Unlocking the value : metadata and linked data at the British Library / Alan ...
Unlocking the value : metadata and linked data at the British Library / Alan ...Unlocking the value : metadata and linked data at the British Library / Alan ...
Unlocking the value : metadata and linked data at the British Library / Alan ...CIGScotland
 
How to Leverage Big Data to Deliver Smart Logistics
How to Leverage Big Data to Deliver Smart LogisticsHow to Leverage Big Data to Deliver Smart Logistics
How to Leverage Big Data to Deliver Smart LogisticsAlibaba Cloud
 

Similaire à Measuring completeness as metadata quality metric in Europeana (DH 2017) (20)

Bicod2017
Bicod2017Bicod2017
Bicod2017
 
BICOD-2017
BICOD-2017BICOD-2017
BICOD-2017
 
polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdf
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Let your data shine... with OpenRefine
Let your data shine... with OpenRefineLet your data shine... with OpenRefine
Let your data shine... with OpenRefine
 
How Data Science can help energy companies map their infrastructure
How Data Science can help energy companies map their infrastructureHow Data Science can help energy companies map their infrastructure
How Data Science can help energy companies map their infrastructure
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
 
Tracking research data footprints - slides
Tracking research data footprints - slidesTracking research data footprints - slides
Tracking research data footprints - slides
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greater
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
Alexia Meyermann: Building a research infrastructure for educational studies ...
Alexia Meyermann: Building a research infrastructure for educational studies ...Alexia Meyermann: Building a research infrastructure for educational studies ...
Alexia Meyermann: Building a research infrastructure for educational studies ...
 
Big data analysis and modelling
Big data analysis and modellingBig data analysis and modelling
Big data analysis and modelling
 
Produktdatenmanagement mit Neo4j
Produktdatenmanagement mit Neo4jProduktdatenmanagement mit Neo4j
Produktdatenmanagement mit Neo4j
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
 
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
 
Unlocking the value : metadata and linked data at the British Library / Alan ...
Unlocking the value : metadata and linked data at the British Library / Alan ...Unlocking the value : metadata and linked data at the British Library / Alan ...
Unlocking the value : metadata and linked data at the British Library / Alan ...
 
How to Leverage Big Data to Deliver Smart Logistics
How to Leverage Big Data to Deliver Smart LogisticsHow to Leverage Big Data to Deliver Smart Logistics
How to Leverage Big Data to Deliver Smart Logistics
 

Plus de Péter Király

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Péter Király
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Péter Király
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Péter Király
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Péter Király
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)Péter Király
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Péter Király
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Péter Király
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Péter Király
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Péter Király
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Péter Király
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)Péter Király
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)Péter Király
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Péter Király
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Péter Király
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Péter Király
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Péter Király
 
Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Péter Király
 
SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)Péter Király
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Péter Király
 

Plus de Péter Király (20)

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 
Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)
 
SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)
 

Dernier

怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Dernier (20)

怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Measuring completeness as metadata quality metric in Europeana (DH 2017)

  • 1. Measuring completeness as metadata quality metric in Europeana Péter Király peter.kiraly@gwdg.de Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany Digital Humanities 2017 (Montréal, Canada) 9th August, 2017
  • 2. Measuring completeness. Glossary bit.ly/mq-dh2017 - 2 ★ Metadata here: cultural heritage metadata (descriptions of books etc.) ★ Europeana a metadata aggregator from 3500+ cultural heritage institutions with 53M metadata records http://europeana.eu ★ Big Data here: 10-100 million metadata records, 100 GB - 1.5 TB ★ EDM Europeana Data Model, Europeana’s metadata schema
  • 4. Measuring completeness. Generic title and bad thumbnail bit.ly/mq-dh2017 - 4 affects search and identification
  • 5. ★ The Royal Library: The National Library of Denmark and Copenhagen University Library (40,680) ★ The Royal Library: The National Library of Denmark and Copenhagen University Library (20,688) Measuring completeness. Non normalized institution names 5 ★ National Library of the Netherlands (1,291,139) ★ National Library of the Netherlands - Koninklijke Bibliotheek (554,068) ★ Bodleian Libraries, University of Oxford (354,441) ★ Bodleian Libraries, Oxford University (3,243) ★ Cinecittà Luce S.p.A. (372,412) ★ Cinecittà Luce (2,405) ★ LUCE (105) difference in whitespaces (“n “) affects “filter by institution” function & web widget difference in name (translations, extra attributes) bit.ly/mq-dh2017 - 5
  • 6. Measuring completeness. Non normalized values in “year” facet 6 good ★ 1666 ★ 1914 bad ★ -1988 ★ 13436 ★ 97500000 ★ 20140409 ★ 1146345933 affects “filter by year” function bit.ly/mq-dh2017 - 6
  • 7. Measuring completeness. Multilinguality problem 7 ★ Mona Lisa → 456 results ★ La Gioconda → 365 results ★ La Joconde → 71 results http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html affects search function bit.ly/mq-dh2017 - 7
  • 8. Measuring completeness. Empty fields 8 no useful information more examples in Report and Recommendations from the Task Force on Metadata Quality (2015) bit.ly/mq-dh2017 - 8
  • 9. Measuring completeness. The question 9 How can we determine which records should be improved and which are good enough? we would like to have metrics like this: support of functional requirements good acceptable bad bit.ly/mq-dh2017 - 9
  • 10. Measuring completeness. Why data quality is important? 10 “Fitness for purpose” (QA principle) purpose: to access content no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft, https://www.w3.org/TR/dwbp/ bit.ly/mq-dh2017 - 10
  • 11. Measuring completeness. Hypothesis 11 by measuring structural elements we can approximate metadata record quality ≃ metadata smell bit.ly/mq-dh2017 - 11
  • 12. Measuring completeness. Purposes 12 ★improve the metadata ★services: good data → reliable functions ★better metadata schema & documentation ★propagate “good practice” bit.ly/mq-dh2017 - 12
  • 13. Measuring completeness. Proposal I. - an organization 13 Europeana Data Quality Committee ★ analyzing/revising metadata schema ★ functional requirement analysis ○ defining “enabling” elements ★ problem catalog ★ multilinguality bit.ly/mq-dh2017 - 13
  • 14. Measuring completeness. Proposal II. - a tool proposal 14 “Metadata Quality Assurance Framework” a generic tool for measuring metadata quality ★ adaptable to different metadata schemes ★ scalable (to Big Data) ★ understandable reports for data curators ★ open source bit.ly/mq-dh2017 - 14
  • 15. Measuring completeness. What to measure? 15 ★Structure and semantics Completeness, cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (see [bibliography]) ★Functional requirements Requirements of the most important functions, discovery scenarios ★Problem catalog Known metadata problems bit.ly/mq-dh2017 - 15
  • 16. Measuring completeness. Completeness categories 16 ★ simple completeness ratio of filled fields ★ cardinality of fields which fields are filled and how intensively ★ functionalities field groups supporting functions ○ mandatory elements ○ descriptiveness ○ searchability ○ contextualization ○ identification ○ browsing ○ … bit.ly/mq-dh2017 - 16
  • 17. Measuring completeness. Measurement levels 17 overall view collection view record view Completeness Field cardinality Uniqueness Multilinguality Language specification Problem catalog etc. links measurements aggregated statistics metrics bit.ly/mq-dh2017 - 17
  • 18. Measuring completeness. Completeness score calculation 18 Weighted cardinality Completeness score Weighted functionality Pearson’s correlation coefficient is 0.52 Method I Method II weight: 2.5 × score bit.ly/mq-dh2017 - 18
  • 19. Measuring completeness. Completeness score distribution 19 Distribution of completeness scores in one dataset. functionality-based method ★ higher scores ★ more variant cardinality-based method ★ lower scores ★ less variant combined method ★ closer to functionality bit.ly/mq-dh2017 - 19
  • 20. Measuring completeness. Results 20 ★ lots of records miss semantic enrichments (contextual entities) ○ 6% have agent, 28% place, 32% timespan, 40% concept entities ○ only a couple of data providers have 100% coverage ★ only mandatory elements appear in each record ★ there are unused fields ★ there are overused fields ○ suggestion: generic fields → specific field ○ dc:description → dc:subject, dct:alternative, dct:tableOfContents bit.ly/mq-dh2017 - 20
  • 22. Measuring completeness. Technical background 22 ★ OAI-PMH ★ Europeana API ★ Hadoop ★ NoSQL ★ Spark ★ Hadoop ★ Java ★ Apache Solr ★ Spark ★ Scala ★ R ★ PHP ★ D3.js ★ highchart.js ★ NoSQL ingest measure statistical analysis web interface processing workflow json csv html, svg json, jpg bit.ly/mq-dh2017 - 22
  • 23. Measuring completeness. Further steps 23 ★scores into recommendations ★communication ★expert evaluation ★cooperation with other projects ★ingestion process ★W3C recommendations ○ Shape Constraint Language ○ Data Quality Vocabulary ★is usage in-line with scores? ★do scores change? ★machine learning-based classification & clustering human technical bit.ly/mq-dh2017 - 23
  • 24. Measuring completeness. Credits and links bit.ly/mq-dh2017 - 24 This research is conducted in close collaboration with the Europeana Data Quality Committee, thanks to all the members! Special thanks to Marco Büchler & the eTRAP team and to the GWDG HPC experts! ★Europeana Data Quality Committee // http://pro.europeana.eu/europeana- tech/data-quality-committee ★demo site // http://144.76.218.178/europeana-qa/ ★source code (GPL v3.0) // http://pkiraly.github.io/about/#source-codes ★Europeana data (CC0) // http://hdl.handle.net/21.11101/0000-0001-781F-7 ★[bibliography] // http://zotero.org/groups/metadata_assessment ★contact // peter.kiraly@gwdg.de, @kiru slides // http://bit.ly/mq-dh2017