SlideShare une entreprise Scribd logo
1  sur  26
Metadata Quality Assurance
Péter Király
peter.kiraly@gwdg.de
Heyne Haus, Göttingen, 18/12/2015
Oberseminar Datenmanagement, Cloud und e-Infrastructure
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
Metadata Quality Assurance Framework
2
What is metadata?
 Data about data
 Specifically: descriptive data about ...
 digitized (or physical) object
such as paintings, books, photos
 larger datasets
such as research data
 Provides access points to the underlining data
Metadata Quality Assurance Framework
3
Why data quality is important?
„Fitness for purpose”
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft 17 December 2015
http://www.w3.org/TR/2015/WD-dwbp-20151217/
Metadata Quality Assurance Framework
4
Symptoms of bad quality metadata
 Hard to identify („What it is?”)
 Hard to distinguish from other records
 Misleading descriptions
 Uninterpretable descriptions
 Missing fields
 Unreusable (lost original context)
 Hard to find
Metadata Quality Assurance Framework
5
Some typical issues
 Title is not informative
Metadata Quality Assurance Framework
6
Mixing different data types
 Numeric
 RDF resource
Metadata Quality Assurance Framework
7
Field overuse
 What is the meaning of the field?
 identifier
 relation
 source
TextGrid OAI-PMH response
Metadata Quality Assurance Framework
8
Copy & paste cataloguing
 Keeping placeholders / templates
Metadata Quality Assurance Framework
9
Same entity, differently recorded
 lucas cranach der ältere
 Cranach, Lucas (der Ältere) [Herstellung]
 Cranach, Lucas (I) (naar tekening van)
 Cranach, Lucas vanem (autor)
Result of entity detection:
 http://dbpedia.org/resource/Lucas_Cranach_t
he_Elder
 http://viaf.org/viaf/49268177/
 none
Metadata Quality Assurance Framework
10
Same entity recorded differently
Different displays, and content:
 http://dbpedia.org/resource/Lucas_Cranach_t
he_Elder
 http://viaf.org/viaf/49268177/
 none
Metadata Quality Assurance Framework
11
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value for a record
Metadata Quality Assurance Framework
12
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value for a record set (e.g. a collection from the same
source)
Metadata Quality Assurance Framework
13
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value for a field – how users utilize the field?
Metadata Quality Assurance Framework
14
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
Field group. A group of fields together supports a given
funtionality, e.g. display, search, identify, re-use, multilinguality.
Metadata Quality Assurance Framework
15
Grouping fields by functionalities
Mandatory
Descriptiveness
Searchability
Contextualisation
Identification
Browsing
Viewing
Re-Usability
Multilinguality
dc:title × × × × ×
dcterms:alternative × × × ×
dc:description × × × × × ×
dc:creator × × × ×
dc:publisher × ×
dc:contributor ×
Created by Valentine Charles, Europeana Research and Development team
Metadata Quality Assurance Framework
16
Metrics
The foundational metrics were set by Bruce–
Hillmann, Stvilia, Ochoa–Duval, Gavrilis et al.
 Completeness
 Accuracy
 Conformance to expectations
 Logical consistency and coherence
 Accessibility
 Timeliness
 Provenance
Metadata Quality Assurance Framework
17
Data sources
 Europeana – the European digital library,
museum and archive: 48M+ medatata records
in EDM (Europeana Data Model) schema
 TextGrid repository: Dublin Core metadata
and TEI (Text Encoding Initiative) records
 Research data from the Göttingen Campus
 Library catalogue records in MARC (Machine
Readable Catalog) schema
 Other open data
Metadata Quality Assurance Framework
18
Method: collection – measuring – sharing
 Data collection (ingestion) via REST API, OAI-
OMH harvesting, file download etc.
 Issues:
 GWDG cloud: 160 GB, Europeana: 300 GB
 low I/O performance
 Europeana OAI-PMH is in a „beta” state
 OAI-PMH requires 10M+ HTTP requests
 REST API requires 50M+ HTTP requests
Metadata Quality Assurance Framework
19
Method: collection – measuring – sharing
Measuring records
 Big data so it should be scalable
 Apache Hadoop: MapReduce and friends
 Plugable architecture: „meters”
 UI: set parameters for meters
 input: records, schema, meters, config files
 output:
 identifier, projected metadata fields
 metric1, metric2, metric3 ... metricN
Metadata Quality Assurance Framework
20
Method: collection – measuring – sharing
Statistical analysis
 Calculating descriptive statistics with
R/Julia/other tool
 Derivation of numbers representing
collections and fields from the record level
measurements
Metadata Quality Assurance Framework
21
Method: collection – measuring – sharing
Completeness of 3 collections 2 response types
best in
collection
worst in
collection
similar records
heterogenious
records
different
manifestations
Metadata Quality Assurance Framework
22
Method: collection – measuring – sharing
outputs
 Display results in an interactive dashboard
 REST API to share the raw data
Images: i) European Data Portal Metadata Quality Dashboard ii) Kibana promotional video
Metadata Quality Assurance Framework
23
Method: collection – measuring – sharing
Data Quality Vocabulary (W3C Working Draft)
http://w3c.github.io/dwbp/vocab-dqg.html
:myDatasetDistribution
dqv:hasQualityMeasure :measure1, :measure2 .
:measure1
a dqv:QualityMeasure ;
dqv:computedOn :myDatasetDistribution ;
dqv:hasMetric :csvAvailabilityMetric ;
dqv:value "1.0"^^xsd:double .
:measure2
a dqv:QualityMeasure ;
dqv:computedOn :myDatasetDistribution ;
dqv:hasMetric :csvConsistencyMetric ;
dqv:value "0.5"^^xsd:double .
Metadata Quality Assurance Framework
24
What it is good for?
 Improve the metadata
 Improve metadata schema and its docum.
 Propagate „good practice”
 Improve services: „good” data is ranked
higher in search result list
Specifically for GWDG:
 Could be built in to current and planned data
management / data archiving tools
Metadata Quality Assurance Framework
25
Further steps
 Define meters by Domain Specific Language
 Pattern discovery, machine learning,
clustering
 Connectors for data sources
 „Jenkins for data publication”
Problem catalogue
Data source
Schema
Metadata QA Report
Metadata Quality Assurance Framework
26
Follow me
 Project plan and blog: http://pkiraly.github.io
 Software development:
 https://github.com/pkiraly/europeana-oai-pmh-client:
Harvester for Europeana OAI-PMH Service
 https://github.com/pkiraly/oai-pmh-lib: OAI-PMH client
library
 https://github.com/pkiraly/europeana-api-php-client: PHP
client for Europeana’s REST API
 https://github.com/pkiraly/europeana-qa: Europeana
Metadata Quality Assurance Toolkit
 @kiru, https://www.linkedin.com/in/peterkiraly

Contenu connexe

Tendances

OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...
Barry Hardy
 
Metadata mapping
Metadata mappingMetadata mapping
Metadata mapping
Vlad Vega
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
Lucy McKenna
 

Tendances (20)

FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to Practice
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
 
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
 
OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...
 
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
 
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
 
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
 
A Case for linked Data for Medical Devices in the IVD Market
A Case for linked Data for Medical Devices in the IVD MarketA Case for linked Data for Medical Devices in the IVD Market
A Case for linked Data for Medical Devices in the IVD Market
 
Metadata mapping
Metadata mappingMetadata mapping
Metadata mapping
 
Metadata crosswalks
Metadata crosswalksMetadata crosswalks
Metadata crosswalks
 
Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Building an Internet of Genomics
Building an Internet of GenomicsBuilding an Internet of Genomics
Building an Internet of Genomics
 
Evaluating FAIRness
Evaluating FAIRnessEvaluating FAIRness
Evaluating FAIRness
 
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All Hands
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All HandsBioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All Hands
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All Hands
 
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
 
Amrapali Zaveri Defense
Amrapali Zaveri DefenseAmrapali Zaveri Defense
Amrapali Zaveri Defense
 

Similaire à Metadata Quality Assurance

Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
dataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
Phi Jack
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
Sanjay Padhi, Ph.D
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptx
ImXaib
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Ahmad Assaf
 

Similaire à Metadata Quality Assurance (20)

How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
A metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsA metadata standard for Knowledge Graphs
A metadata standard for Knowledge Graphs
 
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsNIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data Commons
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Data Quality
Data QualityData Quality
Data Quality
 
The NIH Data Commons - BD2K All Hands Meeting 2015
The NIH Data Commons -  BD2K All Hands Meeting 2015The NIH Data Commons -  BD2K All Hands Meeting 2015
The NIH Data Commons - BD2K All Hands Meeting 2015
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptx
 
Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
 
Komatsoulis internet2 executive track
Komatsoulis internet2 executive trackKomatsoulis internet2 executive track
Komatsoulis internet2 executive track
 
McGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and ScalingMcGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and Scaling
 
Data Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open DataData Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open Data
 
Project E: Citation
Project E: CitationProject E: Citation
Project E: Citation
 
Towards metrics to assess and encourage FAIRness
Towards metrics to assess and encourage FAIRnessTowards metrics to assess and encourage FAIRness
Towards metrics to assess and encourage FAIRness
 
FAIR Metrics - Presentation to NIH KC1
FAIR Metrics - Presentation to NIH KC1FAIR Metrics - Presentation to NIH KC1
FAIR Metrics - Presentation to NIH KC1
 
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
 

Plus de Péter Király

Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Péter Király
 

Plus de Péter Király (20)

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 

Dernier

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 

Dernier (20)

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 

Metadata Quality Assurance

  • 1. Metadata Quality Assurance Péter Király peter.kiraly@gwdg.de Heyne Haus, Göttingen, 18/12/2015 Oberseminar Datenmanagement, Cloud und e-Infrastructure Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
  • 2. Metadata Quality Assurance Framework 2 What is metadata?  Data about data  Specifically: descriptive data about ...  digitized (or physical) object such as paintings, books, photos  larger datasets such as research data  Provides access points to the underlining data
  • 3. Metadata Quality Assurance Framework 3 Why data quality is important? „Fitness for purpose” no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft 17 December 2015 http://www.w3.org/TR/2015/WD-dwbp-20151217/
  • 4. Metadata Quality Assurance Framework 4 Symptoms of bad quality metadata  Hard to identify („What it is?”)  Hard to distinguish from other records  Misleading descriptions  Uninterpretable descriptions  Missing fields  Unreusable (lost original context)  Hard to find
  • 5. Metadata Quality Assurance Framework 5 Some typical issues  Title is not informative
  • 6. Metadata Quality Assurance Framework 6 Mixing different data types  Numeric  RDF resource
  • 7. Metadata Quality Assurance Framework 7 Field overuse  What is the meaning of the field?  identifier  relation  source TextGrid OAI-PMH response
  • 8. Metadata Quality Assurance Framework 8 Copy & paste cataloguing  Keeping placeholders / templates
  • 9. Metadata Quality Assurance Framework 9 Same entity, differently recorded  lucas cranach der ältere  Cranach, Lucas (der Ältere) [Herstellung]  Cranach, Lucas (I) (naar tekening van)  Cranach, Lucas vanem (autor) Result of entity detection:  http://dbpedia.org/resource/Lucas_Cranach_t he_Elder  http://viaf.org/viaf/49268177/  none
  • 10. Metadata Quality Assurance Framework 10 Same entity recorded differently Different displays, and content:  http://dbpedia.org/resource/Lucas_Cranach_t he_Elder  http://viaf.org/viaf/49268177/  none
  • 11. Metadata Quality Assurance Framework 11 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 An overall value for a record
  • 12. Metadata Quality Assurance Framework 12 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 An overall value for a record set (e.g. a collection from the same source)
  • 13. Metadata Quality Assurance Framework 13 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 An overall value for a field – how users utilize the field?
  • 14. Metadata Quality Assurance Framework 14 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 Field group. A group of fields together supports a given funtionality, e.g. display, search, identify, re-use, multilinguality.
  • 15. Metadata Quality Assurance Framework 15 Grouping fields by functionalities Mandatory Descriptiveness Searchability Contextualisation Identification Browsing Viewing Re-Usability Multilinguality dc:title × × × × × dcterms:alternative × × × × dc:description × × × × × × dc:creator × × × × dc:publisher × × dc:contributor × Created by Valentine Charles, Europeana Research and Development team
  • 16. Metadata Quality Assurance Framework 16 Metrics The foundational metrics were set by Bruce– Hillmann, Stvilia, Ochoa–Duval, Gavrilis et al.  Completeness  Accuracy  Conformance to expectations  Logical consistency and coherence  Accessibility  Timeliness  Provenance
  • 17. Metadata Quality Assurance Framework 17 Data sources  Europeana – the European digital library, museum and archive: 48M+ medatata records in EDM (Europeana Data Model) schema  TextGrid repository: Dublin Core metadata and TEI (Text Encoding Initiative) records  Research data from the Göttingen Campus  Library catalogue records in MARC (Machine Readable Catalog) schema  Other open data
  • 18. Metadata Quality Assurance Framework 18 Method: collection – measuring – sharing  Data collection (ingestion) via REST API, OAI- OMH harvesting, file download etc.  Issues:  GWDG cloud: 160 GB, Europeana: 300 GB  low I/O performance  Europeana OAI-PMH is in a „beta” state  OAI-PMH requires 10M+ HTTP requests  REST API requires 50M+ HTTP requests
  • 19. Metadata Quality Assurance Framework 19 Method: collection – measuring – sharing Measuring records  Big data so it should be scalable  Apache Hadoop: MapReduce and friends  Plugable architecture: „meters”  UI: set parameters for meters  input: records, schema, meters, config files  output:  identifier, projected metadata fields  metric1, metric2, metric3 ... metricN
  • 20. Metadata Quality Assurance Framework 20 Method: collection – measuring – sharing Statistical analysis  Calculating descriptive statistics with R/Julia/other tool  Derivation of numbers representing collections and fields from the record level measurements
  • 21. Metadata Quality Assurance Framework 21 Method: collection – measuring – sharing Completeness of 3 collections 2 response types best in collection worst in collection similar records heterogenious records different manifestations
  • 22. Metadata Quality Assurance Framework 22 Method: collection – measuring – sharing outputs  Display results in an interactive dashboard  REST API to share the raw data Images: i) European Data Portal Metadata Quality Dashboard ii) Kibana promotional video
  • 23. Metadata Quality Assurance Framework 23 Method: collection – measuring – sharing Data Quality Vocabulary (W3C Working Draft) http://w3c.github.io/dwbp/vocab-dqg.html :myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 . :measure1 a dqv:QualityMeasure ; dqv:computedOn :myDatasetDistribution ; dqv:hasMetric :csvAvailabilityMetric ; dqv:value "1.0"^^xsd:double . :measure2 a dqv:QualityMeasure ; dqv:computedOn :myDatasetDistribution ; dqv:hasMetric :csvConsistencyMetric ; dqv:value "0.5"^^xsd:double .
  • 24. Metadata Quality Assurance Framework 24 What it is good for?  Improve the metadata  Improve metadata schema and its docum.  Propagate „good practice”  Improve services: „good” data is ranked higher in search result list Specifically for GWDG:  Could be built in to current and planned data management / data archiving tools
  • 25. Metadata Quality Assurance Framework 25 Further steps  Define meters by Domain Specific Language  Pattern discovery, machine learning, clustering  Connectors for data sources  „Jenkins for data publication” Problem catalogue Data source Schema Metadata QA Report
  • 26. Metadata Quality Assurance Framework 26 Follow me  Project plan and blog: http://pkiraly.github.io  Software development:  https://github.com/pkiraly/europeana-oai-pmh-client: Harvester for Europeana OAI-PMH Service  https://github.com/pkiraly/oai-pmh-lib: OAI-PMH client library  https://github.com/pkiraly/europeana-api-php-client: PHP client for Europeana’s REST API  https://github.com/pkiraly/europeana-qa: Europeana Metadata Quality Assurance Toolkit  @kiru, https://www.linkedin.com/in/peterkiraly