Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Visualization
of Metadata Quality
for Open Government
DataKonrad Johannes Reiche*, Edzard Höfig, Ina
Schieferdecker**, pre...
“A piece of content or data is open if anyone is free to
use, reuse, and redistribute it — subject only, at most,
to the r...
“A piece of content or data is open if anyone is free to
use, reuse, and redistribute it — subject only, at most,
to the r...
Government Data Citizens
DOMAIN
Government Data Citizens
DOMAIN
DESIGN
Repositories
XML
JSON
RDF
Metadata
PDF XLS CSVDOC
Resources
Quality.
What could possibly go wrong?
Metadata Record
Name regional-household-income
ID 98899446-0a1a-43bc-874c-2d54dc700...
Quality.
What could possibly go wrong?
Metadata Record
Name regional-household-income
ID 98899446-0a1a-43bc-874c-2d54dc700...
Quality.
What could possibly go wrong?
Metadata Record
Name regional-household-income
ID 98899446-0a1a-43bc-874c-2d54dc700...
Metadata Record
Name regional-household-income
ID 98899446-0a1a-43bc-874c-2d54dc700670
Maintainer
Maintainer Email
Author ...
Metadata Record
Name
ID 98899446-0a1a-43bc-874c-2d54dc700670
Maintainer
Maintainer Email
Author
Author Email
License ID uk...
Reputation Loss
QUALITY LOSSInformation Loss
- Missing Fields
- Dead Links
- Inaccurate Information
- False Information
- ...
Meta·da·ta Qual·i·ty
/ˈmɛtədeɪtə kwɒlɪti/
The fitness to describe the data (resources), supporting
the task dimensions of ...
Assessing Metadata
Quality is HARDHighly Subjective
Metadata Resource
?
1. Manual 2. Automated
Wrong
Qualified Process
Pri...
Automated Quality
AssessmentEmpirical Analysis + Visual Aid
- Field Usage
- Field Values
Framework
- Based on Information ...
QUALITY METRICS
𝑞 𝑚 : 𝑟𝑒𝑐𝑜𝑟𝑑 𝑡 ⟶ 𝑉 ∈ [0, 1]
Measurement.
Assigning a symbolic value to an object to enable the
characterization of a certa...
Completeness. How many fields have
been completed?
Record contains all the information required to
have an ideal represent...
Weighted Completeness. Not all fields
are equally relevant.
Weight value 𝑤𝑖 expresses the relative
importance of field 𝑖.
...
Accuracy. How accurate is the resource
represented?
Semantic distance 𝑑𝑖. Difference
between the information a user can
ex...
Richness of Information. How much
value is added?
𝑞𝑖 𝑟𝑒𝑐𝑜𝑟𝑑 =
𝑖=1
𝑛
𝐼 𝑓𝑖𝑒𝑙𝑑𝑖
𝑛
Vocabulary terms and descriptions
should be...
Readability. How readable are the
descriptions?
𝑞 𝑟 𝑟𝑒𝑐𝑜𝑟𝑑 = 206.836 − 1.015
𝑤𝑜𝑟𝑑𝑠
𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠
− 84.6
𝑠𝑦𝑙𝑙𝑎𝑏𝑙𝑒𝑠
𝑤𝑜𝑟𝑑𝑠
Readab...
Availability. Are the links working?
𝑞 𝑎𝑣 𝑟𝑒𝑐𝑜𝑟𝑑 =
𝑖=1
𝑛
𝑎𝑖
𝑛
Metadata only links to the resources. Without working links
...
Implementatio
Metadata Census
REQUIREMENTS
Metadata Harvester
Schemaless Data Store
Quality Metrics
Visualization
Leaderboard
Scalability
Extensibility
...
Repository
+ url : String
+ name : String
+ type : Symbol
Snapshot
+ date : Date
MetaMetadata
+ metadata_record : Hash
+ s...
CompletenessMetric
WeightedCompleteness
<<Interface>>
Metric
+ compute(record)
MetricWorker
+ perform(snapshot, metric)
Ge...
Metadata
Harvester
JSON JSON
JSON
Archives
APIRequests
Records
Imports
Persist
Metadata Census
Metadata
Harvester
JSON JSON
JSON
Archives
APIRequests
Records
Preliminary
Analyzer
Dump
I...
Imports
Persist
Metadata Census
Metadata
Harvester
JSON JSON
JSON
Archives
APIRequests
Records
Metric Processor
Query
Reco...
ViewUser
Generates
Investigates
Imports
Persist
Metadata Census
Metadata
Harvester
JSON JSON
JSON
Archives
APIRequests
Rec...
Open Government Data.
Evaluation
Implementation focused exclusively on CKAN repositories.
Rank Repository Scor
Misspelling
Richnessof
Information
Openness
Completeness
Availability
Weighted
Completeness
Readabili...
Conclusi
on
What is good about
this approach?
Metadata quality is quantified, but every quality aspect on its
own. Metric scores are a...
Platform has the advantage
that it acts as a beacon...
If your metadata breaks bad
everyone will see it.
What is bad not so good
about this approach?
- Lacks number of quality metrics
- No empirical analysis beforehand
- Overva...
Final Thought. Do not
aim for excellence, aim
for low-quality
metadata.
Quality Feed. Monitor
metadata changes live
and record changes in
a timeline.
Repository Support. There are more
repositor...
Metadata Revision System. Avoid
storing whole snapshots, but the
change set.
Domain-Specific Language.
Make it even easier...
DEMOmetadata-census.com
Konrad cedem praesi
Konrad cedem praesi
Konrad cedem praesi
Konrad cedem praesi
Prochain SlideShare
Chargement dans…5
×

Konrad cedem praesi

703 vues

Publié le

Publié dans : Technologie, Business
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Konrad cedem praesi

  1. 1. Visualization of Metadata Quality for Open Government DataKonrad Johannes Reiche*, Edzard Höfig, Ina Schieferdecker**, presented by Nikolay Tcholtchev** konrad.reiche@gmail.com*, {firstname.lastname}@fokus.fraunhofer.de**
  2. 2. “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.” O·pen Da·ta /ˈəʊp(ə)n ˈdeɪtə/
  3. 3. “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-like.” O·pen Da·ta /ˈəʊp(ə)n ˈdeɪtə/
  4. 4. Government Data Citizens DOMAIN
  5. 5. Government Data Citizens DOMAIN DESIGN Repositories XML JSON RDF Metadata PDF XLS CSVDOC Resources
  6. 6. Quality. What could possibly go wrong? Metadata Record Name regional-household-income ID 98899446-0a1a-43bc-874c-2d54dc700670 Maintainer Margaret Jarmon Maintainer Email magaret.jarmon@cabinet-office.x.gsi.gov.uk Author Office for National Statistics Author Email webmaster@cabinet-office.x.gsi.gov.uk License ID uk-ogl Resources URL http:/ / www.ons.gov.uk/ ons/ rhi13 Description Spring 2013 Format CSV URL http:/ / www.ons.gov.uk/ ons/ rhi14 Description Spring 2014 Format CSV
  7. 7. Quality. What could possibly go wrong? Metadata Record Name regional-household-income ID 98899446-0a1a-43bc-874c-2d54dc700670 Maintainer Maintainer Email Author Office for National Statistics Author Email License ID uk-ogl Resources URL http:/ / www.ons.gov.uk/ ons/ rhi13 Description Spring 2013 Format CSV URL http:/ / www.ons.gov.uk/ ons/ rhi14 Description Format CSV
  8. 8. Quality. What could possibly go wrong? Metadata Record Name regional-household-income ID 98899446-0a1a-43bc-874c-2d54dc700670 Maintainer Maintainer Email Author Office for National Statistics Author Email License ID uk-ogl Resources URL http:/ / www.ons.gov.uk/ ons/ rhi13 Description Spring 2013 Format CSV URL http:/ / www.ons.gov.uk/ ons/ rhi14 Description Format CSV CSV HTML
  9. 9. Metadata Record Name regional-household-income ID 98899446-0a1a-43bc-874c-2d54dc700670 Maintainer Maintainer Email Author Office for National Statistics Author Email License ID uk-ogl Resources URL http:/ / www.ons.gov.uk/ ons/ rhi13 Description Spring 2013 Format CSV URL http:/ / www.ons.gov.uk/ ons/ rhi14 Description Format CSV Quality. What could possibly go wrong? CSV
  10. 10. Metadata Record Name ID 98899446-0a1a-43bc-874c-2d54dc700670 Maintainer Maintainer Email Author Author Email License ID uk-ogl Resources URL http:/ / www.ons.gov.uk/ ons/ rhi13 Description Spring 2013 Format CSV URL http:/ / www.ons.gov.uk/ ons/ rhi14 Description Format CSV Quality. What could possibly go wrong? CSV
  11. 11. Reputation Loss QUALITY LOSSInformation Loss - Missing Fields - Dead Links - Inaccurate Information - False Information - Outdated Values - Missing Information - Bad Spelling - Non-Schema Compliant Bad Searchability Unreliable Untrustworthy
  12. 12. Meta·da·ta Qual·i·ty /ˈmɛtədeɪtə kwɒlɪti/ The fitness to describe the data (resources), supporting the task dimensions of finding, identifying, selecting and eventually obtaining the resources. The quality is inversely proportional to the uncertainty of the user about the actual data.
  13. 13. Assessing Metadata Quality is HARDHighly Subjective Metadata Resource ? 1. Manual 2. Automated Wrong Qualified Process Principles + Guidelines Postulated as being not feasible anymore due to the large number of metadata records. - Algorithms? - Procedures? - Oracle? - Machine Learning?
  14. 14. Automated Quality AssessmentEmpirical Analysis + Visual Aid - Field Usage - Field Values Framework - Based on Information Quality - Three Dimensions: - Intrinsic - Relational / Contextual - Reputational - Evaluation Criteria - Completeness - Accuracy - Provenance - Logical Consistency - Timeliness …
  15. 15. QUALITY METRICS
  16. 16. 𝑞 𝑚 : 𝑟𝑒𝑐𝑜𝑟𝑑 𝑡 ⟶ 𝑉 ∈ [0, 1] Measurement. Assigning a symbolic value to an object to enable the characterization of a certain attribute of that object. Process P Quality. Complex Attribute. No single measure. Highly Subjective. Use of Proxies.
  17. 17. Completeness. How many fields have been completed? Record contains all the information required to have an ideal representation of the described resource. Metadata Record Name uk-civil-service-high-earners ID 68addaac-59ae-4230-bb67-c5a8f6a76285 Maintainer Maintainer Email Author Civil Service Capability Group Author Email webmaster@cabinet-office.x.gsi.gov.uk License ID uk-ogl Resources Size 40959 Description Civil Servants Salaries 2010 Format CSV Size Description Civil Servants Salaries 2011 Format CSV
  18. 18. Weighted Completeness. Not all fields are equally relevant. Weight value 𝑤𝑖 expresses the relative importance of field 𝑖. Metadata Record Name uk-civil-service-high-earners ID 68addaac-59ae-4230-bb67-c5a8f6a76285 Maintainer Maintainer Email Author Civil Service Capability Group Author Email webmaster@cabinet-office.x.gsi.gov.uk License ID uk-ogl Resources Size 40959 Description Civil Servants Salaries 2010 Format CSV Size Description Civil Servants Salaries 2011 Format CSV
  19. 19. Accuracy. How accurate is the resource represented? Semantic distance 𝑑𝑖. Difference between the information a user can extract from the record and the resource. Metadata Record Name regional-household-income ID 98899446-0a1a-43bc-874c-2d54dc700670 Maintainer Maintainer Email Author Office for National Statistics Author Email License ID uk-ogl Resources URL http:/ / www.ons.gov.uk/ ons/ rhi13 Description Spring 2013 Format CSV URL http:/ / www.ons.gov.uk/ ons/ rhi14 Description Format CSV CSV HTML
  20. 20. Richness of Information. How much value is added? 𝑞𝑖 𝑟𝑒𝑐𝑜𝑟𝑑 = 𝑖=1 𝑛 𝐼 𝑓𝑖𝑒𝑙𝑑𝑖 𝑛 Vocabulary terms and descriptions should be meaningful. Information should be unique and not redundant. 𝑚 Number of Documents Number of Words𝑛
  21. 21. Readability. How readable are the descriptions? 𝑞 𝑟 𝑟𝑒𝑐𝑜𝑟𝑑 = 206.836 − 1.015 𝑤𝑜𝑟𝑑𝑠 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 − 84.6 𝑠𝑦𝑙𝑙𝑎𝑏𝑙𝑒𝑠 𝑤𝑜𝑟𝑑𝑠 Readable in terms of cognitive accessibility. Flesch-Kincaid Reading Ease
  22. 22. Availability. Are the links working? 𝑞 𝑎𝑣 𝑟𝑒𝑐𝑜𝑟𝑑 = 𝑖=1 𝑛 𝑎𝑖 𝑛 Metadata only links to the resources. Without working links the actual data is not available. 𝑎𝑖 is true if the 𝑖th resource is reachable through the URL.
  23. 23. Implementatio Metadata Census
  24. 24. REQUIREMENTS Metadata Harvester Schemaless Data Store Quality Metrics Visualization Leaderboard Scalability Extensibility Non-functional Functional
  25. 25. Repository + url : String + name : String + type : Symbol Snapshot + date : Date MetaMetadata + metadata_record : Hash + score : Float + statistics: Hash + completeness : Hash + weighted_completeness : Hash + richness_of_information: Hash ... + latitude : String + longitude : String + best_record() : MetaMetadata + worst_record() : MetaMetadata + score() : Float 0..* 1..* DESIGN.
  26. 26. CompletenessMetric WeightedCompleteness <<Interface>> Metric + compute(record) MetricWorker + perform(snapshot, metric) GenericMetricWorker CompletenessMetricWorker OpennessMetric <<use>> <<use>> <<use>>
  27. 27. Metadata Harvester JSON JSON JSON Archives APIRequests Records
  28. 28. Imports Persist Metadata Census Metadata Harvester JSON JSON JSON Archives APIRequests Records Preliminary Analyzer Dump Importer Database
  29. 29. Imports Persist Metadata Census Metadata Harvester JSON JSON JSON Archives APIRequests Records Metric Processor Query Records Scheduler Analyzer Preliminary Analyzer Dump Importer Database
  30. 30. ViewUser Generates Investigates Imports Persist Metadata Census Metadata Harvester JSON JSON JSON Archives APIRequests Records Metric Processor Query Records Scheduler Analyzer Preliminary Analyzer Dump Importer Database
  31. 31. Open Government Data. Evaluation
  32. 32. Implementation focused exclusively on CKAN repositories.
  33. 33. Rank Repository Scor Misspelling Richnessof Information Openness Completeness Availability Weighted Completeness Readability Accuracy 1 data.gc.ca 74 97 86 80 79 79 81 71 20 2 data.sa.gov.au 71 98 63 94 77 86 82 72 0 3 GovData.de 67 99 4 38 55 81 87 79 56 4 data.qld.gov.au 66 99 67 96 73 60 78 59 0 4 PublicData.eu 66 98 84 69 64 70 67 42 32 4 data.gov.uk 66 97 85 69 62 74 67 44 28 4 africaopendata.org 66 100 20 78 70 87 68 55 53 5 datos.codeandomexico.org 65 100 55 84 65 100 75 37 0 6 catalogodatos.gub.uy 63 100 64 1 70 74 78 65 52 6 data.openpolice.ru 63 100 0 0 58 100 81 100 64 7 dados.gov.br 61 100 87 36 53 57 72 44 39 8 opendata.admin.ch 59 100 12 0 58 100 68 35 100 9 data.gv.at 57 100 21 99 51 68 65 59 0 10 data.gov.sk 49 100 51 0 48 92 58 37 7
  34. 34. Conclusi on
  35. 35. What is good about this approach? Metadata quality is quantified, but every quality aspect on its own. Metric scores are aggregated to make it comparable. Every additional quality metric is supposed to complete the quality puzzle. Automated — Generic — Quantifiable — Repeatable
  36. 36. Platform has the advantage that it acts as a beacon... If your metadata breaks bad everyone will see it.
  37. 37. What is bad not so good about this approach? - Lacks number of quality metrics - No empirical analysis beforehand - Overvalues problems with the metadata More quality metrics are necessary. Current metrics need to consider more special cases in the metadata records.
  38. 38. Final Thought. Do not aim for excellence, aim for low-quality metadata.
  39. 39. Quality Feed. Monitor metadata changes live and record changes in a timeline. Repository Support. There are more repository software with public APIs. Socrata being most prominent. More Quality Metrics - Duplicate Detection - Discoverability - Coherence - Advancement - Reputation
  40. 40. Metadata Revision System. Avoid storing whole snapshots, but the change set. Domain-Specific Language. Make it even easier to add individual quality metrics.
  41. 41. DEMOmetadata-census.com

×