DOIs, provenance & vocabularies - Nicholas Car (CSIRO)
Presented at the ANDS facilitated GeoNetwork Community of Practice on April 3rd, 2017 in Canberra.
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
DOIs, provenance & vocabularies - Nicholas Car (CSIRO)
1. DOIs, provenance & vocabularies
Nicholas Car
Data Architect
nicholas.car@ga.gov.au
DOIs, Provenance & Vocabs
2. Outline
Three different extensions to regular GN use:
1. DOI and other identifier use
2. Provenance formulation and recording
3. Vocabulary use
DOIs, Provenance & Vocabs
4. DOIs and other identifiers
• GN uses UUIDs for records
• Strengths:
• Universally unique so:
• Able to be generated by or outside GN
• Transferable
• Indefinitely stable
DOIs, Provenance & Vocabs
5. DOIs and other identifiers
• GN uses UUIDs for records
• Strengths:
• Universally unique so:
• Able to be generated by or outside GN
• Transferable
• Indefinitely stable
Alex can generate catalogue records using custom code and
post them into GA’s eCat. He can generate the UUIDs rather
than have eCat do it so he can know what they are before
submission.
DOIs, Provenance & Vocabs
6. DOIs and other identifiers
• GN uses UUIDs for records
• Strengths:
• Universally unique so:
• Able to be generated by or outside GN
• Transferable
• Indefinitely stable
Alex can generate catalogue records using custom code and
post them into GA’s eCat. He can generate the UUIDs rather
than have eCat do it so he can know what they are before
submission.
Jingbo can move records between catalogues at the NCI and
still use the same UUIDs for them
DOIs, Provenance & Vocabs
7. DOIs and other identifiers
• GN uses UUIDs for records
• Strengths:
• Universally unique so:
• Able to be generated by or outside GN
• Transferable
• Indefinitely stable
• Weaknesses:
• Not meaningful
• Not part of an identifier scheme
• Not resolvable by themselves
DOIs, Provenance & Vocabs
8. DOIs and other identifiers
• GN uses UUIDs for records
• Weaknesses:
• Not meaningful
• Not part of an identifier scheme
• Not resolvable by themselves
data.gov.au, not using GN, provides UUIDs and meaningful
aliases for datasets, e.g.
“Offshore reconnaissance geophysical techniques”
http://data.gov.au/dataset/cdecf261-84a7-4911-a645-
2d7113e97d0b
http://data.gov.au/dataset/offshore-reconnaissance-
geophysical-techniques
DOIs, Provenance & Vocabs
9. DOIs and other identifiers
• What are DOIs?
• a persistent identifier used to uniquely identify digital
objects, standardized by the ISO
• Uses the Handle network: highly persistent
• Popular and widely understood
• Has many convenience resolver systems, e.g.
https://doi.org/{DOI}
(https://doi.org/10.4225/25/58a3ff6e07d21)
• IGSNs are another DOI-like identifier
DOIs, Provenance & Vocabs
10. DOIs and other identifiers
• GA uses DOIs for important datasets and our own eCat IDs
for all datasets, e.g.:
• “Radiometric Thorium Equivalent grid of Warrachie, SA”
• UUID: 64af9ff3-71dd-431a-bc94-9d2280acef79
• eCatID: 106850
• Our landing page: http://www.ga.gov.au/metadata-
gateway/metadata/record/106850
• DOI: https://doi.org/10.4225/25/58a3ff6e07d21
DOIs, Provenance & Vocabs
11. GA’s DOI directions
• Our eCat ID will remain our authoritative ID
• Due to their embedded presence & simplicity
• GN configured to mint them
• We will promote eCat IDs & other IDs like DOIs, not UUIDs
• GN landing page’s “Permalink” button will reveal a DOI
• If it exists for a record
• If not, an eCat-based URI including the eCat ID
• UUIDs only used under the hood
• For GN functions like crosslinks
• We may support other ID schema in the future, like IGSNs
• We require architecture outside GN for URI ID redirection
DOIs, Provenance & Vocabs
15. GA’s provenance model
• We use PROV
• We do not use ISO19115 Lineage
• Designed for satellite data processing
• Limited to history of the catalogued item only
• Not database/graph (de-normalised wrt many objects)
DOIs, Provenance & Vocabs
16. GA’s provenance model
• We use PROV
• We do not use ISO19115 Lineage
• Some provenance stored in our GN eCat
• We also link across multiple systems
• Example: GN ARGUS
• Datasets Surveys’ metadata online
DOIs, Provenance & Vocabs
17. GA’s provenance model
• We use PROV
• We do not use ISO19115 Lineage
• Some provenance stored in our GN eCat
• We also link across multiple systems
• We have had to define our dataset dataset provenance
relationships in ISO19115:
• PROV: wasDerivedFrom
• ISO -1: AssociationTypeCode dependency
• PROV: wasRevisionOf
• ISO -1: AssociationTypeCode revisionOf
• PROV: hadPrimarySource
• ISO -1: AssociationTypeCode source
DOIs, Provenance & Vocabs
18. GA’s provenance model
• We use PROV
• We do not use ISO19115 Lineage
• Some provenance stored in our GN eCat
• We also link across multiple systems
• We have had to define our dataset dataset provenance
relationships in ISO19115
• We can have Dataset other thing relationships
• ARGUS example:
• PROV: Dataset prov:wasGeneratedBy Activity
• ISO -1: Dataset ? Activity (not in GN)
DOIs, Provenance & Vocabs
20. Vocabularies
• Items in GN stored with keywords and the thesaurus they
come from:
DOIs, Provenance & Vocabs
<mri:descriptiveKeywords>
<mri:MD_Keywords>
<mri:keyword>
<gco:CharacterString>Offshore Areas</gco:CharacterString>
</mri:keyword>
<mri:type>
<mri:MD_KeywordTypeCode
codeList="http://asdd.ga.gov.au/asdd/profileinfo/
gmxCodelists.xml#MD_KeywordTypeCode"
codeListValue="theme">
theme
</mri:MD_KeywordTypeCode>
</mri:type>
</mri:MD_Keywords>
</mri:descriptiveKeywords>
21. Vocabularies
• Items in GN stored with keywords and the thesaurus they
come from:
DOIs, Provenance & Vocabs
<mri:descriptiveKeywords>
<mri:MD_Keywords>
<mri:keyword>
<gco:CharacterString>Earth Sciences</gco:CharacterString>
</mri:keyword>
<mri:thesaurusName>
<cit:CI_Citation>
<cit:title>
<gco:CharacterString>
Australian and New Zealand Standard Research Classification
(ANZSRC)
</gco:CharacterString>
</cit:title>
...
22. Vocabularies
• Items in GN stored with keywords and the thesaurus they
come from:
DOIs, Provenance & Vocabs
...
<cit:CI_OnlineResource>
<cit:linkage>
<gco:CharacterString>
http://www.abs.gov.au/ausstats/abs@.nsf/mf/1297.0
</gco:CharacterString>
</cit:linkage>
</cit:CI_OnlineResource>
...
23. Vocabularies
• Items in GN stored with keywords and the thesaurus they
come from
• GA is moving to using online SKOS-based vocabs for all code
lists
• E.g. “GA Data Classification”
• Broad GA categorisation for all data
• Will be compulsory, as ANZSRC, enforced by GN
• Can use specialised terms in other vocabs
• GN will offer term selection
• Live from online voc, not stored XML
DOIs, Provenance & Vocabs
24. Vocabularies
• Items in GN stored with keywords and the thesaurus they
come from
• GA is moving to using online SKOS-based vocabs for all code
lists
• We are keen to work with others testing GN/SPARQL
service integration
DOIs, Provenance & Vocabs
25. Vocabularies
• Items in GN stored with keywords and the thesaurus they
come from
• GA is moving to using online SKOS-based vocabs for all code
lists
• Remediation of existing keywords anticipated
• Automated KW testing for term tidy-up
• Abstract text mining with Natural Language Processing to
add to KWs
• Bulk addition, based on business knowledge of record
data
• E.g. thematic tagging based on GA section
DOIs, Provenance & Vocabs
26. Vocabularies
• Items in GN stored with keywords and the thesaurus they
come from
• GA is moving to using online SKOS-based vocabs for all code
lists
• Remediation of existing keywords anticipated
• Automated KW testing for term tidy-up
• Abstract text mining with Natural Language Processing to
add to KWs
• Bulk addition, based on business knowledge of record
data
• Reverse vocab application
• Existing free text terms vocabs
DOIs, Provenance & Vocabs
27. Vocabularies
• Items in GN stored with keywords and the thesaurus they
come from
• GA is moving to using online SKOS-based vocabs for all code
lists
• Remediation of existing keywords anticipated
• We will be registering our vocabs themselves as datasets in
eCat!
DOIs, Provenance & Vocabs
28.
29. Afterword
• Lots of extension work at GA using GN
• Inter systems linking growing
• Semantic Richness beyond ISO19115 growing
• GN still the only catalogue system for the foreseeable future
• Other GN initiatives at GA, for another CoP meeting!
DOIs, Provenance & Vocabs