1. [Unclear] words are denoted in square brackets
#4 FAIR Data Principles – R is Reusable
20 September 2017
Video & slides available from ANDS website
START OF TRANSCRIPT
Ketih Russell: Welcome to the fourth webinar in this series
My name is Keith Russell from the Australian National Data Service and
I am your host for today. My colleague Susannah Sabine is behind the
scenes co-hosting the webinar with me
The Australian National Data Service works with research organisations
around Australia to establish trusted partnerships, provide reliable
services to add value to research data and enhance the capability in
the research sector.
We work together with two other NCRIS funded projects RDS
(Research Data Services) and Nectar to create an aligned set of joint
investments to deliver transformation in the research sector.
This webinar is part of a series of ANDS activities which aim to support
the Australian research community in increasing our ability to manage
our research data as a national asset.
2. #4 FAIR – R for Reusable Page 2 of 15
This is the fourth and final in a series of webinars on the FAIR data
principles. We have had webinars on Findable, Accessible,
Interoperable. Today we will talk about making data re-usable,
according to the FAIR data principles.
Today I will kick off with an introduction to what the Force11 FAIR data
principles say about making your research data re-usable.
Then we will have two speakers that will talk about how you can take
these principles and apply these in practice.
First we will have Nerida Quatermass from Creative Commons
Australia who will provide more information on using the Creative
Commons Licensing framework and things to think about when
choosing a licence.
After that Margie Smith from Geoscience Australia will present on the
work that they have been doing on attaching provenance information to
research data.
These are the elements that the Force11 have described for making
your research data re-usable.
First of all it is important to note that the other elements under FAIR
(Findable, Accessible, Interoperable) are also really important to make
data re-usable.
If nobody can find the data it will not be re-used for example.
The first high level heading is that the data and the metadata should
have a plurality of accurate and relevant attributes. Under this heading
they have described three elements that these attributes should cover.
1) Number one is that the data and the metadata should be released
with a clear and accessible license for the data. Making data available
but not assigning any licence makes the data really hard to re-use, it is
completely unclear as a re-user what you can actually do with the data.
3. #4 FAIR – R for Reusable Page 3 of 15
If you attach a licence make sure that it is in a machine readable format.
That way machines can access the data and know whether it can be
used for analysis.
Nerida will explain about a possible framework to use to assign a
licence to the data.
2) Number two is that the data and the metadata are associated with
provenance information on how the data was created. This provides
clarity on the steps that were taken in collecting, selecting, analyzing
the data. Turning it from raw data into derived data and finally the final
data set. This is extremely useful information if you want to re-use the
data as this provides context and gives you background on whether the
data will also be suitable for your purposes.
Attaching provenance information is easier said than done and I am
very grateful that Margi Smith is willing to present on how they have
picked up this challenge at Geoscience Australia.
3) The final point is that the data and the metadata should meet domain-
relevant community standards. For example, the data is best in a data
format and file format that is commonly used the discipline so it is easy
for another researcher in that discipline to pick up and use. Also use a
metadata format that is common in that discipline as that often contains
specific fields that are relevant to that discipline and help a researcher
in that field to quickly understand the potential re-use of the data set.
I would now like to ask our two speakers to talk in more detail about
aspects that are relevant for making the data re-usable.
First we will have Nerida Quatermass from Creative Commons
Australia, based at QUT.
She will present on the Creative Commons Licensing framework and
considerations on using these licences.
Next we will have Margie Smith from Geoscience Australia who will
present on how GA has attached provenance information to data.
4. #4 FAIR – R for Reusable Page 4 of 15
We will save up questions till the end of the webinar. But please feel
free to already type this in the question box as we go along.
Nerida Quatermass: Copyright law grants the monopoly over a work in material form to the
“owner” of it. CC licences have filled a need for a public licence ie. one
that anybody can rely on as permission to re-use a work. Before CC
licences, the only way to get re-use rights was by exceptions allowed
in copyright law or, licences directly negotiated between a copyright
owner and a licencee.
Public licences like CC are central to opening up access to research
output, including sharing of data associated with these.
I’ve put an open access spectrum there because it’s really important
to distinguish between free access and re-usability which starts with
permission to share, and extends to the right to make derivative
works.
These permissions to re-use are communicated with a clear machine
readable licence.
You are probably all know about CC licences. But as an overview:
Four licence elements can be combined, resulting in Six CC licenses.
They are featured in this slide on a spectrum of allowing more to less
re-use of a work.
The most open or permissive licence is Attribution, the most restrictive
is Attribution-Noncommercial-NoDerivatives.
The “free cultural Works seal” was developed for Wikimedia content. It
signals an important delineation between less and more restrictive
licenses applied to works in the digital commons. It distinguishes non-
software works.
In addition to the licenses, CC offers two public domain tools.
CC0, the public domain tool, for creators, and
5. #4 FAIR – R for Reusable Page 5 of 15
The PD Mark is used to indicate works that are already in the PD.
(used commonly by cultural heritage institutions in digital collections)
(C with a strikethrough)
CC0 can be particularly important to maximize the re-use of data and
databases because it otherwise may be unclear whether highly factual
data and databases are restricted by copyright or other rights. CC0 is
intended to cover all copyright and database rights, so however data
and databases are restricted under copyright or otherwise, those
rights are all surrendered.
It is foremost a waiver. It means you waive all of YOUR rights so that
you have zero rights left in a work, effectively dedicating it to the
public domain.
It has a legal code beneath it, because you need a legal mechanism
to relinquish your rights.
When you release content under CC Zero, you are explicitly stating
that you do not expect attribution. There's a little bit of uncertainty
around CC0 because Australian moral rights are fairly new, but the
licences are designed as carefully as possible to respect the author's
wishes.
The mains point are:
Do license your data – international rules are too variable to rely on
public domain
CC0 –ensures maximum compatibility with other licensed works and
prevents attribution stacking (e.g. attributing many in a project; the
immediate source of a derivative work +++ upstream works- there are
other ways to acknowledge contribution)
Next best CC-BY – if really want attribution to be legal requirement.
The licences, communicate re-use rights through the three layer
design-
6. #4 FAIR – R for Reusable Page 6 of 15
1. The Legal Code is the legal instrument which states the terms and
conditions of the licence
2. Human Readable format is a plain-language summary of the
licence, with relevant icons to clearly indicates conditions of licensing
and the re-use rights under the licence- You are free to… under the
following terms
In addition to supporting its reuse by individuals the FAIR Principles
put specific emphasis on enhancing the ability of machines to
automatically find and use the data- bringing us to the third layer:
A machine-readable translation of the licence attaches to digital works
or digital copies of work. The transation code (rights expression
language) becomes embedded in the digital source, which helps
search engines and other applications identify a work. This can also
be achieved by uploading a work to a content sharing platform that
supports CC licensing and takes care of the machine-readability for
you.
It’s also important to mark a work with the licence. I’ll talk about
marking shortly.
Regarding the robustness of the legal instrument:
The Creative Commons Licences have been upheld in every
jurisdiction in which litigation concerning them has occurred. There
have been no recorded cases of litigation concerning a Creative
Commons Licence in Australia, which would tend to support the
quality of their construction.
CC Licences are irrevocable, so last for the term of copyright.
The are non-exclusive, so it is open to the rights holder to apply
another licence to the material should the need arise. For example, if
you release material under a CC- Non-Commercial Licence, but a
commercial partner wishes to exploit the material, you can enter into a
separate licence with the commercial partner that permits commercial
reuse. This is known as “Dual licensing”.
7. #4 FAIR – R for Reusable Page 7 of 15
To maximise discoverability by search engines and software systems
make sure to use our license chooser tool to get the machine-
readable html code. The licence chooser also mints the licence for
marking a work.
Four important things come out of this- licence selection, attribution,
citation and more permissions:
1. Licence selection- guided by questions about what re-use you will
allow:
• Allow adaptations of your work to be shared?
• Allow commercial uses of your work?
• Remember that if your work is an adaptation of a work licensed
under either CC BY-SA or CC BY-NC-SA, then your derivative
work must be made available under the same license as per the
ShareAlike condition.
2. Attribution is a base condition of all the CC licences.
Flexibility for attribution requirements: “reasonable to means, medium
and context”- can link to separate resource-
licensor may waive some or all of the attribution requirements.
licensor may waive some or all of the attribution requirements-
3. Citation- location of the work, and also source works: Answers
concerns from data creators about being able to find the original
data.
• If the work you are licensing is a derivative of another work, then then
you need to communicate that your work is a derivative: including
the source URL of the original work and derivative/ modification
described.
• When modifying materials under one of the Version 4.0 CC licenses,
you must make a note of any modifications you make to the
materials, regardless of whether the modification is significant
8. #4 FAIR – R for Reusable Page 8 of 15
enough to merit a derivative work, and provide URI back to source.
Answers concerns from data creators about being able to find the
original data.
• It might be unfeasible to include attribution within a merged dataset in
which case, include URI back to unmodified version.
Lastly, More permissions:
• For example, if you license something under CC BY but are okay with
people not attributing you in certain cases--this is your chance to
specify those cases.
• You can't change the terms of a CC license, but you can always grant
additional permissions or warranties beyond what the license
allows?
• Does your work incorporate elements of several third party materials?
– Mark these and provide attribution.
Marking communicates the licence ON the work: here is a list of ways to
mark a work
Regarding content platforms: If there is no licence field there is usually
a description or other free form field where you can enter info about a
work.
My key message today is Re-use is a core component of FAIR data. So,
do licence your data to enable re-use.
Creative Commons licenses provide a simple mechanism
• to ensure that users of research have the rights they need to reuse,
replicate, and apply research outputs and data.
• To disseminate and communicate research output in order to
maximise the impact of work while protecting intellectual property
and academic integrity
• With built in attribution and citation which creates a clear path to the
original data.
9. #4 FAIR – R for Reusable Page 9 of 15
Margie Smith: Hi there!
My name is Margie Smith and I have worked at Geoscience Australia
since November 2016 in the Science Data Governance and Policy
team… a team of two.
I came across to help GA meet its obligations under the National
Archives of Australia’s Digital Continuity 2020 Policy, to bring some
external policy knowledge into the organisation and to provide
governance guidance around science data management.
In response to the National Archives Digital Continuity 2020 Policy
and other Australian Government Open Data policies, government
organisations have been tasked with making their data holdings
visible and available.
Making data open is not new to GA but there is most definitely now a
whole of government push for access to all data domains.
I have produced several documents to meet the DC2020 data
governance milestones, but as you can see from this diagram, there
has to be a balance of both oversight and execution across the data
lifecycle – to have one without the other will either produce a pile of
documents that nobody reads or a plethora of silos of excellence
generating portals, datasets and services that only those in the know
can find and use.
Whilst there are a series of external drivers for data management, use
and re-use, there are also strong drivers currently within the
organisation.
For example:
• the cost of collecting or acquiring the data
• the cost of not finding data previously acquired or
• finding data and not being the person who ‘knows’ all about it
10. #4 FAIR – R for Reusable Page 10 of 15
• succession planning
• analogue collections – diaries or paper products that have yet to
be digitised
• general public servant obligations like the Archives Act
• and, of course, GA’s Science Principles and vision.
Provenance will support the organisation through enabling data re-use
(as you can now find it) and allow for transparent science and advice
through understanding the data supply chain.
At the moment, our metadata records indicate provenance of the data
through the lineage statement or in the abstract.
As shown in these examples, the provenance of a dataset or product
are usually free-text and can be semi-structured or unstructured.
Very concise or…
… not exactly concise.
Here the abstract includes everything you need to know about the
Coastal inundation modelling for Busselton, Western Australia, under
current and future climate.
Whilst this provenance information is very useful, it is not particularly
useable; and by useable*
, I mean its ability to be located, retrieved,
presented and interpreted – by person or ideally, by machine search.
*
from the ISO 15489-1:2016 Information and documentation --
Records management -- Part 1: Concepts and principles
As an example of why we need provenance for data reuse, I have
made up a scenario.
In this scenario, the advice was generated from the complete dataset
at the time.
A scientist generated a model using algorithms and provided advice
based on the output of the model.
11. #4 FAIR – R for Reusable Page 11 of 15
The advice, assuming it was of a general nature, is then made
available through the catalogue – generally as a PDF document.
The metadata for the advice gives the name of the dataset used, the
area that the advice covers, the organisation as author of the report,
and perhaps some of the methodology used in the generation of the
report.
In most cases, you could link the advice to the name of the dataset
that was used to generate the advice, but not easily to the scientist or
team and the models used to generate the advice.
So this provenance model of a data product could work well as a
highly structured PROV system.
My colleague Nick Car gave a presentation on GA’s PROV model to
ANDS in March and I suggest you watch that for specific information
about the model at Geoscience Australia.
Adapting Nick’s model, I have tried to replicate my previous scenario –
modelling what we are working towards at GA.
This is currently happening through lineage and association with
digital objects rather than a true PROV model of digital objects.
Working from right to left, the Advice would have a metadata record in
eCat, our electronic catalogue, that indicates the process used to
generate the advice, which is made up of the temporal subset of the
dataset the advice is based on, the software or models that were
applied to the data and information around that data’s acquisition as
well as the reason the advice was required.
If the data is to be re-used in future advice, it might also be helpful to
know what models were tried previously that didn’t work.
For our catalogue-like things, we need to gradually add the ability to
link Entities, Agents, Activities etc to be able to use graph structured
provenance (PROV-DM) across multiple types of objects and across
multiple systems in the future.
12. #4 FAIR – R for Reusable Page 12 of 15
In my role I am particularly interested in the repeatability of advice
given by any government entity. Per the Archives Act, advice of this
type given by government must be stored for a period of years and
include the models, algorithms, software and data used to generate
the advice. It is a safety net for the entity and the public servants that
generated the advice at that point in time.
This is currently a manual process, heavily reliant on the individual
generating the advice and storing it appropriately.
It would be excellent if the work we are currently undertaking would
make it a lot easier for scientists to generate and catalogue this advice
in the future.
Prior to sorting out what I wanted to include in this presentation, I had
another look at the FAIR principles for data reuse.
Looking at these principles, I was feeling a lot better about what has
been achieved at GA in the last 18months.
We have a public catalogue, it has a clear and accessible data usage
license and the standards used for cataloguing are in the spatial
domain.
The lineage in a metadata record has been the de facto ‘data
provenance’ to date.
We are currently working on multi-domain metadata retrieval from our
catalogue; for example, we will be able to export records in AGRIF for
Records Management, ISO19115 for spatial and DCAT for the
National Archives.
The Google search is already enabled in the search panel on the
ga.gov.au splash page – this enables a search of both the website
and the catalogue for content.
In June, I was fortunate to attend the Open Geospatial Consortium
technical meeting which is an international spatial standards
organisation. It was evident in discussions there that many other
13. #4 FAIR – R for Reusable Page 13 of 15
countries were also working towards delivering their catalogues in
formats other than spatial to enable searching by other domains.
We have a new catalogue, our eCat: where metadata records will
have
• a persistent identifier
• the license for data re-use is clear
• you can get to the data or product directly from the metadata
record
• and records for data are linked to services and portals that use
them, and vice versa.
At the moment, we are working to publish the 19115-3 catalogue
schema and codelists that are used by GA in the catalogue.
In terms of oversight, we have data product plans, roles and
responsibilities, and workflows for the release of products from GA
through eCat which is a longstanding and well understood process.
For the past month, my area has been undertaking work to highlight
the need for science areas to focus on a data-first rather than product-
first view. This data-first process will echo the data product publishing
workflows and have a dedicated internal catalogue we are calling
SourceCat.
SourceCat is a clone of the eCat software and is being trialled within
two areas of GA before being released across the organisation.
Once we have this in place, being able to show provenance from the
product to the data will be made easier as we start the process at the
beginning rather than try and remediate at the product publishing end
of a project.
This is a view of our new eCat – the electronic catalogue for products
generated at GA.
We have moved to the newer metadata standard for Australian Spatial
Data, the ISO 19115-1:2014 which you can see indicated on the page.
14. #4 FAIR – R for Reusable Page 14 of 15
There are also Keyword lists which have been somewhat free-forming
to date. We have now selected well defined vocabularies where they
exist and are working with the custodians to publish them whilst at the
same time wrapping a governance structure around their maintenance
and future extension.
There is a persistent id and data download is indicated.
In the scenario I gave before, I pictured how the provenance of a data
product would work well as part of a highly structured PROV model.
The structure required supports data provenance and re-use even if it
doesn’t become a PROV system immediately.
The Source Catalogue is currently being built as a proof of concept for
two science areas in the organisation with the intention of making it an
agency tool for all data that is acquired or created.
In the future we intend to have a Software Catalogue and Objects
Catalogue so that the software or models used in data curation or
data products can be included as per PROV models. These are all
clones of the eCat software.
With this comes the need to support the organisation with tools and
documented procedures that in the future will become automagic
processes to bring data into the building. This support is more of the
oversight and execution balance that I spoke of earlier.
We are also using the catalogue standard to introduce elements that
will align with a future PROV model.
We will be including the element ‘derivedFrom’ in the metadata record.
In the future, if a product does not have a ‘derivedFrom’ element, it will
not be published.
Further into the future we will include the element ‘haveProv’, which is
different to lineage, as it is forward facing – linking the data to all
products that have used it.
15. #4 FAIR – R for Reusable Page 15 of 15
By having all these links embedded, Nick explained that this will allow
a machine readable PROV-record to link to a metadata record to
indicate provenance exists. He then started talking PROV bundles
and lost me but hopefully all these steps will lead to the working
PROV model of the future GA.
I was also thinking about the next talk on licensing frameworks. In this
future machine-to-machine scenario, the licenses of aggregated
products may be determined through an automated rule set
depending on the way the data product is delivered.
In this example a dataset and its associated web service have
differing licences. For third-party aggregated data use this process is
currently determined through extensive written agreements for each
product.
Finally, it takes a lot of work to remediate legacy metadata records.
Are we going to remediate every single one of our legacy data
records? NO – or at least not straight away. Not all data is high value
nor does all data have to be highly useable, but all data acquired and
data products created should be FAIR.
To re-use data, it is necessary to understand its provenance to assess
if it is fit for purpose and in working towards a PROV model and
implementing tools like the SourceCat we are also further along the
path to achieving GA’s vision to fully maximise our data potential.
END OF TRANSCRIPT