The Virtual Hazards Impact & Risk Laboratory (VHIRL) is a scientific workflow portal that provides researchers with access to a cloud computing environment for natural hazards eResearch tools. It allows researchers to construct experiments with data from a variety of sources and execute cloud computing processes for rapid and remote simulation and analysis. The service currently includes tools for the simulation of three major hazards affecting the Asia-Pacific region: earthquakes, tsunamis and tropical cyclones.
For scientific results, the establishment of provenance is key to reproducibility and trust. Thus the need for any virtual laboratory to provide provenance information for the tasks it manages is obvious, but the appropriate way to report and manage provenance information is not always so straightforward. Many virtual laboratories and workflow systems provide bespoke provenance management with a focus on internal system use. This has clear benefits for reproducibility within the system, but it limits the interoperability of systems. For VHIRL, a provenance solution was required that was as
interoperable with other, external, provenance systems as possible.
A related common issue facing workflow tools and virtual laboratories is the need to manage software code. With this comes well-known issues associated with code sharing: licensing, source code management, version management and dependency resolution. There are a wide selection of commonly used tools to help solve these problems, for example Git and Subversion.
A key goal of VHIRL was to externalise as much information management as was reasonable. VHIRL is a virtual laboratory: it is not designed to be a data store, software repository, or records management system. A solution was required that could hand off the management of provenance records and code to external services, with links between them, other data services and VHIRL jobs where appropriate.
Scientific software can be quite complicated and systems for managing dependencies and source vary from system to system. In order to provide the least friction for authors of software, we designed a system called the Scientific Software Solution Centre (SSSC) to manage solutions to scientific problems and deliver the solution templates, code and dependencies that enable them for use in VHIRL and other Virtual Laboratories and applications.
Recombinant DNA technology (Immunological screening)
Standard Provenance Reporting and Scientific Software Management in Virtual Laboratories
1. Catherine Wise, Nicholas J Car, Ryan Fraser and Geoff Squire
Data61 and LAND & WATER
Standard Proveance Reporting and
Scientifc Software Management in
Virtual Labs
2. What are VLs?
What is VHIRL?
What is provenance?
How does VHIRL manage provenance (or not)?
How do we represent VHIRL’s actions to standardised provenance?
What work, other than representation, is needed for provenance?
What benefits do we get from this work?
Outline
6. • Virtual Hazards Impact & Risk Laboratory (VHIRL) is a scientific
workflow portal
• Gives researchers access to a cloud computing for natural
hazards research
• data from a variety of sources
• uses cloud computing resources
• currently has tools for the earthquakes, tsunamis & tropical
cyclones in the Asia-Pacific region
What is VHIRL?
12. From http://en.wikipedia.org/wiki/Provenance#Computer_Science:
What is provenance?
“Computer science uses the term provenance to mean the
lineage of data or processes, as per data provenance.
However there is a field of informatics research within
computer science called provenance that studies how
provenance of data and processes should be characterised,
stored and used. Semantic web standards bodies, such as the
World Wide Web Consortium, ratified a standard for
provenance representation in 2014, known as PROV.”
13. How do we represent VLs using
standardised provenance?
14. • Natively tracks ‘everything’ used for scenario (re)runs
• Is not a: Data store, Software repo, Records mgt system
• Externalises as much information mgt as possible
• Code managed by the SSSC
VHIRL’s own data management
15. • SSSC is a web-based system to
manage code & dependencies
• Contains Problems &
Solutions that define a
workflow
• Solutions consists of a Toolbox
• Toolboxes are code wrapped
in a Python script +
description of the required
inputs
Scientific Solutions Software Centre (SSSC)
Class diagram for the SSSC
16. Scientific Solutions Software Centre (SSSC)
• Beautiful, RESTful API
this example:
http://vhirl-dev.csiro.au/scm/toolbox/2
• Solution prov:Plan
• No RDF metadata, yet!
22. VHIRL provenance into PROMS Server
Report N
Entity Activity Agent
Reporting
System X
R.S. Report
Report N
Report N
Report M
Report NReporting
System Y
Report N
Report N
Report N
Organisational
Provenance
Store
reported and stored
23. Modelling VHIRL’s data types
VL Run
output
data
userThe VL
Report N
managed
data
web
service
data
user
supplied
data
managed
code
user
supplied
code
30. managed
data
web
service
data
user
supplied
data
managed
code
user
supplied
code
Data Management
VL ID’d and persisted
output
data
cited using PROMS-O format
soon to be VL ID’d and persisted, with
minimal metadata recorded too
SSSC ID’s and persisted
perhaps SSSC ID’s and persisted,
perhaps VL managed
soon to be VL ID’d and persisted, if required,
perhaps with time limits
Virtual Labs Service Citation Example
[{ref}] {service title}
{service endpoint URI}
{query}
{time queried}
{cached copy ID}
[1] “Subset of elevation”
http://pid.csiro.au/service/anuga-thredds
“bussleton.nc?var=elevation&spatial=bb&
north=-33.06495205829679&south=-
33.551573283840156&west=114.849678
74597227&east=115.70661233971667&t
emporal=all&time_start=&time_end=&hor
izStride”
“2014-12-15T13:15:11”
http://pid.csiro.au/dataset/abcd1234
32. Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
e1 = Entity(title='Grid X',
description='netCDF grid of property X',
uri='http://eg-vl.org.au/dataset/123',
downloadURL='http://eg-vl.org.au/dataset/123?_view=dl',
wasAttributedTo='http://data.ga.gov.au/id/person/john.doe')
Agent
N
Report N
Report for
Run 456
33. Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
e1 = Entity(title='Grid X',
description='netCDF grid of property X',
uri='http://eg-vl.org.au/dataset/123',
downloadURL='http://eg-vl.org.au/dataset/123?_view=dl',
wasAttributedTo='http://data.ga.gov.au/id/person/john.doe')
Agent
N
e2 = ServiceEntity(
title='Subset of elevation',
description='5km solar radiation interpolated raster service',
serviceBaseUri='http://siss2.anu.edu.au/anuga/busselton.nc',
query='var=elevation&spatial=bb&north=-33.06495205&south=-
33.551573283&west=114.84967874&east=115.70661233&tempor
al=all&time_start=&time_end=&horizStride',
queriedAtTime='2014-12-15T13:15:11'
chachedCopy='http://bom.gov.au/dataset/678')
Report N
Report for
Run 456
34. Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
Agent
N
a0 = Activity(
title='Run 456',
description='Upper bound run, full Grid X use',
wasAssociatedWith={VL added automatically},
startedAtTime={VL added automatically},
endedAtTime={VL added automatically},
usedEntities= [e1, e2],
generatedEntities={VL added automatically})Report N
Report for
Run 456
35. Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
Agent
N
Report N
Report for
Run 456
r0 = Report(
title='Report for Run 456',
description='Upper bound run, full Grid X use',
startingActivity={VL added automatically},
endingActivity={VL added automatically})
rs0 = ReportSender('http://provstore.vl.org.au/report/')
rs.send(r0)