RDSI Dash Tinman

Data Sharing (DaSh) Programme
Tinman –31st October 2011

2
SECTION A – CONTEXT AND PURPOSE
The RDSI project was established through the SuperScience investment from the Education
Infrastructure Fund (EIF) in the 2009 Federal Budget, and is managed through the Department of
Innovation, Industry, Science and Resources (DIISR). The detailed objectives, expected outcomes and
process to achieve these are described in the RDSI Project Plan, available from the RDSI website1
.
Quoting:
The expected benefits of RDSI are to:
• improve the availability of quality research data for sharing and re-use and, as a result, expand the
scale and scope of problems that Australian researchers may seek to address;
• improve research efficiency; and
• reduce institutional data storage costs and enable more extensive collaboration.
The infrastructure may also assist institutions to:
• sustain a quality of research in the digital age that includes the reproducibility of results;
• meet the storage requirements of key research activities undertaken at that institution; and
• comply with the research data provisions of Universities Australia’s Australian Code for the
Responsible Conduct of Research.
The RDSI project is delivered through four key programmes which are jointly coordinated, depend on
each other, but are delivered through different and complementary approaches.
• The Node Development (NoDe) programme will establish a small number of physical sites around
Australia to provide baseline storage and access services to the research sector.
• The Data Sharing (DaSh) programme will develop the technical architecture for inter-node and
node-user data movement, access management and sharing functionality for the sector.
• The Research Data Services (ReDS) programme will support the development of larger collections
of value, their infrastructure requirement at nodes and their association with collaboration and
analysis facilities.
• The Vendor Panel (VePa) programme provides the public research sector with a set of preferred
commercial suppliers for the delivery of storage infrastructure and services, leveraging the
economies of scale of both the sector and the RDSI investment.
The intention of the RDSI project is to foster the development of an enduring and sustainable
infrastructure, on a cost-effective basis, well beyond the lifetime of the project itself.
This document addresses the DaSh programme, providing a broad outline of the programme itself, its
requirements, expectations and deliverables. As a “tinman” model it is not intended to be a final
position, but to provide suggestions on points of further discussion and encourage feedback. It
follows the earlier strawman workshop. It will form the basis for the final model of the DaSh
programme. There will sector wide consultation on this tinman model.
1
http://rdsi.uq.edu.au/

3
SECTION B – SUMMARY OF THE DaSh PROGRAMME
1. Goals of the DaSh programme
The DaSh Programme will build capability to support the sharing and re-use of research data and, as a
result, is aimed at expanding the scale and scope of problems that Australian researchers may seek to
address. In order to identify what high performance data sharing and data movement services are
needed by the sector, consultations with relevant research sector stakeholders, combined with an
evaluation of existing services will be undertaken during implementation of the Project.
2. DaSh programme Themes
The DaSh Programme will consist of ten themes as follows:
DaShNet – The network connecting nodes to users and to each other
Federated Authorisation and Service registration – Upgrading the AAF for authorisation
ReDS Application Processing – Automation of application workflows for the ReDS programme
RDSI DaShBoard – A system to automatically collect and publish Node and Collections metrics
RDSI Data Fabric – Providing a common access to collections and working storage for researchers
RDSI File Systems – Establishing File System(s) across nodes with a consistent namespace
RDSI Data Mover – Providing fast data movement between, into and out of nodes
RDSI StoreGate – A gateway to external public storage
RDSI DaShLab – An environment to support testing of implementation and changes of RDSI elements
RDSI Portal – AAF Integrated access to RDSI elements/services with appropriate entitlement
3. Consultation and governance within the DaSh programme
The DaSh Programme will establish a Technical Advisory Committee (TAC) to provide advice to the
project on elements of the technical architecture. The TAC will consist of staff from the RDSI project,
including the Project Director, Project Manager and DaSh Technical Architect together with a
representative from each confirmed Node.
A Technical Reference Group (TRG) will also be established to provide early comments on DaSh
designs and proposals. Membership of the TRG will be open.
4. Development Principles for the DaSh programme
Where possible, the project will seek to acquire or re-use software before considering development. If
development is required, the RDSI project team will call for expressions of interest from Nodes to
undertake the development. Where practical, there will be calls for expression of interest from Nodes
to host RDSI services, if hosting is required.

4
SECTION C – DISCUSSION OF THEMES IN THE DaSh PROGRAMME
This section will discuss the proposed themes in the DaSh programme together with considerations of
implementation.
DaShNet – The network connecting nodes to users and to each other
Proposition
Interconnecting RDSI nodes to each other with the highest available bandwidth will improve the
availability of services across RDSI by supporting replication between nodes. Replication will allow the
delivery of higher availability services than could be provided by individual data centres, which are
often at Uptime Institute Tier 2 status. Resilience through widely distributed replication is more cost
effective than upgrading data centres. These interconnections will also support rapid data movement
between nodes. As the use of RDSI nodes increases, there is a potential for congestion at the access
point for the node. This can be alleviated by providing high bandwidth dedicated access for each
node.
Discussion
The goals of DaShNet will be to establish:
(i) A set of interconnections between primary nodes using the fastest available wavelengths
across the AARNet backbone
(ii) A network access connection for each primary node to support dedicated access to the
node. This will also use the fastest available wavelengths across the AARNet backbone
(iii) Appropriate network connections to each additional node
It is anticipated that the majority of funding for this theme will be for network equipment and
wavelength implementation costs. The initial expectation is that there will be a single source of
network equipment for this theme.
Implementation Considerations
DaShNet will be a project proposed by RDSI to the National Research Network (NRN) project which
will look to use the upgraded AARNet backbone. There will, therefore, be early discussion between
RDSI, NRN and AARNet.
Federated Authorisation and Service registration – Upgrading the AAF for authorisation
Proposition
RDSI will require that users must be able to use the Australian Access Federation (AAF) for
authentication unless there is an agreed exception. However, the mechanisms for granting
authorisation to use a resource, such as a collection, would have to be implemented independently by
resource managers unless there is a federated approach to authorisation. Implementation of an
“Entitlements Service” to support such an approach will benefit users and managers of RDSI
infrastructure and collections by eliminating duplication and providing a consistent approach to
authorisation. The AAF is the logical home for such an entitlements service.

5
Discussion
An entitlement service would be a logical extension to the AAF’s existing authorisation service and
would have wide benefits in providing a consistent approach for a number of eResearch projects,
including RDSI and NeCTAR. An entitlements service could be either developed specifically as a direct
enhancement of the AAF or one of a small number of commercially available entitlement systems
could be licenced and integrated with the AAF. This will be a crucial service for other developments in
RDSI and in other projects; an early delivery of at least the interface specifications will therefore be
essential.
A directory holding authorisation, registration and other service information will be required to
support RDSI services and the design of the directory will depend on the choice of solution for the
entitlement service. A part of such a directory might be implemented in an RDSI portal.
Early discussion will be undertaken between RDSI, AAF and NeCTAR to determine the most
appropriate design.
ReDS Application Processing – Automation of application workflows for the ReDS programme
Proposition
There will be a range of applications for allocation of space under the ReDS programme which vary in
complexity and size. There will be advantages in both timeliness and workload if the process is
automated to the greatest possible extent. This automation will also provide additional benefits by
providing a point for automatically capturing and storing the parameters of a collection for use by the
RDSI measurement and monitoring processes.
Discussion
The design of this application will be determined by developments in the ReDS programme which will
establish agreed levels of delegation and automation and by the requirements of the RDSI DaShBoard
which will determine the data to be captured for monitoring and measurement. The application may
involve either bespoke development or licencing a commercial product for modification and
integration. It will need to integrate with the RDSI portal and will influence the definition of metrics
about collections to be used by the RDSI DaShBoard.
The RDSI project team will develop a specification for this application taking into account the
requirements of other DaSh themes. After an initial market survey of available commercial offerings,
a specification for ReDS Application Processing will be developed and expressions of interest would
then be sought from confirmed RDSI nodes to develop, integrate and host the application as
appropriate. It is anticipated that ReDS Application processing would be integrated with the RDSI
Portal and that potential users would access it through the portal.

6
RDSI DaShBoard – A system to automatically collect and publish Node and Collections metrics
Proposition
The project plan has described the process of establishing trust in the collection of RDSI nodes by
openly and transparently publishing performance against agreed service levels and metrics for both
nodes and collections. The RDSI DaShBoard will automate the process of collecting and publishing this
data to support the production of timely information with low levels of manual intervention.
Discussion
Metrics for the monitoring of nodes will be jointly developed with the Node Development programme
and metrics for monitoring collections will be jointly developed with the ReDS programme. A common
protocol for the transmission of monitoring data to the RDSI DaShBoard must also be developed. It is
anticipated that the DaShBoard would be a part of the RDSI Portal and would collect and display
information from all RDSI Nodes automatically and on a regular basis. This information would relate
to both nodes and collections. The DaShBoard could be developed specifically for RDSI or commercial
software could be licenced and integrated with the RDSI Portal.
After an initial market survey of available commercial offerings, a specification for the RDSI
DaShBoard will be developed and expressions of interest would then be sought from confirmed RDSI
nodes to develop, integrate and host the application as appropriate.
RDSI Data Fabric – Providing a common access to collections and working storage for researchers
Proposition
As described in the RDSI Project Plan, one of the objectives for the DaSh programme is to provide a
consistent interface for researchers to collections and it is understood that this may be only one of a
number of interfaces depending on the nature and uses of the collection. At the same time, it is
helpful for researchers to also have access to some easily accessible storage, through the same
collaborative interface, to support their access to, use and development of, collections. This storage
must support easy collaboration between researchers. The RDSI Data Fabric will be the means of
achieving these objectives.
Discussion
The ARCS Data Fabric successfully provides a consistent interface to collaborative storage using iRODS
middleware and iRODS has significant functionality to support a consistent interface to also
distributed collections at RDSI nodes and elsewhere. The ARCS Data Fabric implements its own
arrangements for entitlements and uses other ARCS functionality to establish service registration
through an LDAP directory which also supports the additional identity credentials needed for
WebDAV access to the Data Fabric.
Whilst there is a goal of migrating functionality from the ARCS Data Fabric to the RDSI Data Fabric, this
does not necessarily imply that the technology solution will be the same and there may well be
benefits in ensuring that an RDSI Data Fabric integrates with and uses the entitlements service

7
described earlier. Furthermore a number of commercial solutions have emerged over the last year
with some having tight coupling with an entitlements service.
After an initial investigation and potentially testing of existing open source solutions and
commercially available products, the RDSI Project team will work with iVEC, who are the existing
development and support group for the ARCS Data Fabric, to develop an appropriate specification
which will then be the subject of consultation with the sector. In the event that a commercially
available product is chosen, joint work will be undertaken with the Vendor Panel (VePa) programme
in relation to establishing a panel for procurement.
After the development of an initial specification, the RDSI Data Fabric would be established by a call
for expressions of interest from confirmed RDSI nodes, for one node to undertake any development
or integration and three nodes host it. The developing node might also be one of the three hosting
nodes.
RDSI File Systems – Establishing File System(s) across nodes with a consistent namespace
Proposition
For some applications, a distributed file system providing a consistent namespace within and between
nodes may be required to provide increased levels of durability. In addition, a file system with
enhanced levels of security may be necessary if nodes are to host data collections with higher levels
of confidentiality.
Discussion
The RDSI File Systems theme will work with the RDSI Research Data Managers, the ReDS Programme
Manager and other stakeholders to develop appropriate use cases, whilst the DaSh Technical
Architect will identify, and where feasible, test different options. These may include open source file
systems or commercially available file systems that could be licenced by the sector. In the latter case,
the VePa programme will be leveraged to establish an appropriate panel of vendors. The use cases
and technical options will be discussed with confirmed nodes and other interested stakeholders
before developing a requirements specification.
Once a requirements specification has been developed the RDSI project team will discuss
implementation options with confirmed nodes.
RDSI Data Mover – Providing fast data movement between, into and out of nodes
Proposition
Researcher accessible tools to efficiently move data between nodes, into nodes and out of nodes will
be of benefit to users of RDSI services.

8
Discussion
As the size of data sets scales up to hundreds of terabytes and potentially petabytes, existing tools to
ingest data into nodes, extract data from nodes or move it between nodes are severely challenged. In
particular, for larger data movements, a third party transfer service is needed so that a researcher can
submit a transfer and then continue to use their own computing resources whilst waiting for
notification of completion. For efficiency, the process needs to be user driven and substantially
automated.
An investigation and potentially testing of existing open source solutions and the small number of
commercially available products will be undertaken and published. After consultation with nodes and
other interested parties, a specification for development or for licencing and integration of a
commercially available product will be developed.
RDSI StoreGate – A gateway to external public storage
Proposition
Researchers will benefit from streamlined access to one or more external public storage clouds both
as a means of storing appropriate research data in the cloud and for accessing relevant services in
public storage clouds. One potential use could be for additional copies of data that are stored at RDSI
nodes; however there are a number of use cases. A particular benefit in using external public cloud
storage is that it is an “on demand” service which can meet short term needs with a fast provisioning
time (often in minutes), little upper limit on capacity and an ability to pay only for the time that the
storage is actually needed. Potential users of external public storage will still need to pay for such
storage; the proposition is that it will be faster and cheaper to access and that it may reduce
proliferation in the use of small pools of external storage each with their own identity credentials.
Discussion
The RDSI project, owing to the nature of its funding, cannot fund public cloud storage. However, to
facilitate use of public storage for research purposes it can develop a gateway for connection to a
number of external public storage cloud providers. Use of such external storage encounters three
principal difficulties; performance, proliferation and cost. Performance issues arise from accessing
external storage over the public internet rather than taking advantage of the dedicated high
bandwidth available across the Australian Research and Education Network (AREN). RDSI StoreGate
would seek to address this issue by attempting to facilitate peering of a number of external public
storage providers with the AREN.
There is anecdotal evidence of significant existing use of external public storage providers for research
data. An example would be the use of Dropbox which stores its data in Amazon’s storage service. The
proliferation in the use of individual Dropbox accounts which do not integrate with other services in
the sector, such as the AAF, forms a barrier to collaboration. RDSI StoreGate will investigate options
to improve integration.
The cost of using external public storage often breaks down into three components. A network traffic
charge; a cost for moving data into and out of the external storage; and a cost of the storage itself.
The first of these could be eliminated by peering a number of storage providers with the AREN as

9
described earlier. The second of these is a function of location and the content delivery networks
used by external storage providers. It may also be improved by peering but it would greatly benefit
from the ability to access Australian based providers. Both this and the third element of cost (the
storage itself) are susceptible to price reduction through demand aggregation. By working with the
RDSI Vendor Panel (VePa) programme to create a panel of external storage providers, RDSI StoreGate
is intended to reduce costs through the aggregation of demand for external public storage.
Internet 2 recently announced its Net+ services which include some form of aggregated access to
external storage providers, Box.net and HP in the United States. The RDSI project team will review
available providers in Australia and work closely with AARNet, Internet 2 and others in developing a
specification for RDSI StoreGate and will also work closely with the VePa programme to construct a
panel of external public storage providers. Implementation options will be developed after these
stages.
RDSI DaShLab – An environment to support testing of implementation and changes of RDSI elements
Proposition
The DaSh Technical Architect, together with Technical Architects from each of the nodes will benefit
from the ability to test implementations of, and changes to the RDSI Technical Architecture.
Discussion
The RDSI Nodes and the network between them, present a unique environment which cannot easily
be replicated by any individual node or institution. Successful implementation of infrastructure and
applications will be dependent on an ability to undertake meaningful testing. By establishing a test
environment or testbed which spans a number of nodes, it will be possible to support the testing of
infrastructure and applications in a realistic environment. DaShLab will be the test environment
spanning the Nodes.
After the development of an initial specification, DaShLab would be established by a call for
expressions of interest from confirmed RDSI nodes, with a target minimum of 2 nodes and no
maximum number. The DaSh programme would fund infrastructure at the nodes to facilitate the
development of DaShLab.
RDSI Portal – AAF Integrated access to RDSI elements/services with appropriate entitlement
Proposition
An RDSI Portal will be an effective means of integrating access to all RDSI services including those
described within other DaSh programme themes. It may also be effective in acting as an integration
point with other eResearch project services.
Discussion

10
The design of the RDSI Portal will be strongly dependent on developments within the other RDSI
themes with which it must integrate. It must clearly be integrated with the AAF and with the
Entitlements Service described earlier. Depending on the design of the Entitlements Service, it may be
necessary for the RDSI Portal to hold directory information about service or resource registration.
The portal may involve either bespoke development or licencing a commercial product for
modification and integration.
The RDSI project team will develop a specification for the RDSI Portal taking into account the
requirements of other DaSh themes. After an initial market survey of available commercial offerings,
a specification for the portal will be developed and expressions of interest would then be sought from
confirmed RDSI nodes to develop, integrate and host the application as appropriate.
SECTION D – IN-DEPTH DISCUSSION OF DaSh PROGRAMME ELEMENTS
This section explores underlying components of the DaSh programme in depth. It is presented to
underpin, extend and enhance the discussion on DaSh programme themes, which have been
described earlier in summary form. The topics discussed in this section are generally applicable to
more than one theme and it is not, therefore, intended that there should be a one to one
correspondence between these topics and the themes.
1. Identity, Authentication and Authorization within RDSI
The Australian Higher Education and Research sectors like many other countries has a SAML v2 based
trust federation called the Australian Access Federation (AAF). This technology allows university staff,
students and researchers to access applications using the credentials issued to them by their
institutions. By later proving possession of and control over these credentials during some act of
authentication at the institution's Identity Provider (IdP), the binding between the end-user and its
digital identity is also proven at some level of assurance.
At a simpler level, institutions manufacture the digital identity of staff, students and researchers
within their institution, based on information within their systems-of-record like HR and SIS systems,
using some form of identity and access management process tailored to that institution. The end-
user's digital identity is composed of all the relevant attributes that may potentially be used to
provide access to a resource. The AAF provides a mechanism, based on the SAML v2 specification, to
assert some components of an end-user's digital identity and transport them securely to a service
provider (SP) so that the resource owner can make an informed authorization decision to allow an
end-user to access to that resource. No matter what resources an end-user wishes to access it is
always based on the end-users digital identity. Effectively an end-user's digital identity is a constant
across all the SPs in the federation.
An end-user's digital identity can in some cases be supplemented by other Identity Providers outside
the province of the end-user's institution. While this allows for a more expressive attribute economy
for authorization it does create some policy and technical issues. Ideally an institution's IdP should
only assert attributes within its province and identity process. Asserting attributes outside one's

11
province diminishes the level of assurance of those attributes. Secondly there is an issue of scale at
the institutions themselves. As an example consider an attribute whose presence in a SAML assertion
informs a SP that a group of researchers from several institutions can access a resource. Using only
institutional IdPs to achieve this, an attribute must be present in the digital identity of every member
of this group. Coordination of this level over a potentially large number of institutions and people is
somewhat erratic. However if this attribute was asserted by a single non-institutional IdP the scaling
problem is minimized.
The AAF is in the process of creating a National Entitlement Service which will allow principle
investigators and people of similar ilk to create entitlements linked to end-users which can be
asserted to a SP in addition to the institution's assertion and used by the SP for fine-tuned
authorization. RDSI will leverage this service in much of its web-browser-based applications. Node
operators should also follow RDSI's lead and where appropriate use the AAF to authenticate to Node-
based service providers.
In fact it is one of the prime principles of the RDSI project to directly attempt to use AAF's federated
identity to access both web-browser-based applications and non-web-browser-based applications.
However the typical SAML v2 authentication profiles used in the AAF does not work well for
applications that are not based on the web browser metaphor; which entails the use of HTTP Cookies
and Redirects. Examples of some of these applications in the RDSI's circle of interest are:
• WebDAV
• i-commands for iRODS
• XMPP/Jabber. (XMPP is a one of the potential protocols for the management and control of
cloud resources.)
• SSH.
• Mounting and accessing file systems.
• Accessing databases.
• My Proxy (which is a service for issuing X.509 certificate for Grid computing.)
Luckily there are initiatives already in play to develop the ability to use a federated credential to
access these applications and services. For example the iPlant Collaborative
<http://www.iplantcollaborative.org> is using work based on Project Moonshot to use federated
access to authenticate and use iRODS i-commands on the command line. (The Project Moonshot work
connects GSS with Radius and EAP to achieve this). There is also work to use the SAML v2 Enhanced
Client or Proxy profile to achieve a similar result. RDSI and the AAF will work together to advance
these innovative authentication and authorization initiatives for use within RDSI and its sister
projects. Unfortunately these emerging technologies are still a bit rough on the edges and may not be
available for production use within RDSI. Until these technologies become more mature one will need
to use contemporary solutions to some of these access issues.
Additionally the movement of data has been a significant component of Grid computing for some
time and many applications have been developed to provide these services using the Globus Tool Kit.
These Grid services typically use GSI (Grid Security Infrastructure) to provide authentication and
authorization using X.509 certificates. While the Globus tools are, in some people’s eyes, overly

12
complicated, it would be a mistake to ignore an existing production infrastructure that does do the
job. For this reason GSI will be a significant component of authentication and authorization in RDSI
Nodes.
2. Identity, Authentication and Authorization within Data Storage Systems
In the previous section the underling concept of an end-user's digital identity being constant across all
service providers is a powerful one. But how does one provide an analogous concept of an end-user's
digital identity being constant across all data storage systems, both within RDSI Nodes, RDSI's sister
projects and other programs?
One way of achieving this goal is to synchronize all participating data storage systems to use a
common identity layer. One such layer could be implemented using a LDAP Directory service,
common across all participating data storage systems. This in concert with the Pluggable
Authentication Modules (PAM) mechanism, which is standard in almost all Unix-like systems, can
provide such an identity layer. We will also concentrate on the Portable Operating System Interface
for Unix (POSIX) series of standards as again this covers most UNIX systems as well as Microsoft
Windows systems if the Microsoft Windows Services for UNIX (SFU) component is installed. It should
be noticed that POSIX/Unix semantics are different to the Microsoft Windows/NFS semantics but with
SFU installed the core identity layer should be consistent over both UNIX and Windows.
POSIX systems link a user to a numeric ID; called the UID; and link a collection of users to a group
which is identified by the numeric ID called the GID. In this POSIX representation the user names and
group names are only there as crutches for the “wetware” that use these systems. It is the numeric
values of the UID and GID that matters in the file system operations. Access to files or directories are
based on the UID, the GID, the permissions of the file (which are stored in the inode of the file or
directory) and credentials used to prove the identity of the user. A LDAP Directory or Active Directory
server can store these mapping of users to UIDs and collections of users to GIDs and the credentials
used to authenticate the end-user. A number of other useful information can be stored using LDAP
schemas like RFC 2307.
This Data Storage Identity Layer (DSIL) will provide a consistent user/UID and group/GID namespace
which can be plugged in to a both remote and local file systems using the likes of PAM and the DSIL
LDAP server, etc so that remote and local file systems share the same semantics of a particular UID or
GID without any remapping. Administrators of the local or remote file systems that use these
mappings can provision user accounts as longs as they do not degrade the semantics of the mappings.
For instance if the user Bob has a UID/GID of 12345/67890 as defined in the DSIL LDAP directory, any
account provisioned for Bob on any participating file system must have a username of Bob, a UID of
numeric value 12345 and a default GID of numeric value 67890. Additional attributes related to
provisioning of accounts like the home directory, the GECOS field, the preferred shell, etc are in the
domain of the local or remote administrators. As OpenLDAP is likely to be chosen as the DSIL LDAP
service a local administrator may host a local leaf-node LDAP replica using the OpenLDAP Translucent
Proxy and rewrite the non-mandatory attributed on the fly.

13
An interface within the RSDI portal will also be provided for those who do not want to use DSIL service
and instead provide their own UID/GID mappings. This interface will allow such users to design their
own mapping options which they will use when mounting a remote file system onto their local file
system. In both these cases it is typically the root user that mounts these remote file system. There
must be a certain level of trust in this act amongst all parties.
It should be noted that without this intelligent design provided by DSIL, the sharing of data through a
remote file system is made more difficult as individual UIDs and GIDs may have to be remapped from
a remote file system's UIDs/GIDs to the local file systems UIDs/GIDs so as to receive the full benefit of
the remote file system. Taking this in account and the fact that there are many such file systems in the
Australian High Education and Research sectors, using UID/GID remapping is an unscalable and
piecemeal solution.
Additionally there are issues concerning the confidentiality and integrity of the data exported from or
imported to remote file systems. File systems like NFSv4.1 provide a GSS-API mechanisms to provide
the confidentiality and integrity of the data without the likes of TLS, GRE or SSH tunnelling; however
many file systems do not. More on this topic will be discussed later sections.
The central component of DSIL is a well replicated LDAP service using various typical LDAP schemas
including the likes of RFC 2307 and 2377. The directory needs to be populate with identity
information from both end-user's IdPs and various Attribute Authorities that may have additional
identity information outside the scope of the their institution. A prototypical workflow of a person
named Bob wishing to register with DSIL is as follows:
(i) Bob uses his credentials issued by his institution to access the RDSI portal using the AAF
infrastructure for the first time. Bob's email address, surname, given name and other
appropriate attributes are asserted in the SAML payload to RDSI DSIL Registration portal.
As this is Bob's first visit to the portal, it requires Bob to nominate two unique usernames.
The first is an 8 character username based on the original POSIX standard. This will provide a
compatibility level over all POSIX based systems where required. The second username will be
a long username; potentially 256 characters.
Both of these accounts will be linked. While DNs of both LDAP entries are different, the
important attributes like the uidNumber, gidNumber, etc will be provisioned uniquely and
replicated to both accounts.
(It should also be noted that most modern systems use unsigned 32bit integers to store UIDs
and GIDs. This potentially provides the DSIL directory service a maximum of 2 billion accounts
and also 2 billion groups. To ensure that system accounts don't collide with DSIL, end user
accounts will start with UIDs and GIDs of 1,000,000.)
(ii) Bob must also provide a new password for this account. Passwords will be stored in a
Kerberos v5 KDC (Key Distribution Centre) which will impose strong passwords and strong
hashes (like AES) as defined by RDSI policy. Kerberos v5 pre-authentication will also be
enabled to reduce the risk of comprised hashes. This password will also suffer a password
aging regime as per RDSI policy. Nearing the end of the aging Bob will receive a series of email
prompting him to access the RDSI portal password management system to restart the aging

14
process. Ignoring these emails will trigger the archiving of Bob's account.
(iii) Bob at this time (or any other time) can upload and manage his SSH public keys. A patched
version of SSH, namely OpenSSH-LPK, provides an easy way of centralizing strong user
authentication by using an LDAP server for retrieving public keys instead of ~/.ssh/authorised
keys. This allows the de-provisioning of a user's SSH access at one point.
(iv) Bob at a previous time has accessed the AAF National Entitlement Service where he has
defined and managed a set of Australian and New Zealand Standard Research Classification
codes which represents the research discipline that Bob is interested in. This information
coupled with similar codes related to research data will allow RDSI to track in a board sense
how researchers use research collections and allow RDSI to tune the use of RDSI and Nodes
over the project. As part of the SAML workflow the AAF National Entitlement Service
Attribute Authority will be queried to add these ANZSRC code to the SAML assertion.
(v) Bob at this time (or any other time) can request to be a group coordinator. A new unique GID
will be provided to Bob and he will be able to invite other users to register with the DSIL
services (if needed) and join his group. At this time the group only exists as an entry in the
DSIL directory. To provision this group access control, a local or remote administrator must
change the GID of the file or directory the new GID.
Bob can also define other group coordinators which will have the same rights as Bob within
that group.
(vi) Bob can only manage the password of his DSIL account after a successful federated
authentication to RDSI portal password management system. System Administrators should
be very reluctant to change Bob's password. This will ensure that their DSIL credentials
maintain an appropriate level of authentication assurance.
Through end-users registering with the DSIL service RDSI will organically grow a database of identity
information that will span both web-browser-based application and data storage systems.
3. RDSI Portal
The RDSI portal is one of the major web applications within the RDSI ecosystem. It will provide several
federated services which will be of use to the end users of RDSI and potentially other sister projects.
These services will be detailed below.
DSIL Registration Portal
The purpose of the DSIL Registration portal is to extend the digital identity of an end-user into the
realm of data storage by adding attributes that describe the end-user in a file system. Once an end-
user has authenticated to the portal an LDAP entry is created in the DSIL directory. Such a directory
entry might look like:
dn: uid=bobuser,ou=users,dc=dsil,dc=rdsi,dc=edu,dc=au
objectclass: top
objectclass: person
objectclass: organizationalPerson
objectclass: inetOrgPerson
objectclass: posixAccount
objectclass: ldapPublicKey

15
description: Bob User's Account
userPassword: {KERBEROS}bobuser@RDSI.EDU.AU
cn: Bob User
sn: User
givenname: Bob
mail: bob@domain.com
manager: uid=bobsboss,ou=users,dc=dsil,dc=rdsi,dc=edu,dc=au
uid: bobuser
uidNumber: 1234500
gidNumber: 6789000
homeDirectory: /home/bobuser
sshPublicKey: ssh-dss AAAAB3...
sshPublicKey: ssh-dss AAAAM5...
At this stage there is only the potentiality of this entry. A system administrator of a file system that
participates in the use of the DSIL service must provision the account as they seem fit as long as they
do not degrade the semantics of the user/UID and group/GID mappings.
Account Management Portal
End-users will need to manage their own DSIL credentials in-line with the RDSI password policy. This
portal will ensure that DSIL credentials are sufficiently strong over the period of the password aging
process. Previous passwords will not be appropriate to use for new passwords and will be rejected.
The DSIL service will initially attempt to achieve a level of authentication assurance similar to the NIST
800-63/Liberty Identity Assurance Framework standard at level 2. Levels of identity assurance are
asserted by the end-user's IDP and transported to the DSIL service in the payload of a SAML assertion
where it will be encoded within the eduPersonAssurance attribute.
As there will be password aging associated with their DSIL credentials end-users might lose access to
the DSIL service because they have ignored the series of emails of impending doom of their account.
Moreover there are situations where end-users just fall off the map. End-users should nominate an
email address of a colleague or similar trusted person into this portal so that if a person cannot
respond to an email of doom another may respond so as to fend off the account archiving process.
Group management Portal
Group management is a crucial component of any research activity. Once you have proven your
identity to a service provider or relying party some authorisation process kicks in to determine if you
have the rights to access it. This is true from the large scale of the Large Hadron Collider to the much
smaller scale of a simple file system contained protected data. Authorization in a file system is
typically achieve using groups. If a user is in the right group and the permissions of a file or directory
allow access to that group one can access the file or directory.
As described above, a group coordinator defines the need of a particular group to provide access
control to a resource. These groups and their members must be added to DSIL directory and await the
provisioning of this group to a resource (i.e. file or directory) by a local or remote administrator.

16
However there are many additional ways to provide authorization information so as to create a more
vivid pastel of mechanisms other than just file system groups. Especially as the DSIL directory will
contain both local and institutionally sourced attributes. One means to manage all these
authorization data within the DSIL directory may be provide an instance of Internet2's Grouper
Groups Management Toolkit v2.0. Such a directory entry for a group might look like:
dn: cn=bobsgroup,ou=groups,dc=dsil,dc=rdsi,dc=edu,dc=au
objectclass: top
objectclass: posixGroup
description: Bobs Group
cn: bobsgroup
gidNumber: 1000000
memberUid: bobuser
memberUid: eveuser
RDSI Attribute Authority
A (SAML v2) Attribute Authority is an effective way of having your authorization cake and eating it as
well. When an institutional IDP asserts a set of attributes to a SP, it should only assert information
that is within its scope of the institution. However there may be other sources of authorization
information pertaining to the authenticated end-user of the institutional IdP which may provide extra
information to an SP. Mashing these sets of authorization data together provides a richer pallet of
authorization possibilities.
An Attribute Authority (AA) provides a secondary source of attributes to be asserted with the payload
of the institutional IdP. An AA is a somewhat like a lobotomized IdP and is usually backed by a LDAP
server; in this case the DSIL directory. The management of the AA attributes can be provided by
applications like Internet2 Grouper Groups Management Toolkit v2.0, as described above, and will
allow delegated individuals access and management of various authorization data. When programs
like the Project Moonshot reach a certain level of production quality the RDSI AA will be ready to
provide direct authentication and authorization use institutional credentials.
ReDs Portal
The ReDs portal, a component of the RDSI portal, allows collection owners and data curators to
submit their data for merit-based ReDs funding so as to offset the cost of storing their data at the
various RDSI nodes. Using the collection owner's or data curator's federated identity, the portal will
allow them to upload sufficient information so that the RDSI Resource Allocation Panel can assay the
merit of the submission. The information required for the submission is detailed in the ReDs program.
Collection owners and data curators will be able to track the progression of their ReDs bid through the
portal. Also all formal communications between ReDs bidders and the RDSI Resource Allocation
Panel will be tracked as well.
Monitoring and Analytics
As in any business it is important to maintain a constant vigil of the metrics that describe the health of
the business so as to maximize its profits. In a similar way the RDSI project also needs to keep a close
eye on the metrics that describe its health. The RDSI ecosystem consists of many entities such as

17
potential and successful Node bidders, collection owners and data custodians, potentially and
successful ReDs bidders and of course the end-user researchers as well. All these entities need to
have sufficient information so as to make their component of RDSI a success and therefore the whole
project a success.
RDSI will ensure that these metrics are monitored and provided as openly as possible to the all.
My Node Portal
As stated above the RDSI ReDs program provides funding for the storage of significant data
collections. Once a collection owner or data curator has been successful in their ReDs bid they have to
store the collection in one of the RDSI Nodes. The choice of which node is of course up to successful
bidder.
A conscientious bidder would need to take in many facts concerning the way a particular node
functions as a business or how their collections would suit a node that specializes around a set of
disciplines. This information is typically somewhat elusive in most cases. To aide successful ReDs
bidders to make an informed choice RDSI will ensure that sufficient information is available to them.
RDSI Nodes must supply up-to-date detailed information and metrics concerning their operations.
This information will be displayed on the My Node portal, a component of the RDSI portal.
All RDSI nodes will be require to regularly collect various selections of information concerning all the
facets of a node's operations. This data will be transfer to the My Node portal and displayed in an
intuitive manner.
4. RDSI Analytics
It is of considerate importance for RDSI itself to monitor how researchers of various disciplines
interact with data sets produced by various disciplines. While this information at a level of individual
researcher is somewhat overbearing and is an issue to researchers’ privacy, at a discipline level it can
provide information that will enhance the success of the RDSI project.
Relating the Australian and New Zealand Standard Research Classification (ANZSRC) codes of
researchers to the same codes associated with the data sets as metadata should provide de-identified
data that will help the RDSI project to measure its success.
Knowledge Management
The sharing of knowledge is an important process in research. Without this sharing the efficiency of
research endeavours would be much curtailed and researchers would spend significant time re-
inventing the wheel. The RDSI portal will provide a wiki so as to allow researchers to share the tricks
of their trades; data wise. The wiki will also allow researchers to document how they create, use and
store their data. This may well produce productive synergies between various researchers and even
disciplines themselves.
It is also important for RDSI and Node operators to have a good understanding of the data practices of
researchers and disciplines so to meet their needs.

18
5. DaShNet
Moving data from a RDSI Node to a researcher or when a researcher ingesting a new data collection
into RDSI Node will be one of the “meat and potatoes” daily operations within the RDSI project.
However these daily operations are fraught with consequences especially if the volume of data to be
transferred is large. If there is insufficient network bandwidth and/or high network latencies between
the researcher and the data they are trying to access, the efficiency of the research process will
deteriorate. Researchers usually have many activities “on the go” and they will typically move on to
another activity while waiting for a long data transfer to finish. Getting back to the original activity
may take some time or in some cases never.
As an example consider this; most Australian universities have either at the minimum a 1Gbps
connection or a 10Gbps connection at the maximum. Transferring 1TB of data will take either slightly
over 2 hours at 1Gbps or 13 minutes at 10Gbps. In this scenario a researcher will probably just go out
for a cup of coffee rather that move on to another activity. However if 100TB of data was transferred,
it would take either slightly over 222 hours at 1Gbps or 22 hours at 10Gbps, the researcher would
definitely move on to a new activity.
The solution for this issue is twofold. Firstly the network bandwidth between the researcher and a
RDSI Node must be maximized considering the network topology both inside the researcher's
institution and the AREN (Australian Research and Education Network) backbone. Similarly the
network latency must be likewise minimized. In a coordinated move the AREN is currently moving
their backbone bandwidth to 100Gbps and the NRN (National Research Network) are providing
40Gbps network links from the AREN backbone to a RDSI Node's border router.
Reconsidering the previous 100TB data transfer at a bandwidth of 40Gbps and assuming that an
institution will eventually upgrade their border routers to at least 40Gbps it would take approximately
5 hours to transfer the 100TB of data rather than the 22 hours at 10Gbps.
Secondly highly efficient data movement protocols must be employed. This topic will be discussed in a
later section.
6. National File System
One of the initiatives of the Australian Research and Collaboration Services (ARCS) was the ARCS Data
Fabric which provided 25GB of free storage to all researchers. Unfortunately ARCS funding finished 1st
July 2011 leaving this service in financial doubt. However RDSI has to step forward to continue this
service. The RDSI project will provide a National File System that will be provided to researchers in
the Australian Higher Education and Research sector 25GB of free storage.
The deployment of this file system will be in much the same image of the ARCS Data Fabric so as to
provide the similar interface to previous and current users. It will run the iRODS v3 software using the
OS authentication feature. This will allow the DSIL LDAP directory to provide the same username/UID
and group/GID semantics within iRODS as without.
7. Data as a Service
The RDSI project is a prime example of DaaS, Data as a Service. As defined by wikipedia:
DaaS is based on the concept that the product, data in this case, can be provided on

19
demand to the user regardless of geographic or organizational separation of provider
and consumer.
Data as a Service brings the notion that data quality can happen in a centralized
place, cleansing and enriching data and offering it to different systems, applications
or users, irrespective of where they were in the organization or on the network. As
such, Data as Service solutions provide the following advantages:
• Agility – Customers can move quickly due to the simplicity of the data access and the
fact that they don’t need extensive knowledge of the underlying data. If customers
require a slightly different data structure or has location specific requirements, the
implementation is easy because the changes are minimal.
• Cost-effectiveness – Providers can build the base with the data experts and
outsource the presentation layer, which makes for very cost effective user interfaces
and makes change requests at the presentation layer much more feasible.
• Data quality – Access to the data is controlled through the data services, which
tends to improve data quality because there is a single point for updates. Once
those services are tested thoroughly, they only need to be regression tested if they
remain unchanged for the next deployment.
In RDSI's case the data itself is generated by researchers doing the normal things that researchers do;
i.e. compiling discipline based data sets and publishing their findings. Such data sets as prescribe by
the rigours of the RDSI ReDs program will be uploaded to the central repositories within the collection
of RDSI Nodes. Easy discovery and access to the data contained within the RDSI Nodes is an
imperative.
Data Discovery and Metadata
As the RDSI Nodes will be brimming with useful data sets and collections it will be very important for a
researcher to be able to easily find a particular data set. However without sufficient metadata
describing the data set it will be next to impossible for a researcher to discover the existence of the
data let alone where it is located. Without accurate and sufficient metadata the purpose of the RDSI
infrastructure is pointless.
In Medieval times parish priests were entrusted with the care of souls. These priests were titled
curates. In present times data custodians are entrusted with the care of metadata. Data collections
and data sets must have data custodians too so that they can be curated, cared for and discoverable
throughout their life cycle.
It is assumed that ANDS (Australian National Data Service) will provide its expertise with respect to
curation matters. For more details on this subject please read the ANDS Guide The Data Curation
Continuum.
Data Movement
One of the prime purposes of a RDSI Node is to be able to move data from or to a Node to where it
can be consumed by a researcher so as to provide some form of new scientific result. This data

20
movement can be achieved in an extraordinary large number of ways and means. However the data
movement mechanism that is chosen is usually the proscribed data movement protocol of a particular
discipline or the preferred data movement mechanism of the researcher or his/hers research group.
In a sector as robust as the Australian Higher Education and Research sector this still provides a
potentially large numbers of data movement mechanisms in use. It would be economically infeasible
for every RDSI Node to provide an interface for every data movement mechanism used in the sector.
At some stage a Node must choose what interfaces it will support.
So how can RDSI help Nodes in the choice of data movement mechanisms? An obvious answer is that
RDSI through DaSh Technical Architecture will compel all Nodes to implement a certain set of data
movement mechanism. These mechanisms will be decided through a community input process.
Nodes can of course implement other data movement mechanism as well and this choice will
obviously be one of the many differentiators of the Node from other Nodes; either attracting or
repelling successful ReDs bidders.
Of these compelled interfaces a number will be consider as a commodity type. That is to the end-
users these interfaces will be well known and common in their use. For the system administrators of
Nodes these interfaces will also be well known and the installation and support of them should be a
well known quantity to Node system administrators. Some potential examples of these are for
instance GridFTP, NFS v4, CIFS, webDAV and iRODS.
A number of these compelled data movement mechanism may also be of a specialist type where the
burden of installation and support is higher than the commodity type and end-users may not have
been commonly exposed to them.
The list of compelled data movement mechanism will provide an initial level playing field for all
Nodes. It will also provide a level playing field for all end-users of RDSI Node repositories.
In the next sections we will discuss some of the data movement interfaces that may have a part to
play within the RDSI project. As a gross simplification these interfaces will be categorized as:
• File Transfers (in which the provision of the service is mostly stateless).
• File Systems (in which the provision of the service is mostly stateful).
• Data Middleware (in which there are other applications between the data and the end-
user).
Which interfaces that will be compelled will be teased out using community advice. As initial list may
look like this:
File Transfers File Systems Data Middleware
GridFTP NFS v4.1 (pNFS)
Clustered NFS
iRODS
Rsync over SSH pCIFS
CIFS
Globus Online
Amazon S3 webDAV Reliable File Transfer (RFT)
HTTP Storage Resource Manager (SRM)

21
As in all movement of data from one place to another there is always a risk that either the
confidentiality and/or the integrity of the data may be compromised in transit. Some data movement
mechanisms provide a layer of encryption to minimize these risks. Others use digital signatures or
checksums to detect that the data has been tampered with. However there are other data movement
mechanisms that do not provide any security of the data in transit.
The use of such insecure data movement mechanisms within the RDSI project will only be tolerated
when these data movement mechanisms are tunnelled through a layer that will supply a layer of
confidentiality and integrity of the data. Such layers are provided by protocols like TLS, GRE or SSH
tunnelling. In these cases it is the responsibility of the Node to provide this end-to-end layer from a
RDSI Node to the end-user however they wish.
File Transfers
The original File Transfer Protocol (FTP) specification was published as RFC 114 in 1971, even before
TCP and IP existed. Since then file transfers have been in the past the heavy lifters in the data
movement area. Simply put, file transfers move a complete file or a piece of a file from one place to
another; ideally as fast as possible. Examples of file transfer mechanisms of interested to RDSI are:
• GridFTP is a protocol for network transfers using grid frameworks. GridFTP is part of the
Globus toolkit and was designed for efficient and secure transfer of large amounts of data.
GridFTP uses extensions to the FTP protocol to add enhancements such as parallel transfers
and automatic restart of transfer after interruption.
o ARCS GridFTP service
• Rsync
• HTTP
• Amazon S3 is an online storage web service offered by Amazon Web Services. Amazon S3
provides storage through web services interfaces (REST, SOAP, and Bit Torrent).
• Tsunami UDP is a fast user-space file transfer protocol that uses TCP control and UDP data for
transfer over very high speed long distance networks (≥ 1 Gbps and even 10 GE), designed to
provide more throughput than possible with TCP over the same networks.
• Aspera’s fasp™ transport technology is an emerging standard for the high-speed movement of
large files or large collections of files over wide area networks.
• Bitspeed Velocity is a software application that accelerates file transfers. It maximizes existing
WAN bandwidth to up to 100% utilization
8. SRM based File Transfers
In the simplest situation a file transfer mechanism assumes there is there is only one protocol
supported at both ends of the transfer. However in real life either ends of the transport may support
multiple file transfer mechanism and there may not be an exact overlap of these mechanisms. In such
cases the separate end points must negotiate a common mechanism before a file transfer can be
initiated. Storage Resource Management (SRM) is a Grid middleware application which that help
provide this negotiation layer as well as other useful features such as coordinating storage allocation,
dynamic space reservation and automatic garbage collection that prevents clogging of storage
systems.

22
File Systems
A distributed file system or network file system is any file system that allows access to files from
multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple
machines to share files and storage resources. The client nodes do not have direct access to the
underlying block storage but interact over the network using a protocol. This makes it possible to
restrict access to the file system depending on access lists or capabilities on both the servers and the
clients, depending on how the protocol is designed. Ideally these file systems should be able to move
data as fast as possible so as to maximize researcher productivity. Examples of network file systems
of interested to RDSI are:
• NFS v4.1/pNFS. NFSv4.1 adds the Parallel NFS pNFS capability, which enables data access
parallelism. The NFSv4.1 protocol defines a method of separating the file system meta-data
from the location of the file data; it goes beyond the simple name/data separation by striping
the data amongst a set of data servers. Whether an implementation of NFS v4.1/pNFS
provides sufficient aspects of the standard to provide strong authentication, data integrity and
data privacy is the concern of a Node operator given the RDSI stance on the importance of this
matter.
• NFS v4. The NFS v4 protocol specification RFC 3010 provides both strong authentication using
GSSAPI as well as strong integrity and privacy using LIPKEY and SPKM-3. Whether an
implementation of NFS v4 provides sufficient aspects of the standard to provide strong
authentication, data integrity and data privacy is the concern of a Node operator given the
RDSI stance on the importance of this matter.
• SMB/CIFS/pCIFS. The Common Internet File System (CIFS), also known as Server Message
Block (SMB), is a network protocol whose most common use is sharing files on a Local Area
Network. While CIFS can use strong authentication protocols like Kerberos it has little natively
support in the areas of data integrity or privacy. To combat this deficiency one can tunnel
CIFS/SMB file systems over protocols like SSH, TLS or GRE.
CTDB is a cluster implementation of the TDB database used by Samba and other projects to
store temporary data and is the core component that provides pCIFS ("parallel CIFS") with
Samba3/4.
• webDAV (RFC 4918) is a set of methods based on the Hypertext Transfer Protocol that
facilitates collaboration between users in editing and managing documents and files stored on
web servers. The WebDAV protocol makes the Web a readable and writable medium. It
provides a framework for users to create, change and move documents on a server. The most
important features of the WebDAV protocol include:
• Locking ("overwrite prevention")
• Properties (creation, removal, and querying of information about author, modified date
et cetera);
• Namespace management (ability to copy and move Web pages within a server's
namespace)
• Collections (creation, removal, and listing of resources)
The webDAV specification does not natively support data integrity or privacy however

23
typically webDAV is tunnelled through TLS to provide these services.
Data Middleware
• iRODS. The Integrated Rule-Oriented Data System, is open source software that helps people
manage large collections of digital data distributed across multiple sites running diverse
infrastructure.
• OpeNDAP. An acronym for "Open-source Project for a Network Data Access Protocol", is a
data transport architecture and protocol widely used by earth scientists. The protocol is based
on HTTP and the current specification is OPeNDAP 2.0 draft. OPeNDAP includes standards for
encapsulating structured data, annotating the data with attributes and adding semantics that
describe the data. The protocol is maintained by OPeNDAP.org, a publicly-funded non-profit
organization that also provides free reference implementations of OPeNDAP servers and
clients.
• Globus Online. Globus Online is a fast, reliable file transfer service that makes it easy for any
user to move any data anywhere. Recommended by HPC centres and user communities of all
kinds, Globus Online automates the time-consuming and error-prone activity of managing file
transfers, so users can stay focused on what’s most important: their research.
• Globus Reliable File Transfer (RFT) Service. RFT is a Web Services Resource Framework (WSRF)
compliant web service that provides “job scheduler"-like functionality for data movement. You
simply provide a list of source and destination URLs (including directories or file globs) and
then the service writes your job description into a database and then moves the files on your
behalf. Once the service has taken your job request, interactions with it are similar to any job
scheduler.
• Globus Replica Location Service (RLS). The RLS service is one component of data management
services for Grid environments. RLS is a tool that provides the ability keep track of one or more
copies, or replicas, of files in a Grid environment. This tool, which is included in the Globus
Toolkit, is especially helpful for users or applications that need to find where existing files are
located in the Grid.
• Globus Data Replication Service (DRS). The function of the DRS is to ensure that a specified set
of files exists on a storage site. The DRS begins by querying RLS to discover where the desired
files exist in the Grid. After the files are located, the DRS creates a transfer request that is
executed by RFT. After the transfers are completed, DRS registers the new replicas with RLS.
• WAN Data Cache. Researchers are naturally distributed over the city and country. In most
cases researchers are locate at universities where their access to sufficient network bandwidth
is both sufficiently large and sufficiently close. Access speeds to data within the RDSI Nodes
will thus be sufficient due to the AREN, NRN and the DaShNet initiative. However there will
always researchers who may not be so endowed network-bandwidth-wise. These “spatially
disenfranchised” are still required to perform their science and access data within the RDSI
Nodes. WAN Data Caches can help these spatially disenfranchised researchers to achieve
significantly more effective access and bandwidth to data within a RDSI Node than they
currently have. However this access and bandwidth will always be less than that of “spatially
enfranchised researchers”.

24
Structure Data
The labels "structured data" and "unstructured data" are often used ambiguously by different interest
groups; and often used lazily to cover multiple distinct aspects of the issue. In reality, there are at
least three orthogonal aspects to structure2:
• The structure of the data itself.
• The structure of the container that hosts the data.
• The structure of the access method used to access the data.
These three dimensions are largely independent and one does not need to imply another. For
example, it is absolutely feasible and reasonable to store unstructured data in a structured database
container and access it by unstructured search mechanisms.
In many cases researchers will have their data described and constrained by some of the aspects
detailed above. To support these activities Nodes would require more infrastructure that “just plain
storage”. However Node operators should see this as an opportunity. In fact a collection of Nodes
might collaborate together to provide, for example, a massive distributed query engine based on the
concepts of NoSQL and Map/Reduce. Such a service could be quite enticing to a significant portion of
the Australian Higher Education and Research sectors.
9. Data Integrity
Data Integrity within the RDSI project is of utmost importance. If a RDSI Node can't provide data to an
end-user in the same state in which it was ingested, then researchers may not be able to trust the
data from that Node. Moreover the stain of the loss of data integrity from one Node may affect the
trust-worthiness of another Node in the eyes of data custodians and the end users as well. While it is
impossible to reduce the risk of data integrity to zero, it is possible to management this risk.
Node operators bear the brunt of this risk and they must ensure that proactive and reactive measures
be taken. As a proactive measure storage systems should be able to detect such events like bit rot and
silent corruptions and attempt to heal them without human input. Such events should also be
monitored within the My Node portal even if the system successful healed the loss of data integrity.
As a secondary proactive measure a service similar to fsprobe, the CERN probabilistic data integrity
checker, should performs a regular check of file systems by writing various combinations of bit
patterns and then reading them back. This can be used to identify file system, operating system and
hardware problems.
As a reactive measure when a storage system does detect a fault, the cause must be investigated
promptly and mitigation strategies designed and put in place. RDSI must be informed of these faults
and mitigation strategies. Node operators should share this information with each other so that a
body of knowledge of these storage anomalies can help minimize future storage anomalies over all
RSDI infrastructure.
2 Duncan Pauly, founder and chief technology officer of Coppereye

25
10. Manifestation of Trust within RDSI program
It is obvious that all RDSI infrastructures must manifest a significant level of trust worthiness so that
researchers, data custodians and other users will feel secure in its use.
In an infrastructure like PKI or a SAML based federation the province of trust is usually located at a
single point. For instance the trust root of either a root CA (for the case of a PKI) or a self-signed
certificate (for the case of a SAML federation for use to digitally signing an aggregation of SAML
metadata). For both PKI and a SAML federation there are also open and well-published practice
statements that allow end-users and replying parties to understand the risks of using the PKI or SAML
federation.
In the RDSI infrastructure there are a number of trust centres that manifest this aggregated trust.
Some are manifested by RDSI itself as a governance and policies layer, some are manifested by the
RDSI Nodes and their work practices. Some are manifested in the appropriate sanctioned use of the
DSIL LDAP directory. There are also trust manifestation centres that at first glance have little real
connection to either RDSI or Nodes. For example when a remote RDSI file system is mounted on a
local file system it is the work practices of the local system administrators that generate the trust-
worthiness of that act.
For this reason the manifestation of trust for all aspects of the RDSI project is somewhat more
complicated than the simple case of a PKI or SAML federation. This increases the risk to end-users and
replying parties as they may not be able to the full understand the risks of using the RDSI
infrastructure. The RDSI governance layer must manage the perception of risks well so as to optimize
the significant level of trust worthiness so that researchers, data custodians and other users will feel
secure in its use.

RDSI Dash Tinman

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (10)

Similaire à RDSI Dash Tinman

Similaire à RDSI Dash Tinman (20)

Dernier

Dernier (20)

RDSI Dash Tinman