I shall provide a summary of JISC work in the area of ‘Big Data’. My primary focus will be on how to manage the huge amount of research data produced in UK Universities. I shall cover the history of JISC interventions to improve research data management and look at next steps. I shall touch on some other areas of work like ‘Digging into Data’ and web archiving which also deal with ‘big data’.
Why Teams call analytics are critical to your entire business
Simon Hodson
1. Thursday 10 May 2012
Eduserv Symposium: Big Data
JISC and the Big (Research) Data Challenge
Simon Hodson
JISC Programme Manager, Managing Research Data
2. Why is managing research data important?
JISC considers it a priority to support universities in improving the way
research data is managed and, where appropriate, made available for
reuse.
Research funder policies, legislative frameworks, good practice, open data
agenda
– The outputs of publicly funded research should be publicly available.
– The evidence underpinning research findings should be available for
validation
Good data management is good for research
– More efficient research process, avoidance of data loss, benefits of data reuse
Alignment with university missions.
– Universities want to provide excellent research infrastructure.
– Universities want to have better oversight of research outputs.
3. Estimated Research Data Requirements
Two Russell Group Universities
Estimated current data holdings of c.2PB (managed and unmanaged)
Currently provide 800TB/300TB in a central storage facility, not all of which is
used (but will be full in 12-18 months)…
Significant amount of data in temporary storage, external drives etc…
‘the more groups we go to talk to, the more we're hearing of significant
data holdings on external hard drives and small RAID systems’
1994 Group University
No central research data provision.
Faculties (medicine, business, humanities) have 20-30TB each.
Engineering currently has 170TB faculty system, urgent need to expand.
But… one group, recently interviewed, currently has 250TB, only half in
‘managed storage’; will reach PB levels in the next few years.
4. DUDs
The data centre
under the desk (or
in a back pack) is
not adequate.
5. Why manage research data?
Not just about storage or avoiding data loss…!
It’s about knowing what to keep and what to throw away…
Important to extract maximum return on investment from publicly
funded research.
Access to underlying data is essential for verification and therefore
research integrity.
Opportunities to extract more knowledge from existing data, new
analysis.
It’s about making the most out of data created!
7. JISC and Research Data
1. Understanding the problem (pre-2007-2009)
2. Prototyping solutions (2009-11)
3. Hardening solutions and building institutional capacity (2011-13)
4. Developing elements of national infrastructure (2013+)
8. 1: Understanding the Problem
Key JISC reports:
Dealing with Data:
http://www.ukoln.ac.uk/ukoln/staff/
e.j.lyon/reports/dealing_with_data_
report-final.pdf
Keeping Research Data Safe:
http://www.jisc.ac.uk/media/docum
ents/publications/keepingresearch
datasafe0408.pdf
Skills, Role, Career Structure of
Data Scientists and Curators:
http://www.jisc.ac.uk/media/docum
ents/programmes/digitalrepositorie
s/dataskillscareersfinalreport.pdf
Other:
UKRDS Scoping Study:
http://www.ukrds.ac.uk/resources/
9. Prototyping Solutions:
First MRD Programme, 2009-11
RDM Infrastructure (guidance/support, systems)
RDM Planning (DMPs, best practice, disciplinary challenges)
RDM Training (targeted at disciplinary needs)
Challenges of data citation and publication
First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs
10. Building Institutional Capacity:
First MRD Programme, 2009-11
RDM Infrastructure (policy, guidance/support, systems)
17 large projects
RDM Planning (DMPs, best practice, disciplinary challenges)
RDM Training (disciplines and libraries/research
support)
Innovative data publication
Second JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
Projects shortly to be announced for research data publication and developing RDM
training materials: http://bit.ly/jiscmrd-2012-Call
11. A holistic approach…
Leadership and
Policy Development
Publication, Citation
Guidance and
and Discovery
Training
Mechanisms
Support for Data
RDM Systems and
Management
Infrastructure
Planning
12. How to develop RDM services
Why develop services?
Roles and responsibilities
In development! Process of service development
The components / building blocks
• Policy
• Data Management
Planning
• Storage
• Data registry..... Examples and
case studies to
Getting started develop into
toolkit
Slide Credit: Sarah Jones and Martin Donnelly, DCC
13. Next steps? Elements of a national infrastructure
Journals are increasingly implementing policies requiring availability
of underlying data.
Registry of Journal Data Policies to help researchers and research
administrators understand the implications and changing landscape.
Universities are developing catalogues of research data holdings.
National registry of research data to facilitate discovery, reuse; better
understanding of impact and research landscape.
14.
15. Thank You!
First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs
Second JISC MRD Programme, 2011-13: http://bit.ly/jiscmrd2009-11
Programme Blog: http://researchdata.jiscinvolve.org/
MRD Project Blogs: http://tiny.cc/MRDblogs
Twitter: #jiscmrd
E-mail: s.hodson@jisc.ac.uk
Acknowledgements for slides, content: Carol Goble, Liz Lyon, Peter Murray-
Rust, David Shotton, Martin Donnelly, Sarah Jones.
16. From prototype to platform…
DataFlow Project: http://www.dataflow.ox.ac.uk/
UMF Programme SaaS for RDM Projects: http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx
17. The JISC UMF DataFlow Project
Researchers DataStage is a file management system
A DataStage data package consists of
selected data files accompanied by an
RDF metadata manifest, with a SWORD
v2 wrapper
DataStage file system
Researchers, other users
SWORD deposit
DataBank is a generic repository, and
can be used to store things other that
research datasets, for example data
management plans (DMPs) DataBank repository