A 2 hour introductory session presented to PhD students at the University of Melbourne, 13 September 2012.
Given by Steve Bennett (VeRSI) and Jeff Christiansen (ANDS).
08448380779 Call Girls In Friends Colony Women Seeking Men
UpSkills: Research Data Management for the Sciences
1. Research Data Planning
...for the Sciences
MSGR UpSkills Program
Jeff Christiansen & Steve Bennett
13 September 2012
17/09/2012 1
2. Why data management
What data
Where you store it
Who owns it
How you manage it
Bonus: start work on a data management plan!
17/09/2012 2
3. Intro – who we are
Dr Jeff Christiansen jeff.christiansen@ands.org.au
Australian National Data Service
Previously researcher in molecular genetics
Steve Bennett: steve.bennett@versi.edu.au
Victorian e-Research Strategic Initiative
Helps researchers with systems for digital data
17/09/2012 3
4. Why data management
What data
Where you store it
Who owns it
How you manage it
17/09/2012 4
5. Becoming aware of data
management in research
BSc (Hons)
Experiment 1
?
Experiment 2
17/09/2012 5
7. Becoming aware of data
management in research
PhD
CCACGCGTCCGGTGTGAGCTCTCCTTCAGCTGCTGCAGGCATTACACTCAGCTCTGCTGT
CCAAGCTGCTCATGTGATTGCCCTCTAATCCATTCAGGCAAAGTGAGCTAGACTTGTTTA
AGCTGCAGGTCTTATTTTGATTGTAGCAGGCTAGTGAACAGTCACAGAAGTGGTTCAAGT
ATTGTGCCCCTTGGAGCTGTTATCTTTGAAAATGTGGCCGTGGCTGGAAAAGGATGCATC
TGCACCAATGGCACAGTGACCAGCCAGTTGCTTAGGGGCTTAGCTGGTGGATTTGGACCT
GTCTTCTGCAACCTGGGGAAAGCATAATCTACTGTGTTATTTGATAATGGAAGCGCCGTG
ATCAGATCCATCCCTCTGCTTTGAATTTTCAAACAAATAATCAAGAATTTGGCTCGTGTT
AAAAAAAAAAAAAAAA
17/09/2012 7
9. Becoming aware of data
management in research
PhD
CCACGCGTCCGGTGTGAGCTCTCCTTCAGCTGCTGCAGGCATTACACTCAGCTCTGCTGT
CCAAGCTGCTCATGTGATTGCCCTCTAATCCATTCAGGCAAAGTGAGCTAGACTTGTTTA
AGCTGCAGGTCTTATTTTGATTGTAGCAGGCTAGTGAACAGTCACAGAAGTGGTTCAAGT
ATTGTGCCCCTTGGAGCTGTTATCTTTGAAAATGTGGCCGTGGCTGGAAAAGGATGCATC
TGCACCAATGGCACAGTGACCAGCCAGTTGCTTAGGGGCTTAGCTGGTGGATTTGGACCT
GTCTTCTGCAACCTGGGGAAAGCATAATCTACTGTGTTATTTGATAATGGAAGCGCCGTG
ATCAGATCCATCCCTCTGCTTTGAATTTTCAAACAAATAATCAAGAATTTGGCTCGTGTT
AAAAAAAAAAAAAAAA
17/09/2012 9
10. Becoming aware of data
management in research
PhD
CCACGCGTCCGGTGTGAGCTCTCCTTCAGCTGCTGCAGGCATTACACTCAGCTCTGCTGT
CCAAGCTGCTCATGTGATTGCCCTCTAATCCATTCAGGCAAAGTGAGCTAGACTTGTTTA
AGCTGCAGGTCTTATTTTGATTGTAGCAGGCTAGTGAACAGTCACAGAAGTGGTTCAAGT
ATTGTGCCCCTTGGAGCTGTTATCTTTGAAAATGTGGCCGTGGCTGGAAAAGGATGCATC
TGCACCAATGGCACAGTGACCAGCCAGTTGCTTAGGGGCTTAGCTGGTGGATTTGGACCT
GTCTTCTGCAACCTGGGGAAAGCATAATCTACTGTGTTATTTGATAATGGAAGCGCCGTG
ATCAGATCCATCCCTCTGCTTTGAATTTTCAAACAAATAATCAAGAATTTGGCTCGTGTT
AAAAAAAAAAAAAAAA
17/09/2012 10
11. Becoming aware of data
management in research
PhD
CCACGCGTCCGGTGTGAGCTCTCCTTCAGCTGCTGCAGGCATTACACTCAGCTCTGCTGT
CCAAGCTGCTCATGTGATTGCCCTCTAATCCATTCAGGCAAAGTGAGCTAGACTTGTTTA
AGCTGCAGGTCTTATTTTGATTGTAGCAGGCTAGTGAACAGTCACAGAAGTGGTTCAAGT
ATTGTGCCCCTTGGAGCTGTTATCTTTGAAAATGTGGCCGTGGCTGGAAAAGGATGCATC
TGCACCAATGGCACAGTGACCAGCCAGTTGCTTAGGGGCTTAGCTGGTGGATTTGGACCT
GTCTTCTGCAACCTGGGGAAAGCATAATCTACTGTGTTATTTGATAATGGAAGCGCCGTG
ATCAGATCCATCCCTCTGCTTTGAATTTTCAAACAAATAATCAAGAATTTGGCTCGTGTT
AAAAAAAAAAAAAAAA
17/09/2012 11
12. Becoming aware of data
management in research
PhD
CCACGCGTCCGGTGTGAGCTCTCCTTCAGCTGCTGCAGGCATTACACTCAGCTCTGCTGT
CCAAGCTGCTCATGTGATTGCCCTCTAATCCATTCAGGCAAAGTGAGCTAGACTTGTTTA
AGCTGCAGGTCTTATTTTGATTGTAGCAGGCTAGTGAACAGTCACAGAAGTGGTTCAAGT
ATTGTGCCCCTTGGAGCTGTTATCTTTGAAAATGTGGCCGTGGCTGGAAAAGGATGCATC
TGCACCAATGGCACAGTGACCAGCCAGTTGCTTAGGGGCTTAGCTGGTGGATTTGGACCT
GTCTTCTGCAACCTGGGGAAAGCATAATCTACTGTGTTATTTGATAATGGAAGCGCCGTG
ATCAGATCCATCCCTCTGCTTTGAATTTTCAAACAAATAATCAAGAATTTGGCTCGTGTT
AAAAAAAAAAAAAAAA
17/09/2012 12
14. Becoming aware of data
management in research
EMAGE Database Project Manager
15. Becoming aware of data
management in research
EMAGE Database Project Manager
16. Becoming aware of data
management in research
EMAGE Database Project Manager
17. Becoming aware of data
management in research
EMAGE Database Project Manager
18. Becoming aware of data
management in research
EMAGE Database Project Manager
Cross DB queries need to use appropriate descriptors, not just free text
E.g. Gene name identifiers
19. Becoming aware of data
management in research
Being organised, having systems in place and adopting
community standards are all helpful in data
management.
Think about what you will be required to do when
publishing.
There are obligations for having data available for others
post publication.
It’s useful to have your data organised so you can
collaborate with others easily.
What will happen to your data when you leave the lab?
Your supervisor would like to know what’s what/where.
20. Data Planning & Managing
Motivators
#1 Meet your obligations
legal, ethical, funding requirements; uni, department, group policies
Find out now – avoid hassle later (ask research-data@unimelb.edu.au)
#2 Make your life easier
a data management system to make your research work
a data management plan to save time
keeping data, finding stuff again, labelling, security
sharing & collaborating
#3 Helping your career
being a professional researcher
data – your assets and records – finding, understanding data in years to come
contributing to global research community
manage your data now, help your future self.
17/09/2012 20
21. Why data management
What data
Where you store it
Who owns it
How you manage it
Ask: research-data@unimelb.edu.au
17/09/2012 21
22. What is data?
Observational data
Sensor readings, telemetry (non-reproducible)
Experimental data
Gene sequences, chromatograms (reproducible,
but expensive)
Simulation data
Climate models (model the most important thing)
Derived/compiled data
Compiled database (reproducible but expensive)
17/09/2012 22
23. What else is data?
Social sciences
Surveys, statistical data
Humanities
Cultural artefacts (video, photos, sound…)
Physical samples
Soil, biological, water, archeological…
Does anyone here not have data?
17/09/2012 23
24. The University’s definitions
Research Data
laboratory notebooks; field notebooks; primary research data (hardcopy or
in computer); questionnaires; audiotapes; videotapes; models; photographs;
films; test responses; slides; artefacts; specimens; samples
Research Records
Includes correspondence (electronic mail and paper-based correspondence);
project files; grant applications; ethics applications; technical reports; research
reports; master lists; signed consent forms; and information sheets for research
participants
Administrative Records (Research Office, Central Records)
Includes contracts and agreements, patents, licences, grants, intellectual property
and trademarks, policies, ethics, research project files, reports, publications
What is often included as “Research Data”:
= data + records + copies (physical & digital)
= stuff you used and/or created
17/09/2012 24
25. Group activity (15 mins)
Form groups of similar discipline
Earth sciences/forestry/botany/agriculture
Health/medical biology/physio/social work
Engineering/computer science/linguistics
Discuss:
What kind of data do you collect?
How do you get it?
Your data management checklist:
Section 1.1
17/09/2012 25
26. Why data management
What data
Where you store it
Who owns it
How you manage it
17/09/2012 26
27. Research trends
Research Data is increasing in size
Protein crystallography 100 GB/experiment
Gene sequencing 1,000 GB/day
High-energy physics 10,000,000s GB/year
Astronomy (SKA) 1,000,000,000 GB/day
Research Collaborations are increasing
Human Genome project (1990-2003)
113 people, 20 orgs
Belle collaboration (1994-..)
~370 people, 60 inst., 14 countries
ATLAS collaboration @ LHC CERN (1994-2020+)
~2500 people, 169 inst., 37 countries
Research Data is increasingly digital
Wonderful opportunities for reuse,
sharing, collaboration, analysis
Data science (4th paradigm)
“eResearch”!
17/09/2012 27
28. Research trends
Large scale data intensive science
“A totally new way of doing research”
New research methods, new skills,
therefore new training needed
New skills...
Specialists – in both technology and
research
Informatics – dealing with data from
collection through analysis
Data Management and Planning –
collecting, maintaining, sharing data
Everyone!
17/09/2012 28
29. How big?
1mb 10 Gb 1Tb
(spreadsheets) (numerical, (simulations,
synchrotron) 1Pb
video)
Easy! Awkward Easy?
(Probably already solved)
Limit of Google
Drive, DropBox…
17/09/2012 29
30. Where to keep it?
Possibilities:
Research group storage
Ask!
Local computer
Backups crucial. Sharing hard. Disaster looms.
Cloud (Dropbox, Google Drive)
Check security, legals. How to archive?
Ask research-data@unimelb.edu.au
17/09/2012 30
33. Group activity #2 (15 mins)
Discuss
How much data will you have?
Where will you store it?
What data formats?
Data management checklist
Complete section 2.3 & 2.4
If non-digital: 2.1, 2.2
17/09/2012 33
34. Why data management
What data
Where you store it
Who owns it
How you manage it
17/09/2012 34
35. In collaborations, get IP right early.
Find out:
Does the University own your data?
Can you still share it?
Restrictions?
Licences?
17/09/2012 35
36. IP – who claims to own it
Copyright – who has legal backing
(not all data can be copyright)
Ethics – more rules you agreed to
Must you keep the data private?
Must you share it?
Privacy – can you de-identify the data?
17/09/2012 36
37. Group activity #3 (15 mins)
Discuss
Who owns your data?
What data can you share? With whom?
How will you protect confidential information?
Data management checklist
Complete section 1.3
17/09/2012 37
38. Why data management
What data
Where you store it
Who owns it
How you manage it
17/09/2012 38
41. Starting your system
Consider your goals – what do you want to
get out of managing your data?
Figure out your criteria for keeping data
Picture your data three years from now
Consider the metadata you want to collect
to document your datasets
17/09/2012 41
42. Benefits
Find your data 3 years from now
Get more papers out of your data
Save time and stress – get organised
Share with collaborators
Some journals require data submission
17/09/2012 42
43. Being more professional...
Not rocket science!
Stop and think about what data you have, what you’re doing, what you should be
doing
Some scary facts:
Microfilm, non-acidic paper last 100+ years
magnetic media lasts 10+ years
optical media lasts 20+ years
2-10% of hard drives fail every year
software & hardware can outdate quickly
Scary stories:
US study 100’s charges “research misconduct”
40% avoided by better data management!
UniMelb ~20 cases research misconduct 2008.
Most involved students. All needed good records!
Climategate scandal, UK – FOI
Burroughs 1977 – B 9495
Proper Planning & Management is needed!!! Magnetic Tape Subsystem
17/09/2012 43
44. High level view
Your data management system needs to cover:
(Use, Transform, Update)
Create, Keep,
Capture, Transfer,
Describe Destroy
Store, Secure,
Preserve
(National Archives)
17/09/2012 44
45. A simple Data Man. System
Identify key data in your context, important stuff to keep (your Data Assets)
Find secure places to keep physical & digital Records + Data (filing cabinet, department
shared drive) – backups are essential
Where and when should there be checks on your data (sanity checks, quality control,
standards)
File your data and records into logical divisions, say activities, projects, or pieces of work
eg. folders /DeptShare/johnsmith/Records/ProteinABC Investigation
Don’t break things down too much, makes things harder to find!
Have a consistent file naming convention:
perhaps: ActivityOrContents-LocationOrPerson-CreateDate-Id-Description.ext
eg. “ProteinABC-LJW-20100409-0001 Raw data from instrument.dat”
Keep good metadata (notes, records) on how you captured your data, particularly for
physical records
Descriptions of collections or files – Structured text files good enough
eg. FileOrCollectionName-metadata.txt
On other things, entities that are not files – Structured text files or spreadsheets
Have a good labeling/ID/coding system
Perhaps keep a registry (spreadsheet will do; IDs, names, location, basic metadata)
Find the right balance in digitising physical stuff (easy and quick)
Digital is easy to keep/transfer/search if stored properly. However, digitising/scanning everything
can be time consuming and without good descriptions may not be useful.
Link digital notes/metadata to physical stuff (IDs, names, labels, codes, location)
Have some basic digital representations or notes of important physical stuff
45
46. Free Tools
jEdit – text file editor (private notes, metadata and records)
local disk + file share + Cobian Backup (private project records, data)
Google Desktop (file and email search)
Zotero (reference material) (EndNote is Uni default)
EVO & Skype & Google chat (video/tele/chat communication)
http://evo.arcs.org.au/
Sakai@Melbourne (project workspace)
https://sakai.unimelb.edu.au/ see Info Skills classes
Google docs + Sites (collaborative editing) on EndNote,
Google groups (email list) UpSkills 29 June on VC
research data storage, a tricky one…
use local storage in preference, ask around
DropBox, Google Drive, Microsoft SkyDrive, box.com…
too many others to list, heaps on the web…
See Digital Research Tools (DiRT) wiki for a huge list
http://digitalresearchtools.pbworks.com/
Check with your supervisor,
17/09/2012 46
47. Data Security
2 aspects to security
Safety from damage or loss
How important is the data to you?
Safety from incorrect use
What are the possible consequences?
Safety from damage or loss (unintended and intentional)…
What’sacceptable loss (safety can cost, use up time)
Backups (data, software, system)
How often (hourly, daily, weekly, monthly, manually, automated)?
How many and where (onsite, offsite, both, multiple)?
Departmental storage? Probably backed up already!
Disaster Recovery
Quality hardware, multiple/spare servers, spare disk drives,
Operating System and Applications image backups
(talk with someone technical, your local IT guys)
17/09/2012 47
48. Data Security
Safety from damage or loss (continued)…
Make sure Backup is occurring
Essential data and records... “Your Archive”
Frequency should depend on how often your data changes
Incremental backups are essential. Replication IS NOT SAFE!!!
Keep some copies (one?) offsite.
Database backups should use database tools (mysqldump, pg_dump etc.)
Departmental storage is best... probably backed up already!
Worst case... DIY, use external hard drives or remote storage
Seek advice on software
for Windows I use... Cobian Backup, DriveImage XML
for Linux I use... rsync (see http://rsync.samba.org/examples.html )
for Mac there is... Time Machine
(talk with someone technical, your local IT guys)
17/09/2012 48
49. Data Security
Safety from incorrect use (unintended and malicious)…
PCI DSS - a recommendation (Payment Card Industry Data Security Standard)
eg. google for: “nacubo.org payment card data security”
12 requirements that are good practice (first 10 are the basics)
10 IT basics…
Firewall servers
Do not use default usernames/password
Physically protected stored data (lock up servers, disk, tape, source material)
Use encrypted transmission over internet (VPN, SSL, SSH, GridFTP, S/MIME email)
Update antivirus/antimalware software regularly
Use secure and trusted applications
Restrict access to sensitive data (tighter control, or put it somewhere else)
Assign unique IDs for each user
Record and monitor all access to data
Plus some good practice…
Don’t retain sensitive data
Or encrypt sensitive information
17/09/2012 49
50. Read up!
Google: research data toolkit
http://researchdata.unimelb.edu.au
ANDS guides
To consider: identifiers, DOIs, archival,
security, licensing, metadata formats,
ontologies, controlled vocabularies,
definition of “collection”, data reuse,
metadata stores…!
17/09/2012 50
51. Group activity #4 (15 mins)
Data management checklist
Complete section 3.1
17/09/2012 51
This bit is pretty easy for most people. We’ll do a quick summary.
Trivia: some disciplines actually don’t. Philosophy, theology, law.
But maybe your data volumes are easy to manage. Who is towards the left? Who is in the middle? Who is towards the right?The middle can be the most awkward: too big to store online, too small to get the attention of “big data” initiatives.
If you’re organised – and lucky – you can deposit your data in a repository for your discipline, in the University’s Research Data Registryor in the national register: Research Data Australia.This helps increase your profile and helps potential collaborators find you.
Soil samples left by one research group at the Burnley Campus. With only basic labelling, how will future research groups make any sense of it?
Now we talk about the “who”: who owns the data, who controls it, as well as restrictions on it: privacy, confidentiality, ethics, requirements to share (or not to share).
Nowyou’ve thought about what data you’ve got, made some decisions about where to put it, and have considered the thorny issues of IP, it’s time to put that knowledge together systematically: a data management system.
Show of hands: who has a data management plan? who has heard of the Policy on the Management of Research Data and Records
After investigating a number of different research data life cycles, I believe this to be the simplest approach to research data record keeping that might integrate with a broad range of research practice.
Once you know what information your going to keep (your archive) you can start putting into place a Data Management System. Apply, where practical, to all data/records you collect.Check: everyone knows metadata?