This presentation covers a number of best practices for managing research data. The main topics include: file naming and organization conventions, data documentation, and data storage and backups.
2. Do You Still Have Your Data?
• What if your hard drive crashes?
• What if you are accused of fraud?
• What if your collaborator abruptly quits?
• What if the building burns down?
• What if you need to use your old data?
• What if your backup fails?
• What if your computer gets stolen?
• What if…
3. What Are Data?
• Observational
– Sensor data, telemetry, survey data, sample data,
images
• Experimental
– Gene sequences, chromatograms, toroid magnetic
field data
• Simulation
– Climate models, economic models
• Derived or compiled
– Text and data mining, compiled database, 3D models,
data gathered from public documents
4. Why Data Management?
• Don’t loose data
• Find data more easily
– Especially if you need older data
• Easier to analyze organized, documented data
• Avoid accusations of fraud & misconduct
• Get credit for your data
• Don’t drown in irrelevant data
5. For each minute of planning at
beginning of a project, you will save
10 minutes of headache later
6. What This Session Covers
• Introduction to a few topics in data
management
– File organization conventions
– Documentation
– Storage and backups
7. What This Session Covers
• Hands-on exercises in each topic
• My goal is to offer practical, usable solutions
– Recognize that I can’t cover everything
10. File Naming Conventions
• Make it easier to find files
• Avoid many duplicates
– Especially when you’re not sure which is the latest
or most correct!
• Make it easier to wrap up a project because
you know which files belong to it!
11. File Naming Conventions
• Files should be named consistently
• Files names should be descriptive but short
(<25 characters)
• Use underscores instead of spaces
• Avoid these characters: “ / : * ? ‘ < > * + & $
• Use the file dating convention: YYYY-MM-DD
– This works well with a lab notebook
12. File Versioning
• Why?
– If you only have one copy and you make a
mistake…
– If your data is stored in multiple locations
13. File Versioning
• For analyzed data, use version numbers
• Save files often to a new version
• Label the final version FINAL
• For code, consider GIT or SVN
14. File Organization
• Any system is better than none
• Possibilities
– One project, one folder (for small projects)
– Separate folders for data or project stages
– Separate folders for different types of data
– Date-based folders (pairs well with a lab
notebook!)
15. What To Avoid
• One person data hoards
• Data scattered across several machines
– Not backups! Backups are fine
• Storage that doesn’t mirror “ownership”
– If it’s communal, it belongs in a communal place
– If data collection happens on an individual’s
machine, that doesn’t mean the data should stay
there!
16. Document Your Conventions
• No point to have a system without
documentation
– README.txt
• Use .txt over .doc because it’s more durable
– Front cover of research notebook
– A printout by the computer
– Etc.
17. Document Your Conventions
• In project-wide README.txt
– Basic project information
• Title
• Contributors
• Grant info
• etc.
– Contact information for at least one person
– All locations where data live, including backups
– Useful information about the files and how
they’re organized
18. Exercise: File Naming Conventions
• Develop a file naming convention for your
most common data type
20. What would someone unfamiliar
with your data need in order to find,
evaluate, understand, and reuse
them?
21. Documentation
• Consider the differences between
– someone inside your lab
– someone outside your lab but in your field
– someone outside your field
• Two parts: metadata and methods
22. Documentation
Methods
• How the data were
gathered
• How the data should be
interpreted
• What you did
– Limitations on what you did
• …build trust in your data
Metadata
• What you’re looking at
• Who made it and when
• How it got there
• What it means and
• What you can do with it
• …before you even look at
the file
23. Methods
• Examples of methods to document
– Code
– Survey
– Codebook
– Data dictionary
– Anything that lets someone reproduce your results
• Don’t forget the units!
24. Metadata
• Informal and formal description of data
• Informal standard can fit your unique research
• Benefits of a formal standard
– Completeness
– Aids in sharing
– Often required for deposit into a repository
• May be required by your funder
25. Metadata
• Tons of formal standards available across
many, many disciplines
• Consult
– Disciplinary repository
– Your peers
– Subject librarian
– Data Services Librarian
26. Metadata
• Decide on a metadata standard before you
collect the data!
– Easier to record metadata when collecting data
than to convert later
• Standard or no, keep metadata CONSISTENTLY
and COMPUTABLY whenever you can
27. Metadata Standard: Dublin Core
• contributor
• coverage
• creator
• date
• description
• format
• identifier
• language
• publisher
• relation
• rights
• source
• subject
• title
• type
28. Metadata Example
• Contributor
– Jane Collaborator
• Creator
– Kristin Briney
• Date
– 2013 Apr 15
• Description
– A microscopy image of
cancerous breast tissues
under 20x zoom. This image is
my control, so it has only the
standard staining describe on
2013 Feb 2 in my notebook.
• Format
– JPEG
• Identifier
– IMG00057.jpg
• Relation
– Same sample as images
IMG00056.jpg and
IMG00055.jpg
• Subject
– Breast cancer
• Title
– Cancerous breast tissue
control
29. Exercise: Documentation
• For your most common data type, make a list
of the most important information to record
for each dataset
31. A Note on Security
• Does your data fall under the following?
– HIPAA
• Health information
– FERPA
• Student information
– FISMA
• Government subcontractor
– Human subject research, etc.
Ask for help!
32. A Note on Security
• Secure storage
• Controlled access
• De-identification of personal information
• Security training
33. UWM Security Resources
• UWM Information Security Office
– Visit: https://www4.uwm.edu/itsecurity/
– Email: infosec@uwm.edu
• Certificate in Information Security
• HIPAA
– https://www4.uwm.edu/legal/hipaa/index.cfm
• FERPA
– http://www4.uwm.edu/academics/ferpa.cfm
34. Storage
• Library motto: Lots of Copies Keeps Stuff Safe!
• Rule of 3: 2 onsite, 1 offsite
• Storage run by experts is more reliable than
storage you run yourself
– It costs more, but that’s for a reason
36. Your Computer
• You’re using it, but should you keep data on it?
– What happens if you lose it?
– What happens if it is stolen?
– What happens if it breaks?
– Will the data stay there as long as you are required to
keep them?
• Don’t be disorganized
• Don’t keep sensitive data here
• Verdict: By itself it is not enough
37. USB/Flash Drive
• Pros
– Small, convenient package
– Big enough for a wide variety of datasets
• Cons
– Will you remember to back your data up onto it?
– Easy to lose
– Easy to perpetuate out-of-date copies
• Verdict: good for data transport, but not for
backup
38. CD-ROMs/DVD-ROMs
• Pros
– More reliable (but if one does fail, you won’t know
until it’s too late)
– Portable
• Cons
– Will you remember to back your data up onto it?
– Hassle to deal with
– Slow to write to
– Difficult to keep track of old copies
• Verdict: Not good for quick backup, and just okay
for periodic offsite backup
39. External Hard Drive
• Pros
– Relatively cheap
– Large storage capacity
• • Cons
– You have to set up, maintain, and audit it yourself
– Some brands are less reliable
– Disorganization a problem
• Verdict: Coupled with automatic-backup
software, an okay choice for onsite backup
– You’ll still want a second backup offsite
40. Shared Drives/Servers
• Pros
– Keeps data off your easily-stolen laptop
– Not your problem to manage
– Shared costs typically mean lower costs
• Cons
– Who’s managing the thing? Are they competent?
– Can have storage quotas
– Can be hard to get to outside the lab or the office
• Verdict: If well-managed, a good choice for regular use,
onsite, or offsite backup
– Beware the dusty Linux box under the desk!
41. Tape Backup
• Pros
– Can happen near-invisibly
– Highly reliable
– Tolerably secure (not always on network)
• Cons
– Can be hard or slow to get data back
– Not always audited as often as they should be
• Verdict: Good for onsite or offsite backup, if
somebody else is running them and you know
they’re regularly audited
42. Cloud Storage
• Pros
– Convenient syncing
– Cheap
– If client-side encryption is involved, decently secure
• Cons
– Required network connection
• Possible security risks and inconvenience if off-network
– Ongoing (and out of your control) costs
– Your backup is hostage to their business risks
– Reliability, security, privacy not guaranteed
• Verdict: For savvy shoppers, fine for offsite backup. A
little risky for your only backup.
43. Exercise: Storage
• Conduct a quick inventory of your data
– What datasets do you have?
– How big are they?
• Inventory where your files are currently
stored, including backups. How safe are your
data?
44. Backups
• Any backup is better than none
• Automatic backup is better than manual
• Your research is only as safe as your backup
plan
– Lots of horror stories here
45. Ideal Backup Characteristics
• Low effort
• High reliability
• As secure as necessary
– Tradeoffs between security and convenience
• As open as possible to collaborators
• Well organized
46. Check Your Backups
• Backups only as good as ability to recover data
• Test your backups periodically
– Preferably a fixed schedule
– 1 or 2 times a year may be enough
– Bigger/more complex data should be checked
more often
• Test your backup whenever you change things
47. A Final Note
• Must retain data at least 3 years post-project
per OMB Circular A-110
– Better to retain for >6 years
• Consider letting someone else worry about
this
– A disciplinary repository
– The UWM Digital Commons
48. Exercise: Backups
• Sketch out your ideal backup system, and
identify the first step in getting to there from
your current system.
50. Where to Go from Here
• Talk to your coworkers
– …but be aware you might not be able to change
things
– Discuss
• Common schemes for metadata and file naming
• Centralized documentation
• Robust backup
• Use good practices and be a model for others
51. UWM Resources
• Data management resources
– dataplan.uwm.edu
• Information Security Office
– www4.uwm.edu/itsecurity/
• Data Services Librarian
– Kristin Briney, briney@uwm.edu
52. Thank You!
• This presentation available under a Creative Commons
Attribution (CC-BY) license
• Some content courtesy of Dorothea Salo
– http://dsalo.info/
– http://www.graduateschool.uwm.edu/research/researcher
-central/proposal-development/data-plan/boot-camp/