1. The document discusses tips and tools for data stewardship, including planning for data management, best practices for data collection and organization, documenting workflows, creating metadata, and sharing data.
2. It emphasizes writing a data management plan, keeping raw data separate and secure, using version control and backups, and revisiting plans periodically.
3. The document encourages learning skills for data management, using resources like libraries and repositories, and embracing changes that support more open and reproducible science.
20. Because reproducibility* is one of
the fundamental tenets of science.
*reproducibility: being able to go from data to figures/results
not reproducibility: independently verifiable via following same
techniques.
23. Because reproducibility is one of
the fundamental tenets of science.
Because we need to be credible.
Because Fox News, creationism,
and the war on science.
24. “Help us identify grants that are wasteful
or that you don’t think are a good use of
taxpayer dollars.” !
Rep. Adrian Smith (R-Nebraska), a member of the House Committee on Science
and Technology
25. Because reproducibility is one of
the fundamental tenets of science.
Because we need to be credible.
Because Fox News, creationism,
and the war on science
Because it means faster progress.
33. … “Federal agencies investing in research and
development (more than $100 million in annual
expenditures) must have clear and coordinated
policies for increasing public access to
research products.”
Feb
2013
34. 1. Maximize free public access
2. Ensure researchers create data
management plans
3. Allow costs for data preservation and
access in proposal budgets
4. Ensure evaluation of data management
plan merits
5. Ensure researchers comply with their data
management plans
6. Promote data deposition into public
repositories
7. Develop approaches for identification and
attribution of datasets
8. Educate folks about data stewardship
From Flickr by Joe Crimmings Photography
40. Use descriptive file names
• Unique
• Reflect contents
From
R
Cook,
ESA
Best
Practices
Workshop
2010
Bad:
Mydata.xls
2001_data.csv
best version.txt
Better:
Eaffinis_nanaimo_2010_counts.xls
Site
name
Year
What was
measured
Study
organism
*Not for everyone
*
Planning
Design file naming scheme
43. A relational database is
A set of tables
Relationships among the tables
A language to specify & query the tables
A RDB provides
Scalability: millions+ records
Features for sub-setting, querying, sorting
Reduced redundancy & entry errors
From Mark Schildhauer
Planning
Consider a database
44. You should invest time in learning databases if
your data sets are large or complex
Consider investing time in learning databases if
your data are small and humble
you ever intend to share your data
you are < 30 years old
Planning
From Mark Schildhauer
Consider a database
45. Store your data in a repository
Institutional archive
Discipline/specialty archive
Pick a data repository
From Flickr by torkildr
Ask a librarian
Repos of repos:
databib.org
re3data.org
Planning
46. FromFlickrbysepasynod
From Flickr by taberandrew
From Flickr by withassociates
What software?
What hardware?
What personnel?
How often?
Set up reminders!
Test system
Decide on preservation/backup
Planning
47. …document that
describes what you will
do with your data
throughout
the research project
From Flickr by Barbies Land
Write a data
management plan!
Planning
48. DMP components
But they all have
different requirements
and express them in
different ways
• What will be collected
• Methods
• Standards
• Metadata
• Sharing/access
• Long-term storage
Planning
From Flickr by Barbies Land
49. Step-by-step wizard for generating DMP
create | edit | re-use | share
Free & open to community
dmptool.org
Planning
51. Realistically:
• Archive .csv version of raw data
• Make a “raw” tab in working data file
• Do all work on other tabs
During
collection
Keep raw data raw
52. Raw data as .csv
R script for processing & analysis
During
collection
Ideally:
• Use scripts to process data
• Save them with data
Keep raw data raw
53. During
collection
Document your workflow
Temperature
data
Salinity
data
Data import into Excel
Analysis: mean, SD
Graph production
Quality control &
data cleaning
“Clean” T
& S data
Summary
statistics
Data in
spread-
sheet
Workflow: how you get from the raw data to the final
products of your research
Simple workflow: flow chart
54. During
collection
Workflow: how you get from the raw data to the final
products of your research
Simple workflow: commented script
• R, SAS, MATLAB…
• Well-documented code is
Easier to review
Easier to share
Easier to use for repeat analysis
#
%
$
&
Document your workflow
59. Create parameter table
From doi:10.3334/ORNLDAAC/777
From doi:10.3334/ORNLDAAC/777
From R Cook, ESA Best Practices Workshop 2010
During
collection
Break down spreadsheets
Fake a relational database
Create a site table
61. Metadata: data reporting
WHO created the data?
WHAT is the content
of the data set?
WHEN was it created?
WHERE was it collected?
HOW was it developed?
WHY was it developed?
FromFlickrby//ichaelPatric|{
During
collection
Create metadata
62. Digital context
• Name of the data set
• The name(s) of the data file(s) in the
data set
• Date the data set was last modified
• Example data file records for each data
type file
• Pertinent companion files
• List of related or ancillary data sets
• Software (including version number)
used to prepare/read the data set
• Data processing that was performed
Personnel & stakeholders
• Who collected
• Who to contact with questions
• Funders
Scientific context
• Scientific reason why the data were
collected
• What data were collected
• What instruments (including model & serial
number) were used
• Environmental conditions during collection
• Temporal & spatial resolution
• Standards or calibrations used
Information about parameters
• How each was measured or produced
• Units of measure
• Format used in the data set
• Precision & accuracy if known
Information about data
• Definitions of codes used
• Quality assurance & control measures
• Known problems that limit data use (e.g.
uncertainty, sampling problems)
During
collection
Create metadata
63. • Provide structure to describe data
Common terms | definitions | language | structure
• Come in many flavors
EML , FGDC, ISO19115, DarwinCore,…
• Can be met using software tools
Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM)
What is
metadata?
Metadata standards…
During
collection
Standard
<
Create metadata