Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Data Matters for AGU Early Career Conference
1. Data Matters
Tips & Tools for Better Research
Carly Strasser, California Digital Library
carlystrasser@gmail.com
AGU Student & Early Career Scientist Conference
14 Dec 2014
From Flickr by Lachlan Donald
2. Why are
you here?
Science: you’re (probably)
doing it wrong
8. Digital data
From Flickr by Flickmor
From Flickr by DW0825
From Flickr by US Army Environmental Command
C. Strasser
Courtesey of WHOI
From Flickr by deltaMike
20. Because reproducibility is one of
the fundamental tenets of science.
Because we need to be credible.
Because Fox News, creationism,
and the war on science.
21. “Help us identify grants that are wasteful
or that you don’t think are a good use of
taxpayer dollars.”
Rep. Adrian Smith (R-Nebraska), a member of the House Committee on Science
and Technology
22. Because reproducibility is one of
the fundamental tenets of science.
Because we need to be credible.
Because Fox News, creationism,
and the war on science
Because it means faster progress.
30. Feb
2013
… “Federal agencies investing in research and
development (more than $100 million in annual
expenditures) must have clear and coordinated
policies for increasing public access to
research products.”
32. From Flickr by Big Swede Guy
data management
Best
Practices
33. From Flickr by Mark Sardella
Plan before data collection
34. Design sample naming schemePlanning
• Create a key (data dictionary)
• Make sure names are unique
• Define codes
From Flickr by zebbie
35. Design file naming schemePlanning
Use descriptive file names
• Unique
• Reflect contents
From
R
Cook,
ESA
Best
Practices
Workshop
2010
Bad:
Mydata.xls
2001_data.csv
best version.txt
Better:
Eaffinis_nanaimo_2010_counts.xls
Site
name
Year
What was
measured
Study
organism
*Not for everyone
*
36. Design file organizationPlanning
Biodiversity
Lake
Experiments
Field work
Grassland
Biodiv_H20_heatExp_2005to2008.csv
Biodiv_H20_predatorExp_2001to2003.csv
…
Biodiv_H20_PlanktonCount_2001toActive.csv
Biodiv_H20_ChlAprofiles_2003.csv
…
Consider…
• Dependencies?
• File formats?
• Time of collection?
• Order of analysis?
From S. Hampton
37. Planning
Design your spreadsheet
Constrain entries
Atomize
Break down spreadsheets
From Flickr by Ulleskelf
38. Consider a databasePlanning
A relational database is
A set of tables
Relationships among the tables
A language to specify & query the tables
A RDB provides
Scalability: millions+ records
Features for sub-setting, querying, sorting
Reduced redundancy & entry errors
From Mark Schildhauer
39. Pick a data repository
Store your data in a repository
Institutional archive
Discipline/specialty archive
From Flickr by torkildr
Planning
40. Pick a data repository
Store your data in a repository
Institutional archive
Discipline/specialty archive
From Flickr by torkildr
Planning
Ask a librarian
41. Pick a data repository
Store your data in a repository
Institutional archive
Discipline/specialty archive
From Flickr by torkildr
Planning
Ask a librarian
Repos of repos:
databib.org
re3data.org
42. Decide on preservation/backup
From Flickr by sepa synod
From Flickr by taberandrew
From Flickr by withassociates
Planning
43. Decide on preservation/backup
From Flickr by sepa synod
From Flickr by taberandrew
From Flickr by withassociates
What software?
What hardware?
What personnel?
How often?
Set up reminders!
Test system
Planning
44. …document that
describes what you will
do with your data
throughout
the research project
From Flickr by Barbies Land
Write a data
management plan!
Planning
45. Planning
DMP components
• What will be collected
• Methods
• Standards
• Metadata
• Sharing/But they access
all have
• Long-term storage
different requirements
and express them in
different ways
From Flickr by Barbies Land
48. Realistically:
• Archive .csv version of raw data
• Make a “raw” tab in working data file
• Do all work on other tabs
During
Keep raw data rawcollection
49. Keep raw data raw
Raw data as .csv
During
collection
R script for processing & analysis
Ideally:
• Use scripts to process data
• Save them with data
50. During
Document your workflowcollection
Workflow: how you get from the raw data to the final
products of your research
Temperature
data
Salinity
data
Data import into Excel
Quality control &
“Clean” T data cleaning
& S data
Analysis: mean, SD
Graph production
Data in
spread-sheet
Summary
statistics
Simple workflow: flow chart
51. During
collection
Workflow: how you get from the raw data to the final
products of your research
Commented script
• R, SAS, MATLAB…
• Well-documented code is
Easier to review
Easier to share
Easier to use for repeat analysis
#
%$
&
Document your workflow
52. Constrain data entries
• Excel lists
• Data validation
• Google docs forms
Modified from K. Vanderbilt
During
collection
54. During
Break down spreadsheetscollection
Fake a relational database
Create parameter table
From doi:10.3334/ORNLDAAC/777
From doi:10.3334/ORNLDAAC/777
From R Cook, ESA Best Practices Workshop 2010
Create a site table
55. Metadata: data reporting
WHO created the data?
WHAT is the content
of the data set?
WHEN was it created?
WHERE was it collected?
HOW was it developed?
WHY was it developed?
From Flickr by //ichael Patric|{
During
Create metadatacollection
56. Create metadatacollection
Digital context
• Name of the data set
• The name(s) of the data file(s) in the
data set
• Date the data set was last modified
• Example data file records for each data
type file
• Pertinent companion files
• List of related or ancillary data sets
• Software (including version number)
used to prepare/read the data set
• Data processing that was performed
Personnel & stakeholders
• Who collected
• Who to contact with questions
• Funders
During
Scientific context
• Scientific reason why the data were
collected
• What data were collected
• What instruments (including model & serial
number) were used
• Environmental conditions during collection
• Temporal & spatial resolution
• Standards or calibrations used
Information about parameters
• How each was measured or produced
• Units of measure
• Format used in the data set
• Precision & accuracy if known
Information about data
• Definitions of codes used
• Quality assurance & control measures
• Known problems that limit data use (e.g.
uncertainty, sampling problems)
57. < Create metadata
St a n da rd
Metadata standards…
• Provide structure to describe data
During
collection
What is
metadata?
Common terms | definitions | language | structure
• Come in many flavors
EML , FGDC, ISO19115, DarwinCore,…
• Can be met using software tools
Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM)
58. Back up daily
During
collection
From Flickr by lippo
From Flickr by see phar
Original
Near
Far
59. During
collection
From Flickr by Barbies Land
Remember that data
management plan?
Revisit
Review
Revise
60. During
collection
Schedule a time each
week or month
Revisit
Review
Revise
From Flickr by purplemattfish