Axa Assurance Maroc - Insurer Innovation Award 2024
Data Management for Scientists: Workshop at Ocean Sciences 2012
1. Data
Management
for
Scientists
Reduce
your
workload
Reuse
your
ideas
Recycle
your
data
www.oddee.com
Carly
Strasser,
PhD
Ocean
Sciences
Meeting
California
Digital
Library,
UC
Office
of
the
President
February
2012
carly.strasser@ucop.edu
www.carlystrasser.net
2. Roadmap
4. Toolbox
3. How
to
improve
2. Data
management
landscape
1. Background
3. NSF
funded
DataNet
Project
Office
of
Cyberinfrastructure
Community
Cyberinfrastructure
Engagement
&
Outreach
From
Flickr
by
wetwebwork
Courtesy
of
DataONE
4. What
role
can
libraries
play
in
data
education?
What
barriers
to
sharing
can
we
eliminate?
Why
don’t
people
share
data?
Is
data
management
Do
attitudes
about
being
taught?
sharing
differ
among
disciplines?
How
can
we
promote
storing
data
in
repositories?
5.
6. Roadmap
4. Toolbox
3. How
to
improve
2. Data
management
landscape
1. Background
7. From
Flickr
by
DW0825
From
Flickr
by
Flickmor
From
Flickr
by
deltaMike
Digital
data
www.woodrow.org
C.
Strasser
Courtesey
of
WHOI
From
Flickr
by
US
Army
Environmental
Command
9. Data
Models
Maximum
Likelihood
estimation
Matrix
Models
Images
Tables
Paper
10. Data
Models
Maximum
Likelihood
estimation
Matrix
Models
Images
Tables
Paper
11. UGLY TRUTH
Many
Earth
|
Environmental
|
Ecological
scientists…
5shortessays.blogspot.com
are
not
taught
data
management
don’t
know
what
metadata
are
can’t
name
data
centers
or
repositories
don’t
share
data
publicly
or
store
it
in
an
archive
aren’t
convinced
they
should
share
data
12. Data
Hangover
What
happened?
From
Flickr
by
SteveMcN
13. Where
data
end
up
From
Flickr
by
diylibrarian
www
blog.order2disorder.com
From
Flickr
by
csessums
Data
Metadata
From
Flickr
by
csessums
Recreated
from
Klump
et
al.
2006
14. Who
cares?
From
Flickr
by
Redden-‐McAllister
From
Flickr
by
AJC1
www.rba.gov.au
15. Where
data
end
up
From
Flickr
by
diylibrarian
www
Data
www
Metadata
From
Flickr
by
torkildr
Recreated
from
Klump
et
al.
2006
17. Trends
in
Data
Archiving
Journal
publishers
Joint
Data
Archiving
Agreement
Data
Papers
etc.
Ecological
Archives,
Beyond
the
PDF
Funders
Data
management
requirements
18. Roadmap
4. Toolbox
3. Best
practices
2. Data
management
landscape
1. Background
19. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
7. Planning
24. 2.
Data
collection
&
organization
Create
unique
identifiers
• Decide
on
naming
scheme
early
• Create
a
key
• Different
for
each
sample
From
Flickr
by
zebbie
From
Flickr
by
sjbresnahan
25. 2.
Data
collection
&
organization
Standardize
• Consistent
within
columns
– only
numbers,
dates,
or
text
• Consistent
names,
codes,
formats
Modified
from
K.
Vanderbilt
From
Pink
Floyd,
The
Wall
themurkyfringe.com
26. 2.
Data
collection
&
organization
Standardize
• Reduce
possibility
of
manual
error
by
constraining
entry
choices
Excel
lists
Data Google
Docs
Forms
validataion
Modified
from
K.
Vanderbilt
27. 2.
Data
collection
&
organization
Identify
missing
data
• Numeric
fields:
distinct
value
(e.g.
9999)
• Text
fields:
NULL
or
NA
• Use
data
flags
in
a
separate
column
to
qualify
empty
cells
M1
=
missing;
no
sample
collected
E1
=
estimated
from
grab
sample
28. 2.
Data
collection
&
organization
Create
parameter
table
Create
a
site
table
From
doi:10.3334/ORNLDAAC/777
From
doi:10.3334/ORNLDAAC/777
From
R
Cook,
ESA
Best
Practices
Workshop
2010
29. 2.
Data
collection
&
organization
SPREADSHEETS:
THE GOOD
Quick
on
the
draw
Clickety-‐click
and
you’re
ready
to
fire
Always
there
in
time
Everyone
has
Excel
Smarter
than
he
lets
on
Stats,
Pivot
tables,
VB
scripts
Cleans
up
real
pretty
Graphics,
fonts,
colors,
borders
From
Mark
Schildhauer
30. 2.
Data
collection
&
organization
SPREADSHEETS:
THE BAD
Shoot
first
ask
later
Click&fire
Click&fire
Click&fire
No
scruples
Delete
row,
click&fire,
ctrl-‐x/ctrl-‐c,
click&fire,
re-‐sort,
save
Talks
a
good
story
but
not
much
education
Stats
From
Mark
Schildhauer
31. 2.
Data
collection
&
organization
SPREADSHEETS:
THE UGLY
Ill-‐mannered
Takes
data
prisoner;
conflates
raw
and
summary
data
Gaudy
Use
of
visual
cues
as
metadata:
color,
font,
border
Shifty
Cross-‐linking
worksheets
sets
up
“invisible”
dependencies
Shiftless
No
provenance
The
more
complicated
your
spreadsheet,
the
uglier
it
gets
for
use
with
other
software
From
Mark
Schildhauer
32. 2.
Data
collection
&
organization
All
of
the
things
that
make
Excel
great
for
data
are
bad
for
archiving!
1. Create
archive-‐ready
raw
data
2. Put
it
somewhere
special
3. Have
your
fun
with
fancy
Excel
techniques
4. Keep
archiving
in
mind
33. 2.
Data
collection
&
organization
What
about
databases?
A
relational
database
is
A
set
of
tables
Relationships
among
the
tables
A
language
to
specify
&
query
the
tables
From
Mark
Schildhauer
34. 2.
Data
collection
&
organization
Sample
sites
samples
Samples
Species
*siteID
*sampleID
*sampleID
*speciesID
site_name
siteID
siteID
sample_date
species_name
latitude
sample_date
common_name
speciesID
longitude
speciesID
height
family
description
height
flowering
order
flowering
flag
comments
flag
comments
*
Denotes
the
primary
key
From
Mark
Schildhauer
35. 2.
Data
collection
&
organization
Databases
often
enforce
good
practice
Must
define
A
B
C
D
E
Tables
1
2
3
10
11
Attributes
4
5
6
12
13
14
15
Relationships
(constraints)
7
8
9
16
17
Databases
provide:
Scalability:
millions+
records
Features
for
sub-‐setting,
querying,
sorting
Scripted
language:
SQL
Reduced
redundancy
&
potential
data
entry
errors
From
Mark
Schildhauer
36. 2.
Data
collection
&
organization
Spreadsheets
Databases
• Good
for
simple,
self-‐contained
• Works
well
with
lots
of
data
charts,
graphs,
calculations
• Easy
to
query
and
subset
data
• Handy
for
collecting
raw
data
• Data
fields
are
constrainted
• Flexible
cell
content
type
• Columns
cannot
be
sorted
But…
independently
of
each
other
• Hard
to
subset
or
sort
• Normalization
reduces
data
entry
• Lack
“record”
integrity:
can
sort
a
and
potential
for
error
column
independently
of
all
others
But…
• Harder
to
maintain
as
complexity
• More
to
learn
and
size
of
data
grows
• Harder
to
use
From
Mark
Schildhauer
37. 2.
Data
collection
&
organization
Invest
time
in
learning
databases
if
your
data
sets
are
large
or
complex
Consider
investing
time
in
learning
databases
if…
your
data
are
small
and
humble
you
ever
intend
to
share
your
data
you
are
<
30
years
old
www.top20training.com
From
Mark
Schildhauer
38. 2.
Data
collection
&
organization
Use
descriptive
file
names
PhDcomics.com
39. 2.
Data
collection
&
organization
Use
descriptive
file
names
*
• Unique
• Reflect
contents
Bad:
Mydata.xls
Better:
Eaffinis_nanaimo_2010_counts.xls
2001_data.csv
best
version.txt
Study
Year
organism
Site
name
What
was
measured
*Not
for
everyone
From
R
Cook,
ESA
Best
Practices
Workshop
2010
40. 2.
Data
collection
&
organization
Organize
files
logically
Biodiversity
Lake
Experiments
Biodiv_H20_heatExp_2005to2008.csv
Biodiv_H20_predatorExp_2001to2003.csv
…
Field
work
Biodiv_H20_PlanktonCount_2001toActive.csv
Biodiv_H20_ChlAprofiles_2003.csv
…
Grassland
From
S.
Hampton
41. 2.
Data
collection
&
organization
Preserve
information
R
script
for
processing
&
analysis
• Keep
raw
data
raw
• Use
scripts
to
process
data
&
save
them
with
data
Raw
data
as
.csv
42. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
7. Planning
43. 3.
Quality
control
and
quality
assurance
Before
data
collection
• Define
&
enforce
standards
• Assign
responsibility
for
data
quality
From
Flickr
by
StacieBee
44. 3.
Quality
control
and
quality
assurance
During
data
collection/entry
• Minimize
manual
entry
• Use
double
entry
• Use
text-‐to-‐speech
program
to
read
data
back
• Use
a
database
• Document
changes
From
Flickr
by
schock
45. 3.
Quality
control
and
quality
assurance
After
data
entry
• Check
for
missing,
impossible,
anomalous
values
• Perform
statistical
summaries
• Look
for
outliers
• Normal
probability
plots
• Regression
• Scatter
plots
60
50
40
• Maps
30
20
10
0
0
10
20
30
40
46. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
7. Planning
48. 4.
Metadata
basics
Metadata
=
Data
reporting
WHO
created
the
data?
WHAT
is
the
content
of
the
data
set?
WHEN
was
it
created?
WHERE
was
it
collected?
HOW
was
it
developed?
WHY
was
it
developed?
49. • Scientific
context
4.
Metadata
basics
• Scientific
reason
why
the
data
were
collected
• What
data
were
collected
• Digital
context
• What
instruments
(including
model
&
• Name
of
the
data
set
serial
number)
were
used
• The
name(s)
of
the
data
file(s)
in
the
data
• Environmental
conditions
during
collection
set
• Where
collected
&
spatial
resolution
When
• Date
the
data
set
was
last
modified
collected
&
temporal
resolution
• Example
data
file
records
for
each
data
• Standards
or
calibrations
used
type
file
• Information
about
parameters
• Pertinent
companion
files
• How
each
was
measured
or
produced
• List
of
related
or
ancillary
data
sets
• Units
of
measure
• Software
(including
version
number)
• Format
used
in
the
data
set
used
to
prepare/read
the
data
set
• Precision
&
accuracy
if
known
• Data
processing
that
was
performed
• Information
about
data
• Personnel
&
stakeholders
• Definitions
of
codes
used
• Who
collected
• Quality
assurance
&
control
measures
• Who
to
contact
with
questions
• Known
problems
that
limit
data
use
(e.g.
• Funders
uncertainty,
sampling
problems)
• How
to
cite
the
data
set
50. 4.
Metadata
basics
What
is
metadata?
Select
the
appropriate
metadata
standard
• Provides
structure
to
describe
data
Common
terms
|
definitions
|
language
|
structure
• Lots
of
different
standards
EML
,
FGDC,
ISO19115,
DarwinCore,…
• Tools
for
creating
metadata
files
Morpho
(EML),
Metavist
(FGDC),
NOAA
MERMaid
(CSGDM)
52. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
7. Planning
53. 5.
Workflows
Workflow:
how
you
get
from
the
raw
data
to
the
final
products
of
your
research
Simple
workflows:
flow
charts
Temperature
data
Data
import
into
R
Data
in
R
Salinity
format
data
Quality
control
&
“Clean”
T
data
cleaning
&
S
data
Analysis:
mean,
SD
Summary
statistics
Graph
production
54. 5.
Workflows
Workflow:
how
you
get
from
the
raw
data
to
the
final
products
of
your
research
Simple
workflows:
commented
scripts
• R,
SAS,
MATLAB
• Well-‐documented
code
is…
Easier
to
review
Easier
to
share
%
#
$
Easier
to
repeat
analysis
&
56. 5.
Workflows
Workflows
enable
From
Flickr
by
merlinprincesse
Reproducibility
can
someone
independently
validate
findings?
Transparency
others
can
understand
how
you
arrived
at
your
results
Executability
others
can
re-‐run
or
re-‐use
your
analysis
57. 5.
Workflows
Minimally:
document
your
analysis
commented
code;
simple
flow-‐chart
www.littlebytesoflife.com
Emerging
workflow
applications
will…
− Link
software
for
executable
end-‐to-‐end
analysis
− Provide
detailed
info
about
data
&
analysis
− Facilitate
re-‐use
&
refinement
of
complex,
multi-‐step
analyses
− Enable
efficient
swapping
of
alternative
models
&
algorithms
− Help
automate
tedious
tasks
58. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
7. Planning
59. 6.
Data
stewardship
&
reuse
From
Flickr
by
greensambaman
The
20-‐Year
Rule
The
metadata
accompanying
a
data
set
should
be
written
for
a
user
20
years
into
the
future
RULE
(National
Research
Council
1991)
60. 6.
Data
stewardship
&
reuse
Use
stable
formats
csv,
txt,
tiff
Create
back-‐up
copies
original,
near,
far
Periodically
test
ability
to
restore
information
Modified from R. Cook
61. 6.
Data
stewardship
&
reuse
Store
your
data
in
a
repository
Institutional
archive
Discipline/specialty
archive
DataCite
list
of
repostiories:
www.datacite.org/repolist
From
Flickr
by
torkildr
62. 6.
Data
stewardship
&
reuse
Data
Citation
Allows
readers
to
find
data
products
Get
credit
for
data
and
publications
Promotes
reproducibility
Better
measure
of
research
impact
Example:
Sidlauskas,
B.
2007.
Data
from:
Testing
for
unequal
rates
of
morphological
diversification
in
the
absence
of
a
detailed
phylogeny:
a
case
study
from
characiform
fishes.
Dryad
Digital
Repository.
doi:10.5061/dryad.20
Learn
more
at
www.datacite.org
Modified from R. Cook
63. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
7. Planning
&
data
management
plans
in
particular
64. 1.
Planning
What
is
a
data
management
plan?
A
document
that
describes
what
you
will
do
with
your
data
during
your
research
and
after
you
complete
your
research
Data
Hangover
65. 1.
Planning
Why
should
I
prepare
a
DMP?
Saves
time
Increases
efficiency
Easier
to
use
data
Others
can
understand
&
use
data
Credit
for
data
products
Funders
require
it
66. NSF
DMP
Requirements
From
Grant
Proposal
Guidelines:
DMP
supplement
may
include:
1. the
types
of
data,
samples,
physical
collections,
software,
curriculum
materials,
and
other
materials
to
be
produced
in
the
course
of
the
project
2.
the
standards
to
be
used
for
data
and
metadata
format
and
content
(where
existing
standards
are
absent
or
deemed
inadequate,
this
should
be
documented
along
with
any
proposed
solutions
or
remedies)
3.
policies
for
access
and
sharing
including
provisions
for
appropriate
protection
of
privacy,
confidentiality,
security,
intellectual
property,
or
other
rights
or
requirements
4.
policies
and
provisions
for
re-‐use,
re-‐distribution,
and
the
production
of
derivatives
5.
plans
for
archiving
data,
samples,
and
other
research
products,
and
for
preservation
of
access
to
them
67. 1. Types
of
data
&
other
information
• Types
of
data
produced
• Relationship
to
existing
data
• How/when/where
will
the
data
be
captured
or
created?
C.
Strasser
• How
will
the
data
be
processed?
• Quality
assurance
&
quality
control
measures
• Security:
version
control,
backing
up
biology.kenyon.edu
• Who
will
be
responsible
for
data
management
during/after
project?
From
Flickr
by
Lazurite
68. 2. Data
&
metadata
standards
• What
metadata
are
needed
to
make
the
data
meaningful?
• How
will
you
create
or
capture
these
metadata?
Wired.com
• Why
have
you
chosen
particular
standards
and
approaches
for
metadata?
69. 3. Policies
for
access
&
sharing
4. Policies
for
re-‐use
&
re-‐distribution
• Are
you
under
any
obligation
to
share
data?
• How,
when,
&
where
will
you
make
the
data
available?
• What
is
the
process
for
gaining
access
to
the
data?
• Who
owns
the
copyright
and/or
intellectual
property?
• Will
you
retain
rights
before
opening
data
to
wider
use?
How
long?
• Are
permission
restrictions
necessary?
• Embargo
periods
for
political/commercial/patent
reasons?
• Ethical
and
privacy
issues?
• Who
are
the
foreseeable
data
users?
• How
should
your
data
be
cited?
70. 5. Plans
for
archiving
&
preservation
• What
data
will
be
preserved
for
the
long
term?
For
how
long?
• Where
will
data
be
preserved?
• What
data
transformations
need
to
occur
before
preservation?
• What
metadata
will
be
submitted
alongside
the
datasets?
• Who
will
be
responsible
for
preparing
data
for
preservation?
Who
will
be
the
main
contact
person
for
the
archived
data?
From
Flickr
by
theManWhoSurfedTooMuch
71. Don’t
forget:
Budget
• Costs
of
data
preparation
&
documentation
Hardware,
software
Personnel
Archive
fees
• How
costs
will
be
paid
Request
funding!
dorrvs.com
72. NSF’s
Vision*
DMPs
and
their
evaluation
will
grow
&
change
over
time
(similar
to
broader
impacts)
Peer
review
will
determine
next
steps
Community-‐driven
guidelines
– Different
disciplines
have
different
definitions
of
acceptable
data
sharing
– Flexibility
at
the
directorate
and
division
levels
– Tailor
implementation
of
DMP
requirement
Evaluation
will
vary
with
directorate,
division,
&
program
officer
*Unofficially
Help
from
Jennifer
Schopf,
NSF
73. Roadmap
4. Toolbox
3. Best
practices
2. Data
management
landscape
1. Background
75. DMPTool:
dmp.cdlib.org
Step-‐by-‐step
wizard
for
generating
DMP
Create
|
edit
|
re-‐use
|
share
|
save
|
generate
Open
to
community
Links
to
institutional
resources
Directorate
information
&
updates
76. CDL
Services:
www.cdlib.org/services/uc3
Data
Repository
Deposit
|
Manage
|
Share
|
Preserve
• Precise
identification
of
a
dataset
• Credit
to
data
producers
and
data
publishers
• A
link
from
the
traditional
literature
to
the
data
• Research
metrics
for
datasets
Example:
Sidlauskas,
B.
2007.
Data
from:
Testing
for
unequal
rates
of
morphological
diversification
in
the
absence
of
a
detailed
phylogeny:
a
case
study
from
characiform
fishes.
Dryad
Digital
Repository.
doi:10.5061/dryad.20
77. Why
are
you
promoting
Excel?
• Open
source
add-‐in
• Facilitate
data
management,
sharing,
archiving
for
scientists
• Focus
on
atmospheric,
ecological,
hydrological,
and
oceanographic
data
• Collecting
requirements
for
add-‐in
from
scientists,
data
centers,
libraries
Funders:
Gordon
and
Betty
Moore
Foundation,
Microsoft
Research
78. Why
are
you
promoting
Excel?
Everyone
uses
it
Stopgap
measure
Funders:
Gordon
and
Betty
Moore
Foundation,
Microsoft
Research
79. www.dataone.org
• Data
Education
Tutorials
• Database
of
best
practices
&
software
tools
• Links
to
DMPTool
• Primer
on
data
management
From
Flickr
by
Robert
Hruzek
82. Handy
References
Best
Practices
for
Preparing
Environmental
Data
Sets
to
Share
and
Archive.
September
2010.
Hook,
Santhana
Vannan,
Beaty,
Cook,
&
Wilson
http://daac.ornl.gov/PI/BestPractices-‐2010.pdf
Some
Simple
Guidelines
for
Effective
Data
Management.
Borer,
Seabloom,
Jones,
&
Schildhauer.
Bull
Ecol
Soc
Amer,
April
2009:
205-‐214.