This document discusses data management practices for researchers. It defines what constitutes data, such as observations, experiments, simulations, and documents. It outlines the roles of librarians in advising on data management plans, metadata practices, and archiving data. It also discusses why data management is important for validation, replication of research, and compliance with funder requirements. The document provides examples of file structures, naming conventions, metadata, codebooks, and archiving data in institutional repositories to facilitate long-term access and reuse of research data.
3. What is (are) Data?
Observations
Sensor data, telemetry, survey data, sample data
Experiments
Gene sequences, chromatograms
Simulations
Economic models
Derivations/Compilations
Text mining, data from public documents
Documents & texts themselves = data
Research Process
Observational conditions, experimental procedure,
instrumentation, label descriptions, units, metadata
4. Librarian Roles & Data
Advisory
Original data:
Consult on creating DMP
Consult on data organization, methodology, etc.
Consult on metadata practices
Consult on archiving
Help disseminate research
Journal publication, OA resources, blogs, etc.
Deposit into repository (IR, 3rd party, etc.)
Secondary data:
Consult on methodology / analysis
Discovery…
Curatorial
Manage IR (institutional repository)
Create metadata for datasets
Purchase / catalog / discovery for secondary data
5. What is Data Management?
Planning for the short-term and long-term:
care of and
access to
…your data.
Or: What are you going to do with that data?
How will you describe it?
How are you organizing it?
After you’re done, where will you put it?
How will you/others be able to access it? For how long?
6. Data Management:
Why Does it Matter?
Grant requirements
Public access to funded research
Validation
Replication
Re-use, continue research
Teaching
Natural disasters
Computer failure/stolen
USB/hard drive failure/lost
Files corrupted
7. Funding
Requirements
NSF:
Proposals must include a supplementary document of no
more than two pages labeled “Data Management Plan”
…describe how the proposal will conform to NSF policy
on the dissemination and sharing of research results.
NIH:
The NIH expects and supports the timely release and
sharing of final research data… for use by other
researchers. …expected to include a plan for data
sharing or state why data sharing is not possible.
NEH Office of Digital Humanities
NOAA
IMLS
NIJ
8. DMP Considerations
What data types, from what sources, in what formats will this project
produce? How much of it will there be?
How will you describe or document your data? Are there standards you
will be using for this?
Will you be sharing your data? Do you have the rights to share the data?
What did you tell the IRB?
How often do you need to backup your files? How do you need to be
able to access your files? How many backups will you have?
How much storage space do you need? What is your budget for your
storage?
Where are you going to archive or store the data? and how will it be
accessed?
What are the roles and responsibilities around all of these things? i.e.,
Who's going to be doing all this?
16. Data Lifecycle:
Publish, Store, Access, Reuse
File size & format
Open vs. proprietary
Security
Anonymize or encrypt?
Levels may vary by access (org. vs. 3rd party)
Data Citation
Sharing
Upload data & metadata
Institutional repository, data center, etc.
Persistent identifier
22. Other places data can live…
Figshare
ICPSR
Github
DataUp
Dropbox
(or other cloud storage)
IF you use proper encryption
Lists of data repositories:
DataCite
DataBib
23. Data Discovery
Data Depositories (previous slide)
ICPSR
Figshare
Institutional Repositories
OpenDOAR (directory)
Specific institutions
Data Catalogs
Numeric Data Catalog (Columbia)
GeoData (Columbia, others)
Gov & Public Sources (data producers)
NYC OpenData
Data.gov
Census Bureau
Bureau of Labor Statistics
IMLS (Institute of Museum & Library Services)
Data isn’t just for science! Lots of fields collect lots of kinds of data…
Quantitative data—usually numbers, but anything that’s quantifiable—survey data maybe
Qualitative data—interviews, some surveys,
Ethnographic-- observations of space use
Maps
Photographs
Sound recordings, video
Observations, e.g.: Sensor data, telemetry, survey data, sample data, neuroimages.
Experiments, e.g.: gene sequences, chromatograms, toroid magnetic field data.
Simulations, e.g.: climate models, economic models
Derivations / Compilations, e.g.: text and data mining, compiled database, 3D models, data gathered from public documents.
Research process, e.g.: observational conditions, experimental procedure, instrumentation, label descriptions, units
*http://www.nytimes.com/2011/10/06/science/06nobel.html?_r=0 “when you get to see everyone’s mistakes you can tell the mistake from a pattern” Micah Altman at RDS2013
Observational Data/Media
Real time captureUsually irreplaceable
Derivation/Compilation Data
ReproducibleExpensive
Research Process Data
Data documentation & description (aka ‘metadata’)Analysis algorithms & codes
Simulation Data
Model & Inputs are usually the important elements
So what are librarian roles, related to data?
-- advisory
-- curatorial
Crossover of public services & technical services…
-- metadata (TS)
-- subject expertise (either; often PS)
-- methodology/analysis (may be PS)
-- data mgmt (TS, PS, or Scholarly Communication dept.)
-- IRs… (digital services, schol comm—could be outside the libraries!)
-- secondary data (usually PS)
Some question prompts
– librarian might work through this checklist with a researcher
-- help shape their DMP
DMPs include things like:
-- kinds of data
-- data collection methods
-- hardware, software
-- archival plans…
These are things that the librarian may consult with a researcher on…
Consider:
Files: size, format, organization
-- open format vs. proprietary
Security
-- Be vigilant and protective of data in your custody. Observe precautions when transporting paper files with patient or individual identifiable information. Do not leave documents or USB drives in unsecured locations (cars, lockers). Encrypt information (email, cloud storage)
-- who needs to access this data—institutional, 3rd party access?
-- before sharing data, anonymize individually identifying information, be aware of geo-location tags
Storage/backup system
-- Checksum validation, test to be sure backups are working properly
Retention
-- how long does this data need to be accessible/stored? Grant requirements, your own plans, others
-- is it reproducible or not?
Access/Transparency
-- do you (or others) need to access it frequently, or only need periodic access?
-- document the data (processes, labels, etc.)
-- deposit in repository (for public access)
Consistency: Pick a system, write it down, & stick with it
Identify necessary elements
Create brief, understandable names
Date: YYYY-MM-DD
Version: v01, v02,…FINAL
In general, try to stay away from spaces in filenames as well as the following characters:
.\ / : * ? “ < > | [ ] & $
Metadata is important:
-- to help the researcher understand and remember their data (variables!)
-- to help others find that data (for validation, replication, reuse)
-- replication in particular = GOOD metadata!
Codebook is an essential part of metadata for data…
Tells what each variable is, what it’s called, how it was measured, what units it’s in, how it’s coded…
Codebooks can also include survey questions asked to obtain the variables.
This is also important in Data Reference work w/ secondary data…
Both funders and some journals require data sharing at time of publication / w/in 12 months
So, when it’s all done… where do we PUT this stuff?
More places to store original data….
So from the other end
—as a researcher wanting to use secondary data (already collected)
--or a student wanting to do data analysis…
Where do we FIND data?