LIS 653, Session 11: Data Management & Curation

Data
Management
LIS 653
Starr Hoffman

What is (are) Data?
 Observations
 Sensor data, telemetry, survey data, sample data
 Experiments
 Gene sequences, chromatograms
 Simulations
 Economic models
 Derivations/Compilations
 Text mining, data from public documents
 Documents & texts themselves = data
 Research Process
 Observational conditions, experimental procedure,
instrumentation, label descriptions, units, metadata

Librarian Roles & Data
Advisory
 Original data:
 Consult on creating DMP
 Consult on data organization, methodology, etc.
 Consult on metadata practices
 Consult on archiving
 Help disseminate research
 Journal publication, OA resources, blogs, etc.
 Deposit into repository (IR, 3rd party, etc.)
 Secondary data:
 Consult on methodology / analysis
 Discovery…
Curatorial
 Manage IR (institutional repository)
 Create metadata for datasets
 Purchase / catalog / discovery for secondary data

What is Data Management?
Planning for the short-term and long-term:
 care of and
 access to
…your data.
Or: What are you going to do with that data?
 How will you describe it?
 How are you organizing it?
 After you’re done, where will you put it?
 How will you/others be able to access it? For how long?

Data Management:
Why Does it Matter?
 Grant requirements
 Public access to funded research
 Validation
 Replication
 Re-use, continue research
 Teaching
 Natural disasters
 Computer failure/stolen
 USB/hard drive failure/lost
 Files corrupted

Funding
Requirements
 NSF:
Proposals must include a supplementary document of no
more than two pages labeled “Data Management Plan”
…describe how the proposal will conform to NSF policy
on the dissemination and sharing of research results.
 NIH:
The NIH expects and supports the timely release and
sharing of final research data… for use by other
researchers. …expected to include a plan for data
sharing or state why data sharing is not possible.
 NEH Office of Digital Humanities
 NOAA
 IMLS
 NIJ

DMP Considerations
 What data types, from what sources, in what formats will this project
produce? How much of it will there be?
 How will you describe or document your data? Are there standards you
will be using for this?
 Will you be sharing your data? Do you have the rights to share the data?
What did you tell the IRB?
 How often do you need to backup your files? How do you need to be
able to access your files? How many backups will you have?
 How much storage space do you need? What is your budget for your
storage?
 Where are you going to archive or store the data? and how will it be
accessed?
 What are the roles and responsibilities around all of these things? i.e.,
Who's going to be doing all this?

Planning the Data Life-Cycle
Consider…
 Files:
 Size, format, organization
 Security
 Storage/Backup system
 Retention
 Access/Transparency

Data Lifecycle:
Create / Analyze / Edit
 File Management
 Consistency, brevity, description
 Versioning (v01, v02, FINAL)
 Avoid spaces
Directory structure
/[Project]/[Grant Number]/[Event]/[Date]
File naming
[description]_[instrument]_[location]_[YYYYMMDD].[ext]
 Transparency/Sharing
 Document data: codebook, metadata

File Structure & Naming Examples
Directory Structure
/[Project]/[Grant Number]/[Event]/[Date]
 /NYCPhysicalActivity/NOT-MH-14-033/Interview/20141109
 /Dissertation/LitReview/LibraryLeadership/
File Naming
[description]_[instrument]_[location]_[YYYYMMDD].[ext]
 PhysicalActivity_InterviewQs_PS193_20141109.doc
 PhysicalActivity_InterviewResponses_20141022.xls
 LibraryLeadershipHenson_Article_2011.pdf
 Leadership_Survey_20130917.doc

Metadata & Description
 Variables: labels, meaning, how they were
measured, units, codes
 Survey questions
 Experimental procedures
 Research methodology
 Statistical analyses performed
 Preferred data citation
 Pew Hispanic Center. (2008). 2007 Hispanic
Healthcare Survey [Data file and code book].
Retrieved from
http://pewhispanic.org/datasets/

Data Lifecycle:
Publish, Store, Access, Reuse
 File size & format
 Open vs. proprietary
 Security
 Anonymize or encrypt?
 Levels may vary by access (org. vs. 3rd party)
 Data Citation
 Sharing
 Upload data & metadata
 Institutional repository, data center, etc.
 Persistent identifier

Other places data can live…
 Figshare
 ICPSR
 Github
 DataUp
 Dropbox
 (or other cloud storage)
 IF you use proper encryption
Lists of data repositories:
 DataCite
 DataBib

Data Discovery
 Data Depositories (previous slide)
 ICPSR
 Figshare
 Institutional Repositories
 OpenDOAR (directory)
 Specific institutions
 Data Catalogs
 Numeric Data Catalog (Columbia)
 GeoData (Columbia, others)
 Gov & Public Sources (data producers)
 NYC OpenData
 Data.gov
 Census Bureau
 Bureau of Labor Statistics
 IMLS (Institute of Museum & Library Services)

Replicated Data
And Finally…
Geeky puns.

LIS 653, Session 11: Data Management & Curation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to LIS 653, Session 11: Data Management & Curation

Similar to LIS 653, Session 11: Data Management & Curation (20)

More from Dr. Starr Hoffman

More from Dr. Starr Hoffman (20)

Recently uploaded

Recently uploaded (20)

LIS 653, Session 11: Data Management & Curation

Editor's Notes