The document provides guidance on developing a data management plan (DMP) to meet requirements for National Science Foundation grant proposals. It discusses the context and rationale for federal data policies, defines the key elements required for a DMP, and provides examples of DMPs for different types of research data. The main points are: understanding the NSF data policy aims to increase research impact and data sharing/reuse; a DMP must address the types of data generated, metadata standards, data access/sharing plans, long-term preservation, and associated costs; and good planning helps ensure data remains accessible, usable and preserved into the future. Resources and guidance are available to help researchers develop robust and fundable DMPs.
1. DATA MANAGEMENT
June 13, 2012
PLANS & PLANNING:
MEETING THE NSF
REQUIREMENT
2. WHO ARE WE?
Heather Coates
Digital Scholarship & Data Management Librarian
Liaison to the School of Public Health
University Library
Kristi Palmer
Digital Scholarship Team Leader
Liaison to the Department of History
and Programs of Women's and American Studies
University Library
3. LEARNING OBJECTIVES
After attending this workshop:
You will understand the NSF data policies.
You will be aware of the relevant data -related services at IUPUI.
You will have resources to develop a data management plan
(DMP) for your NSF proposal(s).
You will be able to write a comprehensive DMP for your NSF
proposal(s).
You will send your DMP draft to the Data Services Program for
review and assistance as needed.
4. OVERVIEW
Context for the NSF data policies
Meeting the NSF DMP requirement
The requirement: 5 elements
Developing a Data Management Plan
Implementing your plan
Workshop Evaluation ( 5 minutes)
5. CONTEXT: SCHOLARLY COMMUNICATIONS
Funding, funding, funding
Scholarly Impact
Exposure increased citation
More equal access (especially for students)
Facilitates reproducibility
Facilitate new discoveries via secondary analysis/data re -use
Foster productive collaborations
Lead to new computational techniques
Planning for the future
If we can’t find it, it doesn’t exist
Persistent access
Long-term preservation of scholarly records
6. CONTEXT: WHY THE LIBRARY?
preservation, curation, access
Trusted member of the institution
Organizational structure lends itself to collaboration with
researchers
Interdisciplinary by nature
Existing infrastructure for digital information
Existing expertise in preserving and providing access to
information
Program of Digital Scholarship
Archives
7. CONTEXT: DATA SERVICES PROGRAM
Part of the Program of Digital Scholarship
Mission
Identifying data issues and connecting you to the solutions
Services
Workshops
Individual consultations
Data repository
Resources
Guide to NSF Data Management Plan Requirement
Website
8. CONTEXT: TERMINOLOGY
Cyberinfrastructure: computing resources & networks, services,
& people (see Empowering People, 2009 for more)
Data management: technical processing and preparation of data
for analysis
Data curation: selection of data for preservation and adding
value for current and future use
Data citation: mechanisms to enable easy reuse and verification,
track impact of data, and create structures to recognize and
reward researchers ( DataCite)
Data sharing: must take into account ethical and legal issues; a
spectrum with many options
Data stewardship:
9. CONTEXT: FEDERAL POLICIES
Issues in scholarly communication
Open access
Open data & data citation
Data management & curation
Federal policies (incremental steps towards openness)
National Research Council, 1985
Office of Management & Budget, 1999: Circular A-110
NIH Data Sharing Policy, 2003
NIH Public Access Policy, 2008
NSF DMP Requirement, 2011
Other policies: NEH, NOAA, NASA, Howard Hughes Medical Institute
Wellcome Trust
10. CONTEXT: IU STRATEGIC PLAN
IU Empowering People Strategic Plan for IT (2009), Action
33:
“IU should provision a data utility service for research data
that affords abundant near- and long-term storage, ease of
use, and preservation capabilities. This data utility will need
to offer a range of services for securing data, providing
authorized access within and beyond IU; ensuring metadata
description, annotation, and provenance; and providing
backup/recovery services.”
11. CONTEXT: OPEN ACCESS
What is Open Access?
Freely available, online, and free of most copyright restrictions
Why should you care?
Right thing to do?
Increase your citations
“We analysed 119,924 conference articles in computer science and related
disciplines. The mean number of citations to offline articles is 2.74, and the
mean number of citations to online articles is 7.03, an increase of 157%.”
(Lawrence, 2008)
Publisher functions need not reside in for profit hands
"Between 1975 and 2005 the average cost of journals in chemistry and
physics rose from $76.84 to $1,879.56. In the same period, the cost of a
gallon of unleaded regular gasoline rose from 55 cents to $1.82. If the gallon
of gas had increased in price at the same rate as chemistry and physics
journals over this period it would have reached $12.43 in 2005, and would
be over $14.50 today.” (Lewis, 2008)
12. CONTEXT: OPEN ACCESS @ IUPUI AND IU
IUPUI University Library Program of Digital Scholarship
http://www.ulib.iupui.edu/digitalscholarship
Open Journals
IUPUIScholarWorks-Faculty Scholarship
Electronic Theses and Dissertations
Cultural Heritage Collections
Data
eArchives
13. CONTEXT: RESEARCH LIFE CYCLE
Source: DDI Structural Reform Group. “DDI Version 3.0 Conceptual Model." DDI Alliance. 2004. Accessed on 11 August 2008.
<http://www.icpsr.umich.edu/DDI/committee-info/Concept-Model-WD.pdf>.
14. CONTEXT: BENEFITS OF PLANNING
Saves time
Less reorganization down the road
Increases efficiency
Gathers necessary information for analysis and writing
Prevents problems in understanding data and metadata
Prevents data loss
If you have a plan, you are more likely to back up your data
Makes it easier to preserve your data
Documentation is more easily created throughout a project
Metadata generation can be automated or incorporated into procedures
Requirements of some funding agencies and institutions
15. DMP: INTERPRETING THE POLICY
Why?
Increased impact of research money
Reduce redundant data collection
Enhance use and value of existing data
Further scientific research
Data gathering tool
What kinds of data are we collecting?
How are researchers collecting, managing, and preserving data?
What are community norms?
Language is broad to allow input from research communities
Implementation costs of the DMP CAN be included in direct costs
16. DMP: KEEP IN MIND
The gist of it…
Describe what you will do with your data during and after the proposed
project
Ensures data is safe now and in the future
DMP should reflect…
Awareness of data management and curation in your discipline
Feasible plan to utilize available cyberinfrastructure
Try to…
Explain the rationale for your choices
Identify roles for data management and curation activities
17. DMP: ELEMENTS
Types of data
Standards and metadata
Access and sharing
Re-use, re-distribution, and the production of derivatives
Long-term preservation
[Budget]
18. DMP: TYPES OF DATA [1]
Use standards common in your research community
Characterize the data
Types of data
experimental, observational, raw or derived, models, simulations, curriculum
materials, software, images, audio, video, etc.
File formats (i.e., text, spreadsheet, database, etc.)
How much data? (# of files, total size)
Will the data be reproducible?
Relationship to existing data? (i.e., interoperability)
Syntactic
Semantic
19. DMP: TYPES OF DATA [2]
How will data be collected?
How? (tools, instruments, measurements, etc.)
When? (timeframe, series)
Where? (sites, settings)
How will data be processed?
Workflows (brief overview using flow chart)
Software packages
How will the data be stored and managed?
File naming conventions
Version control
20. DMP: TYPES OF DATA [3]
What QA & QC measures will be used?
Identify steps during processing and analysis to eliminate missing data
points, identify outliers, and provide statistical summaries (e.g., double
data entry, histograms, scatterplots)
Before data are collected, define and enforce standards and assign
responsibility
During project, document processes and any changes or deviations
What is the backup and security plan?
Identify particular security or confidentiality issues
Describe location & frequency
Roles & responsibilities
Who will carry out data collection, processing, and backup activities?
21. EXAMPLE: TYPES OF DATA
Atmospheric Concentrations of CO2, Mauna Loa
Observatory, Hawaii, 2011 -2013
https://www.dataone.org /sites/all/documents/DMP_MaunaLoa_Fo
rmatted.pdf
Arthropod responses to grassland nutrient limitation
https://www.dataone.org /sites/all/documents/DMP_NutNet_Form
atted.pdf
22. DMP: STANDARDS & METADATA [1]
Metadata describes the who, what, when, where, how, why of
the data
Include workflow: how you get from raw data to final products
Purpose: enable finding, organization, interoperability,
identification, archiving & preservation
Standards are commonly agreed upon terms and definitions in a
structured format
Dublin Core (commonly used by libraries)
Darwin Core (geographic occurrence of species)
EML (ecology)
Data Documentation Initiative (DDI; social sciences)
IEEE LOM (learning objects metadata)
23. DMP: STANDARDS & METADATA [2]
Ask yourself: will your datasets be self -explanatory or
understandable in isolation?
Decisions to make about metadata
Relevant standard(s)
Format
Content
What information is needed to use and interpret in 5 years, 25 years?
How are metadata created?
Automatically generated
Manually created
24. EXAMPLE: STANDARDS & METADATA [1]
Atmospheric Concentrations of CO2, Mauna Loa Observatory,
Hawaii, 2011-2013
https://www.dataone.org /sites/all/documents/DMP_MaunaLoa_Fo
rmatted.pdf
Metadata will be comprised of two formats —Contextual
information about the data in a text based document and ISO
19115 standard metadata in an xml file. These two formats for
metadata were chosen to provide a full explanation of the data
(text format) and to ensure compatibility with international
standards (xml format). The standard XML file will be more
complete; the document file will be a human -readable summary of
the XML file.
25. EXAMPLE: STANDARDS & METADATA [2]
R i o G ra n d e H yd rol ogic G e o d atabase C o m p e n di um
htt ps:/ /www. dataone .org /site s /al l/ doc ume nts /D M P_ Hydrol ogic _ Form atte d.pdf
M i c ro s o f t A c c e s s D ata b a s e fo r ma t w i l l b e u s e d s i n c e i t i s re a d i l y a c c e s s i b l e a n d
i t i s co m p a t i b l e w i t h E S R I A rc G I S ( htt p : / / w w w. e s r i . co m/s o f t wa re /a rc g i s / i n d ex . ht m l ) , a
G e o g ra p h i c I nfo r m at i o n S y s te m s o f t w a re p a c ka g e u s e d by t h e s ta ke h o l d e rs . N a m i n g
co nv e nt i o n s w i l l b e co n s i s te nt – n o s p a c e s w i l l b e u s e d i n ta b l e n a m e s o r f i e l d n a m e s .
T h e f i l e n a m i n g co nv e nt i o n w i l l co n s i s t o f t h e d a ta s o u rc e _ d a ta t y p e fo r m a t fo r ra w d a ta
f i l e s . D a ta re p o r t i n g f u n c t i o n a l i t y w i l l b e b u i l t i nto t h e V B A p ro c e s s i n g p ro g ra m s to
p ro v i d e o u t p u t i n .t x t f i l e fo r m at fo r n u m b e r o f re co rd s p e r s o u rc e w h e n u p d a ta b l e d a ta
s o u rc e s a re ref re s h e d .
Ev e r y ef fo r t w i l l b e m a d e to g o b a c k to t h e a u t h o r i ta t i v e s o u rc e fo r a n
i d e nt i f i e d d a ta s et . Q u a l i t y co nt ro l o f t h e d a ta b a s e w i l l b e p e r fo r m ed u s i n g S Q L
s t a te m e nt s t h a t ca p i ta l i ze o n t h e d a ta b a s e s t r u c t u re to e n s u re re l a t i o n a l d a ta b a s e
i nte g r i t y. A p p ro p r i ate p r i m a r y key s w i l l b e a s s i g n e d to m a n a g e p o s s i b l e d a ta d u p l i ca te s .
Po t e nt i a l d u p l i ca te s i te I D s , w i l l b e h a n d l e d t h ro u g h a u to m a te d p ro c e d u re s a n d t h e
c re a t i o n o f a l te r n a te I D ta b l e s .
A d ata d i c t i o n a r y w i l l b e c re a te d t h a t d ef i n e s t h e ta b l e d ef i n i t i o n , ta b l e
f i e l d s , a n d ta b l e f i e l d d a ta t y p e s . A n e nt i t y - re l at i o n s h i p d i a g ra m w i l l b e c re a te d t h a t
d ef i n e s t h e re l a t i o n a l s t r u c t u re o f t h e d a ta b a s e .
A m eta d a ta re co rd w i l l b e p ro d u c e d u s i n g t h e F G D C s ta n d a rd t h a t d e s c r i b e s t h e
e nt i re g e o d a ta b a s e . T h e F G D C s ta n d a rd w a s c h o s e n d u e to re q u i re d Fe d e ra l g o v e r n m e nt
s t a n d a rd s .
26. DMP: ACCESS & SHARING
What are your obligations for sharing?
Funding agency, institution, other organization, legal, etc.
What are the ethical or legal issues? (i.e., privacy,
confidentiality, security, intellectual property, or other rights)
How will the data be made available?
What is the process for gaining access?
When will the data be made available?
When will the data become available?
For how long will the data be available?
What is the process for gaining access?
Who will have access to the data?
27. DMP: RE-USE, RE-DISTRIBUTION, ETC.
What rights will you retain before data is made available?
Will permission restrictions be necessary?
Limits or conditions for political, commercial, or patent reasons?
Is there an embargo period? Why?
Future users and uses
Who might be interested in the data?
How might you anticipate this data being used?
What value might the data have for these people?
28. EXAMPLE: ACCESS, SHARING, RE-USE
Development of a NanoKlein Calorimeter
http://libguides.unm.edu/content.php?pid=137795&sid=1422879
We expect to apply for a patent for this instrument. All of the
materials submitted as part of the patent process will be a matter
of public record. We will also make technical drawings, test data
and calibration data available through our institutional repository.
Cave Microbiology
http://libguides.unm.edu/content.php?pid=137795&sid=1422879
29. DMP: LONG-TERM PRESERVATION
Project-based funding does not lend itself to long -term
preservation.
What data will be preserved?
What transformations are necessary to prepare the data?
How long do you think the data will be useful? How long will the
data be preserved?
Contextual information needed to make the data reusable
metadata, references, reports, manuscripts, grant proposal, etc.
Where will it be preserved?
Links to published materials and other outcomes? Use of persistent
citation?
Procedures for preservation and back-up?
Who will be the contact for the dataset?
30. EXAMPLE: LONG-TERM PRESERVATION [1]
Arthropod responses to grassland nutrient limitation
https://www.dataone.org /sites/all/documents/DMP_NutNet_Form
atted.pdf
We will preserve both arthropod datasets generated during this
project (abundance and stoichiometry) for the long term in the
Digital Conservancy at the U of M. We will include the .csv
files, along with the associated metadata files. We will also submit
an abstract with the datasets that describe their original context
and any potentially relevant project information. Borer will be
responsible for preparing data for long -term preservation and for
updating contact information for investigators.
31. EXAMPLE: LONG-TERM PRESERVATION [2]
Improving the long-term preservability of HDF-formatted data by
creating maps to file contents
https://www.dataone.org /sites/all/documents/DMP_HDFMap_For
matted.pdf
The writer software will be preserved by the HDF Group for the life
of the HDF libraries. The HDF Group uses industrystandard best
practices to ensure the integrity of their software and systems.
Once the map writer has been used to generate maps for every
HDF file in existence, the continued existence of the writer
software is not required. The reader software will be preserved at
SourceForge.org for as long as there is community interest. The
collection of HDF files will be preserved at NSIDC as long as utility
is deemed high.
32. IUPUIDATAWORKS
Institutional repository that can facilitate subject repositories
Policies are being developed, informed by faculty needs
Pilot projects
More support at little/no cost
Flexibility in what we are willing to do
New tools to demonstrate impact of research
The future
Standardized levels of service
Standardized policies, responsive to faculty needs
Cost recovery for significant intellectual/time investment
33. IMPLEMENTING YOUR PLAN [1]
The DMP is a working document
NSF expects progress to be reported (progress reports, final
reports, new grant proposals)
Incorporate implementation into the project startup process
C&G, IRB, IACUC all have to be in place before data collection can begin
Review, revise, and set up your system during startup
Good documentation ensures…
A shared understanding of the data throughout a project
That future researchers will be able to understand data within the
relevant context
That re-users of data are able to interpret the data appropriately
34. IMPLEMENTING YOUR PLAN [2]
Research File System: http://pti.iu.edu/storage/rfs
Scholarly Data Archive: http://pti.iu.edu/storage/sda
Research Technologies, UITS: http://uits.iu.edu/page/avel
Core Ser vices, UITS: http://pti.iu.edu/cs
Scholarly Cyberinfrastructure, UITS: http://uits.iu.edu/page/amee
C TSI Tools: http://www.indianactsi.org /rct (Alfresco Share, REDCap )
Program of Digital Scholarship: http://ulib.iupui.edu/digitalscholarship
Center for Research & Learning: http://crl.iupui.edu/
OVCR: http://research.iupui.edu/development/
Office of Academic Affairs: http://www.academicaffairs.iupui.edu
Intellectual Property Policy: https://www.indiana.edu/~vpfaa/
academicguide/index.php/Policy_I-11
IUWare: https://iuware.iu.edu
IUanyWare: https://iuanyware.iu.edu/vpn/index.html
StatMath: http://www.indiana.edu/~statmath/
Statistics Consulting Center: http://www.math.iupui.edu/asci/
35. PRACTICAL TOOLS
Lynda.com tutorials: http://ittraining.iu.edu/lynda/default.aspx
Cleaning Up Your Excel Data (2010)
Managing & Analyzing Data in Excel (2010)
Data Validation in Depth (2010)
DMPTool: https://dmp.cdlib.org /
DMPOnline: https://dmponline.dcc.ac.uk/
UK Data Archive Costing Tool:
http://www.data-archive.ac.uk/media/257647/
ukda_jiscdmcosting.pdf
Creative Commons Licenses & Data:
http://wiki.creativecommons.org /Data
Licensing Research Data, Digital Curation Centre
http://www.dcc.ac.uk/resources/how -guides/license-research-data
CIC Author Addendum
http://www.cic.net/authors
36. RECOMMENDED READING
UK Data Archive: Managing & Sharing Data Brochure:
http://www.data-archive.ac.uk/media/2894/managingsharing.pdf
37. MORE RESOURCES
National Science Board, Digital Research Data Sharing &
Management, 2012 (pre-publication):
http://www.nsf.gov/nsb/publications/2011/nsb1124.pdf
Committee on Science, Engineering, and Public Policy (U.S.).
(2009). Ensuring the integrity, accessibility, and stewardship of
research data in the digital age. Washington, D.C.: National
Academies Press.
National Science Board Committee on Strategy and Budget Task
Force on Data Policies. (2011). Digital Research Data Sharing &
Management. Washington, D.C.: National Science Board.
America Creating Opportunities to Meaningfully Promote
Excellence in Technology, Education, and Science Reauthorization
Act of 2010, Pub. L. No. 111 -358. 124 Stat. 3982 (2010).
Retrieved from the Library of Congress Thomas database .
38. REFERENCES
1. Higgins, S. ( nd). What are metadata standards. http://ww w.dcc.ac.uk/
resources/bri efing -papers/standards -watch-papers/what -are- metadata -
standards
2. Digital Curation Centre. ( nd). DCC Charter and Statement of Principles.
Retrieved from http://ww w.dcc.ac.uk/about -us/dcc- charter.
3. Indiana Universit y. (2011). Indiana Universit y ’s Advanced
Cyberinf rast ructure. Retri eved from
http://pti.iu.edu/cyberinf rast ructure.pdf.
4. Indiana Universit y. (2009). Empowering Peopl e: Indiana Universit y ’s
Strategic Plan for Information Technology. Retrieved from
http://ovpit.iu. edu/st rategic2/ .
5. National Science Foundati on. (2011 ). Award and Administration Guide:
Chapter IV C.4., Disseminati on and Sharing of Research Results. Ret ri eved
from
http://ww w.nsf. gov/pubs/policydocs/pappguide/nsf 1 1001/aag_6. jsp#VI D4 .
6. Lawrence, S., Free online availability substantially increases a paper ’s
impact, Nature, 31 May 2001. http://ww w.nat ure. com/nature/debates/e -
access/Articles/lawrence.html (accessed November 5, 2008,)
7. Lewis, David W. "Librar y budgets, open access, and the future of scholarl y
communication: Transformati ons in academic publishing." C&RL News, May
2008, Vol. 69, No. 5. [Available at:
http://ww w.ala.org /ala/mgrps/di vs/acrl/publicati ons/crlnews/
2008/may/ALA_print _layout _1_ 47113 9_471 139. cf m ]
39. COMPELLING CASES FOR OPEN DATA
SPARC, Research is more valuable when it ’s shared:
http://www.arl.org /sparc/greaterreach/index.shtml
Tim Berners-Lee: http://www.ted.com/talks/tim_berners_lee_
on_the_next_web.html
Open-source cancer research: http://www.ted.com/talks/
jay_bradner_open_source_cancer_research.html
Polymath problem blogs:
http://polymathprojects.org /about/
http://stevekochscience.blogspot.com/2011/02/open -data-success-
story.html
http://eaves.ca/2011/09/07/the -economics-of-open-data-mini-case-
transit-data-translink/
40. THANK YOU
Tell us what you think, take a brief survey.
Find us @
http://ulib.iupui.edu/digitalscholarship/dataservices
Heather Coates, hcoates@iupui.edu, 317-278-7125
Kristi Palmer, klpalmer@iupui.edu, 317-274-8230
IUB
Stacy Konkiel, skonkiel@indiana.edu, 812-856-5295
41. EXTRA: NIH DATA SHARING POLICY
$500,000 or more in direct costs in any year of the proposed research
Final research data, not summary statistics or tables, not underlying
pathology reports and other clinical source documents, might include
both raw data and derived variables
If an application describes a data -sharing plan, NIH expects that plan
to be enacted.
NIH expects the timely release and sharing of data to be no later than
the acceptance for publication of the main findings from the final
dataset.
It is the responsibility of the investigators, their Institutional Review
Board (IRB), and their institution to protect the rights of subjects and
the confidentiality of the data. Prior to sharing , data should be
redacted to strip all identifiers, and effective strategies should be
adopted to minimize risks of unauthorized disclosure of personal
identifiers.
42. EXTRA: NIH DATA SHARING PLAN
describe briefly the expected schedule for data sharing
the format of the final dataset
the documentation to be provided
whether or not any analytic tools also will be provided
whether or not a data -sharing agreement will be required
if so, a brief description of such an agreement (including the criteria for
deciding who can receive the data and whether or not any conditions
will be placed on their use)
mode of data sharing (e.g., under their own auspices by mailing
a disk or posting data on their institutional or personal
website, through a data archive or enclave)
Applicants may request funds in their application for data
sharing.
43. RESOURCES
National Institutes of Health, Data Sharing Policy
http://grants.nih.gov/grants/policy/data_sharing /data_sharing_gui
dance.htm
NIH Public Access Policy Implications
http://publicaccess.nih.gov/public_access_policy_implications_20
12.pdf
Notes de l'éditeur
Housekeeping: hold questions until the end, make sure everyone has handoutsResources: SlidesDSP Guide to NSF DMPNSF Policy language handoutCIC Author Addendum
We’re going to spend the majority of our time today walking through practical tips and examples for each section of the DMP, but there is important background information you need to know first.
The NSF data sharing policy and data management plan requirement came about within the context of broader discussions about how information is disseminated in the sciences, so we’ll quickly review that discussionbefore getting into the practical steps of developing a DMP. We want to accomplish a 2 things:We want to prepare you to engage in discussions about the scholarly record and how research products are disseminated, specifically your rights and options. We want to give you the information you need to make informed decisions regarding copyright, IP, patent, and other issues when it comes to choosing where to publish and preserve all things related to your research. Data is just one piece of this picture.In addition to funding, there are many compelling reasons to plan for preserving and sharing your data. The good news is that data sharing can boost the scholarly impact of your data and research in general, which is always good for promotion and tenure. -collaborations funders are increasingly looking for interdisciplinary and multi-institution collaborationsThe benefits of digital data come with costs. Unlike with paper-based data and records, we can’t assume that we’ll be able to access and use digital information in 5, 10, or 50 years. We need to plan for managing and preserving valuable digital data so that the scholarly record isn’t lost. If we can’t find something, it doesn’t exist. These issues of persistent access and long-term preservation are challenges that libraries have been solving for thousands of years.
Some people wonder why the library is taking on this challenge of helping researchers to manage and preserve their data. There are several good reasons.-every college or university has a library-our place within IUPUI facilitates collaboration; we have existing relationships with each department; these collaborations are another way to build capacity for data management, sharing, preservation, and curation by making use of resources that are already available-libraries and librarians have been caring for information in many formats for thousands of years; while the formats change more rapidly these days, our core principles remain the sameOther campus units can help you with your research, but have a different focus, such as compliance with human subjects or animal use guidelines, contracts and grants, bioethics, etc.
The Data Services Program is part of the University Library’s Program of Digital Scholarship. The Data Services Program offers workshops and consultations for developing an NSF data management plan as well as data management and curation in general. In addition, we have established a data repository for IUPUI research. The repository is one of many tools available to your for preserving and sharing, if appropriate, your research data.On our website, we’ve provided links to :Sample NSF DMP from other institutionsvarious toolsGuidance from institutions like the ICPSR and Digital Curation Centre (UK)Significant publications discussing data management and curation
I want to clarify some terms so we’re all on the same page. Data management is largely seen as the purview of scientists and biostatisticians since it varies by research community and discipline. Data sharing is not an all-or-none proposition. It encompasses a wide spectrum of activities ranging from open data publishing on the internet without restriction to controlled access by pre-defined partners or collaboratorsData citation is a concept similar to citation of scholarly publications and refers to mechanisms that allowseasy reuse and verification of data (DataCite);the impact of data to be tracked (DataCite);And creates a scholarly structure that recognizes and rewards data producers (DataCite)
These policies came about as a result of broader conversations about scholarly communication. In case you aren’t familiar with the term, it refers to the processes by which we produce and disseminate information relating to teaching, research, and other scholarly activities. Our goal is to provide you with the information necessary to engage in these discussions within IU and your research communities so that you are making informed decisions about how your research gets out there, who retains rights, and who can access it.The NSF data policies are not radical deviations; they are logical steps forward towards more formal guidelines for providing public access, data management best practices, minimal requirements for sharing data, and data stewardship. http://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp
IU as an institution has been engaged in discussions about scholarly communications for several years and have voiced their commitment to these issues by including data management and curation in the IT Strategic Plan.
The purpose of diagrams like this are to provide a common ground for discussion of a complex and diverse process.This represents a rough map of the research process that we all engage in.The open access conversation is focused on the dissemination of research products like peer-reviewed articles and books at the end of the research life cycle, whereas data management planning is most effective when it’s initiated before data collection begins and implemented throughout the research life cycle.
Important to know that the language was crafted to:Allow the research community to shape the implementationRole for communities of practice to develop relevant best practicesThe budget allocations and narrative should tell a cohesive story; if you identify big challenges in data storage and preservation, but do not allocate funds to address these challenges, it will likely raise a red flag for the review committee.
Ultimately, this document should demonstrate that you are aware of data management and preservation issues in general, more specifically the relevant practices within your research community or discipline, and that you have thought through how these affect your proposed project. The plans proposed should be feasible – for you and for us.
If you look at the guide I’ve provided, you’ll see these topics are broken down into a variety of specific questions to address. We’ll go through each section in more detail.It may be helpful to begin your DMP with a few sentences describing the research project in general, to provide context for the detailed information in each section.As you develop each section of your DMP, it’s important to do two things: Explain your reasoning it could just be that it’s a standard practice in your field/communityIdentify roles for data management and curation activities think about who on your team or in another campus unit will carry out the activities described; this section should identify who will be carrying out the major elements of your plan. This may include the PI, staff, students, external contractors, institutional IT, the library, and external data repositories.
In this first section, you want to describe two things: the data you will generate or use and the documentation you will create to facilitate data management and curation.Syntactic interoperability: ensures technical infrastructure (hardware, software, data formats) used to create and discover data can work together/communicate with each otherSemantic interoperability: ensures that the data can be interpreted once exchanged, through use of common data and metadata structures and content
In addition to describing your plan in the DMP, these activities should be described in working documents throughout the life of the project. Creating data documentation is easiest and most efficient at the beginning of a project. Good documentation ensures 3 things: a shared understanding of the data throughout a project; that future researchers will be able to understand data within the context they were created;that re-users of data are able to interpret the data appropriately. You don’t need to spend a lot of time or space describing the planned documentation, but it is worthwhile to mention what format it will take and who will be responsible for creating and maintaining it. This can relate to the second section describing metadata and standards.
Data screening tests: histograms, boxplots, Z-scores, etc.Research methods, even within a single lab, change over time. Good documentation can facilitate efficient data collection and processing and preserve data integrity.
CO2:Site, frequency, raw data file descriptionFinal data productFile format and sizeArthropodprocess of data collection is well described with referral to proposal for further detailPre-defined sample/note code – naming convention & unique IDMeasurements of interest are described, but not defined; common definition or practice?File naming convention and formatsProcess for transfer of files is not describedRelationship to existing data described & process for integration is included
Who created the data?What is the content of the data?When were the data created?Where is it geographically?How were the data developed?Why were the data developed?-can use flow charts for simple workflows
Ask yourself are your data self-explanatory? Consider it from the perspective of a typical reader of a journal you publish in or a colleague who might be interested in collaborating. The solution is good documentation and metadata. More frequently, the people analyzing the data are not those who collected it. Metadata and good data documentation facilitate stronger understanding of the data, facilitate quality and appropriate re-useThere are a lot of standards out there; you can ask what others in your discipline or research community are doing or contact us to see if we are aware of emerging standards. If you know you will be depositing your data in a particular repository, you can ask them what their requirements or recommendations are.
Metadata formats included; formal ISO standardContent of metadata not fully describedNo rationale
Format describedSoftware toolsFile naming conventionQuality control procedures describedData dictionaryMetadata standardNo rationale
Let’s take a look at the handout with the NSF policy language. Again, the language is broad and allows for practices to vary by research community. As you can see from the policy, data dissemination and sharing does not refer to publishing in scholarly journals. In this section, you should define what you will share, how, and the procedures for access. If you plan to use a specific data repository, they can help you develop this section; likely, they will have standard processes in place.Acceptable practices for data sharing vary by discipline; some have very mature data repositories while others rely on informal channels. Best practices for persistent access indicate more permanent and secure mechanisms than a faculty or department website. The solution at IUPUI is our data repository (IUPUIDataWorks).In terms of the access procedures, you want to think about what mechanism will be used for requests, whether registration and authentication are necessary, and what information you want to keep for your own records about those who request and receive your data. This can be useful information to demonstrate the value and impact of your research.Data sharing encompasses a wide spectrum of activities; you can decide what, when, how, and with whom you will share your data. Even if you are part of a community in which data sharing is not common practice, I urge you to think about what data might be shared or re-used without compromising your intellectual property or competitiveness. Sharing your data broadly can increase the impact of your research, benefit the institution, and your research community. Often, the value of data are unknown or unrecognized until they have been examined by a wide audience.
This section will relate to the access and sharing section, but should focus on policies and permissions for re-use, re-distribution, and production of derivatives works as opposed to the mechanisms described in the previous section. You can protect your ability to use the data for ongoing analysis while sharing as much of it as possible with your research community and the general public. While you can’t plan for every case, it is useful to imagine who might be interested in the data, how it might be used, and set up a process for handling those cases. Depending on where you decide to deposit your data, this could be very formalized or relatively informal.If you decide to share your data through a repository, often there are mechanisms built in for applyingCreative Commons licenses. This is true for our data repository as well.
CalorimeterThere are many avenues for sharing data and the results of a study; the patent process is one of them.Cave MicrobiologyThis is representative of a reference datasetDemonstrates history of sharing their dataInclude selected license for their dataPhysical samples – not required to share, but they include this infoField notes – requirement to share as public documentsSEM images – subject repositoryWebsite for educational resources
Ultimately, the impetus for data management, preservation, sharing, and curation will need to come from someone other than funding agencies. Institutions and libraries are looking at sustainable ways to fulfill our commitments and make sure that we can be good data stewards. These systems are still being developed. Realistically, we can promise that your data will be available in 10 years; in 10 years, we hope to have solutions for many of these problems.Here, you should relate the strategies you’ve outlined in previous sections to your long-term preservation strategy. This is an opportunity for you to discuss with us or an external data repository in your discipline, the long-term plan for keeping your data safe. If you are completely unsure how to approach this, feel free to contact the DSP for support. We can help you develop a feasible and appropriate preservation strategy that relies on existing services and infrastructure, whether at IU or elsewhere. A key component of your plan is the description of the cyberinfrastructure available to you and how you will use it to carry out your plan as a responsible data steward. Although your lab may be equipped to store and maintain the data for a project while it’s active, you may not have the capacity to make sure the data is preserved once the project is complete and your lab resources are dedicated to new endeavors. Neither IU nor NSF want to see scientific data lost and are investing significant effort and resources in maintaining the scientific record.These are activities that the Library specifically is invested in and equipped to do; our focus is on long-term preservation, curation, and access. What this means will likely vary by dataset, project, and lab; we’re happy to think this through with you to develop a plan that will meet the needs.
Identified an institutional repositorySelected files to be archived; could be more specificType and format of project information is vague; abstract may not be sufficient
Example of software archival/preservationWhat are “industry-standard best practices”Continued existence of writer software is not required, but reader software will be preservedIsSourceForge an institution that is likely to persist for 10, 25, 50 years?
There are a wealth of resources at IU to help you with your research. These are just a few of those relevant to data management and curation.