2. What is an NSF Data Management Plan?
How and why was it created?
Why are Libraries a part of data management?
(Short Break)
Creating and Implementing NSF Data Management Plans
Preserving Research Data after a project is completed
3. To:
Understand current NSF and government data
policies requirements.
Be aware of research support services within the
Libraries.
Locate and use various resources to develop data
management plans (DMPs) for NSF proposal(s).
Write a comprehensive DMP for NSF proposal(s).
Identify and plan for long-term preservation of
research data from funded projects.
4.
5. Storing Research Data “Forever”
Serge Goldstein
Associate CIO & Director of Academic Services
Princeton University
Fall 2010 Coalition for Networked Information Meeting
URL: http://www.youtube.com/watch?v=fQ-YEcV1k1A
6. Cyberinfrastructure: computing resources &
networks, services, & people
Data management: technical processing and preparation of
data for analysis
Data curation: selection of data for preservation and adding
value for current and future use
Data citation: mechanisms to enable easy reuse and
verification, track impact of data, and create structures to
recognize and reward researchers (DataCite)
Data sharing: must take into account ethical and legal issues;
a spectrum with many options
Source: Heather Coates and Kristi Palmer, Data management plans & planning: Meeting the NSF
Requirement, March 7, 2012
URL: http://www.slideshare.net/goldenphizzwizards/meeting-the-nsf-dmp-requirement-20120307-final
12. Open Access
Open Educational Tools
Open Standards
Open Science
Open Source
Dorothea Salo, Battle of the Opens, Book of Trogool, March 15, 2010
http://en.wikipedia.org/wiki/File:Benjamin_Franklin_-_Join_or_Die.jpg
13. Houghton, J.W. (2011). "The costs and potential benefits of alternative scholarly publishing models"
Information Research, 16(1) paper 469. [Available at http://InformationR.net/ir/16-1/paper469.html]
17. Saves time
Less reorganization for future projects
Increases efficiency
Compile and prioritizing data collection(s)
Anticipate how your data will be used
Consider data preservation requirements and
plan for them
Better aware of funding agency mandates
and data preservation culture in your field
28. 1. Types of Data
2. Data and Metadata Standards
3. Policies for Access and Sharing
Data Privacy and Protection
4. Data re-use and re-distribution
5. Data Archiving and Preservation
29. Expected data. The DMP should describe the types of data, samples, physical collections, software, curriculum materials, or other
materials to be produced in the course of the project. It should then describe the expected types of data to be retained.
The Federal government defines ‘data’ in OMB Circular A-110 as:
Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to
validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future
research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects
(e.g., laboratory samples). Research data also do not include:
(A) Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are
published, or similar information which is protected under law; and
(B) Personnel and medical information and similar information the disclosure of which would constitute a clearly
unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a
research study.
PIs should use the opportunity of the DMP to give thought to matters such as:
• The types of data that their project might generate and eventually share with others, and under what conditions
• How data are to be managed and maintained until they are shared with others
• Factors that might impinge on their ability to manage data, e.g. legal and ethical restrictions on access to non-
aggregated data
• The lowest level of aggregated data that PIs might share with others in the scientific community, given that community’s
norms on data
• The mechanism for sharing data and/or making them accessible to others
• Other types of information that should be maintained and shared regarding data,
Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
30. “This research project will generate data resulting from sensor recordings (i.e. earth
pressures, accelerations, wall deformation and displacement and soil
settlement) during the centrifuge experiments. In addition to the
raw, uncorrected sensor data, converted and corrected data (in engineering
units), as well as several other forms of derived data will be produced. Metadata
that describes the experiments with their materials, loads, experimental
environment and parameters will be produced. The experiments will also be
recorded with still cameras and video cameras. Photos and videos will be part of
the data collection.”
“A total storage demand of 50 GB is anticipated at the University of Michigan, and
50 GB at Auburn University.”
“Based on the previous viscoelastic turbulent channel flow simulations, the amount
of resulting binary data is estimated around 40 TB per year. Some text format
data files are also required for post-processing in the laboratory and are
anticipated to be around 1 TB per year.”
“In one year, we will perform approximately 2 to 3 simulations. This means ~100 3D
plots, 30 restart files, 1000 EUV, X-ray and LASCO-like images, 10 satellite
files, 1000 2D plot files (total of about 150 GB of data per year).”
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
31. “The data, samples, and materials expected to be produced will consist of laboratory
notebooks, raw data files from experiments, experimental analysis data files,
simulation data, microscopy images, optical images, LabView acquisition programs,
and quantum dot superlattice nanowire thermoelectric samples.... each of these data
is described below:
A. Laboratory notebooks: The graduate student and PI will record by hand any
observations, procedures, and ideas generated during the course of the research.
B. Experimental raw data files: These files will consist of ASCII text that represents data
directly collected from the various electrical instruments used to measure the
thermoelectric properties of the superlattice nanowire thermoelectric devices.
C. Experimental analysis data files: These files will consist of spreadsheets and plots of
the raw data mentioned in Part A. The data in these files will have been manipulated
to yield meaningful and quantitative values for the device efficiency and ZT. The
analysis will be performed using best practice and acceptable methods for calculating
device efficiency and ZT.
D. Simulation data: These data will represent the results from commercially available
simulation and modeling software to model the quantum confinement.
E. Microscopy images: Images of the proposed silicon nanostructures will be generated
by scanning electron microscopy (SEM), transmission electron microscopy (TEM) at
high resolution to quantify wire diameter and roughness, and atomic force
microscopy (AFM).
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
32. Data formats and dissemination. The DMP should describe data
formats, media, and dissemination approaches that will be
used to make data and metadata available to others. Policies
for public access and sharing should be described, including
provisions for appropriate protection of
privacy, confidentiality, security, intellectual property, or other
rights or requirements. Research centers and major
partnerships with industry or other user communities must
also address how data are to be shared and managed with
partners, center members, and other major stakeholders.
Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
33. Period of data retention. SBE is committed to
timely and rapid data distribution. However, it
recognizes that types of data can vary widely
and that acceptable norms also vary by
scientific discipline. It is strongly
committed, however, to the underlying
principle of timely access, and applicants
should address how this will be met in their
DMP statement.
Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
34. “The Dublin Core will be used as the standard for metadata. The
metadata set mainly consists of fifteen elements, including
title, creator, subject, description, publisher, contributor, date, type, f
ormat, identifier, source, language, relation, coverage, and rights.
These elements have been ratified as both national (i.e., ANSI/NISO
Standard Z39.85) and international standards (i.e., ISO Standard
15836). Further, they describe resources such as
text, video, audio, and data files. These standard formats will be used
in our study.”
“For each code made available, a user's manual will be provided with
instructions for compiling the source codes, installing and running the
codes, formulating input data streams, and visualizing the output.
Documentation will be in PDF format.”
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
35. “Verilog, SPICE, and MATLAB files generated will be processed and submitted to FTP servers
as .mat files with TXT documentation. The data will be distributed in several widely used
formats, including ASCII, tab-delimited (for use with Excel), and MAT format. Instructional
material and relevant technical reports will be provided as PDF. Digital video data files
generated will be processed and submitted to the FTP servers in MPEG-4 (.mp4) and .avi
formats. Variables will use a standardized naming convention consisting of a
prefix, root, suffix system.”
“Plasma image data will be RGB colored JPG or TIFF format with resolution determined by the
camera. Video data will be RGB colored AVI format.”
“Images from the scanning electron microscopes (SEMs) and focused ion beam workstations
(FIBs) are saved in tagged image file format (TIFF), which is readily readable by a wide
variety of imaging and processing applications.”
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
36. III. Policies for access and sharing and provisions for appropriate protection/privacy
As detailed in the project description, the CARE platform in intended to be a research cloud
service that provides analytical middleware for use in analyzing health data. During the
project, access will be limited to project team member and invited expert stakeholders through a
password protected website. Commencing with Task 5 (month 26), means for access by the
broader research community will be implemented. At that time, the project team will determine
whether there is a need for initiating access charges, which may be appropriate for securing the
longer terms sustainability of the CARE platform and analysis tools.
All of the data that will be utilized are publicly available data sets that have been de-identified by
public agencies and have passed their standards for privacy protection and assurance so that no
individually identifiable data is provided. The datasets to be utilized within this project and other
intellectual property have been released without restriction.
Over the course of the study, the project team will meet with both the Community Health
Institute and the SafeRoadMaps/CERS team to arrive at a data-sharing agreement for
postproject utilization of their data. Such an agreement will provide a model for not only this
partnership, but for licensing the CARE Platform analytics for use by other health data sets.
Source: http://rci.ucsd.edu/_files/DMP%20Example%20Chaitan%20Baru%20SDSC.pdf
37. “After uploading the data into the NEES Project Warehouse
and allowing public access, all data will be available for re-
use and re-distribution with proper acknowledgement of
their originators.”
“Researchers and practitioners in diverse fields will be able to
readily reuse and redistribute shared data. Terms of use will
include the prohibition of commercial commercial use of the
work – modifications of the work will be allowed with the
proper citations.”
“The simulation code will be developed in C and provided to the
public in source code format for non-commercial use under
GNU General Public License (GPL).”
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
38. “Before data is stored, it will be stripped of all institutional and individual
identifiers to ensure confidentiality by staff of the Center following
procedures developed by the researchers.”
“Audio files of interviews will be stored on a password protected secure
server during the study and for two years after, and destroyed
subsequently.”
“Exceptions to shared data include proprietary DTE GIS utility
information (for security reasons) and software code of commercial
interest to the project's GOALI partners or identified licensees. Both
exceptions are permitted by the ENG DMP policy.... The research
team will however develop a set of 3D GIS datasets for distribution
the public. These datasets will represent non-existent buried
infrastructure and will only be useful for the evaluation of the other
research products.”
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
39. IV. Policies and provisions for re-use, re-distribution
As noted in the project description, policies for provision
and re-use will be developed as part of the research
project. It is anticipated that there will be
considerable interest in the platform and tools within
the research and practice community, including
academic researchers, health research agencies, and
cloud service providers, among others. The need for
such a tool was identified during a recent NSF
sponsored symposium on Health
Cyberinfrastructure, which was conducted by the PIs.
Source: http://rci.ucsd.edu/_files/DMP%20Example%20Chaitan%20Baru%20SDSC.pdf
40. Data storage and preservation of access. The
DMP should describe physical and cyber
resources and facilities that will be used for the
effective preservation and storage of research
data. These can include third party facilities
and repositories.
Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
41. V. Plans for archiving and Preservation of access
The project website and service will contain all appropriate information
and documentation for using the CARE platform and tool for health
research discovery and analysis. The site will also contain all
references, research papers, and related products developed throughout
the course of the project.
The San Diego Supercomputer Facility at UC San Diego will host the data
throughout the research project and provide a minimum of three years of
online access beyond the completion of the project. Data storage will be
performed at the nominal rates charged by SDSC to any project using
the facility. These are relatively modest (~$1000/TB) and can be borne
ahead of time for the 3-year period. Should the CARE platform not
extend beyond the three years (post grant), the data could then be
archived at SDSC at even lower cost. A decision would have to be made
at that point in time regarding how exactly to archive the data, and on
paying for the archival storage.
Source: http://rci.ucsd.edu/_files/DMP%20Example%20Chaitan%20Baru%20SDSC.pdf
42. “For archiving, the data along with any related publications
will be deposited in Libra, the UVA archival system, with
an appropriate licensing statement. DOIs will be attached
to all data stored from this project. Since the current
preservation plan for Libra is indefinite data
storage, preservation of access is assured.”
“Materials to be publicly shared will be stored with the Deep
Blue repository, a service of the UM Libraries that provides
deposit access and preservation services. Deposited items
will be assigned a persistent URL that will be registered
with the Handle System for assigning, managing, and
resolving persistent identifiers (‘handles’) for digital
objects and other Internet resources.”
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
43.
44. What are your goals?
Who needs access and when?
When/if can data be shared/distributed?
Prepare for future funder mandates
Plan beyond individual PI/grant projects
45. • Campus Copyright policy
• Collaborator institution copyright and
ownership policies, informal agreements
• Patent and provenance issues
• International copyright considerations
• Post-project data retention requirements
• Post-employment data agreements
URL: http://www2.binghamton.edu/academics/provost/faculty-staff-handbook/handbook-xii.html
46. Survey sample:
308 campus researchers with
externally sponsored projects or
submitted proposals (2009-
2011); 91 survey respondents
Source: Binghamton University Research Faculty Survey, June 2011, Jim Wolf, Director of Academic Computing (ret.)
48. Data Accessibility Data Preservation Timeframe
50
60
45
40
35 50
30
25 40
20 access granted to
15 individuals 30 forever
10 openly available to all 3-7 yrs
5 20
0 <3 yrs
proprietary
10
private 0
Local research ITS storage Library Disciplinary
group server archive repository
(e.g., ICPSR)
Source: Research Faculty Survey, Jim Wolf, Director of Academic Computing (ret.), June 2011
49. Create consistent, standardized metadata
Perform regular file fixity and format checks
Identify, update and migrate file formats
Mitigate and eliminate file degradation
Provide storage space, controlled access and
an “exit strategy”
51. Media Deterioration and Format Obsolescence Demonstrate that
“Backups” are Inadequate for Long-Term Preservation
Sources: http://oldcomputers.net/macintosh.html; http://www.classiccmp.org/dunfield/pc/index.htm
52. Build content from one project to the next
Create a set of policies based on current best
practices and funder requirements
Refine data
collection, access, use, distribution, and
preservation policies over time
Open Access and other open movements have advocates and they feel strongly about these causes. Like the patriots of the American Revolution, they feel that publishing is broken and only radical change can create change in the system.
Publishers are offering the same content in more ways. This recent UK study shows the more publishing and viewing options available, the less expensive it becomes. This is opposite of our expenditures, where items are becoming more expensive in newer formats.
Scholars, though, also care about the impact of their work, not just how it is published. The Open movement has created more metrics and groups like altmetrics.org that advocate for weblinks, bookmarks and online conversations on research to help measure impact for tenure and promotion decisions.
The open access conversation is focused on the dissemination of research products like peer-reviewed articles and books at the end of the research life cycle, whereas data management planning is most effective when it’s initiated before data collection begins and implemented throughout the research life cycle.