SlideShare une entreprise Scribd logo
1  sur  214
NISO Virtual Conference: Dealing with the Data
Deluge: Successful Techniques for Scientific
Data Management
April 23, 2014
Speakers:
Jan Brase, Jared Lyle, Mercè Crosas,
Michael Witt, Christine Borgman, Adriane Chapman,
David Wilcox, Judy Ruttenberg
http://www.niso.org/news/events/2014/virtual/data_deluge/
NISO Virtual Conference:
The Semantic Web Coming of Age:
Technologies and Implementations
Agenda
11:00 a.m. – 11:10 a.m. – Introduction
Todd Carpenter, Executive Director, NISO
11:10 a.m. - 12:00 p.m. Keynote Speaker: DataCite – A Global Approach for Better Data Sharing
Jan Brase, Ph.D., German National Library of Science and Technology
12:00 p.m. - 12:30 p.m. Guidelines and Resources for Office of Science and Technology Policy (OSTP) Data Access Plans
Jared Lyle, Director of Data Curation Services, Interuniversity Consortium for Political and Social Research (ICPSR), University of
Michigan
12:30 p.m. - 1:00 p.m. Joint Declaration of Data Citation Principles: Implementation and Compliance in the Dataverse Repository
Mercè Crosas, Ph.D., Director of Data Science, Institute for Quantitative Social Science (IQSS), Harvard University
1:00 p.m. - 1:45 p.m. Lunch Break
1:45 p.m. - 2:15 p.m. Purdue University Research Repository (PURR): A Commitment to Supporting Researchers
Michael Witt, Head, Distributed Data Curation Center (D2C2); Associate Professor of Library Science, Purdue University Research
Repository (PURR)
2:15 p.m. - 2:45 p.m. The Roles of Data Citation in Data Management
Christine L. Borgman, Professor & Presidential Chair in Information Studies, UCLA
2:45 p.m. - 3:15 p.m. Is This Data Fit for My Use? The Challenges and Opportunities Data Provenance Presents
Adriane Chapman, MITRE
3:15 p.m. - 3:30 p.m. Afternoon Break
3:30 p.m. - 4:00 p.m. A Durable Space: Technologies for Accessing Our Collective Digital Heritage
David Wilcox, Product Manager, DuraSpace
4:00 p.m. - 4:30 p.m. The SHared Access Research Ecosystem (SHARE) Project: A Joint Initiative of ARL, AAU, and APLU
Judy Ruttenberg, Program Director for Transforming Research Libraries, Association of Research Libraries (ARL)
4:30 p.m. - 5:00 p.m. Conference Roundtable
Moderated by Todd Carpenter, Executive Director, NISO
DataCite –
A global approach for better data
sharing
Jan Brase
DataCite
NISO virtual conference
April 23rd 2014
Thousand years ago:
science was empirical
describing natural phenomena
Last few hundred years:
theoretical branch
using models, generalizations
Last few decades:
a computational branch
simulating complex phenomena
Today:
data exploration (eScience)
unify theory, experiment, and
simulation
Jim Gray, eScience Group, Microsoft Research
2
2
2
.
3
4
a
cG
a
a
Science Paradigms
Scientific Information is more than a journal article or a
book
Libraries should open their cataolgues to any kind of
information
The catalogue of the future is NOT ONLY a window to the
library‗s holding, but
A portal in a net of trusted providers of scientific content
Consequences for Libraries
We do not have it
BUT
We know where you can find
And here is the link to it!
7
Simulation
Scientific Films
3D Objects
Grey Literature
Research Data
Software
Including non-classical publications
Why is this a role for libraries?
• Libraries have a history in bringing
scientific information to the public
• Libraries have a tendency to be persistent
• A project will be forgotten in 40 years, the
library will very likely still exist then
• Library are very trustworthy organisations
DataCite
High visability of the content
Easy re-use and verification.
Scientific reputation for the collection and documentation of
content (Citation Index)
Encouraging the Brussels declaration on STM publishing
Avoiding duplications
Motivation for new research
What if any kind of scientific content would
be citable?
How to achieve this?
Science is global
• it needs global standards
• Global workflows
• Cooperation of global players
Science is carried out locally
• By local scientist
• Beeing part of local infrastrucures
• Having local funders
Global consortium carried by local institutions
focused on improving the scholarly infrastructure
around datasets and other non-textual
information
focused on working with data centres and
organisations that hold content
Providing standards, workflows and best-practice
Initially, but not exclusivly based on the DOI system
Founded December 1st 2009 in London
DataCite
International DOI
Foundation
DataCite
Member
Institution
Data CentreData CentreData Centre
Member
Institution
Data CentreData CentreData Centre
… Works
with
Managing Agent
(TIB)
Member
Associate
Stakeholder
DataCite structure
1. Technische Informationsbibliothek (TIB)
2. Canada Institute for Scientific and Technical Information (CISTI),
3. California Digital Library, USA
4. Purdue University, USA
5. Office of Scientific and Technical
Information (OSTI), USA
6. Library of TU Delft,
The Netherlands
7. Technical Information
Center of Denmark
8. The British Library
9. ZB Med, Germany
10. ZBW, Germany
11. Gesis, Germany
12. Library of ETH Zürich
13. L’Institut de l’Information Scientifique
et Technique (INIST), France
14. Swedish National Data Service (SND)
15. Australian National Data Service (ANDS)
16. Conferenza dei Rettori delle Università Italiane (CRUI)
17. National Research Council of Thailand (NRCT)
18. The Hungarian Academy of Sciences
19. University of Tartu, Estonia
20. Japan Link Center (JaLC)
21. South African Environmental Observation Network (SAEON)
22. European Organisation for Nuclear Research (CERN)
DataCite members
Affiliated members:
1. Digital Curation Center (UK)
2. Microsoft Research
3. Interuniversity Consortium for
Political and Social Research (ICPS
1. Korea Institute of Science and
Technology Information (KISTI)
5. Bejiing Genomic Institute (BGI)
6. IEEE
7. Harvard University Library
8. World Data System (WDS)
9. GWDG
IRD
( gr av/ 10 cm 3)
Sand
( %)
C aC O3
( %)
TOC
( %)
R adio
( %/ sand)
Sme c t
( %/ clay)
IRD
( gr av/ 10 cm 3)
Sand
( %)
C aC O3
( %)
TOC
( %)
R adio
( %/ sand)
Sme c t
( %/ clay)
IRD
( gr av/ 10 cm 3)
Sand
( %)
C aC O3
( %)
TOC
( %)
R adio
( %/ sand)
Sme c t
( %/ clay)
IRD
( gr av/ 10 cm 3)
Sand
( %)
C aC O3
( %)
TOC
( %)
R adio
( %/ sand)
Sme c t
( %/ clay)
IRD
( gr av/ 10 cm 3)
Sand
( %)
C aC O3
( %)
TOC
( %)
R adio
( %/ sand)
Sme c t
( %/ clay)
PS 1389-3 PS 1390-3 PS 1431-1 PS 1640-1 PS 1648-1
Age (kyr) max. : 233.55 ky r PS1389-3f f
0.0
100.0
200.0
0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100
54° 0' 54° 0'
54°30' 54°30'
55° 0' 55° 0'
55°30' 55°30'
11°
11°
12°
12°
13°
13°
14°
14°
15°
15°
World vector shore line
Grain size class KOLP A
Grain size class KOEHN2
Grain size class KOEHN
Geochemistry
Grain size class KOLP B
Grain size class KOLP DIN
20 m
Scale: 1:2695194 at Latitude 0°
Source: Baltic Sea Research Institute, Warnemünde.
Earth quake events =>
doi:10.1594/GFZ.GEOFON.gfz2009kciu
Climate models => doi:10.1594/WDCC/dphase_mpeps
Sea bed photos => doi:10.1594/PANGAEA.757741
Distributes samples => doi:10.1594/PANGAEA.51749
Medical case studies => doi:10.1594/eaacinet2007/CR/5-
270407
Computational model => doi:10.4225/02/4E9F69C011BC8
Audio record => doi:10.1594/PANGAEA.339110
Grey Literature => doi:10.2314/GBV:489185967
Videos => doi:10.3207/2959859860
What type of data are we talking about?
Anything that is the foundation
of further reserach
is research data
Data is evidence
Anything that is the foundation
of further reserach
is research data
Data is evidence
Over 3,200,000 DOI names registered so far.
290 data centers.
10,000,000 resolutions in 2013.
DataCite Metadata schema published (in cooperation with
all members) http://schema.datacite.org
DataCite MetadataStore
http://search.datacite.org
DataCite in 2014
DataCite search
Searchterm: *
Searchterm: uploaded:[NOW-7DAY TO NOW]
Searchterm: relatedIdentifier:*
Searchterm:
relatedIdentifier:issupplementto:10.1029*
Searchterm:relatedIdentifier:*:10.1055*
OAI and Statistics
OAI Harvester
http://oai.datacite.org
DataCite statistics (resolution and registration)
http://stats.datacite.org
DataCite Content Service
Service for displaying DataCite metadata
Different formats (BibTeX, RIS, RDF, etc.)
Content Negotation (through MIME-Typ)
• Access through DOI proxy (http://dx.doi.org)
• First implemented by CNRI and CrossRef:
Documentation:
http://www.crosscite.org/cn/
Content negotiation
Optimized for m2m communication using the accept
header of the http protocol
curl -L -H "Accept: MIME_TYPE" http://dx.doi.org/DOI
Try a shortcut out in any webbrowser:
http://data.datacite.org/MIME_TYPE/DOI
http://data.crossref.org/DOI
Resolving to the citation
http://data.datacite.org/application/x-
datacite+text/10.5524/100005
Li, j; Zhang, G; Lambert, D; Wang, J (2011): Genomic data
from Emperor penguin. GigaScience.
http://dx.doi.org/10.5524/100005
Resolving to the RDF metadata
http://data.datacite.org/application/rdf+xml/10.5524/100005
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:j.0="http://purl.org/dc/terms/" > <rdf:Description
rdf:about="http://dx.doi.org/10.5524/100005">
<j.0:identifier>10.5524/100005</j.0:identifier> <j.0:creator>Li,
J</j.0:creator> <j.0:creator>Zhang, G</j.0:creator>
<j.0:creator>Wang, J</j.0:creator>
<owl:sameAs>doi:10.5524/100005</owl:sameAs>
<owl:sameAs>info:doi/10.5524/100005</owl:sameAs>
<j.0:publisher>GigaScience</j.0:publisher> <j.0:creator>Lambert,
D</j.0:creator> <j.0:date>2011</j.0:date> <j.0:title>Genomic
data from the Emperor penguin (Aptenodytes forsteri)</j.0:title>
</rdf:Description></rdf:RDF>
Example of use
This allows persistent identification of RDF statements!
Implemented for all over 65 million CrossRef and
DataCite DOI names
Example of use:
DOI Citation Formatter
http://www.crosscite.org/citeproc/
2012: STM, CrossRef and DataCite Joint
Statement
1. To improve the availability and findability of research data,
the signers encourage authors of research papers to deposit
researcher validated data in trustworthy and reliable
Data Archives.
2. The Signers encourage Data Archives to enable bi-
directional linking between datasets and publications by
using established and community endorsed unique
persistent identifiers such as database accession codes and
DOI's.
3. The Signers encourage publishers and data archives to make
visible or increase visibility of these links from publications
to datasets and vice versa
32
Example
The dataset:
Storz, D et al. (2009):
Planktic foraminiferal flux and faunal composition of sediment trap
L1_K276 in the northeastern Atlantic.
http://dx.doi.org/10.1594/PANGAEA.724325
Is supplement to the article:
Storz, David; Schulz, Hartmut; Waniek, Joanna J; Schulz-Bull, Detlef;
Kucera, Michal (2009): Seasonal and interannual variability of the
planktic foraminiferal flux in the vicinity of the Azores Current.
Deep-Sea Research Part I-Oceanographic Research Papers, 56(1),
107-124,
http://dx.doi.org/10.1016/j.dsr.2008.08.009
Next steps
ODIN project with ORCID.
http://datacite.labs.orcid-eu.org/
MoU with Thomson reuters to cooperate on
data citation index
DataCite plugin for next D-Space release
(early 2014)
Cooperation
MoU with ORCID
Agreement with Re3Data and DataBib to
include their service in 2016
MoU with RDA to become organisational
affiliate
2014 Annual conference
Let us get back to libraries
The wave
Growth of
Information –
Diversity of media types and
formats
User requirements – e. g. :
Science 2.0, collaborative
networks, social media
A threat?
Information overload is only a problem for manual curation.
Google is not complaining about data deluge—they‘re
constantly trying to get more data.
The more data you throw, the better the filter gets.
To develop and maintain these tools is a classical
tasks for libraries!
Don’t turn off the taps, build boats.
It is not only a challenge …
… it is an opportunity
We all should ride the wave …
Thank you!
Guidelines and
Resources for OSTP
Data Access Plans
NISO Webinar
April 2014
www.icpsr.umich.edu/datamanagement
The OSTP Memo
Guidelines for Response
• Released February 2013, this memo directs funding
agencies with an annual R&D budget over $100 million to
develop a public access plan for disseminating the results
of their research
• ICPSR stresses that standards and guidelines for many of
the requirements currently exist
• The slides to follow provide an overview of the access
plan elements including guidelines and resources on how
to respond to meet digital data requirements in the
memo
The OSTP Memo – A Review
• Released February 22, 2013
• A concern for investment: “Policies that mobilize these
publications and data for re-use through preservation and
broader public access also maximize the impact and
accountability of the Federal research investment.”
• Federal agencies with over $100 M annually in R&D
expenditures to develop plans to support increased public
access to the results of research funded by the Federal
Government
• Plans to contain eight points
The Eight Points of the Plan
1. Strategy for leveraging existing archives
2. Strategy to improve the public’s ability to locate and access digital
data
3. Approach to optimize search, archival, and dissemination features
that encourage innovation in accessibility & interoperability and
ensure long-term stewardship
4. A plan to notify awardees & researchers of their obligations
5. Strategy for measuring and enforcing compliance with the plan
6. Identification of resources within the existing agency budget to
implement plan
7. Timeline for implementation
8. Identification of special circumstances that prevent the agency from
meeting memo objectives
Data Portion of Memo - 13 Elements
• The portion of the memo describing objectives
for public access to data stresses 13 elements
for a public access plan
• The elements are also summarized online
within ICPSR’s Web site:
http://icpsr.umich.edu/content/datamanagement/ostp.html
http://sites.nationalacademies.org/DBASSE/CurrentProjects/DBASSE_0
http://www.icpsr.umich.edu/files/ICPSR/ICPSRComment
RDAP 2014 Panel: Funding agency
(NOAA, NSF, NIH) responses to federal
requirements for public access to
research results
Wendy Kozlowski (Cornell), Moderator
http://www.slideshare.net/asist_org/rdap14-
ostp-panel-introduction
http://www.slideshare.net/asist_org/rdap-3-
2714thakur
Visit ICPSR Archives/Repositories already Meeting
Public Access Requirements
ICPSR – a 50-Year History of Providing Access to
Research Data
Established in 1962, ICPSR maintains and shares
over 8,600 research datasets and hosts 16 public-
access specialized collections of data funded by
various government agencies and foundations. Our
mission:
ICPSR advances and expands social and behavioral
research, acting as a global leader in data
stewardship and providing rich data resources and
responsive educational opportunities for present
and future generations.
ICPSR’s Data Management & Curation Goals
• Quality - Data at ICSPR are
enhanced with meaningful
information to make it complete,
self-explanatory, and usable for
future researchers
• Access – Sought by over 730
member institutions an indexed by
all the major search engines, ICPSR
data are easily discoverable and
widely accessible to the public.
• Citation - By providing
standardized and well-recognized
data citations, ICPSR ensures that
data producers receive credit for
their archived data
• Preservation – For over 50
years, ICPSR has preserved its data
resources for the long-term,
guarding against deterioration,
accidental loss, and digital
obsolescence
• Confidentiality - Stringent
protections are in place for securing
and distributing sensitive data
• Educational Support –
ICPSR has a long tradition of
supporting training in quantitative
methods, scientific data
management, and resources for
instruction
ICPSR’s Data Management & Curation Site
http://www.icpsr.umich.edu/datamanagement/
http://icpsr.umich.edu/datamanagement/ostp.html
ICPSR’s Guidelines for OSTP Data
Access Plan Page
Data Portion of Memo - 13 Elements
• The portion of the memo describing objectives
for public access to data stresses 13 elements
for a public access plan
• The elements are also summarized online
within ICPSR’s Web site:
http://icpsr.umich.edu/content/datamanagement/ostp.html
Maximize Access
"Maximize access, by the general public and without charge, to digitally
formatted scientific data created with Federal funds“
• Increasing access to research data prevents the duplication of effort,
provides accountability and verification of research results, and
increases opportunities for innovation and collaboration.
• Finding and accessing data in repositories requires descriptive metadata
("data about data") in standard, machine-actionable form. Metadata
help search engines find data, and help researchers understand the
context of data collections.
• Standards already exist: see Data Documentation Initiative
– http://www.ddialliance.org/
Maximize Access cont.
• Access also involves knowing how to interpret the data. Incomplete data
limit reuse. Obsolete data formats can be unreadable.
– Repositories 'curate' or enhance data to make it complete, self-explanatory,
and usable for future researchers. This includes adding descriptive labels,
correcting coding errors, gathering documentation, and standardizing the
final versions of files. This is called “data curation.”
– Like museums that curate art or artifacts for study and understanding now
and in the future, data archives curate data with the same goals.
• Data curation is crucial to maximizing access. Resources for curating
data:
– ICPSR's Guide to Social Science Data Preparation and Archiving
– UK Data Archive's Managing and Sharing Data guide.
Protect Confidentiality and Privacy
• It is critically important to protect the
identities of research subjects.
• Disclosure risk is a term that is often used
for the possibility that a data record from a
study could be linked to a specific person.
• Concerns about disclosure risk have grown
as more datasets have become available
online, and it has become easier to link
research datasets with publicly available
external databases.
Protect Confidentiality and Privacy cont.
Protecting confidentiality of research subjects is not a viable
argument for not sharing data. Infrastructure, including virtual
and physical data enclaves, already exists:
• Restricted-Use Data are made available for research
purposes for use by investigators who agree to stringent
conditions for the use of the data and its physical
safekeeping.
• Enclave Data are those datasets which present especially
acute disclosure risks. They can be accessed only on-site in
ICPSR's physical data enclave in Ann Arbor. Investigators
must be approved. Their notes and analytic output are
reviewed by ICPSR staff.
Balance Demands of Long-term
Preservation and Access
• Preserving digital data requires much more than
storing files on a server, desktop, or in the cloud!
• Digital preservation is the active and ongoing
management of digital content to lengthen the
lifespan and mitigate against loss, including physical
deterioration, format obsolescence, and hardware and
software failure.
Balance Demands of Long-term
Preservation and Access cont.
• Not all data are worth preserving
indefinitely; less valuable or easily
producible data may be preserved for
shorter periods.
• Establish selection and appraisal guidelines
that make it clear what to save or discard.
– Selection criteria consider factors like
availability, confidentiality, copyright, quality,
file format, and financial commitment.
Use of Data Management Plans
• Data management plans describe how researchers
will provide for long-term preservation of, and
access to, scientific data in digital formats.
• Data management plans provide opportunities for
researchers to manage and curate their data more
actively from project inception to completion.
• See ICPSR's resource: Guidelines for Effective Data
Management Plans
Include Cost of Data Management in Funding
Proposals
• Data management services carry real costs, ranging from
personnel to storage to software.
• Maintenance costs are routinely built into physical
infrastructure development, so too should data management
costs be built into data development.
• Long-term access to data requires durable institutions that plan
on a scale of decades and even generations.
• Cost resources:
– DataONE's Provide budget information for your data
management plan
– UK Data Archive's Costing Tool: Data Management Planning.
Evaluate Data Management Plans &
Ensure Compliance
• Plans help researchers prepare for working with
and preserving data, repositories get ready to
accession and provide access, and agencies to
understand the community needs for archiving
and access. Evaluation helps refine plans so they
are realistic and attainable.
• If data management plans are to be a standard
component of funding applications, funding
recipients should be held accountable for
diversions from the originally stated plans.
Promote Public Deposit of Data
• Public deposit of data helps to ensure the long-term
accessibility and preservation of the data.
• It removes the burden of ongoing maintenance and care (and
user support) from the researcher and provides a stable system
to which data can be entrusted.
• Many sustainable online repositories are already available to
host and archive research data. These may include discipline-
specific repositories, archives administered by funding
agencies, or institutional repositories.
• Databib, a searchable directory of over 500 research data
repositories, can help locate relevant repositories by subject
area.
Preserve Intellectual Property Rights
and Commercial Interests
Original research may be both
commercially valuable and proprietary.
There are several approaches to
managing these interests, including:
– Tailor copyright and patent licenses, such
as through Creative Commons licenses
– Establish an embargo period or delayed
dissemination on distribution.
Private-sector Cooperation to Improve
Access
Encourage cooperation with the private sector
to improve data access and compatibility.
Issues to consider:
• What funding structures will be in place to ensure that both
organizations involved are benefiting from the partnership?
• Will the partnership require any rights to be transferred to the
private organization?
• How does private-sector cooperation affect
access restrictions and intellectual property
concerns?
Mechanisms for Identification &
Attribution of Data
• Properly citing data encourages the replication of
scientific results, improves research standards, guarantees
persistent reference, and gives proper credit to data
producers.
• Citing data is straightforward. Each citation must include
the basic elements that allow a unique dataset to be
identified over time: title, author, date, version, and
persistent identifier.
• Resources: ICPSR's Data Citations page , IASSIST's Quick
Guide to Data Citation, DataCite.
Data Stewardship Workforce Development
In coordination with other agencies and the private
sector, support training, education, and workforce
development related to scientific data
management, analysis, storage, preservation, and
stewardship. Recent data stewardship workforce
development in the United States has included:
• Digital Preservation Outreach and Education, from the Library of
Congress
• Digital Preservation Management tutorial, from Cornell University,
ICPSR, and MIT
• DigCCurr, from the University of North Carolina
Data Stewardship Workforce
Development cont.
ICPSR hosts data stewardship courses as part of
its Summer Program in Quantitative Methods of
Social Research. These include:
• Curating and Managing Research Data for Re-Use
• Assessing and Mitigating Disclosure Risk: Essentials for
Social Science
• Providing Social Science Data Services: Strategies for
Design and Operation
Long-term Support for Repository
Development
• ICPSR advocates long-term funding for specialized, long-lived,
trustworthy, and sustainable repositories that can mediate
between the needs of scientific disciplines and data
preservation requirements.
• As digital data management becomes an increasingly important
part of scientific research, funding agencies must contribute to
the developing ecosystem of services and technologies that
support access to and preservation of data.
• For more information, including various long-term funding
models, see ICPSR’s 2013 position paper – “The Price of
Keeping Knowledge”
Get More information
• Visit ICPSR’s Data Management & Curation site:
http://www.icpsr.umich.edu/datamanagement
• Contact us:
– netmail@icpsr.umich.edu
– (734) 647-2200
Acknowledgements:
Linda Detterman
Emily Reynolds
Gavin Strassel
Thank you!
lyle@umich.edu
Joint Declaration of Data Citation
Principles:
Implementation and Compliance in
the Dataverse Repository
Mercè Crosas, Ph.D.
Twitter: @mercecrosas
Director of Data Science
Institute for Quantitative Social Science, Harvard University
NISO Virtual Conference, April 23, 2014
A brief History of Data Citation
Altman M., Crosas M., 2014, “The Evolution of Data Citation: From
Principles to Implementation” IASSIST Quarterly, In Press
1906
Chicago Manual
of Style
Standards in Scholarly Citation:
author/creator, title, dates,
publisher or distributor of the work
1960
First scientific digital
data archives
1977 – 1998
ASBR (“Data File” type)
MARC (machine readable catalog)
1999-2014
Data Repositories
(NESSTAR, Dataverse,
Dryad, Figshare)
DOI services(DataCite)
The Making of the Principles
 Decades of research and practices in data citation
 Consolidated to a single set of Principles
 By a synthesis group representing 25+ organizations
 Driven by the premise that:
"sound, reproducible scholarship rests upon a foundation of
robust, accessible data"
and
"data should be considered legitimate, citable products of
research"
Joint Declaration of Data Citation Principles
1 Importance
2 Credit and Attribution
3 Evidence
4 Unique Identification
5 Access
6 Persistence
7 Specificity and Verifiability
8 Interoperability and flexibility
Full Principles: https://www.force11.org/datacitation
Endorsement: https://www.force11.org/datacitation/endorsements
Joint Declaration of Data Citation Principles
1. Importance
Data should be considered legitimate, citable
products of research. Data citations should be
accorded the same importance in the
scholarly record as citations of other research
objects, such as publications.
Joint Declaration of Data Citation Principles
2. Credit and Attribution
Data citations should facilitate giving scholarly
credit and normative and legal attribution to all
contributors to the data, recognizing that a
single style or mechanism of attribution may not
be applicable to all data.
Joint Declaration of Data Citation Principles
3. Evidence
In scholarly literature, whenever and wherever
a claim relies upon data, the corresponding
data should be cited.
Joint Declaration of Data Citation Principles
4. Unique Identification
A data citation should include a persistent
method for identification that is machine
actionable, globally unique, and widely used
by a community.
Joint Declaration of Data Citation Principles
5. Access
Data citations should facilitate access to the
data themselves and to such associated
metadata, documentation, code, and other
materials, as are necessary for both humans
and machines to make informed use of the
referenced data.
Joint Declaration of Data Citation Principles
6. Persistence
Unique identifiers, and metadata describing
the data, and its disposition, should persist --
even beyond the lifespan of the data they
describe.
Joint Declaration of Data Citation Principles
7. Specificity and Verifiability
Data citations should facilitate identification of,
access to, and verification of the specific data
that support a claim. Citations or citation
metadata should include information about
provenance and fixity sufficient to facilitate
verifying that the specific time slice, version
and/or granular portion of data retrieved
subsequently is the same as was originally cited.
Joint Declaration of Data Citation Principles
8. Interoperability and flexibility
Data citation methods should be sufficiently
flexible to accommodate the variant practices
among communities, but should not differ so
much that they compromise interoperability of
data citation practices across communities.
About Dataverse
 A software framework to build data repositories.
 Provides a preservation and archival infrastructure,
… while researchers share, keep control of and get
recognition for their data through a web interface.
 Harvard Dataverse is open to all researchers and
disciplines.
 It contains more than 50,000 data sets.
 Other large Dataverse instances throughout the world:
ODUM at UNC, Dutch Universities, Scholar Portal, Fudan University.
 Dataverse 4.0 (June 2014) brings an entirely new UI and
improved data publishing workflows.
Data Citation Implementation in Dataverse
The Dataverse generates a Data Citation for each deposited
data set compliant with the Principles:
Authors, Year, Dataset Title, DOI, Data Repository, UNF, version
Example:
Logan Vidal, 2013, "ANES data coding ",
http://dx.doi.org/10.7910/DVN/23274 Harvard Dataverse,
UNF:5:0fdUNzmCsyeqrVKtgUG74A==, V8
Compliant with Principle 2
Principle 2:
Credit and Attribution: …facilitate giving
scholarly credit and … attribution to all
contributors to the data, …
Authors, Year, Dataset Title, DOI, Data Repository, UNF, version
Compliant with Principles 4, 5, 6
Principles 4, 5, 6
Unique Identification: …machine actionable, globally unique, and
widely used by a community …
Access: … access to the data themselves and to such associated
metadata, documentation, code, and other materials …
Persistence: … even beyond the lifespan of the data they describe.
Authors, Year, Dataset Title, DOI, Data Repository, UNF, version
Resolves to landing page with access
to metadata, docs, code and data
Landing Page Example: Metadata
Landing Page Example: Data, Code & Docs
Compliant with Principle 7
Principle 7
Specificity and Verifiability: …provenance and fixity
sufficient to facilitate verifying that the specific time slice,
version and/or granular portion of data …
Authors, Year, Dataset Title, DOI, Data Repository, UNF, version
Universal Numerical Fingerprint:
Independent of format
Example of version History
Compliant with Principle 8
Principle 8: Interoperability and flexibility:
Dataverse exports all citation metadata in XML, JSON formats
Implementation Suggestions for Publishers
 Upgrade data citation to references section [Principle 1: Importance]
 In article, cite data by claim [Principle 3: Evidence]
 Provide guidelines for authors based on Principles, but customized to
each journal [Principle 8: Interoperability and Flexibility]
 Interoperate with, or recommend, trusted Data Repositories
compliant with the Principles
 Build tools to access machine-readable metadata from datasets
Want to be involved?
Join the Data Citation Implementation group:
https://www.force11.org/datacitationimplementation
Remaining Challenges
 Challenges of Provenance: what is the chain of
ownership and transformations to the data?
 Challenges of Identity: what should be cited? at what
level of granularity and versioning for large, dynamic
datasets?
 Challenges of Attribution: How do you support attribution
for hundreds/thousands contributors?
Altman M., Crosas M., 2014, “The Evolution of Data Citation:
From Principles to Implementation” IASSIST Quarterly, In Press
NISO VIRTUAL CONFERENCE
APRIL 23, 2014 – SUCCESSFUL TECHNIQUES FOR SCIENTIFIC DATA MANAGEMENT
Purdue University Research Repository (PURR):
A Commitment to Supporting Researchers
Michael Witt
Head, Distributed Data Curation Center
Associate Professor of Library Science
http://www.lib.purdue.edu/research/witt
E-mail: mwitt@purdue.edu
OVERVIEW
1. Preaching to the choir, but still: Data
2. Ecosystem of data repositories
3. Our campus data repository & service (PURR)
a. Data management planning
b. Project space for collaboration
c. Publishing data
d. Archiving data
4. Creating opportunities for liaison librarians & helping to
operationalize library research data services
5. Roles and collaboration
6. Conclusion
104
DATA = EVIDENCE
105
http://epicgraphic.com/data-cake
FUNDING AGENCY MANDATES
106
ECOSYSTEM OF DATA REPOSITORIES
• Publisher, e.g., Dryad
• Sub/Disciplinary, e.g., RKMP
• Consortium, e.g., ICPSR
• Country, e.g., Research Data Australia
• Government, e.g., data.gc.ca
• Research center, e.g., NASA GES DISC
• Instrument, e.g., CHANDRA
• General-purpose, e.g., FigShare
• Roll-your-own, e.g., DataVerse
• University, e.g., PURR
• Many others…
107
CAMPUS COLLABORATION
The PURR service is a collaborative effort of the
Purdue University Libraries, Office of the Vice
President for Research, and Information
Technology at Purdue. PURR is a designated
university core research facility.
Designated community:
Purdue University faculty, staff, and graduate
student researchers; their collaborators; and the
current and future consumers of their data.
108
LIBRARY STRATEGIC PLAN
Data is written into the three pillars of our strategic plan:
• Learning
“…information literacy defined broadly to include digital information
literacy, science literacy, data literacy, health literacy, etc…”
• Scholarly Communication
“Lead in data-related scholarship and initiatives”
• Global Challenges
“We will lead in international initiatives in information literacy and e-
science and … contribute to international information literacy,
learning spaces, data management, and scholarly communication
initiatives.”
109
https://www.lib.purdue.edu/sites/default/files/admin/plan2016.pdf
http://purr.purdue.edu
110
CURATION LIFECYCLE SERVICE MODEL
111
Witt, M. (2012). Co-designing, Co-developing, and Co-implementing an Institutional Data Repository Service. Journal
of Library Administration, 52(2). DOI:10.1080/01930826.2012.655607. http://docs.lib.purdue.edu/lib_fsdocs/6/
Digital Curation Centre’s Curation Lifecycle Model: http://www.dcc.ac.uk/resources/curation-lifecycle-model
PURR SERVICE – INTERNAL MODEL
112
112
PURR SERVICE – EXTERNAL MODEL
113
INTRO TO PURR VIDEO
114
http://www.youtube.com/watch?v=Yw0IJj7FqA8
PURR POSTCARD AND POSTER
115
115
116
Dimensions of Discovery (Winter 2013). Office of the Vice President for Research, Purdue University,
http://www.purdue.edu/research/vpr/publications/docs/dimensions/Winter2013.pdf
DATA MANAGEMENT PLANS
• Boilerplate text
• Example DMPs
• DMP Self-Assessment
• DMPTool
• Workshops
• Tutorials
• Reference and consultation with subject-
specialist librarian and/or data services
specialist
https://purr.purdue.edu/dmp
117
CREATE PROJECT AND COLLABORATE
Create:
• any Purdue faculty, staff, or graduate student researcher can create
projects
• describe the project
• disclaim use of sensitive or restricted data
• receive a default allocation of storage
• register a grant award to increase allocation
• invite collaborators to join project
Collaborate:
• git repository to share and version files (Google Drive integration)
• wiki
• blog
• to-do list management and project notes
• newsfeed
• stage data publications
118
SENSITIVE AND RESTRICTED DATA
Sensitive data: Information whose access must be guarded due to proprietary,
ethical, or privacy considerations. This classification applies even though there
may not be a civil statute requiring this protection.
Restricted data Information protected because of protective statutes, policies or
regulations. This level also represents information that isn't by default protected
by legal statue, but for which the Information Owner has exercised their right to
restrict access.
http://www.purdue.edu/securepurdue/policies/dataConfident/restrictions.cfm
• FERPA  Registrar
• HIPAA  Health Center
• IRB  Human Research Protection Program
• Export Control  Vice President for Research
119
PROJECT SPACE
121
PURR project tutorial video:
http://www.youtube.com/watch?v=q5xGO_oF9uQ
STORAGE MENU
https://purr.purdue.edu/about/pricing
122
DATA PUBLICATION
123
PURR publication tutorial video:
http://www.youtube.com/watch?v=jYBcsfiRhio
PRESERVATION AND STEWARDSHIP
Initial commitment of 10 years
• data producer or dept can fund for longer
• otherwise remanded to library collection
Design guided by ISO 16363 / TRAC
• Organization infrastructure
• Digital object management
• Technical infrastructure & Security Risk
Management
124
ARCHIVAL INFORMATION PACKAGE
Bagit “bag” contains:
• bag declaration file, manifest file, data files
Metadata file (XML):
• METS wrapper
• Dublin Core and MODS (descriptive metadata)
• PREMIS (preservation metadata)
MetaArchive: LOCKSS replication network (7 copies)
125
SUPPORTING POLICIES
• Terms of Deposit
• Collection Development Policy
• Preservation Policy
• Preservation Strategies
• File Format Recommendations
• Preservation Support Policy
126
https://purr.purdue.edu/legal/terms
REPOSITORY SOFTWARE: HUBZERO
• HUBzero, open source software: http://hubzero.org
• Maintained by HUBzero Foundation, originally funded by NSF
• Over 50 hubs online, supporting different virtual scientific communities,
hundreds of thousands of users
• http://nanoHUB.org - grandfather of the hubs, exemplar
• Built to facilitate virtual communities and online, scientific collaboration,
research/teaching
• Collaborate, develop, publish, access, execute, and manage content
using a web browser
• Software tools, documents, multimedia, learning objects, datasets, etc.
• Social network functionality and collaboration features
• LAMP stack, Joomla framework, OpenVZ and Rappture, git, etc.
• EZID interface to mint DataCite DOIs (coming soon: ORCID)
• Some extensions customized for PURR not in core distribution
127
PURR TEAM
• Executive Committee: Dean of Libraries, Vice
President for Research, Chief Information
Officer
• Steering Committee: 2 from libraries, 2 from IT,
2 from research office and sponsored programs,
3 domain faculty researchers
• Personnel: Project Director (.50), Technologists
(3.85), HUBzero Liaison (.35), Metadata
Specialist (.20), Digital Archivist (.25), Digital
Data Repository Specialist (1.0)
128
LIBRARIES PURR TEAM
129
PURR Project Director (50%)
Michael Witt
Three examples of responsibilities:
• resourcing (personnel, budget, coffee, etc.)
• oversees development roadmap, service definition
and design
• communicates across constituencies
LIBRARIES PURR TEAM
130
Digital Data Repository Specialist
Courtney Matthews
Three examples of responsibilities:
• primary point of contact for helping users and
librarians utilize PURR
• coordinates outreach, support, and development
(tons of community engagement)
• helps to acquire, organize, and ingest data
collections
LIBRARIES PURR TEAM
131
Digital Library Software Developer
Mark Fisher
Three examples of responsibilities:
• developing a module to create archival information
packages from datasets published in PURR
• integrating PURR with MetaArchive, an LOCKSS
preservation network
• web and graphics design to keep the PURR website
current and dynamic
LIBRARIES PURR TEAM
132
Digital Archivist (25%)
Carly Dearborn
Three examples of responsibilities:
• define and implement AIP as well as long-term
digital object management and supporting practices
• lead policy development and documentation such as
PURR’s preservation policy, preservation strategies,
file format recommendations, and preservation
support policy
• consult with data producers and librarians on file
formats, appraisal of data collections, and data
management planning
LIBRARIES PURR TEAM
133
Metadata Specialist (20%)
Amy Barton
Three examples of responsibilities:
• consult with data producers and librarians identify
and apply appropriate metadata schemas and
vocabularies to describe datasets
• design and implement metadata for preservation,
findability, and citability (i.e., DataCite DOIs)
• enhance and provide quality assurance for metadata
for acquired data collections
KEY PLAYERS: SUBJECT LIBRARIANS
134
KEY PLAYERS: DATA SPECIALISTS
135
Librarians consult on data management plans in their
subject areas.
Creating opportunities for librarians to interact with researchers about data
136
Librarian is notified by e-mail when a new project is
created or a grant is awarded, based on department
affiliation of Purdue project owner.
Creating opportunities for librarians to interact with researchers about data
137
Librarian may consult or collaborate on project
if needed.
Creating opportunities for librarians to interact with researchers about data
138
Librarians review and post submitted
datasets.
Creating opportunities for librarians to interact with researchers about data
139
At the end of initial commitment (10 years), archived
and published datasets are remanded to the
Libraries‘ collection. A librarian working with the
digital archivist selects (or not) the dataset for the
collection.
Creating opportunities for librarians to interact with researchers about data
140
CONCLUSION
• Soft launch in 2012; 2013 was our first full year
• PURR included in 1,040 data management plans with proposals
from Purdue (tracked by our sponsored programs office)
• 79 grants awarded
• 1,466 registered researchers
• 331 active research projects
• Average project team size: 4 people
• Average files per project: 67 files
DMP analysis (n=111 NSF proposals from Purdue, Jan-Jun 2013)
• 49% PURR
• 29% Local computer or server
• 14% Disciplinary repository (e.g., ICPSR, Protein Data Bank,
nanoHUB, NEES)
• 8% No data or not applicable
141
THANK YOU
PURR: http://purr.purdue.edu
Michael Witt
Head, Distributed Data Curation Center
Associate Professor of Library Science
http://www.lib.purdue.edu/research/witt
E-mail: mwitt@purdue.edu
The Roles of Data Citation in Data
Management
NISO Virtual Conference:
Dealing with the Data Deluge: Successful Techniques for
Scientific Data Management
http://www.niso.org/news/events/2014/virtual/data_deluge/
Christine L. Borgman
Professor and Presidential Chair in Information Studies
University of California, Los Angeles
hudsonalpha.org
NASA Astronomy Picture of the Day
Deluge!!!
Data!
Scientists
Social Scientists
Funding agencies Policy makers
Humanists
Librarians
http://www.guzer.com/pictures/suprise_suprise.jpg 14
Publishers Internet architects
http://www.census.gov/population/cen2000/map02.gif
What are data?
ncl.ucar.edu
http://onlineqda.hud.ac.uk/Intro_QDA/Examples_of_Qualitative_Data.php
Marie Curie‘s notebook aip.org
hudsonalpha.org
NASA Astronomy Picture of the Day
145
146
Data are representations of
observations, objects, or other entities
used as evidence of phenomena for
the purposes of research or
scholarship.
C.L. Borgman, 2014, forthcoming, Big Data, Little Data, No Data: Scholarship in
the Networked World, MIT Press.
hudsonalpha.org
Publications are
arguments made by
authors, and data are
the evidence used to
support the arguments.
C.L. Borgman, 2014, forthcoming, Big Data, Little Data, No Data: Scholarship in
the Networked World, MIT Press.
Citing publications vs. data
• If publications are the stars and
planets of the scientific
universe, data are the ‘dark
matter’ – influential but largely
unobserved in our mapping
process*
*CODATA-ICSTI Task Group on Data Citation Standards and Practices, 2013, p. 54
Authorship and Attribution
• Publications
– Independent units
– Authorship is negotiated
• Data
– Compound objects
– Ownership is rarely clear
– Attribution
• Long term responsibility: Investigators
• Expertise for interpretation: Data collectors and analysts
hudsonalpha.org
Attribution of data
• Legal responsibility
– Licensed data
– Specific attribution required
• Scholarly credit: contributorship
– Author of data
– Contributor of data to this publication
– Colleague who shared data
– Software developer
– Data collector
– Instrument builder
– Data curator
– Data manager
– Data scientist
– Field site staff
– Data calibration
– Data analysis, visualization
– Funding source
– Data repository
– Lab director
– Principal investigator
– University research office
– Research subjects
– Research workers, e.g., citizen science… 150
Scholarly credit
• Publications
• Publications
• Publications
• Publications
• Publications
• Publications
• Awards and honors
• Grants
• Teaching
• Service
• Data
http://blog.startfreshtoday.com/Portals/170402/images/improve-credit-score1.jpg
Everyone is overwhelmed with life and
email and, in academia, trying to get
funding and write papers. Whether
something is open or not open is not
highest on the priority list. There’s still
need for making people aware of open
science issues and making it easy for
them to participate if they want to.
Jonathan Eisen, genetics professor at the
University of California, Davis
DESPITE BEING GOOD FOR YOU AND FOR SCIENCE,
TOO MANY CHALLENGES AND TOO LITTLE TIME
Rewards for
publications
Effort to
document
data
Competition,
priority
Control,
ownership
Slide courtesy of Merce Crosas, Harvard IQSS; Mashup of Borgman and Crosas slides 152
Data citation as solution to…
• Credit
• Attribution
• Discovery
Research practices
• Goal is publications that report the research
Vs.
• Goal is data that are reusable by others
Image: Alyssa Goodman, Harvard Astronomy
154
Scientific data creation, use, and reuse*
• What are the characteristics
of data use and reuse within
each research community?
• How do characteristics of
data use and reuse vary
within and between research
communities?
Fastlizard4’s image of a Geiger counter setup to measure
background radiation (flickr.com)
155
* Wynholds, L. A., Wallis, J. C., Borgman, C. L., Sands, A., & Traweek, S. (2012). Data, data use, and scientific
inquiry: two case studies of data practices. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital
Libraries (pp. 19–22). New York, NY, USA: ACM. doi:10.1145/2232817.2232822
* Wallis, J. C., Rolando, E., & Borgman, C. L. (2013). If We Share Data, Will Anyone Use Them? Data Sharing and
Reuse in the Long Tail of Science and Technology. PLoS ONE, 8(7), e67332. doi:10.1371/journal.pone.0067332
Research Sites
• Center for Embedded
Networked Sensing
– Science research
• Environment
• Seismology
– Technology research
• Instrumentation
• Networks
– Small science
– Circa 300 partners
• Sloan Digital Sky Survey
needs to align
– Science research
• Astronomy
• Astrophysics
– Technology research
• Instrumentation
• Databases
– Big science
– Circa 400 partners
156
Interview Questions
Topic Question CENS SDSS
Data
Types
Within your work, what is typically considered
to be “data?”
X X
How do you distinguish between different
levels or states of data?
X
DataSources
What are the main sources of data for your
research projects?
X
Do you routinely or have you ever used data
that you did not generate yourself, or from
beyond the immediate project team?
X X
Data
Use
When you look at data, what are you hoping to
find in it?
X X
When, if ever, do you reuse your datasets? X X
157
Dimensions of Data
• Observed vs. simulated data
• Lab generated vs. field collected
• Collected by team vs. obtained from external
sources
• Old vs. new data
• Raw vs. processed data
• Foreground vs. background data
158
Research findings
• Uses of data vary by type of inquiry
• Foreground data
– Research questions
– Curated
– Cited
• Background data
– Necessary for comparison or calibration
– Rarely curated
– Rarely cited
• Value of data lies in their use
• “Use” of data is not reflected in citations
159http://drpinna.com/the-gold-standard-22948
Sharing and discovering data
• Means to share data
– Curated data archives: NASA, UKDA, ICPSR…
– Contributor-curated collections
– Research domain collections
– University repositories
– Personal websites
– ftp sites
• Release upon request*
http://www.zippykidstore.com/
*Wallis, J. C., Rolando, E., & Borgman, C. L. (2013). If We Share Data, Will
Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and
Technology. PLoS ONE, 8(7), e67332. doi:10.1371/journal.pone.0067332
160
Discoverability
• Data are inseparable from
– Code
– Technical standards
– Documentation
– Instrumentation
– Calibration
– Provenance
– Workflows
– Local practices
– Physical samples
http://peacetour.org/sites/default/files/code4peace-logo2-v3-color-sm.jpg 161
Usability of cited objects
• Identify the form and content
• Interpret
• Evaluate
• Open
• Read
• Compute upon
• Reuse
• Combine
• Describe
• Annotate…
162
Identity and persistence of digital
objects
• Identity
– Identifiers
• DOI, Handles, URI, PURL…
– Naming and namespaces
• Authors/creators: ORCID, VIAF…
• Generic/specific: registry number…
– Description
• Self-describing
• Metadata augmentation
• Persistence
– Permanent
– Long-lived
– Scratch spaces
http://web-interview-
questions.blogspot.com/2010_06_21_archive.h
tml
163
Intellectual property
• What can I do with this object?
• What rights are associated?
– Reuse
– Reproduce
– Attribute
• Who owns the rights?
• How open are data?
– Open data
– Open bibliography
164http://pzwart.wdka.hro.nl/mdr/research/lliang/mdr/mdr_images/opencontent.jpg/
Implications for data management
• Authors of publications
– Cite publications for their data, findings, and other content
– Cite your data as you wish others to cite them
– Cite others’ data and publications as they wish to be cited
• Data archives
– Add metadata for discovery of datasets
– Add metadata for interpretation and provenance
• Institutional repositories, bibliographic databases
– Establish standards and practices for citing data sources
– Coordinate communities, e.g., telescope bibliography, IAU*
165
*IAU Working Group Libraries. (2013). Best Practices for Creating a Telescope Bibliography.
IAU-Commission5 - WG Libraries. http://iau-commission5.wikispaces.com/WG+Libraries
Data Citation and Attribution
166
Uhlir, P. F. (Ed.). (2012). For Attribution -- Developing
Data Attribution and Citation Practices and Standards:
Summary of an International Workshop. Washington,
D.C.: The National Academies Press. Retrieved from
http://www.nap.edu/catalog.php?record_id=13564
Data Science Journal, Volume 12,
13 September 2013
2012
CODATA-ICSTI Task Group on Data
Citation and Attribution. Co-Chairs: Jan
Brase, Sarah Callaghan, Christine
Borgman
Research funding acknowledgements
Research reported here is supported
in part by grants from the National
Science Foundation and the Alfred P.
Sloan Foundation:
The Transformation of Knowledge, Culture,
and Practice in Data-Driven Science: A
Knowledge Infrastructures Perspective, Sloan
Award # 20113194, CL Borgman, UCLA, PI; S
Traweek, UCLA, Co-PI
The Data Conservancy, NSF Cooperative
Agreement (DataNet) award OCI0830976,
Sayeed Choudhury, Johns Hopkins University,
PI
The Center for Embedded Networked Sensing
(CENS) is funded by NSF Cooperative
Agreement #CCR-0120778, Deborah L. Estrin,
UCLA, PI
Towards a Virtual Organization for Data
Cyberinfrastructure, NSF #OCI-0750529, C.L.
Borgman, UCLA, PI; G. Bowker, Santa Clara
University, Co-PI; Thomas Finholt, University
of Michigan, Co-PI
Monitoring, Modeling & Memory: Dynamics
of Data and Knowledge in Scientific
Cyberinfrastructures: NSF #0827322, P.N.
Edwards, UM, PI; Co-PIs C.L. Borgman, UCLA;
G. Bowker, SCU and Pittsburgh; T. Finholt, UM;
S. Jackson, UM; D. Ribes, Georgetown; S.L.
Star, SCU and Pittsburgh
167
Finding and following digital objects
• Discoverability
– Identify existence
– Locate
– Retrieve
• Provenance
– Chain of custody
– Transformations from original state
• Relationships
– Units identified
– Links between units
– Actions on relationships
http://chicagoist.com/2008/10/09/a_gourmet_
oasis_provenance_food_and.php
168
Infrastructure for digital objects
• Social practice
• Usability
• Identity
• Persistence
• Discoverability
• Provenance
• Relationships
• Intellectual property
• Policy
http://datalib.ed.ac.uk/GRAPHICS/blue_data.gif
169
Social practice
• Why cite data?
– Reproduce research
– Replicate findings
– Reuse data
• Why attribute data?
– Social expectation
– Legal responsibility
• How to cite data?
– Bibliographic reference
– Identifier
– Link
170
http://farm2.static.flickr.com/1207/707625876_46aa44851f_o.jpg
171
Foreground vs Background
Foreground data Background data
Uses Research questions Comparison,
calibration
Reuses Internal data sources External data sources
Disposition Retain, curate Discard
Value Reference in paper Rarely cited
UCLA USC UCR CALTECH UCMCENTER FOR EMBEDDED NETWORKED SENSING
Sensor Collected
Application Data
Sensor Collected
Proprioceptive Data
Sensor Collected
Performance Data
Hand Collected
Application Data
Flow
Water depth
Ammonium
Ammonia Phosphate
Water temp
pH
Temperature
Conductivity
Chlorophyll
GPS/location Time
Sap flow
CO2
Humidity
Rainfall
Packets transmitted
Packets received
ORP
PAR
Motor speed
Rudder angle
Heading
Roll/pitch/yaw
Soil moisture
Nitrate
Calcium
Chloride
Water potential
Wind speed
Wind direction
Wind duration
Leaf wetness
Routing table
Neighbor table
Fault detection
Awake time
Organism presence
Organism concentration
Battery voltage
Mercury
Methylmercury
Nutrient concentration
Nutrient presence
LandSat images Mosscam
CDOM
Bird calls
CENS Data: Foreground vs background
Astronomy data: Foreground vs. background
Type Source Named Genre
Catalog (Data)
index
SIMBAD, VizieR Obs
Curated Data
Collection
NASA Exoplanet Database Obs
Data Archive Multi-mission Archive at STScI (MAST), Infrared Science Archive (IRSA) Obs
Federated Data
Query Services
Virtual Observatory Services (NVO, IVOA) Obs
Ground Based
Instruments
DEep Imaging Multi-Object Spectrograph (DEIMOS), Keck Observatories, Laser
Interferometer Gravitational-Wave Observatory (LIGO)
Obs
Ground Based Sky
Surveys
Deep Lens Survey, DEEP2 Galaxy Redshift Survey, Catalina Transients Survey,
Palomar-Quest Survey, Sloan Digital Sky Survey (SDSS), Digitized Palomar
Observatory Sky Survey (DPOSS), SDSS Value Added Catalogs
Obs
Physical Constants NIST Atomic Spectra Database Exp
Publications Index SAO/NASA Astrophysics Data System Mixed
Simulation Millennium Simulation Database Sim
Space Based
Instruments
Chandra X-Ray Observatory, Fermi Large Area Telescope, Far Ultraviolet
Spectroscopic Explorer (FUSE), Galaxy Evolution Explorer (GALEX), Hubble Space
Telescope, Spitzer Space Telescope, XMM X-ray Telescope
Obs
Space Based Sky
Surveys
Two Micron All Sky Survey (2MASS), Infrared Astronomical Satellite Survey (IRAS),
Wide-field Infrared Survey Explorer (WISE)
Obs
173
© 2012 The MITRE Corporation. All rights reserved.
Adriane Chapman
achapman@mitre.org
M. David Allen
dmallen@mitre.org
Barbara Blaustein
bblaustein@mitre.org
Is this data fit for
my use?
The challenges and
opportunities provenance
presents
Information graphic courtesy of FreeDigitalPhotos.net
Public Release #12-1548.
© 2012 The MITRE Corporation. All rights reserved.
Page 175
What is
Provenance?
Public Release #12-1548.
© 2012 The MITRE Corporation. All rights reserved.
Public Release #12-1548.
■Provenance can help in evaluating whether data
is fit for a specific purpose
– Does the data item derive from an Internet source?
– Were untrusted organizations involved in producing the
data item?
■Provenance “in the raw” is not always useful to
users
– Generally presented as a directed acyclic graph (DAG)
– Many users have a good intuitive understanding of
simple graphs, BUT
Is Data Fit for a Specific Use?
Page 176
Provenance graphs are often
large and unwieldy
© 2012 The MITRE Corporation. All rights reserved.
Use Case
Page 177
© 2012 The MITRE Corporation. All rights reserved.
Financial Systemic Risk Analysis
Analysts
Financial Models
build and run
Are there
systemic
risks to the
health of
the financial
system?
Decision Makers
Public Release #12-3756
© 2012 The MITRE Corporation. All rights reserved.
Systemic Risk: The IT Problem
■ To monitor systemic risk, regulators have hundreds of
analysts, running hundreds of models…
– …against hundreds of data sets at various time scales…
– …each with thousands of different parameter settings
■ Currently, care and feeding of these models (especially data
extract-transform-load) is ad hoc
■ Result: Current simulation environments don’t support
analysts’ need to find and interpret data across the
resulting millions of simulation executions
Public Release #12-3756
© 2012 The MITRE Corporation. All rights reserved.
Data Provenance Challenge
I ran a flow of funds model
from the University of
Vermont back in May. Which
version did I use? What
transformations did I perform
on the input data sets?
Which model runs
used the 1Q 2011
version of the
FDIC’s Uniform
Bank Performance
Reports?
Who is running
Prof. Jones’
model? What input
data are they using
it with and with
what parameters?
Public Release #12-3756
© 2012 The MITRE Corporation. All rights reserved.
Data Provenance Example
d1
Filter (P1)
Multi-market
model (P2)
Source:
Thompson
Reuters order
book data
d2 d3
Filter (P3) d4 d5
Version: 1
Time-horizon = 2016
Invoked-by: Jones
Version: 2
Time-horizon = 2018
Invoked-by: Smith
Multi-market
model (P4)
Year: 2010
Sector: Technology
Time Series
Normalization (P6)
Link-based
Classification
Model (P7)
d7 d8
Outlook: ―Excellent‖
Outlook: ―Poor‖
Filter (P5) Year: 2001-2010
Sector: Housing
Periodicity: Quarterly Invoked-by:
Roberts
Outlook: ―Fair‖
d6
Source: Nanex
order book data
Public Release #12-3756
© 2012 The MITRE Corporation. All rights reserved.
FitnessWidgets
Page 182
© 2012 The MITRE Corporation. All rights reserved.
Public Release #12-1548.
Page 183
Ease of Use for End Users
Data-centric goal: build tools and applications over
provenance information to support a user’s needs.
Information graphic courtesy of FreeDigitalPhotos.net
© 2012 The MITRE Corporation. All rights reserved.
Public Release #12-1548.
■ Ad hoc, user-defined
Fitness Widgets: Pre-defined queries
operating over provenance graphs
Page 184
© 2012 The MITRE Corporation. All rights reserved.
Fitness Widgets: Pre-defined queries
operating over provenance graphs
■ Complex, pre-defined
Page 185
Public Release #12-1548.
© 2012 The MITRE Corporation. All rights reserved.
Page 186
More Complex: Cross-organizational
“double counting”
Public Release #12-1548.
© 2012 The MITRE Corporation. All rights reserved.
The Skeletons
Page 187
© 2012 The MITRE Corporation. All rights reserved.
PLUS Provenance Manager
Provenance Manager
PLUS
Users &
Applications
Administrators
Provenance Store (MySQL)
PLUS
Applications &
Capture Agents
Report
AnnotateRetrieve
Administer
(access control,
archiving, etc.)
API
(provenance-aware
applications)
Coordination points for automatic
provenance capture
Web Proxy
(provenance-aware
applications)
Approved for Public Release 10-4145
© 2010 The MITRE Corporation. All rights reserved
A. Chapman, M.D. Allen, B. Blaustein, L. Seligman, “PLUS: A Provenance Manager for
Integrated Information,” IEEE Int. Conf. on Information Reuse and Integration (IRI ‘11), Las
API
© 2012 The MITRE Corporation. All rights reserved.
Architectural Options for Lineage Capture
■ ―Smart Applications‖
– Strategy: Each application calls lineage API to log whatever it thinks
is important.
– But, unrealistic for legacy applications
■ ―Interceptors‖
– Strategy: Listen in to whatever is happening, and log silently as it
happens
– Requires a small number of points of lineage capture: ESBs are
ideal, since they act as central ―routers‖
■ ―Wrappers‖
– Strategy: Write a transparent wrapper service. Make sure all
orchestrations call the wrapper service with enough information for
the wrapper to invoke the real thing.
189
Public Release #10-1285
© 2012 The MITRE Corporation. All rights reserved.
Public Release #12-1548.
© 2012 The MITRE Corporation. All rights reserved.
View Provenance
The provenance graph
is built automatically
over time by
“watching” users’
actions
Public Release #12-1548.
© 2012 The MITRE Corporation. All rights reserved.
The system can
show relationship
information and
metadata details
Get Details
Public Release #12-1548.
© 2012 The MITRE Corporation. All rights reserved.
Sort Information
The system
provides ways to
get information
“at a glance”, e.g.
which
organizations
own the data that
was used.
Public Release #12-1548.
© 2012 The MITRE Corporation. All rights reserved.
FitnessWidgets
FitnessWidgets
help the analyst
assess data
products for his
specific use.
Public Release #12-1548.
© 2012 The MITRE Corporation. All rights reserved.
Annotations
Annotate any node.
Information can be
propagated through
graph.
Public Release #12-1548.
© 2012 The MITRE Corporation. All rights reserved.
■ Provenance keeps track of who did what, when to data.
■ Provenance can help
– Determine what data to use
– Find data
– Know what happened to the data
■ It is not a silver bullet
– Capture is hard
■ Determine what pieces of information are vital to judging
“fitness”, try to capture those
Conclusions
Page 196
SHARE PROJECT UPDATE
Judy Ruttenberg, Program Director
Association of Research Libraries
NISO Virtual Conference: Dealing with the
Data Deluge
April 23, 2014
Higher education &
research community
• Preservation, access, and reuse of
research outputs (data, articles, and
more)
• Interlocking layers & services to
better understand what research is
being produced, and to render that
research as accessible as possible
• Leverage existing ecosystem
Formation and context of SHARE
• Institutional OA policies
• AAU-ARL Task Force on Scholarly
Communication
• Funder mandates
–2013 OSTP Memorandum
–2014 Omnibus Appropriations
–Private and other funder policies
Who is SHARE?
Steering Group
• Provost, Library directors, CIO, SRO
• ARL, AAU, APLU, CNI, SPARC, NLM (federal
agency liaison)
Staff
• Project Manager (ARL), Technical Director,
Product/Community Lead, Development Team
Working Groups
• Repository, Workflow, Technical,
Communications
Layers & Services of SHARE
Notification Service: Project underway
– Beta release fall 2014
– Full release fall 2015
Concurrent planning for
interactive systems:
– Registry
– Discovery
– Aggregation
SHARE Notification Service
Problem Statement:
• Difficult to keep abreast of the release
of publications, datasets, other
research outputs
• No single, structured way to report
research output releases in timely
and ubiquitous manner
SHARE Notification Service
Outcome & Goal:
• Know that research output exists
• Enable, short-term & with high-latency:
–Repository Managers to identify
articles/papers/reports for deposit
–University and funding agency grant
administrators to determine compliance
with public access policies
SHARE Notification Service –
Building Blocks
SHARE Notification Service –
Information Flow
SHARE Research Release Events
SHARE Research Release Events
SHARE Registry Layer
SHARE Registry &
Discovery Layers
Other Community Initiatives
• CHORUS
• ORCID
• CrossRef
• International
Long-term planning
• Data
• Author rights: An intellectual property
rights strategy, including the promotion of
university-based open access policies and
favorable licensing terms, will be part of
the scaffolding that will enable the layers
of SHARE to develop
www.arl.org/share
www.facebook.com/SHARE.research
www.twitter.com/share_research
share@arl.org
Staying connected with SHARE:
NISO Virtual Conference
Dealing with the Data Deluge: Successful
Techniques for Scientific Data Management
NISO Virtual Conference • April 23, 2014
Questions?
All questions will be posted with presenter answers on
the NISO website following the webinar:
http://www.niso.org/news/events/2014/virtual/data_deluge/
Thank you for joining us today.
Please take a moment to fill out the brief online survey.
We look forward to hearing from you!
THANK YOU

Contenu connexe

Tendances

Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14Dag Endresen
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBryan Heidorn
 
Using Neo4j for exploring the research graph connections made by RD-Switchboard
Using Neo4j for exploring the research graph connections made by RD-SwitchboardUsing Neo4j for exploring the research graph connections made by RD-Switchboard
Using Neo4j for exploring the research graph connections made by RD-Switchboardamiraryani
 
Global Biodiversity Information Facility - 2013
Global Biodiversity Information Facility - 2013Global Biodiversity Information Facility - 2013
Global Biodiversity Information Facility - 2013Dag Endresen
 
Global Biodiversity Information Facility (GBIF) - 2012
Global Biodiversity Information Facility (GBIF) - 2012Global Biodiversity Information Facility (GBIF) - 2012
Global Biodiversity Information Facility (GBIF) - 2012Dag Endresen
 
Towards a Machine-Actionable Scholarly Communication System
Towards a Machine-Actionable Scholarly Communication SystemTowards a Machine-Actionable Scholarly Communication System
Towards a Machine-Actionable Scholarly Communication SystemHerbert Van de Sompel
 
Data sharing and data management – what are they all about?
Data sharing and data management –  what are they all about?Data sharing and data management –  what are they all about?
Data sharing and data management – what are they all about?Belinda Weaver
 
A Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordA Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordHerbert Van de Sompel
 
Big Data in the Arts and Humanities
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and HumanitiesAndrew Prescott
 
A Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesA Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesIan Mulvany
 
Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014
Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014
Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014Dag Endresen
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingGigaScience, BGI Hong Kong
 
Tools für das Management von Forschungsdaten
Tools für das Management von ForschungsdatenTools für das Management von Forschungsdaten
Tools für das Management von ForschungsdatenHeinz Pampel
 
EPSRC research data expectations and PURE for datasets
EPSRC research data expectations and PURE for datasetsEPSRC research data expectations and PURE for datasets
EPSRC research data expectations and PURE for datasetsEDINA, University of Edinburgh
 
DataUp Lightning Talk for #iEvoBio
DataUp Lightning Talk for #iEvoBioDataUp Lightning Talk for #iEvoBio
DataUp Lightning Talk for #iEvoBioCarly Strasser
 

Tendances (20)

Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary Challenge
 
Using Neo4j for exploring the research graph connections made by RD-Switchboard
Using Neo4j for exploring the research graph connections made by RD-SwitchboardUsing Neo4j for exploring the research graph connections made by RD-Switchboard
Using Neo4j for exploring the research graph connections made by RD-Switchboard
 
Global Biodiversity Information Facility - 2013
Global Biodiversity Information Facility - 2013Global Biodiversity Information Facility - 2013
Global Biodiversity Information Facility - 2013
 
Jan Brase: Data and Libraries - the DataCite consortium
Jan Brase: Data and Libraries - the DataCite consortiumJan Brase: Data and Libraries - the DataCite consortium
Jan Brase: Data and Libraries - the DataCite consortium
 
Global Biodiversity Information Facility (GBIF) - 2012
Global Biodiversity Information Facility (GBIF) - 2012Global Biodiversity Information Facility (GBIF) - 2012
Global Biodiversity Information Facility (GBIF) - 2012
 
Christine borgman keynote
Christine borgman keynoteChristine borgman keynote
Christine borgman keynote
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Towards a Machine-Actionable Scholarly Communication System
Towards a Machine-Actionable Scholarly Communication SystemTowards a Machine-Actionable Scholarly Communication System
Towards a Machine-Actionable Scholarly Communication System
 
Data sharing and data management – what are they all about?
Data sharing and data management –  what are they all about?Data sharing and data management –  what are they all about?
Data sharing and data management – what are they all about?
 
A Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordA Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly Record
 
Big Data in the Arts and Humanities
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and Humanities
 
Knowledge Graphs for Scholarly Communication
Knowledge Graphs for Scholarly CommunicationKnowledge Graphs for Scholarly Communication
Knowledge Graphs for Scholarly Communication
 
A Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesA Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific Curiosities
 
Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014
Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014
Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
 
Tools für das Management von Forschungsdaten
Tools für das Management von ForschungsdatenTools für das Management von Forschungsdaten
Tools für das Management von Forschungsdaten
 
EPSRC research data expectations and PURE for datasets
EPSRC research data expectations and PURE for datasetsEPSRC research data expectations and PURE for datasets
EPSRC research data expectations and PURE for datasets
 
DataUp Lightning Talk for #iEvoBio
DataUp Lightning Talk for #iEvoBioDataUp Lightning Talk for #iEvoBio
DataUp Lightning Talk for #iEvoBio
 

Similaire à April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

What is DataCite-screenshots
What is DataCite-screenshotsWhat is DataCite-screenshots
What is DataCite-screenshotsdatacite
 
Hausstein data cite-dara-dasish2014
Hausstein data cite-dara-dasish2014Hausstein data cite-dara-dasish2014
Hausstein data cite-dara-dasish2014bhausstein
 
Riding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessRiding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessdatacite
 
DataCite at APE 2011
DataCite at APE 2011DataCite at APE 2011
DataCite at APE 2011datacite
 
Museum collections as research data - October 2019
Museum collections as research data - October 2019Museum collections as research data - October 2019
Museum collections as research data - October 2019Dag Endresen
 
The role of biodiversity informatics in GBIF, 2021-05-18
The role of biodiversity informatics in GBIF, 2021-05-18The role of biodiversity informatics in GBIF, 2021-05-18
The role of biodiversity informatics in GBIF, 2021-05-18Dag Endresen
 
Dig the new breed: how open approaches can empower archaeologists
Dig the new breed: how open approaches can empower archaeologistsDig the new breed: how open approaches can empower archaeologists
Dig the new breed: how open approaches can empower archaeologistsDART Project
 
Data, librarians, and services
Data, librarians, and servicesData, librarians, and services
Data, librarians, and servicesAndrew Treloar
 
What is DataCite?
What is DataCite?What is DataCite?
What is DataCite?datacite
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?LEARN Project
 
Sla2009 D Curation Heidorn
Sla2009 D Curation HeidornSla2009 D Curation Heidorn
Sla2009 D Curation HeidornBryan Heidorn
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
FAIR and open biodiversity collection data management
FAIR and open biodiversity collection data managementFAIR and open biodiversity collection data management
FAIR and open biodiversity collection data managementDag Endresen
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceAndrew Sallans
 

Similaire à April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management (20)

What is DataCite-screenshots
What is DataCite-screenshotsWhat is DataCite-screenshots
What is DataCite-screenshots
 
Hausstein data cite-dara-dasish2014
Hausstein data cite-dara-dasish2014Hausstein data cite-dara-dasish2014
Hausstein data cite-dara-dasish2014
 
Riding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessRiding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information access
 
DataCite at APE 2011
DataCite at APE 2011DataCite at APE 2011
DataCite at APE 2011
 
Museum collections as research data - October 2019
Museum collections as research data - October 2019Museum collections as research data - October 2019
Museum collections as research data - October 2019
 
The role of biodiversity informatics in GBIF, 2021-05-18
The role of biodiversity informatics in GBIF, 2021-05-18The role of biodiversity informatics in GBIF, 2021-05-18
The role of biodiversity informatics in GBIF, 2021-05-18
 
Dig the new breed: how open approaches can empower archaeologists
Dig the new breed: how open approaches can empower archaeologistsDig the new breed: how open approaches can empower archaeologists
Dig the new breed: how open approaches can empower archaeologists
 
Open Science - Global Perspectives/Simon Hodson
Open Science - Global Perspectives/Simon HodsonOpen Science - Global Perspectives/Simon Hodson
Open Science - Global Perspectives/Simon Hodson
 
RDA Presentation to G8
RDA Presentation to G8RDA Presentation to G8
RDA Presentation to G8
 
Data, librarians, and services
Data, librarians, and servicesData, librarians, and services
Data, librarians, and services
 
What is DataCite?
What is DataCite?What is DataCite?
What is DataCite?
 
E research overview gahegan bioinformatics workshop 2010
E research overview gahegan bioinformatics workshop 2010E research overview gahegan bioinformatics workshop 2010
E research overview gahegan bioinformatics workshop 2010
 
Cornell 2011 05-13
Cornell 2011 05-13Cornell 2011 05-13
Cornell 2011 05-13
 
Ciard Initiative and a Global Infrastructure for Linked Open Data
Ciard Initiative and a Global Infrastructure for Linked Open Data Ciard Initiative and a Global Infrastructure for Linked Open Data
Ciard Initiative and a Global Infrastructure for Linked Open Data
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?
 
British Library Datasets Programme Feb 2011
British Library Datasets Programme Feb 2011British Library Datasets Programme Feb 2011
British Library Datasets Programme Feb 2011
 
Sla2009 D Curation Heidorn
Sla2009 D Curation HeidornSla2009 D Curation Heidorn
Sla2009 D Curation Heidorn
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
FAIR and open biodiversity collection data management
FAIR and open biodiversity collection data managementFAIR and open biodiversity collection data management
FAIR and open biodiversity collection data management
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-Science
 

Plus de National Information Standards Organization (NISO)

Plus de National Information Standards Organization (NISO) (20)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 
Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"
 
Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"
 
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
 
Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"
 
Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"
 
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
 
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
 
Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"
 
Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"
 
Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"
 

Dernier

URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 

Dernier (20)

URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 

April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

  • 1. NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management April 23, 2014 Speakers: Jan Brase, Jared Lyle, Mercè Crosas, Michael Witt, Christine Borgman, Adriane Chapman, David Wilcox, Judy Ruttenberg http://www.niso.org/news/events/2014/virtual/data_deluge/
  • 2. NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Implementations Agenda 11:00 a.m. – 11:10 a.m. – Introduction Todd Carpenter, Executive Director, NISO 11:10 a.m. - 12:00 p.m. Keynote Speaker: DataCite – A Global Approach for Better Data Sharing Jan Brase, Ph.D., German National Library of Science and Technology 12:00 p.m. - 12:30 p.m. Guidelines and Resources for Office of Science and Technology Policy (OSTP) Data Access Plans Jared Lyle, Director of Data Curation Services, Interuniversity Consortium for Political and Social Research (ICPSR), University of Michigan 12:30 p.m. - 1:00 p.m. Joint Declaration of Data Citation Principles: Implementation and Compliance in the Dataverse Repository Mercè Crosas, Ph.D., Director of Data Science, Institute for Quantitative Social Science (IQSS), Harvard University 1:00 p.m. - 1:45 p.m. Lunch Break 1:45 p.m. - 2:15 p.m. Purdue University Research Repository (PURR): A Commitment to Supporting Researchers Michael Witt, Head, Distributed Data Curation Center (D2C2); Associate Professor of Library Science, Purdue University Research Repository (PURR) 2:15 p.m. - 2:45 p.m. The Roles of Data Citation in Data Management Christine L. Borgman, Professor & Presidential Chair in Information Studies, UCLA 2:45 p.m. - 3:15 p.m. Is This Data Fit for My Use? The Challenges and Opportunities Data Provenance Presents Adriane Chapman, MITRE 3:15 p.m. - 3:30 p.m. Afternoon Break 3:30 p.m. - 4:00 p.m. A Durable Space: Technologies for Accessing Our Collective Digital Heritage David Wilcox, Product Manager, DuraSpace 4:00 p.m. - 4:30 p.m. The SHared Access Research Ecosystem (SHARE) Project: A Joint Initiative of ARL, AAU, and APLU Judy Ruttenberg, Program Director for Transforming Research Libraries, Association of Research Libraries (ARL) 4:30 p.m. - 5:00 p.m. Conference Roundtable Moderated by Todd Carpenter, Executive Director, NISO
  • 3. DataCite – A global approach for better data sharing Jan Brase DataCite NISO virtual conference April 23rd 2014
  • 4. Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: data exploration (eScience) unify theory, experiment, and simulation Jim Gray, eScience Group, Microsoft Research 2 2 2 . 3 4 a cG a a Science Paradigms
  • 5. Scientific Information is more than a journal article or a book Libraries should open their cataolgues to any kind of information The catalogue of the future is NOT ONLY a window to the library‗s holding, but A portal in a net of trusted providers of scientific content Consequences for Libraries
  • 6. We do not have it BUT We know where you can find And here is the link to it!
  • 7. 7 Simulation Scientific Films 3D Objects Grey Literature Research Data Software Including non-classical publications
  • 8. Why is this a role for libraries? • Libraries have a history in bringing scientific information to the public • Libraries have a tendency to be persistent • A project will be forgotten in 40 years, the library will very likely still exist then • Library are very trustworthy organisations
  • 10. High visability of the content Easy re-use and verification. Scientific reputation for the collection and documentation of content (Citation Index) Encouraging the Brussels declaration on STM publishing Avoiding duplications Motivation for new research What if any kind of scientific content would be citable?
  • 11. How to achieve this? Science is global • it needs global standards • Global workflows • Cooperation of global players Science is carried out locally • By local scientist • Beeing part of local infrastrucures • Having local funders
  • 12. Global consortium carried by local institutions focused on improving the scholarly infrastructure around datasets and other non-textual information focused on working with data centres and organisations that hold content Providing standards, workflows and best-practice Initially, but not exclusivly based on the DOI system Founded December 1st 2009 in London DataCite
  • 13. International DOI Foundation DataCite Member Institution Data CentreData CentreData Centre Member Institution Data CentreData CentreData Centre … Works with Managing Agent (TIB) Member Associate Stakeholder DataCite structure
  • 14. 1. Technische Informationsbibliothek (TIB) 2. Canada Institute for Scientific and Technical Information (CISTI), 3. California Digital Library, USA 4. Purdue University, USA 5. Office of Scientific and Technical Information (OSTI), USA 6. Library of TU Delft, The Netherlands 7. Technical Information Center of Denmark 8. The British Library 9. ZB Med, Germany 10. ZBW, Germany 11. Gesis, Germany 12. Library of ETH Zürich 13. L’Institut de l’Information Scientifique et Technique (INIST), France 14. Swedish National Data Service (SND) 15. Australian National Data Service (ANDS) 16. Conferenza dei Rettori delle Università Italiane (CRUI) 17. National Research Council of Thailand (NRCT) 18. The Hungarian Academy of Sciences 19. University of Tartu, Estonia 20. Japan Link Center (JaLC) 21. South African Environmental Observation Network (SAEON) 22. European Organisation for Nuclear Research (CERN) DataCite members Affiliated members: 1. Digital Curation Center (UK) 2. Microsoft Research 3. Interuniversity Consortium for Political and Social Research (ICPS 1. Korea Institute of Science and Technology Information (KISTI) 5. Bejiing Genomic Institute (BGI) 6. IEEE 7. Harvard University Library 8. World Data System (WDS) 9. GWDG
  • 15. IRD ( gr av/ 10 cm 3) Sand ( %) C aC O3 ( %) TOC ( %) R adio ( %/ sand) Sme c t ( %/ clay) IRD ( gr av/ 10 cm 3) Sand ( %) C aC O3 ( %) TOC ( %) R adio ( %/ sand) Sme c t ( %/ clay) IRD ( gr av/ 10 cm 3) Sand ( %) C aC O3 ( %) TOC ( %) R adio ( %/ sand) Sme c t ( %/ clay) IRD ( gr av/ 10 cm 3) Sand ( %) C aC O3 ( %) TOC ( %) R adio ( %/ sand) Sme c t ( %/ clay) IRD ( gr av/ 10 cm 3) Sand ( %) C aC O3 ( %) TOC ( %) R adio ( %/ sand) Sme c t ( %/ clay) PS 1389-3 PS 1390-3 PS 1431-1 PS 1640-1 PS 1648-1 Age (kyr) max. : 233.55 ky r PS1389-3f f 0.0 100.0 200.0 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 54° 0' 54° 0' 54°30' 54°30' 55° 0' 55° 0' 55°30' 55°30' 11° 11° 12° 12° 13° 13° 14° 14° 15° 15° World vector shore line Grain size class KOLP A Grain size class KOEHN2 Grain size class KOEHN Geochemistry Grain size class KOLP B Grain size class KOLP DIN 20 m Scale: 1:2695194 at Latitude 0° Source: Baltic Sea Research Institute, Warnemünde. Earth quake events => doi:10.1594/GFZ.GEOFON.gfz2009kciu Climate models => doi:10.1594/WDCC/dphase_mpeps Sea bed photos => doi:10.1594/PANGAEA.757741 Distributes samples => doi:10.1594/PANGAEA.51749 Medical case studies => doi:10.1594/eaacinet2007/CR/5- 270407 Computational model => doi:10.4225/02/4E9F69C011BC8 Audio record => doi:10.1594/PANGAEA.339110 Grey Literature => doi:10.2314/GBV:489185967 Videos => doi:10.3207/2959859860 What type of data are we talking about?
  • 16. Anything that is the foundation of further reserach is research data Data is evidence Anything that is the foundation of further reserach is research data Data is evidence
  • 17. Over 3,200,000 DOI names registered so far. 290 data centers. 10,000,000 resolutions in 2013. DataCite Metadata schema published (in cooperation with all members) http://schema.datacite.org DataCite MetadataStore http://search.datacite.org DataCite in 2014
  • 18. DataCite search Searchterm: * Searchterm: uploaded:[NOW-7DAY TO NOW] Searchterm: relatedIdentifier:* Searchterm: relatedIdentifier:issupplementto:10.1029* Searchterm:relatedIdentifier:*:10.1055*
  • 19.
  • 20.
  • 21.
  • 22. OAI and Statistics OAI Harvester http://oai.datacite.org DataCite statistics (resolution and registration) http://stats.datacite.org
  • 23.
  • 24.
  • 25. DataCite Content Service Service for displaying DataCite metadata Different formats (BibTeX, RIS, RDF, etc.) Content Negotation (through MIME-Typ) • Access through DOI proxy (http://dx.doi.org) • First implemented by CNRI and CrossRef: Documentation: http://www.crosscite.org/cn/
  • 26. Content negotiation Optimized for m2m communication using the accept header of the http protocol curl -L -H "Accept: MIME_TYPE" http://dx.doi.org/DOI Try a shortcut out in any webbrowser: http://data.datacite.org/MIME_TYPE/DOI http://data.crossref.org/DOI
  • 27. Resolving to the citation http://data.datacite.org/application/x- datacite+text/10.5524/100005 Li, j; Zhang, G; Lambert, D; Wang, J (2011): Genomic data from Emperor penguin. GigaScience. http://dx.doi.org/10.5524/100005
  • 28. Resolving to the RDF metadata http://data.datacite.org/application/rdf+xml/10.5524/100005 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:j.0="http://purl.org/dc/terms/" > <rdf:Description rdf:about="http://dx.doi.org/10.5524/100005"> <j.0:identifier>10.5524/100005</j.0:identifier> <j.0:creator>Li, J</j.0:creator> <j.0:creator>Zhang, G</j.0:creator> <j.0:creator>Wang, J</j.0:creator> <owl:sameAs>doi:10.5524/100005</owl:sameAs> <owl:sameAs>info:doi/10.5524/100005</owl:sameAs> <j.0:publisher>GigaScience</j.0:publisher> <j.0:creator>Lambert, D</j.0:creator> <j.0:date>2011</j.0:date> <j.0:title>Genomic data from the Emperor penguin (Aptenodytes forsteri)</j.0:title> </rdf:Description></rdf:RDF>
  • 29. Example of use This allows persistent identification of RDF statements! Implemented for all over 65 million CrossRef and DataCite DOI names Example of use: DOI Citation Formatter http://www.crosscite.org/citeproc/
  • 30.
  • 31.
  • 32. 2012: STM, CrossRef and DataCite Joint Statement 1. To improve the availability and findability of research data, the signers encourage authors of research papers to deposit researcher validated data in trustworthy and reliable Data Archives. 2. The Signers encourage Data Archives to enable bi- directional linking between datasets and publications by using established and community endorsed unique persistent identifiers such as database accession codes and DOI's. 3. The Signers encourage publishers and data archives to make visible or increase visibility of these links from publications to datasets and vice versa 32
  • 33. Example The dataset: Storz, D et al. (2009): Planktic foraminiferal flux and faunal composition of sediment trap L1_K276 in the northeastern Atlantic. http://dx.doi.org/10.1594/PANGAEA.724325 Is supplement to the article: Storz, David; Schulz, Hartmut; Waniek, Joanna J; Schulz-Bull, Detlef; Kucera, Michal (2009): Seasonal and interannual variability of the planktic foraminiferal flux in the vicinity of the Azores Current. Deep-Sea Research Part I-Oceanographic Research Papers, 56(1), 107-124, http://dx.doi.org/10.1016/j.dsr.2008.08.009
  • 34.
  • 35.
  • 36. Next steps ODIN project with ORCID. http://datacite.labs.orcid-eu.org/ MoU with Thomson reuters to cooperate on data citation index DataCite plugin for next D-Space release (early 2014)
  • 37.
  • 38.
  • 39.
  • 40. Cooperation MoU with ORCID Agreement with Re3Data and DataBib to include their service in 2016 MoU with RDA to become organisational affiliate
  • 42. Let us get back to libraries
  • 43. The wave Growth of Information – Diversity of media types and formats User requirements – e. g. : Science 2.0, collaborative networks, social media
  • 44. A threat? Information overload is only a problem for manual curation. Google is not complaining about data deluge—they‘re constantly trying to get more data. The more data you throw, the better the filter gets. To develop and maintain these tools is a classical tasks for libraries! Don’t turn off the taps, build boats.
  • 45. It is not only a challenge … … it is an opportunity We all should ride the wave …
  • 47. Guidelines and Resources for OSTP Data Access Plans NISO Webinar April 2014 www.icpsr.umich.edu/datamanagement
  • 48. The OSTP Memo Guidelines for Response • Released February 2013, this memo directs funding agencies with an annual R&D budget over $100 million to develop a public access plan for disseminating the results of their research • ICPSR stresses that standards and guidelines for many of the requirements currently exist • The slides to follow provide an overview of the access plan elements including guidelines and resources on how to respond to meet digital data requirements in the memo
  • 49. The OSTP Memo – A Review • Released February 22, 2013 • A concern for investment: “Policies that mobilize these publications and data for re-use through preservation and broader public access also maximize the impact and accountability of the Federal research investment.” • Federal agencies with over $100 M annually in R&D expenditures to develop plans to support increased public access to the results of research funded by the Federal Government • Plans to contain eight points
  • 50. The Eight Points of the Plan 1. Strategy for leveraging existing archives 2. Strategy to improve the public’s ability to locate and access digital data 3. Approach to optimize search, archival, and dissemination features that encourage innovation in accessibility & interoperability and ensure long-term stewardship 4. A plan to notify awardees & researchers of their obligations 5. Strategy for measuring and enforcing compliance with the plan 6. Identification of resources within the existing agency budget to implement plan 7. Timeline for implementation 8. Identification of special circumstances that prevent the agency from meeting memo objectives
  • 51. Data Portion of Memo - 13 Elements • The portion of the memo describing objectives for public access to data stresses 13 elements for a public access plan • The elements are also summarized online within ICPSR’s Web site: http://icpsr.umich.edu/content/datamanagement/ostp.html
  • 54. RDAP 2014 Panel: Funding agency (NOAA, NSF, NIH) responses to federal requirements for public access to research results Wendy Kozlowski (Cornell), Moderator http://www.slideshare.net/asist_org/rdap14- ostp-panel-introduction http://www.slideshare.net/asist_org/rdap-3- 2714thakur
  • 55. Visit ICPSR Archives/Repositories already Meeting Public Access Requirements
  • 56. ICPSR – a 50-Year History of Providing Access to Research Data Established in 1962, ICPSR maintains and shares over 8,600 research datasets and hosts 16 public- access specialized collections of data funded by various government agencies and foundations. Our mission: ICPSR advances and expands social and behavioral research, acting as a global leader in data stewardship and providing rich data resources and responsive educational opportunities for present and future generations.
  • 57. ICPSR’s Data Management & Curation Goals • Quality - Data at ICSPR are enhanced with meaningful information to make it complete, self-explanatory, and usable for future researchers • Access – Sought by over 730 member institutions an indexed by all the major search engines, ICPSR data are easily discoverable and widely accessible to the public. • Citation - By providing standardized and well-recognized data citations, ICPSR ensures that data producers receive credit for their archived data • Preservation – For over 50 years, ICPSR has preserved its data resources for the long-term, guarding against deterioration, accidental loss, and digital obsolescence • Confidentiality - Stringent protections are in place for securing and distributing sensitive data • Educational Support – ICPSR has a long tradition of supporting training in quantitative methods, scientific data management, and resources for instruction
  • 58. ICPSR’s Data Management & Curation Site http://www.icpsr.umich.edu/datamanagement/
  • 60. Data Portion of Memo - 13 Elements • The portion of the memo describing objectives for public access to data stresses 13 elements for a public access plan • The elements are also summarized online within ICPSR’s Web site: http://icpsr.umich.edu/content/datamanagement/ostp.html
  • 61. Maximize Access "Maximize access, by the general public and without charge, to digitally formatted scientific data created with Federal funds“ • Increasing access to research data prevents the duplication of effort, provides accountability and verification of research results, and increases opportunities for innovation and collaboration. • Finding and accessing data in repositories requires descriptive metadata ("data about data") in standard, machine-actionable form. Metadata help search engines find data, and help researchers understand the context of data collections. • Standards already exist: see Data Documentation Initiative – http://www.ddialliance.org/
  • 62. Maximize Access cont. • Access also involves knowing how to interpret the data. Incomplete data limit reuse. Obsolete data formats can be unreadable. – Repositories 'curate' or enhance data to make it complete, self-explanatory, and usable for future researchers. This includes adding descriptive labels, correcting coding errors, gathering documentation, and standardizing the final versions of files. This is called “data curation.” – Like museums that curate art or artifacts for study and understanding now and in the future, data archives curate data with the same goals. • Data curation is crucial to maximizing access. Resources for curating data: – ICPSR's Guide to Social Science Data Preparation and Archiving – UK Data Archive's Managing and Sharing Data guide.
  • 63. Protect Confidentiality and Privacy • It is critically important to protect the identities of research subjects. • Disclosure risk is a term that is often used for the possibility that a data record from a study could be linked to a specific person. • Concerns about disclosure risk have grown as more datasets have become available online, and it has become easier to link research datasets with publicly available external databases.
  • 64. Protect Confidentiality and Privacy cont. Protecting confidentiality of research subjects is not a viable argument for not sharing data. Infrastructure, including virtual and physical data enclaves, already exists: • Restricted-Use Data are made available for research purposes for use by investigators who agree to stringent conditions for the use of the data and its physical safekeeping. • Enclave Data are those datasets which present especially acute disclosure risks. They can be accessed only on-site in ICPSR's physical data enclave in Ann Arbor. Investigators must be approved. Their notes and analytic output are reviewed by ICPSR staff.
  • 65. Balance Demands of Long-term Preservation and Access • Preserving digital data requires much more than storing files on a server, desktop, or in the cloud! • Digital preservation is the active and ongoing management of digital content to lengthen the lifespan and mitigate against loss, including physical deterioration, format obsolescence, and hardware and software failure.
  • 66. Balance Demands of Long-term Preservation and Access cont. • Not all data are worth preserving indefinitely; less valuable or easily producible data may be preserved for shorter periods. • Establish selection and appraisal guidelines that make it clear what to save or discard. – Selection criteria consider factors like availability, confidentiality, copyright, quality, file format, and financial commitment.
  • 67. Use of Data Management Plans • Data management plans describe how researchers will provide for long-term preservation of, and access to, scientific data in digital formats. • Data management plans provide opportunities for researchers to manage and curate their data more actively from project inception to completion. • See ICPSR's resource: Guidelines for Effective Data Management Plans
  • 68. Include Cost of Data Management in Funding Proposals • Data management services carry real costs, ranging from personnel to storage to software. • Maintenance costs are routinely built into physical infrastructure development, so too should data management costs be built into data development. • Long-term access to data requires durable institutions that plan on a scale of decades and even generations. • Cost resources: – DataONE's Provide budget information for your data management plan – UK Data Archive's Costing Tool: Data Management Planning.
  • 69. Evaluate Data Management Plans & Ensure Compliance • Plans help researchers prepare for working with and preserving data, repositories get ready to accession and provide access, and agencies to understand the community needs for archiving and access. Evaluation helps refine plans so they are realistic and attainable. • If data management plans are to be a standard component of funding applications, funding recipients should be held accountable for diversions from the originally stated plans.
  • 70. Promote Public Deposit of Data • Public deposit of data helps to ensure the long-term accessibility and preservation of the data. • It removes the burden of ongoing maintenance and care (and user support) from the researcher and provides a stable system to which data can be entrusted. • Many sustainable online repositories are already available to host and archive research data. These may include discipline- specific repositories, archives administered by funding agencies, or institutional repositories. • Databib, a searchable directory of over 500 research data repositories, can help locate relevant repositories by subject area.
  • 71. Preserve Intellectual Property Rights and Commercial Interests Original research may be both commercially valuable and proprietary. There are several approaches to managing these interests, including: – Tailor copyright and patent licenses, such as through Creative Commons licenses – Establish an embargo period or delayed dissemination on distribution.
  • 72. Private-sector Cooperation to Improve Access Encourage cooperation with the private sector to improve data access and compatibility. Issues to consider: • What funding structures will be in place to ensure that both organizations involved are benefiting from the partnership? • Will the partnership require any rights to be transferred to the private organization? • How does private-sector cooperation affect access restrictions and intellectual property concerns?
  • 73. Mechanisms for Identification & Attribution of Data • Properly citing data encourages the replication of scientific results, improves research standards, guarantees persistent reference, and gives proper credit to data producers. • Citing data is straightforward. Each citation must include the basic elements that allow a unique dataset to be identified over time: title, author, date, version, and persistent identifier. • Resources: ICPSR's Data Citations page , IASSIST's Quick Guide to Data Citation, DataCite.
  • 74. Data Stewardship Workforce Development In coordination with other agencies and the private sector, support training, education, and workforce development related to scientific data management, analysis, storage, preservation, and stewardship. Recent data stewardship workforce development in the United States has included: • Digital Preservation Outreach and Education, from the Library of Congress • Digital Preservation Management tutorial, from Cornell University, ICPSR, and MIT • DigCCurr, from the University of North Carolina
  • 75. Data Stewardship Workforce Development cont. ICPSR hosts data stewardship courses as part of its Summer Program in Quantitative Methods of Social Research. These include: • Curating and Managing Research Data for Re-Use • Assessing and Mitigating Disclosure Risk: Essentials for Social Science • Providing Social Science Data Services: Strategies for Design and Operation
  • 76. Long-term Support for Repository Development • ICPSR advocates long-term funding for specialized, long-lived, trustworthy, and sustainable repositories that can mediate between the needs of scientific disciplines and data preservation requirements. • As digital data management becomes an increasingly important part of scientific research, funding agencies must contribute to the developing ecosystem of services and technologies that support access to and preservation of data. • For more information, including various long-term funding models, see ICPSR’s 2013 position paper – “The Price of Keeping Knowledge”
  • 77. Get More information • Visit ICPSR’s Data Management & Curation site: http://www.icpsr.umich.edu/datamanagement • Contact us: – netmail@icpsr.umich.edu – (734) 647-2200
  • 80. Joint Declaration of Data Citation Principles: Implementation and Compliance in the Dataverse Repository Mercè Crosas, Ph.D. Twitter: @mercecrosas Director of Data Science Institute for Quantitative Social Science, Harvard University NISO Virtual Conference, April 23, 2014
  • 81. A brief History of Data Citation Altman M., Crosas M., 2014, “The Evolution of Data Citation: From Principles to Implementation” IASSIST Quarterly, In Press 1906 Chicago Manual of Style Standards in Scholarly Citation: author/creator, title, dates, publisher or distributor of the work 1960 First scientific digital data archives 1977 – 1998 ASBR (“Data File” type) MARC (machine readable catalog) 1999-2014 Data Repositories (NESSTAR, Dataverse, Dryad, Figshare) DOI services(DataCite)
  • 82. The Making of the Principles  Decades of research and practices in data citation  Consolidated to a single set of Principles  By a synthesis group representing 25+ organizations  Driven by the premise that: "sound, reproducible scholarship rests upon a foundation of robust, accessible data" and "data should be considered legitimate, citable products of research"
  • 83. Joint Declaration of Data Citation Principles 1 Importance 2 Credit and Attribution 3 Evidence 4 Unique Identification 5 Access 6 Persistence 7 Specificity and Verifiability 8 Interoperability and flexibility Full Principles: https://www.force11.org/datacitation Endorsement: https://www.force11.org/datacitation/endorsements
  • 84. Joint Declaration of Data Citation Principles 1. Importance Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.
  • 85. Joint Declaration of Data Citation Principles 2. Credit and Attribution Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data.
  • 86. Joint Declaration of Data Citation Principles 3. Evidence In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.
  • 87. Joint Declaration of Data Citation Principles 4. Unique Identification A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.
  • 88. Joint Declaration of Data Citation Principles 5. Access Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.
  • 89. Joint Declaration of Data Citation Principles 6. Persistence Unique identifiers, and metadata describing the data, and its disposition, should persist -- even beyond the lifespan of the data they describe.
  • 90. Joint Declaration of Data Citation Principles 7. Specificity and Verifiability Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific time slice, version and/or granular portion of data retrieved subsequently is the same as was originally cited.
  • 91. Joint Declaration of Data Citation Principles 8. Interoperability and flexibility Data citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities.
  • 92. About Dataverse  A software framework to build data repositories.  Provides a preservation and archival infrastructure, … while researchers share, keep control of and get recognition for their data through a web interface.  Harvard Dataverse is open to all researchers and disciplines.  It contains more than 50,000 data sets.  Other large Dataverse instances throughout the world: ODUM at UNC, Dutch Universities, Scholar Portal, Fudan University.  Dataverse 4.0 (June 2014) brings an entirely new UI and improved data publishing workflows.
  • 93. Data Citation Implementation in Dataverse The Dataverse generates a Data Citation for each deposited data set compliant with the Principles: Authors, Year, Dataset Title, DOI, Data Repository, UNF, version Example: Logan Vidal, 2013, "ANES data coding ", http://dx.doi.org/10.7910/DVN/23274 Harvard Dataverse, UNF:5:0fdUNzmCsyeqrVKtgUG74A==, V8
  • 94. Compliant with Principle 2 Principle 2: Credit and Attribution: …facilitate giving scholarly credit and … attribution to all contributors to the data, … Authors, Year, Dataset Title, DOI, Data Repository, UNF, version
  • 95. Compliant with Principles 4, 5, 6 Principles 4, 5, 6 Unique Identification: …machine actionable, globally unique, and widely used by a community … Access: … access to the data themselves and to such associated metadata, documentation, code, and other materials … Persistence: … even beyond the lifespan of the data they describe. Authors, Year, Dataset Title, DOI, Data Repository, UNF, version Resolves to landing page with access to metadata, docs, code and data
  • 97. Landing Page Example: Data, Code & Docs
  • 98. Compliant with Principle 7 Principle 7 Specificity and Verifiability: …provenance and fixity sufficient to facilitate verifying that the specific time slice, version and/or granular portion of data … Authors, Year, Dataset Title, DOI, Data Repository, UNF, version Universal Numerical Fingerprint: Independent of format
  • 100. Compliant with Principle 8 Principle 8: Interoperability and flexibility: Dataverse exports all citation metadata in XML, JSON formats
  • 101. Implementation Suggestions for Publishers  Upgrade data citation to references section [Principle 1: Importance]  In article, cite data by claim [Principle 3: Evidence]  Provide guidelines for authors based on Principles, but customized to each journal [Principle 8: Interoperability and Flexibility]  Interoperate with, or recommend, trusted Data Repositories compliant with the Principles  Build tools to access machine-readable metadata from datasets Want to be involved? Join the Data Citation Implementation group: https://www.force11.org/datacitationimplementation
  • 102. Remaining Challenges  Challenges of Provenance: what is the chain of ownership and transformations to the data?  Challenges of Identity: what should be cited? at what level of granularity and versioning for large, dynamic datasets?  Challenges of Attribution: How do you support attribution for hundreds/thousands contributors? Altman M., Crosas M., 2014, “The Evolution of Data Citation: From Principles to Implementation” IASSIST Quarterly, In Press
  • 103. NISO VIRTUAL CONFERENCE APRIL 23, 2014 – SUCCESSFUL TECHNIQUES FOR SCIENTIFIC DATA MANAGEMENT Purdue University Research Repository (PURR): A Commitment to Supporting Researchers Michael Witt Head, Distributed Data Curation Center Associate Professor of Library Science http://www.lib.purdue.edu/research/witt E-mail: mwitt@purdue.edu
  • 104. OVERVIEW 1. Preaching to the choir, but still: Data 2. Ecosystem of data repositories 3. Our campus data repository & service (PURR) a. Data management planning b. Project space for collaboration c. Publishing data d. Archiving data 4. Creating opportunities for liaison librarians & helping to operationalize library research data services 5. Roles and collaboration 6. Conclusion 104
  • 107. ECOSYSTEM OF DATA REPOSITORIES • Publisher, e.g., Dryad • Sub/Disciplinary, e.g., RKMP • Consortium, e.g., ICPSR • Country, e.g., Research Data Australia • Government, e.g., data.gc.ca • Research center, e.g., NASA GES DISC • Instrument, e.g., CHANDRA • General-purpose, e.g., FigShare • Roll-your-own, e.g., DataVerse • University, e.g., PURR • Many others… 107
  • 108. CAMPUS COLLABORATION The PURR service is a collaborative effort of the Purdue University Libraries, Office of the Vice President for Research, and Information Technology at Purdue. PURR is a designated university core research facility. Designated community: Purdue University faculty, staff, and graduate student researchers; their collaborators; and the current and future consumers of their data. 108
  • 109. LIBRARY STRATEGIC PLAN Data is written into the three pillars of our strategic plan: • Learning “…information literacy defined broadly to include digital information literacy, science literacy, data literacy, health literacy, etc…” • Scholarly Communication “Lead in data-related scholarship and initiatives” • Global Challenges “We will lead in international initiatives in information literacy and e- science and … contribute to international information literacy, learning spaces, data management, and scholarly communication initiatives.” 109 https://www.lib.purdue.edu/sites/default/files/admin/plan2016.pdf
  • 111. CURATION LIFECYCLE SERVICE MODEL 111 Witt, M. (2012). Co-designing, Co-developing, and Co-implementing an Institutional Data Repository Service. Journal of Library Administration, 52(2). DOI:10.1080/01930826.2012.655607. http://docs.lib.purdue.edu/lib_fsdocs/6/ Digital Curation Centre’s Curation Lifecycle Model: http://www.dcc.ac.uk/resources/curation-lifecycle-model
  • 112. PURR SERVICE – INTERNAL MODEL 112 112
  • 113. PURR SERVICE – EXTERNAL MODEL 113
  • 114. INTRO TO PURR VIDEO 114 http://www.youtube.com/watch?v=Yw0IJj7FqA8
  • 115. PURR POSTCARD AND POSTER 115 115
  • 116. 116 Dimensions of Discovery (Winter 2013). Office of the Vice President for Research, Purdue University, http://www.purdue.edu/research/vpr/publications/docs/dimensions/Winter2013.pdf
  • 117. DATA MANAGEMENT PLANS • Boilerplate text • Example DMPs • DMP Self-Assessment • DMPTool • Workshops • Tutorials • Reference and consultation with subject- specialist librarian and/or data services specialist https://purr.purdue.edu/dmp 117
  • 118. CREATE PROJECT AND COLLABORATE Create: • any Purdue faculty, staff, or graduate student researcher can create projects • describe the project • disclaim use of sensitive or restricted data • receive a default allocation of storage • register a grant award to increase allocation • invite collaborators to join project Collaborate: • git repository to share and version files (Google Drive integration) • wiki • blog • to-do list management and project notes • newsfeed • stage data publications 118
  • 119. SENSITIVE AND RESTRICTED DATA Sensitive data: Information whose access must be guarded due to proprietary, ethical, or privacy considerations. This classification applies even though there may not be a civil statute requiring this protection. Restricted data Information protected because of protective statutes, policies or regulations. This level also represents information that isn't by default protected by legal statue, but for which the Information Owner has exercised their right to restrict access. http://www.purdue.edu/securepurdue/policies/dataConfident/restrictions.cfm • FERPA  Registrar • HIPAA  Health Center • IRB  Human Research Protection Program • Export Control  Vice President for Research 119
  • 120.
  • 121. PROJECT SPACE 121 PURR project tutorial video: http://www.youtube.com/watch?v=q5xGO_oF9uQ
  • 123. DATA PUBLICATION 123 PURR publication tutorial video: http://www.youtube.com/watch?v=jYBcsfiRhio
  • 124. PRESERVATION AND STEWARDSHIP Initial commitment of 10 years • data producer or dept can fund for longer • otherwise remanded to library collection Design guided by ISO 16363 / TRAC • Organization infrastructure • Digital object management • Technical infrastructure & Security Risk Management 124
  • 125. ARCHIVAL INFORMATION PACKAGE Bagit “bag” contains: • bag declaration file, manifest file, data files Metadata file (XML): • METS wrapper • Dublin Core and MODS (descriptive metadata) • PREMIS (preservation metadata) MetaArchive: LOCKSS replication network (7 copies) 125
  • 126. SUPPORTING POLICIES • Terms of Deposit • Collection Development Policy • Preservation Policy • Preservation Strategies • File Format Recommendations • Preservation Support Policy 126 https://purr.purdue.edu/legal/terms
  • 127. REPOSITORY SOFTWARE: HUBZERO • HUBzero, open source software: http://hubzero.org • Maintained by HUBzero Foundation, originally funded by NSF • Over 50 hubs online, supporting different virtual scientific communities, hundreds of thousands of users • http://nanoHUB.org - grandfather of the hubs, exemplar • Built to facilitate virtual communities and online, scientific collaboration, research/teaching • Collaborate, develop, publish, access, execute, and manage content using a web browser • Software tools, documents, multimedia, learning objects, datasets, etc. • Social network functionality and collaboration features • LAMP stack, Joomla framework, OpenVZ and Rappture, git, etc. • EZID interface to mint DataCite DOIs (coming soon: ORCID) • Some extensions customized for PURR not in core distribution 127
  • 128. PURR TEAM • Executive Committee: Dean of Libraries, Vice President for Research, Chief Information Officer • Steering Committee: 2 from libraries, 2 from IT, 2 from research office and sponsored programs, 3 domain faculty researchers • Personnel: Project Director (.50), Technologists (3.85), HUBzero Liaison (.35), Metadata Specialist (.20), Digital Archivist (.25), Digital Data Repository Specialist (1.0) 128
  • 129. LIBRARIES PURR TEAM 129 PURR Project Director (50%) Michael Witt Three examples of responsibilities: • resourcing (personnel, budget, coffee, etc.) • oversees development roadmap, service definition and design • communicates across constituencies
  • 130. LIBRARIES PURR TEAM 130 Digital Data Repository Specialist Courtney Matthews Three examples of responsibilities: • primary point of contact for helping users and librarians utilize PURR • coordinates outreach, support, and development (tons of community engagement) • helps to acquire, organize, and ingest data collections
  • 131. LIBRARIES PURR TEAM 131 Digital Library Software Developer Mark Fisher Three examples of responsibilities: • developing a module to create archival information packages from datasets published in PURR • integrating PURR with MetaArchive, an LOCKSS preservation network • web and graphics design to keep the PURR website current and dynamic
  • 132. LIBRARIES PURR TEAM 132 Digital Archivist (25%) Carly Dearborn Three examples of responsibilities: • define and implement AIP as well as long-term digital object management and supporting practices • lead policy development and documentation such as PURR’s preservation policy, preservation strategies, file format recommendations, and preservation support policy • consult with data producers and librarians on file formats, appraisal of data collections, and data management planning
  • 133. LIBRARIES PURR TEAM 133 Metadata Specialist (20%) Amy Barton Three examples of responsibilities: • consult with data producers and librarians identify and apply appropriate metadata schemas and vocabularies to describe datasets • design and implement metadata for preservation, findability, and citability (i.e., DataCite DOIs) • enhance and provide quality assurance for metadata for acquired data collections
  • 134. KEY PLAYERS: SUBJECT LIBRARIANS 134
  • 135. KEY PLAYERS: DATA SPECIALISTS 135
  • 136. Librarians consult on data management plans in their subject areas. Creating opportunities for librarians to interact with researchers about data 136
  • 137. Librarian is notified by e-mail when a new project is created or a grant is awarded, based on department affiliation of Purdue project owner. Creating opportunities for librarians to interact with researchers about data 137
  • 138. Librarian may consult or collaborate on project if needed. Creating opportunities for librarians to interact with researchers about data 138
  • 139. Librarians review and post submitted datasets. Creating opportunities for librarians to interact with researchers about data 139
  • 140. At the end of initial commitment (10 years), archived and published datasets are remanded to the Libraries‘ collection. A librarian working with the digital archivist selects (or not) the dataset for the collection. Creating opportunities for librarians to interact with researchers about data 140
  • 141. CONCLUSION • Soft launch in 2012; 2013 was our first full year • PURR included in 1,040 data management plans with proposals from Purdue (tracked by our sponsored programs office) • 79 grants awarded • 1,466 registered researchers • 331 active research projects • Average project team size: 4 people • Average files per project: 67 files DMP analysis (n=111 NSF proposals from Purdue, Jan-Jun 2013) • 49% PURR • 29% Local computer or server • 14% Disciplinary repository (e.g., ICPSR, Protein Data Bank, nanoHUB, NEES) • 8% No data or not applicable 141
  • 142. THANK YOU PURR: http://purr.purdue.edu Michael Witt Head, Distributed Data Curation Center Associate Professor of Library Science http://www.lib.purdue.edu/research/witt E-mail: mwitt@purdue.edu
  • 143. The Roles of Data Citation in Data Management NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management http://www.niso.org/news/events/2014/virtual/data_deluge/ Christine L. Borgman Professor and Presidential Chair in Information Studies University of California, Los Angeles hudsonalpha.org NASA Astronomy Picture of the Day
  • 144. Deluge!!! Data! Scientists Social Scientists Funding agencies Policy makers Humanists Librarians http://www.guzer.com/pictures/suprise_suprise.jpg 14 Publishers Internet architects
  • 146. 146 Data are representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship. C.L. Borgman, 2014, forthcoming, Big Data, Little Data, No Data: Scholarship in the Networked World, MIT Press. hudsonalpha.org
  • 147. Publications are arguments made by authors, and data are the evidence used to support the arguments. C.L. Borgman, 2014, forthcoming, Big Data, Little Data, No Data: Scholarship in the Networked World, MIT Press.
  • 148. Citing publications vs. data • If publications are the stars and planets of the scientific universe, data are the ‘dark matter’ – influential but largely unobserved in our mapping process* *CODATA-ICSTI Task Group on Data Citation Standards and Practices, 2013, p. 54
  • 149. Authorship and Attribution • Publications – Independent units – Authorship is negotiated • Data – Compound objects – Ownership is rarely clear – Attribution • Long term responsibility: Investigators • Expertise for interpretation: Data collectors and analysts hudsonalpha.org
  • 150. Attribution of data • Legal responsibility – Licensed data – Specific attribution required • Scholarly credit: contributorship – Author of data – Contributor of data to this publication – Colleague who shared data – Software developer – Data collector – Instrument builder – Data curator – Data manager – Data scientist – Field site staff – Data calibration – Data analysis, visualization – Funding source – Data repository – Lab director – Principal investigator – University research office – Research subjects – Research workers, e.g., citizen science… 150
  • 151. Scholarly credit • Publications • Publications • Publications • Publications • Publications • Publications • Awards and honors • Grants • Teaching • Service • Data http://blog.startfreshtoday.com/Portals/170402/images/improve-credit-score1.jpg
  • 152. Everyone is overwhelmed with life and email and, in academia, trying to get funding and write papers. Whether something is open or not open is not highest on the priority list. There’s still need for making people aware of open science issues and making it easy for them to participate if they want to. Jonathan Eisen, genetics professor at the University of California, Davis DESPITE BEING GOOD FOR YOU AND FOR SCIENCE, TOO MANY CHALLENGES AND TOO LITTLE TIME Rewards for publications Effort to document data Competition, priority Control, ownership Slide courtesy of Merce Crosas, Harvard IQSS; Mashup of Borgman and Crosas slides 152
  • 153. Data citation as solution to… • Credit • Attribution • Discovery
  • 154. Research practices • Goal is publications that report the research Vs. • Goal is data that are reusable by others Image: Alyssa Goodman, Harvard Astronomy 154
  • 155. Scientific data creation, use, and reuse* • What are the characteristics of data use and reuse within each research community? • How do characteristics of data use and reuse vary within and between research communities? Fastlizard4’s image of a Geiger counter setup to measure background radiation (flickr.com) 155 * Wynholds, L. A., Wallis, J. C., Borgman, C. L., Sands, A., & Traweek, S. (2012). Data, data use, and scientific inquiry: two case studies of data practices. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (pp. 19–22). New York, NY, USA: ACM. doi:10.1145/2232817.2232822 * Wallis, J. C., Rolando, E., & Borgman, C. L. (2013). If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLoS ONE, 8(7), e67332. doi:10.1371/journal.pone.0067332
  • 156. Research Sites • Center for Embedded Networked Sensing – Science research • Environment • Seismology – Technology research • Instrumentation • Networks – Small science – Circa 300 partners • Sloan Digital Sky Survey needs to align – Science research • Astronomy • Astrophysics – Technology research • Instrumentation • Databases – Big science – Circa 400 partners 156
  • 157. Interview Questions Topic Question CENS SDSS Data Types Within your work, what is typically considered to be “data?” X X How do you distinguish between different levels or states of data? X DataSources What are the main sources of data for your research projects? X Do you routinely or have you ever used data that you did not generate yourself, or from beyond the immediate project team? X X Data Use When you look at data, what are you hoping to find in it? X X When, if ever, do you reuse your datasets? X X 157
  • 158. Dimensions of Data • Observed vs. simulated data • Lab generated vs. field collected • Collected by team vs. obtained from external sources • Old vs. new data • Raw vs. processed data • Foreground vs. background data 158
  • 159. Research findings • Uses of data vary by type of inquiry • Foreground data – Research questions – Curated – Cited • Background data – Necessary for comparison or calibration – Rarely curated – Rarely cited • Value of data lies in their use • “Use” of data is not reflected in citations 159http://drpinna.com/the-gold-standard-22948
  • 160. Sharing and discovering data • Means to share data – Curated data archives: NASA, UKDA, ICPSR… – Contributor-curated collections – Research domain collections – University repositories – Personal websites – ftp sites • Release upon request* http://www.zippykidstore.com/ *Wallis, J. C., Rolando, E., & Borgman, C. L. (2013). If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLoS ONE, 8(7), e67332. doi:10.1371/journal.pone.0067332 160
  • 161. Discoverability • Data are inseparable from – Code – Technical standards – Documentation – Instrumentation – Calibration – Provenance – Workflows – Local practices – Physical samples http://peacetour.org/sites/default/files/code4peace-logo2-v3-color-sm.jpg 161
  • 162. Usability of cited objects • Identify the form and content • Interpret • Evaluate • Open • Read • Compute upon • Reuse • Combine • Describe • Annotate… 162
  • 163. Identity and persistence of digital objects • Identity – Identifiers • DOI, Handles, URI, PURL… – Naming and namespaces • Authors/creators: ORCID, VIAF… • Generic/specific: registry number… – Description • Self-describing • Metadata augmentation • Persistence – Permanent – Long-lived – Scratch spaces http://web-interview- questions.blogspot.com/2010_06_21_archive.h tml 163
  • 164. Intellectual property • What can I do with this object? • What rights are associated? – Reuse – Reproduce – Attribute • Who owns the rights? • How open are data? – Open data – Open bibliography 164http://pzwart.wdka.hro.nl/mdr/research/lliang/mdr/mdr_images/opencontent.jpg/
  • 165. Implications for data management • Authors of publications – Cite publications for their data, findings, and other content – Cite your data as you wish others to cite them – Cite others’ data and publications as they wish to be cited • Data archives – Add metadata for discovery of datasets – Add metadata for interpretation and provenance • Institutional repositories, bibliographic databases – Establish standards and practices for citing data sources – Coordinate communities, e.g., telescope bibliography, IAU* 165 *IAU Working Group Libraries. (2013). Best Practices for Creating a Telescope Bibliography. IAU-Commission5 - WG Libraries. http://iau-commission5.wikispaces.com/WG+Libraries
  • 166. Data Citation and Attribution 166 Uhlir, P. F. (Ed.). (2012). For Attribution -- Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. Washington, D.C.: The National Academies Press. Retrieved from http://www.nap.edu/catalog.php?record_id=13564 Data Science Journal, Volume 12, 13 September 2013 2012 CODATA-ICSTI Task Group on Data Citation and Attribution. Co-Chairs: Jan Brase, Sarah Callaghan, Christine Borgman
  • 167. Research funding acknowledgements Research reported here is supported in part by grants from the National Science Foundation and the Alfred P. Sloan Foundation: The Transformation of Knowledge, Culture, and Practice in Data-Driven Science: A Knowledge Infrastructures Perspective, Sloan Award # 20113194, CL Borgman, UCLA, PI; S Traweek, UCLA, Co-PI The Data Conservancy, NSF Cooperative Agreement (DataNet) award OCI0830976, Sayeed Choudhury, Johns Hopkins University, PI The Center for Embedded Networked Sensing (CENS) is funded by NSF Cooperative Agreement #CCR-0120778, Deborah L. Estrin, UCLA, PI Towards a Virtual Organization for Data Cyberinfrastructure, NSF #OCI-0750529, C.L. Borgman, UCLA, PI; G. Bowker, Santa Clara University, Co-PI; Thomas Finholt, University of Michigan, Co-PI Monitoring, Modeling & Memory: Dynamics of Data and Knowledge in Scientific Cyberinfrastructures: NSF #0827322, P.N. Edwards, UM, PI; Co-PIs C.L. Borgman, UCLA; G. Bowker, SCU and Pittsburgh; T. Finholt, UM; S. Jackson, UM; D. Ribes, Georgetown; S.L. Star, SCU and Pittsburgh 167
  • 168. Finding and following digital objects • Discoverability – Identify existence – Locate – Retrieve • Provenance – Chain of custody – Transformations from original state • Relationships – Units identified – Links between units – Actions on relationships http://chicagoist.com/2008/10/09/a_gourmet_ oasis_provenance_food_and.php 168
  • 169. Infrastructure for digital objects • Social practice • Usability • Identity • Persistence • Discoverability • Provenance • Relationships • Intellectual property • Policy http://datalib.ed.ac.uk/GRAPHICS/blue_data.gif 169
  • 170. Social practice • Why cite data? – Reproduce research – Replicate findings – Reuse data • Why attribute data? – Social expectation – Legal responsibility • How to cite data? – Bibliographic reference – Identifier – Link 170 http://farm2.static.flickr.com/1207/707625876_46aa44851f_o.jpg
  • 171. 171 Foreground vs Background Foreground data Background data Uses Research questions Comparison, calibration Reuses Internal data sources External data sources Disposition Retain, curate Discard Value Reference in paper Rarely cited
  • 172. UCLA USC UCR CALTECH UCMCENTER FOR EMBEDDED NETWORKED SENSING Sensor Collected Application Data Sensor Collected Proprioceptive Data Sensor Collected Performance Data Hand Collected Application Data Flow Water depth Ammonium Ammonia Phosphate Water temp pH Temperature Conductivity Chlorophyll GPS/location Time Sap flow CO2 Humidity Rainfall Packets transmitted Packets received ORP PAR Motor speed Rudder angle Heading Roll/pitch/yaw Soil moisture Nitrate Calcium Chloride Water potential Wind speed Wind direction Wind duration Leaf wetness Routing table Neighbor table Fault detection Awake time Organism presence Organism concentration Battery voltage Mercury Methylmercury Nutrient concentration Nutrient presence LandSat images Mosscam CDOM Bird calls CENS Data: Foreground vs background
  • 173. Astronomy data: Foreground vs. background Type Source Named Genre Catalog (Data) index SIMBAD, VizieR Obs Curated Data Collection NASA Exoplanet Database Obs Data Archive Multi-mission Archive at STScI (MAST), Infrared Science Archive (IRSA) Obs Federated Data Query Services Virtual Observatory Services (NVO, IVOA) Obs Ground Based Instruments DEep Imaging Multi-Object Spectrograph (DEIMOS), Keck Observatories, Laser Interferometer Gravitational-Wave Observatory (LIGO) Obs Ground Based Sky Surveys Deep Lens Survey, DEEP2 Galaxy Redshift Survey, Catalina Transients Survey, Palomar-Quest Survey, Sloan Digital Sky Survey (SDSS), Digitized Palomar Observatory Sky Survey (DPOSS), SDSS Value Added Catalogs Obs Physical Constants NIST Atomic Spectra Database Exp Publications Index SAO/NASA Astrophysics Data System Mixed Simulation Millennium Simulation Database Sim Space Based Instruments Chandra X-Ray Observatory, Fermi Large Area Telescope, Far Ultraviolet Spectroscopic Explorer (FUSE), Galaxy Evolution Explorer (GALEX), Hubble Space Telescope, Spitzer Space Telescope, XMM X-ray Telescope Obs Space Based Sky Surveys Two Micron All Sky Survey (2MASS), Infrared Astronomical Satellite Survey (IRAS), Wide-field Infrared Survey Explorer (WISE) Obs 173
  • 174. © 2012 The MITRE Corporation. All rights reserved. Adriane Chapman achapman@mitre.org M. David Allen dmallen@mitre.org Barbara Blaustein bblaustein@mitre.org Is this data fit for my use? The challenges and opportunities provenance presents Information graphic courtesy of FreeDigitalPhotos.net Public Release #12-1548.
  • 175. © 2012 The MITRE Corporation. All rights reserved. Page 175 What is Provenance? Public Release #12-1548.
  • 176. © 2012 The MITRE Corporation. All rights reserved. Public Release #12-1548. ■Provenance can help in evaluating whether data is fit for a specific purpose – Does the data item derive from an Internet source? – Were untrusted organizations involved in producing the data item? ■Provenance “in the raw” is not always useful to users – Generally presented as a directed acyclic graph (DAG) – Many users have a good intuitive understanding of simple graphs, BUT Is Data Fit for a Specific Use? Page 176 Provenance graphs are often large and unwieldy
  • 177. © 2012 The MITRE Corporation. All rights reserved. Use Case Page 177
  • 178. © 2012 The MITRE Corporation. All rights reserved. Financial Systemic Risk Analysis Analysts Financial Models build and run Are there systemic risks to the health of the financial system? Decision Makers Public Release #12-3756
  • 179. © 2012 The MITRE Corporation. All rights reserved. Systemic Risk: The IT Problem ■ To monitor systemic risk, regulators have hundreds of analysts, running hundreds of models… – …against hundreds of data sets at various time scales… – …each with thousands of different parameter settings ■ Currently, care and feeding of these models (especially data extract-transform-load) is ad hoc ■ Result: Current simulation environments don’t support analysts’ need to find and interpret data across the resulting millions of simulation executions Public Release #12-3756
  • 180. © 2012 The MITRE Corporation. All rights reserved. Data Provenance Challenge I ran a flow of funds model from the University of Vermont back in May. Which version did I use? What transformations did I perform on the input data sets? Which model runs used the 1Q 2011 version of the FDIC’s Uniform Bank Performance Reports? Who is running Prof. Jones’ model? What input data are they using it with and with what parameters? Public Release #12-3756
  • 181. © 2012 The MITRE Corporation. All rights reserved. Data Provenance Example d1 Filter (P1) Multi-market model (P2) Source: Thompson Reuters order book data d2 d3 Filter (P3) d4 d5 Version: 1 Time-horizon = 2016 Invoked-by: Jones Version: 2 Time-horizon = 2018 Invoked-by: Smith Multi-market model (P4) Year: 2010 Sector: Technology Time Series Normalization (P6) Link-based Classification Model (P7) d7 d8 Outlook: ―Excellent‖ Outlook: ―Poor‖ Filter (P5) Year: 2001-2010 Sector: Housing Periodicity: Quarterly Invoked-by: Roberts Outlook: ―Fair‖ d6 Source: Nanex order book data Public Release #12-3756
  • 182. © 2012 The MITRE Corporation. All rights reserved. FitnessWidgets Page 182
  • 183. © 2012 The MITRE Corporation. All rights reserved. Public Release #12-1548. Page 183 Ease of Use for End Users Data-centric goal: build tools and applications over provenance information to support a user’s needs. Information graphic courtesy of FreeDigitalPhotos.net
  • 184. © 2012 The MITRE Corporation. All rights reserved. Public Release #12-1548. ■ Ad hoc, user-defined Fitness Widgets: Pre-defined queries operating over provenance graphs Page 184
  • 185. © 2012 The MITRE Corporation. All rights reserved. Fitness Widgets: Pre-defined queries operating over provenance graphs ■ Complex, pre-defined Page 185 Public Release #12-1548.
  • 186. © 2012 The MITRE Corporation. All rights reserved. Page 186 More Complex: Cross-organizational “double counting” Public Release #12-1548.
  • 187. © 2012 The MITRE Corporation. All rights reserved. The Skeletons Page 187
  • 188. © 2012 The MITRE Corporation. All rights reserved. PLUS Provenance Manager Provenance Manager PLUS Users & Applications Administrators Provenance Store (MySQL) PLUS Applications & Capture Agents Report AnnotateRetrieve Administer (access control, archiving, etc.) API (provenance-aware applications) Coordination points for automatic provenance capture Web Proxy (provenance-aware applications) Approved for Public Release 10-4145 © 2010 The MITRE Corporation. All rights reserved A. Chapman, M.D. Allen, B. Blaustein, L. Seligman, “PLUS: A Provenance Manager for Integrated Information,” IEEE Int. Conf. on Information Reuse and Integration (IRI ‘11), Las API
  • 189. © 2012 The MITRE Corporation. All rights reserved. Architectural Options for Lineage Capture ■ ―Smart Applications‖ – Strategy: Each application calls lineage API to log whatever it thinks is important. – But, unrealistic for legacy applications ■ ―Interceptors‖ – Strategy: Listen in to whatever is happening, and log silently as it happens – Requires a small number of points of lineage capture: ESBs are ideal, since they act as central ―routers‖ ■ ―Wrappers‖ – Strategy: Write a transparent wrapper service. Make sure all orchestrations call the wrapper service with enough information for the wrapper to invoke the real thing. 189 Public Release #10-1285
  • 190. © 2012 The MITRE Corporation. All rights reserved. Public Release #12-1548.
  • 191. © 2012 The MITRE Corporation. All rights reserved. View Provenance The provenance graph is built automatically over time by “watching” users’ actions Public Release #12-1548.
  • 192. © 2012 The MITRE Corporation. All rights reserved. The system can show relationship information and metadata details Get Details Public Release #12-1548.
  • 193. © 2012 The MITRE Corporation. All rights reserved. Sort Information The system provides ways to get information “at a glance”, e.g. which organizations own the data that was used. Public Release #12-1548.
  • 194. © 2012 The MITRE Corporation. All rights reserved. FitnessWidgets FitnessWidgets help the analyst assess data products for his specific use. Public Release #12-1548.
  • 195. © 2012 The MITRE Corporation. All rights reserved. Annotations Annotate any node. Information can be propagated through graph. Public Release #12-1548.
  • 196. © 2012 The MITRE Corporation. All rights reserved. ■ Provenance keeps track of who did what, when to data. ■ Provenance can help – Determine what data to use – Find data – Know what happened to the data ■ It is not a silver bullet – Capture is hard ■ Determine what pieces of information are vital to judging “fitness”, try to capture those Conclusions Page 196
  • 197. SHARE PROJECT UPDATE Judy Ruttenberg, Program Director Association of Research Libraries NISO Virtual Conference: Dealing with the Data Deluge April 23, 2014
  • 198. Higher education & research community • Preservation, access, and reuse of research outputs (data, articles, and more) • Interlocking layers & services to better understand what research is being produced, and to render that research as accessible as possible • Leverage existing ecosystem
  • 199. Formation and context of SHARE • Institutional OA policies • AAU-ARL Task Force on Scholarly Communication • Funder mandates –2013 OSTP Memorandum –2014 Omnibus Appropriations –Private and other funder policies
  • 200. Who is SHARE? Steering Group • Provost, Library directors, CIO, SRO • ARL, AAU, APLU, CNI, SPARC, NLM (federal agency liaison) Staff • Project Manager (ARL), Technical Director, Product/Community Lead, Development Team Working Groups • Repository, Workflow, Technical, Communications
  • 201. Layers & Services of SHARE Notification Service: Project underway – Beta release fall 2014 – Full release fall 2015 Concurrent planning for interactive systems: – Registry – Discovery – Aggregation
  • 202. SHARE Notification Service Problem Statement: • Difficult to keep abreast of the release of publications, datasets, other research outputs • No single, structured way to report research output releases in timely and ubiquitous manner
  • 203. SHARE Notification Service Outcome & Goal: • Know that research output exists • Enable, short-term & with high-latency: –Repository Managers to identify articles/papers/reports for deposit –University and funding agency grant administrators to determine compliance with public access policies
  • 204. SHARE Notification Service – Building Blocks
  • 205. SHARE Notification Service – Information Flow
  • 210. Other Community Initiatives • CHORUS • ORCID • CrossRef • International
  • 211. Long-term planning • Data • Author rights: An intellectual property rights strategy, including the promotion of university-based open access policies and favorable licensing terms, will be part of the scaffolding that will enable the layers of SHARE to develop
  • 213. NISO Virtual Conference Dealing with the Data Deluge: Successful Techniques for Scientific Data Management NISO Virtual Conference • April 23, 2014 Questions? All questions will be posted with presenter answers on the NISO website following the webinar: http://www.niso.org/news/events/2014/virtual/data_deluge/
  • 214. Thank you for joining us today. Please take a moment to fill out the brief online survey. We look forward to hearing from you! THANK YOU

Notes de l'éditeur

  1. Current archives/collections/repositories already meeting public access requirements regarding dataNACDA – NACJD – SAMHDA: examples of long term sustainabilityNAHDAP – SAMHDA – DSDR: examples of sharing of confidential dataNACJD – example of depository/researcher compliance (holding 10% of funding to PI)LGBT – MET: unique infrastructure and disseminationResearch Connections: reports and data dissemination; audiences including policymakers
  2. Abstract:Decades of data citation research, initiatives and guidelines have been consolidated into a single set of Data Citation Principles, created by a synthesis group that represents more than 25 organizations. The principles are driven by the premise that &quot;sound, reproducible scholarship rests upon a foundation of robust, accessible data&quot; and therefore &quot;data should be considered legitimate, citable products of research&quot;. The Dataverse repository, developed at Harvard University&apos;s IQSS, generates a data citation compliant with the Joint Principles, and provides data publishing workflows to guarantee a persistent linkage between journal articles and the underlying data. The Dataverse is open and free to all researchers.
  3. We are all familiar with the metaphor of the data deluge … we are all being drowned in dataAnd we all may be drowning, period….Concerns with data capture the prior trends in significant waysAnd yet, much of that data is runoff – it is not curated, and maybe should not be keptWe need to identify what is the “right stuff” to keepThe right way to keep itAnd the right tools and services to make it useful**Data have become a critical focus for scholarly communication – but we cannot address ALL, or even very many, of those issues here. Will try to stay as narrowly focused on the issues of data citation and attribution as we can.
  4. Here’s the real problem with the data deluge, and the data policies – is an utter lack of agreement on what constitutes data! Data tend to be defined by example – unacceptable in usual scholarly discourse. Would you define an animal by example?RCUK says data, specimens, models – identifying something as data, or a form of evidence, is itself a scholarly act. Marie Curie’s notebook is scientific data and also historical dataThese astronomical data can only be understood with access to the models used to generate themThe field notes in the bottom right are of little value without the research design and interpretationThe mouse is most certainly data – but getting useful information may require sending a postdoc to someone’s lab for 6 months to learn the method.
  5. We’re here to talk about attribution. In CC terms, that means giving credit – but credit for what?If you have obtained data through a license, it may require that the data be cited in a certain wayThe broader issue, is attribution for what?We heard yesterday that data citation is an incentive for data release. That’s an untested hypothesis – and needs to be testedWhat we found at the symposium was that everyone down the line had their hand out! The mechanism for citation will vary by who is getting credit and the reason for making the reference. These are but a few of the many stakeholders that might deserve credit in some situations.
  6. In between: Publication plus methods for longitudinal research** few researchers conduct their activities with reuse in mind – DL services has to begin at the very beginning of the process if data are to be managed and useful to anyone later.
  7. Comparison of two large collaborative research sitesInterviews and ethnographic fieldworkThe data practices of CENS and SDSS researchers have implications for data curation, system evaluation, and policy.
  8. Beyond the method of collection (sensor vs hand) and the domain interested (robotics, systems, app sci) there are these other dimensions along which “use” varies.
  9. Some data that are important to the conduct of research are not viewed as sufficiently valuable to keep. Other data of great value may not be mentioned or cited, because those data serve only as background to a given investigation. Metrics to assess the value of documents do not map well to data.
  10. The ability to discover the existence of data is a critical requirement for a data-sharing infrastructure. We can define discovery as being the ability to determine the existence of a set of data objects with specified attributes or characteristics. The attributes of interest include aspects such as the producer of the data, the date of production, the method or production, a description of its contents, its representation. Discovery may also include aspects such as levels of quality, certification, or validation by third parties. Discoverability depends both on the description and representation of data and on tools and services to search for data objects. Data rarely are self-describing . Description  and representation usually take the form of metadata, some of which may be automated if data are generated by instruments such as sensor networks or telescopes. Much metadata creation requires human intervention, making it an expensive process that is often avoided by researchers (Edwards, Mayernik, Batcheller, Bowker &amp; Borgman, 2011, forthcoming; Mayernik, 2011; Mayernik, Batcheller &amp; Borgman, 2011). The lack of standards and practices for citing data, akin to citing publications, is a barrier to discoverability  [cite  BRDI mtg Aug 2011]. A variety of approaches to discovery are possible. Web search engines that walk the visible internet are one possibility assuming that data descriptions are reachable via standard web protocols. With the introduction of semantic web technologies and associated crawlers and search engines, location of data-sets of interest based on semantic content becomes possible. Alternatively, more discipline-specific and structured catalogs can be created. Arguably quite a bit of data is self describing: e.g., FITS, NetCDF, … Tho even those are incomplete. The more succinct ex we can include the better. Seems to me to be a distinct issue, related to naming not discoverability? A bit of both. Let’s discuss.
  11. Let’s look more closely at each of theseIdentity – unique, and in what space should it be?Generic or field specific?Persistence – not all data should be available forecver – what needs to be identified and why?
  12. The ability to discover the existence of data is a critical requirement for a data-sharing infrastructure. We can define discovery as being the ability to determine the existence of a set of data objects with specified attributes or characteristics. The attributes of interest include aspects such as the producer of the data, the date of production, the method or production, a description of its contents, its representation. Discovery may also include aspects such as levels of quality, certification, or validation by third parties. Discoverability depends both on the description and representation of data and on tools and services to search for data objects. Data rarely are self-describing . Description  and representation usually take the form of metadata, some of which may be automated if data are generated by instruments such as sensor networks or telescopes. Much metadata creation requires human intervention, making it an expensive process that is often avoided by researchers (Edwards, Mayernik, Batcheller, Bowker &amp; Borgman, 2011, forthcoming; Mayernik, 2011; Mayernik, Batcheller &amp; Borgman, 2011). The lack of standards and practices for citing data, akin to citing publications, is a barrier to discoverability  [cite  BRDI mtg Aug 2011]. A variety of approaches to discovery are possible. Web search engines that walk the visible internet are one possibility assuming that data descriptions are reachable via standard web protocols. With the introduction of semantic web technologies and associated crawlers and search engines, location of data-sets of interest based on semantic content becomes possible. Alternatively, more discipline-specific and structured catalogs can be created. Arguably quite a bit of data is self describing: e.g., FITS, NetCDF, … Tho even those are incomplete. The more succinct ex we can include the better. Seems to me to be a distinct issue, related to naming not discoverability? A bit of both. Let’s discuss.
  13. This is the demand side. It is really hard to reuse other people’s data. You need to know so much about the data to trust what you’ve got. The cases of reuse that we find are data from curated repositories, as in astronomy, surveys, and so on. Even in the big data world, they spend up to 80% of their time cleaning data to make them reusable.Reuse in clinical trials, where reproducibility is part of the paradigmRelatively little reuse of data in most areas – see our paper, just out in PLOS ONE this summer
  14. The ability to discover the existence of data is a critical requirement for a data-sharing infrastructure. We can define discovery as being the ability to determine the existence of a set of data objects with specified attributes or characteristics. The attributes of interest include aspects such as the producer of the data, the date of production, the method or production, a description of its contents, its representation. Discovery may also include aspects such as levels of quality, certification, or validation by third parties. Discoverability depends both on the description and representation of data and on tools and services to search for data objects. Data rarely are self-describing . Description  and representation usually take the form of metadata, some of which may be automated if data are generated by instruments such as sensor networks or telescopes. Much metadata creation requires human intervention, making it an expensive process that is often avoided by researchers (Edwards, Mayernik, Batcheller, Bowker &amp; Borgman, 2011, forthcoming; Mayernik, 2011; Mayernik, Batcheller &amp; Borgman, 2011). The lack of standards and practices for citing data, akin to citing publications, is a barrier to discoverability  [cite  BRDI mtg Aug 2011]. A variety of approaches to discovery are possible. Web search engines that walk the visible internet are one possibility assuming that data descriptions are reachable via standard web protocols. With the introduction of semantic web technologies and associated crawlers and search engines, location of data-sets of interest based on semantic content becomes possible. Alternatively, more discipline-specific and structured catalogs can be created. Arguably quite a bit of data is self describing: e.g., FITS, NetCDF, … Tho even those are incomplete. The more succinct ex we can include the better. Seems to me to be a distinct issue, related to naming not discoverability? A bit of both. Let’s discuss.
  15. An infrastructure for digital objects has many features – we’re concerned at this meeting with how they apply to data, attribution, and citation – but must remember that they are part of a larger internet architecture of digital objectsI will provide a brief overview of these relationships as background to the issues we will address this week
  16.  Usability – really reusability – how valuable is an object if you can’t open it? Need software? Locked up in PDF?Related to discoverabiltyIPIf we could release everything under CC0 licenses, the world would be a simpler place. That won’t happenNeed to know what rights are attached, what you can do with it.Open data – in sense of no rights attached, in sense of reusable (structured vspdf)Open bib – citations per se are facts, and generally not copyrightable.Movement toward open bib – let the descriptions, the metadta go free, then others can map the world of content and ideas. Separate from the payrwalls. 
  17. Provenance addresses the challenges described above. First, it helps analysts understand the data assumptions behind different simulation runs. For example, in Figure 1, nodes P2 and P4 represent runs of the multi-market model that produce outputs d3 and d5 respectively. In this simple example, d3 and d5 differ in their assessments of the economic outlook (“excellent” vs. “poor”). What might account for the differences? Data provenance allows the user to see exactly what input data and model version were used, what transformations were performed on the data, and the parameter settings used. Without automatic capture of this information as simulations and data transformation tools run, it is very difficult to recreate the provenance retrospectively by examining scripts and individual analysts’ notes.  Provenance also helps analysts find collaborators. For example, a new analyst Wilson could query to see who is using the multi-market model (Smith and Jones), how they are using it, and with what data. Alternatively, Wilson could query to see who is using order book data (Smith, Jones, and Roberts), what sources they are using for it (Thompson Reuters and Nanex), and what models they are using the data for. These queries would be very difficult to answer in a large-scale analytic environment without provenance.
  18. Like a concept flow.But, YOU don’t have to know that the washington post copies from reuters, or that analyst 2 uses a person communication. It’s captured silently and built up over time.