Presentation at the National Library of Medicine, in a Symposium organized by the National Data Stewardship Residency, funded by the Library of Congress and the Institute of Museum and Library Services, on "Digital Frenemies: Closing the Gap in Born-Digital and Made-Digital Curation”.
https://ndsr2016.wordpress.com/
1. The Rise of Data Publishing
in the Digital World
(and how Dataverse and DataTags help)
Mercè Crosas, Ph.D.
Chief Data Science and Technology Officer
Institute for Quantitive Social Science
Harvard University
@mercecrosas
NDSR 2016 Symposium
2. From 1665 to late 20th century:
A steady increase in size and
complexity of research output
3. The number of journals doubles every 20 years
since 1750s, with growth of number of scientists
1665 1765 1865 1965
100
10000
Mabe, 2003
4. The number of journals doubles every 20 years
since 1750s, with growth of number of scientists
1700: 3 journals
1665 1765 1865 1965
100
10000
Mabe, 2003
5. The number of journals doubles every 20 years
since 1750s, with growth of number of scientists
1700: 3 journals
1800: ~10 journals
1665 1765 1865 1965
100
10000
Mabe, 2003
6. The number of journals doubles every 20 years
since 1750s, with growth of number of scientists
1700: 3 journals
1800: ~10 journals
1900: ~400 journals
1665 1765 1865 1965
100
10000
Mabe, 2003
7. The number of journals doubles every 20 years
since 1750s, with growth of number of scientists
1700: 3 journals
1800: ~10 journals
1900: ~400 journals
2000: ~14,000 journals
(peer-reviewed)
1665 1765 1865 1965
100
10000
Mabe, 2003
9. Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
1665 1765 1865 1965
100
10000
10. Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
1665 1765 1865 1965
100
10000
11. Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text 50% cite previous
work
1665 1765 1865 1965
100
10000
12. Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text 50% cite previous
work
First Line Graphs
and bar charts
(Playfair, 1786)
1665 1765 1865 1965
100
10000
13. Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
50% of articles have
tables & figures
50% cite previous
work
First Line Graphs
and bar charts
(Playfair, 1786)
1665 1765 1865 1965
100
10000
14. Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
50% of articles have
tables & figures
50% cite previous
work
method sections
appear
First Line Graphs
and bar charts
(Playfair, 1786)
1665 1765 1865 1965
100
10000
15. Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
50% of articles have
tables & figures
50% cite previous
work
method sections
appear
First Line Graphs
and bar charts
(Playfair, 1786)
First Scatterplots
(Hershel,1833;
Galton 1896)
1665 1765 1865 1965
100
10000
16. Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
50% of articles have
tables & figures
most articles
have tables &
figures, often
standalone
50% cite previous
work
method sections
appear
First Line Graphs
and bar charts
(Playfair, 1786)
First Scatterplots
(Hershel,1833;
Galton 1896)
1665 1765 1865 1965
100
10000
17. Data Tables andVisuals Become Increasingly
Common, and part of the Scientific Argument
a few tables &
visuals, as part of
the text
50% of articles have
tables & figures
most articles
have tables &
figures, often
standalone
50% cite previous
work
100% with citations
(1 per 100 words)
part of scholarly credit
method sections
appear
First Line Graphs
and bar charts
(Playfair, 1786)
First Scatterplots
(Hershel,1833;
Galton 1896)
1665 1765 1865 1965
100
10000
20. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
21. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
22. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
23. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
24. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
25. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
26. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
• wide range of data types with new technologies
27. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
• wide range of data types with new technologies
• Number of scientists increases from 100s to a few millions
28. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
• wide range of data types with new technologies
• Number of scientists increases from 100s to a few millions
• Science becomes extremely specialized:
29. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
• wide range of data types with new technologies
• Number of scientists increases from 100s to a few millions
• Science becomes extremely specialized:
• from 1 journal to 14,000 peer-reviewed journals
30. Scholarly Publishing Adapts to the
Increase of Cognitive Complexity (Gross et al 2001)
• 18th century:
• formal components appear in articles (introduction,
conclusions, table, figures, citations)
• 19th century:
• explain data instead of establish observations of facts
• wide use of visuals, high citation density, methods section
• 20th century:
• structured quantitative data with increased use of statistics
• wide range of data types with new technologies
• Number of scientists increases from 100s to a few millions
• Science becomes extremely specialized:
• from 1 journal to 14,000 peer-reviewed journals
• one new journal for each 150 authors, read by 500
31. In the last decades, more
and more publications
and data
32. A Steeper Growth of Scholarly Output
Since 1950, the total number of journals doubles every ~15 years
2010: 80,000 journals
2010: 33,000 peer-reviewed
33. An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
34. An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
1920 - 1950s
35. An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
1920 - 1950s
36. An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
1920 - 1950s 1970 - 1980s
37. An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
First Biomedical
Databases
(PDB, GenBank, ...)
1920 - 1950s 1970 - 1980s
38. An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
First Biomedical
Databases
(PDB, GenBank, ...)
1920 - 1950s 1970 - 1980s 2016
39. An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
A wide range of
Research Data
Repositories
First Biomedical
Databases
(PDB, GenBank, ...)
1920 - 1950s 1970 - 1980s 2016
40. An Outburst of Research Data and Specialization,
Results into > 1000 Community Repositories
First Social Science
Data Archives
(ODUM, ICPSR, ...)
A wide range of
Research Data
Repositories
First Biomedical
Databases
(PDB, GenBank, ...)
1500 repositories listed in re3data.org
1920 - 1950s 1970 - 1980s 2016
42. Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
43. Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
44. Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
45. Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
46. Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
Data Archiving:
Long-term access to data
47. Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
Data Archiving:
Long-term access to data
• Accessibility
48. Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
Data Archiving:
Long-term access to data
• Accessibility
• Preservation
49. Data Publishing Emerges as the Union of
Scholarly Publishing and Data Archiving
Scholarly publishing:
Distribute research output
• Attribution and credit
• Dissemination
• Finding & Reuse
Data Archiving:
Long-term access to data
• Accessibility
• Preservation
• Finding & Reuse
51. Why Data Publishing now?
Extending Gross et al. thesis, data publishing accommodates the
complexity of research input and output in the digital world.
52. Why Data Publishing now?
Extending Gross et al. thesis, data publishing accommodates the
complexity of research input and output in the digital world.
53. Why Data Publishing now?
• Data (and software) have become common input and
output of research
Extending Gross et al. thesis, data publishing accommodates the
complexity of research input and output in the digital world.
54. Why Data Publishing now?
• Data (and software) have become common input and
output of research
• A scholarly article cannot hold or describe accurately these
vast amounts of data and software
Extending Gross et al. thesis, data publishing accommodates the
complexity of research input and output in the digital world.
55. Why Data Publishing now?
• Data (and software) have become common input and
output of research
• A scholarly article cannot hold or describe accurately these
vast amounts of data and software
• As input and output of research, data must be citable and
accessible to enable validation and reuse, with attribution
Extending Gross et al. thesis, data publishing accommodates the
complexity of research input and output in the digital world.
56. What is needed for FAIR Data Publishing
FAIR = Findable Accessible Interoperable Reusable
57. What is needed for FAIR Data Publishing
Data Citation
FAIR = Findable Accessible Interoperable Reusable
58. What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
FAIR = Findable Accessible Interoperable Reusable
59. What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
FAIR = Findable Accessible Interoperable Reusable
60. What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
FAIR = Findable Accessible Interoperable Reusable
61. What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
FAIR = Findable Accessible Interoperable Reusable
62. What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
FAIR = Findable Accessible Interoperable Reusable
63. What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
• Sufficient information to
understand and reuse the
data
FAIR = Findable Accessible Interoperable Reusable
64. What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
• Sufficient information to
understand and reuse the
data
Repository
FAIR = Findable Accessible Interoperable Reusable
65. What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
• Sufficient information to
understand and reuse the
data
Repository
• Digital access to metadata
and data
FAIR = Findable Accessible Interoperable Reusable
66. What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
• Sufficient information to
understand and reuse the
data
Repository
• Digital access to metadata
and data
• Archive and preservation for
long-term access
FAIR = Findable Accessible Interoperable Reusable
67. What is needed for FAIR Data Publishing
Data Citation
• Persistent id to
reference data uniquely
• Support for versions
and fixity
• Attribution to authors
and repository
Metadata
• Catalog to discover and
locate the data
• Sufficient information to
understand and reuse the
data
Repository
• Digital access to metadata
and data
• Archive and preservation for
long-term access
• Interoperability through
standards and APIs
FAIR = Findable Accessible Interoperable Reusable
68.
69. A data repository system that serves as a
solution for publishing FAIR research data
71. Around the World
Harvard Dataverse:
Generic data repository open
to researchers world wide
Dataverse repositories serve a community, an institution, an archive, ...
76. Data Citation Basics
Force11, Joint Declaration of Data Citation Principles; Starr et al, 2015
The dataset landing page is accessible and guaranteed by the repository
(or data publisher), even when data are restricted or deaccessioned
85. Information Extraction:Tabular Files
RData
Stata
SPSS
Excel
CSV
var 1 var 2 var 3
obs 1 2 a 0
obs 2 4 c 0
obs 3 6 b 1
obs 4 1 e 0
obs 5 2 a 1
obs 6 3 b 1
Variable Metadata:
Variable name, label,
type, stats, geospatial
coordinates
2 a 0
4 c 0
6 b 1
1 e 0
2 a 1
3 b 1
DataValues:
Independent of format
86. Information Extraction:Tabular Files
RData
Stata
SPSS
Excel
CSV
var 1 var 2 var 3
obs 1 2 a 0
obs 2 4 c 0
obs 3 6 b 1
obs 4 1 e 0
obs 5 2 a 1
obs 6 3 b 1
Variable Metadata:
Variable name, label,
type, stats, geospatial
coordinates
2 a 0
4 c 0
6 b 1
1 e 0
2 a 1
3 b 1
DataValues:
Independent of format
Universal Numerical Fingerprint (UNF):
checksum on data values, from canonical format
93. Tiered Access
Open (default):
CC0
Open Open Click to Download
GuestBook Open Open
Fill in guestbook before
download
Terms of Use Open Open
Click through terms of
use before download
Data Restricted Open Restricted Request Access via
click through
Data Restricted Open Restricted
Request Access via
application
Metadata Files How to Access
94. Tiered Access
Open (default):
CC0
Open Open Click to Download
GuestBook Open Open
Fill in guestbook before
download
Terms of Use Open Open
Click through terms of
use before download
Data Restricted Open Restricted Request Access via
click through
Data Restricted Open Restricted
Request Access via
application
Metadata Files How to Access
95. Tiered Access
Open (default):
CC0
Open Open Click to Download
GuestBook Open Open
Fill in guestbook before
download
Terms of Use Open Open
Click through terms of
use before download
Data Restricted Open Restricted Request Access via
click through
Data Restricted Open Restricted
Request Access via
application
Metadata Files How to Access
96. Tiered Access
Open (default):
CC0
Open Open Click to Download
GuestBook Open Open
Fill in guestbook before
download
Terms of Use Open Open
Click through terms of
use before download
Data Restricted Open Restricted Request Access via
click through
Data Restricted Open Restricted
Request Access via
application
Metadata Files How to Access
97. Tiered Access
Open (default):
CC0
Open Open Click to Download
GuestBook Open Open
Fill in guestbook before
download
Terms of Use Open Open
Click through terms of
use before download
Data Restricted Open Restricted Request Access via
click through
Data Restricted Open Restricted
Request Access via
application
Metadata Files How to Access
101. Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
102. Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
103. Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
104. Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
Publish v. 1.1
105. Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
Publish v. 1.1
Major change
(might include new
data file)
106. Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
Publish v. 1.1
Major change
(might include new
data file)
107. Data Publishing Workflows
Create Dataset
(landing page
restricted)
Publish v. 1
Review
(collaborators or
anonymous reviewers)
Minor change
(metadata only)
Publish v. 1.1
Major change
(might include new
data file)
Publish v. 2
110. The Biomedical Dataverse at Harvard Medical School -
also tested as a persistent repository for LINCS data
(NIH Library of Integrated Network based Cellular Signatures)
Collaboration with Piotr Sliz and Caroline Shamu (HMS)
(NIH Library of Integrated Network-based Cellular Signatures)
111. The Biomedical Dataverse at Harvard Medical School -
also tested as a persistent repository for LINCS data
(NIH Library of Integrated Network based Cellular Signatures)
Collaboration with Piotr Sliz and Caroline Shamu (HMS)
(NIH Library of Integrated Network-based Cellular Signatures)
113. “User
Uploads
must
be
void
of
all
iden4fiable
informa4on,
such
that
re-‐iden4fica4on
of
any
subjects
from
the
amalgama4on
of
the
informa4on
available
from
all
of
the
materials
(across
datasets
and
dataverses)
uploaded
under
any
one
author
and/or
user
should
not
be
possible.”
114. “SubmiCer
represents
and
warrants
that
the
Content
does
not
contain
any
informa4on
(i)
which
iden4fies,
or
which
can
be
used
in
conjunc4on
with
other
publicly
available
informa4on
to
personally
iden4fy,
any
individual;”
115. “If
you
are
submiHng
human
sequences
to
GenBank,
do
not
include
any
data
that
could
reveal
the
personal
iden4ty
of
the
source.
It
is
our
assump4on
that
you
have
received
any
necessary
informed
consent
authoriza4ons
that
your
organiza4ons
require
prior
to
submiHng
your
sequences.”
GenBank
116. How can we maximize
publishing sensitive data while
being mindful of privacy?
117. Sweeney
L,
Crosas
M,
Bar-‐Sinai
M.
Sharing
Sensi4ve
Data
with
Confidence:
The
DataTags
System.
Technology
Science.
2015101601.
October
16,
2015.
hCp://techscience.org/a/2015101601
The DataTags System
118.
119. A datatag is a set of security features and access
requirements for file handling
120. A datatag is a set of security features and access
requirements for file handling
A datatags repository is one that stores and shares
data files in accordance with a standardized and
ordered levels of security and access requirements
130. References
• http://dataverse.org
• http://dataverse.harvard.edu
• http://datatags.org
• Sweeney L, Crosas M, Bar-Sinai M. 2015, Sharing
Sensitive Data with Confidence:The DataTags System.
Technology Science, hCp://techscience.org/a/2015101601
• Gross Harmon, Reidy, 2001, Communicating Science
• Mabe,
2003,
The
Growth
and
Number
of
Journals
• Friendly,
2006,
A
Brief
History
of
Data
Visualiza4on