Why data science matters and what we can do with it
1. deepcarbon.net
Xiaogang (Marshall) Ma and DCO-Data Science Team
Tetherless World Constellation
Rensselaer Polytechnic Institute
Why Data Science Matters?
and what can we do with it
2. Outline
• Data Management and Publication
• Interoperability of Data
• Provenance of Research
• Era of Science 2.0
2
4. • Meet grant requirements
– Many funding agencies now require researchers formally state
how they will manage and preserve datasets generated from a
research project.
4
… …
Why Manage and Publish Data
5. • Increase your research efficiency
– Have you ever had a hard time understanding the data that
you or your colleagues have collected?
5
Data work
6. 6
Image courtesy of British Geological Survey
Nice, now I have my DATA well managed, and next…
7. • Increase the visibility of your research
– Making your data available to other researchers through
widely-searched repositories can increase your prominence
and demonstrate continued use of the data and relevance of
your research.
• Facilitate new discoveries
– Enabling other researchers to use your data reinforces open
scientific inquiry and can lead to new and unanticipated
discoveries. And doing so prevents duplication of effort by
enabling others to use your data rather than trying to gather
the data themselves.
7
8. Data Management Plan: What and How
• What is a Data Management Plan?
– A data management plan is a formal document that outlines
what you will do with your data during and after you complete
your research.
• What is involved in developing one?
– Developing a data management plan can be time-consuming,
tedious, and daunting, but it's a very important step in ensuring
that your research data is safe and sound for the present and
future.
– With the right process and framework it does not take too long
time and can pay-off enormously in the long-run.
8
9. • Topics in a data management plan include
– Introduction and context
– Data types, formats, standards and capture methods
– Short-term storage and data management
– Deposit and long-term preservation
– Data sharing, access and re-use
– Resourcing
– Adherence and review
10. • Resources/Tools help create DMPs:
– DCC Data Management Plans:
http://www.dcc.ac.uk/resources/data-management-plans
– MIT Data Management and Publishing:
http://libraries.mit.edu/data-management/
– NSF Data Management Plan Requirements:
http://www.nsf.gov/eng/general/dmp.jsp
– DMPTool: https://dmptool.org
– IEDA Data Management Plan Tool:
http://www.iedadata.org/compliance/plan
– DCC DMPOnline: https://dmponline.dcc.ac.uk
10
12. Data Publication & Citation
• Data as first class products of research
– NSF bio-sketches can include data publications
13
Image from j4h.net
13. • Ways of data publication
– Data as supplemental material of a paper
– Standalone data
– Data paper: data + descriptive ‘data paper’
14
(Strasser, 2014)
Examples:
• Standalone data journals: Nature Scientific Data, Geoscience Data
Journal, Ecological Archives
• Journals that publish data papers: GigaScience, F1000 Research,
Internet Archaeology
15. • Data Citation Index
– Indexes the world's leading data repositories
– Records for the datasets are connected to related peer-
reviewed literature indexed in the Web of Science™
– Allow researchers to efficiently access to data across subjects
and regions
16
17. A good example
• OneGeology
18
• Web-accessible geologic map data
worldwide (scale ~1:1 million)
• Stimulate a rapid increase in interoperability
(i.e. disseminate GeoSciML and
vocabularies further and faster)
• 120 participating countries (July 2014)
20. 21
Earth Resource Form
Environmental Impact Value
Exploration Activity Type
Exploration Result
UNFC Value
Earth Resource Expression
Earth Resource Shape
Enduse Potential
Mineral Occurrence Type
Mining Activity Type
Processing Activity Type
Mining Waste Type Value
Commodity Code
Mineral Deposit Group
Mineral Deposit Type
Product Value
A list of recently finished vocabularies
CGI Geoscience Terminology Workgroup
• Construct a collection of vocabularies for
populating information interchange documents and
enabling interoperability
• Provide labels for concepts, scope to various
communities defined by language, science domain,
or application domain
24. 25
(Ma et al., 2011)
Interoperable:
“Data should be discoverable, accessible, decodable,
understandable and usable, and data sharing should be
legal and ethical for all participants.”
25. • Interoperability does not mean that all data should be
mediated or standardized.
• However, it is important that data archives are
accompanied by detailed documentation, clarifying data
provenance, data model, vocabularies used, etc.
26
(Ma et al., 2011)
27. Provenance capture
• Documenting provenance
– Linking a range of observations and model outputs, research
activities, people and organizations involved in the production of
scientific findings with the supporting data sets and methods
used to generate them.
28
Well-curated provenance information
makes scientific workflows transparent
and improves the credibility and
trustworthiness of their outputs. It also
facilitates informed and rational policy
and decision-making.
Image from nature.com
(Ma et al., 2014)
28. “Figure 1.2: Sea Level Rise: Past, Present, and Future” in the Third National
Climate Assessment report draft of USA (NCA3) 29
What is the provenance
of this figure?
29. • Detailed caption of that figure:
– Estimated, observed and possible amounts of global sea level
rise from 1800 to 2100. Proxy estimates (Kemp et al. 2012)
(for example, based on sediment records) are shown in red
(pink band shows uncertainty), tide gauge data in blue
(Church and White 2011a), and satellite observations are
shown in green (Nerem et al. 2010). The future scenarios
range from 0.66 feet to 6.6 feet in 2100 (Parris et al. 2012).
Higher or lower amounts of sea level rise are considered
implausible, as represented by the gray shading. The orange
line at right shows the currently projected range of sea level
rise of 1 to 4 feet by 2100, which falls within the larger risk-
based scenario range. The large projected range reflects
uncertainty about how glaciers and ice sheets will react to the
warming ocean, the warming atmosphere, and changing winds
and currents. As seen in the observations, there are year-to-
year variations in the trend. (Figure source: Josh Willis, NASA
Jet Propulsion Laboratory)
30
As a case study, let’s trace the
provenance of this paper.
30. Provenance tracing of NASA contributions to Figure 1.2 in draft NCA3
Here only the details of
Topex-Poseidon mission are
shown
Here only the details of one
paper (i.e., “paper/103”) cited
by that figure are shown
(a) Instances of
calibration, model and
software underpinning
“paper/103”
(b) Instances of sensor,
instrument and platform
underpinning that paper
31
36. • Science 2.0
– New practices of scientists who post raw experimental results,
nascent theories, claims of discovery and draft papers on the
Web for others to see and comment on.
– Proponents say these “open access” practices make scientific
progress more collaborative and therefore more productive.
– Critics say scientists who put preliminary findings online risk
having others copy or exploit the work to gain credit or even
patents.
37
(Waldrop, 2008)
38. • Social scholarship: Reconsidering scholarly practices in
the age of social media
– Polled 1,600 US and Canadian faculty members
– Found that 15% use Twitter, 28% use YouTube and 39% use
Facebook for scholarly activity
39
(Greenhow and Gleason, 2014)
Using social media more often would help scientists to disseminate
their results, debate findings and engage a wider audience
Researchers must learn to create a robust
online presence
Social-media metrics to be added to the
tenure process
39. • Altmetrics
– A very broad group of metrics, capturing various parts of
impact a paper or work can have.
40
(Lin and Fenner, 2013)
The ImpactStory
Altermetrics Classifications
40. • altmetric.com
– already a product used by NPG, Springer, etc.
41
This Altmetric score means that the article is:
• in the 99 percentile (ranked 181st) of the
81,582 tracked articles of a similar age in all
journals
• in the 93 percentile (ranked 69th) of the 992
tracked articles of a similar age in Nature
http://www.nature.com/nature/journal/v497/n7449/nature12127/metrics
We saw this figure at the beginning of this presentation. So, now what we can do with the provenance tracing?
(a) Instances of calibration, model and software underpinning “paper/103” and (b) Instances of sensor, instrument and platform underpinning that paper.
(a) Instances of calibration, model and software underpinning “paper/103” and (b) Instances of sensor, instrument and platform underpinning that paper.