A presentation with a review of technical trends in data management, publication and citation, and methodologies on data interoperability, provenance of research and semantic escience.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture
1. TWC
Why Data Science Matters
Xiaogang (Marshall) Ma
Tetherless World Constellation
Rensselaer Polytechnic Institute
Email: max7@rpi.edu; Twitter: @MarshallXMa
ICSU-WDS Data Stewardship Award Lecture
SciDataCon 2014, New Delhi, India, Nov. 02-05
2. TAckWnowledgCements
• Dr. Mustapha Mokrane and Dr. Simon Hodson
• Colleagues at TWC/RPI, CODATA-ECDP, ESIP, CGI-IUGS,
AGU/ESSI, ICSU-WDS, RDA, ITC, and more
• My mentor Prof. Peter Fox
• My family
• All of you
3. TWOutlinCe
• Technical trends
– Data management, publication & citation
• Methodology
– Interoperability & Provenance
• Data management is just a start
– Data analysis
– Semantic eScience
3
5. DTata MWanagemCent Plan
• Data Management Plan
– A formal document that outlines what you will do with your data
during and after you complete your research
• Resources/Tools help create DMPs:
– NSF Data Management Plan Requirements:
http://www.nsf.gov/eng/general/dmp.jsp
– DCC Data Management Plans:
http://www.dcc.ac.uk/resources/data-management-plans
– DMPTool: https://dmptool.org
– DCC DMPOnline: https://dmponline.dcc.ac.uk
5
6. TDaWta PubliCcation
• Data as first class products of research
– e.g., NSF bio-sketches can include data publications
See: http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp
6 Image from j4h.net
7. TWC
7
“All data necessary to understand, assess, and extend the conclusions of
the manuscript must be available to any reader of Science. ”
“…authors are required to make materials, data and associated protocols
promptly available to readers without undue qualifications.”
“…authors must make materials, data, and associated protocols available
to readers.”
“…it is a condition of publication that authors make available the data and
research materials supporting the results in the article.”
“…require authors to make all data underlying the findings described in
their manuscript fully available without restriction…”
“Earth and space science data should be widely accessible in multiple
formats and long‐term preservation of data is an integral responsibility of
scientists and sponsoring institutions.”
“…support the principle that research data should be made freely
available to all researchers…”
“…recommends depositing data that correspond to journal articles in
reliable data repositories…”
8. TWC
• Ways of data publication
– Data as supplemental material of a paper
– Standalone data
– Data paper: data in a repository + descriptive ‘data paper’
8
Examples:
• Standalone data journals: Nature Scientific Data, Geoscience Data
Journal, Ecological Archives, Data in Brief …
• Journals that publish data papers: Earth and Space Science,
GigaScience, F1000 Research, Internet Archaeology …
Strasser, GeoData 2014 Workshop Presentation (2014)
9. TWC
9
An isolateddata island ?!
Image from nature.com
10. TDWata CitaCtion
• Data Citation Index
– Indexes the world's leading data repositories
– Connects datasets to related refereed literature indexed in
the Web of Science™
– Efficient access to data across subjects and regions
10
Image courtesy http://wokinfo.com
11. TDataW interopCerability
11
Interoperability:
“Data should be discoverable, accessible, decodable,
understandable and usable, and data sharing should be
legal and ethical for all participants.”
Ma et al., Nature Geosciecne (2011)
Original image from: http://ehna.org
12. PTroveWnance ofC research
12
Provenance documentation
“Linking a range of observations and model outputs, research
activities, people and organizations involved in the production of
scientific findings with the supporting data sets and methods
used to generate them”
Image from nature.com
Ma et al., Nature Climate Change (2014)
http://data.globalchange.gov
13. TWC • IPython Notebook:
A web-based interactive computational environment
Codes, APIs,
datasets, text…
PDF document
• We made extension to the IPython Notebook
environment to enable automatic provenance
capture during a scientific workflow
Di Stefano et al., ESIP 2014 Summer Meeting Presentation (2014)
13
15. TSemWantic eSCcience
• Artificial Intelligence accelerates scientific discovery
– Data search, synthesis and hypothesis representation
– Data analysis: reasoning with models of the data
Gil et al., Science (2014)
Image from science.com
A state-of-the-art example:
Hanalyzer (high-throughput analyzer)
• Uses natural language processing to
automatically extract a semantic network from
all PubMed papers relevant to a scientist
• Uses Semantic Web technology to integrate
assertions from other biomedical sources
• Reasons about the network to find new
correlations that suggest new genes to
investigate
Leach et al., PLoS Comput Bio (2009)
15
16. TWC Deep Carbon Virtual Observatory
Fox, RDA Fourth Plenary Meeting Presentation (2014)
A cyber-enabled
platform for linked
science
http://deepcarbon.net
17. TWSummaCry
• Data as first class products of research
• eScience: the digital or electronic facilitation of science
• Semantic eScience
– A virtuous circle between science and semantic technologies
– Data driven + Knowledge driven?
Image courtesy @WileyExchanges
17