Measures of Central Tendency: Mean, Median and Mode
Scientific Data Management
1. School of Information
Studies
Syracuse University
Scientific Data Management
2011 Professional Development Day for
New England Librarians
Jian Qin
School of Information Studies
Syracuse University
http://eslib.ischool.syr.edu/
School of Information Studies
Syracuse University
The Day ahead
An environmental scan
• E-Science, cyberinfrastructure, and data
• What do all these have to do with me?
Case study: The gravitational
wave research data management
Group work: Role play in
developing data management
initiatives
3/16/2011 2:00 PM Overview of E-Science 2
1
2. School of Information
Studies
Syracuse University
An environmental scan
• E-Science, cyberinfrastructure, and data
• What do all these have to do with me?
Overview of E-Science
Characteristics of e-science
Data sets, data collections, and data
repositories
Why does it matter to libraries?
School of Information Studies
Syracuse University
E-Science
“In the future, e-Science will refer to
the large scale science that will
increasingly be carried out through
distributed global collaborations
enabled by the Internet. ”
National e-Science Center. (2008). Defining e-Science.
http://www.nesc.ac.uk/nesc/define.html
Overview of E-Science 4
3/16/2011 2:01 PM
2
3. School of Information Studies
Syracuse University
E-Infrastructure for the research lifecycle
http://epubs.cclrc.ac.uk/bitstrea
m/3857/science_lifecycle_STFC
_poster1.PDF
Overview of E-Science 5
3/16/2011 2:01 PM
School of Information Studies
Syracuse University
Characteristics of e-science
• Digital data driven
• Distributed
• Collaborative
• Trans-disciplinary
• Fuses pillars of science
– Experiment
– Theory Greer, Chris. (2008). E-Science: Trends,
Transformations & Responses. In:
– Model/simulation Reinventing Science Librarianship:
Models for the Future, October 2008.
– Observation/correlation http://www.arl.org/bm~doc/ff08greer.
pps
3/16/2011 2:01 PM Overview of E-Science 6
3
4. School of Information Studies
Syracuse University
3/16/2011 2:01 PM Overview of E-Science 7
School of Information Studies
Syracuse University
Shift in Science Paradigms
Thousand A few hundred A few decades Today
years ago years ago ago
Data exploration (eScience)
unify theory, experiment, and
simulation
A computational -- Data captured by
approach instruments or generated by
simulating simulator
Theoretical complex -- Processed by software
branch phenomena -- Information/Knowledge
using models, stored in computer
generalizations -- Scientist analyzes
Science was database/files using data
empirical management and statistics
describing natural Gray, J. & Szalay, A. (2007). eScience – A transformed
phenomena scientific method. http://research.microsoft.com/en-
us/um/people/gray/talks/NRC-CSTB_eScience.ppt
4
5. School of Information Studies
Syracuse University
Gray, J. & Szalay, A. (2007). eScience – A transformed
X-Info scientific method. http://research.microsoft.com/en-
us/um/people/gray/talks/NRC-CSTB_eScience.ppt
• The evolution of X-Info and Comp-X
for each discipline X
• How to codify and represent our knowledge
Experiments &
Instruments
Other Archives facts questions
Literature facts ? answers
Simulations
The Generic Problems
• Data ingest
• Query and Vis tools
• Managing a petabyte
• Building and executing models
• Common schema
• Integrating data and Literature
• How to organize it
• Documenting experiments
• How to reorganize it
• Curation and long-term preservation
• How to share with others
School of Information Studies
Syracuse University
Useful resources
• What is eScience?
• eScience Initiatives
• Science Research and Data
• Science Data Management
• Literature Reviews
• Data Policy Issues
• eScience Research Centers
• http://eslib.ischool.syr.edu/index.ph
http://research.microsoft.com/en- p?option=com_content&view=sectio
us/collaboration/fourthparadigm/
n&id=9&Itemid=83
Overview of E-Science 10
3/16/2011 2:01 PM
5
6. School of Information Studies
Syracuse University
Data
Any and all complex
data entities from
observations, experiments,
simulations, models, and
higher order assemblies,
along with the associated
documentation needed to An artist’s conception (above) depicts
fundamental NEON observatory
describe and interpret the instrumentation and systems as well as
potential spatial organization of the
data. environmental measurements made by these
instruments and systems.
http://www.nsf.gov/pubs/2007/nsf0728/ns
f0728_4.pdf
3/16/2011 2:01 PM Overview of E-Science 11
School of Information Studies
Syracuse University
Scientific data formats
Common data format
Image formats
Matrix formats
Microarray file formats
Communication protocols
3/16/2011 2:01 PM Overview of E-Science 12
6
7. School of Information Studies
Syracuse University
Scientific datasets
• The scientific data
set, or SDS, is a
group of data
structures used to
store and
describe
multidimensional
arrays of
scientific data.
NCSA HDF Development Group. (1998). HDF 4.1r2 User's Guide.
http://www.hdfgroup.org/training/HDFtraining/UsersGuide/SDS_SD.fm1.ht
ml#48894
Overview of E-Science 13
3/16/2011 2:01 PM
School of Information Studies
Syracuse University
Example: Ecological dataset
• Floristic diversity
data
– Related links
– Data attributes
– Download link
Overview of E-Science 14
3/16/2011 2:01 PM
7
8. School of Information Studies
Syracuse University
Example: Biodiversity dataset
• Actions for Porcupine
Marine Natural History
Society - Marine flora
and fauna records from
the North-east Atlantic
– Metadata record
output in different
standard formats
– URL for dataset
download
Overview of E-Science 15
3/16/2011 2:01 PM
School of Information Studies
Syracuse University
Example: The Significant Earthquake Database
• The Significant
Earthquake Database
– A database containing
data about significant
earthquake events and the
damages caused
– An interface for extracting
a subset of data
– A link to download the
whole dataset
– Documentation
Overview of E-Science 16
3/16/2011 2:01 PM
8
9. School of Information Studies
Syracuse University
Dataset example:
Cayuga Lake (New York) Water Quality
Monitoring Data Related to the Cornell Lake
Source Cooling Facility, 1998-2009
Overview of E-Science 17
3/16/2011 2:01 PM
School of Information Studies
Syracuse University
Research data collections
Data output Size Metadata Management
Standards
Larger, Multiple, Organized
discipline- comprehensive Institutionalized,
based
Heroic
Smaller, individual
team-based None or inside the
random team
3/16/2011 2:01 PM Overview of E-Science 18
9
10. School of Information Studies
Syracuse University
Research collections
• Limited processing or long-term
management
• Not conformed to any data
standards
• Varying sizes and formats of data
files
• Low level of processing, lack of
plan for data products
• Low awareness of metadata
standards and data management
issues
Overview of E-Science 19
3/16/2011 2:01 PM
School of Information Studies
Syracuse University
Resource collections
• Authored by a community of investigators,
within a domain or science or engineering
• Developed with community level standards
• Life time is between mid- and long-term
• Example: Hubbard Brook Ecosystem Study
(http://www.hubbardbrook.org )
– One of the regional sites in the Long term
Ecological Research Network (LTER)
– Community of the ecological domain
– Community of investigators from around the
country on ecosystem study
– Ecological Metadata Language (EML), a
community-level standard
– Cataloged, searchable dataset collections
Overview of E-Science 20
3/16/2011 2:01 PM
10
11. School of Information Studies
Syracuse University
Reference collection
• Example: Global Biodiversity Information Facility
– Created by large segments of science community
– Conform to robust, well-established and comprehensive
standards, e.g.
• ABCD (Access to Biological Collection Data)
• Darwin Core
• DiGIR (Distributed Generic Information Retrieval)
• Dublin Core Metadata standard
• GGF (Global Grid Forum)
• Invasive Alien Species Profile
• LSID (Life Sciences Identifier)
• OGC (Open Geospatial Consortium)
3/16/2011 2:01 PM Overview of E-Science 21
School of Information Studies
Syracuse University
http://www.tdwg.org/
Global standards/
Biodiversity
Information
Facility
http://www.gbif.org/informatics/discoverymetadata/a-metadata-
infrastructure/ PM
3/16/2011 2:01 Overview of E-Science 22
11
12. School of Information Studies
Syracuse University
Datasets, data collections, and data
repositories System for storing,
managing,
preserving, and
• Data collections are built for providing access to
larger segments of science datasets
and engineering Data
• Datasets repository
– typically centered around an A repository may
event or a study contain one or more
– contain a single file or multiple data collections
files in various formats A data collection may
– coupled with documentation contain one or more
about the background of data datasets
collection and processing
A dataset may
contain one or more
3/16/2011 2:01 PM Overview of E-Science data files 23
School of Information Studies
Syracuse University
An emerging trend in academic libraries
3/16/2011 2:01 PM Overview of E-Science 24
12
13. School of Information Studies
Syracuse University
Initiatives in research libraries
Data support and Libraries involved in
services in supporting
institutions: eScience:
45% 73%
• Pressure points:
– Lack of resources
– Difficulty acquiring the appropriate staff and
expertise to provide eScience and data
management or curation services
– Lack of a unifying direction on campus
Source: Soehner, C., Steeves, C. & Ward, J. (2010). E-Science and data support services: A
study of ARL member institution. http://www.arl.org/bm~doc/escience_report2010.pdf
3/16/2011 2:01 PM Overview of E-Science 25
School of Information Studies
Syracuse University
Data preservation challenges
• Data formats
– Vary in data types, e.g. vector and raster data types
– Format conversions, e.g. from an old version to a newer
one
• Data relations
– e.g. there are data models, annotations, classification
schemes, and symbolization files for a digital map
• Semantic issues
– Naming datasets and attributes
Overview of E-Science 26
3/16/2011 2:01 PM
13
14. School of Information Studies
Syracuse University
Data access challenges
• Reliability
• Authenticity
• Leverage technology to make data access
easier and more effective
– Cross-database search
– Integration applications
Overview of E-Science 27
3/16/2011 2:01 PM
School of Information Studies
Syracuse University
Supporting digital research data
• Lifecycle of research data
– Create: data creation/capture/gathering from laboratory
experiments, field work, surveys, devices, media,
simulation output…
– Edit: organize, annotate, clean, filter…
– Use/reuse: analyze, mine, model, derive additional data,
visualize, input to instruments /computers
– Publish: disseminate, create, portals /data. Databases,
associate with literature
– Preserve/destroy: store / preserve, store /replicate
/preserve, store / ignore, destroy…
Overview of E-Science 28
3/16/2011 2:01 PM
14
15. School of Information Studies
Syracuse University
Supporting data management
The data deluge Researchers need:
Numerical, image, video Specialized search
engines to discover
Models, simulations, bit the data they need
streams
Powerful data mining
XML, CVS, DB, HTML tools to use and
analyze the data
3/16/2011 2:01 PM Overview of E-Science 29
School of Information Studies
Syracuse University
Research data management
Community
Institution
eScience
librarian
Financial and
policy support Science Data content User
domain idiosyncrasies requirements
Evolving and interconnecting –
Institutional Community National International
repository repository repository repository
Overview of E-Science 30
3/16/2011 2:01 PM
15
16. School of Information Studies
Syracuse University
Implications to scholarly communication
process
Publishing Curation Archiving
Data publishing; Maintaining, preserving
The long-term
New scholarly publishing and adding value to
storage, retrieval, and
models—open access, digital research data
use of scientific data
institutional and throughout its lifecycle.
and methods.
community repositories,
self-publishing, library
publishing, ....
3/16/2011 2:01 PM Overview of E-Science 31
School of Information Studies
Syracuse University
SDM, research impact, and value
3/16/2011 2:01 PM Overview of E-Science 32
16
17. School of Information Studies
Syracuse University
Cyberinfrastructure (CI) vision for the 21st
century discovery
Atkins, D. (2007). Global interdisciplinary research.
http://www.nsf.gov/pubs/2007/nsf0728/index.jsp
Overview of E-Science 33
3/16/2011 2:01 PM
School of Information Studies
Syracuse University
Educating the new type of workforce
• Scientific data literacy (SDL)
project (http://sdl.syr.edu),
2007-2009
• E-Science Librarianship
Curriculum project (eSLib)
(http://eslib.ischool.syr.edu),
2009-2012, in partnership with
Cornell University Library
Overview of E-Science 34
3/16/2011 2:01 PM
17
18. School of Information Studies
Syracuse University
eSLib curriculum
Scientific Data Management
(core)
Other activities:
Cyberinfrastructure and • eScience Labs
Scientific Collaboration (core) • Mentorship
program
Data services (capstone) • Student projects
• Internships
Database systems (required
elective)
Metadata, workflows, XML
3/16/2011 2:01 PM Overview of E-Science 35
School of Information Studies
Syracuse University
Defining data literacy
IL: ACRL. (2010).
DL: Finn, Charles, W.P. (Tech & Learning, 2004)
SDL: Qin, J. & J. D’Ignazio, (Journal of Library Metadata, 2010)
3/16/2011 2:00 PM Overview of E-Science 36
18
19. School of Information Studies
Syracuse University
Summary
• E-Science development has raised
expectations to research libraries
– Working knowledge and skills in e-Science
– Focus on process (data and team science)
rather than product (reference services)
– Proactive, collaborative, integrative, and
interdisciplinary
Overview of E-Science 37
3/16/2011 2:01 PM
School of Information
Studies
Syracuse University
Case Study: Learning Data
Management Needs of the
Gravitational Wave
Researchers
3/16/2011 2:01 PM Overview of E-Science 38
19
20. School of Information Studies
Syracuse University
Gravitational Wave (GW) Research
Overview of E-Science 39
3/16/2011 2:01 PM
School of Information Studies
Syracuse University
Nature of GW research
• Computational intensive
• Large amounts of data
• Highly distributed and collaborative
• Open while “secretive” (embargo time)
Overview of E-Science 40
3/16/2011 2:01 PM
20
21. School of Information Studies
Syracuse University
Data management goal
• To enable:
– dataset search by various options and
– once a dataset of interest is identified, enable the
tracking of this dataset to its original (raw) data.
Overview of E-Science 41
3/16/2011 2:01 PM
School of Information Studies
Syracuse University
What is needed to accomplish the goal?
• Metadata, of course! But
– What metadata?
– Metadata about what data, the raw data,
processed data, code data, or output data?
• What is there to be learned?
Overview of E-Science 42
3/16/2011 2:01 PM
21
22. School of Information Studies
Syracuse University
Understand the research workflow
• Interview the scientist
– Listening (good listening skills)
– Asking questions (don’t be afraid of asking
questions)
– Use your librarian brain to ingest the
conversation:
• How does the research flow from one point to next?
• What consists of the research input and output at each
stage of research in terms of data?
Overview of E-Science 43
3/16/2011 2:01 PM
School of Information Studies
Mapping out the knowledge v0.1 Syracuse University
Overview of E-Science 44
3/16/2011 2:01 PM
22
23. School of Information Studies
Mapping out the knowledge v0.2 Syracuse University
Overview of E-Science 45
3/16/2011 2:01 PM
School of Information Studies
Mapping
Syracuse University
out the
knowledge
v0.3
Overview of E-Science 46
3/16/2011 2:01 PM
23
24. School of Information Studies
Mapping out the knowledge v1.0 Syracuse University
3/16/2011 2:01 PM Overview of E-Science 47
School of Information Studies
Syracuse University
Lessons learned
• Science is learnable even if you don’t have a subject
background
– Learn enough to understand the research process and workflow
• Scientists are eager to get help
• Librarians need to be technical-minded
– Data, metadata, database
– Structures, models, workflows
• Librarians need to be good listeners while staying good
conversation leaders
– Know when and how to lead the conversation to get what you
need for data management planning and implementation
– Do your homework on the subject so that you can be an
intelligent listener
Overview of E-Science 48
3/16/2011 2:01 PM
24
25. School of Information
Studies
Syracuse University
Case Discussions for
Working Lunch
3/16/2011 2:01 PM Overview of E-Science 49
School of Information Studies
Syracuse University
Case Study #1: To build or not to build a data
repository?
A university library has developed an institutional repository for preserving and
providing access to the scholarly output by the researchers in this institution.
Now the new challenge arises from e-science research demanding data
management plan by the funding agency and the linking between publications
and data by the authors and users. You already know that some faculty use
their disciplinary data repository for submitting their datasets (e.g., GenBank for
microbiology research data). The problem you face now is whether an
institutional data repository should be built for those who do “small science” and
don’t have funding nor expertise to manage their data.
Questions to be addressed:
• What are the strategies you will use to approach the problem?
• What are the possible solutions for the problem?
• What are some of the tradeoffs for the solutions you will adopt?
Overview of E-Science 50
3/16/2011 2:01 PM
25
26. School of Information Studies
Syracuse University
Case study #2: Developing a data
taxonomy
The concept of research data management is a stranger to many faculty as
well as your library staff. What is data? What is a data set? These seemingly
simple terms can be very confusing and have different interpretations in
different context and disciplines. As part of the data management strategies,
you decide to develop an authoritative data taxonomy for the campus research
community. This data taxonomy will benefit the creation and use of institutional
data policies, data repository or repositories, and data management plans
required of funding agencies.
Questions to be addressed:
• What should the data taxonomy include?
• What form should it take, a database-driven website or a static HTML page?
• Who should be the constituencies in this process?
• Who will be the maintainer once the taxonomy is released?
Overview of E-Science 51
3/16/2011 2:01 PM
School of Information Studies
Syracuse University
Case study #3: Developing a data policy
Data policies play an important role in governing how the data will be managed,
shared, and accessed. It is also an instrument that will fend off potential legal
problems. Data policies have several types: data access and use, data
publishing, and data management. Your university’s Office of Sponsored
Research has some existing policy on data, but it is neither systematic nor
complete. Many of the terms were defined years ago and did not cover the new
areas such as the embargo period of data. As the university has decided to
build a data repository for managing and preserving datasets, a data policy has
become one of the top priorities for both the institution and the data repository.
Questions to be addressed:
• What should the data policy include?
• Who should be the constituencies in this process?
• Who will be the interpretation authority for the data policy?
3/16/2011 2:01 PM Overview of E-Science 52
26
27. School of Information Studies
Syracuse University
Case study #4: Cataloging datasets
Describing datasets is the process of creating metadata for datasets. In
scientific disciplines, several metadata standards have been developed, e.g.,
the Content Standard for Digital Geospatial Metadata (CSDGM), Darwin Core,
and Ecological Metadata Language (EML). Each of these metadata standards
contains hundreds of elements and requires both metadata and subject
knowledge training in order to use them. Besides, creating one record using
any of these standards will require a tremendous time investment. But you
library does not have such specialized personnel nor have the fund to hire new
persons for the job. The existing staff has some general metadata skills such as
Dublin Core. In deciding the metadata schema for your data repository, you
need to address these questions:
• Should I adopt a scientific metadata standard or develop one tailored to our
need?
• How can I learn what metadata elements are critical to dataset submitters and
searchers?
• What are some of the benefits and disadvantages for adopting a standard or
developing a local schema?
3/16/2011 2:01 PM Overview of E-Science 53
School of Information Studies
Syracuse University
Case study #5: Evaluating data repository tools
Research data as a driving force for e-science is inherently a tool-intensive
field. Tools related to data management can be divided into two broad
categories: those for creating metadata records and those for data repository
management. An academic institution decided to build their own data repository
as part of the supporting service for researchers to meet the data management
plan requirement of funding agencies. This data repository development task
was handed down to the library. You the library director have to decide whether
to develop an in-house system or use an off-the-shelf software system. As
usual, you put together a taskforce to find a solution to this challenge. The
questions to be addressed by the taskforce include:
• What are the options available to us?
• What evaluation criteria are the most important to our goal?
• What are the limitations for us to adopt one option or the other?
• How will this option be interoperate with existing institutional repository
system? Or, can the existing repository system used for data repository
purposes?
3/16/2011 2:01 PM Overview of E-Science 54
27