Presentation Title: Grand Challenges and Big Data: Implications for Public Participation in Scientific Research
Presenter: William Michener, Professor and PI/Director of DataONE, University Libraries, University of New Mexico
Student Profile Sample - We help schools to connect the data they have, with ...
Michener Plenary PPSR2012
1. Grand Challenges and Big Data:
Implications for Public Participation in
Scientific Research
Bill Michener
Professor and DataONE Project Director
University of New Mexico
4 August 2012
PPSR Meeting in Portland, OR
7. The Long Tail of Orphan Data
“Most of the bytes
are at the high end,
Specialized repositories but most of the
(e.g. GenBank, PDB) datasets are at the
Volume
low end” – Jim Gray
Orphan data
(B. Heidorn)
Rank frequency of datatype
7
8. Research and Data Life Cycle Integration
Plan
Proposal
writing Analyze Collect
Ideas Research Integrate Assure
Discover Describe
Publication Preserve
8
12. DataONE Supports Data Preservation
Three major components for a Member Nodes
flexible, scalable, sustainable • diverse institutions
Coordinating Nodes
network • serve local community
• retain complete metadata
Investigator Toolkit
• provide resources for
catalog
managing their data
• indexing for search
• retain copies of data
• network-wide services
• ensure content
availability (preservation)
• replication services
12
13. ORNL DAAC
as a DataONE
Member Node NASA collectors DAAC Users (UWG)
Investigator Toolkit
DataONE Users
13
21. 3. Tools for Innovation and Discovery
The Fourth Paradigm:
1. Observational and
experimental
2. Theoretical research
3. Computer simulations of
natural phenomena
4. Data-intensive research
• new tools, techniques,
and ways of working
21
21
22. Investigator Toolkit Support
Plan
DMP-Tool
Analyze Collect
Kepler
Integrate Assure
Discover Describe
Preserve
22
25. ✔Check for best practices
✔Create metadata
✔Connect to ONEShare
Data &
Metadata (EML)
25
26. Exploration, Visualization, and Analysis Tools
Diverse bird observations and Model results
environmental data from
300,00 locations in the US Occurrence of Indigo Bunting (2008)
integrated and analyzed using
High Performance Computing
Resources
Land Cover
Jan Ap Jun Sep Dec
r
Meteorology
• Examine patterns of
migration
MODIS – Spatio-Temporal Exploratory • Infer how climate
Remote Model identifies factors change may affect
sensing data affecting patterns of bird migration
migration
26
Networking, interconnectedness of information. Defining the relationships between components increases the value and utility of those items.The internet provides connectivity between systems, and a good deal of infrastructure has been built on this rapidly evolving, now pervasive fabric.The design of most internet based infrastructure though is very ephemeral, and thus is not suitable for preservation of information, or more importantly, the relationships between elements.URLs are often used as identifiers, except these have a significant problem in that their resolution, that is finding the location where the content identified by the URL may be retrieved is entirely dependent on the persistent availability of the service endpoint referenced by the URL. Change in any component in the resolution chain results in failure, and thus negates the utility of the URL.[Diagram of URL resolution process]The semantic web, the goal of interconnectedness between information is entirely dependent on effective identifier resolution.Preservation of content.Access to content. Creating communities of agents able to access and manipulate, information. Generating new content, relationships between content, discovering new associations. Being completely open about activity – the generation of new content, mining existing information, access to processing resources may however be best done with some privacy. There are always some activities best not to perform in full public view.The DataONE project is building infrastructure that addresses these concerns.
There is widely used infrastructure for certain well-defined “easy” biological datatypes like DNA sequences and protein structures. But these repositories are not adequate to capture all those many datasets that requires more context to be reusable. Our civilization is not wealthy to ever support the variety specialized repositories that would be needed, and the curation that would be needed to standardize these data.
In fact, many researchers find the new requirement to be quite confusing. Here are just a few examples of the questions that they are asking.
DataONE is a federated data network built to improve access to Earth science data, and to support science by: engaging the relevant science, data, and policy communities; facilitating easy, secure, and persistent storage of data; and disseminating integrated and user-friendly tools for data discovery, analysis, visualization, and decision-making. There are three principal components:Member Nodes which include a diverse array of data centers and repositories that are associated with national and international agencies and research networks, universities, libraries, etc.Coordinating Nodes which support data replication across Member Nodes (i.e., data centers) as well as network wide services like 24/7 access to metadata at the CNs, indexing and rapid search and discovery, etc. Am Investigator Toolkit that includes tools that are widely used by scientists, The tools are coupled with the DataONE resources so that it is, for example, possible to seamlessly and transparently access data at Member Nodes through the tool of your choice.
NASA Collectors: Field investigators who collect data from NASA-funded projects and deposit those data in the ORNL DAAC. DAAC Users: Those who search and download data from the ORNL DAACMember Node Crescent: the software stack that enables the MN functionality for the ORNL DAAC. This crescent software is developed and installed by D1 staff, making use of the characteristics of the DAAC system and metadata DAAC users can obtain data directly from the ORNL DAAC, as they did before. D1 users will access metadata from the CN and will acquire ORNL DAAC data from the DAAC indirectly via the Member Node. The data and documentation downloads are recorded by the DAAC; the D1 users sees the DAAC’s citation to the downloaded data set
There are many opportunities for collaboration with DataONE and there are many benefits to doing so; the next few slides highlight the benefit and opps for research scientists, Member Nodes, and funding agencies. This map highlights many of the international partners that have expressed interest in establishing Member Nodes, many of which are active members of the DataONE Users Group.
Other development activities during years 2-5 will focus on expanding the suite of tools that are available through the Investigator Toolkit. New tool additions will be identified and prioritized by the DataONE Users Group.
As one example, DataONE is part of a consortium that is developing a Data Management Planning Online Tool. The tool “walks” scientists through the process of developing a concise, but comprehensive data management plan that could enable good stewardship of data and meet requirements of sponsors and home institutions.
The five steps are located on the left side bar and include information about the data, metadata (or documentation about the data, policies for access and re-use, and plans for archiving and preserving the data. In this example, the Univ. of Virginia offers suggested text for archiving and preserving the data that can be pasted into the plan.
How else do we know what the community needs?The Scientific Exploration, Visualization and Analysis working group is another example that you heard about earlier. In summary, by running through a comprehensive case study, this working group was able to provide specific guidance on the challenges faced when conducting data intensive science. Challenges that were communicated to, and met by, the DataONE core CI team and developers.Another mechanism to understand community needs is to conduct extensive surveys of stakeholders….