2. DataONE vision and approach
Enable new science and knowledge creation through
universal access to data about life on earth and the
environment that sustains it.
1. Build on existing
cyberinfrastructure 2. Create new
cyberinfrastructure 3. Support communities
of practice
2
2
3. DataONE Cyberinfrastructure
Three major components for a Member Nodes
flexible, scalable, sustainable • diverse institutions
Coordinating Nodes
network • serve local community
• retain complete metadata
Investigator Toolkit
• provide resources for
catalog
managing their data
• indexing for search
• retain copies of data
• network-wide services
• ensure content
availability (preservation)
• replication services
3
4. Training in all elements of the data life cycle
Plan
Analyze Collect
Kepler
Integrate Assure
Discover Describe
Preserve
4
5. DataONE Education and Training
Summer Internships
Training at Conferences and Workshops
• Supercomputing 2011
• DataONE Implementation Workshop: Publishing data as a
Member Node
• Ecological Society of America (ESA)
• American Geophysical Union (AGU)
Educational Modules
Graduate-level course
• Summer Institute for Environmental Informatics
5
7. Environmental Information Management (EIM) Institute
Graduate students biology, geology, ecology, or other
environmental sciences, environmental engineering, geography
or science librarianship
Conceptual and practical hands-on
training to effectively
design, manage, analyze, visualize, and
preserve data and information:
• Managing data files
• Creating databases and web portals
• Data analysis and visualization
• Techniques for
managing, analyzing, and visualizing
geospatial data
7
8. DataONE Team and Sponsors
• Amber Budden, Roger Dahl, Rebecca • Ewa Deelman
Koskela, Bill Michener, Robert Nahf, Mark
• Servilla
Dave Vieglais • Peter Honeyman
• Suzie Allard, Carol Tenopir, Maribeth • Jeff Horsburgh
Manoff, Kimberley Douglass, Robert
• Waltz, Bruce Wilson Giri
John Cobb, Bob Cook, • Robert Sandusky
Palanismy, Line Pouchard
• Patricia Cruse, John Kunze • Bertram Ludaescher
• Sky Bristol, Mike Frame, Richard Huffine, Viv • Peter Buneman
Hutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly
• Chris Jones, Stephanie Hampton, Matt • Cliff Duke
Jones
• Paul Allen, Rick Bonney, Steve Kelling • Carole Goble
• Ryan Scherle, Todd Vision • Donald Hobern
• Randy Butler • David DeRoure
LEON LEVY
FOUNDATION 8
11. A Science Use Case
Diverse bird observations and Model results
environmental data from
300,00 locations in the US Occurrence of Indigo Bunting (2008)
integrated and analyzed using
High Performance Computing
Resources
Land Cover
Jan Ap Jun Sep Dec
r
Meteorology
• Examine patterns of
migration
MODIS – Spatio-Temporal Exploratory • Infer how climate
Remote Model identifies factors change may affect
sensing data affecting patterns of bird migration
migration
11
Notes de l'éditeur
The DataONE mission/vision is to “enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it.” DataONE is based on three precepts. 1. We are leveraging existing infrastructure such as the hundreds of existing data centers and repositories, and the myriad of software tools. 2. We are focusing our efforts on developing new infrastructure that better enables interoperability across data centers and between scientific tools and data resources. [The new cyberinfrastructure being created by DataONE is illustrated on a future slide.] 3. We recognize that the largest challenges are sociocultural in nature, and thus we focus significant attention on engaging and supporting the broader community of stakeholders (e.g. scientists, students, librarians).
DataONE is a federated data network built to improve access to Earth science data, and to support science by: engaging the relevant science, data, and policy communities; facilitating easy, secure, and persistent storage of data; and disseminating integrated and user-friendly tools for data discovery, analysis, visualization, and decision-making. There are three principal components:Member Nodes that include a diverse array of data centers and repositories that are associated with national and international agencies and research networks, universities, libraries, etc.Coordinating Nodes that support data replication across Member Nodes (i.e., data centers) as well as network wide services like 24/7 access to metadata at the CNs, indexing and rapid search and discovery, etc.An Investigator Toolkit that includes tools that are widely used by scientists, The tools are coupled with the DataONE resources so that it is, for example, possible to seamlessly and transparently access data at Member Nodes through the tool of your choice.
Other development activities during years 3-5 will focus on expanding the suite of tools that are available through the Investigator Toolkit. New tool additions will be identified and prioritized by the DataONE Users Group.
Other development activities during years 2-5 will focus on expanding the suite of tools that are available through the Investigator Toolkit. New tool additions will be identified and prioritized by the DataONE Users Group.
This final slide illustrates the initial DataONE partners that have now been involved for over 3 years, since the proposal was conceived. The DataONE Users Group now includes significantly more partners and we expect to grow exponentially over the next five years.
The DataONE team is growing!
The Scientific Exploration, Visualization and Analysis Working Group is an example of a scientific use case. By running through a comprehensive case study, this working group was able to provide specific guidance on the challenges faced when conducting data intensive science. Challenges that were communicated to, and met by, the DataONE core CI team and developers.Science requires: Multiple cooperating extreme scale CI components (EVA/eBird pilot lesson learned)EVA pilot collaborated with TeraGrid (now XSEDE) to use HPC and “schlep” data as part of the workflow50K cpu-core hours (SU’s) last year(supporting SOTB 2011)3M hours allocated this year (Cornell CLO team has optimized code for 3-10X speedup, loosened data transfer bottleneck, so we will under run)Plan for 500 species (3 yr data) runs. Currently: 70/wk for 2011 campaignHPC use 10X 2 years in a row. Data increases as well.Conclusion: success breeds scale