SlideShare une entreprise Scribd logo
1  sur  15
Transforming Science Through Data-driven Discovery
How Cyverse.org enables scalable
data discoverability and re-use
Matt Vaughn, co-PI
@mattdotvaughn
vaughn@tacc.utexas.edu
History and Context
~ $100m direct NSF
investment over 10
years
Currently working to
sustain its successes
beyond 2018
iPlant 2008
Empowering a
New Plant Biology
iPlant 2013
Cyberinfrastructure
for Life Science
CyVerse 2016
Transforming Science
Through Data-Driven
Discovery
Plant Science Cyberinfrastructure Collaborative
A "new type of organization" that is "community-
driven" uniting "biologists, computer and information
scientists and experts from other disciplines working
in an integrated team" to provide "computational and
cyberinfrastructure capabilities and expertise that are
capable of handling large and heterogeneous plant
biology data sets"
What is Cyberinfrastructure?
•Data storage and retrieval
•Software (system & user)
•Computing capability
•Human expertise and support
Organized into systems that solve problems of size
and scope that would not otherwise be solvable
Platform Overview
Ready to use
Platforms
Foundational
Capabilities
Established CI
Components
Extensible
Services
EaseofUse
Adoption and Outputs
• Over 40K registered users (15-20%
active)
• Millions of computing hours on
XSEDE, campus HPC, Cyverse
systems, and commercial cloud
• 2+ PB user data stored in CyVerse
Data Store
• Hundreds of publications, courses,
and discoveries
• Spin-off technologies
• Jetstream: NSF production
cloud
• Syndicate: Software-defined
storage system
• Agave API: Multitenant
science PaaS
• Communities such as iAnimal,
iMicrobe, iPlant.UK
• 3rd party software resources
using it as a platform
Federation
Metadata
Finding and re-using Data (1)
iRODS (2+PB)
ElasticSearchTucson
Resources
Austin
Resources
Catalog Servers
CSHL
Resource
iPlant.UK
Resources
Data Store APIs
Agave API
AWS S3
Public FTP
SFTP
At the heart of all Cyverse applications is a data-centric
architecture, designed to be scaled and extended
Finding and re-using Data (2)
• Browser-based file manager
• Upload from local or URI
• Download
• Add/Edit comments and tags
• AVU metadata + structured
templates
• Share with collaborators or any
Cyverse user
The Cyverse Discovery Environment Data Window
Finding and re-using Data (3)
• Browser-based file manager
• Upload from local or URI
• Download
• Add/Edit comments and tags
• AVU metadata + structured
templates
• Share with collaborators or any
Cyverse user
Google Drive, for big data
The Cyverse Discovery Environment Data Window
Finding and re-using Software (1)
• Extendable App Catalog
• Provide Dockerfile + GUI
specification
• Develop VM image
• Deploy application web
service
Info view for a Cyverse Discovery Environment application
Finding and re-using Software (2)
• Extendable App Catalog
• Provide Dockerfile + GUI
specification
• Develop VM image
• Deploy application web
service
• Require links to
documentation, example files
and usage, appropriate
software and domain
ontologies
Public or shared Atmosphere VM images tagged with “GWAS”
Finding and re-using Software (3)
• Extendable App Catalog
• Provide Dockerfile + GUI
specification
• Develop VM image
• Deploy application web
service
• Require links to
documentation, example files
and usage, appropriate
software and domain
ontologies
• Give credit to app author and
software authorApplication and Data catalogs available to 3rd parties
Cyverse Data Commons (1)
Data Commons Landing Page (1.0)
Persistent URL for each data set. No authentication
required. Fast browsing and retrieval.
NCBI SRA Submission Workflow in DE
Cyverse is the analysis home for a lot of genomics
data. To get it off our systems, we need to help get it
into the SRA!
Cyverse Data Commons (2)
Actively facilitating publication and discovery of data stored with CyVerse
Candidate
Research
Data @
Data Store
Identify,
organize,
rename
files and
folders
Prepare a
DataCite
metadata
document
Submit to
Cyverse
Curation
Team
Data
snapshot
made
public. DOI
issued.
Candidate
VM image
Document
contents &
capabilities
Prepare a
DataCite
metadata
document
Submit to
Cyverse
Curation
Team
Public
image
released.
DOI issued.
Summary
• Cyverse is a model for providing cyberinfrastructure to diverse
bioscience user communities
• State of the art has shifted at least twice since we started work
• Had to overcome initial reticence to “give data” to Cyverse
• Still hard to get developers and providers to maintain after
contributing
• Cost recovery model - We have started using the term ‘subsidized’
rather than free but it might be too late.
• Natural syngergy between our organization and ODEN objectives
Transforming Science Through Data-driven Discovery
Parker Antin
Nirav Merchant
Eric Lyons
Matt Vaughn
@mattdotvaughn
vaughn@tacc.utexas.edu
Doreen Ware
Dave Micklos
CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383.
CyVerse Executive Team

Contenu connexe

Tendances

Reaching a Billion Users with Hadoop
Reaching a Billion Users with HadoopReaching a Billion Users with Hadoop
Reaching a Billion Users with Hadoop
DataWorks Summit
 

Tendances (20)

Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
 
Reproducible Research and the Cloud
Reproducible Research and the CloudReproducible Research and the Cloud
Reproducible Research and the Cloud
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Interoperability and scalability with microservices in science
Interoperability and scalability with microservices in scienceInteroperability and scalability with microservices in science
Interoperability and scalability with microservices in science
 
Accelerating your research with Microsoft Azure
Accelerating your research with Microsoft AzureAccelerating your research with Microsoft Azure
Accelerating your research with Microsoft Azure
 
Cloud Dataverse
Cloud DataverseCloud Dataverse
Cloud Dataverse
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Virtualization for HPC at NCI
Virtualization for HPC at NCIVirtualization for HPC at NCI
Virtualization for HPC at NCI
 
Reaching a Billion Users with Hadoop
Reaching a Billion Users with HadoopReaching a Billion Users with Hadoop
Reaching a Billion Users with Hadoop
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science
 
D4Science Data Infrastructure - Facilitator for a FAIR Data Management
D4Science Data Infrastructure - Facilitator for a FAIR Data ManagementD4Science Data Infrastructure - Facilitator for a FAIR Data Management
D4Science Data Infrastructure - Facilitator for a FAIR Data Management
 
Multi-layer Authorization Framework for Hadoop Ecosystem
Multi-layer Authorization Framework for Hadoop EcosystemMulti-layer Authorization Framework for Hadoop Ecosystem
Multi-layer Authorization Framework for Hadoop Ecosystem
 
Dataverse on the MOC
Dataverse on the MOCDataverse on the MOC
Dataverse on the MOC
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
 
CDL research lifecycle
CDL research lifecycleCDL research lifecycle
CDL research lifecycle
 
Data Publishing at Harvard's Research Data Access Symposium
Data Publishing at Harvard's Research Data Access SymposiumData Publishing at Harvard's Research Data Access Symposium
Data Publishing at Harvard's Research Data Access Symposium
 
Bridging Environmental Data Providers and SeaDataNet DIVA Service within a Co...
Bridging Environmental Data Providers and SeaDataNet DIVA Service within a Co...Bridging Environmental Data Providers and SeaDataNet DIVA Service within a Co...
Bridging Environmental Data Providers and SeaDataNet DIVA Service within a Co...
 

Similaire à How Cyverse.org enables scalable data discoverability and re-use

Research methods group accelarating impact by sharing data
Research methods group  accelarating impact by sharing dataResearch methods group  accelarating impact by sharing data
Research methods group accelarating impact by sharing data
World Agroforestry (ICRAF)
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-researchUc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
University of California Curation Center
 
Ndsa 2013-abrams-integrating-repositories-for-data-sharing
Ndsa 2013-abrams-integrating-repositories-for-data-sharingNdsa 2013-abrams-integrating-repositories-for-data-sharing
Ndsa 2013-abrams-integrating-repositories-for-data-sharing
University of California Curation Center
 

Similaire à How Cyverse.org enables scalable data discoverability and re-use (20)

Research methods group accelarating impact by sharing data
Research methods group  accelarating impact by sharing dataResearch methods group  accelarating impact by sharing data
Research methods group accelarating impact by sharing data
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-researchUc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
 
Dataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsDataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTags
 
Sgci esip-7-20-18
Sgci esip-7-20-18Sgci esip-7-20-18
Sgci esip-7-20-18
 
Or 2013-abrams-sharing-data-rich-research
Or 2013-abrams-sharing-data-rich-researchOr 2013-abrams-sharing-data-rich-research
Or 2013-abrams-sharing-data-rich-research
 
Ndsa 2013-abrams-integrating-repositories-for-data-sharing
Ndsa 2013-abrams-integrating-repositories-for-data-sharingNdsa 2013-abrams-integrating-repositories-for-data-sharing
Ndsa 2013-abrams-integrating-repositories-for-data-sharing
 
The UC Curation Center (UC3): Developing Tools & Services for Managing Research
The UC Curation Center (UC3): Developing Tools & Services for Managing ResearchThe UC Curation Center (UC3): Developing Tools & Services for Managing Research
The UC Curation Center (UC3): Developing Tools & Services for Managing Research
 
Dataverse for Journals
Dataverse for JournalsDataverse for Journals
Dataverse for Journals
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
Cloud Standards in the Real World: Cloud Standards Testing for Developers
Cloud Standards in the Real World: Cloud Standards Testing for DevelopersCloud Standards in the Real World: Cloud Standards Testing for Developers
Cloud Standards in the Real World: Cloud Standards Testing for Developers
 
A Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management ProgramA Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management Program
 
Datashare cni spring2013
Datashare cni spring2013Datashare cni spring2013
Datashare cni spring2013
 
#2 NCI data services - Fair data webinar 6 Sept 2017
#2 NCI data services - Fair data webinar 6 Sept 2017#2 NCI data services - Fair data webinar 6 Sept 2017
#2 NCI data services - Fair data webinar 6 Sept 2017
 
Globus status and publication plans
Globus status and publication plansGlobus status and publication plans
Globus status and publication plans
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
 
DataShare for UC Campuses
DataShare for UC CampusesDataShare for UC Campuses
DataShare for UC Campuses
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
 
DCSF 19 Improving the Human Condition with Docker
DCSF 19 Improving the Human Condition with DockerDCSF 19 Improving the Human Condition with Docker
DCSF 19 Improving the Human Condition with Docker
 

Plus de Matthew Vaughn

Plus de Matthew Vaughn (14)

On-Demand Cloud Computing for Life Sciences Research and Education
On-Demand Cloud Computing for Life Sciences Research and EducationOn-Demand Cloud Computing for Life Sciences Research and Education
On-Demand Cloud Computing for Life Sciences Research and Education
 
Towards a (united) federation of Bioinformatics resources
Towards a (united) federation of Bioinformatics resourcesTowards a (united) federation of Bioinformatics resources
Towards a (united) federation of Bioinformatics resources
 
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingClouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
 
Packaging computational biology tools for broad distribution and ease-of-reuse
Packaging computational biology tools for broad distribution and ease-of-reusePackaging computational biology tools for broad distribution and ease-of-reuse
Packaging computational biology tools for broad distribution and ease-of-reuse
 
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Jetstream: Adding Cloud-based Computing to the National CyberinfrastructureJetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
 
Scaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data ChallengesScaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data Challenges
 
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataArabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
 
Developing Apps: Exposing Your Data Through Araport
Developing Apps: Exposing Your Data Through AraportDeveloping Apps: Exposing Your Data Through Araport
Developing Apps: Exposing Your Data Through Araport
 
Dinosaur bioinformatics
Dinosaur bioinformaticsDinosaur bioinformatics
Dinosaur bioinformatics
 
aip-developer-intro_pag2015
aip-developer-intro_pag2015aip-developer-intro_pag2015
aip-developer-intro_pag2015
 
iplant-highlights-pag2015
iplant-highlights-pag2015iplant-highlights-pag2015
iplant-highlights-pag2015
 
aip-workshop1-dev-tutorial
aip-workshop1-dev-tutorialaip-workshop1-dev-tutorial
aip-workshop1-dev-tutorial
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
Arabidopsis Information Portal overview from Plant Biology Europe 2014
Arabidopsis Information Portal overview from Plant Biology Europe 2014Arabidopsis Information Portal overview from Plant Biology Europe 2014
Arabidopsis Information Portal overview from Plant Biology Europe 2014
 

Dernier

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Dernier (20)

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 

How Cyverse.org enables scalable data discoverability and re-use

  • 1. Transforming Science Through Data-driven Discovery How Cyverse.org enables scalable data discoverability and re-use Matt Vaughn, co-PI @mattdotvaughn vaughn@tacc.utexas.edu
  • 2. History and Context ~ $100m direct NSF investment over 10 years Currently working to sustain its successes beyond 2018 iPlant 2008 Empowering a New Plant Biology iPlant 2013 Cyberinfrastructure for Life Science CyVerse 2016 Transforming Science Through Data-Driven Discovery Plant Science Cyberinfrastructure Collaborative A "new type of organization" that is "community- driven" uniting "biologists, computer and information scientists and experts from other disciplines working in an integrated team" to provide "computational and cyberinfrastructure capabilities and expertise that are capable of handling large and heterogeneous plant biology data sets"
  • 3. What is Cyberinfrastructure? •Data storage and retrieval •Software (system & user) •Computing capability •Human expertise and support Organized into systems that solve problems of size and scope that would not otherwise be solvable
  • 4. Platform Overview Ready to use Platforms Foundational Capabilities Established CI Components Extensible Services EaseofUse
  • 5. Adoption and Outputs • Over 40K registered users (15-20% active) • Millions of computing hours on XSEDE, campus HPC, Cyverse systems, and commercial cloud • 2+ PB user data stored in CyVerse Data Store • Hundreds of publications, courses, and discoveries • Spin-off technologies • Jetstream: NSF production cloud • Syndicate: Software-defined storage system • Agave API: Multitenant science PaaS • Communities such as iAnimal, iMicrobe, iPlant.UK • 3rd party software resources using it as a platform
  • 6. Federation Metadata Finding and re-using Data (1) iRODS (2+PB) ElasticSearchTucson Resources Austin Resources Catalog Servers CSHL Resource iPlant.UK Resources Data Store APIs Agave API AWS S3 Public FTP SFTP At the heart of all Cyverse applications is a data-centric architecture, designed to be scaled and extended
  • 7. Finding and re-using Data (2) • Browser-based file manager • Upload from local or URI • Download • Add/Edit comments and tags • AVU metadata + structured templates • Share with collaborators or any Cyverse user The Cyverse Discovery Environment Data Window
  • 8. Finding and re-using Data (3) • Browser-based file manager • Upload from local or URI • Download • Add/Edit comments and tags • AVU metadata + structured templates • Share with collaborators or any Cyverse user Google Drive, for big data The Cyverse Discovery Environment Data Window
  • 9. Finding and re-using Software (1) • Extendable App Catalog • Provide Dockerfile + GUI specification • Develop VM image • Deploy application web service Info view for a Cyverse Discovery Environment application
  • 10. Finding and re-using Software (2) • Extendable App Catalog • Provide Dockerfile + GUI specification • Develop VM image • Deploy application web service • Require links to documentation, example files and usage, appropriate software and domain ontologies Public or shared Atmosphere VM images tagged with “GWAS”
  • 11. Finding and re-using Software (3) • Extendable App Catalog • Provide Dockerfile + GUI specification • Develop VM image • Deploy application web service • Require links to documentation, example files and usage, appropriate software and domain ontologies • Give credit to app author and software authorApplication and Data catalogs available to 3rd parties
  • 12. Cyverse Data Commons (1) Data Commons Landing Page (1.0) Persistent URL for each data set. No authentication required. Fast browsing and retrieval. NCBI SRA Submission Workflow in DE Cyverse is the analysis home for a lot of genomics data. To get it off our systems, we need to help get it into the SRA!
  • 13. Cyverse Data Commons (2) Actively facilitating publication and discovery of data stored with CyVerse Candidate Research Data @ Data Store Identify, organize, rename files and folders Prepare a DataCite metadata document Submit to Cyverse Curation Team Data snapshot made public. DOI issued. Candidate VM image Document contents & capabilities Prepare a DataCite metadata document Submit to Cyverse Curation Team Public image released. DOI issued.
  • 14. Summary • Cyverse is a model for providing cyberinfrastructure to diverse bioscience user communities • State of the art has shifted at least twice since we started work • Had to overcome initial reticence to “give data” to Cyverse • Still hard to get developers and providers to maintain after contributing • Cost recovery model - We have started using the term ‘subsidized’ rather than free but it might be too late. • Natural syngergy between our organization and ODEN objectives
  • 15. Transforming Science Through Data-driven Discovery Parker Antin Nirav Merchant Eric Lyons Matt Vaughn @mattdotvaughn vaughn@tacc.utexas.edu Doreen Ware Dave Micklos CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383. CyVerse Executive Team

Notes de l'éditeur

  1. (Brief) History and Context In the mid-2000s, realization inside the NSF that biology had some unique CI challenges not being met Plant Genome was already spending on full-genome characterization projects (Arabidopsis 2010, etc). Big data was on horizon - NGS just emergent BIO-specific CI. Chose plant sciences due to strong communities and sharing culture.  Funded iPlant in 2008 Project spend its first 18 months assessing the immediate and future needs for plant science, began developing CI Renewed in 2013, with broadened mandate to cover BIO in general excepting human disease Rebranded in 2016 as part of a strategy to operate sustainably after initial program is over.
  2. What is Cyberinfrastructure? Before diving in to specifics, define Cyberinfrastructure This is remarkably similar to the definition of a Commons So, our charge was: Blend data storage + computing capability, reproducible analysis, and human expertise
  3. Platform Overview Vertically integrated set of offerings that serve a variety of users (technical skill, science use case, geographic location, etc) Data Storage is centralized, sharing is easy. Tied to ability to analyze in situ.  Ease of use <-> Ease of Re-use Everything below the consumer-facing layer: LEGO building blocks At the bottom: Federation is baked in. We own almost no hardware! This is key. Hard to sustain!  
  4. Adoption and Outputs (END 6:00) So, what if you build and they don’t come? Luckily, they did. On average, we serve as many users as other major CI investments like leadership class clusters or the XSEDE project. But different users! Home to lots of training and consulting (~25% effort) Cyverse has spun out at least three successful open data ecosystem products
  5. Finding and re-using Data 1 EARLY DESIGN DECISION: Availability of a scalable “Data Store” OPTIONAL: You don’t have to keep all your data there, but we hope to add sufficient value that you do.  Tech Stack: There was nothing ready to go. Combines iRODS + ElasticSearch + Agave APIs Currently 2+ PB of user files. At UA this is purchased as needed. At TACC, sliced from our Corral storage offering. CSHL and Plant.UK federating in. Agave APIs give us access to other storage protocols like S3, SFTP, FTP, Azure, etc.
  6. Finding and re-using Data 2 Why don’t you just give us Google, Dropbox, Box?  Data Store APIs let us implement Data Window GUI. Here’s an example from Cyverse’s DE workbench Comprehensive, easy Data Management, but petascale Aside: Provenance under the hood, but we don’t expose via UI yet
  7. Finding and re-using Data 3 Google Drive for Scientific Big Data Can do local caching as well but hard to do native support well This has been our story to date on Data.. more in a  minute
  8. Finding and re-using Software (1) Reagents (Data) and Protocols (Apps) both must be sharable and reusable Software -> Application Catalog Each front-end GUI has its own concept and implementation but share common infrastructure or are interoperable Here’s the DE, our flagship workbench application Deploying apps to these catalogs involves Docker or VM image GUI specifications (written in some DSL or metadata form) About half of applications in Cyverse are community contributed
  9. Finding and re-using Software (2) Here’s ATMOSPHERE image catalog Mandate provision of help docs and examples data/usage
  10. Finding and re-using Software (3) Here’s an example of Cyverse App and Data available in a 3rd PARTY APPLICATION Give credit and attribution to App contributor as well as primary software author (if different)
  11. Turning back to dataCyverse Data Commons (1) (12:00) To date, Cyverse strategy around data has been "Bring it in, use and discover within the platform, we won’t lock you in" This was not selfish - adopters needed a clear path and we wanted to be sure our CI was externally reliable We’ve been working to broaden this approach as our technology has matured under a banner called “Cyverse Data Commons" We hold a lot of 1’ data. Some of it has a natural home, like NCBI. We have taken responsibility to help that happen. In other cases, it makes sense to publish in place No natural repository Data is too large to move There is an expectation that re-users will perform extensive re-analysis on it Accomplish this now with “Community Data” and deep-linking  Improving offerings over course of 2016 …
  12. Cyverse Data Commons (2) Here are two example workflow being implemented Both result in a persistent, resolvable identifier Note: The VM workflow is already implemented in our sister project Jetstream. Images can be exported and are being published at IU Scholarworks Uses DataCite schema. Indexed by public search engines Feeds into our ElasticSearch-based metadata service to allow easy search and retrieve Search API will be publicly accessible later this year
  13. Bullet points