Synopsis: Biological research increasingly depends on computational analysis of large and complex data sets. These slides were used in a one-hour webinar that provided a comprehensive look at platforms, tools, and services for large-scale data analysis provided by CyVerse, a cyberinfrastructure (CI) project of the US National Science Foundation. The webinar is available at https://youtu.be/QErkkoDFdyU.
The webinar was aimed at viewers interested in the compute and platform architecture of Cyverse. It introduced the basic components of CyVerse CI including the Discovery Environment (a simple web portal for managing data, analyses and workflows); the Data Store (scalable, secure, and reliable storage for terabyte-scale data management); Atmosphere (one-click, on-demand cloud computing); and the Visual and Interactive Computing Environment (flexible implementations of Jupyter Labs, Rstudio, and R Shiny).
CyVerse provides a full stack of CI services with entry points for computational novices and software developers. All resources are freely available to the community and free accounts can be obtained at user.cyverse.org. New users can check out the Cyverse Youtube channel (https://www.youtube.com/channel/UC-gvdjTz9rq6RovZ57LoDDA/featured) for webinars on how to get started with Cyverse and how to use specific tools and workflows.
Speaker: Jason Williams. Assistant Director, External Collaborations Cold Spring Harbor Laboratory, DNA Learning Center and Education, Outreach and Training lead for CyVerse.
Cyverse: Extensible Cyberinfrastructure for Life Science
1. Transforming Science Through Data-driven Discovery
Extensible Cyberinfrastructure for
Life Science
Jason Williams – Education, Outreach, Training Lead
Cold Spring Harbor Laboratory
williams@cshl.edu @JasonWilliamsNY
2. Transforming science through data-driven discovery
More than 60K users, PBs of data, and hundreds of publications, courses,
and discoveries
CyVerse vision
3. CyVerse evolution
2006
iPlant 2008
Empowering a
New Plant Biology
CyVerse 2016
Transforming Science Through
Data-Driven Discovery
public launch
2010
iPlant 2013
Cyberinfrastructure for
Life Sciences
funding renewal
2015 2018
funding renewal
2017
4. We are funded by the National
Science Foundation
• We are your colleagues and collaborators!
• >$100 Million in investment
• Freely available to the community
• Spur national/international collaboration
• Cite CyVerse:
CyVerse.org/acknowledge-cite-cyverse
DBI-0735191 and DBI-1265383
CyVerse evolution
5. What is Cyberinfrastructure?
•Data storage
•Software
•High-performance computing
•People
organized into systems that solve problems of size and
scope that would not otherwise be solvable.
6. Platforms, tools, datasets Storage and compute Training and support
Community-focused cyberinfrastructure
7. Microbial Plant Animal Biomedical Ecological
CyVerse is built for data
Sequence Images Other datatypes
8. CyVerse product stack
Ready to use
Platforms
Foundational
Capabilities
Established CI
Components
Extensible
Services
EaseofUse
9. Genomic data and analysis:
• Reference guided assembly
• De novo assembly
• RNA-Seq (expression; gene/isoform discovery)
• Variant calling
• Genome/Transcriptome annotation
• ChIP-Seq/Integration of epigenetic information
• Multiple sequencing platforms
• New and evolving technologies
CyVerse Community Priorities
10. CyVerse is a collaborative virtual organization
CyVerse Institutions
CyVerse UK
11. • We strive to be the CI Lego blocks
• Danish 'leg godt' - 'play well’
• Also translates as 'I put together' in
Latin
• If a solution is not available you can
craft your own using CyVerse CI
components
CyVerse Products
12. Data Store
Initial 100 GB allocation – TB allocations available
Automatic data backup
Easy upload /download and sharing
The resources you need to share and manage data with your lab,
colleagues and community
13. Discovery Environment
Hundreds of bioinformatics Apps in an easy-to-use interface
A platform that can run almost any bioinformatics application
Seamlessly integrated with data and high performance computing
User extensible – add your own applications
14. Atmosphere
Cloud computing for the life sciences
Simple: Access to hundreds of virtual machine images
Flexible: Fully customize your software setup
Powerful: Integrated with CyVerse computing and data resources
15. Science APIs
Fully customize CyVerse resources
Science-as-a-service platform
Define your own compute, and storage resources (local and CyVerse)
Build your own app store of scientific codes and workflows
16. DNA Subway
Educational workflows for Genomes, DNA Barcoding, RNA-Seq
Commonly used bioinformatics tools in streamlined workflows
Teach important concepts in biology and bioinformatics
Inquiry-based experiments for novel discovery and publication of data
17. Bisque
Image analysis, management, and metadata
Secure image storage, analysis, and data management
Integrate existing applications or create new ones
Custom visualization and image handling routines and APIs
19. CyVerse Data Store
Store any type of file related to your research
Move files seamlessly between CyVerse platforms
Automate file transfers
Share files with lab members, collaborators, and communities
20. CyVerse Data Store
Command linePoint-and-click
iCommands
Multiple ways to access
Cyberduck Discovery Environment
21. Discovery Environment
Simple upload/download for small files
Bulk upload files and folders (<10GB)
Import from URL (no size limit)
Advantage + Disadvantage -
Covers most upload/download
sharing needs
Some size/speed limitations
22. Cyberduck
Drag and drop files and folders
No size limit, file editing/previews
Easy Desktop functionality
Advantage + Disadvantage -
More like desktop file systems No permissions/metadata control
23. iCommands
Full flexibility
Ability to script and automate
Access from terminal/server
Advantage + Disadvantage -
Customizability Requires some command line
expertise
26. Discovery Environment
A platform that can run almost any bioinformatics application
Seamlessly integrated with data and high performance computing
User extensible – add your own applications
27. • Upload / Download files and folders
• Share files via URL (Public Links)
• Share files/folders with other users
Data
Manage data
Discovery Environment Overview
28. Apps
• Run hundreds of bioinformatics Apps
• Build automated workflows
• Modify Apps or integrate new ones
Analyze data and customize Applications
Discovery Environment Overview
29. Analyses
• Monitor job status and find results
• Cancel jobs or re-launch jobs
• Detailed job history
View history, find results, reproduce analyses, optimize parameters
Discovery Environment Overview
31. Demo analysis – sequence alignment using MUSCLE
View sample data in Data Store
Launch a job using the MUSCLE sequence alignment app
Monitor the job progress and view results
Task: Take unaligned DNA sequences in FASTA format and create a multiple alignment
Discovery Environment
32. Flexible implementations of Jupyter Labs, RStudio, and R Shiny
All customizable via Docker
Developer friendly
Visual and Interactive Computing Environment
Discovery Environment: VICE
34. Atmosphere
Simple: Access hundreds of virtual machine images
Flexible: Fully customize your software setup
Powerful: Integrated with CyVerse computing and data resources
35. Important concepts: Image
What is Cloud Computing?
Image (file)
Document(s) (file)
Original system
Complete clone (files/data)
Copied Document(s) (file)
36. Important concepts: Instance
What is Cloud Computing?
CyVerse Cloud
+(Disk + CPU + Memory) + (Image)
Atmosphere Instance
(virtual machine)
128.196.34.158
37. Atmosphere Overview
Largest, easiest to use cloud for Life Sciences
• Choose an existing image or customize
• Instances up to 16-Core / 128 GB RAM
• Access via shell or VNC
• Share you image with selected users, or make them public
38. Atmosphere
Cloud computing for life sciences: sample use cases
• Run the software and data that are monopolizing your laptop/desktop
• Use desktop enabled images to run visually oriented programs (GUI)
• SUDO access – manage complex dependencies
• Uniform computing setups for your lab, collaborators, and students
• Make your own software available to a larger user community
40. Atmosphere
Cloud computing for life sciences
Windows Mac Linux
VNC Viewer VNC Viewer VNC Viewer
Shell/terminal Shell/terminalPuTTY
VNC Viewer: www.realvnc.com/download/viewer
PuTTy: www.putty.org
41. Where to go from here:
Learning Center
• Get Started Guide
• Tutorials and Videos
• Documentation
Upcoming Events
• Workshops
• Webinarslearning.cyverse.org
42. Transforming Science Through Data-driven Discovery
Parker Antin
Nirav Merchant
Eric Lyons
Matt Vaughn Doreen Ware
Dave Micklos
CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383.
Executive Team
Notes de l'éditeur
Focus here is on genomics data, but not restricted to genomics data
Focus here is on genomics data, but not restricted to genomics data
Focus here is on genomics data, but not restricted to genomics data
Focus here is on genomics data, but not restricted to genomics data
Focus here is on genomics data, but not restricted to genomics data
Focus here is on genomics data, but not restricted to genomics data