The document discusses extending the iPlant cyberinfrastructure to support microbes in addition to plants. It provides an overview of iPlant, including its funding from NSF, collaborations, resources like data storage and computing platforms, and applications for analysis. Future plans are outlined to build tools and streamline workflows for metagenomics and enable high-throughput computing for microbial data.
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes
1. Bonnie
Hurwitz,
PhD
Arizona
Health
Sciences
Center
Extending
the
iPlant
Cyberinfrastructure:
From
Plants
to
Microbes
2. The
iPlant
Collabora,ve
Community
Cyberinfrastructure
for
Life
Science
hEp://www.iplantcollaboraIve.org
3. iVirus
and
iMicrobe
Joaquin Ruiz, PhD
Dean, College of Science
Darren Boss
Devesh Chourasiya
Funding
Staff
Ma= Sullivan, PhD
Shane Burgess, PhD
Dean, CALS
4. The iPlant Collaborative
Vision
Enable life science researchers and
educators to use and extend
cyberinfrastructure to understand and
ultimately predict the complexity of
biological systems
5. How
iPlant
CI
Enables
Discovery
Challenge:
Create
an
easy-‐to-‐use
plaNorm
powerful
enough
to
handle
data-‐intensive
biology
Many
bioinformaIcs
tools
“off
limits”
to
those
without
specialized
computaIonal
backgrounds.
6. iPlant is a collaborative virtual
organization
The iPlant Collaborative
Who makes up iPlant?
7. The iPlant Collaborative
How is iPlant funded?
iPlant Renewed by NSF
September 2013 begins next 5 year period
Scientific Advisory Board
Focus on Genotype-Phenotype science
NSF Recommended expansion of
scope beyond plants
8. iPlant collaborates to enable access to the solutions that work the
best for the community…
The iPlant Collaborative
Who does iPlant collaborate with?
9. How
iPlant
CI
Enables
Discovery
Overview
of
resources
End
Users
Computa0onal
Users
Teragrid
XSEDE
ü Storage
ü Computa0on
ü Hos0ng
ü Web
Services
ü Scalability
Building
a
plaNorm
that
can
support
diverse
and
constantly
evolving
needs.
10. iPlant Data Store
ü Initial 100 GB allocation – TB allocations available
ü Automatic data backup
ü Easy upload /download and sharing
The resources you need to share and manage
data with your lab, colleagues and community
11. Discovery Environment
Hundreds of bioinformatics Apps in an easy-to-
use interface
ü A platform that can run almost any bioinformatics application
ü Seamlessly integrated with data and high performance
computing
ü User extensible – add your own applications
12. Agave API
Fully customize iPlant resources
ü Science-as-a-service platform
ü Define your own compute, and storage resources
(local and iPlant)
ü Build your own app store of scientific code and workflows
13. Atmosphere
Cloud computing for the life sciences
ü Simple: One-click access to more than 100 virtual machine
images
ü Flexible: Fully customize your software setup
ü Powerful: Integrated with iPlant computing and data resources
14. DNA Subway
Educational workflows for Genomes, DNA
Barcoding, RNA-Seq
ü Commonly used bioinformatics tools in streamlined workflows
ü Teach important concepts in biology and bioinformatics
ü Inquiry-based experiments for novel discovery and
publication of data
15. Bisque
Image analysis, management, and metadata
ü Secure image storage, analysis, and data management
ü Integrate existing applications or create new ones
ü Custom visualization and image handling routines and APIs
16. Typical
End
Users
Computa0onal
Users
Teragrid
XSEDE
iMicrobe
and
iVirus
Leverage
the
iPlant
Cyberinfrastructure
ü Storage
ü Computa0on
ü Analysis
ü App
dev.
ü Pipeline
dev.
ü Code
distrib.
ü Data
Discoverability
Using
iPlant
for:
17. What’s
Under
the
Hood?
Stampede
-‐
High
Level
Overview
• Base
Cluster
(Dell/Intel/Mellanox):
– Intel
Sandy
Bridge
processors
– Dell
dual-‐socket
nodes
w/32GB
RAM
(2GB/core)
– 6,400
nodes
– 56
Gb/s
Mellanox
FDR
InfiniBand
interconnect
– More
than
100,000
cores,
2.2
PF
peak
performance
• Co-‐Processors:
– Intel
Xeon
Phi
“MIC”
Many
Integrated
Core
processors
– Special
release
of
“Knight’s
Corner”
(61
cores)
– All
MIC
cards
are
on
site
at
TACC
more
than
6000
installed
final
installa0on
ongoing
for
formal
summer
acceptance
– 7+
PF
peak
performance
• Max
Total
Concurrency:
– exceeds
500,000
cores
– 1.8M
threads
• Entered
produc,on
opera,ons
on
January
7,
2013
18. iMicrobe/ iVirus: New App Development
June 2013 – May 2014:
13: New Apps
1: High-throughput analysis pipeline
19. Forging
Ahead
with
iPlant
• Build
a
metegenomics
toolkit
• Streamline
metagenomics
workflows
• Enable
high-‐
throughput
compuIng
• Provide
key
datasets
for
computaIon
20. iPlant Data Store
The resources you need to share
and manage data with your lab,
colleagues and community
21. Overview
of
the
iPlant
Data
Store
Some
Complica0ons
of
Big
Data
• Difficult/slow
transfers
• Expense
for
storage/backup
• Difficult
to
share
and
publish
• Metadata
• Analysis
22. iPlant
Supports
the
Life
Cycle
of
Data
Store
Markup
Search
Transfer
Analyze
Visualize
Collaborate
Share
Data
Results
A
Results
B
Algo1
Algo2
Pre-‐
PublicaIon
Post-‐
PublicaIon
23. Teragrid
XSEDE
Overview
of
the
iPlant
Data
Store
Scalable,
Reliable,
Redundant,
High-‐performance
• Access
your
data
from
mul0ple
iPlant
services
• Automa0c
data
backup
(redundant
between
University
of
Arizona
and
University
of
Texas)
• Mul0ple
ways
to
share
data
with
collaborators
• Mul0-‐threaded
high
speed
transfers
• Default
100GB
alloca0on.
>1TB
alloca0ons
available
with
jus0fica0on
24. Overview
of
the
iPlant
Data
Store
Some
important
items
we
won’t
see
Source
DesInaIon
Copy
Method
Time
(seconds)
CD
My
Computer
cp
320
Berkeley
Server
My
Computer
scp
150
External
Drive
My
Computer
cp
36
USB2.0
Flash
My
Computer
cp
30
iDS
MyComputer
iget
18
My
Computer
My
Computer
cp
15
Close
to
op0mum
condi0ons;
transfer
between
Univ.
of
Arizona
and
UC
Berkeley
100GB:
29m15s
1
GB
/
17.5
seconds
26. Overview
of
the
iPlant
Discovery
Environment
Through
the
Discovery
Environment
you
have:
• High-‐powered
compu0ng
• iPlant
data
store
• Easy
to
use
interface
• Virtually
limitless
apps
• Analysis
history
(provenance)
27. What
you
can
do
in
the
iPlant
DE?
Scalable
plajorm
for
powerful
compu0ng,
data,
and
applica0on
resources
• Navigate
the
components
of
the
DE
• Access
and
manipulate
data
• Start
and
complete
an
analysis
• Track
your
analysis
and
see
your
results
28. Why
is
iPlant
DE
Scalable?
Democra0ze
your
code
• Rich
plajorm
for
bioinforma0cs
~400
apps
(and
coun0ng)
• Data
co-‐localized
with
analysis
• Easy
to
use
interface,
with
access
to
support
• Easy
to
integrate
and
customize
your
own
tools
29. Goal:
Create
a
metagenomic
assembly.
Task
1:
Upload
metagenomic
fasta
file
to
your
personal
data
store
Task
2:
Run
quality
control
on
your
raw
sequence
reads
Task
3:
Find
and
select
an
assembly
tool
(e.g.
Metavelvet)
Task
4:
Specify
parameters
and
your
input
files.
Run
the
assembly
App.
Task
5:
Monitor
the
progress
of
your
analysis
and
save
parameters.
Task
6:
View
your
results.
Discovery
Environment
Example
37. What about Annotations?
• Annotations are descriptions of features on contigs in a
genome / metagenome
– Ab initio gene predictions
– Protein homology (Genbank nr, SIMAP)
– Curated protein resources (COG, Kegg, …)
• Secondary annotations
– InterPro Scan (Pfam, PIR, Prosite, …)
– GO and other ontologies
– Pathway Mapping (Kegg, Metacyc, Ecocyc)
38. Genome and Metagenome
Assembly
ALLPATHS-LG
Newbler
SOAPdenovo
Velvet
MetaVelvet
ABySS
SPA
Digital Norm.
IDBA-UD
Ab initio Gene
Prediction
Glimmer
Prodigal
FragGeneScan
Metagene
MetaGenmark
Transciptome Assembly
Trinity
De novo:
Reference-guided:
SOAPdenovo-Trans
Velvet/Oasis
Trans-ABySS
Tophat
Cufflinks
Meta-
Genome
input
Evidence
input
Conversion Tools
Annotation
Primary:
Secondary:
BLAST
tophat2gff
cufflinks2gff
Visualization
k-mer based
InterProScan
InterPro2GO
JBrowse
Web-Apollo
Data Commons:
Genomes and Metagenomes
Proteins / Genes
Reference Annotations
Metadata (in irods)
At TACCIn the DE Under DevelopmentKey:
Assembly &
Annotation at
iPlant
ü Storage
ü Computa0on
ü Analysis
ü Data
Access
ü Code
Distr.
ü Query
by
metadata
39.
40. The
Louis
Pasteur
Method:
We
can’t
“see”
all
bacteria
using
culture-‐based
approaches
Razumov
(1932)
“The
Great
Plate
Anomaly.”
41.
Community
Genomics
Isolate
Metagenomics
The
Post-‐Genomic
Era:
from
Pasteur
to
CSI
42. Environmental
Sample
Extract
DNA
High
throughput
sequencing
Assemble
reads
Gene
Prediction
library
creation
Making
Sense
of
Metagenomes
Function
Taxonomy
Compare
to
known
proteins
43. Viromes are dominated by the Unknown
PhoIc
AphoIc
Hurwitz BL & Sullivan MB. The Pacific Ocean Virome (POV). PLoS One. 8: e57355.
Bacteria
5%
Eukaryota
1%
Archaea
0%
Viruses
3%
Viruses
7%
Bacteria
4%
Eukaryota
1%
Archaea
0%
Unknown
88%
Unknown
91%
We need
new tools!
49. Host
Genes
that
Promote
Viral
ReplicaIon
Fe-‐S
cluster
biogenesis
and
funcIon
DNA/Protein
biosynthesis
and
repair
Host
“wake-‐up”
Energy
producIon
in
photosynthesis
Niche
Defining
PhoIc
Core:
Hurwitz
BL,
Hallam
S.,
Sullivan
MB.
(2013)
Metabolic
Reprogramming
by
Viruses
in
the
Sunlit
and
Dark
Ocean.
Genome
Biology,
14,
R123.
Hurwitz
BL,
Brum
J.
and
Sullivan
MB.
Depth
Stra0fied
Func0onal
and
Taxonomic
Niche
Specializa0on
in
the
‘Core’
and
‘Flexible’
Pacific
Ocean
Virome
.
In
Review.
50. AdapIve
for
High
Pressure
Environments
DNA
replicaIon
iniIaIon
DNA
repair
MoIlity
Energy
producIon
in
the
TCA
cycle
Niche
Defining
AphoIc
Core:
Hurwitz
BL,
Hallam
S.,
Sullivan
MB.
(2013)
Metabolic
Reprogramming
by
Viruses
in
the
Sunlit
and
Dark
Ocean.
Genome
Biology,
14,
R123.
Hurwitz
BL,
Brum
J.
and
Sullivan
MB.
Depth
Stra0fied
Func0onal
and
Taxonomic
Niche
Specializa0on
in
the
‘Core’
and
‘Flexible’
Pacific
Ocean
Virome.
In
Review.
51. QC
sequences
• FASTQ_
shrinker
Assembly
part
1
• Velveth
pcpipe
part
1
• Cd-‐hit-‐2d
Input
to
Analyses
• Blastx
to
nr
• QIIME
• RarefacMon
New.fastq
Find
Genes
• Meta-‐
Gene-‐Mark
POV
PCs
pcpipe
part
2
• Cd-‐hit
Assembly
part
2
• Velvetg
New.a.faa
iPlant
Discovery
Environment:
Automated
Workflows
POV
+
Novel
PCs
PCpipe:
creaIng
protein
clusters
for
viral
ecology
52. 1.
Select
the
Apps
2.
Order
the
Apps
3.
Map
Outputs
to
Inputs
4.
Run
the
analysis
Crea0ng
Workflows
Easy
as
1-‐2-‐3-‐4
63. QC
sequences
• FASTQ_
shrinker
Assembly
part
1
• Velveth
pcpipe
part
1
• Cd-‐hit-‐2d
AnnotaIon
• Protein
annotaMon
• Secondary
annotaMon
New.fastq
Find
Genes
• Meta-‐
Gene-‐Mark
POV
PCs
pcpipe
part
2
• Cd-‐hit
pcpipe
workflow
Assembly
part
2
• Velvetg
New.a.faa
Gotchas
in
the
PCpipe
Workflow
FoundaIon
API
Runs
on
XSEDE
(HPC)
cannot
be
used
in
a
workflow
POV
+
Novel
PCs
FoundaIon
API
Runs
on
XSEDE
64. iPlant App iMicrobe
adapter
iMicrobe
condor
node
BLAST vs
SIMAP
cd-hit-2d cd-hit extract
proteins in
novel PCs
SIMAP
Annotation
Pipeline
Management
Foundation
Code
HPC
Job distribution
on condor on condor on condor on TACC on condor
Step 1 Step 2 Step 3 Step 4 Step 5
User
ORFs
Existing
Protein
Clusters
Input 1 Input 2
ORFs in
existing
clusters
ORFs in
new
clusters
Annotation
for new
clusters
Output 1 Output 2 Output 3
An
Integrated
PCPipe