iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes

Bonnie
Hurwitz,
PhD

Arizona
Health
Sciences
Center

Extending
the
iPlant
Cyberinfrastructure:

From
Plants
to
Microbes

The
iPlant
Collabora,ve

Community
Cyberinfrastructure
for
Life
Science

hEp://www.iplantcollaboraIve.org

iVirus
and
iMicrobe

Joaquin Ruiz, PhD
Dean, College of Science
Darren Boss
Devesh Chourasiya
Funding
Staﬀ

Ma= Sullivan, PhD
Shane Burgess, PhD
Dean, CALS

The iPlant Collaborative
Vision
Enable life science researchers and
educators to use and extend
cyberinfrastructure to understand and
ultimately predict the complexity of
biological systems

How
iPlant
CI
Enables
Discovery

Challenge:
Create
an
easy-‐to-‐use
plaNorm
powerful
enough

to
handle
data-‐intensive
biology

Many
bioinformaIcs
tools
“oﬀ
limits”
to
those
without

specialized
computaIonal
backgrounds.

iPlant is a collaborative virtual
organization
Who makes up iPlant?

How is iPlant funded?
iPlant Renewed by NSF
September 2013 begins next 5 year period
Scientific Advisory Board
Focus on Genotype-Phenotype science
NSF Recommended expansion of
scope beyond plants

iPlant collaborates to enable access to the solutions that work the
best for the community…
Who does iPlant collaborate with?

How
iPlant
CI
Enables
Discovery

Overview
of
resources

End
Users
Computa0onal
Users

Teragrid
XSEDE
ü  Storage

ü  Computa0on

ü  Hos0ng

ü  Web
Services

ü  Scalability

Building
a
plaNorm

that
can
support

diverse
and

constantly
evolving

needs.

iPlant Data Store
ü  Initial 100 GB allocation – TB allocations available
ü  Automatic data backup
ü  Easy upload /download and sharing
The resources you need to share and manage
data with your lab, colleagues and community

Discovery Environment
Hundreds of bioinformatics Apps in an easy-to-
use interface
ü  A platform that can run almost any bioinformatics application
ü  Seamlessly integrated with data and high performance
computing
ü  User extensible – add your own applications

Agave API
Fully customize iPlant resources
ü  Science-as-a-service platform
ü  Define your own compute, and storage resources
(local and iPlant)
ü  Build your own app store of scientific code and workflows

Atmosphere
Cloud computing for the life sciences
ü  Simple: One-click access to more than 100 virtual machine
images
ü  Flexible: Fully customize your software setup
ü  Powerful: Integrated with iPlant computing and data resources

DNA Subway
Educational workflows for Genomes, DNA
Barcoding, RNA-Seq
ü  Commonly used bioinformatics tools in streamlined workflows
ü  Teach important concepts in biology and bioinformatics
ü  Inquiry-based experiments for novel discovery and
publication of data

Bisque
Image analysis, management, and metadata
ü  Secure image storage, analysis, and data management
ü  Integrate existing applications or create new ones
ü  Custom visualization and image handling routines and APIs

Typical
End

Users

Computa0onal

Users
Teragrid
XSEDE
iMicrobe
and
iVirus

Leverage
the
iPlant
Cyberinfrastructure

ü  Storage

ü  Computa0on

ü  Analysis

ü  App
dev.

ü  Pipeline
dev.

ü  Code
distrib.

ü  Data

Discoverability

Using
iPlant
for:

What’s
Under
the
Hood?

Stampede
-‐
High
Level
Overview

•  Base
Cluster
(Dell/Intel/Mellanox):

–  Intel
Sandy
Bridge
processors

–  Dell
dual-‐socket
nodes
w/32GB
RAM
(2GB/core)

–  6,400
nodes

–  56
Gb/s
Mellanox
FDR
InﬁniBand
interconnect

–  More
than
100,000
cores,
2.2
PF
peak
performance

•  Co-‐Processors:

–  Intel
Xeon
Phi
“MIC”
Many
Integrated
Core
processors

–  Special
release
of
“Knight’s
Corner”
(61
cores)

–  All
MIC
cards
are
on
site
at
TACC

more
than
6000
installed

ﬁnal
installa0on
ongoing
for
formal

summer
acceptance

–  7+
PF
peak
performance

•  Max
Total
Concurrency:

–  exceeds
500,000
cores

–  1.8M
threads

•  Entered
produc,on
opera,ons
on
January
7,
2013

iMicrobe/ iVirus: New App Development
June 2013 – May 2014:
13: New Apps
1: High-throughput analysis pipeline

Forging

Ahead

with

iPlant

•  Build
a

metegenomics

toolkit

•  Streamline

metagenomics

workﬂows

•  Enable
high-‐
throughput

compuIng

•  Provide
key
datasets

for
computaIon

iPlant Data Store
The resources you need to share
and manage data with your lab,
colleagues and community

Overview
of
the
iPlant
Data
Store
Some
Complica0ons
of
Big
Data

•  Diﬃcult/slow
transfers

•  Expense
for
storage/backup

•  Diﬃcult
to
share
and
publish

•  Metadata

•  Analysis

iPlant
Supports
the
Life
Cycle
of
Data

Store

Markup
Search

Transfer

Analyze
Visualize

Collaborate
Share

Data

Results
A

Results
B

Algo1

Algo2

Pre-‐
PublicaIon

Post-‐
PublicaIon

Teragrid
XSEDE
Overview
of
the
iPlant
Data
Store
Scalable,
Reliable,
Redundant,
High-‐performance

•  Access
your
data
from
mul0ple
iPlant
services

•  Automa0c
data
backup
(redundant
between

University
of
Arizona
and
University
of
Texas)

•  Mul0ple
ways
to
share
data
with
collaborators

•  Mul0-‐threaded
high
speed
transfers

•  Default
100GB
alloca0on.
>1TB
alloca0ons

available
with
jus0ﬁca0on

Overview
of
the
iPlant
Data
Store
Some
important
items
we
won’t
see

Source
DesInaIon
Copy
Method
Time
(seconds)

CD
My
Computer
cp
320

Berkeley
Server
My
Computer
scp
150

External
Drive
My
Computer
cp
36

USB2.0
Flash
My
Computer
cp
30

iDS
MyComputer
iget
18

My
Computer
My
Computer
cp
15

Close
to
op0mum
condi0ons;
transfer
between

Univ.
of
Arizona
and
UC
Berkeley

100GB:
29m15s

1
GB
/
17.5
seconds

Discovery Environment
Hundreds of bioinformatics
Apps in an easy-to-use
interface

Overview
of
the
iPlant
Discovery
Environment
Through
the
Discovery

Environment
you
have:

•  High-‐powered
compu0ng

•  iPlant
data
store

•  Easy
to
use
interface

•  Virtually
limitless
apps

•  Analysis
history

(provenance)

What
you
can
do
in
the
iPlant
DE?
Scalable
plajorm
for

powerful
compu0ng,
data,
and
applica0on
resources

•  Navigate
the
components
of
the
DE

•  Access
and
manipulate
data

•  Start
and
complete
an
analysis

•  Track
your
analysis
and
see
your
results

Why
is
iPlant
DE
Scalable?
Democra0ze
your
code

•  Rich
plajorm
for
bioinforma0cs

~400
apps
(and
coun0ng)

•  Data
co-‐localized
with
analysis

•  Easy
to
use
interface,
with
access

to
support

•  Easy
to
integrate
and
customize
your
own

tools

Goal:
Create
a
metagenomic
assembly.

Task
1:
Upload
metagenomic
fasta
ﬁle
to
your
personal
data
store

Task
2:
Run
quality
control
on
your
raw
sequence
reads

Task
3:
Find
and
select
an
assembly
tool
(e.g.
Metavelvet)

Task
4:
Specify
parameters
and
your
input
ﬁles.

Run
the
assembly
App.

Task
5:
Monitor
the
progress
of
your
analysis
and
save
parameters.

Task
6:
View
your
results.

Discovery
Environment
Example

Sequence Quality Control in the iPlant DE

Genome, Metagenome,
and Transcriptome
Assembly
Genome and Metagenome
Assembly
ALLPATHS-LG
Newbler
SOAPdenovo
Velvet
MetaVelvet
ABySS
SPA
Digital Norm.
IDBA-UD
Transciptome Assembly
Trinity
De novo:
Reference-guided:
SOAPdenovo-Trans
Velvet/Oasis
Trans-ABySS
Tophat
Cufflinks
In the DEKey:

Specify Data and Assembly
Parameters

What about Annotations?
•  Annotations are descriptions of features on contigs in a
genome / metagenome
–  Ab initio gene predictions
–  Protein homology (Genbank nr, SIMAP)
–  Curated protein resources (COG, Kegg, …)
•  Secondary annotations
–  InterPro Scan (Pfam, PIR, Prosite, …)
–  GO and other ontologies
–  Pathway Mapping (Kegg, Metacyc, Ecocyc)

Genome and Metagenome
Assembly
ALLPATHS-LG
Newbler
SOAPdenovo
Velvet
MetaVelvet
ABySS
SPA
Digital Norm.
IDBA-UD
Ab initio Gene
Prediction
Glimmer
Prodigal
FragGeneScan
Metagene
MetaGenmark
Transciptome Assembly
Trinity
De novo:
Reference-guided:
SOAPdenovo-Trans
Velvet/Oasis
Trans-ABySS
Tophat
Cufflinks
Meta-
Genome
input
Evidence
input
Conversion Tools
Annotation
Primary:
Secondary:
BLAST
tophat2gff
cufflinks2gff
Visualization
k-mer based
InterProScan
InterPro2GO
JBrowse
Web-Apollo
Data Commons:
Genomes and Metagenomes
Proteins / Genes
Reference Annotations
Metadata (in irods)
At TACCIn the DE Under DevelopmentKey:
Assembly &
Annotation at
iPlant
ü  Storage

ü  Computa0on

ü  Analysis

ü  Data
Access

ü  Code
Distr.

ü  Query
by

metadata

The
Louis
Pasteur
Method:

We
can’t
“see”
all
bacteria
using
culture-‐based
approaches

Razumov
(1932)
“The
Great
Plate
Anomaly.”

Community

Genomics

Isolate

Metagenomics

The
Post-‐Genomic
Era:
from
Pasteur
to
CSI

Environmental

Sample

Extract
DNA
High
throughput
sequencing

Assemble
reads
Gene
Prediction

library

creation

Making
Sense
of
Metagenomes

Function

Taxonomy

Compare
to

known
proteins

Viromes are dominated by the Unknown
PhoIc
AphoIc

Hurwitz BL & Sullivan MB. The Pacific Ocean Virome (POV). PLoS One. 8: e57355.
Bacteria

5%
Eukaryota

1%

Archaea

0%

Viruses

3%

Viruses

7%

Bacteria

4%
Eukaryota

1%

Archaea

0%

Unknown

88%

Unknown

91%
We need
new tools!

Phage
FuncIon
based
on
Environment

PcPipe:
a
VigneEe
in
Viral
Metagenomics

Assemble Find Genes
Protein
Clusters
Input
reads
Input
reads
Cluster
Genes
BIN
Organizing
the
Unknown

Yooseph
S,
et
al.
(2007)
The
Sorcerer
II
Global
Ocean
Sampling
expedi0on:

expanding
the
universe
of
protein
families.
PLoS
Biol
5(3):e16.

27K
High-‐Conﬁdence
Viral
Protein
Clusters

GOS

50%

POV
+
GOS

22%

POV

28%

Isolate

Phage

1%

2X

environmental

viral
protein

clusters

70%

of
data
now

included

Hurwitz BL & Sullivan MB. (2013) The Pacific Ocean Virome (POV). PLoS One. 8: e57355.

Ocean
Microbial
CommuniIes
Vary
by

Environmental
Factors

Paciﬁc
Ocean
Virome:

Geographic
Region

LocaIon
on
a
Transect

Season

Depth

Hurwitz BL & Sullivan MB. (2013) The Pacific Ocean Virome (POV). PLoS One. 8: e57355.

GDS
GFS
M5OD
M4OS
M2MS
LF26S
LA26S
LJ26S
LJ12S
LJ4S
M1CS
STCS
SFCS
SFSS
SFDS
M3MD
LJ12D
LJ26D
LJ4O
LJ12A
LJ4D
LJ4A
M6O1K
M7O4K
LF26D
LF26O
LJ12O
LF26A
LA26A
LA26O
LJ26O
LA26D
LJ4O
LJ12A
LJ4D
LJ4A
M6O1K
M7O4K
LF26D
LJ12O
LF26O
LF26A
LJ26O
LA26A
LA26O
LA26D
LJ26D
LJ12D
M3MD
GDS
GFS
M4OS
M5OD
LJ4S
LJ12S
LJ26S
LA26S
LF26S
M2MS
M1CS
SFSS
SFDS
SFCS
STCS
Aphotic Photic
AphoticPhotic
Hurwitz
BL,
Brum
J.
and
Sullivan
MB.
Depth
Stra0ﬁed
Func0onal
and
Taxonomic
Niche
Specializa0on

in
the
‘Core’
and
‘Flexible’
Paciﬁc
Ocean
Virome
.

In
Review.

Photic
vs
Photic
Aphotic
vs
Photic
Aphotic
vs
Aphotic
Photic
vs
Aphotic
Protein

Clusters

group
by

phoIc

zone

Many PCs shared
Some PCs shared
Few PCs shared

Host
Genes
that
Promote
Viral
ReplicaIon

Fe-‐S
cluster
biogenesis
and
funcIon

DNA/Protein
biosynthesis
and
repair

Host
“wake-‐up”

Energy
producIon
in
photosynthesis

Niche
Defining
PhoIc
Core:

Hurwitz
BL,
Hallam
S.,
Sullivan
MB.
(2013)
Metabolic
Reprogramming
by
Viruses
in
the
Sunlit
and
Dark

Ocean.
Genome
Biology,
14,
R123.

Hurwitz
BL,
Brum
J.
and
Sullivan
MB.
Depth
Stra0fied
Func0onal
and
Taxonomic
Niche
Specializa0on

in
the
‘Core’
and
‘Flexible’
Pacific
Ocean
Virome
.

In
Review.

AdapIve
for
High
Pressure
Environments

DNA
replicaIon
iniIaIon

DNA
repair

MoIlity

Energy
producIon
in
the
TCA
cycle

Niche
Defining
AphoIc
Core:

Hurwitz
BL,
Hallam
S.,
Sullivan
MB.
(2013)
Metabolic
Reprogramming
by
Viruses
in
the
Sunlit
and
Dark

Ocean.
Genome
Biology,
14,
R123.

Hurwitz
BL,
Brum
J.
and
Sullivan
MB.
Depth
Stra0fied
Func0onal
and
Taxonomic
Niche
Specializa0on

in
the
‘Core’
and
‘Flexible’
Pacific
Ocean
Virome.

In
Review.

QC
sequences

•  FASTQ_

shrinker

Assembly

part
1

•  Velveth

pcpipe
part
1

•  Cd-‐hit-‐2d

Input
to

Analyses

•  Blastx
to
nr

•  QIIME

•  RarefacMon

New.fastq

Find
Genes

•  Meta-‐
Gene-‐Mark

POV
PCs

pcpipe
part
2

•  Cd-‐hit

Assembly

part
2

•  Velvetg

New.a.faa

iPlant
Discovery
Environment:

Automated
Workﬂows

POV
+

Novel

PCs

PCpipe:
creaIng
protein
clusters
for
viral
ecology

1. 
Select
the
Apps

2. 
Order
the
Apps

3. 
Map
Outputs
to
Inputs

4. 
Run
the
analysis

Crea0ng
Workﬂows
Easy
as
1-‐2-‐3-‐4

Create
a
New
Workﬂow

Provide
Workﬂow
Informa0on

New.a.faa
POV
PCs

Map
Outputs
to
Inputs

User’s
ORFs

POV
PCs

Run
the
Workﬂow

Automated
workﬂows

cannot
use
Apps
that
run

on
the
HPC

QC
sequences

•  FASTQ_

shrinker

Assembly

part
1

•  Velveth

pcpipe
part
1


AnnotaIon

•  Protein

annotaMon

•  Secondary

annotaMon

New.fastq

Find
Genes

•  Meta-‐
Gene-‐Mark

POV
PCs

pcpipe
part
2

•  Cd-‐hit

pcpipe
workflow

Assembly

part
2

•  Velvetg

New.a.faa

Gotchas
in
the
PCpipe
Workflow

FoundaIon
API

Runs
on
XSEDE
(HPC)
cannot
be
used
in
a
workflow

POV
+

Novel

PCs

FoundaIon
API

Runs
on
XSEDE

iPlant App iMicrobe
adapter
iMicrobe
condor
node
BLAST vs
SIMAP
cd-hit-2d cd-hit extract
proteins in
novel PCs
SIMAP
Annotation
Pipeline
Management
Foundation
Code
HPC
Job distribution
on condor on condor on condor on TACC on condor
Step 1 Step 2 Step 3 Step 4 Step 5
User
ORFs
Existing
Protein
Clusters
Input 1 Input 2
ORFs in
existing
clusters
ORFs in
new
clusters
Annotation
for new
clusters
Output 1 Output 2 Output 3
An
Integrated
PCPipe

Exis0ng
PCs

(POV)

Directory
of

User
deﬁned

ORFS

PCPipe
App

Collaborating with iPlant
•  Solve
computa0onal
boulenecks

•  Make
tools
easier
to
use

•  Share
Data

•  Provide
community
input

Collaboration

QuesIons
or
Comments?

Bonnie
Hurwitz,
PhD

QC
sequences

•  FASTQ_

shrinker

Assembly

•  Velvet

pcpipe
part
1


Gene

AnnotaIon

•  SIMAP

•  GO

•  PFAM…

New.fastq

PCs

pcpipe
part
2

•  Cd-‐hit

Find
Genes

•  Prodigal

ORFs

PCpipe:
Protein
Cluster
Pipeline

Steps
in
iPlant
DE

PCs
+

Novel

PCs

(HPC or Cloud)

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes

Similaire à iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes (20)

Dernier

Dernier (20)

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes