The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
myGrid Workflow for Bioinformatics Analysis
1. http://taverna.org.uk/
S"an
Soiland-‐Reyes
&
Robert
Haines
myGrid,
School
of
Computer
Science
University
of
Manchester,
UK
ITER
IM
workshop
Château
de
Cadarache,
2011-‐06-‐08
2. What
is
myGrid?
An
e-‐Science
Collabora"on
Since
2001
Not
a
grid!
Numerous
partners
involved:
University
of
Manchester
University
of
Southampton
University
of
Oxford
EMBL-‐EBI
Provides
sustainable
and
produc"on
quality
soTware
Supported
by
OMII-‐UK,
EPSRC
and
BBSRC
Mixture
of
developers,
bioinforma"cians
and
researchers
SoTware
|
Services
|
Content
|
Skills
|
Community
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
3. Mo"va"on:
Bioinforma)cs
Challenge:
Large
amounts
of
data
Many
open
ques"ons
Numerous
freely
available
public
datasets
and
analysis
tools
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
4. Huge
amounts
of
data
Microarray
1000+
Genes
QTL
regions
100+
Genes
How
do
I
look
Next
Gen
at
all
the
genes
systema)cally?
Sequencing
100,000+
Genes
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
5. Manual
approach
Search
using
public
web
sites
and
databases
Pubmed
Uniprot
EBI
BioMart
Copy
and
paste
to
web
tools
for
analysis
NCBI
Blast
EBI
InterPro
Further
processing
locally
R
Perl
Python
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
6. Manual:
disadvantages
• Scale
of
analysis
task
overwhelms
researchers
–
lots
of
data
• User
bias
and
premature
filtering
of
datasets
–
cherry
picking
• Hypothesis-‐Driven
approach
to
data
analysis
• Constant
changes
in
data
-‐
problems
with
re-‐
analysis
of
data
• Implicit
methodologies
(hyper-‐linking
through
web
pages)
• Error
prolifera)on
from
any
of
the
listed
issues
–
notably
human
error
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
7. Web
services
and
workflows
Web
services
Technology
and
standards
for
exposing
code
and
data
resources
that
can
be
programma)cally
consumed
by
a
remote
third
party
Descrip"on
on
how
to
interact
with
the
service,
parameters,
documenta"on
Workflows
General
technique
for
describing
and
execu"ng
a
process
Describe
what
you
want
to
do
running
which
services
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
8. The Taverna Open Source Suite of Tools Web Portals
Workflow Repository GUI Workbench Client User Interfaces
Virtual
Machine
Third Party Tools
Service Catalogue
Workflow Engine
Provenance Workflow
Store Server
Activity and Service
Plug-in Manager
Open
Provenance
Model
Programming and
Secure Service Access APIs
9. Taverna
workflows
Workflow Inputs
start_position chromosome_name end_position
genes_in_qtl
A
set
of
(local
and
remote)
mmusculus_gene_ensembl
remove_entrez_duplicates remove_uniprot_duplicates create_report
services
to
analyze
or
manage
merge_entrez_genes merge_uniprot_ids
remove_Nulls REMOVE_NULLS_2
data
add_ncbi_to_string add_uniprot_to_string
Kegg_gene_ids_2 Kegg_gene_ids
concat_kegg_genes
Nested
workflows
are
also
split_gene_ids regex_2
split_for_duplicates
Get_pathways
remove_duplicate_kegg_genes
Workflow Inputs
services
Data-‐links
connects
services
regex gene_ids
split_by_regex
lister
get_pathways_by_genes1
i.e.
output
from
service
A
is
input
to
service
B
and
C
Merge_pathways
concat_ids
Describes
the
desired
dataflow
concat_gene_pathway_ids
Merge_gene_pathways
instead
of
process
coordina"on
Workflow Outputs
pathway_genes pathway_ids
merge_pathway_list_1
merge_pathway_list_2
split_for_duplicate_pathways
Automa"c
itera"ons
Can
customize
list
handling
and
remove_duplicate_ids
pathway_descriptions
control
links
gene_descriptions merge_genes_and_pathways remove_pathway_duplicates
merge_gene_desc merge_genes_and_pathways_2 merge_pathway_desc
remove_nulls_3 merge_genes_and_pathways_3 remove_pathway_nulls merge_patwhay_ids species kegg_pathway_release
Workflow Outputs
flatten_pathway_files remove_pathway_nulls_2 merge_kegg_references merge_reports getcurrentdatabase binfo
gene_descriptions genes_pathways merged_pathways pathway_descriptions pathway_ids kegg_external_gene_reference report ensembl_database_release kegg_pathway_release
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
10. What
types
of
services
and
data?
WSDL/SOAP
web
services
Secured
invoca"on
with
HTTPS/SSL/WS-‐Security
RESTful
web
services
Secured
invoca"on
with
HTTPS/Basic
Auth
Spreadsheet
import
Command
line
tools
(local,
SSH)
Inline
scripts
(Beanshell,
R)
Excel/CSV
spreadsheets
Java
APIs
Customiza"ons:
BioMart,
BioMoby
/
SADI
Soaplab
Grid
services
(EGEE
gLite,
caGrid,
PBS,
UNICORE)
…
your
tool
(Plugin
tutorial
in
wiki)
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
11. Service
limita"ons
Web
service
crea"on
involves
wrapping
exis"ng
tools
or
wri"ng
WS
code
Web
services
can
go
down
can
use
redundant
services
in
workflow
Service
monitoring
Transferring
data
up/down
to
WS
slow
Support
references
in
WS
interface
Execu"ng
command
line
tools
directly
requires
execu"on
access
Trickier
to
share
workflows,
require
either
SSH/grid
creden)als
or
installing
tools
locally
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
12. Which
services?
Taverna
is
general,
can
connect
to
standard
web
services
and
command
line
tools
for
any
domain
in
bioinforma"cs..
From
professional
third-‐party
organisa"ons
providing
robust
&
open
data/analysis
services
..to
under-‐the-‐desk
web
services
for
one
par"cular
purpose,
ran
by
PhD
students
hhp://biocatalogue.org/
-‐
2000+
services
from
140+
providers
–
crowd
sourced
and
quality
monitored
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
14. BioCatalogue
integra"on
Search
services
from
workbench
Add
services
to
workflow
View
service
descrip)ons
and
up)me
status
from
within
workflow
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
15. Taverna
workbench
Graphical
desktop
tool
No
server
installa"on
required
Drag-‐and-‐drop
services
into
diagram
Connect
services,
run,
reconnect,
rerun
Integrates
diverse
set
of
tools
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
19. Sharing
workflows
myExperiment.org
allows
users
to
share,
find,
download
and
rate
workflows
“Facebook
for
the
scien"st”
4000+
members,
1400+
workflows
Open
source
code,
can
set
up
own
instance
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
20. myExperiment
integra"on
Search
and
browse
workflows
By
tags
Free
text
search
Own/group
workflows
Packs,
e.g.
“Examples”
Upload/share
workflows
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
21. Taverna
workflow
features
Nested
workflows
Reuse
exis"ng
components
Implicit
itera"ons
With
customizable
list
handling
Pipelining
Process
par"al
itera"on
results
early
Parallelisa"on
Run
as
soon
as
data
is
available
Retries,
failover,
looping
For
stability
and
condi"onal
tes"ng
Plugin-‐extensible
execu"on
control
Ideas:
caching,
error
detec"on,
dynamic
service
lookup
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
22. Extensible
UI
and
engine
Plugins
can
provide
new
“perspec"ves”
e.g.:
BioCatalogue,
myExperiment
Provide
service-‐specific
customiza"on
e.g.:
BioMart
interface
replicates
web
site
Adding
new
func"onality
New
service
types,
eg:
…
Execu"on
control
like
looping/branching
Design
helpers,
“Find
matching
service”
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
23. Workflow
limita"ons
Ini"ally
designed
for
dataflows
Not
suitable
for
business
processes
like
“HR
procedure
for
hiring
new
staff”
Long-‐running
workflows
require
Taverna
Server
..
But
suitable
for
coordina)ng
command
line
and
grid
execu"ons,
the
data
might
just
be
job
references
Execu"on
control
extensible,
eg:
Looping,
Branching
Dynamic
service
lookup
Data
manipula"on,
Error
detec"on
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
24. Data
and
provenance
handling
Data
references
passed
between
services
in
workflow
http,
file,
sftp,
gridftp,
etc
(extensible)
Data
downloaded/uploaded
or
references
translated
when
needed
Provenance
captured
for
workflow
runs
Trace
execu"on
steps,
view
intermediate
values
while
running
Export
as
Open
Provenance
Model
(OPM)
/
RDF
Proof
and
origin
of
produced
outputs
Extensible
annota)ons
Wf4Ever:
reproducible
research
objects
Workflow/data
as
a
scien"fic
publica"on
preserva"on
Need
to
capture
more
service
data
and
metadata
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
25. Data
limita"ons
Running
Workbench
limited
by:
Local
disk
space
for
storing
data
Network
speeds
for
up/download
Firewall
access
Execute
wf
using
Taverna
Server
or
command
line
remotely
with
ssh/job
submission
No
standardized
WS
reference
mechanism
Agree
on
mechanism
within
WS
‘family’
with
shared
disk
(eg.
deconstruct
local
path
from
HTTP
URI)
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
26. Parameter
sweeps
Implicit
itera)ons
with
pipelining
provides
an
intui"ve
way
to
set
up
parameter
sweeps
Advanced
looping
and
extensible
execu)on
control
allows
itera"ve
&
recursive
reduc"ons/approxima"ons
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
27. Taverna
command
line
Executes
from
a
Windows/Linux/OSX
shells
Takes
a
predefined
workflow
with
files
as
inputs
and
outputs
Quick
way
to
“produc"onize”
a
workflow
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
28. Taverna
Server
REST/SOAP
interface
to
execute
workflows
Client
libraries
for
Ruby
and
Java
Two
demonstra"on
web
interfaces
Ruby
Java
Portlets
Upcoming:
Security
delega"on
AWS
image
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
29. Taverna
portlet
Example
portlet
interface
Executes
workflows
using
Taverna
Server
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
31. Ruby
web
interface
Example
customized
Uses
Ruby
gem
web
interface
t2-‐server
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
32. Grids
and
clusters
Taverna
have
been
integrated
with
several
leading
grid
and
middleware
infrastructures,
such
as:
PBS
caGrid/Globus
EGEE/gLite
NorduGrid’s
ARC
JSDL/GridSAM
Plans
for
SAGA
integra"on
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
33. Taverna
on
the
cloud
Use-‐case:
SNP
analysis
and
annota"on
of
genome
sequenced
from
breeds
of
cows
in
Africa
–
why
are
some
of
them
resistent
to
X?
Amazon
EC2
with
Taverna
Server
and
local
services
Ruby
on
Rails
web
interface
Runs
through
31
chromosomes
in
2
hours
using
10
instances
-‐
$10
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
35. Taverna
3
roadmap
OSGi
plugin
system
Workflow
language:
Scufl2
Compound
format;
embedding
metadata,
dependencies,
independent
API
for
crea"ng/
inspec"ng
workflows
Components
Finding/sharing
command
line
tool
descrip"ons
Richer
way
of
finding
compa"ble
services
http://www.mygrid.org.uk/
http://www.taverna.org.uk/
36. Open
source,
open
development
Taverna
suite
of
tools
are
all
open
source,
free
to
use
and
customize
Large
user
community,
ac"ve
mailing
lists
Lead
developers:
myGrid
in
Manchester
UK
Contributors
from
across
the
world
PAL
programme
myGrid
provides
training,
tutorials
and
documenta)on
http://www.mygrid.org.uk/
http://www.taverna.org.uk/