Internal seminar @Newcastle University, Feb 2011

Programming scientific data pipelines with
the Taverna workflow management system
Dr. Paolo Missier
School of Computing Science
Newcastle University, UK

Newcastle, Feb. 2011

With thanks to the myGrid team in
Manchester for contributing their
material and time

Outline
Objective:
to provide a practical introduction to workflow
systems for scientific applications

• Workflows in context: lifecycle and the workflow eco-system
• Workflows for data integration
• The user experience: from services and scripts to workflows
• Extensibility I: importing services
– using R scripts
• Extensibility II: plugins
• The Taverna computation model: a closer look
– functional model, dataflow parallelism
• Performance

2

Workflows in science
High level programming models for scientific applications

• Specification of service / components
execution orchestration

• Handles cross cutting concerns like
error handling, service invocation,
data movement, data streaming,
provenance tracking…..

• A workflow is a specification
configured for each run

3

What are Workflows used for?
EarthSciences Life Sciences

4

Taverna
• First released 2004
• Current version Taverna 2.2
• Currently 1500+ users per month, 350+ organizations, ~40
countries, 80,000+ downloads across versions

• Freely available, open source LGPL
• Windows, Mac OS, and Linux

• http://www.taverna.org.uk
• User and developer workshops
• Documentation
• Public Mailing list and direct email support

http://www.taverna.org.uk/introduction/taverna-in-use/

5

Who else is in this space?
Trident

Triana
VisTrails
Kepler

Taverna
Pegasus (ISI)

6

Example: the BioAID workflow
Purpose:
The workflow extracts protein names from documents retrieved from
MedLine based on a user Query (cf Apache Lucene syntax).
The protein names are filtered by checking if there exists a valid UniProt
ID for the given protein name.

Credits:
- Marco Roos (workflow),
- text mining services by Sophia Katrenko and Edgar Meij (AID), and
Martijn Schuemie (BioSemantics, Erasmus University Rotterdam).

Available from myExperiment:
http://www.myexperiment.org/workflows/154.html

7

The workflows eco-system in myGrid
A process-centric science lifecycle


Service discovery
and import


Service discovery
and import Data
Metadata Methods
- inputs
- provenance - the workflow
- parameters
- annotations
- results

Workflow as data integrator

QTL
genomic
regions

genes
in QTL

metabolic
pathways
(KEGG)

Taverna computational model (very briefly)
List-structured
KEGG gene ids: geneIDs pathways
[ [ mmu:26416 ], •
•
• •
[ mmu:328788 ] ] •
•
• •
• •

geneIDs pathways
•
•
• •
•
•
• •
• •

[ path:mmu04010 MAPK signaling,
path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...],
[ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

Taverna computational model (very briefly)
List-structured
KEGG gene ids:

[ [ mmu:26416 ],
[ mmu:328788 ] ]

• Collection processing
• Simple type system
• no record / tuple structure
• data driven computation
• with optional processor synchronisation
• parallel processor activation
• greedy (no scheduler)



From services and scripts to workflows
• the BioAID workflow again:
– http://www.myexperiment.org/workflows/154.html
• overall composition:
• 15 beanshell and other local scripts
– mostly for data formatting
• 4 WSDL-based service operations:

operation service

getUniprotID synsetServer

queryToArray tokenize

apply applyCRFService

search SearcherWSService

11

Service composition requires adapters
Example: SBML model optimisation workflow -- designed by Peter Li
http://www.myexperiment.org/workflows/1201

Service composition requires adapters
Example: SBML model optimisation workflow -- designed by Peter Li
http://www.myexperiment.org/workflows/1201

String[] lines = inStr.split("n");
StringBuffer sb = new StringBuffer();
for(i = 1; i < lines.length -1; i++)
{
String str = lines[i];
str = str.replaceAll("<result>", "");
str = str.replaceAll("</result>", "");
sb.append(str.trim() + "n");
}

String outStr = sb.toString();

Url -> content (built-in shell script)

import java.util.regex.Pattern;
import java.util.regex.Matcher;

sb = new StringBuffer();
p = "CHEBI:[0-9]+";

Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(sbrml);
while (matcher.find())
{
sb.append("urn:miriam:obo.chebi:" + matcher.group() + ",");
}
String out = sb.toString();
//Clean up
if(out.endsWith(","))
out = out.substring(0, out.length()-1);

chebiIds = out.split(",");

Building workflows from existing services
• Large collection of available services
– default but extensible palette of services in the workbench
– mostly third party
– All the major providers: NCBI, DDBJ, EBI …

A plethora of providers:

For an example of how
to build a simple
workflow, please follow
Exercise 3 from this
tutorial

13

Incorporating R scripts into Taverna
Requirements for using R in a local installation:
- install R from main archive site:
http://cran.r-project.org/
- install Rserve:
http://www.rforge.net/Rserve/
- start Rserve locally:
- start the R console and type the commands:
library(Rserve)
Rserve(args="--no-save")

Taverna can display graphical output from R

The following R script simply produces a png image that is
displayed on the Taverna output:
png(g);
plot(rnorm(1:100));
dev.off();

To use it, create an R Taverna workflow with output port g
- of type png image
See also: http://www.mygrid.org.uk/usermanual1.7/rshell_processor.html
14

Integration between Taverna and eScience Central
• An example of integration between
– Taverna workflows (desktop)
– the eScience Central cloud environment
• Facilitated by Taverna’s plugin architecture

• See http://www.cs.man.ac.uk/~pmissier/T-eSC-integration.svg

15

Plugin: Excel spreadsheets as workflow input
• Third-party plugin code can later be bundled in a distribution
• Ex.: importing input data from a spreadsheet
– see: http://www.myexperiment.org/workflows/1417.html
– and example input spreadsheet: http://www.myexperiment.org/files/410.html

16

Taverna Model of Computation: a closer look
• Arcs between two ports define data dependencies
– processors with inputs on all their (connected) ports are ready
– no active scheduling: admission control is simply by the size of threads pool
– processors fire as soon as they are ready and there are available threads in
the pool
• No control structures
– no explicit branching or loop constructs
• but dependencies between processors can be added:
– end(P1) ➔ begin(P2)

coordination link semantics:
“fetch_annotations can only start after
ImprintOutputAnnotator has completed”

Typical pattern:
writer ➔ reader
(eg to external DB)

17

List processing model
• Consider the gene-enzymes workflow from the previous demo:

Values can be either atomic or (nested) lists
- values are of simple types (string, number,...)
- but also mime types for images (see R example above)

What happens if the input to our workflow is a list of gene IDs?
geneID = [ mmu:26416, mmu:19094 ]

we need to declare the input geneID to be of depth 1
- depth n in general, for a generic n-deep list
18

Implicit iteration over lists
Demo:
– reload KEGG-genes-enzymes-atomicInput.t2flow
– declare input geneID to be of depth 1
– input two genes, run

– Each processor is activated once for each element in the list
– this is because each is designed to accept an atomic value
– the result is a nested list of results, one for each gene in the input list

19

Functional model for collection processing /1
Simple processing:
service expects atomic values,
receives atomic values

v1 v2 v3

X1 X2 X3

P
Y1 Y2

w1 w2

20

Simple processing: Simple iteration:
service expects atomic values, service expects atomic values,
receives atomic values receives input list

v = [v1 ... vn]
v1 v2 v3 v = [v1 ... vn] v1 vn

X1 X2 X3 X X X

Y1
P
Y2 Y
P
➠ P1

Y
... Pn

Y

w1 w2 w = [w1 ... wn] w1 wn

w = [w1 ... wn]

20


v = [v1 ... vn]
v1 v2 v3 v = [v1 ... vn] v1 vn

X1 ad = 1
X2 X3 X X X

Y1
P
Y2
dd = 0

Y
P
➠ P1

Y
... Pn

Y
δ=1
w1 w2 w = [w1 ... wn] w1 wn

w = [w1 ... wn]

20


v = [v1 ... vn]
v1 v2 v3 v = [v1 ... vn] v1 vn

X1 ad = 1
X2 X3 X X X

Y1
P
Y2
dd = 0 P

Y
➠ P1

Y
... Pn

Y
δ=1
w1 w2 w = [w1 ... wn] w1 wn

w = [w1 ... wn]
v = [[...], ...[...]]
Extension:
service expects atomic ad =2
X
values,
receives input nested list dd = 0 P

Y
δ=2
20
w = [[..] ...[...]]

Functional model /2

The simple iteration model v = [[...], ...[...]]
generalises by induction to a
ad =n
generic δ=n-m X

dd = m P

Y
δ = n-m ≥ 0
w = [[..] ...[...]] - depth = n-m

21

Functional model /2

The simple iteration model v = [[...], ...[...]]
generalises by induction to a
ad =n
generic δ=n-m X

dd = m P

Y
δ = n-m ≥ 0
w = [[..] ...[...]] - depth = n-m
This leads to a recursive functional
formulation for simple collection
processing:

v = a1 . . . an
(P v) if l = 0
(evall P v) =
(map (evall − 1 P ) v) if l > 0

21

Functional model - multiple inputs /3
v2 = [v21 ... v2k]

v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m]

P
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
Y
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
w = [ [w11 ... w1n],
...
[wm1 ...wmn] ]

22

Functional model - multiple inputs /3
v2 = [v21 ... v2k]

v1 = [v11 ... vin] X1 X2 X3 v3 = [v31 ... v3m]

P
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
Y
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
w = [ [w11 ... w1n],
...
[wm1 ...wmn] ]

Cross-product involving v1 and v3 (but not v2):
v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product
and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]

22

Generalised cross product
Binary product, δ = 1: a × b = [[ ai , bj ]|bj ← b]|ai ← a]
(eval2 P a, b ) = (map (eval1 P ) a × b)

23


Generalized to arbitrary depths:

[[(vi , wj )|wj ← w]|vi ← v]
 if d1 > 0, d2 >0

[(v , w)|v ← v]
i i if d1 > 0, d2 =0
(v, d1 ) ⊗ (w, d2 ) =
[(v, wj )|wj ← w]
 if d1 = 0, d2 >0


(v, w) if d1 = 0, d2 =0
...and to n operands: ⊗i:1...n (vi , di )

23


Generalized to arbitrary depths:

[[(vi , wj )|wj ← w]|vi ← v]
 if d1 > 0, d2 >0

[(v , w)|v ← v]
i i if d1 > 0, d2 =0
(v, d1 ) ⊗ (w, d2 ) =
[(v, wj )|wj ← w]
 if d1 = 0, d2 >0


(v, w) if d1 = 0, d2 =0
...and to n operands: ⊗i:1...n (vi , di )

Finally: general functional semantics for collection-based processing

(evall P (v1 , d1 ), . . . , (vn , dn ) )
(P v1 , . . . , vn ) if l = 0
=
(map (evall−1 P ) ⊗i:1...n vi , di ) if l > 0
23

Parallelism in the dataflow model
• The data-driven model with implicit iterations provides
opportunities for parallel processing of workflows
two types of parallelism:
• intra-processor: implicit iteration over list data
• inter-processor: pipelining

[ id1, id2, id3, ...]

SFH1 SFH2 SFH3 implicit assumption
... ... ... of independence
amongst the threads
that operate on
elements of a list
getDS1 getDS2 getDS3

[ DS1, DS2, DS3, ...]

24

Exploiting latent parallelism

[ a, b, c,...]

[ (echo_1 a), (echo_1 b), (echo_1 c)]

(echo_2 (echo_1 a))
(echo_2 (echo_1 b))
(echo_2 (echo_1 c))

See also:
http://www.myexperiment.org/workflows/1372.html

25

Performance - experimental setup
• previous version of Taverna engine used as baseline
• objective: to measure incremental improvement

list generator

Parameters:
multiple parallel pipelines

- byte size of list elements (strings)
- size of input list
- length of linear chain

main insight: when the workflow is designed for
pipelining, parallelism is exploited effectively

26

Performance study: Experimental setup - I
• Programmatically generated dataflows

– the “T-towers”

parameters:
- size of the lists involved
- length of the paths
- includes one cross product

27

caGrid workflow for performance analysis
Goal: perform cancer diagnosis using microarray analysis
- learn a model for lymphoma type prediction based on
samples from different lymphoma types

source: caGrid

28 http://www/myexperiment.org/workflows/746


lymphoma samples
source: caGrid ➔ hybridization data



lymphoma samples

process microarray
data as training dataset



lymphoma samples

process microarray
data as training dataset
learn
predictive model


Results I - Memory usage

shorter execution
time due to pipelining
T2 main memory
data management

T2 embedded Derby back-end

T1 baseline

list size: 1,000 strings of 10K chars each
no intra-processor parallelism (1 thread/processor)

29

Results II - Available processors pool

pipelining in T2 makes up for smaller pools of threads/processor

30

Results III - Bounded main memory usage
Separation of data and process spaces ensures scalable data management

varying data element size:
10K, 25K, 100K chars

31

Ongoing effort: Taverna on the Cloud
• Early experiments on running multiple instances of Taverna
workflows in a cloud environment
• Coarse-grained cloud deployment: workflow-at-a-time
– data partitioning ➔ each partition is allocated to a workflow instance

For more details, please see: Paul Fisher, ECCB talk slides, October, 2010

32

Summary
• Workflows: high-level programming paradigm
• Bridges the gap between scientists and developers
• Many workflow models available (commercial/open source)
• Taverna implements a dataflow model
– has proven useful for a broad variety of scientific applications

Strengths:
• Rapid prototyping given a base of third-party or own services
• Explicit modelling of data integration processes
• Extensibility:
– for workflow designers: easy to import third-party services (SOAP, REST)
– accepts scripts in a variety of languages
– for developers: easy to add functionality using a plugin model
• Good potential for parallelisation
• Early experiments on cloud deployment: workflow-at-a-time
– ongoing study for finer-grain deployment of portions of the workflow
33
Back to the start

ADDITIONAL MATERIAL
• Provenance of workflow data
• Provenance and Trust of Web data

34

Example workflow (Taverna)

chr: 17 QTL →
start: 28500000
end: 3000000 Ensembl Genes

Ensembl Gene → Ensembl Gene →
Uniprot Gene Entrez Gene

Uniprot Gene → Entrez Gene →
Kegg Gene Kegg Gene

merge gene IDs

Gene → Pathway path:mmu04210 Apoptosis,
path:mmu04010 MAPK, ...

Baseline provenance of a workflow run
QTL → mmu:12575
Ensembl Genes
v1 ... vn w

Ensembl Gene → Ensembl Gene →
Uniprot Gene Entrez Gene
path:mmu04012
exec
Uniprot Gene → Entrez Gene → a1 ... an b1 ... bm
Kegg Gene Kegg Gene
mmu:26416

merge gene IDs

path:mmu04010 y11 ymn
Gene → Pathway
...

path:mmu04010→derives_from→mmu:26416
path:mmu04012→derives_from→mmu:12575

• The graph encodes all direct data dependency relations
• Baseline query model: compute paths amongst sets of nodes
• Transitive closure over data dependency relations
36

Motivation for fine-grained provenance
List-structured
KEGG gene ids:

[ [ mmu:26416 ],
[ mmu:328788 ] ]



Motivation for fine-grained provenance
List-structured
KEGG gene ids:

[ [ mmu:26416 ],
[ mmu:328788 ] ]

geneIDs pathways
•
•
• •
•
•
• •
• •



Efficient query processing: main result
Workflow graph Provenance graph

X X v1 ... vn w

Q R
Y Y

a1 ... an b1 ... bm

X1 X2 X3

P
y11 ... ymn
Y

y = [ [y11 ... y1n],
...
[ym1 ... ymn] ]

• Query the provenance of individual collections elements
• But, avoid computing transitive closures on the provenance graph
• Use workflow graph as index instead
• Exploit workflow model semantics to statically predict dependencies on individual tree
elements
• This results in substantial performance improvement for typical queries

Efficient query processing: main result
Workflow graph Provenance graph
[1] []

X X v1 ... vn w

Q R
Y Y

a1 ... an b1 ... bm
[]

X1 X2 X3 [n]
[1]
P
y11 ... ymn
Y

y = [ [y11 ... y1n],
...
[ym1 ... ymn] ]

• Query the provenance of individual collections elements
• But, avoid computing transitive closures on the provenance graph
• Use workflow graph as index instead
• Exploit workflow model semantics to statically predict dependencies on individual tree
elements
• This results in substantial performance improvement for typical queries

Trust and provenance for Web data
• Testimonials: http://www.w3.org/2005/Incubator/prov/
– "At the toolbar (menu, whatever) associated with a document there is a button marked
"Oh, yeah?". You press it when you lose that feeling of trust. " - Tim Berners-Lee, Web
Design Issues, September 1997
– Provenance is the number one issue we face when publishing government data as
linked data for data.gov.uk" - John Sheridan, UK National Archives, data.gov.uk,
February 2010

how exactly is
provenance-based quality
checking going to work?

Upcoming W3C Working Group
on Provenance for Web data

- a European initiative: chaired
by Luc Moreau (Southampton),
Paul Groth (NL)

39

Provenance graphs and belief networks
Intuition:
As news propagate, so do trust and quality judgments about them
• Is there a principled way to model this?
• Idea: explore conceptual similarities between provenance graphs and
belief networks (i.e. Bayesian networks)

Standard Bayesian network example:

40

From process graph to provenance graph

dx

A1 C1
P1 d1 S1 d1’ P2 d2 S2 d2’

data production data publishing

41

Quality Control points
dx

A1 C1
P1 d1 S1 d1’ P2 d2 S2 d2’


41

Quality Control points
dx

A1 C1
P1 d1 S1 d1’ P2 d2 S2 d2’


provenance
“used” graph for dx
P(dx)
“was generated by”
d2’ S2 d2 P2

d1’ S1 d1 P1

“published”

“was published by”
C1 A1
41
curator
author

From provenance graph to belief network
provenance
“used” graph for dx
P(dx)
“was generated by”
d2’ S2 d2 P2

d1’ S1 d1 P1

“published”

“was published by”
C1 A1
curator
author

CPT
QCP CPT
CPT
P1 A1

d1 S1 C1

- assume judgments are available at QCPs
Pdx A2 d1’ - Where do the remaining conditional probabilities
come from?

d2 S2
- can judgments be
42
27 propagated here?
d2’

Internal seminar @Newcastle University, Feb 2011

Recommandé

Recommandé

Contenu connexe

Similaire à Internal seminar @Newcastle University, Feb 2011

Similaire à Internal seminar @Newcastle University, Feb 2011 (20)

Plus de Paolo Missier

Plus de Paolo Missier (20)

Dernier

Dernier (20)

Internal seminar @Newcastle University, Feb 2011

Notes de l'éditeur