12. What do Scientists use Taverna for? ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Systems biology model building Proteomics Sequence analysis Protein structure prediction Gene/protein annotation Microarray data analysis QTL studies QSAR studies Medical image analysis Public Health care epidemiology Heart model simulations High throughput screening Phenotypical studies Phylogeny Statistical analysis Text mining Astronomy, Music, Meteorology Netherlands Bioinformatics Centre Genome Canada Bioinformatics Platform BioMOBY US FLOSS social science program RENCI SysMO Consortium French SIGENAE farm animals project ThaiGrid CARMEN Neuroscience project SPINE consortium EU Enfin, EMBRACE, BioSapian, Casimir EU SysMO Consortium NERC Centre for Ecology and Hydrology Bergen Centre for Computational Biology Max-Planck institute for Plant Breeding Research Genoa Cancer Research Centre AstroGrid 30 USA academic and research institutions
13. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier 200 Genotype Phenotype Metabolic pathways Literature [Paul Fisher]
14.
15.
16. WaaS: Workflows as a Service ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier [Pettifer, Kell, University of Manchester] inside
17. Workflows operating over Grid Infrastructure ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier http://www.knowarc.eu KnowARC integrated with Taverna" application prototype to use Taverna as direct interface to Grid resources running ARC. Open source grid software infrastructure aimed at enabling multi-institutional data sharing and analysis. Underpins caBIG. Taverna links together caGrid resources. http://cagrid.org / http://www.eu-egee.org/ Europe’s leading grid computing project, Piloted Taverna over EGEE gLite services
18.
19. caBIG cancer cyberinfrastructure uses Taverna to link services ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier A sample caGrid workflow for microarray analysis, using caArray, GenePattern and geWorkbench [Ravi Madduri] Orchestrating caGrid Services in Taverna Wei Tan, Ravi Madduri, Kiran Keshav, Baris E. Suzek, Scott Oster, Ian Foster, Proc IEEE Intl Conf on Web Services (ICWS 2008)
20. Who else is in this space? ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Kepler Triana BPEL Ptolemy II Taverna Trident BioExtract
21.
22. Workflow-based experimentation lifecycle ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Run Analyse Results Publish Develop Collect and query provenance metadata
23. Taverna ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Graphical Workbench For Professionals Plug-in architecture Nested Workflows Drag and Drop Wiring together Rapidly incorporate new service without coding. Not restricted to predetermined services Access to local and remote resources and analysis tools 3500+ service operations available when start up
24.
25.
26.
27.
28.
29.
30. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Curation Model Versioning Quantitative Content Tags Service Model Semantic Content Ontologies Functional Capabilities Provenance Operational Capabilities Operational Metrics Usage Policy Community Standing Ratings Usage Statistics Attribution Free text Searching Statistics Usable and Useful Understandable Controlled vocabs Interfaces
34. Scaling up along the social dimension ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier What ? Where ? Why ? Who ? How ? Crossing the boundaries of individual investigation Develop Run Analyze Publish Develop Run Analyze Publish
35. Scientific collaboration ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Source: Andrea Wiggins , talk given at the School of Computer Science, University of Manchester, UK, June 18th, 2009
36. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Source: Andrea Wiggins , talk given at the School of Computer Science, University of Manchester, UK, June 18th, 2009
37. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Source: Andrea Wiggins , talk given at the School of Computer Science, University of Manchester, UK, June 18th, 2009
38.
39.
40.
41. Publishing for collaboration ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Run Analyse Results Publish Develop Collect and query provenance metadata Design-time reuse: Composition from existing workflows Runtime reuse: Workflows as services compare results across versions foster virtual scientific communities provenance exchange and interoperability the OPM experiment
42. Collaboration in the workflow space ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier What ? Where ? Why ? Who ? How ? Develop Run Analyze Publish Develop Run Analyze Publish
55. Workflows and Services Curation by Experts Social Curation by the Crowd refine validate refine validate Self-Curation by Contributors seed seed refine validate seed refine validate seed Automated Curation
60. Collaboration in the workflow space ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier What ? Where ? Why ? Who ? How ? Develop Run Analyze Publish Develop Run Analyze Publish
72. Provenance interoperability for open science OPM ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Develop Run Analyze Publish Develop Run Analyze Publish OPM: the Open Provenance Model
73. The Science Lifecycle scientists Graduate Students Undergraduate Students experimentation Data, Metadata, Provenance, Scripts, Workflows, Services, Ontologies, Blogs, ... Digital Libraries Next Generation Researchers Adapted from David De Roure’s slides Local Web Repositories Virtual Learning Environment Technical Reports Reprints Peer-Reviewed Journal & Conference Papers Preprints & Metadata Certified Experimental Results & Analyses
74. Finding the Provenance of research outputs across all the systems data transited through scientists Local Web Repositories Graduate Students Undergraduate Students Virtual Learning Environment Technical Reports Reprints Peer-Reviewed Journal & Conference Papers Preprints & Metadata Certified Experimental Results & Analyses experimentation Data, Metadata, Provenance, Scripts, Workflows, Services, Ontologies, Blogs, ... Digital Libraries Next Generation Researchers
75. Provenance Across Applications Local provenance stores Adapted from Luc Moreau’s slides: “The Open Provenance Model” (Univ. of Southampton,UK), 2009 Application Application Application Application Application Provenance Inter-Operability Layer import from OPM export to OPM
81. Upcoming events ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier SWPM 2009: The First International Workshop on the Role of Semantic Web in Provenance Management http://wiki.knoesis.org/index.php/SWPM-2009 Co-located with ISWC'09, October 25/26 2009, Washington D.C., USASubmission Deadline: Friday, July 31, 2009 Special issue of Future Generation Computer Systems Journal (FGCS) on the third provenance challenge (to be announced) expected deadline: Dec., 2009
82.
83.
Notes de l'éditeur
Repetitive and mundane boring stuff made easier, reliable and adaptable. Big science and collaborative science
Interoperability, Integration and Collaboration Automated processing Interactive Repetitive and accurate compound processes (protocols) Transparent processes Data flow Trackable results Agile software development
This workflow searches for genes which reside in a QTL (Quantitative Trait Loci) region in the mouse, Mus musculus. The workflow requires an input of: a chromosome name or number; a QTL start base pair position; QTL end base pair position. Data is then extracted from BioMart to annotate each of the genes found in this region. The Entrez and UniProt identifiers are then sent to KEGG to obtain KEGG gene identifiers. The KEGG gene identifiers are then used to searcg for pathways in the KEGG pathway database. this is pathways_and_gene_annotations_for_qtl_phenotype_28303 exec with chromosome = 17 start_position = 28500000 end_position = 32500000
mention scalability: we support large datasets as well, in addition to these small listsand that lists can grow very large
The case study of Paul’s work Different datatypes, held in geographically different places of different subdisciplines. Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping Genes captured in microarray experiment and present in QTL (Quantitative Trait Loci ) region Microarray + QTL
The workflow produces OMIM tagged diseases which can be used to enrich the proto-ontology automatically in RDF
- caGrid (or the Grid) is the underlying network architecture and platform that provides the basis for connectivity of caBIG® tools. - ARC (australian Research Center) plugin for Taverna used for medical imaging Taverna is a workflow management system well known in bioinformatics. To show the ease of transitioning from local to grid, we execute MUSTANG first on the command line, then on the grid using Taverna. We then present two examples from the field of medical imaging, where the first one has to deal with huge temporary datasets. It thus greatly benefits from ARCs storage management and grid URL handling capabilities. The last example shows how one can achieve rapid testing iterations by separating the program binary from its use case description and the workflow. In the end, myExperiment is presented, which is a free web community for sharing Taverna workflows. By preparing a use case and an example workflow one can make his own program easily usable by everyone since the ARC middleware equalizes different grid configurations. video
aimed at different layers of the software stack “ The Many Faces of IT as Service”, Foster, Tuecke, 2005 “ Provisioning” – reservation to configuration to … … make sure resource will do what I want it to do, with the right qualities of service Virtualization = separation of concerns between provider & consumer of “content” Client and service Service provider and resource provider Provisioning = assemble & configure resources to meet user needs Management = sustain desired qualities of service despite dynamic environment
The GT4 plug-in support semantic based service query. Users can input multiple (up to three, in current scenario three is enough. We can add more in our program upon request) service query criteria and input the corresponding value. We can combine multiple criteria. The initial GUI only shows one query criteria, but more can be added by clicking “Add Service Query” button. For example, we can query the caGrid services whose “Research Center” name is “Ohio State University”, with Service Name “DICOMDataService”, and has operation “PullOp”.
we are now taking a broader view... what we have shown so far is just the “run” part of a more comprehensive lifecycle
Beanshell scripting and XML processing support inside the workflows Taverna 2: long running workflows, data reference handling, data streaming and staging, multiple extensibility points. Complete the Taverna 2 properties New data reference handling, security management, provenance management, asynchronous processor and data streaming, explicit monitoring and steering support, new dispatch layer better, supports dynamic service binding and service invocation through a resource broker, improved concurrency handling at the workflow level
The clustalw program from Emboss is called ‘emma’ Services are not deposited and preserved in software libraries. Rapid metadata heart-beat, especially on operational metadata. BioNanny – using Grid tools Use myExperiment to notify scientists with potential problems Use myExperiment to be smart about which services should be monitored. Workflows are deposited but…. Not self-contained. Linking to external services in flux. Or depend on software Incorporating services unavailable to others. Workflow fragility and hence decay. Workflows become plans and provenance rather than working scientific objects unless tended and updated.
for DEMO: search adaptivedisclosure key point: large-scale curation of rich service descriptions through community engagement, sustained over time, to ensure the quality of annotations what makes biocat stand apart from other approaches to service repositories? biocatalogue is a "super-registry" that is able to accommodate service descriptions from multiple different source registries thanks to a flexible annotation model context of application: distributed, P2P biocatalogue
SeekDA is DERI. search engine for WS http://www.ebi.ac.uk/uniprot-das/
curation results in trust and therefore usage encourages / justifies sustained further curation efforts really happens in all the associated tools that have references and use biocatalogue e.g. curation of Taverna services, of myExp workflows, etc accreditation of curators: idea borrowed from the wikipedia style of community contribution. not all contributors are equal, but all are entitled to providing contributions
how are the types of annotations chosen? contributors are free to add pretty much any types of annotations at the moment, using a simple tagging mechanism, i.e., annotations are not necessarily locked into controlled vocabularies / ontologiesthink of them as name-value pairs for the time being, but already exposing metadata as RDF through the APIusers can rate services. main rating criteria expected to be ease of use, availability and quality of documentation. Separate subjective from objectiveadditionally, there is room for automated curation: using Quasar, for exampleservice monitoring and associated automated annotations. functional testing of services. let users/ providers upoload complex test scripts, including full-fledged workflows, for example. biocatalogue can run these periodically and annotate the services with the outcome (think "junit test reports" as automated annotations)Service Profile Wheel AvailabilityFreshness
Automated monitoring & testing Test scripts, endpoint availability, meantime failure Partner feeds myExperiment.org Workflow profile Update feeds to users Develop incentives Expert for oversight How do we rank? How do we compare non-alike?
Cite FLOSS
myExperiment is as much an engineering project as it is a social experiment
e-Science is me-Science Aligning community with individual. But we have to aware of the drivers for collaboration. Competitive advantage. Be the first with the Nature paper. Academic vanity Credit, credibility, fame, acclaim, recognition, peer respect, reputation. Adoption Get my stuff adopted / recognised More funding Being found out Open to rigorous inspection. Being scooped Beaten by lab X Protecting my turf. Releasing results too early. Getting left behind. Being out of fashion. Looking stupid Being misinterpreted or misrepresented. Looking stupid. Losing control. Taking a risk
attribution -> provenance
Transferring data, methods and know-how from one discipline to another (e.g. astronomy image analysis applied to cancer tissue microarrays) How do you find relevant material that uses a different jargon in a different discipline organised to only suit its experts? Validation? How do I know it does what it says it does? Reproducibility? When the services are volatile. Reusability? When it contains an in-house code or application. Longevity? Will this workflow still run in 6 months time? Palpability? Why does this workflow fail? why does it work? How does it work?
Validation? How do I know it does what it says it does? Reproducibility? When the services are volatile. Reusability? When it contains an in-house code or application. Longevity? Will this workflow still run in 6 months time? Palpability? Why does this workflow fail? why does it work? How does it work?
myExperiment is as much an engineering project as it is a social experiment
step back: what do we mean by provenance, in general?data and process provenancewhat do we do with it?we collect a great deal of metadata, at very fine level of granularity: what can we expect to get out of it?
see dedicated in-depth talk for details on our approach
Lineage query lets users identify variables that carry interesting values for which provenance is sought nodes in the graph where provenance information should be reported