A presentation of IDB (Infrastructure Distributed for Biology) using StratusLab technology by Christophe Blanchet and Clément Gauthey at Lille, France, May 2013.
Powerpoint exploring the locations used in television show Time Clash
IDB-Cloud Providing Bioinformatics Services on Cloud
1. Christophe Blanchet, Clément Gauthey
Infrastructure Distributed for Biology
IDB-IBCP CNRS FR3302 - LYON - FRANCE
http://idee-b.ibcp.fr
IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552)
and the French National Research Agency's Arpege Programme (ANR-10-SEGI-001)
IDB-Cloud
Providing Bioinformatics
Services on Cloud
2. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Bioinformatics Today
• Biological data are big data
• 1512 online databases (NAR Database Issue 2013)
• Institut Sanger, UK, 5 PB
• Beijing Genome Institute, China, 4 sites, 10 PB
➡ Big data in lot of places
• Analysing such data became difficult
• Scale-up of the analyses : gene/protein to complete genome/
proteome, ...
• Lot of different daily-used tools
• That need to be combined in workflows
• Usual interfaces: portals,Web services, federation,...
➡ Datacenters with ease of access/use
• Distributed resources
• Experimental platforms: NGS, imaging, ...
• Bioinformatics platforms
➡ Federation of datacenters
ADN
BI
M
ADN
A
ADN
BI CC
BI
ADN
ADN
3. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Sequencing Genomes
source: www.politigenomics.com/next-generation-sequencing-informatics
Complete genome sequencing
become a lab commodity with
NGS (cheap and efficient)
source: www.genomesonline.org
4. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Infrastructures in Biology
Lot of tools
and web services
to treat and vizualize
lot of data
5. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
The scene
• Bioinformatics services
providers
• Is it easy to deploy lot of
(incompatible) tools ?
• To make them connected to
public databases ?
• To limit transfer of huge data ?
• To provide users with their own
computing resources ?
• With their own isolated storage ?
• Scientists
• Is it easy to access/use these
tools ?
• To adapt to your usage ?
• To get your/other tools deployed
on a datacenter ?
• To combine them ?
• To get my own computing/
storage resources ?
ADN
ADN
BI
M
ADN
BI
ADN
ADN
BI CC
BI
ADN
ADN
ADN
Bioinformatics Center
Scientists
Computer
Resources
French biologists
have access to
regional resources
(RENABI)
Availability? Yes
Engineers
No
Compatible?
Usually one
cluster for
all use
Yes
No ?
tool
X ?
installation
time
7. RENABI GRISBI www.grisbio.fr
Satisfactions des besoins
gLite GRISBI
Banques internationales ~ oui biomaj NFS
Espace personnel ~ oui XtreemFS ?
Espace commun ~ oui
Accès simple au stockage non XtreemFS ?
Distribution des calculs WMS
Intégration cluster l’existant ~ oui CE-gateway
Déploiement des logiciels SWAREA ++ temps humain
Workflow/pipeline ~ DAG
Gestion des identités et accès vo.renabi.fr Shibboleth/LDAP
Interface facile à utiliser ~ CLI « commandes GR »
Interface publique: accès anonyme sur portail
et web services
non ? certificats robot, myproxy ?
➡ Logiciel gLite répond au besoin en puissance de calcul
➡ Modes d’accès et de gestion des données sont moins adaptés
aux usages de la communauté
8. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Cloud computing ?
Created by Sam Johnston
License: Creative Commons
9. 9
StratusLab Project
Goal
§Create comprehensive, open-source,
IaaS cloud distribution
EU FP7 project
§1 June 2010—31 May 2012 (2 years)
§6 partners from 5 countries
§Budget : 3.3 M€ (2.3 M€ EC)
Contacts
§Site web: http://stratuslab.eu/
§Twitter: @StratusLab
§Support: support@stratuslab.eu
CNRS (FR) UCM (ES)
GRNET (GR) SIXSQ (CH)
TID (ES) TCD (IE)
10. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
IDB’s Cloud
• Cloud workbench for Biology
• 13 turnkey bioinformatics appliances (as of Apr. 2013)
• Running since Sept. 2011, opened to Biology community
• Lyon, FRANCE
• Powered by
• StratusLab
• Compute nodes, Block storage
• +900 cores, +4TB RAM, 36TB vdisks
• Mainly Intel SandyBridge servers with 32c 128GB
• Bigmen servers with 64c 768GB
• VMs from 1core-1GB to 64cores-768GB RAM
• + Openstack
• Object storage (Swift)
• +200 TB redundant & scalable storage
11. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Driven throught a simple web interface
12. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Integrate Bioinformatics Tools in Cloud
BLAST
GOR4
FastA
SSearch
Abyss
ClustalW
Bioinformatics
Tools
Ray
BWA
PhyML RedHat,
CentOS
Debian,
Ubuntu
Suse
Linux
Virtual machines
Create
new
Appliance
Bioinformatics Marketplace
NGSStructure Galaxy ARIA (…)Sequence
• Appliances are virtual machines
• small : few GB, easy to convert in most virtualization formats
• Installed and pre-configured with common bioinformatics tools
• e.g. BLAST, Clustalw,ARIA, MEME, HMMer, TopHat, BWA, Samtools, etc.
13. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Bioinformatics Appliances
14. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Select your bioinformatics tools
15. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Run Bioinformatics Cloud Instances
Bioinformatics Marketplace
NGSStructure Galaxy ARIA (…)Sequence
IBCP's Cloud
Resources
BLAST,
Clustal,
etc.
PaaS
Workers
VM CNS
SharedFS
launch jobs
sshIaaS
Master & Storage
VM ARIA
Portal
Launch
Instances
16. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Manage your Cloud Instances
17. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
UNIPROT
PDB
EMBL
PROSITE
Genomes
Public
Data sources
Bioinformatics
Cloud
BLAST,
Clustal,
etc.
PaaS
Workers
VM CNS
SharedFS
launch jobs
sshIaaS
Master & Storage
VM ARIA
Portal
shared
(NFS)
User
Persistent data
pdisk
(iSCSI)
Biological Data in Cloud
Upload your data
Get your results
scp http/S3
scp http/S3
19. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Common bioinformatics node
• ‘Biocompute’ appliance
• Use your own instance(s)
• With pre-installed
standard bioinformatics
tools
• BLAST, FastA, SSearch,HMM,...
• ClustalW2, Clustal-Omega, Muscle,..
• Bowtie(2), BWA, samtools, ...
• MEME, R, etc.
• Connected to public
reference data
• Uniprot, EMBL, genomes, PDB, etc.
• Automaticaly shared to theVMs
20. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Structural Biology
• TOwards StruCtural AssignmeNt Improvement
• To improve the determination of protein structures based on
Nuclear Magnetic Resonance (NMR) information with ARIA
software
• Large computational needs.
• A NMR laboratory will not specially invest in building a cluster of
about 100 nodes to be able to run such NMR structure calculations.
• Flexibility of the cloud to deploy the different required
bioinformatics tools can accelerate such a procedure.
• Commercial interest in providing such tools to structural biologists
on a “pay as you go” basis.
• Endorsers:
Institut Pasteur Paris
and CNRS IBCP
21. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
IaaS deployment of ARIA
Shared
Storage
Intermediate
results
CNS
CNS
CNS
CNS
CNS
CNS
CNS
CNS
...
(20-100)
Structure
preparation
(8x)
ARIA
Final
results
Input data: 10s MB
Results: GB
Read
Write
Virtual
Cluster
Workers
VM CNS
Master & Storage
VM ARIA SharedFS
launch jobs
ssh
Significant increase in the
number of calculated protein
conformations improves the
statistics on the NMR
conformations and can help
to overcome the ambiguity
bottleneck.
22. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Galaxy portal for NGS analyses
• Analyse NGS data
• portal Galaxy is widely used in the community
• connected to large public data: sequences and indexes
• large user data (GBs)
• Preserve workflows and results (persistent storage)
23. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Proteomics desktop
• Motivation
• Collaboration with a mass spectroscopy platform
• Running out of space on their local resources
• Protein identification
• Mass experimental data
• Reference databases : nr, Swiss-Prot
• Reference screening tools:
OMSSA, X!Tandem
• User interface
• Remote display
• NX
• Reference GUIs
• SearchGUI
• PeptidShaker
source: PeptideShaker site
24. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Conclusion
• Provide turnkey bioinformatics appliances
• Standard tools and pipelines
• Interoperability: ready to run on cloud
• Easier to transfer appliances than data (GB vs TB)
• Provide a cloud infrastructure tightly connected to
existing bioinformatics infrastructure
• Public IDB’s bioinformatics cloud
• Linked to public biological databases
• In collaboration with the French Bioinformatics Institute
• Ease the usage by scientists
• Usual bioinformatics gateways
• Persistent and large ubiquitous storage
• Web interface for cloud management
• Access on a registration basis and standard use
25. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
Perspectives
• Define good practices to provide academic community and
industry with bioinformatics services!
• French Bioinformatics Institute - IFB
• Goals are to provide core bioinformatics resources to the national and
international life science research community in key fields such as genomics,
proteomics, systems biology, etc.
• Aims at building a national academic cloud devoted to Bioinformatics, inspired
by the model evaluated through the IDB’s cloud.
• European ELIXIR infrastructure
• To build a sustainable European
infrastructure for biological
information, supporting life science
research and its
translation
• IFB will be the French
representative in ELIXIR.
Bioinformatics
CenterAppliances
catalog
Scientists
French biologists
have access to
regional resources
(RENABI)
Yes
Engineers
No
tool
X ? Cloud
Bioinformatics or
public cloud.
Regional, national
or a federation.
Appliances
create new
register
Available ?
26. Réseau des Ingénieurs en Bioinformatique, Lille, 23 mai 2013
• Acknowledgment
• IDB members: Clément Gauthey, Simon Malesys
• StratusLab members
• co-funding by the European Community's Seventh
Framework Programme (INFSO-RI-261552) and by
the French National Research Agency's Arpege
Programme (ANR-10-SEGI-001).
Questions ?
http://idee-b.ibcp.fr