SlideShare une entreprise Scribd logo
1  sur  45
Data-intensive applications on cloud
computing resources: Applications in life
sciences
Ola Spjuth <ola.spjuth@farmbio.uu.se>
Department of Pharmaceutical Biosciences
and Science for Life Laboratory
Uppsala University
Today: We have access to high-throughput
technologies to study biological phenomena
Science for Life Laboratory, Sweden
An internationally leading center
that develops and applies
large-scale technologies for
molecular biosciences with a
focus on health and
environment.
National platform since 2013
Stockholm node
Uppsala node
● CoordinatedbyNBIS
● ~63FTE:s (staff>75)
● Staffatallmajor
Swedishuniversities
● 2018budget~7.5M€
● Bioinformatics
platform ofSciLifeLab
4
2017: Human whole genome sequenced
in 3 days for ~$1100
…requires supercomputers
for analysis and storage
Massively parallel sequencing….
2017: Illumina HiSeq X systems. 15K whole human
genomes per year
2016: NGI data velocity 950 Mbp/hour = 16 Mbp/s
Analysis
Scientists
Sample
transfer
Current mode of operation
Platforms
Pre-processing (NGI)
Research (SNIC)
Data
delivery
What we sequenced at NGI /
Some statistics Storage usage
Projects at SNIC-UPPMAX
Data-intensive bioinformatics
Other disciplines
Support tickets
NGS users
• Key observations
– Batch-oriented on HPC/HTC, shared storage, Linux, open source software
– Computations are not so large, seldom multi-node
– Storage biggest challenge. Projects do not end. Users do not clean up
data. WGS projects are very large.
– Many and inefficient users, lots of software (admin burden, support,
education)
– Free resources (no cost) does not promote efficient usage
• Investment strategies
– When investing in computational hardware, it takes a long time from
funding decision until the resources are operational (10-12 months on
average).
– Expansion of resources are done at specific points in time, low flexibility
between these.
– Decision on resources are made by a national board with limited influence
from life science scientists or platforms (Sweden)
9
Why cloud in the life sciences?
• Access to resources
– Flexible configurations
– On-demand, pay-as-you-go
• Collaborate on international level
– Publish/federate data
– E.g. Large sequencing initiatives, “move compute to the
data”
• New types of analysis environments
– Hadoop/Spark/Flink etc.
– Microservices, Docker, Kubernetes, Mesos
10
Using clouds in Bioinformatics
How can we take advantage of cloud resources?
Simplest example:
• Start VM from (pre-made) VMI
• Upload data
• Perform scientific task
• Download results
• Terminate VM
Easy to scale this up to using many instances!
Or….. is it?
• What if I want to run 100 instances in parallel?
• What about if I want a new tool? Later versions?
• Do I need to upload data every time?
11
So we want to set up and use a virtual
cluster
• Multiple compute nodes
• Network
• Distributed storage
• Firewall, DNS, reverse proxy, etc.
So, we now have a virtual cluster. And now?
Batch-like system
– Install a queueing system, e.g. SLURM
– Install bioinformatics software
Big Data system
– Install HDFS + Hadoop/Spark on the nodes
Container-based system
– Install Docker and Kubernetes
Data
– Ingress project data, possibly reference data 12
(There are tools
that can help
automating some
of these
procedures.)
Challenges with cloud
• Tradition: Strong HPC tradition in academia
– Sweden: Existing HPC resources funded by Research
Council and personnel at 6 centra in Sweden (SNIC)
• Economy: Cost model is new
– Difficult to assess the costs
• Data: How to work with large-scale data (TB/PB-range)
• Legal: Working with sensitive data
• Educational: New technology for many
13
Some SciLifeLab cloud options
14
● Geographically distributed federated IaaS cloud
based on 2nd generation HPC-hardware
● Built using OpenStack
SNIC Cloud in Sweden
Needs in bioinformatics
• Primarily resources with a lot of RAM and storage (high I/O)
• Preferably transparent system, users don’t want to deal with e-
infrastructure at all
• How to work with storage (tiered?)
• Is Best-Effort SLA enough?
16
Virtual Machines and Containers
Virtual machines
• Package entire systems (heavy)
• Completely isolated
• Suitable in cloud environments
Containers:
• Share OS
• Smaller, faster, portable
• Docker
17
Microservices
18
MicroServices
• Decompose functionality into smaller, loosely coupled, on-demand
services communicating via an API
– “Do one thing and do it well”
• Services are easy to replace, language-agnostic
– Minimize risk, maximize agility
– Suitable for loosely coupled teams
– Portable - easy to scale
– Multiple services can be chained into larger tasks
Software containers (e.g. Docker) are
ideal for microservices!
Orchestrating containers
• Origin: Google
• A declarative language for
launching containers
• Start, stop, update, and
manage a cluster of
machines running
containers in a consistent
and maintainable way
• Suitable for microservices
Containers
Scheduled and packed containers on nodes
Connecting the microservices
• A suitable way of using
containers are connecting
them into a (scientific)
workflow.
• Tools like Pachyderm
(http://pachyderm.io/), Luigi
(https://github.com/spotify/lui
gi) and Galaxy
(https://galaxyproject.org/)
can assist with this.
• Goal: Reproducible, fault-
tolerant, scalable execution.
21
Tools
Tools
Data
Data
VREs aim to
bridge this gap!
Researcher Other
researchers
Virtual Research Environments
Researcher
Tools
Data
Compute
and
storage
resources
Virtual Research Environment!
Other
researchers
Virtual Research Environments
PhenoMeNal
• Horizon 2020 project, 2015-2018
• Virtual Research Environments (VRE), Microservices, Workflows
• Towards interoperable and scalable Metabolomics data analysis
• Private environments for sensitive data
http://phenomenal-h2020.eu/
DockerHub
Virtual Infrastructure
GitHub
Cloudflare
kubeadm Terraform
kubectl
Packer
• Enable users to deploy their own virtual
infrastructure on an IaaS provider
• Containerize tools, orchestrate microservices
with workflow systems on top of Kubernetes
PhenoMeNal approach and
stack
KubeNow
Users should not see this…
Users should see this!
27
Start-to-end MS-analysis
28
Deployment on local clouds
Steffen Neumann,
IPB Halle
Two on-premises deployments
MRC-NIHR Phenome Centre
Kultima group
www.caramba.clinic
Bring compute to the data
• Moving data can be problematic
– e.g. size, legal, resources, costs, time…
• VRE encompasses all components necessary to carry out
analysis
– Launch near data
– Re-use environment, or even a scientific workflow
• Next step: Federate data, federate clouds
31
Research focus in my group
e-Science methods development
Smart data management,
predictive modeling
Applied e-Science research
Drug discovery and
individualized diagnostics
e-infrastructure development
Automation, Big Data
Privacy
preservation
Workflows
Big Data
frameworks
Data management and
predictive modeling
Data
federation
Compute
federation
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
NGS projects
2014 2015 2016 2017
Efficiency feedback
to users began
0
20
40
60
80
100
Efficiency(%)
Date
●●
●
●●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●●●●●●●
●
●
●●
●●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●
●
●
●
●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●●
●
●●●
●●●●●●
Other projects
2014 2015 2016 2017
0
20
40
60
80
100
Selected research questions
How can we improve efficiency on shared HPC for data-
intensive bioinformatics?
1. M. Dahlö, D. Schofield, Wesley Schaal and O. Spjuth, Tracking the NGS revolution: Usage and system support of bioinformatics
projects on shared high-performance computing clusters. In Preparation.
2. O. Spjuth, E. Bongcam-Rudloff, J. Dahlberg, M. Dahlö, A. Kallio, L. Pireddu, F. Vezzi, and E. Korpelainen, Recommendations on e-
infrastructures for next-generation sequencing. Gigascience, 2016, 5:26
3. S. Lampa, M. Dahlö, P. I. Olason, J. Hagberg, and O. Spjuth, Lessons learned from implementing a national infrastructure in
sweden for storage and analysis of next-generation sequencing data. Gigascience, 2013, 2:9
Data locality?
Outsourcing?
Martin Dahlö
Selected research questions
Can Big Data frameworks aid data-intensive bioinformatics?
1. A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing
massively parallel DNA sequencing data. Gigascience. 2015; 4:26.
2. L. Ahmed, A. Edlund, E. Laure, and O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing
Technology and Science (Cloud-Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013
3. M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth. Conformal Prediction in Spark: Large-Scale Machine Learning with
Confidence. EEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.
4. M. Capuccini, L.Ahmed, W. Schaal, E. Laure and O. Spjuth Large-scale virtual screening on public cloud resources with
Apache Spark Journal of Cheminformatics 2017 9:15
Laeeq
Valentin
Marco
Efficient Virtual Screening
with Apache Spark and
Machine Learning
Hadoop pipeline scales better than HPC
and is economical for current data sizes
“EasyMapReduce: Leverage the power of Spark And Docker
To scale scientific tools in MapReduce fashion“
36
https://spark-summit.org/east-2017/events/easymapreduce-leverage-the-
power-of-spark-and-docker-to-scale-scientific-tools-in-mapreduce-fashion/
Selected research questions
How useful are Scientific Workflows in
data-intensive research?
O. Spjuth et al. Experiences with workflows for automating data-intensive bioinformatics. Biology Direct.
2015 Aug 19;10(1):43.
S. Lampa, J. Alvarsson and O. Spjuth. Towards agile large-scale predictive modelling in drug discovery with
flow-based programming design principles. Journal of Cheminformatics, 2016, 8:67
Samuel
Jon
• Streamline analysis on high-
performance e-infrastructures
• Support reproducible data analysis
• Enable large-scale data analysis
http://scipipe.org
https://github.com/pharmbio/sciluigi
http://pachyderm.io
Selected research questions
How can we deploy smart, high-availability services with APIs?
http://www.openrisknet.org
• Horizon 2020 project, 2017-2020
• E-Infrastructure for chemical safety assessment
• Multi-tenant Virtual Environments, microservices
• APIS, “Semantic interoperability”
• Academia – industry
• Much focus on standardizing chemical data and predictive modeling
Staffan
Jonathan
Arvid
Research questions around the
corner
• Public and private data sources are not static. How can we
continuously improve predictive models as data changes?
• We can generate too much data. Can predictive modeling aid data
acquisition, storage and analysis?
39
Reactive/continuous modeling
Data sources
Coordinate
Integrate
Version
Monitor
Publish
models
Archive
models
User
Train and
assess model
HASTE
Hierarchical Analysis of Spatial and TEmporal and image data
From intelligent data acquisition via smart data management to confident predictions
PI, Aim1: Carolina Wählby Aim 3: Andreas HellanderAim 2: Ola Spjuth
29 MSEK
2017-2022
.
.
.
Training data
Can we use
privileged
information to
improve machine
learning models?
Training
Can we make a valid
ranking and guide data
acquisition?
.
.
.
Is something interesting
happening? Can we
assign valid probabilities
for that?
Collect more data
Online setting
Aim 2: Guiding data acquisition with
machine learning
Aim3: Explore a hierarchical model
based on Information Layers
Data warehouse,
distributed storage
Edge
Cloudlet,
private
cloud
Acknowledgements
Wes Schaal
Jonathan Alvarsson
Staffan Arvidsson
Arvid Berg
Samuel Lampa
Marco Capuccini
Martin Dahlö
Valentin Georgiev
Anders Larsson
Polina Georgiev
Maris Lapins
Jon-Ander Novella
44
Lars Carlsson
Ernst Ahlberg
Ola Engqvist
SNIC Science Cloud
Andreas Hellander
Salman Toor
Caramba.clinic
Kim Kultima
Stephanie Herman
Payam Emami
Research group website: http://pharmb.io
Thank you

Contenu connexe

Tendances

Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at YorkMing Li
 
Synergy 2014 - Syn122 Moving Australian National Research into the Cloud
Synergy 2014 - Syn122 Moving Australian National Research into the CloudSynergy 2014 - Syn122 Moving Australian National Research into the Cloud
Synergy 2014 - Syn122 Moving Australian National Research into the CloudCitrix
 
SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries Lenovo Data Center
 
Bergman Enabling Computation for neuro ML external
Bergman Enabling Computation for neuro ML externalBergman Enabling Computation for neuro ML external
Bergman Enabling Computation for neuro ML externalazlefty
 
The Pacific Research Platform Two Years In
The Pacific Research Platform Two Years InThe Pacific Research Platform Two Years In
The Pacific Research Platform Two Years InLarry Smarr
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it worldChris Dwan
 
LambdaFabric for Machine Learning Acceleration
LambdaFabric for Machine Learning AccelerationLambdaFabric for Machine Learning Acceleration
LambdaFabric for Machine Learning AccelerationKnuEdge
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
IDB-Cloud Providing Bioinformatics Services on Cloud
IDB-Cloud Providing Bioinformatics Services on CloudIDB-Cloud Providing Bioinformatics Services on Cloud
IDB-Cloud Providing Bioinformatics Services on Cloudstratuslab
 
e-Infrastructure available for research, using the right tool for the right job
e-Infrastructure available for research, using the right tool for the right jobe-Infrastructure available for research, using the right tool for the right job
e-Infrastructure available for research, using the right tool for the right jobDavid Wallom
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
IPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutablesIPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutablesJose Enrique Ruiz
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleEnis Afgan
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14Robert H. McDonald
 

Tendances (20)

From IoT Devices to Cloud
From IoT Devices to CloudFrom IoT Devices to Cloud
From IoT Devices to Cloud
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
 
Synergy 2014 - Syn122 Moving Australian National Research into the Cloud
Synergy 2014 - Syn122 Moving Australian National Research into the CloudSynergy 2014 - Syn122 Moving Australian National Research into the Cloud
Synergy 2014 - Syn122 Moving Australian National Research into the Cloud
 
SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries
 
Bergman Enabling Computation for neuro ML external
Bergman Enabling Computation for neuro ML externalBergman Enabling Computation for neuro ML external
Bergman Enabling Computation for neuro ML external
 
The Pacific Research Platform Two Years In
The Pacific Research Platform Two Years InThe Pacific Research Platform Two Years In
The Pacific Research Platform Two Years In
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 
LambdaFabric for Machine Learning Acceleration
LambdaFabric for Machine Learning AccelerationLambdaFabric for Machine Learning Acceleration
LambdaFabric for Machine Learning Acceleration
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
IDB-Cloud Providing Bioinformatics Services on Cloud
IDB-Cloud Providing Bioinformatics Services on CloudIDB-Cloud Providing Bioinformatics Services on Cloud
IDB-Cloud Providing Bioinformatics Services on Cloud
 
e-Infrastructure available for research, using the right tool for the right job
e-Infrastructure available for research, using the right tool for the right jobe-Infrastructure available for research, using the right tool for the right job
e-Infrastructure available for research, using the right tool for the right job
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
Research Objects in Wf4Ever
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4Ever
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
IPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutablesIPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutables
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
 

Similaire à Data-intensive applications on cloud computing resources: Applications in life sciences

Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Ola Spjuth
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
The BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchThe BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchBlue BRIDGE
 
The case for cloud computing in Life Sciences
The case for cloud computing in Life SciencesThe case for cloud computing in Life Sciences
The case for cloud computing in Life SciencesOla Spjuth
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...BigData_Europe
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumAnita de Waard
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNADaniel S. Katz
 
Software and Education at NSF/ACI
Software and Education at NSF/ACISoftware and Education at NSF/ACI
Software and Education at NSF/ACIDaniel S. Katz
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Working towards Sustainable Software for Science (an NSF and community view)
Working towards Sustainable Software for Science (an NSF and community view)Working towards Sustainable Software for Science (an NSF and community view)
Working towards Sustainable Software for Science (an NSF and community view)Daniel S. Katz
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sangerChris Dwan
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Virtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open scienceVirtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open scienceBlue BRIDGE
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 

Similaire à Data-intensive applications on cloud computing resources: Applications in life sciences (20)

Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
The BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchThe BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative research
 
The case for cloud computing in Life Sciences
The case for cloud computing in Life SciencesThe case for cloud computing in Life Sciences
The case for cloud computing in Life Sciences
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNA
 
Software and Education at NSF/ACI
Software and Education at NSF/ACISoftware and Education at NSF/ACI
Software and Education at NSF/ACI
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Working towards Sustainable Software for Science (an NSF and community view)
Working towards Sustainable Software for Science (an NSF and community view)Working towards Sustainable Software for Science (an NSF and community view)
Working towards Sustainable Software for Science (an NSF and community view)
 
Final Johnson Research Libraries and Computational Research
Final Johnson Research Libraries and Computational ResearchFinal Johnson Research Libraries and Computational Research
Final Johnson Research Libraries and Computational Research
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Virtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open scienceVirtual research environments for implementing long tail open science
Virtual research environments for implementing long tail open science
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 

Plus de Ola Spjuth

Automating cell-based screening with open source, robotics and AI
Automating cell-based screening with open source, robotics and AIAutomating cell-based screening with open source, robotics and AI
Automating cell-based screening with open source, robotics and AIOla Spjuth
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingOla Spjuth
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsOla Spjuth
 
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression DatasetsCombining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression DatasetsOla Spjuth
 
Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Ola Spjuth
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Ola Spjuth
 
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in SwedenStorage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in SwedenOla Spjuth
 
Agile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discoveryAgile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discoveryOla Spjuth
 
Enabling Translational Medicine with e-Science
Enabling Translational Medicine with e-ScienceEnabling Translational Medicine with e-Science
Enabling Translational Medicine with e-ScienceOla Spjuth
 
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)Ola Spjuth
 
Building a flexible infrastructure with Bioclipse, open source, and federated...
Building a flexible infrastructure with Bioclipse, open source, and federated...Building a flexible infrastructure with Bioclipse, open source, and federated...
Building a flexible infrastructure with Bioclipse, open source, and federated...Ola Spjuth
 
Accessing and scripting CDK from Bioclipse
Accessing and scripting CDK from BioclipseAccessing and scripting CDK from Bioclipse
Accessing and scripting CDK from BioclipseOla Spjuth
 

Plus de Ola Spjuth (12)

Automating cell-based screening with open source, robotics and AI
Automating cell-based screening with open source, robotics and AIAutomating cell-based screening with open source, robotics and AI
Automating cell-based screening with open source, robotics and AI
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imaging
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
 
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression DatasetsCombining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
 
Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
 
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in SwedenStorage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
 
Agile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discoveryAgile large-scale machine-learning pipelines in drug discovery
Agile large-scale machine-learning pipelines in drug discovery
 
Enabling Translational Medicine with e-Science
Enabling Translational Medicine with e-ScienceEnabling Translational Medicine with e-Science
Enabling Translational Medicine with e-Science
 
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
 
Building a flexible infrastructure with Bioclipse, open source, and federated...
Building a flexible infrastructure with Bioclipse, open source, and federated...Building a flexible infrastructure with Bioclipse, open source, and federated...
Building a flexible infrastructure with Bioclipse, open source, and federated...
 
Accessing and scripting CDK from Bioclipse
Accessing and scripting CDK from BioclipseAccessing and scripting CDK from Bioclipse
Accessing and scripting CDK from Bioclipse
 

Dernier

CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 

Dernier (20)

CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 

Data-intensive applications on cloud computing resources: Applications in life sciences

  • 1. Data-intensive applications on cloud computing resources: Applications in life sciences Ola Spjuth <ola.spjuth@farmbio.uu.se> Department of Pharmaceutical Biosciences and Science for Life Laboratory Uppsala University
  • 2. Today: We have access to high-throughput technologies to study biological phenomena
  • 3. Science for Life Laboratory, Sweden An internationally leading center that develops and applies large-scale technologies for molecular biosciences with a focus on health and environment. National platform since 2013 Stockholm node Uppsala node
  • 4. ● CoordinatedbyNBIS ● ~63FTE:s (staff>75) ● Staffatallmajor Swedishuniversities ● 2018budget~7.5M€ ● Bioinformatics platform ofSciLifeLab 4
  • 5. 2017: Human whole genome sequenced in 3 days for ~$1100 …requires supercomputers for analysis and storage Massively parallel sequencing…. 2017: Illumina HiSeq X systems. 15K whole human genomes per year 2016: NGI data velocity 950 Mbp/hour = 16 Mbp/s
  • 6. Analysis Scientists Sample transfer Current mode of operation Platforms Pre-processing (NGI) Research (SNIC) Data delivery
  • 7. What we sequenced at NGI /
  • 8. Some statistics Storage usage Projects at SNIC-UPPMAX Data-intensive bioinformatics Other disciplines Support tickets
  • 9. NGS users • Key observations – Batch-oriented on HPC/HTC, shared storage, Linux, open source software – Computations are not so large, seldom multi-node – Storage biggest challenge. Projects do not end. Users do not clean up data. WGS projects are very large. – Many and inefficient users, lots of software (admin burden, support, education) – Free resources (no cost) does not promote efficient usage • Investment strategies – When investing in computational hardware, it takes a long time from funding decision until the resources are operational (10-12 months on average). – Expansion of resources are done at specific points in time, low flexibility between these. – Decision on resources are made by a national board with limited influence from life science scientists or platforms (Sweden) 9
  • 10. Why cloud in the life sciences? • Access to resources – Flexible configurations – On-demand, pay-as-you-go • Collaborate on international level – Publish/federate data – E.g. Large sequencing initiatives, “move compute to the data” • New types of analysis environments – Hadoop/Spark/Flink etc. – Microservices, Docker, Kubernetes, Mesos 10
  • 11. Using clouds in Bioinformatics How can we take advantage of cloud resources? Simplest example: • Start VM from (pre-made) VMI • Upload data • Perform scientific task • Download results • Terminate VM Easy to scale this up to using many instances! Or….. is it? • What if I want to run 100 instances in parallel? • What about if I want a new tool? Later versions? • Do I need to upload data every time? 11
  • 12. So we want to set up and use a virtual cluster • Multiple compute nodes • Network • Distributed storage • Firewall, DNS, reverse proxy, etc. So, we now have a virtual cluster. And now? Batch-like system – Install a queueing system, e.g. SLURM – Install bioinformatics software Big Data system – Install HDFS + Hadoop/Spark on the nodes Container-based system – Install Docker and Kubernetes Data – Ingress project data, possibly reference data 12 (There are tools that can help automating some of these procedures.)
  • 13. Challenges with cloud • Tradition: Strong HPC tradition in academia – Sweden: Existing HPC resources funded by Research Council and personnel at 6 centra in Sweden (SNIC) • Economy: Cost model is new – Difficult to assess the costs • Data: How to work with large-scale data (TB/PB-range) • Legal: Working with sensitive data • Educational: New technology for many 13
  • 14. Some SciLifeLab cloud options 14
  • 15. ● Geographically distributed federated IaaS cloud based on 2nd generation HPC-hardware ● Built using OpenStack SNIC Cloud in Sweden
  • 16. Needs in bioinformatics • Primarily resources with a lot of RAM and storage (high I/O) • Preferably transparent system, users don’t want to deal with e- infrastructure at all • How to work with storage (tiered?) • Is Best-Effort SLA enough? 16
  • 17. Virtual Machines and Containers Virtual machines • Package entire systems (heavy) • Completely isolated • Suitable in cloud environments Containers: • Share OS • Smaller, faster, portable • Docker 17
  • 19. MicroServices • Decompose functionality into smaller, loosely coupled, on-demand services communicating via an API – “Do one thing and do it well” • Services are easy to replace, language-agnostic – Minimize risk, maximize agility – Suitable for loosely coupled teams – Portable - easy to scale – Multiple services can be chained into larger tasks Software containers (e.g. Docker) are ideal for microservices!
  • 20. Orchestrating containers • Origin: Google • A declarative language for launching containers • Start, stop, update, and manage a cluster of machines running containers in a consistent and maintainable way • Suitable for microservices Containers Scheduled and packed containers on nodes
  • 21. Connecting the microservices • A suitable way of using containers are connecting them into a (scientific) workflow. • Tools like Pachyderm (http://pachyderm.io/), Luigi (https://github.com/spotify/lui gi) and Galaxy (https://galaxyproject.org/) can assist with this. • Goal: Reproducible, fault- tolerant, scalable execution. 21
  • 22. Tools Tools Data Data VREs aim to bridge this gap! Researcher Other researchers Virtual Research Environments
  • 24. PhenoMeNal • Horizon 2020 project, 2015-2018 • Virtual Research Environments (VRE), Microservices, Workflows • Towards interoperable and scalable Metabolomics data analysis • Private environments for sensitive data http://phenomenal-h2020.eu/ DockerHub Virtual Infrastructure GitHub
  • 25. Cloudflare kubeadm Terraform kubectl Packer • Enable users to deploy their own virtual infrastructure on an IaaS provider • Containerize tools, orchestrate microservices with workflow systems on top of Kubernetes PhenoMeNal approach and stack KubeNow
  • 26. Users should not see this…
  • 27. Users should see this! 27
  • 29. Deployment on local clouds Steffen Neumann, IPB Halle
  • 30. Two on-premises deployments MRC-NIHR Phenome Centre Kultima group www.caramba.clinic
  • 31. Bring compute to the data • Moving data can be problematic – e.g. size, legal, resources, costs, time… • VRE encompasses all components necessary to carry out analysis – Launch near data – Re-use environment, or even a scientific workflow • Next step: Federate data, federate clouds 31
  • 32. Research focus in my group e-Science methods development Smart data management, predictive modeling Applied e-Science research Drug discovery and individualized diagnostics e-infrastructure development Automation, Big Data
  • 33. Privacy preservation Workflows Big Data frameworks Data management and predictive modeling Data federation Compute federation
  • 34. ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● NGS projects 2014 2015 2016 2017 Efficiency feedback to users began 0 20 40 60 80 100 Efficiency(%) Date ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●●●●●●● ● ● ●● ●●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ●●●●●● Other projects 2014 2015 2016 2017 0 20 40 60 80 100 Selected research questions How can we improve efficiency on shared HPC for data- intensive bioinformatics? 1. M. Dahlö, D. Schofield, Wesley Schaal and O. Spjuth, Tracking the NGS revolution: Usage and system support of bioinformatics projects on shared high-performance computing clusters. In Preparation. 2. O. Spjuth, E. Bongcam-Rudloff, J. Dahlberg, M. Dahlö, A. Kallio, L. Pireddu, F. Vezzi, and E. Korpelainen, Recommendations on e- infrastructures for next-generation sequencing. Gigascience, 2016, 5:26 3. S. Lampa, M. Dahlö, P. I. Olason, J. Hagberg, and O. Spjuth, Lessons learned from implementing a national infrastructure in sweden for storage and analysis of next-generation sequencing data. Gigascience, 2013, 2:9 Data locality? Outsourcing? Martin Dahlö
  • 35. Selected research questions Can Big Data frameworks aid data-intensive bioinformatics? 1. A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience. 2015; 4:26. 2. L. Ahmed, A. Edlund, E. Laure, and O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing Technology and Science (Cloud-Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013 3. M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth. Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence. EEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67. 4. M. Capuccini, L.Ahmed, W. Schaal, E. Laure and O. Spjuth Large-scale virtual screening on public cloud resources with Apache Spark Journal of Cheminformatics 2017 9:15 Laeeq Valentin Marco Efficient Virtual Screening with Apache Spark and Machine Learning Hadoop pipeline scales better than HPC and is economical for current data sizes
  • 36. “EasyMapReduce: Leverage the power of Spark And Docker To scale scientific tools in MapReduce fashion“ 36 https://spark-summit.org/east-2017/events/easymapreduce-leverage-the- power-of-spark-and-docker-to-scale-scientific-tools-in-mapreduce-fashion/
  • 37. Selected research questions How useful are Scientific Workflows in data-intensive research? O. Spjuth et al. Experiences with workflows for automating data-intensive bioinformatics. Biology Direct. 2015 Aug 19;10(1):43. S. Lampa, J. Alvarsson and O. Spjuth. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. Journal of Cheminformatics, 2016, 8:67 Samuel Jon • Streamline analysis on high- performance e-infrastructures • Support reproducible data analysis • Enable large-scale data analysis http://scipipe.org https://github.com/pharmbio/sciluigi http://pachyderm.io
  • 38. Selected research questions How can we deploy smart, high-availability services with APIs? http://www.openrisknet.org • Horizon 2020 project, 2017-2020 • E-Infrastructure for chemical safety assessment • Multi-tenant Virtual Environments, microservices • APIS, “Semantic interoperability” • Academia – industry • Much focus on standardizing chemical data and predictive modeling Staffan Jonathan Arvid
  • 39. Research questions around the corner • Public and private data sources are not static. How can we continuously improve predictive models as data changes? • We can generate too much data. Can predictive modeling aid data acquisition, storage and analysis? 39
  • 41. HASTE Hierarchical Analysis of Spatial and TEmporal and image data From intelligent data acquisition via smart data management to confident predictions PI, Aim1: Carolina Wählby Aim 3: Andreas HellanderAim 2: Ola Spjuth 29 MSEK 2017-2022
  • 42. . . . Training data Can we use privileged information to improve machine learning models? Training Can we make a valid ranking and guide data acquisition? . . . Is something interesting happening? Can we assign valid probabilities for that? Collect more data Online setting Aim 2: Guiding data acquisition with machine learning
  • 43. Aim3: Explore a hierarchical model based on Information Layers Data warehouse, distributed storage Edge Cloudlet, private cloud
  • 44. Acknowledgements Wes Schaal Jonathan Alvarsson Staffan Arvidsson Arvid Berg Samuel Lampa Marco Capuccini Martin Dahlö Valentin Georgiev Anders Larsson Polina Georgiev Maris Lapins Jon-Ander Novella 44 Lars Carlsson Ernst Ahlberg Ola Engqvist SNIC Science Cloud Andreas Hellander Salman Toor Caramba.clinic Kim Kultima Stephanie Herman Payam Emami
  • 45. Research group website: http://pharmb.io Thank you

Notes de l'éditeur

  1. Strategic funding to enable: Infrastructure for high-throughput analysis Multi-disciplinary research environment Competence in technology and analysis methodology
  2. Drop applications into VMs running Docker in different clouds.
  3. How improve efficiency on shared HPC for data-intensive bioinformatics? Can Cloud Computing and Big Data frameworks aid data-intensive research? How useful are Scientific Workflows in data-intensive research? Can predictive modeling aid data acquisition, storage and analysis? How can we continuously improve predictive models as data changes?
  4. Making predictions with valid estimates of uncertainty Using privileged information in model training Deploy models efficiently on different e-infrastructures