Presentation at the de.NBI 2017 symposium “The Future Development of Bioinformatics in Germany and Europe” held at the Center for Interdisciplinary Research (ZiF) of Bielefeld University, October 23-25, 2017.
https://www.denbi.de/symposium2017
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Data-intensive applications on cloud computing resources: Applications in life sciences
1. Data-intensive applications on cloud
computing resources: Applications in life
sciences
Ola Spjuth <ola.spjuth@farmbio.uu.se>
Department of Pharmaceutical Biosciences
and Science for Life Laboratory
Uppsala University
2. Today: We have access to high-throughput
technologies to study biological phenomena
3. Science for Life Laboratory, Sweden
An internationally leading center
that develops and applies
large-scale technologies for
molecular biosciences with a
focus on health and
environment.
National platform since 2013
Stockholm node
Uppsala node
5. 2017: Human whole genome sequenced
in 3 days for ~$1100
…requires supercomputers
for analysis and storage
Massively parallel sequencing….
2017: Illumina HiSeq X systems. 15K whole human
genomes per year
2016: NGI data velocity 950 Mbp/hour = 16 Mbp/s
8. Some statistics Storage usage
Projects at SNIC-UPPMAX
Data-intensive bioinformatics
Other disciplines
Support tickets
9. NGS users
• Key observations
– Batch-oriented on HPC/HTC, shared storage, Linux, open source software
– Computations are not so large, seldom multi-node
– Storage biggest challenge. Projects do not end. Users do not clean up
data. WGS projects are very large.
– Many and inefficient users, lots of software (admin burden, support,
education)
– Free resources (no cost) does not promote efficient usage
• Investment strategies
– When investing in computational hardware, it takes a long time from
funding decision until the resources are operational (10-12 months on
average).
– Expansion of resources are done at specific points in time, low flexibility
between these.
– Decision on resources are made by a national board with limited influence
from life science scientists or platforms (Sweden)
9
10. Why cloud in the life sciences?
• Access to resources
– Flexible configurations
– On-demand, pay-as-you-go
• Collaborate on international level
– Publish/federate data
– E.g. Large sequencing initiatives, “move compute to the
data”
• New types of analysis environments
– Hadoop/Spark/Flink etc.
– Microservices, Docker, Kubernetes, Mesos
10
11. Using clouds in Bioinformatics
How can we take advantage of cloud resources?
Simplest example:
• Start VM from (pre-made) VMI
• Upload data
• Perform scientific task
• Download results
• Terminate VM
Easy to scale this up to using many instances!
Or….. is it?
• What if I want to run 100 instances in parallel?
• What about if I want a new tool? Later versions?
• Do I need to upload data every time?
11
12. So we want to set up and use a virtual
cluster
• Multiple compute nodes
• Network
• Distributed storage
• Firewall, DNS, reverse proxy, etc.
So, we now have a virtual cluster. And now?
Batch-like system
– Install a queueing system, e.g. SLURM
– Install bioinformatics software
Big Data system
– Install HDFS + Hadoop/Spark on the nodes
Container-based system
– Install Docker and Kubernetes
Data
– Ingress project data, possibly reference data 12
(There are tools
that can help
automating some
of these
procedures.)
13. Challenges with cloud
• Tradition: Strong HPC tradition in academia
– Sweden: Existing HPC resources funded by Research
Council and personnel at 6 centra in Sweden (SNIC)
• Economy: Cost model is new
– Difficult to assess the costs
• Data: How to work with large-scale data (TB/PB-range)
• Legal: Working with sensitive data
• Educational: New technology for many
13
15. ● Geographically distributed federated IaaS cloud
based on 2nd generation HPC-hardware
● Built using OpenStack
SNIC Cloud in Sweden
16. Needs in bioinformatics
• Primarily resources with a lot of RAM and storage (high I/O)
• Preferably transparent system, users don’t want to deal with e-
infrastructure at all
• How to work with storage (tiered?)
• Is Best-Effort SLA enough?
16
17. Virtual Machines and Containers
Virtual machines
• Package entire systems (heavy)
• Completely isolated
• Suitable in cloud environments
Containers:
• Share OS
• Smaller, faster, portable
• Docker
17
19. MicroServices
• Decompose functionality into smaller, loosely coupled, on-demand
services communicating via an API
– “Do one thing and do it well”
• Services are easy to replace, language-agnostic
– Minimize risk, maximize agility
– Suitable for loosely coupled teams
– Portable - easy to scale
– Multiple services can be chained into larger tasks
Software containers (e.g. Docker) are
ideal for microservices!
20. Orchestrating containers
• Origin: Google
• A declarative language for
launching containers
• Start, stop, update, and
manage a cluster of
machines running
containers in a consistent
and maintainable way
• Suitable for microservices
Containers
Scheduled and packed containers on nodes
21. Connecting the microservices
• A suitable way of using
containers are connecting
them into a (scientific)
workflow.
• Tools like Pachyderm
(http://pachyderm.io/), Luigi
(https://github.com/spotify/lui
gi) and Galaxy
(https://galaxyproject.org/)
can assist with this.
• Goal: Reproducible, fault-
tolerant, scalable execution.
21
24. PhenoMeNal
• Horizon 2020 project, 2015-2018
• Virtual Research Environments (VRE), Microservices, Workflows
• Towards interoperable and scalable Metabolomics data analysis
• Private environments for sensitive data
http://phenomenal-h2020.eu/
DockerHub
Virtual Infrastructure
GitHub
25. Cloudflare
kubeadm Terraform
kubectl
Packer
• Enable users to deploy their own virtual
infrastructure on an IaaS provider
• Containerize tools, orchestrate microservices
with workflow systems on top of Kubernetes
PhenoMeNal approach and
stack
KubeNow
31. Bring compute to the data
• Moving data can be problematic
– e.g. size, legal, resources, costs, time…
• VRE encompasses all components necessary to carry out
analysis
– Launch near data
– Re-use environment, or even a scientific workflow
• Next step: Federate data, federate clouds
31
32. Research focus in my group
e-Science methods development
Smart data management,
predictive modeling
Applied e-Science research
Drug discovery and
individualized diagnostics
e-infrastructure development
Automation, Big Data
34. ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
NGS projects
2014 2015 2016 2017
Efficiency feedback
to users began
0
20
40
60
80
100
Efficiency(%)
Date
●●
●
●●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●●●●●●●
●
●
●●
●●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●
●
●
●
●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●●
●
●●●
●●●●●●
Other projects
2014 2015 2016 2017
0
20
40
60
80
100
Selected research questions
How can we improve efficiency on shared HPC for data-
intensive bioinformatics?
1. M. Dahlö, D. Schofield, Wesley Schaal and O. Spjuth, Tracking the NGS revolution: Usage and system support of bioinformatics
projects on shared high-performance computing clusters. In Preparation.
2. O. Spjuth, E. Bongcam-Rudloff, J. Dahlberg, M. Dahlö, A. Kallio, L. Pireddu, F. Vezzi, and E. Korpelainen, Recommendations on e-
infrastructures for next-generation sequencing. Gigascience, 2016, 5:26
3. S. Lampa, M. Dahlö, P. I. Olason, J. Hagberg, and O. Spjuth, Lessons learned from implementing a national infrastructure in
sweden for storage and analysis of next-generation sequencing data. Gigascience, 2013, 2:9
Data locality?
Outsourcing?
Martin Dahlö
35. Selected research questions
Can Big Data frameworks aid data-intensive bioinformatics?
1. A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing
massively parallel DNA sequencing data. Gigascience. 2015; 4:26.
2. L. Ahmed, A. Edlund, E. Laure, and O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing
Technology and Science (Cloud-Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013
3. M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth. Conformal Prediction in Spark: Large-Scale Machine Learning with
Confidence. EEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.
4. M. Capuccini, L.Ahmed, W. Schaal, E. Laure and O. Spjuth Large-scale virtual screening on public cloud resources with
Apache Spark Journal of Cheminformatics 2017 9:15
Laeeq
Valentin
Marco
Efficient Virtual Screening
with Apache Spark and
Machine Learning
Hadoop pipeline scales better than HPC
and is economical for current data sizes
36. “EasyMapReduce: Leverage the power of Spark And Docker
To scale scientific tools in MapReduce fashion“
36
https://spark-summit.org/east-2017/events/easymapreduce-leverage-the-
power-of-spark-and-docker-to-scale-scientific-tools-in-mapreduce-fashion/
37. Selected research questions
How useful are Scientific Workflows in
data-intensive research?
O. Spjuth et al. Experiences with workflows for automating data-intensive bioinformatics. Biology Direct.
2015 Aug 19;10(1):43.
S. Lampa, J. Alvarsson and O. Spjuth. Towards agile large-scale predictive modelling in drug discovery with
flow-based programming design principles. Journal of Cheminformatics, 2016, 8:67
Samuel
Jon
• Streamline analysis on high-
performance e-infrastructures
• Support reproducible data analysis
• Enable large-scale data analysis
http://scipipe.org
https://github.com/pharmbio/sciluigi
http://pachyderm.io
38. Selected research questions
How can we deploy smart, high-availability services with APIs?
http://www.openrisknet.org
• Horizon 2020 project, 2017-2020
• E-Infrastructure for chemical safety assessment
• Multi-tenant Virtual Environments, microservices
• APIS, “Semantic interoperability”
• Academia – industry
• Much focus on standardizing chemical data and predictive modeling
Staffan
Jonathan
Arvid
39. Research questions around the
corner
• Public and private data sources are not static. How can we
continuously improve predictive models as data changes?
• We can generate too much data. Can predictive modeling aid data
acquisition, storage and analysis?
39
41. HASTE
Hierarchical Analysis of Spatial and TEmporal and image data
From intelligent data acquisition via smart data management to confident predictions
PI, Aim1: Carolina Wählby Aim 3: Andreas HellanderAim 2: Ola Spjuth
29 MSEK
2017-2022
42. .
.
.
Training data
Can we use
privileged
information to
improve machine
learning models?
Training
Can we make a valid
ranking and guide data
acquisition?
.
.
.
Is something interesting
happening? Can we
assign valid probabilities
for that?
Collect more data
Online setting
Aim 2: Guiding data acquisition with
machine learning
43. Aim3: Explore a hierarchical model
based on Information Layers
Data warehouse,
distributed storage
Edge
Cloudlet,
private
cloud
44. Acknowledgements
Wes Schaal
Jonathan Alvarsson
Staffan Arvidsson
Arvid Berg
Samuel Lampa
Marco Capuccini
Martin Dahlö
Valentin Georgiev
Anders Larsson
Polina Georgiev
Maris Lapins
Jon-Ander Novella
44
Lars Carlsson
Ernst Ahlberg
Ola Engqvist
SNIC Science Cloud
Andreas Hellander
Salman Toor
Caramba.clinic
Kim Kultima
Stephanie Herman
Payam Emami
Strategic funding to enable:
Infrastructure for high-throughput analysis
Multi-disciplinary research environment
Competence in technology and analysis methodology
Drop applications into VMs running Docker in different clouds.
How improve efficiency on shared HPC for data-intensive bioinformatics?
Can Cloud Computing and Big Data frameworks aid data-intensive research?
How useful are Scientific Workflows in data-intensive research?
Can predictive modeling aid data acquisition, storage and analysis?
How can we continuously improve predictive models as data changes?
Making predictions with valid estimates of uncertainty
Using privileged information in model training
Deploy models efficiently on different e-infrastructures