A talk given at the BioBankCloud conference in Feb 2015 about distributed computing in the contexts of genomics and health.
In this one, we exposed what results we obtained exploring the 1000genomes data using ADAM, followed by an introduction to our scalable GA4GH server implementation built using ADAM, Apache Spark and Play Framework 2.
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
1. Scalable genomic data processing and
interoperable systems with ADAM/Spark
Andy Petrella
Xavier Tordoir
2015-02-19
2. Lineup
Intro
● who we are
● we do distributed computing
Abstract
● Content: Distributed machine learning on
genomes data
● Distributed data and processing (S3, Spark,
Tachyon)
● Distributed machine learning (MLlib, H2O)
● Spark Notebook
Context
● 1000 genomes in VCF
● Distributed genomic data in ADAM
● Size matters (VCF → ADAM + partitioned)
● Data available on S3 (s3://med-at-
scale/1000genomes)
● Stratification
Procedure
● Deploy Spark on ec2
● Deploy Spark Notebook
● Load data
● Clean data
● Transform data
● Train KMeans
Results
● Prediction (confusion matrix)
● Performance
On the bench
● GA4GH compliant and scalable server
● Ad hoc analyses and sharing (through Tachyon)
3. Andy
@Noootsab, I am
@SparkNotebook creator
@Devoxx4Kids organizer
Maths & CS
Scalable systems
Machine learning
Med@Scale
Xavier
@xtordoir
Physics
Data analysis
Genomics
Distributed computing
4. Products (OSS)
● SparkNotebook
● GA4GH server
What we do?
Distributed computing consultancy in
● Internet of Things
● Finance
● Geospatial
● Marketing
Training and coaching in
● Scala
● Spark
● Distributed architecture
● Distributed machine learning
Research and development
● Distributed machine learning models
● Genomics and health
5. Data: 1000genomes (Genotypes + Samples Population)
- Quite some data → real scalability test
- Machine learning:
- Genotype inference
- Population classification (supervised learning)
- Population stratification (unsupervised learning)
Distributed Machine Learning on Genotypes
Data
6. The era of distributed computing
Strong Open Source ecosystem, Industrial developments and research
- Infrastructure can be elastic (e.g. EC2/S3)
- Data storage: HDFS (large blocks…), S3 (remote...)
- Processing: Beefed up MapReduce: Spark
- Escaping the IOPs: Tachyon in-memory filesystem
- Scheduling, HA (Mesos, Marathon)
Distributed Data Processing
9. Distributed Genomic Data
1000 genomes
1092 samples
43,372,735,220 genotypes
Original Data
VCF not partitioned files on FTP or S3: 152 GB (gzipped)
VCF format not easily parallelizable, even worst with compression
Adam / med-at-scale
ADAM files S3: 70.75GB (parquet, compressed)
9172 partitions (7Mb each)
@see http://med-at-scale.s3.amazonaws.com/1000genomes/counts.html
Eggo project
https://github.com/bigdatagenomics/eggo
10. Data
We have the 1000 genomes data, hence
- we have genotypes
- we have samples population labels
Exploration
We can cluster samples.
We can compare with samples populations.
Model
We can run simple stratification algorithms, K-Means.
Technology assessment
11. K-Means
MLLib provides K-Means (not hierarchical)
→ limit to 3 populations
MLLib uses breeze linalg library
→ Only euclidean metric (at that moment)
AT
1
AA
0
TT
2
A
ref allele
11
2
12. Procedure
Spark on EC2 cluster
- spark-ec2 script
- 2 to 40 workers (x 13GB / 4 cores)
- 10 to 40 minutes to launch Driver
Worker
Worker
Worker Worker
$ ./spark-ec2 launch
13. Procedure
SparkNotebook on EC2 cluster
- access from your browser
- configure spark
- control computations on the cluster
Driver
Worker
Worker
Worker
Worker
14. Procedure
Load data
- Read ADAM data from S3 repo
- Read the samples populations
Worker
Worker
Worker
Worker
Driver
18. Results
~ 100,000 variants
#0 #1 #2
GBR 0 0 89
ASW 54 0 7
CHB 0 97 0
The procedure reconstructs the
actual populations.
19. Results
Performance (cluster size)
2 NODES 20 NODES(*)
Cluster Launch 10 min 30.0 min
Count chr22 genotypes (S3) 6 min 1.1 min
Save chr22 from s3 to HDFS 26 min 3.5 min
Count chr22 genotypes (HDFS) 10 min 1.4 min
(*) Cluster size / nb partitions not optimal here: 80 cores / 114 partitions
22. On the bench
Global Alliance for Genomic and Health (GA4GH)
http://genomicsandhealth.org/
http://ga4gh.org/
- Framework for responsible data sharing
- Define schemas
- Define services for interoperability
28. Thank you
Biobankcloud, KTH (Jim Dowling)
UC Berkeley AMPLab, bdgenomics.org team (Frank Nothaft, Matt Massie)
Cloudera (Uri Laserson)
Hey…
Come back tomorrow morning → for demos
And afternoon → to hack on it!