SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Scalable genomic data processing and
interoperable systems with ADAM/Spark
Andy Petrella
Xavier Tordoir
2015-02-19
Lineup
Intro
● who we are
● we do distributed computing
Abstract
● Content: Distributed machine learning on
genomes data
● Distributed data and processing (S3, Spark,
Tachyon)
● Distributed machine learning (MLlib, H2O)
● Spark Notebook
Context
● 1000 genomes in VCF
● Distributed genomic data in ADAM
● Size matters (VCF → ADAM + partitioned)
● Data available on S3 (s3://med-at-
scale/1000genomes)
● Stratification
Procedure
● Deploy Spark on ec2
● Deploy Spark Notebook
● Load data
● Clean data
● Transform data
● Train KMeans
Results
● Prediction (confusion matrix)
● Performance
On the bench
● GA4GH compliant and scalable server
● Ad hoc analyses and sharing (through Tachyon)
Andy
@Noootsab, I am
@SparkNotebook creator
@Devoxx4Kids organizer
Maths & CS
Scalable systems
Machine learning
Med@Scale
Xavier
@xtordoir
Physics
Data analysis
Genomics
Distributed computing
Products (OSS)
● SparkNotebook
● GA4GH server
What we do?
Distributed computing consultancy in
● Internet of Things
● Finance
● Geospatial
● Marketing
Training and coaching in
● Scala
● Spark
● Distributed architecture
● Distributed machine learning
Research and development
● Distributed machine learning models
● Genomics and health
Data: 1000genomes (Genotypes + Samples Population)
- Quite some data → real scalability test
- Machine learning:
- Genotype inference
- Population classification (supervised learning)
- Population stratification (unsupervised learning)
Distributed Machine Learning on Genotypes
Data
The era of distributed computing
Strong Open Source ecosystem, Industrial developments and research
- Infrastructure can be elastic (e.g. EC2/S3)
- Data storage: HDFS (large blocks…), S3 (remote...)
- Processing: Beefed up MapReduce: Spark
- Escaping the IOPs: Tachyon in-memory filesystem
- Scheduling, HA (Mesos, Marathon)
Distributed Data Processing
Berkeley
Data
Analytics
Stack
more here
Distributed Data Processing
SparkNotebook
Interactive
Distributed
Computing
Dev’ time
Dev’ time
Dev’ time
Dev’ time
Dev’ time
Dev’ time
Dev’ time
Distributed Genomic Data
1000 genomes
1092 samples
43,372,735,220 genotypes
Original Data
VCF not partitioned files on FTP or S3: 152 GB (gzipped)
VCF format not easily parallelizable, even worst with compression
Adam / med-at-scale
ADAM files S3: 70.75GB (parquet, compressed)
9172 partitions (7Mb each)
@see http://med-at-scale.s3.amazonaws.com/1000genomes/counts.html
Eggo project
https://github.com/bigdatagenomics/eggo
Data
We have the 1000 genomes data, hence
- we have genotypes
- we have samples population labels
Exploration
We can cluster samples.
We can compare with samples populations.
Model
We can run simple stratification algorithms, K-Means.
Technology assessment
K-Means
MLLib provides K-Means (not hierarchical)
→ limit to 3 populations
MLLib uses breeze linalg library
→ Only euclidean metric (at that moment)
AT
1
AA
0
TT
2
A
ref allele
11
2
Procedure
Spark on EC2 cluster
- spark-ec2 script
- 2 to 40 workers (x 13GB / 4 cores)
- 10 to 40 minutes to launch Driver
Worker
Worker
Worker Worker
$ ./spark-ec2 launch
Procedure
SparkNotebook on EC2 cluster
- access from your browser
- configure spark
- control computations on the cluster
Driver
Worker
Worker
Worker
Worker
Procedure
Load data
- Read ADAM data from S3 repo
- Read the samples populations
Worker
Worker
Worker
Worker
Driver
Procedure
Filter and clean data
- Sample: chromosome slice (chr22), 3 populations (GBR, ASW, CHB)
- Missing genotypes (remove incomplete variants)
Variant1 Variant2 Variant3 Variant4 Variant5 Variant6 Variant7
Sample1 0 0 1 0 1 0 1
Sample2 2 NA 1 2 1 0 0
Sample3 2 0 1 2 2 0 2
Sample4 1 1 0 0 0 NA 0
Procedure
Transform data
- Flat Genotype collection → Sample collection
- Each Sample is a Vector of Genotypes (0, 1, 2)
- Vector is ordered consistently
Genotype
Variant
Sample (ID)
Alleles
Sample
Sample (ID)
Vector[Genotype]
Vector[Variant]
Procedure
Train K-Means
- 10 iterations
- 3 clusters
Sample
Sample (ID)
Vector[Genotype]
Vector
Vector
Vector
Results
~ 100,000 variants
#0 #1 #2
GBR 0 0 89
ASW 54 0 7
CHB 0 97 0
The procedure reconstructs the
actual populations.
Results
Performance (cluster size)
2 NODES 20 NODES(*)
Cluster Launch 10 min 30.0 min
Count chr22 genotypes (S3) 6 min 1.1 min
Save chr22 from s3 to HDFS 26 min 3.5 min
Count chr22 genotypes (HDFS) 10 min 1.4 min
(*) Cluster size / nb partitions not optimal here: 80 cores / 114 partitions
Results
Performance (cluster size)
121,023 Variants 2 NODES 20 NODES
Missing data (collect) 7.8 min 33 sec
Train (10 iter) 2.1 min 28 sec
Predict (collect) 8 sec 2 sec
Results
Performance, 20 NODES (data size)
121,023
Variants
491,222
Variants
Missing data (collect) 33 sec 3.7 min
Train (10 iter) 28 sec 1.6 min
Predict (collect) 2 sec 25 sec
On the bench
Global Alliance for Genomic and Health (GA4GH)
http://genomicsandhealth.org/
http://ga4gh.org/
- Framework for responsible data sharing
- Define schemas
- Define services for interoperability
On the bench
GA4GH schemas
On the bench
GA4GH google implementation
On the bench
GA4GH google implementation
On the bench
GA4GH compliant
& scalable server
Open source and available on GitHub,
https://github.com/med-at-scale/high-health
PRs are welcome!
On the bench
Methods grouped in micro services
GA4GH & Custom methods
Thank you
Biobankcloud, KTH (Jim Dowling)
UC Berkeley AMPLab, bdgenomics.org team (Frank Nothaft, Matt Massie)
Cloudera (Uri Laserson)
Hey…
Come back tomorrow morning → for demos
And afternoon → to hack on it!

Contenu connexe

Tendances

Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadofnothaft
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaDatabricks
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisDenis C. Bauer
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupSri Ambati
 
Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Instituteinside-BigData.com
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 

Tendances (20)

Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
 
Spark Summit East 2015
Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Population-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysisPopulation-scale high-throughput sequencing data analysis
Population-scale high-throughput sequencing data analysis
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Institute
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 

En vedette

How to Improve Your Website
How to Improve Your WebsiteHow to Improve Your Website
How to Improve Your WebsiteBizSmart Select
 
Parvat Pradesh Mein Pavas
Parvat Pradesh Mein PavasParvat Pradesh Mein Pavas
Parvat Pradesh Mein Pavaszainul2002
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceAndrew Sallans
 
Guía taller 2 a padres de familia ie medellin
Guía taller 2 a padres de familia ie medellinGuía taller 2 a padres de familia ie medellin
Guía taller 2 a padres de familia ie medellinCarlos Ríos Lemos
 
Alfred day hershy
Alfred day hershyAlfred day hershy
Alfred day hershykimmygee_
 
5 of the Biggest Myths about Growing Your Business
5 of the Biggest Myths about Growing Your Business5 of the Biggest Myths about Growing Your Business
5 of the Biggest Myths about Growing Your BusinessVolaris Group
 
Our changing state: the realities of austerity and devolution
Our changing state: the realities of austerity and devolutionOur changing state: the realities of austerity and devolution
Our changing state: the realities of austerity and devolutionBrowne Jacobson LLP
 
LA Chef for OpenStack Hackday
LA Chef for OpenStack HackdayLA Chef for OpenStack Hackday
LA Chef for OpenStack HackdayMatt Ray
 
Navigating Uncertainty when Launching New Ideas
Navigating Uncertainty when Launching New IdeasNavigating Uncertainty when Launching New Ideas
Navigating Uncertainty when Launching New Ideashopperomatic
 
De la aldea a los recintos ceremoniales en la sociedad andina del periodo ini...
De la aldea a los recintos ceremoniales en la sociedad andina del periodo ini...De la aldea a los recintos ceremoniales en la sociedad andina del periodo ini...
De la aldea a los recintos ceremoniales en la sociedad andina del periodo ini...Gusstock Concha Flores
 
Brief Encounter: London Zoo
Brief Encounter: London ZooBrief Encounter: London Zoo
Brief Encounter: London ZooEarnest
 
Icsi transformation 11-13 sept - agra
Icsi transformation   11-13 sept - agraIcsi transformation   11-13 sept - agra
Icsi transformation 11-13 sept - agraPavan Kumar Vijay
 
F.Blin IFLA Trend Report English_dk
F.Blin IFLA Trend Report English_dkF.Blin IFLA Trend Report English_dk
F.Blin IFLA Trend Report English_dkFrederic Blin
 
The Clientshare Academy Briefing - Gold Membership - by Practice Paradox
The Clientshare Academy Briefing - Gold Membership - by Practice ParadoxThe Clientshare Academy Briefing - Gold Membership - by Practice Paradox
The Clientshare Academy Briefing - Gold Membership - by Practice ParadoxPractice Paradox
 
Start Writing Groovy
Start Writing GroovyStart Writing Groovy
Start Writing GroovyEvgeny Goldin
 
Créer et afficher une tag list sur scoop.it
Créer et afficher une tag list sur scoop.itCréer et afficher une tag list sur scoop.it
Créer et afficher une tag list sur scoop.itThierry Zenou
 

En vedette (20)

How to Improve Your Website
How to Improve Your WebsiteHow to Improve Your Website
How to Improve Your Website
 
Parvat Pradesh Mein Pavas
Parvat Pradesh Mein PavasParvat Pradesh Mein Pavas
Parvat Pradesh Mein Pavas
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-Science
 
Guía taller 2 a padres de familia ie medellin
Guía taller 2 a padres de familia ie medellinGuía taller 2 a padres de familia ie medellin
Guía taller 2 a padres de familia ie medellin
 
Alfred day hershy
Alfred day hershyAlfred day hershy
Alfred day hershy
 
5 of the Biggest Myths about Growing Your Business
5 of the Biggest Myths about Growing Your Business5 of the Biggest Myths about Growing Your Business
5 of the Biggest Myths about Growing Your Business
 
Our changing state: the realities of austerity and devolution
Our changing state: the realities of austerity and devolutionOur changing state: the realities of austerity and devolution
Our changing state: the realities of austerity and devolution
 
LA Chef for OpenStack Hackday
LA Chef for OpenStack HackdayLA Chef for OpenStack Hackday
LA Chef for OpenStack Hackday
 
Navigating Uncertainty when Launching New Ideas
Navigating Uncertainty when Launching New IdeasNavigating Uncertainty when Launching New Ideas
Navigating Uncertainty when Launching New Ideas
 
De la aldea a los recintos ceremoniales en la sociedad andina del periodo ini...
De la aldea a los recintos ceremoniales en la sociedad andina del periodo ini...De la aldea a los recintos ceremoniales en la sociedad andina del periodo ini...
De la aldea a los recintos ceremoniales en la sociedad andina del periodo ini...
 
Brief Encounter: London Zoo
Brief Encounter: London ZooBrief Encounter: London Zoo
Brief Encounter: London Zoo
 
Icsi transformation 11-13 sept - agra
Icsi transformation   11-13 sept - agraIcsi transformation   11-13 sept - agra
Icsi transformation 11-13 sept - agra
 
Italy weddings
Italy weddingsItaly weddings
Italy weddings
 
F.Blin IFLA Trend Report English_dk
F.Blin IFLA Trend Report English_dkF.Blin IFLA Trend Report English_dk
F.Blin IFLA Trend Report English_dk
 
The Clientshare Academy Briefing - Gold Membership - by Practice Paradox
The Clientshare Academy Briefing - Gold Membership - by Practice ParadoxThe Clientshare Academy Briefing - Gold Membership - by Practice Paradox
The Clientshare Academy Briefing - Gold Membership - by Practice Paradox
 
Presentación taller 1
Presentación taller 1Presentación taller 1
Presentación taller 1
 
Ngan hang-thuong-mai 2
Ngan hang-thuong-mai 2Ngan hang-thuong-mai 2
Ngan hang-thuong-mai 2
 
Start Writing Groovy
Start Writing GroovyStart Writing Groovy
Start Writing Groovy
 
Simplifying life
Simplifying lifeSimplifying life
Simplifying life
 
Créer et afficher une tag list sur scoop.it
Créer et afficher une tag list sur scoop.itCréer et afficher une tag list sur scoop.it
Créer et afficher une tag list sur scoop.it
 

Similaire à BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale

Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataZhong Wang
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafkaNitin Kumar
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are AlgorithmsInfluxData
 

Similaire à BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale (20)

Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
The steps of R code Master.pptx
The steps of R code Master.pptxThe steps of R code Master.pptx
The steps of R code Master.pptx
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
 

Plus de Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GISAndy Petrella
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-dataAndy Petrella
 

Plus de Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GIS
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-data
 

Dernier

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Dernier (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale

  • 1. Scalable genomic data processing and interoperable systems with ADAM/Spark Andy Petrella Xavier Tordoir 2015-02-19
  • 2. Lineup Intro ● who we are ● we do distributed computing Abstract ● Content: Distributed machine learning on genomes data ● Distributed data and processing (S3, Spark, Tachyon) ● Distributed machine learning (MLlib, H2O) ● Spark Notebook Context ● 1000 genomes in VCF ● Distributed genomic data in ADAM ● Size matters (VCF → ADAM + partitioned) ● Data available on S3 (s3://med-at- scale/1000genomes) ● Stratification Procedure ● Deploy Spark on ec2 ● Deploy Spark Notebook ● Load data ● Clean data ● Transform data ● Train KMeans Results ● Prediction (confusion matrix) ● Performance On the bench ● GA4GH compliant and scalable server ● Ad hoc analyses and sharing (through Tachyon)
  • 3. Andy @Noootsab, I am @SparkNotebook creator @Devoxx4Kids organizer Maths & CS Scalable systems Machine learning Med@Scale Xavier @xtordoir Physics Data analysis Genomics Distributed computing
  • 4. Products (OSS) ● SparkNotebook ● GA4GH server What we do? Distributed computing consultancy in ● Internet of Things ● Finance ● Geospatial ● Marketing Training and coaching in ● Scala ● Spark ● Distributed architecture ● Distributed machine learning Research and development ● Distributed machine learning models ● Genomics and health
  • 5. Data: 1000genomes (Genotypes + Samples Population) - Quite some data → real scalability test - Machine learning: - Genotype inference - Population classification (supervised learning) - Population stratification (unsupervised learning) Distributed Machine Learning on Genotypes Data
  • 6. The era of distributed computing Strong Open Source ecosystem, Industrial developments and research - Infrastructure can be elastic (e.g. EC2/S3) - Data storage: HDFS (large blocks…), S3 (remote...) - Processing: Beefed up MapReduce: Spark - Escaping the IOPs: Tachyon in-memory filesystem - Scheduling, HA (Mesos, Marathon) Distributed Data Processing
  • 8. SparkNotebook Interactive Distributed Computing Dev’ time Dev’ time Dev’ time Dev’ time Dev’ time Dev’ time Dev’ time
  • 9. Distributed Genomic Data 1000 genomes 1092 samples 43,372,735,220 genotypes Original Data VCF not partitioned files on FTP or S3: 152 GB (gzipped) VCF format not easily parallelizable, even worst with compression Adam / med-at-scale ADAM files S3: 70.75GB (parquet, compressed) 9172 partitions (7Mb each) @see http://med-at-scale.s3.amazonaws.com/1000genomes/counts.html Eggo project https://github.com/bigdatagenomics/eggo
  • 10. Data We have the 1000 genomes data, hence - we have genotypes - we have samples population labels Exploration We can cluster samples. We can compare with samples populations. Model We can run simple stratification algorithms, K-Means. Technology assessment
  • 11. K-Means MLLib provides K-Means (not hierarchical) → limit to 3 populations MLLib uses breeze linalg library → Only euclidean metric (at that moment) AT 1 AA 0 TT 2 A ref allele 11 2
  • 12. Procedure Spark on EC2 cluster - spark-ec2 script - 2 to 40 workers (x 13GB / 4 cores) - 10 to 40 minutes to launch Driver Worker Worker Worker Worker $ ./spark-ec2 launch
  • 13. Procedure SparkNotebook on EC2 cluster - access from your browser - configure spark - control computations on the cluster Driver Worker Worker Worker Worker
  • 14. Procedure Load data - Read ADAM data from S3 repo - Read the samples populations Worker Worker Worker Worker Driver
  • 15. Procedure Filter and clean data - Sample: chromosome slice (chr22), 3 populations (GBR, ASW, CHB) - Missing genotypes (remove incomplete variants) Variant1 Variant2 Variant3 Variant4 Variant5 Variant6 Variant7 Sample1 0 0 1 0 1 0 1 Sample2 2 NA 1 2 1 0 0 Sample3 2 0 1 2 2 0 2 Sample4 1 1 0 0 0 NA 0
  • 16. Procedure Transform data - Flat Genotype collection → Sample collection - Each Sample is a Vector of Genotypes (0, 1, 2) - Vector is ordered consistently Genotype Variant Sample (ID) Alleles Sample Sample (ID) Vector[Genotype] Vector[Variant]
  • 17. Procedure Train K-Means - 10 iterations - 3 clusters Sample Sample (ID) Vector[Genotype] Vector Vector Vector
  • 18. Results ~ 100,000 variants #0 #1 #2 GBR 0 0 89 ASW 54 0 7 CHB 0 97 0 The procedure reconstructs the actual populations.
  • 19. Results Performance (cluster size) 2 NODES 20 NODES(*) Cluster Launch 10 min 30.0 min Count chr22 genotypes (S3) 6 min 1.1 min Save chr22 from s3 to HDFS 26 min 3.5 min Count chr22 genotypes (HDFS) 10 min 1.4 min (*) Cluster size / nb partitions not optimal here: 80 cores / 114 partitions
  • 20. Results Performance (cluster size) 121,023 Variants 2 NODES 20 NODES Missing data (collect) 7.8 min 33 sec Train (10 iter) 2.1 min 28 sec Predict (collect) 8 sec 2 sec
  • 21. Results Performance, 20 NODES (data size) 121,023 Variants 491,222 Variants Missing data (collect) 33 sec 3.7 min Train (10 iter) 28 sec 1.6 min Predict (collect) 2 sec 25 sec
  • 22. On the bench Global Alliance for Genomic and Health (GA4GH) http://genomicsandhealth.org/ http://ga4gh.org/ - Framework for responsible data sharing - Define schemas - Define services for interoperability
  • 24. On the bench GA4GH google implementation
  • 25. On the bench GA4GH google implementation
  • 26. On the bench GA4GH compliant & scalable server Open source and available on GitHub, https://github.com/med-at-scale/high-health PRs are welcome!
  • 27. On the bench Methods grouped in micro services GA4GH & Custom methods
  • 28. Thank you Biobankcloud, KTH (Jim Dowling) UC Berkeley AMPLab, bdgenomics.org team (Frank Nothaft, Matt Massie) Cloudera (Uri Laserson) Hey… Come back tomorrow morning → for demos And afternoon → to hack on it!