Lightning fast genomics with Spark, Adam and Scala

Andy Petrella
Andy PetrellaCEO & Founder at Kensu à Kensu
Lightning fast genomics 
With Spark and ADAM
Who are we? 
Andy 
@Noootsab 
@NextLab_be 
@Wajug co-driver 
@Devoxx4Kids organizer 
Maths & CS 
Data lover: geo, open, massive 
Fool 
Xavier 
@xtordoir 
SilicoCloud 
-> Physics 
-> Data analysis 
-> genomics 
-> scalable systems 
-> ...
Genomics 
What is genomics about? 
Medical Diagnostics 
Drug response 
Diseases mechanisms
Genomics 
What is genomics about? 
- A human genome is a 3 billion long sequence (of 
nucleic acids: “bases”) 
- 1 per 1000 base is variable in human population 
- Genomes encode bio-molecules (tens of thousands) 
- These molecules interact together 
...and with environment 
→ Biological systems are very complex
Genomics 
State of the art 
- growing technological capacity 
- cost reduction 
- growing data._
Genomics 
State of the art 
- I.T. becomes bottleneck (cost and latency) 
- sacrifice data with sampling or cut-offs 
Andrea Sboner et al
Genomics 
Blocking points 
- “legacy stack” not designed scalable (C, perl, …) 
- HPC approach not a fit (data intensive)
Genomics 
Future of genomics 
- Personal genomes (e.g. 1,000,000 genomes for cancer 
research) 
- New sequencing technologies 
- Sequence “stuff” as needed (e.g. microbiome, 
diagnostics) 
- medicalCondition = f(genomics, environmentHistory)
Genomics 
Needs of scalability → Scala & Spark 
Needs of simplicity, clarity → ADAM
Parquet 101 
Columnar storage 
Row oriented 
Column oriented
Parquet 101 
Columnar storage 
> Homogeneous collocated data 
> Better range access 
> Better encoding
Parquet 101 
Efficient encoding of nested typed structures 
message Document { 
required int64 DocId; 
optional group Links { 
repeated int64 Backward; 
repeated int64 Forward; 
} 
repeated group Name { 
repeated group Language { 
required string Code; 
optional string Country; 
} 
optional string Url; 
} 
}
Parquet 101 
Efficient encoding of nested typed structures 
message Document { 
required int64 DocId; 
optional group Links { 
repeated int64 Backward; 
repeated int64 Forward; 
} 
repeated group Name { 
repeated group Language { 
required string Code; 
optional string Country; 
} 
optional string Url; 
} 
} 
Nested structure →Tree 
Empty levels →Branch pruning 
Repetitions →Metadata (index) 
Types → Safe/Fast codec
Parquet 101 
Efficient encoding of nested typed structures 
ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Parquet 101 
Optimized distributed storage (f.i. in HDFS) 
ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Parquet 101 
Efficient (schema based) serialization: AVRO 
JSON Schema IDL 
{ 
"namespace": "example.avro", 
"type": "record", 
"name": "User", 
"fields": [ 
{"name": "name", "type": "string"}, 
{"name": "favorite_number", "type": ["int", "null"]}, 
{"name": "favorite_color", "type": ["string", "null"]} 
] 
} 
record User { 
string name; 
union { null, int } favorite_number = null; 
union { null, string } favorite_color = null; 
}
Parquet 101 
Efficient (schema based) serialization: AVRO 
JSON Schema Part of the: 
{ 
"namespace": "example.avro", 
"type": "record", 
"name": "User", 
"fields": [ 
{"name": "name", "type": "string"}, 
{"name": "favorite_number", "type": ["int", "null"]}, 
{"name": "favorite_color", "type": ["string", "null"]} 
] 
} 
● protocol 
● serialization 
→less metadata 
Define: IDL → JSON 
Send: Binary → JSON
ADAM 
Credits: AmpLab (UC Berkeley)
ADAM 
Overview (Sequencing) 
- DNA is a molecule 
…or a Seq[Char] 
(A, T, G, C) alphabet
ADAM 
Sequencing 
- Massively parallel sequencing of random 100-150 
bases reads (20,000,000 reads per genome) 
- 30-60x coverage for quality 
- All this mess must be re-organised! 
→ ADAM
ADAM 
Variants Calling 
- From an organized set of reads (ADAM Pileup) 
- Detect variants (Variant Calling) 
→ AVOCADO
ADAM 
Genomics specifications 
- SAM, BAM, VCF 
- Indexable 
- libraries 
- ~ scalable: hadoop-bam
ADAM 
ADAM model 
- schema based (Avro), libraries are generated 
- no storage spec here!
ADAM 
ADAM model 
- Parquet storage 
- evenly distribute data 
- storage optimized for read/query 
- better compression
ADAM 
ADAM API 
- AdamContext provides functions to read from HDFS
ADAM 
ADAM API 
- Scala classes generated from Avro 
- Data loaded as RDDs (Spark’s Resilient Distributed 
Datasets) 
- functions on RDDs (write to HDFS, genomic objects 
manipulations)
ADAM 
ADAM API 
- e.g. reading genotypes
ADAM 
ADAM Benchmark 
- It scales! 
- Data is more compact 
- Read perf is better 
- Code is simpler
Stratification using 1000Genomes 
As usual… let’s get some data. 
Genomes relate to health and are private. 
Still, there are options!
Stratification using 1000Genomes 
http://www.1000genomes.org/ 
(Nowadays targeting 2000 genomes) 
ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
Stratification using 1000Genomes
Stratification using 1000Genomes
Stratification using 1000Genomes 
Study genetic variations in populations (needs 
more contextual data for healthcare). 
To validate the interest in ADAM, we’ll do some 
qualitative exploration of the data. 
Question: it is possible to predict the 
appartenance of a given genome to a 
subpopulation?
Stratification using 1000Genomes 
We can run an unsupervised algorithm on a 
massive number of genomes. 
The idea is to find clusters that would match 
subpopulations. 
Actually, it’s important because it reflects 
populations histories: gene flows, selection, ...
Stratification using 1000Genomes 
From the 200Tb of data, we’ll focus on the 6th 
chromosome, actually only its variants 
ref: http://en.wikipedia.org/wiki/Chromosome
Genome Data 
Data structure
Genome Data 
Data structure 
Panel: Map[SampleID, Population]
Genome Data 
Data structure 
Genotypes in VCF format 
Basically a text file. Ours were downloaded from S3. 
Converted to ADAM Genotypes
Machine Learning model 
Clustering: KMeans 
ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model 
Clustering: KMeans 
PreProcess = {A,C,T,G}² → {0,1,2} 
Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ 
Distance = Euclidian (L2) ⁽*⁾ 
⁽*⁾MLlib restriction, although, here: L2~L1 
SPARK-3012 
ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model 
MLLib, KMeans 
MLLib: 
● Machine Learning Algorithms 
● Data structures (e.g. Vector)
Machine Learning model 
MLLib KMeans 
DataFrame Map: 
● key = Sample 
● value = Vector of Genotypes alleles (sorted by Variant)
Mashup 
prediction 
Sample [NA20332] is in cluster #0 for population Some(ASW) 
Sample [NA20334] is in cluster #2 for population Some(ASW) 
Sample [HG00120] is in cluster #2 for population Some(GBR) 
Sample [NA18560] is in cluster #1 for population Some(CHB)
Mashup 
#0 #1 #2 
GBR 0 0 89 
ASW 54 0 7 
CHB 0 97 0
Cluster 
4 m3.xlarge instances (ec2) 
16 cores + 60G
Cluster 
Performances
Cluster 
40 m3.xlarge 
160 cores + 600G
Conclusions and future work 
● ADAM and Spark provide tools to 
manipulate genomics data in a scalable way 
● Simple APIs in Scala 
● MLLib for machine learning 
→ implement less naïve algorithms 
→ cross medical and environmental data with 
genomes
Acknowledgments 
Acknowledgements 
Scala.IO 
AmpLab 
Matt Massie Frank Nothaft 
Vincent Botta
That’s all Folks 
Apparently, we’re supposed to stay on stage 
Waiting for questions 
Hoping for none 
Looking at the bar 
And the lunch 
Oh there are beers 
And candies 
who can read this?
1 sur 50

Recommandé

Design for Scalability in ADAM par
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
1.5K vues18 diapositives
Data Enthusiasts London: Scalable and Interoperable data services. Applied to... par
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
1.3K vues23 diapositives
Scalable up genomic analysis with ADAM par
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
1.2K vues27 diapositives
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale par
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
1.8K vues28 diapositives
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"... par
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
1.3K vues19 diapositives
Scalable Genome Analysis with ADAM par
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
1.2K vues28 diapositives

Contenu connexe

Tendances

Managing Genomes At Scale: What We Learned - StampedeCon 2014 par
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014StampedeCon
3.3K vues27 diapositives
Fast Variant Calling with ADAM and avocado par
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadofnothaft
1.8K vues23 diapositives
Why is Bioinformatics a Good Fit for Spark? par
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Timothy Danford
4.5K vues17 diapositives
Challenges and Opportunities of Big Data Genomics par
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
1.6K vues25 diapositives
Ga4 gh meeting at the the sanger institute par
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteMatt Massie
13.7K vues26 diapositives
Spark meetup london share and analyse genomic data at scale with spark, adam... par
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
2.1K vues62 diapositives

Tendances(20)

Managing Genomes At Scale: What We Learned - StampedeCon 2014 par StampedeCon
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014
StampedeCon3.3K vues
Fast Variant Calling with ADAM and avocado par fnothaft
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
fnothaft1.8K vues
Why is Bioinformatics a Good Fit for Spark? par Timothy Danford
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
Timothy Danford4.5K vues
Challenges and Opportunities of Big Data Genomics par Yasin Memari
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari1.6K vues
Ga4 gh meeting at the the sanger institute par Matt Massie
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
Matt Massie13.7K vues
Spark meetup london share and analyse genomic data at scale with spark, adam... par Andy Petrella
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella2.1K vues
Hadoop for Bioinformatics: Building a Scalable Variant Store par Uri Laserson
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson12K vues
ADAM—Spark Summit, 2014 par fnothaft
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
fnothaft45.9K vues
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust... par Sri Ambati
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Sri Ambati1.8K vues
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308... par Amazon Web Services
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir par Spark Summit
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit1.1K vues
Enabling Biobank-Scale Genomic Processing with Spark SQL par Databricks
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
Databricks731 vues
Spark Summit EU talk by Erwin Datema and Roeland van Ham par Spark Summit
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit889 vues
Big Data Science with H2O in R par Anqi Fu
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
Anqi Fu8.4K vues
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by... par Spark Summit
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Spark Summit1.4K vues
Distributed GLM with H2O - Atlanta Meetup par Sri Ambati
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
Sri Ambati3.2K vues
Learning Systems for Science par Ian Foster
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster454 vues
Many Task Applications for Grids and Supercomputers par Ian Foster
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
Ian Foster727 vues

Similaire à Lightning fast genomics with Spark, Adam and Scala

Next-generation sequencing data format and visualization with ngs.plot 2015 par
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
3.7K vues47 diapositives
CS Guest Lecture 2015 10-05 advanced databases par
CS Guest Lecture 2015 10-05 advanced databasesCS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesGabe Rudy
384 vues26 diapositives
Bioinfo ngs data format visualization v2 par
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
2.7K vues45 diapositives
Bioinformatics Data Pipelines built by CSIRO on AWS par
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
715 vues37 diapositives
Race against the sequencing machine: processing of raw DNA sequence data at t... par
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Maté Ongenaert
1.1K vues16 diapositives
R Analytics in the Cloud par
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
4.3K vues14 diapositives

Similaire à Lightning fast genomics with Spark, Adam and Scala(20)

Next-generation sequencing data format and visualization with ngs.plot 2015 par Li Shen
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
Li Shen3.7K vues
CS Guest Lecture 2015 10-05 advanced databases par Gabe Rudy
CS Guest Lecture 2015 10-05 advanced databasesCS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databases
Gabe Rudy384 vues
Bioinfo ngs data format visualization v2 par Li Shen
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen2.7K vues
Bioinformatics Data Pipelines built by CSIRO on AWS par Lynn Langit
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
Lynn Langit715 vues
Race against the sequencing machine: processing of raw DNA sequence data at t... par Maté Ongenaert
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...
Maté Ongenaert1.1K vues
R Analytics in the Cloud par DataMine Lab
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
DataMine Lab4.3K vues
Accelerate pharmaceutical r&d with mongo db par MongoDB
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
MongoDB931 vues
Extreme Scripting July 2009 par Ian Foster
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
Ian Foster454 vues
NOSQL and Cassandra par rantav
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
rantav4K vues
Next-generation sequencing format and visualization with ngs.plot par Li Shen
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
Li Shen27.4K vues
Role of bioinformatics in life sciences research par Anshika Bansal
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
Anshika Bansal4K vues
Accelerate Pharmaceutical R&D with Big Data and MongoDB par MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
MongoDB5.6K vues
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark par Databricks
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Databricks431 vues
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba par Databricks
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Databricks1.8K vues
Computing Outside The Box June 2009 par Ian Foster
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
Ian Foster766 vues
Standarization in Proteomics: From raw data to metadata files par Yasset Perez-Riverol
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk... par Spark Summit
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit1.5K vues

Plus de Andy Petrella

Data Observability Best Pracices par
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
238 vues19 diapositives
How to Build a Global Data Mapping par
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
783 vues16 diapositives
Interactive notebooks par
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
211 vues20 diapositives
Governance compliance par
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
398 vues38 diapositives
Data science governance and GDPR par
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
550 vues41 diapositives
Data science governance : what and how par
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
2K vues30 diapositives

Plus de Andy Petrella(20)

Data Observability Best Pracices par Andy Petrella
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
Andy Petrella238 vues
How to Build a Global Data Mapping par Andy Petrella
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
Andy Petrella783 vues
Data science governance and GDPR par Andy Petrella
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
Andy Petrella550 vues
Data science governance : what and how par Andy Petrella
Data science governance : what and howData science governance : what and how
Data science governance : what and how
Andy Petrella2K vues
Scala: the unpredicted lingua franca for data science par Andy Petrella
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
Andy Petrella1.9K vues
Agile data science with scala par Andy Petrella
Agile data science with scalaAgile data science with scala
Agile data science with scala
Andy Petrella1.8K vues
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser... par Andy Petrella
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Andy Petrella1.2K vues
What is a distributed data science pipeline. how with apache spark and friends. par Andy Petrella
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella2.4K vues
Towards a rebirth of data science (by Data Fellas) par Andy Petrella
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
Andy Petrella2.2K vues
Distributed machine learning 101 using apache spark from a browser devoxx.b... par Andy Petrella
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Andy Petrella1.1K vues
Spark Summit Europe: Share and analyse genomic data at scale par Andy Petrella
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella765 vues
Leveraging mesos as the ultimate distributed data science platform par Andy Petrella
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
Andy Petrella1.2K vues
Distributed machine learning 101 using apache spark from the browser par Andy Petrella
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
Andy Petrella4.8K vues
Liège créative: Open Science par Andy Petrella
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
Andy Petrella857 vues
What is Distributed Computing, Why we use Apache Spark par Andy Petrella
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella6.4K vues
Machine Learning and GraphX par Andy Petrella
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella6.1K vues
Quanti-litative Revolution in GIS par Andy Petrella
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GIS
Andy Petrella680 vues

Dernier

"Running students' code in isolation. The hard way", Yurii Holiuk par
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
38 vues34 diapositives
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... par
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...The Digital Insurer
98 vues52 diapositives
The Role of Patterns in the Era of Large Language Models par
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
104 vues65 diapositives
"Node.js Development in 2024: trends and tools", Nikita Galkin par
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin Fwdays
37 vues38 diapositives
CryptoBotsAI par
CryptoBotsAICryptoBotsAI
CryptoBotsAIchandureddyvadala199
42 vues5 diapositives
This talk was not generated with ChatGPT: how AI is changing science par
This talk was not generated with ChatGPT: how AI is changing scienceThis talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing scienceElena Simperl
34 vues13 diapositives

Dernier(20)

"Running students' code in isolation. The hard way", Yurii Holiuk par Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays38 vues
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... par The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
The Role of Patterns in the Era of Large Language Models par Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li104 vues
"Node.js Development in 2024: trends and tools", Nikita Galkin par Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays37 vues
This talk was not generated with ChatGPT: how AI is changing science par Elena Simperl
This talk was not generated with ChatGPT: how AI is changing scienceThis talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing science
Elena Simperl34 vues
AI + Memoori = AIM par Memoori
AI + Memoori = AIMAI + Memoori = AIM
AI + Memoori = AIM
Memoori15 vues
Optimizing Communication to Optimize Human Behavior - LCBM par Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar39 vues
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」 par PC Cluster Consortium
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... par Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro38 vues
Digital Personal Data Protection (DPDP) Practical Approach For CISOs par Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash171 vues
What is Authentication Active Directory_.pptx par HeenaMehta35
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptx
HeenaMehta3515 vues
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf par ThomasBronack
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfBronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
ThomasBronack31 vues

Lightning fast genomics with Spark, Adam and Scala

  • 1. Lightning fast genomics With Spark and ADAM
  • 2. Who are we? Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Xavier @xtordoir SilicoCloud -> Physics -> Data analysis -> genomics -> scalable systems -> ...
  • 3. Genomics What is genomics about? Medical Diagnostics Drug response Diseases mechanisms
  • 4. Genomics What is genomics about? - A human genome is a 3 billion long sequence (of nucleic acids: “bases”) - 1 per 1000 base is variable in human population - Genomes encode bio-molecules (tens of thousands) - These molecules interact together ...and with environment → Biological systems are very complex
  • 5. Genomics State of the art - growing technological capacity - cost reduction - growing data._
  • 6. Genomics State of the art - I.T. becomes bottleneck (cost and latency) - sacrifice data with sampling or cut-offs Andrea Sboner et al
  • 7. Genomics Blocking points - “legacy stack” not designed scalable (C, perl, …) - HPC approach not a fit (data intensive)
  • 8. Genomics Future of genomics - Personal genomes (e.g. 1,000,000 genomes for cancer research) - New sequencing technologies - Sequence “stuff” as needed (e.g. microbiome, diagnostics) - medicalCondition = f(genomics, environmentHistory)
  • 9. Genomics Needs of scalability → Scala & Spark Needs of simplicity, clarity → ADAM
  • 10. Parquet 101 Columnar storage Row oriented Column oriented
  • 11. Parquet 101 Columnar storage > Homogeneous collocated data > Better range access > Better encoding
  • 12. Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } }
  • 13. Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Nested structure →Tree Empty levels →Branch pruning Repetitions →Metadata (index) Types → Safe/Fast codec
  • 14. Parquet 101 Efficient encoding of nested typed structures ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 15. Parquet 101 Optimized distributed storage (f.i. in HDFS) ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
  • 16. Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema IDL { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null; }
  • 17. Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema Part of the: { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } ● protocol ● serialization →less metadata Define: IDL → JSON Send: Binary → JSON
  • 18. ADAM Credits: AmpLab (UC Berkeley)
  • 19. ADAM Overview (Sequencing) - DNA is a molecule …or a Seq[Char] (A, T, G, C) alphabet
  • 20. ADAM Sequencing - Massively parallel sequencing of random 100-150 bases reads (20,000,000 reads per genome) - 30-60x coverage for quality - All this mess must be re-organised! → ADAM
  • 21. ADAM Variants Calling - From an organized set of reads (ADAM Pileup) - Detect variants (Variant Calling) → AVOCADO
  • 22. ADAM Genomics specifications - SAM, BAM, VCF - Indexable - libraries - ~ scalable: hadoop-bam
  • 23. ADAM ADAM model - schema based (Avro), libraries are generated - no storage spec here!
  • 24. ADAM ADAM model - Parquet storage - evenly distribute data - storage optimized for read/query - better compression
  • 25. ADAM ADAM API - AdamContext provides functions to read from HDFS
  • 26. ADAM ADAM API - Scala classes generated from Avro - Data loaded as RDDs (Spark’s Resilient Distributed Datasets) - functions on RDDs (write to HDFS, genomic objects manipulations)
  • 27. ADAM ADAM API - e.g. reading genotypes
  • 28. ADAM ADAM Benchmark - It scales! - Data is more compact - Read perf is better - Code is simpler
  • 29. Stratification using 1000Genomes As usual… let’s get some data. Genomes relate to health and are private. Still, there are options!
  • 30. Stratification using 1000Genomes http://www.1000genomes.org/ (Nowadays targeting 2000 genomes) ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
  • 33. Stratification using 1000Genomes Study genetic variations in populations (needs more contextual data for healthcare). To validate the interest in ADAM, we’ll do some qualitative exploration of the data. Question: it is possible to predict the appartenance of a given genome to a subpopulation?
  • 34. Stratification using 1000Genomes We can run an unsupervised algorithm on a massive number of genomes. The idea is to find clusters that would match subpopulations. Actually, it’s important because it reflects populations histories: gene flows, selection, ...
  • 35. Stratification using 1000Genomes From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants ref: http://en.wikipedia.org/wiki/Chromosome
  • 36. Genome Data Data structure
  • 37. Genome Data Data structure Panel: Map[SampleID, Population]
  • 38. Genome Data Data structure Genotypes in VCF format Basically a text file. Ours were downloaded from S3. Converted to ADAM Genotypes
  • 39. Machine Learning model Clustering: KMeans ref: http://en.wikipedia.org/wiki/K-means_clustering
  • 40. Machine Learning model Clustering: KMeans PreProcess = {A,C,T,G}² → {0,1,2} Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ Distance = Euclidian (L2) ⁽*⁾ ⁽*⁾MLlib restriction, although, here: L2~L1 SPARK-3012 ref: http://en.wikipedia.org/wiki/K-means_clustering
  • 41. Machine Learning model MLLib, KMeans MLLib: ● Machine Learning Algorithms ● Data structures (e.g. Vector)
  • 42. Machine Learning model MLLib KMeans DataFrame Map: ● key = Sample ● value = Vector of Genotypes alleles (sorted by Variant)
  • 43. Mashup prediction Sample [NA20332] is in cluster #0 for population Some(ASW) Sample [NA20334] is in cluster #2 for population Some(ASW) Sample [HG00120] is in cluster #2 for population Some(GBR) Sample [NA18560] is in cluster #1 for population Some(CHB)
  • 44. Mashup #0 #1 #2 GBR 0 0 89 ASW 54 0 7 CHB 0 97 0
  • 45. Cluster 4 m3.xlarge instances (ec2) 16 cores + 60G
  • 47. Cluster 40 m3.xlarge 160 cores + 600G
  • 48. Conclusions and future work ● ADAM and Spark provide tools to manipulate genomics data in a scalable way ● Simple APIs in Scala ● MLLib for machine learning → implement less naïve algorithms → cross medical and environmental data with genomes
  • 49. Acknowledgments Acknowledgements Scala.IO AmpLab Matt Massie Frank Nothaft Vincent Botta
  • 50. That’s all Folks Apparently, we’re supposed to stay on stage Waiting for questions Hoping for none Looking at the bar And the lunch Oh there are beers And candies who can read this?