Lightning fast genomics 
With Spark and ADAM
Who are we? 
Andy 
@Noootsab 
@NextLab_be 
@Wajug co-driver 
@Devoxx4Kids organizer 
Maths & CS 
Data lover: geo, open, ma...
Genomics 
What is genomics about? 
Medical Diagnostics 
Drug response 
Diseases mechanisms
Genomics 
What is genomics about? 
- A human genome is a 3 billion long sequence (of 
nucleic acids: “bases”) 
- 1 per 100...
Genomics 
State of the art 
- growing technological capacity 
- cost reduction 
- growing data._
Genomics 
State of the art 
- I.T. becomes bottleneck (cost and latency) 
- sacrifice data with sampling or cut-offs 
Andr...
Genomics 
Blocking points 
- “legacy stack” not designed scalable (C, perl, …) 
- HPC approach not a fit (data intensive)
Genomics 
Future of genomics 
- Personal genomes (e.g. 1,000,000 genomes for cancer 
research) 
- New sequencing technolog...
Genomics 
Needs of scalability → Scala & Spark 
Needs of simplicity, clarity → ADAM
Parquet 101 
Columnar storage 
Row oriented 
Column oriented
Parquet 101 
Columnar storage 
> Homogeneous collocated data 
> Better range access 
> Better encoding
Parquet 101 
Efficient encoding of nested typed structures 
message Document { 
required int64 DocId; 
optional group Link...
Parquet 101 
Efficient encoding of nested typed structures 
message Document { 
required int64 DocId; 
optional group Link...
Parquet 101 
Efficient encoding of nested typed structures 
ref: https://blog.twitter.com/2013/dremel-made-simple-with-par...
Parquet 101 
Optimized distributed storage (f.i. in HDFS) 
ref: http://grepalex.com/2014/05/13/parquet-file-format-and-obj...
Parquet 101 
Efficient (schema based) serialization: AVRO 
JSON Schema IDL 
{ 
"namespace": "example.avro", 
"type": "reco...
Parquet 101 
Efficient (schema based) serialization: AVRO 
JSON Schema Part of the: 
{ 
"namespace": "example.avro", 
"typ...
ADAM 
Credits: AmpLab (UC Berkeley)
ADAM 
Overview (Sequencing) 
- DNA is a molecule 
…or a Seq[Char] 
(A, T, G, C) alphabet
ADAM 
Sequencing 
- Massively parallel sequencing of random 100-150 
bases reads (20,000,000 reads per genome) 
- 30-60x c...
ADAM 
Variants Calling 
- From an organized set of reads (ADAM Pileup) 
- Detect variants (Variant Calling) 
→ AVOCADO
ADAM 
Genomics specifications 
- SAM, BAM, VCF 
- Indexable 
- libraries 
- ~ scalable: hadoop-bam
ADAM 
ADAM model 
- schema based (Avro), libraries are generated 
- no storage spec here!
ADAM 
ADAM model 
- Parquet storage 
- evenly distribute data 
- storage optimized for read/query 
- better compression
ADAM 
ADAM API 
- AdamContext provides functions to read from HDFS
ADAM 
ADAM API 
- Scala classes generated from Avro 
- Data loaded as RDDs (Spark’s Resilient Distributed 
Datasets) 
- fu...
ADAM 
ADAM API 
- e.g. reading genotypes
ADAM 
ADAM Benchmark 
- It scales! 
- Data is more compact 
- Read perf is better 
- Code is simpler
Stratification using 1000Genomes 
As usual… let’s get some data. 
Genomes relate to health and are private. 
Still, there ...
Stratification using 1000Genomes 
http://www.1000genomes.org/ 
(Nowadays targeting 2000 genomes) 
ref: http://upload.wikim...
Stratification using 1000Genomes
Stratification using 1000Genomes
Stratification using 1000Genomes 
Study genetic variations in populations (needs 
more contextual data for healthcare). 
T...
Stratification using 1000Genomes 
We can run an unsupervised algorithm on a 
massive number of genomes. 
The idea is to fi...
Stratification using 1000Genomes 
From the 200Tb of data, we’ll focus on the 6th 
chromosome, actually only its variants 
...
Genome Data 
Data structure
Genome Data 
Data structure 
Panel: Map[SampleID, Population]
Genome Data 
Data structure 
Genotypes in VCF format 
Basically a text file. Ours were downloaded from S3. 
Converted to A...
Machine Learning model 
Clustering: KMeans 
ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model 
Clustering: KMeans 
PreProcess = {A,C,T,G}² → {0,1,2} 
Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ 
Distance = Euclid...
Machine Learning model 
MLLib, KMeans 
MLLib: 
● Machine Learning Algorithms 
● Data structures (e.g. Vector)
Machine Learning model 
MLLib KMeans 
DataFrame Map: 
● key = Sample 
● value = Vector of Genotypes alleles (sorted by Var...
Mashup 
prediction 
Sample [NA20332] is in cluster #0 for population Some(ASW) 
Sample [NA20334] is in cluster #2 for popu...
Mashup 
#0 #1 #2 
GBR 0 0 89 
ASW 54 0 7 
CHB 0 97 0
Cluster 
4 m3.xlarge instances (ec2) 
16 cores + 60G
Cluster 
Performances
Cluster 
40 m3.xlarge 
160 cores + 600G
Conclusions and future work 
● ADAM and Spark provide tools to 
manipulate genomics data in a scalable way 
● Simple APIs ...
Acknowledgments 
Acknowledgements 
Scala.IO 
AmpLab 
Matt Massie Frank Nothaft 
Vincent Botta
That’s all Folks 
Apparently, we’re supposed to stay on stage 
Waiting for questions 
Hoping for none 
Looking at the bar ...
Prochain SlideShare
Chargement dans…5
×

Lightning fast genomics with Spark, Adam and Scala

82 253 vues

Publié le

We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren't designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.

Publié dans : Technologie
0 commentaire
25 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

Aucun téléchargement
Vues
Nombre de vues
82 253
Sur SlideShare
0
Issues des intégrations
0
Intégrations
62 667
Actions
Partages
0
Téléchargements
352
Commentaires
0
J’aime
25
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive

Lightning fast genomics with Spark, Adam and Scala

  1. 1. Lightning fast genomics With Spark and ADAM
  2. 2. Who are we? Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Xavier @xtordoir SilicoCloud -> Physics -> Data analysis -> genomics -> scalable systems -> ...
  3. 3. Genomics What is genomics about? Medical Diagnostics Drug response Diseases mechanisms
  4. 4. Genomics What is genomics about? - A human genome is a 3 billion long sequence (of nucleic acids: “bases”) - 1 per 1000 base is variable in human population - Genomes encode bio-molecules (tens of thousands) - These molecules interact together ...and with environment → Biological systems are very complex
  5. 5. Genomics State of the art - growing technological capacity - cost reduction - growing data._
  6. 6. Genomics State of the art - I.T. becomes bottleneck (cost and latency) - sacrifice data with sampling or cut-offs Andrea Sboner et al
  7. 7. Genomics Blocking points - “legacy stack” not designed scalable (C, perl, …) - HPC approach not a fit (data intensive)
  8. 8. Genomics Future of genomics - Personal genomes (e.g. 1,000,000 genomes for cancer research) - New sequencing technologies - Sequence “stuff” as needed (e.g. microbiome, diagnostics) - medicalCondition = f(genomics, environmentHistory)
  9. 9. Genomics Needs of scalability → Scala & Spark Needs of simplicity, clarity → ADAM
  10. 10. Parquet 101 Columnar storage Row oriented Column oriented
  11. 11. Parquet 101 Columnar storage > Homogeneous collocated data > Better range access > Better encoding
  12. 12. Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } }
  13. 13. Parquet 101 Efficient encoding of nested typed structures message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Nested structure →Tree Empty levels →Branch pruning Repetitions →Metadata (index) Types → Safe/Fast codec
  14. 14. Parquet 101 Efficient encoding of nested typed structures ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  15. 15. Parquet 101 Optimized distributed storage (f.i. in HDFS) ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
  16. 16. Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema IDL { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null; }
  17. 17. Parquet 101 Efficient (schema based) serialization: AVRO JSON Schema Part of the: { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } ● protocol ● serialization →less metadata Define: IDL → JSON Send: Binary → JSON
  18. 18. ADAM Credits: AmpLab (UC Berkeley)
  19. 19. ADAM Overview (Sequencing) - DNA is a molecule …or a Seq[Char] (A, T, G, C) alphabet
  20. 20. ADAM Sequencing - Massively parallel sequencing of random 100-150 bases reads (20,000,000 reads per genome) - 30-60x coverage for quality - All this mess must be re-organised! → ADAM
  21. 21. ADAM Variants Calling - From an organized set of reads (ADAM Pileup) - Detect variants (Variant Calling) → AVOCADO
  22. 22. ADAM Genomics specifications - SAM, BAM, VCF - Indexable - libraries - ~ scalable: hadoop-bam
  23. 23. ADAM ADAM model - schema based (Avro), libraries are generated - no storage spec here!
  24. 24. ADAM ADAM model - Parquet storage - evenly distribute data - storage optimized for read/query - better compression
  25. 25. ADAM ADAM API - AdamContext provides functions to read from HDFS
  26. 26. ADAM ADAM API - Scala classes generated from Avro - Data loaded as RDDs (Spark’s Resilient Distributed Datasets) - functions on RDDs (write to HDFS, genomic objects manipulations)
  27. 27. ADAM ADAM API - e.g. reading genotypes
  28. 28. ADAM ADAM Benchmark - It scales! - Data is more compact - Read perf is better - Code is simpler
  29. 29. Stratification using 1000Genomes As usual… let’s get some data. Genomes relate to health and are private. Still, there are options!
  30. 30. Stratification using 1000Genomes http://www.1000genomes.org/ (Nowadays targeting 2000 genomes) ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
  31. 31. Stratification using 1000Genomes
  32. 32. Stratification using 1000Genomes
  33. 33. Stratification using 1000Genomes Study genetic variations in populations (needs more contextual data for healthcare). To validate the interest in ADAM, we’ll do some qualitative exploration of the data. Question: it is possible to predict the appartenance of a given genome to a subpopulation?
  34. 34. Stratification using 1000Genomes We can run an unsupervised algorithm on a massive number of genomes. The idea is to find clusters that would match subpopulations. Actually, it’s important because it reflects populations histories: gene flows, selection, ...
  35. 35. Stratification using 1000Genomes From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants ref: http://en.wikipedia.org/wiki/Chromosome
  36. 36. Genome Data Data structure
  37. 37. Genome Data Data structure Panel: Map[SampleID, Population]
  38. 38. Genome Data Data structure Genotypes in VCF format Basically a text file. Ours were downloaded from S3. Converted to ADAM Genotypes
  39. 39. Machine Learning model Clustering: KMeans ref: http://en.wikipedia.org/wiki/K-means_clustering
  40. 40. Machine Learning model Clustering: KMeans PreProcess = {A,C,T,G}² → {0,1,2} Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰ Distance = Euclidian (L2) ⁽*⁾ ⁽*⁾MLlib restriction, although, here: L2~L1 SPARK-3012 ref: http://en.wikipedia.org/wiki/K-means_clustering
  41. 41. Machine Learning model MLLib, KMeans MLLib: ● Machine Learning Algorithms ● Data structures (e.g. Vector)
  42. 42. Machine Learning model MLLib KMeans DataFrame Map: ● key = Sample ● value = Vector of Genotypes alleles (sorted by Variant)
  43. 43. Mashup prediction Sample [NA20332] is in cluster #0 for population Some(ASW) Sample [NA20334] is in cluster #2 for population Some(ASW) Sample [HG00120] is in cluster #2 for population Some(GBR) Sample [NA18560] is in cluster #1 for population Some(CHB)
  44. 44. Mashup #0 #1 #2 GBR 0 0 89 ASW 54 0 7 CHB 0 97 0
  45. 45. Cluster 4 m3.xlarge instances (ec2) 16 cores + 60G
  46. 46. Cluster Performances
  47. 47. Cluster 40 m3.xlarge 160 cores + 600G
  48. 48. Conclusions and future work ● ADAM and Spark provide tools to manipulate genomics data in a scalable way ● Simple APIs in Scala ● MLLib for machine learning → implement less naïve algorithms → cross medical and environmental data with genomes
  49. 49. Acknowledgments Acknowledgements Scala.IO AmpLab Matt Massie Frank Nothaft Vincent Botta
  50. 50. That’s all Folks Apparently, we’re supposed to stay on stage Waiting for questions Hoping for none Looking at the bar And the lunch Oh there are beers And candies who can read this?

×