Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
by Data Fellas,
Spark London Meetup July, 1st ‘15
Share and analyse genomic data
at scale with Spark, Adam, Tachyon and th...
PART I
Adam: genomics on Spark
1K Genomes in Adam on S3
Explore: Compute Stats
Learn: train a model
Outline
PART II
GA4GH:...
Andy Petrella
@noootsab
Maths
scala
Apache Spark
Spark Notebook
Trainer
Data Banana
Xavier Tordoir
@xtordoir
Physics
Bioin...
PART I
Spark & Genomics
Adam: genomics on Spark
1K Genomes in Adam on S3
Explore: Compute Stats
Learn: train a model
So th...
Adam
What is genomics data
Okay, sounds
good. Give me
two of them!
Genome is an important factor in health:
Medical Diagno...
Adam
What is genomics data
You mean devs
are slacking
of?
On the data production:
Fast biotech progress
No so fast IT prog...
Adam
What is genomics data
No! They’re
just sticky
bubbles...
On the data production:
Sequence {A, T, G, C}
3 billion bases
Adam
What is genomics data
Okay, a lot of
bubbles.
On the data production:
Sequence {A, T, G, C}
3 billion bases
… x 30 (x...
Adam
What is genomics data
C’mon. a big
mess of plenty
of lil’ bubbles
then.
On the data production: massively parallel
Se...
Adam
What is genomics data
Ah that
explain why
the black bars
are differents
Adam
What is genomics data
Dude... Tens of
millions
Adam
What is genomics data
Staaaaaaph Tens of
millions
1000’s
1,000,000’s
…
Adam
What is genomics data
‘coz it makes
sparkling
bubbles, right?
Ok, looks like Apache Spark
makes a lot of sense here …
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Well done,...
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Take that
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Dunno what...
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Yeaaah:
ge...
Adam
An efficient storage
Machism in I.
T., what a
flaw!
● Distribute data
● Schema based
● Read/query efficient
● Compact
Adam
An efficient storage
That’s a quick
step
● Distribute data
● Schema based
● Read/query efficient
● Compact
PARQUET!
Adam
An efficient storage
Is Eve okay to
use the
parquet for
that?
● Distribute data
● Schema based
● Read/query efficient...
Adam
A clean API
Object
Wrappedy
adam Context
Adam
A clean API
I could have
done this as a
one liner
adam Context
IO methods
Adam
A clean API
At least, it’s
going to be
simpler than
the chemistry
● Scala classes generated from Avro
● Data loaded a...
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Part of a pipeline
human | Seq |
S...
Thousands Genomes
Open Data Set
Games without
Frontiers
1000 genomes: http://www.1000genomes.org/
Produces BAMs, VCFs, ...
Thousands Genomes
Why do you
complain, they
are
compressed …
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Thousands Genomes
Where are the data
DN...
Thousands Genomes
Adam that shit on S3
Hmmm like in
the good old
days of HPC
The bad part …
● get the vcf.gz file on local...
Thousands Genomes
Adam that shit on S3
what?
No grappa?
The good part …
the Notebook (this one)
Thousands Genomes
Adam that shit on S3
Okay, good
enough to wait
a bit…
What did we gain?
● before: 152 GB (gzipped) in 23...
Explore Genomics
Access the data
Just in case,
you don’t
believe us -_-’
Access data from this notebook
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Explore Genomics
Compute statistics
We’...
Learn Genomics
The problem
Insane, you’ll
have hard time
with me |:-[
How to deal with heterogenous data?
● Population str...
Learn Genomics
The dimensions
Wiiiiiiiiiiiiiiiiide
rows
● 1000 Samples (Rows)
● 30,000,000 variants (columns or
variables)...
Learn Genomics
The dimensions
*LDA for
Latent
Dirichelet
Allocation…
Dimensionality reduction?
● Ideal would be a “Genetic...
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Learn Genomics
The model
Reduce, train,...
Learn Genomics
The notebook
Define and train the model in this
Notebook
The whole
shebang?
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Our pipeline
I am a Llama
Convert ...
PART II
Standards & Micro Services
Wake up!
GA4GH: Standard for Genomics
med-at-scale project
Explore: using Standards
Cre...
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Ga4GH
Let’s fix the baseline
In I.T. it...
GA4GH
models
… everybody
has is own
standard
GA4GH
Services
But a shared
schema is a bit
better!
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
GA4GH
Metadata
The data of my
data is a...
Med At Scale
By Data Fellas
Existing scalable implementation:
Google Genomics
Uses
● BigQuery
● google cloud computing
● d...
Med At Scale
By Data Fellas
Google Genomics is pushing Hard
…
Med At Scale
Scalability first
BIG
There is another scalable implementation:
Med At Scale, by Data Fellas
Uses
● Apache Sp...
Med At Scale
Scalability first
Data Fellas is pushing TOO
BIG
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Composability
very BIG
GA4...
Med At Scale
Customization
Data Fellas is a data science company
Thus our goal is to expose data analyses
A data analysis ...
Med At Scale
Ready for the load
Balls!
We saw that one row has
30,000,000 columns
The queries are slicing and dicing those...
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Ad Hoc Analytics
Who left ...
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
How it works
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
ADAM (and Spark)
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
MLlib (and Spark)
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Efficient binary data
Fina...
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Micro Service
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Cache and Collaboration
Fi...
Explore
Using GA4GH endpoints
notebook TIME!
Use scala/Java Avro client from
the browser.
I give you
Bananas
You give me
A...
Customize
Create and Use micro service (WIP)
Planning the
next gear
Remember the frequencies use case?
There is a custom e...
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Optimization
Query mining (Roadmap)
Alw...
References
Adam: https://github.com/bigdatagenomics/adam
Bdg-Formats: https://github.com/bigdatagenomics/bdg-formats
GA4GH...
Q/A⁽*⁾
THANKS!
⁽*⁾ or head to the pub (at least beers…)
Prochain SlideShare
Chargement dans…5
×

Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

1 883 vues

Publié le

Genomics and Health data is nowadays one of the hot topics requiring lots of computations and specially machine learning. This helps science with a very relevant societal impact to get even better outcome. That is why Apache Spark and its ADAM library is a must have.


This talk will be twofold.


First, we'll show how Apache Spark, MLlib and ADAM can be plugged all together to extract information from even huge and wide genomics dataset. Everything will be packed into examples from the Spark Notebook, showing how bio-scientists can work interactively with such a system.

Second, we'll explain how these methodologies and even the datasets themselves can be shared at very large scale between remote entities like hospitals or laboratories using micro services leveraging Apache Spark, ADAM, Play Framework 2, Avro and Tachyon.

Publié dans : Technologie
  • Soyez le premier à commenter

Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

  1. 1. by Data Fellas, Spark London Meetup July, 1st ‘15 Share and analyse genomic data at scale with Spark, Adam, Tachyon and the Spark Notebook
  2. 2. PART I Adam: genomics on Spark 1K Genomes in Adam on S3 Explore: Compute Stats Learn: train a model Outline PART II GA4GH: Standard for Genomics med-at-scale project Explore: using Standards Create custom micro services
  3. 3. Andy Petrella @noootsab Maths scala Apache Spark Spark Notebook Trainer Data Banana Xavier Tordoir @xtordoir Physics Bioinformatics Scala Spark
  4. 4. PART I Spark & Genomics Adam: genomics on Spark 1K Genomes in Adam on S3 Explore: Compute Stats Learn: train a model So that’s the thing that separates us?
  5. 5. Adam What is genomics data Okay, sounds good. Give me two of them! Genome is an important factor in health: Medical Diagnostics Drug response Diseases mechanisms …
  6. 6. Adam What is genomics data You mean devs are slacking of? On the data production: Fast biotech progress No so fast IT progress?
  7. 7. Adam What is genomics data No! They’re just sticky bubbles... On the data production: Sequence {A, T, G, C} 3 billion bases
  8. 8. Adam What is genomics data Okay, a lot of bubbles. On the data production: Sequence {A, T, G, C} 3 billion bases … x 30 (x 60?)
  9. 9. Adam What is genomics data C’mon. a big mess of plenty of lil’ bubbles then. On the data production: massively parallel Sequence {A, T, G, C} 3 billion bases … x 30 (x 60?)
  10. 10. Adam What is genomics data Ah that explain why the black bars are differents
  11. 11. Adam What is genomics data Dude... Tens of millions
  12. 12. Adam What is genomics data Staaaaaaph Tens of millions 1000’s 1,000,000’s …
  13. 13. Adam What is genomics data ‘coz it makes sparkling bubbles, right? Ok, looks like Apache Spark makes a lot of sense here …
  14. 14. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Well done, a spec as text in a pDf…
  15. 15. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Take that
  16. 16. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Dunno what is a Genotype but it contains a Variant. Apparently.
  17. 17. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Yeaaah: generate client == more slack Adam provides an avro schema
  18. 18. Adam An efficient storage Machism in I. T., what a flaw! ● Distribute data ● Schema based ● Read/query efficient ● Compact
  19. 19. Adam An efficient storage That’s a quick step ● Distribute data ● Schema based ● Read/query efficient ● Compact PARQUET!
  20. 20. Adam An efficient storage Is Eve okay to use the parquet for that? ● Distribute data ● Schema based ● Read/query efficient ● Compact PARQUET! Adam provides parquet as storage format
  21. 21. Adam A clean API Object Wrappedy adam Context
  22. 22. Adam A clean API I could have done this as a one liner adam Context IO methods
  23. 23. Adam A clean API At least, it’s going to be simpler than the chemistry ● Scala classes generated from Avro ● Data loaded as RDDs ● functions on RDDs ○ write to HDFS ○ genomic objects manipulations ○ Primitives to query genomics datasets
  24. 24. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam Part of a pipeline human | Seq | SNAP | Avocado | Adam | Ga4gh ADAM is JVM library leveraging - Spark - Avro - Parquet It still needs to be combined with sources (snap) Adam data is part of processes (AVOCADO). It CAN ALSO BE THE SOURCE FOR external PROCESSING, LEARNING (LIKE mllIB).
  25. 25. Thousands Genomes Open Data Set Games without Frontiers 1000 genomes: http://www.1000genomes.org/
  26. 26. Produces BAMs, VCFs, ... Thousands Genomes Why do you complain, they are compressed …
  27. 27. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Thousands Genomes Where are the data DNA Russian roulette: which is fastest? ● EBI FTP: ftp://ftp.1000genomes.ebi.ac. uk/vol1/ftp/ ● NCBI FTP: ftp://ftp-trace.ncbi.nih. gov/1000genomes/ftp/ ● S3: http://aws.amazon.com/1000genomes/ ● GS: gs://genomics-public-data/ftp-trace.ncbi. nih.gov/1000genomes/ftp
  28. 28. Thousands Genomes Adam that shit on S3 Hmmm like in the good old days of HPC The bad part … ● get the vcf.gz file on local disk (& time for a coffee) ● uncompress (& go for lunch) ● put in HDFS (& take dessert)
  29. 29. Thousands Genomes Adam that shit on S3 what? No grappa? The good part … the Notebook (this one)
  30. 30. Thousands Genomes Adam that shit on S3 Okay, good enough to wait a bit… What did we gain? ● before: 152 GB (gzipped) in 23 files ● After: 71 GB in 9172 partitions (43,372,735,220 genotypes)
  31. 31. Explore Genomics Access the data Just in case, you don’t believe us -_-’ Access data from this notebook
  32. 32. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Explore Genomics Compute statistics We’re there to compute, right? Compute Freqs from this spark notebook
  33. 33. Learn Genomics The problem Insane, you’ll have hard time with me |:-[ How to deal with heterogenous data? ● Population stratification ● Identify natural clusters ● Assign genomes to these clusters
  34. 34. Learn Genomics The dimensions Wiiiiiiiiiiiiiiiiide rows ● 1000 Samples (Rows) ● 30,000,000 variants (columns or variables) Hard to explore such a feature space…
  35. 35. Learn Genomics The dimensions *LDA for Latent Dirichelet Allocation… Dimensionality reduction? ● Ideal would be a “Genetic” Mixture measure (lda* would do that…) ● Or a genetic distance (edit distance) KMeans & distances to centroids
  36. 36. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Learn Genomics The model Reduce, train, validate, infer ● Split training/validation set ● Train KMeans with 25 clusters ● Compute distances to each centroid as new features ● Train Random Forest ● Validation
  37. 37. Learn Genomics The notebook Define and train the model in this Notebook The whole shebang?
  38. 38. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam Our pipeline I am a Llama Convert VCFs to ADAM StoRE ADAM to S3 Compute alleles frequencies Store alleles frequencies to S3 Compute Minor Allele frequency distribution Train a Model for stratification Hmmm… quite some missing pieces, right?
  39. 39. PART II Standards & Micro Services Wake up! GA4GH: Standard for Genomics med-at-scale project Explore: using Standards Create custom micro services
  40. 40. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Ga4GH Let’s fix the baseline In I.T. it’s easy everything is standardized… Global Alliance for Genomic and Health http://genomicsandhealth.org/ http://ga4gh.org/ Framework for responsible data sharing ● Define schemas ● Define services Along with Ethical, Legal, security, clinical aspects
  41. 41. GA4GH models … everybody has is own standard
  42. 42. GA4GH Services But a shared schema is a bit better!
  43. 43. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. GA4GH Metadata The data of my data is also my data Work In Progress ● Individual ● Sample ● Experiment ● Dataset ● IndividualGroup ● Analysis But still very young and too much centered on data Beacon ⁽*⁾ Tells the world you have data. CLearly not enough
  44. 44. Med At Scale By Data Fellas Existing scalable implementation: Google Genomics Uses ● BigQuery ● google cloud computing ● dremel ● … That’s what happens when you think you have…
  45. 45. Med At Scale By Data Fellas Google Genomics is pushing Hard …
  46. 46. Med At Scale Scalability first BIG There is another scalable implementation: Med At Scale, by Data Fellas Uses ● Apache Spark ● Adam ● S3 ● HDFS ● …
  47. 47. Med At Scale Scalability first Data Fellas is pushing TOO BIG
  48. 48. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Composability very BIG GA4GH defines quite some methods, or services They don’t have all the same requirements in term of exposure and data processing → micro services for the Win Allows granular deployment and composition/chaining of methods to answer a global question
  49. 49. Med At Scale Customization Data Fellas is a data science company Thus our goal is to expose data analyses A data analysis is ● elaborated in a notebook ● validated on a cluster ● deployed as a micro service it self Still defining a Schema and Service VERY VERY BIG
  50. 50. Med At Scale Ready for the load Balls! We saw that one row has 30,000,000 columns The queries are slicing and dicing those columns → views are huge Hence, Tachyon via RDD.persist/save will optimize the collocated queries in space and time. The hard part (will/)is to size the tachyon cluster
  51. 51. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Ad Hoc Analytics Who left the rats out? Standards are very important However, they cannot define everything, mostly OLAP. Ad-Hoc analytics are thus allowed on the raw data using Apache Spark directly. Of course, interactivity is a key to performance… hence the Spark-Notebook is involved.
  52. 52. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale How it works Finally…
  53. 53. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale ADAM (and Spark) Finally…
  54. 54. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale MLlib (and Spark) Finally…
  55. 55. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Efficient binary data Finally…
  56. 56. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Micro Service Finally…
  57. 57. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Cache and Collaboration Finally…
  58. 58. Explore Using GA4GH endpoints notebook TIME! Use scala/Java Avro client from the browser. I give you Bananas You give me Ananas
  59. 59. Customize Create and Use micro service (WIP) Planning the next gear Remember the frequencies use case? There is a custom endpoint manually created We’re working on an Integrated Workflow In a notebook: ● create the process ● create Cassandra schema ● persist (using connector) ● Define service AVRO IDL ● Generate project for DCOS ● Log usage (see next)
  60. 60. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Optimization Query mining (Roadmap) Always look at the bright side Back to the high dimensionality problem Caching beforehands is a good solution but is not optimal. Plan: ANalyse the Request/Response objects and the gathered runtime metrics to adapt the caching policies -- query mining processes
  61. 61. References Adam: https://github.com/bigdatagenomics/adam Bdg-Formats: https://github.com/bigdatagenomics/bdg-formats GA4GH website: http://genomicsandhealth.org/ GA4GH data working group: http://ga4gh.org/ Spark-Notebook: https://github.com/andypetrella/spark-notebook/ Med-At-Scale: https://github.com/med-at-scale/high-health Data Fellas: http://data-fellas.guru/
  62. 62. Q/A⁽*⁾ THANKS! ⁽*⁾ or head to the pub (at least beers…)

×