"Petascale Genomics with Spark", Sean Owen, Director of Data Science at Cloudera
YouTube Link: https://www.youtube.com/watch?v=HY93FdK5i60
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the Author:
Sean is Director of Data Science at Cloudera, based in London. Before Cloudera, he founded Myrrix Ltd, a company commercializing large-scale real-time recommender systems on Apache Hadoop. He has been a primary committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google. He holds and MBA from the London Business School and a BA in Computer Science from Harvard.
Before we dive in, let me ask a couple of questions:
Biologists?
Spark experts?
There are always at least three different constituencies in the room:
* biologists
* programmers
* someone thinking about how to build a business around this
Gonna tell you a lot of lies today.
Won’t satisfy everyone. Where I skip over the truth, maybe there will be at least a breadcrumb of truth left over.
This will not be a very technical talk.
What even is genomics?
Who here has heard the terms ‘chromosome’ and ‘gene’ before, and could explain the difference?
So before we dive into the main part of the talk, I’m going to spend a few minutes discussing some of the basic biological concepts.
But each of those cells has, ideally, an identical genome.
The genome is a collection of 23 linear molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about.
The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
Without losing much, assume that our genomes are contained on just a single chromosome.
Now, not only do all the cells in your body have identical genomes…
[ADVANCE]
We can define a concept of ‘location’ across chromosomes.
This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system.
This also means that we can talk about differences between individuals in terms of diffs to a common reference genome.
But where does this reference genome come from?
Here is Bill Clinton (and Craig Venter and Francis Collins), announcing in June of 2000 the “rough draft” of the Human Genome – this is the Human Genome Project.
Took >10 years and $2 billion
What did this actually do?
Anyone recognize this?
Genome analogy: a text file a part of the linear sequence of ACGTs.
Difficult to understand.
Mapmakers work to add ANNOTATIONS to the map.
And often, it’s only the annotations that are interesting, so mapmakers focus on *annotation* of the maps themselves.
The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes.
What does the annotated map of the genome look like?
Chromosome on top. Highlighted red portion is what we’re zoomed in on.
See the scale: total of about 600,000 bases (ACGTs) arranged from left to right.
Multiple annotation “tracks” are overlaid on the genome sequence, marking functional elements, positions of observed human differences, similarity to other animals.
In part it’s the product of numerous additional large biology annotation projects (e.g., HapMap project, 1000 Genomes, ENCODE).
Lot's of bioinformatics is computing these elements, or evaluating models on top of the elements.
How are these annotations actually generated? Shift gears and talk about the technology.
DNA SEQUENCING
If satellites provide images of the world for cartography, sequences are the microscopes that give you “images” of the genome.
Over past decade, massive EXPONENTIAL increase in throughput (much faster than Moore’s law)
Bioinformatics is the computational process to reconstruct the genomic information. But…
[ADVANCE]
Pipelines, of course.
Example pipeline: raw sequencing data => a single individual’s “diff” from the reference.
How are these typically structured?
Each step is typically written as a standalone program – passing files from stage to stage
These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem
What does one of these files look like?
Bioinformaticians LOVE hand-coded file formats.
But only store several fundamental data types.
Strong assumptions in the formats. Inconsistent implementations in multiple languages.
Doesn’t allow different storage backends.
OK, we discussed what the data/files are like that are passed around. What about the computation itself?
Imposes severe constraint: global sort invariant. => Many impls depend on this, even if it’s not necessary or conducive to distributed computing.
But what if we jump into one of these functions. You’ll find a dependence on…
[ADVANCE]
Most bioinformatics tools make strong assumptions about their environments, and also the structure of the data (e.g., global sort), when it shouldn’t be necessary.
Ok, but that’s not all…
[ADVANCE]
We’ve looked at the data and a bit of code for one of these tools. But this runs the pipeline on a single individual.
But of course, it’s never one pipeline…
[ADVANCE]
Scale out!
Typically managed with a pretty low-level job scheduler.
SCALE!
New levels of ambition for large biology projects.
100k genomes at Genomics England in collaboration with National Health Service.
Raw data for a single individual can be in the hundreds of GB
But even before we hit that huge scale (which is soon)…
We don’t want to analyze each sample separately. We want to use ALL THE DATA we generate.
Well, these pipelines often include lots of aggregation, perhaps we can just…
[ADVANCE]
Do the easy thing! Not ideal, especially as the amount of data goes up (data transfer), number of files increases (we saw file handles). May start hitting the cracks.
But even worse…
[ADVANCE]
God help you if you want to jointly use all the data in earlier part of the pipeline.
2 Problems:
Large scale
Using all data simultaneously
Things like global sort order are overly restrictive and leads to algos relying on it when it’s not necessary.
Example of an algo. Bioinformatics loves evaluating probabilistic models on the chromosomes.
We can easily extract parallelism at different parts of our pipelines.
Use higher level distributed computing primitives and let the system figure out all the platform issues for you: storage, job scheduling, fault tolerance, shuffles.
Cheap scalable STORAGE at bottom
Resource management middle
EXECUTION engines that can run your code on the cluster and provide parallelism
Consistent SERIALIZATION framework
Scientist should NOT WORRY about lower levels (coordination, file formats, storage details, fault tolerance)
Another computation for a statistical aggregate on genome variant data. Details not important.
Spark data flow:
Distributed data load
High level joins/spatial computations that are parallelized as necessary.
But really nice thing is because our data is stored using the Avro data model…
[ADVANCE]
We’ve implemented this vision with Spark, starting from the Amplab (same people that gave you Spark) into a project called
ADAM
The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but also…
In addition to some of the standard pipeline transformations, implemented the core spatial join operations (analogous to a geospatial library).
Single-node performance improvements.
Free scalability: fixed price, significant wall-clock improvements
See most recent SIGMOD.
Not to be outdone, Craig Venter proposes 1 million genomes at Human Longevity Inc.
Cloudera is hiring.
Including the data science team.