HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!

DNA, HBase, Hadoop, and
YOU!
by Jeremy Pollack

What does Ancestry.com do?
• Over 30,000 historical content collections
• 11 billion records and images
• Records dating back to 16th century
• 4 petabytes
We are the world's largest online family history resource.

It’s the “eureka” moment of discovery that drives
our business!

DNA molecule 1 differs from DNA molecule 2
at a single base-pair location (a C/T
polymorphism).
(http://en.wikipedia.org/wiki/Single-
nucleiotide_polymorphism)
What does Ancestry DNA do?
"Spit in a tube, pay $99, learn about your past"
• Decodes your family origins (ethnicity)
• Finds your long-lost relatives
• We have identified over four million
fourth cousins.
• The average customer has close to 30
fourth cousin matches.
• By examining these matches, we can
connect your family tree to those of
your distant relatives.
• Ancestry DNA has 120K+ samples, one of
the largest DNA databases in the world.
• About 690GB of data (uncompressed),
or about 6.2 MB per sample

What is GERMLINE?
• GERMLINE is an algorithm that finds hidden relationships
within a pool of DNA.
• GERMLINE also refers to the reference implementation of that
algorithm.
• You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/

So what's the problem?
• GERMLINE (the implementation) was not meant to be used in
an industrial setting.
• Stateless
• Single threaded
• Prone to swapping
• GERMLINE performs poorly on large data sets.
• We were running up against its limitations.
• Put simply : GERMLINE couldn't scale.

0
5
10
15
20
25
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
Hours
Number of samples
GERMLINE Run Times (in hours)

Projected GERMLINE Run Times (in hours)
0
100
200
300
400
500
600
700
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
122500
Hours
Number of samples
GERMLINE run
times
Projected
GERMLINE run
times

The Mission : Create a Scalable Matching
Engine
... and thus was born
(aka "Jermline with a J")

Starbuck : ACTGACCTAGTTGAC
Adama : TTAAGCCTAGTTGAC
The Input
Kara Thrace, aka
Starbuck
• Ace viper pilot
• Has a special
destiny
• Not to be trifled
with
Admiral Adama
• Admiral of the
Colonial Fleet
• Routinely saves
humanity from
destruction
• Not so great
with model
ships

0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Separate into words

0 1 2
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
Build the hash table

Iterate through genome and find matches
Starbuck and Adama match from position 1 to position 2
0 1 2
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama

Does that mean they're related?
...maybe...

Baltar : TTAAGCCTAGGGGCG
But wait... what about Baltar?
Gaius Baltar
• Handsome
• Genius
• Kinda evil

Adding a new sample, the GERMLINE way

0 1 2
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
GGGCG_2 : Baltar
Step one : Rebuild the entire hash table from scratch, including the new
sample
The GERMLINE Way

Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
Step two : Find everybody's matches all over again, including the new
sample. (n x n comparisons)
0 1 2
ACTGA_0 : Starbuck
GGGCG_2 : Baltar
The GERMLINE Way

Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
Step three : Now, throw away the evidence!
0 1 2
ACTGA_0 : Starbuck
GGGCG_2 : Baltar
You have done this before, and you will have to do
it ALL OVER AGAIN.
The GERMLINE Way

Not so good, right?
Now let's take a look at the way.

Step one : Update the hash table.
Starbuck Adama
2_ACTGA_0 1
2_TTAAG_0 1
2_CCTAG_1 1 1
2_TTGAC_2 1 1
Already stored in HBase
Baltar : TTAAG CCTAG GGGCG New sample to add
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that position on
that chromosome
The way

Baltar and Adama match from position 0 to position 1
Baltar and Starbuck match at position 1
Already stored
in HBase
2_Starbuck 2_Adama
2_Starbuck { (1, 2), ...}
2_Adama { (1, 2), ... }
New matches
to add
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
The way
Step two : Find matches.

But wait ... what about
Zarek, Roslin, Hera, and Helo?

Photo by Benh Lieu Song
Run them in parallel with Hadoop!

• Batches are usually about a thousand people.
• Each mapper takes a single chromosome for a single person.
• MapReduce Jobs :
• Job #1 : Match Words
• Updates the hash table
• Job #2 : Match Segments
• Identifies areas where the samples match
Parallelism with Hadoop

Okay, but how does Jermline perform?

Okay, but how does Jermline perform?
A 1700% improvement over
GERMLINE!

0
5
10
15
20
25
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
Hours
Number of samples
Run Times For Matching (in hours)

Run Times For Matching (in hours)
0
20
40
60
80
100
120
140
160
180
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
Hours
Number of samples
GERMLINE run
times
Jermline run
times
Projected
GERMLINE run
times

Bottom line : By leveraging Hadoop and HBase, we
dramatically increased our processing capacity. Without
Hadoop and HBase, this would have been hideously
expensive and difficult.
• Previously, we ran GERMLINE on a single "beefy box".
• 12-core 2.2GHZ Opteron 6174 with 256GB of RAM
• We had upgraded this machine until it couldn't be upgraded any more.
• Processing time was unacceptable, growth was unsustainable.
• To continue running GERMLINE on a single box, we would have required a vastly more
powerful machine, probably at the supercomputer level.
• Now, we run Jermline on a cluster.
• 20 X 12-core 2GHZ Xeon E5-2620 with 96GB of RAM
• We can now run 16 batches per day, whereas before we could only run one.
• Most importantly, growth is sustainable. To add capacity, we need only add more
nodes.

HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (18)

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Dernier

Dernier (20)

HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!

Notes de l'éditeur