SlideShare une entreprise Scribd logo
1  sur  32
DNA, HBase, Hadoop, and
YOU!
by Jeremy Pollack
What does Ancestry.com do?
• Over 30,000 historical content collections
• 11 billion records and images
• Records dating back to 16th century
• 4 petabytes
We are the world's largest online family history resource.
It’s the “eureka” moment of discovery that drives
our business!
DNA molecule 1 differs from DNA molecule 2
at a single base-pair location (a C/T
polymorphism).
(http://en.wikipedia.org/wiki/Single-
nucleiotide_polymorphism)
What does Ancestry DNA do?
"Spit in a tube, pay $99, learn about your past"
• Decodes your family origins (ethnicity)
• Finds your long-lost relatives
• We have identified over four million
fourth cousins.
• The average customer has close to 30
fourth cousin matches.
• By examining these matches, we can
connect your family tree to those of
your distant relatives.
• Ancestry DNA has 120K+ samples, one of
the largest DNA databases in the world.
• About 690GB of data (uncompressed),
or about 6.2 MB per sample
What is GERMLINE?
• GERMLINE is an algorithm that finds hidden relationships
within a pool of DNA.
• GERMLINE also refers to the reference implementation of that
algorithm.
• You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/
So what's the problem?
• GERMLINE (the implementation) was not meant to be used in
an industrial setting.
• Stateless
• Single threaded
• Prone to swapping
• GERMLINE performs poorly on large data sets.
• We were running up against its limitations.
• Put simply : GERMLINE couldn't scale.
0
5
10
15
20
25
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
Hours
Number of samples
GERMLINE Run Times (in hours)
Projected GERMLINE Run Times (in hours)
0
100
200
300
400
500
600
700
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
122500
Hours
Number of samples
GERMLINE run
times
Projected
GERMLINE run
times
The Mission : Create a Scalable Matching
Engine
... and thus was born
(aka "Jermline with a J")
DNA Matching : How it Works
Starbuck : ACTGACCTAGTTGAC
Adama : TTAAGCCTAGTTGAC
The Input
Kara Thrace, aka
Starbuck
• Ace viper pilot
• Has a special
destiny
• Not to be trifled
with
Admiral Adama
• Admiral of the
Colonial Fleet
• Routinely saves
humanity from
destruction
• Not so great
with model
ships
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Separate into words
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
Build the hash table
Iterate through genome and find matches
Starbuck and Adama match from position 1 to position 2
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
ACTGA_0 : Starbuck
TTAAG_0 : Adama
CCTAG_1 : Starbuck, Adama
TTGAC_2 : Starbuck, Adama
Does that mean they're related?
...maybe...
Baltar : TTAAGCCTAGGGGCG
But wait... what about Baltar?
Gaius Baltar
• Handsome
• Genius
• Kinda evil
Adding a new sample, the GERMLINE way
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
Step one : Rebuild the entire hash table from scratch, including the new
sample
The GERMLINE Way
Starbuck and Adama match from position 1 to position 2
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
Step two : Find everybody's matches all over again, including the new
sample. (n x n comparisons)
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
The GERMLINE Way
Starbuck and Adama match from position 1 to position 2
Adama and Baltar match from position 0 to position 1
Starbuck and Baltar match at position 1
Step three : Now, throw away the evidence!
0 1 2
Starbuck : ACTGA CCTAG TTGAC
Adama : TTAAG CCTAG TTGAC
Baltar : TTAAG CCTAG GGGCG
ACTGA_0 : Starbuck
TTAAG_0 : Adama, Baltar
CCTAG_1 : Starbuck, Adama, Baltar
TTGAC_2 : Starbuck, Adama
GGGCG_2 : Baltar
You have done this before, and you will have to do
it ALL OVER AGAIN.
The GERMLINE Way
Not so good, right?
Now let's take a look at the way.
Step one : Update the hash table.
Starbuck Adama
2_ACTGA_0 1
2_TTAAG_0 1
2_CCTAG_1 1 1
2_TTGAC_2 1 1
Already stored in HBase
Baltar : TTAAG CCTAG GGGCG New sample to add
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that position on
that chromosome
The way
Baltar and Adama match from position 0 to position 1
Baltar and Starbuck match at position 1
Already stored
in HBase
2_Starbuck 2_Adama
2_Starbuck { (1, 2), ...}
2_Adama { (1, 2), ... }
New matches
to add
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
The way
Step two : Find matches.
But wait ... what about
Zarek, Roslin, Hera, and Helo?
Photo by Benh Lieu Song
Run them in parallel with Hadoop!
• Batches are usually about a thousand people.
• Each mapper takes a single chromosome for a single person.
• MapReduce Jobs :
• Job #1 : Match Words
• Updates the hash table
• Job #2 : Match Segments
• Identifies areas where the samples match
Parallelism with Hadoop
Okay, but how does Jermline perform?
Okay, but how does Jermline perform?
A 1700% improvement over
GERMLINE!
0
5
10
15
20
25
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
Hours
Number of samples
Run Times For Matching (in hours)
Run Times For Matching (in hours)
0
20
40
60
80
100
120
140
160
180
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
62500
65000
67500
70000
72500
75000
77500
80000
82500
85000
87500
90000
92500
95000
97500
100000
102500
105000
107500
110000
112500
115000
117500
120000
Hours
Number of samples
GERMLINE run
times
Jermline run
times
Projected
GERMLINE run
times
Bottom line : By leveraging Hadoop and HBase, we
dramatically increased our processing capacity. Without
Hadoop and HBase, this would have been hideously
expensive and difficult.
• Previously, we ran GERMLINE on a single "beefy box".
• 12-core 2.2GHZ Opteron 6174 with 256GB of RAM
• We had upgraded this machine until it couldn't be upgraded any more.
• Processing time was unacceptable, growth was unsustainable.
• To continue running GERMLINE on a single box, we would have required a vastly more
powerful machine, probably at the supercomputer level.
• Now, we run Jermline on a cluster.
• 20 X 12-core 2GHZ Xeon E5-2620 with 96GB of RAM
• We can now run 16 batches per day, whereas before we could only run one.
• Most importantly, growth is sustainable. To add capacity, we need only add more
nodes.
Questions?

Contenu connexe

En vedette

HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!Cloudera, Inc.
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBaseHBaseCon
 
HBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseHBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseCloudera, Inc.
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesCloudera, Inc.
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterCloudera, Inc.
 
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseCloudera, Inc.
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...Cloudera, Inc.
 
HBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsHBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsCloudera, Inc.
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponHBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponCloudera, Inc.
 
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics Cloudera, Inc.
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera FieldHBaseCon
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...Cloudera, Inc.
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.Cloudera, Inc.
 

En vedette (18)

HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
 
HBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseHBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBase
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart Meter
 
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBase
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
 
HBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsHBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three Acts
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponHBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
 
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!

  • 1. DNA, HBase, Hadoop, and YOU! by Jeremy Pollack
  • 2. What does Ancestry.com do? • Over 30,000 historical content collections • 11 billion records and images • Records dating back to 16th century • 4 petabytes We are the world's largest online family history resource.
  • 3. It’s the “eureka” moment of discovery that drives our business!
  • 4. DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http://en.wikipedia.org/wiki/Single- nucleiotide_polymorphism) What does Ancestry DNA do? "Spit in a tube, pay $99, learn about your past" • Decodes your family origins (ethnicity) • Finds your long-lost relatives • We have identified over four million fourth cousins. • The average customer has close to 30 fourth cousin matches. • By examining these matches, we can connect your family tree to those of your distant relatives. • Ancestry DNA has 120K+ samples, one of the largest DNA databases in the world. • About 690GB of data (uncompressed), or about 6.2 MB per sample
  • 5. What is GERMLINE? • GERMLINE is an algorithm that finds hidden relationships within a pool of DNA. • GERMLINE also refers to the reference implementation of that algorithm. • You can find it here : http://www1.cs.columbia.edu/~gusev/germline/
  • 6. So what's the problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting. • Stateless • Single threaded • Prone to swapping • GERMLINE performs poorly on large data sets. • We were running up against its limitations. • Put simply : GERMLINE couldn't scale.
  • 8. Projected GERMLINE Run Times (in hours) 0 100 200 300 400 500 600 700 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 62500 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000 92500 95000 97500 100000 102500 105000 107500 110000 112500 115000 117500 120000 122500 Hours Number of samples GERMLINE run times Projected GERMLINE run times
  • 9. The Mission : Create a Scalable Matching Engine ... and thus was born (aka "Jermline with a J")
  • 10. DNA Matching : How it Works
  • 11. Starbuck : ACTGACCTAGTTGAC Adama : TTAAGCCTAGTTGAC The Input Kara Thrace, aka Starbuck • Ace viper pilot • Has a special destiny • Not to be trifled with Admiral Adama • Admiral of the Colonial Fleet • Routinely saves humanity from destruction • Not so great with model ships
  • 12. 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Separate into words
  • 13. 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama Build the hash table
  • 14. Iterate through genome and find matches Starbuck and Adama match from position 1 to position 2 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC ACTGA_0 : Starbuck TTAAG_0 : Adama CCTAG_1 : Starbuck, Adama TTGAC_2 : Starbuck, Adama
  • 15. Does that mean they're related? ...maybe...
  • 16. Baltar : TTAAGCCTAGGGGCG But wait... what about Baltar? Gaius Baltar • Handsome • Genius • Kinda evil
  • 17. Adding a new sample, the GERMLINE way
  • 18. 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar Step one : Rebuild the entire hash table from scratch, including the new sample The GERMLINE Way
  • 19. Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 Step two : Find everybody's matches all over again, including the new sample. (n x n comparisons) 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar The GERMLINE Way
  • 20. Starbuck and Adama match from position 1 to position 2 Adama and Baltar match from position 0 to position 1 Starbuck and Baltar match at position 1 Step three : Now, throw away the evidence! 0 1 2 Starbuck : ACTGA CCTAG TTGAC Adama : TTAAG CCTAG TTGAC Baltar : TTAAG CCTAG GGGCG ACTGA_0 : Starbuck TTAAG_0 : Adama, Baltar CCTAG_1 : Starbuck, Adama, Baltar TTGAC_2 : Starbuck, Adama GGGCG_2 : Baltar You have done this before, and you will have to do it ALL OVER AGAIN. The GERMLINE Way
  • 21. Not so good, right? Now let's take a look at the way.
  • 22. Step one : Update the hash table. Starbuck Adama 2_ACTGA_0 1 2_TTAAG_0 1 2_CCTAG_1 1 1 2_TTGAC_2 1 1 Already stored in HBase Baltar : TTAAG CCTAG GGGCG New sample to add Key : [CHROMOSOME]_[WORD]_[POSITION] Qualifier : [USER ID] Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome The way
  • 23. Baltar and Adama match from position 0 to position 1 Baltar and Starbuck match at position 1 Already stored in HBase 2_Starbuck 2_Adama 2_Starbuck { (1, 2), ...} 2_Adama { (1, 2), ... } New matches to add Key : [CHROMOSOME]_[USER ID] Qualifier : [CHROMOSOME]_[USER ID] Cell value : A list of ranges where the two users match on a chromosome The way Step two : Find matches.
  • 24. But wait ... what about Zarek, Roslin, Hera, and Helo?
  • 25. Photo by Benh Lieu Song Run them in parallel with Hadoop!
  • 26. • Batches are usually about a thousand people. • Each mapper takes a single chromosome for a single person. • MapReduce Jobs : • Job #1 : Match Words • Updates the hash table • Job #2 : Match Segments • Identifies areas where the samples match Parallelism with Hadoop
  • 27. Okay, but how does Jermline perform?
  • 28. Okay, but how does Jermline perform? A 1700% improvement over GERMLINE!
  • 30. Run Times For Matching (in hours) 0 20 40 60 80 100 120 140 160 180 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 47500 50000 52500 55000 57500 60000 62500 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000 92500 95000 97500 100000 102500 105000 107500 110000 112500 115000 117500 120000 Hours Number of samples GERMLINE run times Jermline run times Projected GERMLINE run times
  • 31. Bottom line : By leveraging Hadoop and HBase, we dramatically increased our processing capacity. Without Hadoop and HBase, this would have been hideously expensive and difficult. • Previously, we ran GERMLINE on a single "beefy box". • 12-core 2.2GHZ Opteron 6174 with 256GB of RAM • We had upgraded this machine until it couldn't be upgraded any more. • Processing time was unacceptable, growth was unsustainable. • To continue running GERMLINE on a single box, we would have required a vastly more powerful machine, probably at the supercomputer level. • Now, we run Jermline on a cluster. • 20 X 12-core 2GHZ Xeon E5-2620 with 96GB of RAM • We can now run 16 batches per day, whereas before we could only run one. • Most importantly, growth is sustainable. To add capacity, we need only add more nodes.

Notes de l'éditeur

  1. Everything from birth certificates, obituaries, immigration records, census records, voter registration, old phone books, everything.
  2. Typically, the way it works is this :You search through our records to find one of your relatives. Once you've found enough records that you're satisfied you've found your relative, you attach them to your family tree. After that, Ancestry goes to work for you. Our search engine takes a look at your whole tree to find relatives that you may not know about yet, and presents these to you as hints. (shaky leaf) You can then examine these hints and see if they are, in fact, related to you. I, myself, found all my great-grandparents as well as a few aunts and uncles like that. It's pretty cool! And the beauty of it is that, say you've found a relative who's researched their family tree pretty extensively? Well, you get to piggyback on all that research by simply adding their family tree to yours. A fine example of crowdsourcing.
  3. However, this has its limitations. What if you don't know your extended family that well? What if your ancestors came to the country as slaves? What if your ancestors came into the country illegally?DNA to the rescue.
  4. Mention how we kept upgrading and tightening things up
  5. For each person-to-person comparison, we add up the total length of their shared DNA and run that through a statistical model to see how closely they're related.
  6. Remind people that GERMLINE was stateless