SlideShare une entreprise Scribd logo
1  sur  43
Scaling AncestryDNA with the Hadoop
Ecosystem
June 5, 2014
What Ancestry uses from the Hadoop ecosystem
● Hadoop, HDFS, and MapReduce
● Hbase
● Columnar, NoSQL data store, unlimited rows and columns
● Azkaban
● Workflow
2
What will this presentation cover?
● Describe the problem
● Discoveries with DNA
● Three key steps in the pipeline process
● Measure everything principle
● Three steps with Hadoop
● Hadoop as a job scheduler for the ethnicity step
● Scaling matching step
● MapReduce implementation of phasing
● Performance
● What comes next?
3
Discoveries with DNA
● Autosomal DNA test that analyzes 700,000 SNPs
● Over 400,000 DNA samples in our database
● Identified 3 million relationships that connect the genotyped members
through shared ancestors
4
-
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14
Cousinmatches(4thorcloser)
DatabaseSize(#ofsamples)
DNA DB Size Cousin Matches
Three key steps in the pipeline
● What is a pipeline?
1. Ethnicity (AdMixture)
2. Matching (GERMLINE and Jermline)
3. Phasing (Beagle and Underdog)
● First pipeline executed on a single, beefy box.
● Only option is to scale vertically
5
Init
Results
Processing
Finalize
“Beefy Box”
24 cores,
256G mem
Measure everything principle
● Start time, end time, duration in seconds, and sample count for every step
in the pipeline. Also the full end-to-end processing time.
● Put the data in pivot tables and graphed each step
● Normalize the data (sample size was changing)
● Use the data collected to predict future performance
6
Challenges and pain points
Performance degrades when DNA pool grows
● Static (by batch size)
● Linear (by DNA pool size)
● Quadratic (matching related steps) – time bomb
7
Ethnicity step on Hadoop
Using Hadoop as a job scheduler to scale AdMixture
8
First step with Hadoop
● What was in place?
● Smart engineers with no Hadoop experience
● Pipeline running on a single computer that would not scale
● New business that needed a scalable solution to grow
First step using Hadoop
● Run AdMixture step in parallel on Hadoop
● Self contained program with set inputs and outputs
● Simple MapReduce implementation
● Experience running jobs on Hadoop
● Freed up CPU and memory on the single computer for the other steps 9
What did we do? (Don’t cringe…)
1. Mapper -> Key: User ID, Ethnicity Result
2. Reducer -> Key: User ID, Array [Ethnicity Result]
10
DNA Hadoop Cluster
(40 nodes)
Pipeline Beefy Box
(Process Step Control)
For a batch of 1000
Submit 40 jobs with
25 samples per job
AdMixture Job #1
AdMixture Job #2
AdMixture Job #40
...
Mapper that forks
AdMixture
sequentually for
each sample Reducer #1
Reducer #2
Reducer #3
...
Results go to a
simple reducer that
merges them into a
single results file
(Shuffle)
Performance results
● Went from processing 500 samples in 20 hours
to processing 1,000 samples in 2 ½ hours
● Reduced Beagle phasing step by 4 hours
11
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
100,000
AdMixture Time (sec)
Sum of Run Size Admixture Time
Provided valuable experience and bought us time
H1 Release
AdMixture on
Hadoop
GERMLINE to Jermline
Moving the matching step to MapReduce and HBase
12
Introducing … GERMLINE!
● GERMLINE is an algorithm that finds hidden relationships within a pool of DNA
● Also refers to the reference implementation of that algorithm written in C++. You
can find it here:
So what’s the problem?
● GERMLINE (the implementation) was not meant to be used in an industrial setting
● Stateless, single threaded, prone to swapping (heavy memory usage)
● GERMLINE performs poorly on large data sets
● Our metrics predicted exactly where the process would slow to a crawl
● Put simply: GERMLINE couldn't scale
13
http://www1.cs.columbia.edu/~gusev/germline/
GERMLINE run times (in hours)
14
0
5
10
15
20
25
2,500
5,000
7,500
10,000
12,500
15,000
17,500
20,000
22,500
25,000
27,500
30,000
32,500
35,000
37,500
40,000
42,500
45,000
47,500
50,000
52,500
55,000
57,500
60,000
Hours
Samples
Projected GERMLINE run times (in hours)
15
Hours
Samples
0
100
200
300
400
500
600
700
2,500
12,500
22,500
32,500
42,500
52,500
62,500
72,500
82,500
92,500
102,500
112,500
122,500
GERMLINE run
times
Projected
GERMLINE run
times
700 hours = 29+ days
DNA matching walkthrough
Simplified example of showing how the code works
16
DNA matching : How it works
17
Cersei : ACTGACCTAGTTGAC
Joffrey : TTAAGCCTAGTTGAC
The Input
Cersei Baratheon
• Former queen of
Westeros
• Machiavellian
manipulator
• Mostly evil, but
occasionally
sympathetic
Joffrey Baratheon
• Pretty much the human
embodiment of evil
• Needlessly cruel
• Kinda looks like Justin
Bieber
DNA matching : How it works
18
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
Separate into words
DNA matching : How it works
19
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
ACTGA_0 : Cersei
TTAAG_0 : Joffrey
CCTAG_1 : Cersei, Joffrey
TTGAC_2 : Cersei, Joffrey
Build the hash table
DNA matching : How it works
20
0 1 2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
ACTGA_0 : Cersei
TTAAG_0 : Joffrey
CCTAG_1 : Cersei, Joffrey
TTGAC_2 : Cersei, Joffrey
Iterate through genome and find matches
Cersei and Joffrey match from position 1 to position 2
21
Does that mean they’re related?
...maybe
IBD to relationship estimation
● We use the total length of all shared
segments to estimate the relationship
between two genetic relatives
● This is basically a classification problem
22
5 10 20 50 100 200 500 1000 5000
0.000.010.020.030.040.05
ERSA
total_IBD(cM)
probability
m1
m2
m3
m4
m5
m6
m7
m8
m9
m10
m11
But wait...what about Jaime?
23
Jaime : TTAAGCCTAGGGGCG
Jaime Lannister
• Kind of a has-been
• Killed the Mad King
• Has the hots for his
sister, Cersei
The way
24
Step one: Update the hash table
Cersei Joffrey
2_ACTGA_0 1
2_TTAAG_0 1
2_CCTAG_1 1 1
2_TTGAC_2 1 1
Already stored in HBase
Jaime : TTAAG CCTAG GGGCG New sample to add
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that position on that
chromosome
The way
25
Step two: Find matches, update the results table
Already stored in HBase
Jaime and Joffrey match from position 0 to position 1
Jaime and Cersei match at position 1 New matches to add
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
2_Cersei 2_Joffrey
2_Cersei { (1, 2), ...}
2_Joffrey { (1, 2), ...}
The way
26
Results Table
2_Cersei 2_Joffrey 2_Jaime
2_Cersei { (1, 2), ...} { (1), ...}
2_Joffrey { (1, 2), ...} { (0,1), ...}
2_Jaime { (1), ...} { (0,1), ...}
Hash Table
Cersei Joffrey Jaime
2_ACTGA_0 1
2_TTAAG_0 1 1
2_CCTAG_1 1 1 1
2_TTGAC_2 1 1
2_GGGCG_2 1
27
But wait...what about Daenerys, Tyrion,
Arya, and Jon Snow?
Run them in parallel with Hadoop!
28
Parallelism with Hadoop
● Batches are usually about a thousand people
● Each mapper takes a single chromosome for a single person
● MapReduce jobs:
● Job #1: Match words
- Updates the hash table
● Job #2: Match segments
- Identifies areas where the samples match
29
Matching steps on Hadoop
1. Hash Table Mapper -> Breaks input into words and fills the hash table (HBase Table #1)
2. Hash Table Reducer -> default reducer (does nothing)
3. Results Mapper -> For each new user, read hash table, fill in the results table (HBase
Table #2)
4. Results Reducer -> Key: Object(User ID #1, User ID #2), Array[Object(chrom + pos,
Matching DNA Segment)] 30
DNA Hadoop Cluster
(40 nodes)
Pipeline Beefy Box
(Process Step Control)
Hash Table
Mapper #1
Hash Table
Mapper #2
Hash Table
Mapper #N
...
Hash Table
Reducer #1
Hash Table
Reducer #2
Hash Table
Reducer #N
...
Results
Mapper #1
Results
Mapper #2
Results
Mapper #N
...
Results
Reducer #1
Results
Reducer #2
Results
Reducer #N
...1 2 3 4
Input:
User ID, Phased DNA
Run times for matching with
31
Hours
Samples
A 1700% performance improvement
over GERMLINE!
0
5
10
15
20
25
2,500
7,500
12,500
17,500
22,500
27,500
32,500
37,500
42,500
47,500
52,500
57,500
62,500
67,500
72,500
77,500
82,500
87,500
92,500
97,500
102,500
107,500
112,500
117,500
Run times for matching (in hours)
32
Hours
Samples
0
20
40
60
80
100
120
140
160
180
2,500
7,500
12,500
17,500
22,500
27,500
32,500
37,500
42,500
47,500
52,500
57,500
62,500
67,500
72,500
77,500
82,500
87,500
92,500
97,500
102,500
107,500
112,500
117,500
GERMLINE run
times
Jermline run
times
Projected
GERMLINE run
times
performance
● Science team is sure the Jermline algorithm is linear
● Improving the accuracy
● Found a bug in original C++ reference code
● Balancing false positives and false negatives
● Binary version of Jermline
● Use less memory and improve speed
● Paper submitted describing the implementation
● Releasing as an Open Source project soon 33
Beagle to Underdog
Moving phasing step from a single process to MapReduce
34
Phasing goes to the dogs
● Beagle
● Open source, freely available program
● Multi-threaded process that runs on one computer
● More accurate with a large sample set
● Underdog
● Does the same statistical calculations with a larger reference set, which increases accuracy
● Carefully split into a MapReduce implementation that allows parallel processing
● Collaboration between the DNA Science and Pipeline Developer Teams
35
What did we do?
1. Window Mapper -> Key: Window ID, Object(User ID, DNA)
2. Window Reducer -> Key: Window ID, Array[Object(User ID, DNA)]
3. Phase Mapper (loads window data) -> Key: User ID, Object(Window ID, Phased DNA)
4. Phase Reducer -> Key: User ID, Array[Object(Window ID, Phased DNA)] 36
DNA Hadoop Cluster
(40 nodes)
Pipeline Beefy Box
(Process Step Control)
Input is per User:
ATGCATCGTACGGACT...
Window
Mapper #1
Window
Mapper #2
Window
Mapper #N
...
Window
Reducer #1
Window
Reducer #2
Window
Reducer #N
...
Phase
Mapper #1
Phase
Mapper #2
Phase
Mapper #N
...
Phase
Reducer #1
Phase
Reducer #2
Phase
Reducer #N
...1 2 3 4
Load Window Reference
Data Once
Underdog performance
● Went from 12 hours to process 1,000 samples
to under 25 minutes with a MapReduce implementation
37With improved accuracy!
Underdog
replaces Beagle
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Total Run Size Total Beagle-Underdog Duration
Performance and next steps
Incremental change
38
Pipeline steps and incremental change…
● Incremental change over time
● Supporting the business in a “just in time” Agile way
39
0
50000
100000
150000
200000
250000
500
5618
9615
14446
19522
24820
31172
38307
45252
52232
61675
73407
84337
93937
104684
115758
127194
138601
149988
161616
173719
185701
197853
209575
221290
232673
243550
255111
267266
279210
291017
302658
314662
326704
338813
350854
362954
375161
387334
399512
Beagle-Underdog Phasing
Pipeline Finalize
Relationship Processing
Germline-Jermline Results Processing
Germline-Jermline Processing
Beagle Post Phasing
Admixture
Plink Prep
Pipeline Initialization
Jermline replaces
Germline
Ethnicity V2
Release
Underdog Replaces
Beagle
AdMixture on
Hadoop
…while the business continues to grow rapidly
40
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14
#ofprocessedsamples)
DNA Database Size
What’s next? Building different pipelines
● Azkaban
● Allows us to easily tie together steps on Hadoop
● Drop different steps in/out and create different pipelines
● Significant improvement over a hand coded pipeline
● Cloud
● New algorithm changes will force a complete re-run of the entire DNA pool
● Best example: New matching or ethnicity algorithm will force us to reprocess 400K+
samples
● Solution: Use the cloud for this processing while the current pipeline keeps chugging along
41
What’s next? Other areas for improvement
● Admixture as a MapReduce implementation
● Last major algorithm that needs to be addressed
● Expect to get performance improvements similar to Underdog
● Matching growth will cause problems
● Matches per run increasing
● Change the handoff
42Keep measuring everything and adjust
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
84337
92130
101143
109887
119669
129170
138601
148166
157710
167551
177654
187745
197853
207799
217516
226958
236516
245548
255111
265284
275335
285149
295020
312617
322655
332777
342854
352904
362954
373109
383312
393394
Jermline Processing Duration Batch Run Size Total Matches Per Run
Questions?
Ancestry is hiring for the DNA Pipeline Team!
Tech Roots Blog: http://blogs.ancestry.com/techroots
byetman@ancestry.com
Special thanks to the DNA Science and Pipeline Development Teams at
Ancestry

Contenu connexe

Tendances

Shooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 StreamsShooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 StreamsMaurice Naftalin
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceHansol Kang
 
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...Maurice Naftalin
 
Parallel-Ready Java Code: Managing Mutation in an Imperative Language
Parallel-Ready Java Code: Managing Mutation in an Imperative LanguageParallel-Ready Java Code: Managing Mutation in an Imperative Language
Parallel-Ready Java Code: Managing Mutation in an Imperative LanguageMaurice Naftalin
 
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)Hansol Kang
 
OPTEX Mathematical Modeling and Management System
OPTEX Mathematical Modeling and Management SystemOPTEX Mathematical Modeling and Management System
OPTEX Mathematical Modeling and Management SystemJesus Velasquez
 

Tendances (9)

Shooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 StreamsShooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 Streams
 
Tuning the g1gc
Tuning the g1gcTuning the g1gc
Tuning the g1gc
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
 
Trouble with memory
Trouble with memoryTrouble with memory
Trouble with memory
 
Parallel-Ready Java Code: Managing Mutation in an Imperative Language
Parallel-Ready Java Code: Managing Mutation in an Imperative LanguageParallel-Ready Java Code: Managing Mutation in an Imperative Language
Parallel-Ready Java Code: Managing Mutation in an Imperative Language
 
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
 
OPTEX Mathematical Modeling and Management System
OPTEX Mathematical Modeling and Management SystemOPTEX Mathematical Modeling and Management System
OPTEX Mathematical Modeling and Management System
 
Tree building 2
Tree building 2Tree building 2
Tree building 2
 

En vedette

Discovering U.S. Passenger Lists on Ancestry
Discovering U.S. Passenger Lists on AncestryDiscovering U.S. Passenger Lists on Ancestry
Discovering U.S. Passenger Lists on AncestryAncestry.com
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 
Family History Toolkit: Creating Timelines
Family History Toolkit: Creating TimelinesFamily History Toolkit: Creating Timelines
Family History Toolkit: Creating TimelinesAncestry.com
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Going Digital: General Electric and its Digital Transformation
Going Digital: General Electric and its Digital TransformationGoing Digital: General Electric and its Digital Transformation
Going Digital: General Electric and its Digital TransformationCapgemini
 
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU! HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU! Cloudera, Inc.
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
Creating timelines deck
Creating timelines deckCreating timelines deck
Creating timelines deckAncestry.com
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 

En vedette (9)

Discovering U.S. Passenger Lists on Ancestry
Discovering U.S. Passenger Lists on AncestryDiscovering U.S. Passenger Lists on Ancestry
Discovering U.S. Passenger Lists on Ancestry
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Family History Toolkit: Creating Timelines
Family History Toolkit: Creating TimelinesFamily History Toolkit: Creating Timelines
Family History Toolkit: Creating Timelines
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Going Digital: General Electric and its Digital Transformation
Going Digital: General Electric and its Digital TransformationGoing Digital: General Electric and its Digital Transformation
Going Digital: General Electric and its Digital Transformation
 
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU! HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
HBaseCon 2013: Apache HBase, Apache Hadoop, DNA and YOU!
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Creating timelines deck
Creating timelines deckCreating timelines deck
Creating timelines deck
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 

Similaire à Scaling AncestryDNA Matching and Phasing with Hadoop

Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningSri Ambati
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Applying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsApplying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsPaloma De Las Cuevas
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocolsc.titus.brown
 
ASE2010
ASE2010ASE2010
ASE2010swy351
 
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Maribel Acosta Deibe
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit fasterPatrick Bos
 

Similaire à Scaling AncestryDNA Matching and Phasing with Hadoop (20)

Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Applying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsApplying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systems
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Ssbse10.ppt
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Smallsat 2021
Smallsat 2021Smallsat 2021
Smallsat 2021
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
User biglm
User biglmUser biglm
User biglm
 
ASE2010
ASE2010ASE2010
ASE2010
 
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit faster
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Scaling AncestryDNA Matching and Phasing with Hadoop

  • 1. Scaling AncestryDNA with the Hadoop Ecosystem June 5, 2014
  • 2. What Ancestry uses from the Hadoop ecosystem ● Hadoop, HDFS, and MapReduce ● Hbase ● Columnar, NoSQL data store, unlimited rows and columns ● Azkaban ● Workflow 2
  • 3. What will this presentation cover? ● Describe the problem ● Discoveries with DNA ● Three key steps in the pipeline process ● Measure everything principle ● Three steps with Hadoop ● Hadoop as a job scheduler for the ethnicity step ● Scaling matching step ● MapReduce implementation of phasing ● Performance ● What comes next? 3
  • 4. Discoveries with DNA ● Autosomal DNA test that analyzes 700,000 SNPs ● Over 400,000 DNA samples in our database ● Identified 3 million relationships that connect the genotyped members through shared ancestors 4 - 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 - 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 Cousinmatches(4thorcloser) DatabaseSize(#ofsamples) DNA DB Size Cousin Matches
  • 5. Three key steps in the pipeline ● What is a pipeline? 1. Ethnicity (AdMixture) 2. Matching (GERMLINE and Jermline) 3. Phasing (Beagle and Underdog) ● First pipeline executed on a single, beefy box. ● Only option is to scale vertically 5 Init Results Processing Finalize “Beefy Box” 24 cores, 256G mem
  • 6. Measure everything principle ● Start time, end time, duration in seconds, and sample count for every step in the pipeline. Also the full end-to-end processing time. ● Put the data in pivot tables and graphed each step ● Normalize the data (sample size was changing) ● Use the data collected to predict future performance 6
  • 7. Challenges and pain points Performance degrades when DNA pool grows ● Static (by batch size) ● Linear (by DNA pool size) ● Quadratic (matching related steps) – time bomb 7
  • 8. Ethnicity step on Hadoop Using Hadoop as a job scheduler to scale AdMixture 8
  • 9. First step with Hadoop ● What was in place? ● Smart engineers with no Hadoop experience ● Pipeline running on a single computer that would not scale ● New business that needed a scalable solution to grow First step using Hadoop ● Run AdMixture step in parallel on Hadoop ● Self contained program with set inputs and outputs ● Simple MapReduce implementation ● Experience running jobs on Hadoop ● Freed up CPU and memory on the single computer for the other steps 9
  • 10. What did we do? (Don’t cringe…) 1. Mapper -> Key: User ID, Ethnicity Result 2. Reducer -> Key: User ID, Array [Ethnicity Result] 10 DNA Hadoop Cluster (40 nodes) Pipeline Beefy Box (Process Step Control) For a batch of 1000 Submit 40 jobs with 25 samples per job AdMixture Job #1 AdMixture Job #2 AdMixture Job #40 ... Mapper that forks AdMixture sequentually for each sample Reducer #1 Reducer #2 Reducer #3 ... Results go to a simple reducer that merges them into a single results file (Shuffle)
  • 11. Performance results ● Went from processing 500 samples in 20 hours to processing 1,000 samples in 2 ½ hours ● Reduced Beagle phasing step by 4 hours 11 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000 AdMixture Time (sec) Sum of Run Size Admixture Time Provided valuable experience and bought us time H1 Release AdMixture on Hadoop
  • 12. GERMLINE to Jermline Moving the matching step to MapReduce and HBase 12
  • 13. Introducing … GERMLINE! ● GERMLINE is an algorithm that finds hidden relationships within a pool of DNA ● Also refers to the reference implementation of that algorithm written in C++. You can find it here: So what’s the problem? ● GERMLINE (the implementation) was not meant to be used in an industrial setting ● Stateless, single threaded, prone to swapping (heavy memory usage) ● GERMLINE performs poorly on large data sets ● Our metrics predicted exactly where the process would slow to a crawl ● Put simply: GERMLINE couldn't scale 13 http://www1.cs.columbia.edu/~gusev/germline/
  • 14. GERMLINE run times (in hours) 14 0 5 10 15 20 25 2,500 5,000 7,500 10,000 12,500 15,000 17,500 20,000 22,500 25,000 27,500 30,000 32,500 35,000 37,500 40,000 42,500 45,000 47,500 50,000 52,500 55,000 57,500 60,000 Hours Samples
  • 15. Projected GERMLINE run times (in hours) 15 Hours Samples 0 100 200 300 400 500 600 700 2,500 12,500 22,500 32,500 42,500 52,500 62,500 72,500 82,500 92,500 102,500 112,500 122,500 GERMLINE run times Projected GERMLINE run times 700 hours = 29+ days
  • 16. DNA matching walkthrough Simplified example of showing how the code works 16
  • 17. DNA matching : How it works 17 Cersei : ACTGACCTAGTTGAC Joffrey : TTAAGCCTAGTTGAC The Input Cersei Baratheon • Former queen of Westeros • Machiavellian manipulator • Mostly evil, but occasionally sympathetic Joffrey Baratheon • Pretty much the human embodiment of evil • Needlessly cruel • Kinda looks like Justin Bieber
  • 18. DNA matching : How it works 18 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC Separate into words
  • 19. DNA matching : How it works 19 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC ACTGA_0 : Cersei TTAAG_0 : Joffrey CCTAG_1 : Cersei, Joffrey TTGAC_2 : Cersei, Joffrey Build the hash table
  • 20. DNA matching : How it works 20 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC ACTGA_0 : Cersei TTAAG_0 : Joffrey CCTAG_1 : Cersei, Joffrey TTGAC_2 : Cersei, Joffrey Iterate through genome and find matches Cersei and Joffrey match from position 1 to position 2
  • 21. 21 Does that mean they’re related? ...maybe
  • 22. IBD to relationship estimation ● We use the total length of all shared segments to estimate the relationship between two genetic relatives ● This is basically a classification problem 22 5 10 20 50 100 200 500 1000 5000 0.000.010.020.030.040.05 ERSA total_IBD(cM) probability m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
  • 23. But wait...what about Jaime? 23 Jaime : TTAAGCCTAGGGGCG Jaime Lannister • Kind of a has-been • Killed the Mad King • Has the hots for his sister, Cersei
  • 24. The way 24 Step one: Update the hash table Cersei Joffrey 2_ACTGA_0 1 2_TTAAG_0 1 2_CCTAG_1 1 1 2_TTGAC_2 1 1 Already stored in HBase Jaime : TTAAG CCTAG GGGCG New sample to add Key : [CHROMOSOME]_[WORD]_[POSITION] Qualifier : [USER ID] Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome
  • 25. The way 25 Step two: Find matches, update the results table Already stored in HBase Jaime and Joffrey match from position 0 to position 1 Jaime and Cersei match at position 1 New matches to add Key : [CHROMOSOME]_[USER ID] Qualifier : [CHROMOSOME]_[USER ID] Cell value : A list of ranges where the two users match on a chromosome 2_Cersei 2_Joffrey 2_Cersei { (1, 2), ...} 2_Joffrey { (1, 2), ...}
  • 26. The way 26 Results Table 2_Cersei 2_Joffrey 2_Jaime 2_Cersei { (1, 2), ...} { (1), ...} 2_Joffrey { (1, 2), ...} { (0,1), ...} 2_Jaime { (1), ...} { (0,1), ...} Hash Table Cersei Joffrey Jaime 2_ACTGA_0 1 2_TTAAG_0 1 1 2_CCTAG_1 1 1 1 2_TTGAC_2 1 1 2_GGGCG_2 1
  • 27. 27 But wait...what about Daenerys, Tyrion, Arya, and Jon Snow?
  • 28. Run them in parallel with Hadoop! 28
  • 29. Parallelism with Hadoop ● Batches are usually about a thousand people ● Each mapper takes a single chromosome for a single person ● MapReduce jobs: ● Job #1: Match words - Updates the hash table ● Job #2: Match segments - Identifies areas where the samples match 29
  • 30. Matching steps on Hadoop 1. Hash Table Mapper -> Breaks input into words and fills the hash table (HBase Table #1) 2. Hash Table Reducer -> default reducer (does nothing) 3. Results Mapper -> For each new user, read hash table, fill in the results table (HBase Table #2) 4. Results Reducer -> Key: Object(User ID #1, User ID #2), Array[Object(chrom + pos, Matching DNA Segment)] 30 DNA Hadoop Cluster (40 nodes) Pipeline Beefy Box (Process Step Control) Hash Table Mapper #1 Hash Table Mapper #2 Hash Table Mapper #N ... Hash Table Reducer #1 Hash Table Reducer #2 Hash Table Reducer #N ... Results Mapper #1 Results Mapper #2 Results Mapper #N ... Results Reducer #1 Results Reducer #2 Results Reducer #N ...1 2 3 4 Input: User ID, Phased DNA
  • 31. Run times for matching with 31 Hours Samples A 1700% performance improvement over GERMLINE! 0 5 10 15 20 25 2,500 7,500 12,500 17,500 22,500 27,500 32,500 37,500 42,500 47,500 52,500 57,500 62,500 67,500 72,500 77,500 82,500 87,500 92,500 97,500 102,500 107,500 112,500 117,500
  • 32. Run times for matching (in hours) 32 Hours Samples 0 20 40 60 80 100 120 140 160 180 2,500 7,500 12,500 17,500 22,500 27,500 32,500 37,500 42,500 47,500 52,500 57,500 62,500 67,500 72,500 77,500 82,500 87,500 92,500 97,500 102,500 107,500 112,500 117,500 GERMLINE run times Jermline run times Projected GERMLINE run times
  • 33. performance ● Science team is sure the Jermline algorithm is linear ● Improving the accuracy ● Found a bug in original C++ reference code ● Balancing false positives and false negatives ● Binary version of Jermline ● Use less memory and improve speed ● Paper submitted describing the implementation ● Releasing as an Open Source project soon 33
  • 34. Beagle to Underdog Moving phasing step from a single process to MapReduce 34
  • 35. Phasing goes to the dogs ● Beagle ● Open source, freely available program ● Multi-threaded process that runs on one computer ● More accurate with a large sample set ● Underdog ● Does the same statistical calculations with a larger reference set, which increases accuracy ● Carefully split into a MapReduce implementation that allows parallel processing ● Collaboration between the DNA Science and Pipeline Developer Teams 35
  • 36. What did we do? 1. Window Mapper -> Key: Window ID, Object(User ID, DNA) 2. Window Reducer -> Key: Window ID, Array[Object(User ID, DNA)] 3. Phase Mapper (loads window data) -> Key: User ID, Object(Window ID, Phased DNA) 4. Phase Reducer -> Key: User ID, Array[Object(Window ID, Phased DNA)] 36 DNA Hadoop Cluster (40 nodes) Pipeline Beefy Box (Process Step Control) Input is per User: ATGCATCGTACGGACT... Window Mapper #1 Window Mapper #2 Window Mapper #N ... Window Reducer #1 Window Reducer #2 Window Reducer #N ... Phase Mapper #1 Phase Mapper #2 Phase Mapper #N ... Phase Reducer #1 Phase Reducer #2 Phase Reducer #N ...1 2 3 4 Load Window Reference Data Once
  • 37. Underdog performance ● Went from 12 hours to process 1,000 samples to under 25 minutes with a MapReduce implementation 37With improved accuracy! Underdog replaces Beagle 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Total Run Size Total Beagle-Underdog Duration
  • 38. Performance and next steps Incremental change 38
  • 39. Pipeline steps and incremental change… ● Incremental change over time ● Supporting the business in a “just in time” Agile way 39 0 50000 100000 150000 200000 250000 500 5618 9615 14446 19522 24820 31172 38307 45252 52232 61675 73407 84337 93937 104684 115758 127194 138601 149988 161616 173719 185701 197853 209575 221290 232673 243550 255111 267266 279210 291017 302658 314662 326704 338813 350854 362954 375161 387334 399512 Beagle-Underdog Phasing Pipeline Finalize Relationship Processing Germline-Jermline Results Processing Germline-Jermline Processing Beagle Post Phasing Admixture Plink Prep Pipeline Initialization Jermline replaces Germline Ethnicity V2 Release Underdog Replaces Beagle AdMixture on Hadoop
  • 40. …while the business continues to grow rapidly 40 - 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 #ofprocessedsamples) DNA Database Size
  • 41. What’s next? Building different pipelines ● Azkaban ● Allows us to easily tie together steps on Hadoop ● Drop different steps in/out and create different pipelines ● Significant improvement over a hand coded pipeline ● Cloud ● New algorithm changes will force a complete re-run of the entire DNA pool ● Best example: New matching or ethnicity algorithm will force us to reprocess 400K+ samples ● Solution: Use the cloud for this processing while the current pipeline keeps chugging along 41
  • 42. What’s next? Other areas for improvement ● Admixture as a MapReduce implementation ● Last major algorithm that needs to be addressed ● Expect to get performance improvements similar to Underdog ● Matching growth will cause problems ● Matches per run increasing ● Change the handoff 42Keep measuring everything and adjust 0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000 7,000,000 8,000,000 9,000,000 84337 92130 101143 109887 119669 129170 138601 148166 157710 167551 177654 187745 197853 207799 217516 226958 236516 245548 255111 265284 275335 285149 295020 312617 322655 332777 342854 352904 362954 373109 383312 393394 Jermline Processing Duration Batch Run Size Total Matches Per Run
  • 43. Questions? Ancestry is hiring for the DNA Pipeline Team! Tech Roots Blog: http://blogs.ancestry.com/techroots byetman@ancestry.com Special thanks to the DNA Science and Pipeline Development Teams at Ancestry

Notes de l'éditeur

  1. Critically important. In software development you must measure your performance at every step. Does not matter what you are doing, if you are not measuring your performance, you can’t improve. The last point is critical. We could determine the formula for performance of key phases (correlate this) and used that formula to predict future performance at particular DNA pool sizes. We could see the problems coming and knew when we were going to have performance issues. Story #1: Our first step that was going out of control (going quadratic), was the first implementation of the relationship calculation – happens just after matching. This step was basically two nested for loops that walked over the entire DNA pool for each input sample. Simple code, it worked with small numbers, fell over fast. Time was approaching 5 hours to run. Two of my developers rewrote this in PERL and got it down to 2 minutes 30 seconds. They were ecstatic. One of our DNA Scientists (PhD in Bioinformatics, MS in Computer Science – he knows how to code) wrote an AWK command (nasty regular expressions) that ran in less than 10 seconds. My devs were humbled. For the next week, whenever they ran into Keith, they formally bowed to his skills. (All in good nature, all fun.)
  2. Static by batch size (Phasing). Some steps took a long time but were very consistent. A worry but not critical to fix up front. Linear by DNA Pool size (Pipeline Initialization). Looked at ways to streamline and improve performance of these steps. Quadratic – those are the time bombs (Germline, Relationship processing, Germline results processing) The only way we knew this was coming was because we measured each step in the pipeline.
  3. This is a really simple example. Reality is 22 chromosomes and 96 characters in a word
  4. ‘G’ Germline would have to rebuild the hash table for all samples and then re-run all comparisons. An all by all comparison
  5. This is where HBase shines. It is easy to add columns and rows, very efficient with empty cells (sparse matrix). Hammer HBase with multiple processes doing this at the same time.
  6. Understand your application! Loading the window reference data is the most expensive step in the process. Most other implementations (Beagle included) loaded this data multiple times.
  7. Graph of each step in the pipeline for every run. This graph shows how important it is to measure everything. Some steps have been greatly reduced or eliminated. Light blue is the matching step. You can see it going quadratic and then the change when ‘J’ Jermline was released.