SlideShare une entreprise Scribd logo
1  sur  104
Télécharger pour lire hors ligne
Systematic Analysis of Parameter Selection
for Sequence Aligment Algorithms
Project Recap and Phase 1 Summary
Aaron Smalter Hall
Molecular Graphics and Modeling Laboratory
University of Kansas
June 26, 2013
Motivation
● Genomics has become heavily dependent on
the use of sequence alignment tools
● Performance of sequence alignment is directly
dependent on parameters
● To date there is no systematic analysis of
sequence alignment parameters and their
effects on alignment performance
Challenges
● Sequence alignment is computationally
intensive
● Sequence alignment is often controlled by
many different parameters
● Often not tractable to perform alignment with
multiple parameter combinations
● Number of reads in a data set is growing –
partially offset by better hardware
Approach
● Systematic analysis of effects of parameter
perturbation on sequence alignment behavior
– Analyze performance sensitivity of individual
parameters (phase 1)
– Analyze performance sensitivity of parameter
combinations (phase 2)
– Compare performance characteristics across
sequence alignment tools (phase 3)
Experimental Design (Phase 1)
● Identify a broad set of interesting alignment tools
● Identify a broad set of interesting parameters for
each tool
● Identify interesting data sets to test
tools/parameters
● Execute tools on the same data sets, while
changing parameters individually over a wide
range
Experimental Design (Phase 2)
●
Identify alignment parameters for each tool that are individually
sensitive to changes
● Define sensitivity w.r.t.:
– Computation time and memory required
– Read mapping rate
– Read mapping quality
● Identify functional ranges for each individual parameter
● Execute alignment tools on combinations of parameters, while
perturbing parameters across functional range
●
Identify regions of increased sensitivity and execute alignment
tools with finer grained parameter value intervals
Experimental Design (Phase 3)
● Identify relationships between parameters
● Identify parameter space regions of best
performance
● Compare performance and sensitivities across
sequence alignment tools
Current Status
● We are at the end of phase one, ready to move
into phase two
● Completed:
– Select alignment tools
– Select data sets
– Select parameters of interest
– Run experiments across broad parameter ranges
– Collect performance and sensitivity data
● Now: visualize and assess data for each
alignment tool
Phase 2 Requirements
● Identify sensitive parameters
● Identify functional range of parameters
● Write software scripts to automatically
generate parameter combination jobs to run
on cluster (there will be many, many jobs)
● Execute jobs and collect results
Experimental Choices
● Datasets
● Alignment tools
● Collected Results
– Parameters
– Sensitivities
– Functional ranges
Datasets
● Collected several data sets:
– DePristo/Broad NA12878 Whole Genome
– DePristo/Broad NA17878 Whole Exome
– Synthetic paired end
● 10 million pairs
● 1 million pairs
● 100k pairs
● 10k pairs
● 1k pairs
Alignment Tools
● BWA-mem
● BWA-sw
● SOAP2
● Bowtie2
● Novoalign
● SeqAlto
● RazerS
Collected Results
● For each alignment tool:
– Table of parameters ranked (roughly) by magnitude of effect on
performance
● according to standard deviation of performance characteristics
– Figures of most sensitive parameters showing performance
results over entire parameter range tested
– Scatter plots of every experimental result showing tradeoffs:
● CPU time vs. reads mapped
● CPU time vs. mean MAPQ
● Reads mapped vs. mean MAPQ
Comparison at Defaults
Aligner CPU
Usage
Max
V.Mem
mapped
reads
mapq
mean
mapq
stdev
pct
mismatc
h
BWA-
mem
483.35 5.576G 5827211 44.0732 20.0053 25.4545
BWA-sw 1102.11 5.215G 2853347 64.7333 69.9756 25.3673
SOAP2 463.01 5.613G 5645049 22.8406 12.7877 30.0828
Bowtie2 1565.36 3.343G 5803539 25.1817 14.6297 25.7779
Novoalign 4247.55 7.981G 5296829 65.422 11.862 25.3865
SeqAlto 2823.06 7.022G 5669412 49.164 18.4466 25.8334
RazerS 26738.65 8.577G 1777714 255 0 38.9175
BWA-mem
● A recent addition to the BWA package
– Designed for short reads up to 100bp
● Based on Burrows-Wheeler Transform index
structures
● Some parameter values caused BWA to find
more reads than should be present
● Fairly typical set of parameters
● Released 2012
BWA-mem Parameter Sensitivities
Name Flag CPU Memory Reads MAPQ Parameter
Values
Invalid
Values
minimum seed length -k 274.71 1,439.18 1,231,012.26 20.08 [0,1,10,19,
100]
1,000.00
occurrence threshold for discard -c 1,432.77 7,771.84 178,211.44 5.45 [1,10,100,
1000,10000,
100000]
0.00
mismatch penalty -B 55.54 290.55 892,401.53 8.28 [0,1,4,10,
100,1000]
[]
matching score -A 137.85 713.91 481,096.44 2.29 [1,10,100,
1000]
0.00
unpaired penalty -U 32.41 166.97 8.52 7.25 [0,1,9,10,
100,1000]
[]
re-seeding threshold -r 59.50 301.14 620.75 1.64 [0,1,1.01,1.1
,1.5,2,10,
100,1000]
[]
gap open penalty -O 50.71 264.05 13,670.66 0.76 [0,1,6,10,
100,1000]
[]
band width -w 35.58 185.60 9,465.24 0.10 [0,1,10,100,
1000,10000]
[]
gap extension penalty -E 35.12 182.84 11,426.13 0.08 [0,1,10,100,
1000]
[]
clipping penalty -L 12.18 62.41 6,357.34 0.04 [0,1,5,10,
100,1000]
[]
BWA-mem – k, c
BWA-mem – B, A
BWA-mem – U, O
BWA-mem Tradeoffs
BWA-sw
● Doesn't work on paired ends
– Treat each end as an individual read
– Reads mapped reported is bugged because of
identical read IDs
● Works on reads 70bp-1Mbp
● Similar features to BWA-mem
● Similar parameters to BWA-mem
● Released 2010
BWA-sw – Parameter Sensitivities
Name Flag CPU Memory Reads MAPQ Parameter
Values
Invalid
Values
min score threshold -T 410.90 2,121.72 727,340.56 27.76 [0,1,10,37,
100]
1,000.00
z-best heuristics -z 27,174.89 144,452.7
2
10,993.91 21.56 [1,10,100] 0.00
threshold adjustment coef -c 114.96 593.24 27,795.00 22.45 [0,1,5.5,
10]
[100,100
0]
mismatch penalty -b 122.87 634.87 7,638.45 16.90 [0,1,3,10,
100,1000]
[]
gap open penalty -q 311.60 1,609.43 18,425.36 16.73 [0,1,5,10,
100,1000]
[]
max SA interval for seed -s 11,048.94 57,081.69 24.42 1.13 [1,3,10,
100,1000]
[]
min number seeds -N 265.15 1,368.47 345.99 4.47 [0,1,5,10,
100,1000]
[]
gap extension penalty -r 271.25 1,400.37 2,992.56 0.62 [1,2,10,
100,1000]
[]
band width -w 111.93 577.31 2,800.64 0.01 [1,10,33,
100,1000]
[]
match score -a 0.00 0.00 0.00 0.00 1.00 [0,10,10
0,1000]
BWA-sw – T, z
BWA-sw – c, b
BWA-sw – q, N
BWA-sw Tradeoffs
SOAP2
● Also based on BWT index structures
● Order of magnitude improvement over
previous version
● Similar parameters to BWA
● Original release in 2008, latest release in 2011
SOAP2 – Parameter Sensitivities
Name Flag CPU Memory Reads MAPQ Parameter Values Invalid
Values
min insert size -m 268.29 1,505.68 0.00 4.49 [0,1,10,100,400,
1000,10000,10000
0]
[]
continuous gap size allowed -g 50.62 284.38 36,521.81 0.07 [0,1,10,100,1000] []
min alignment length -s 60.41 338.90 33,200.06 0.07 [10,100,255,1000] []
max insert size -x 186.14 1,044.54 0.00 2.70 [0,1,10,100,600,
1000,10000,10000
0]
[]
disallow gap within e-bp -e 24.60 137.84 0.00 0.00 [0,1,5,10,100,1000] []
max mismatch per read -v 20.21 113.54 31.70 0.00 [0,1,5,10,100,1000] []
seed length -l 11.50 64.80 412.81 0.00 [100,256,1000] []
number Ns to allow -n 7.33 41.21 115.89 0.0000
4
[0,1,5,10,100,1000] []
SOAP2 – m ,g
SOAP2 – s, x
SOAP2 - Tradeoffs
Bowtie 2
● Works on reads from 50-1000bp
● Compresses BWT index to limit memory
footprint
● Similar parameters to BWA and SOAP2, with a
few additions
● Released 2012, latest release in 2013
Bowtie2 – Parameter Sensitivites
Name Flag CPU Memory Reads MAPQ Parameter
Values
Invalid
Values
length of seed substring -L 27,039.75 2,857.25 34,224.99 2.68 [3,6,9,13,16,
19,22,26,29,
32]
[]
end of interval between seed substrings -i2 950.37 3,500.88 15,031.23 1.21 [0,1,1.25,2,4,
8,16]
[]
min acceptable alignment score
coefficient
-score-
min2
501.10 718.53 801,965.66 13.57 [-0.9,-0.6,-
0.3,0,1]
[]
reference gap open penalty -rfg1 181.85 2,152.54 2,889.80 0.11 [0,1,3,5,6,10,
32,100]
[]
reference gap extend penalty -rfg2 462.52 1,624.29 5,373.45 0.15 [1,3,5,6,10,3
2,100]
[]
max mismatch penalty -mp1 91.02 1,071.79 59,024.02 5.69 [2,3,5,6,10,3
2,100]
[]
stop gap extension after <D> failures -D 76.67 1,441.92 12,201.07 0.93 [5,9,13,15,17
,21,25]
[]
read gap open penalty -rdg1 31.19 1,395.54 6,491.17 0.04 [0,1,3,5,6,10,
32,100]
[]
min mismatch penalty -mp2 24.00 1,292.41 1,068.41 0.09 [2,3,4,5] []
try <R> sets of seeds for repetitive seeds -R 47.47 1,185.47 290.98 0.00 [1,2,3] []
penalty for Ns -np 27.86 1,161.23 3,349.96 0.01 [0,1,2,3,5,10,
32,100]
[]
read gap extension penalty -rdg2 30.43 1,072.39 6,812.47 0.04 [1,3,5,6,10,3
2,100]
[]
max mismatches in seed -N 0.00 0.00 0.00 0.00 0.00 1.00
Bowtie2 – L, i2
Bowtie2 – score-min2, rfg1
Bowtie2 – rfg2, mp1
Bowtie2 - Tradeoffs
Novoalign
● Smallest number of parameters
● Requires paid license for commercial use
● Does global alignment with full Needleman-Wunsch
algorithm
● Some nice 'bonus' features:
– multithreaded support
– Base quality calibration
– Adapter stripping
● Originally released 2008, newest version 3 released last
month
Novoalign – Parameter Sensitivities
Name Flag CPU Memory Reads MAPQ Parameter Values Invalid
Values
gap open penalty 'g' 201,109.30 3,731,245.16 4,535.13 3.09 [0,10,20,30,40,50,
60,70,80,90,99]
[]
threshold for highest alignment score t 948.24 7,490.47 1,095,323.89 1.11 [-1 0 10 20 40 50
60 70 80 90 100]
[]
minimum good qual bases for read l 565.53 4,489.02 244,324.72 0.07 [15 20 25 35 45
55 65 75 85 95
100]
[]
structural variation penalty for chimeric
fragments
v 423.15 5,630.93 4,542.41 0.18 [0 10 20 30 40 50
60 70 80 90 100
110 120 130 140]
[]
gap extend penalty x 251.55 1,985.14 6,112.46 0.08 [6 10 20 30 40 50
60 70 80 90 99]
[]
treshold for homopolymer filter 'h' 269.90 2,297.69 136.19 0.00 [0,10,20,30,40] []
Novoalign – g, t
Novoalign – l, v
Novoalign – x, h
Novoalign Tradeoffs
SeqAlto
● More parameters than other aligners
● Uses standard hashing index structures with
larger seeds and adaptive stopping
● Designed for reads about 100bp or more
● Claims 2-4x faster than BWA but our results
do not agree
● Initially released in 2012
SeqAlto – Parameter Table
Name Flag CPU Memory Reads MAPQ Parameter Values Invalid
Values
k-mer maximum occurance threshold (Needleman-Wunsch) max_occ_nw 78.04 545.72 65,393.48 0.73 [2,10,100,1000,100000
]
[]
minimum gap open rate o 5,074.01 35,925.63 281.74 0.01 [0.005,0.05,0.5,0.99,1] []
maximum template size i 2,707.76 18,995.95 676.10 10.24 [250,550,5500,55000] []
k-mer maximum occurance threshold max_occ 174.30 1,222.60 58,578.73 0.66 [2,10,100,1000,10000,
100000]
[]
Phred score pairing prior d 103.76 727.66 928.91 1.96 [0,8,80,100,800,8000] []
maximum gap extension length e 892.84 6,259.12 4,723.80 0.05 [0,5,25,50,75,100,1000
]
[]
Needleman-Wunsch mismatch penalty nw_sub 308.02 2,156.92 1,242.50 1.60 [0,10,15,100,1000] []
Needleman-Wunsch match score nw_mat 30.85 215.38 4,575.00 1.23 [0,2,5,10,100,1000] []
Needleman-Wunsch gap extension penalty nw_ext 16.84 116.35 5,343.32 0.12 [2,10,100,1000] []
Smith-Waterman match score sw_mat 32.53 225.79 2,322.35 0.05 [0,2,5,10,100,1000] []
Needleman-Wunsch gap open penalty nw_gap 33.41 232.77 1,922.74 0.01 [0,10,40,100,1000] []
additional k-mer look-ahead for high mismatch (Needleman-
Wunsch)
kmer_pen_nw 38.24 267.76 1,899.81 0.09 [0,1,10] []
additional k-mer look-ahead for high mismatch kmer_pen 34.28 240.52 1,647.21 0.08 [0,1,10,100] []
k-mer look ahead look_ahead 85.48 600.54 42.25 0.04 [0,2,10,100] []
minimum unclipped read percentage c 5.46 38.80 320.08 0.00 [0,5,25,50,75,100] []
Smith-Waterman gap open penalty sw_gap 6.87 48.40 313.79 0.01 [0,10,40,100,1000] []
Smith-Waterman mismatch penalty sw_sub 14.78 101.98 98.20 0.02 [0,10,15,100,1000] []
Smith-Waterman gap extension penalty sw_ext 9.57 65.70 4.49 0.00 [0,2,10,100,1000] []
k-mer look ahead (Needleman-Wunsch) look_ahead_n
w
3.74 26.81 0.00 0.00 [0,2,10,100] []
average template size m 3.35 24.88 0.00 0.00 [0,100,200,300,550] []
SeqAlto – max_occ_nw, o
SeqAlto – i, max_occ
SeqAlto – d, e
SeqAlto - Tradeoffs
RazerS
● Some irregularities:
– Majority of experiments report only 1.7 million reads
mapped, but some experiments report over 13 million
reads mapped
– MAPQ reported as 255 for all experiments
– Mean read length for most experiments is ~4.5
● Uses q-gram counting for approximate search
● Latest 3 supports parallelization
● Initially released 2012
RazerS – Parameter Sensitivities
Name Flag CPU Memory Reads MAP
Q
Parameter Values Invalid
Values
tolerated deviation from library size le 3,808.44 28,396.98 6,406,142.49 0.00 [0,25,50,100,1000,
10000]
[]
threshold of common kmers between read
and reference
t 40,775.85 297,513.56 886,521.11 0.00 [-1,1,10,100] []
percent identity threshold i 15,076.96 79,300.67 695,578.11 0.00 [92,100] [50,60]
mean library length ll 1,492.70 16,932.42 1,578,820.93 0.00 [100,120,220,320,
2200]
[]
no gaps flag ng 2,623.38 21,989.04 669,685.28 0.00 [0,1] []
repeat length rl 2,826.93 18,235.32 6.00 0.00 [10,100,1000,
10000]
[]
distance range for best match errors dr 1,371.85 7,688.67 230,520.49 0.00 [-1,0,1,10,100] []
read kmers overabundence cutoff oc 1,205.15 5,543.36 0.00 0.00 [0,1] []
overlap length ol 1,107.97 5,205.90 0.00 0.00 [-1,0,1,10,100] []
mutation rate mr 734.48 3,577.62 0.00 0.00 [0,1,5,10] []
percent recognition rate rr 354.60 1,607.40 0.00 0.00 [82,85,90,99] []
taboo length tl 0.00 0.00 0.00 0.00 1.00 [10,100]
RazerS – le, t
RazerS – i, ll
RazerS – ng, dr
RazerS - Tradeoffs
CPU Time Histograms
BWA-mem BWA-sw SOAP2
Bowtie2 Novoalign SeqAlto
Mean MAPQ Histograms
BWA-mem BWA-sw SOAP2
Bowtie2 Novoalign SeqAlto
Reads Mapped Histograms
BWA-mem BWA-sw SOAP2
Bowtie2 Novoalign SeqAlto
Some Conclusions
● BWA-mem and SOAP2 are fastest, and execute in minutes
– But, BWA accuracy is not great
– SOAP2 accuracy is even worse
● Novoalign is the most accurate but requires more time and memory,
and aligns fewer reads
● Bowtie2 is the most memory efficient
● Novoalign and SeqAlto appear to be the most stable aligners
● SeqAlto is decent all around, not the best, not the worst
● RazerS has some basic issues
● In many cases, the best performance characteristics can be
achieved without sacrificing performance in other areas
Next Steps
● Generate parameter combination jobs
● Submit jobs to Beocat for execution
– Beocat has been under a lot of maintenance lately,
is that more or less finished?
● Consolidate results for next round of analysis
● Interpret results and start working on
manuscript
Acknowledgements
● Faculty
– Brooke Fridley
– Jeremy Chen
– Sue Brown
● Students, Staff, and Post-docs
– Byunggil Yoo
– Jennifer Shelton
– Rama Raghavan
– Greg Matuszek
BWA-mem Supplementary
BWA-sw Supplementary
Bowtie2 Supplementary
SeqAlto Supplementary
RazerS Supplementary
Param selection phase1summary_v2
Param selection phase1summary_v2
Param selection phase1summary_v2
Param selection phase1summary_v2
Param selection phase1summary_v2
Param selection phase1summary_v2
Param selection phase1summary_v2
Param selection phase1summary_v2
Param selection phase1summary_v2

Contenu connexe

Similaire à Param selection phase1summary_v2

Shree krishna 20140214
Shree krishna 20140214Shree krishna 20140214
Shree krishna 20140214Shree Shrestha
 
Sustainable Manufacturing: Optimization of single pass Turning machining oper...
Sustainable Manufacturing: Optimization of single pass Turning machining oper...Sustainable Manufacturing: Optimization of single pass Turning machining oper...
Sustainable Manufacturing: Optimization of single pass Turning machining oper...sajal dixit
 
Performance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondPerformance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondTomas Vondra
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningKarunakar Kotha
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningWenfan Xu
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018Manish Pandey
 
Vibration study of a OCDC bracket
Vibration study of a OCDC bracketVibration study of a OCDC bracket
Vibration study of a OCDC bracketRussell Varvel
 
Phy Abstraction for LTE
Phy Abstraction for LTE Phy Abstraction for LTE
Phy Abstraction for LTE rajrajpatel2007
 
Design of Compensators for Speed Control of DC Motor by using Bode Plot Techn...
Design of Compensators for Speed Control of DC Motor by using Bode Plot Techn...Design of Compensators for Speed Control of DC Motor by using Bode Plot Techn...
Design of Compensators for Speed Control of DC Motor by using Bode Plot Techn...IRJET Journal
 
Optimization of Cutting Parameters for MRR in Turning Process of EN-31 Steel ...
Optimization of Cutting Parameters for MRR in Turning Process of EN-31 Steel ...Optimization of Cutting Parameters for MRR in Turning Process of EN-31 Steel ...
Optimization of Cutting Parameters for MRR in Turning Process of EN-31 Steel ...IRJET Journal
 
Adaptive Hyper-Parameter Tuning for Black-box LiDAR Odometry [IROS2021]
Adaptive Hyper-Parameter Tuning for Black-box LiDAR Odometry [IROS2021]Adaptive Hyper-Parameter Tuning for Black-box LiDAR Odometry [IROS2021]
Adaptive Hyper-Parameter Tuning for Black-box LiDAR Odometry [IROS2021]KenjiKoide1
 
Six Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data AccarucySix Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data Accarucyxyhfun
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1MariaDB plc
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1MariaDB plc
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmHadi Fadlallah
 

Similaire à Param selection phase1summary_v2 (20)

Shree krishna 20140214
Shree krishna 20140214Shree krishna 20140214
Shree krishna 20140214
 
Sustainable Manufacturing: Optimization of single pass Turning machining oper...
Sustainable Manufacturing: Optimization of single pass Turning machining oper...Sustainable Manufacturing: Optimization of single pass Turning machining oper...
Sustainable Manufacturing: Optimization of single pass Turning machining oper...
 
Ajila (1)
Ajila (1)Ajila (1)
Ajila (1)
 
Performance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyondPerformance improvements in PostgreSQL 9.5 and beyond
Performance improvements in PostgreSQL 9.5 and beyond
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
BIRTE-13-Kawashima
BIRTE-13-KawashimaBIRTE-13-Kawashima
BIRTE-13-Kawashima
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
 
Vibration study of a OCDC bracket
Vibration study of a OCDC bracketVibration study of a OCDC bracket
Vibration study of a OCDC bracket
 
Phy Abstraction for LTE
Phy Abstraction for LTE Phy Abstraction for LTE
Phy Abstraction for LTE
 
Design of Compensators for Speed Control of DC Motor by using Bode Plot Techn...
Design of Compensators for Speed Control of DC Motor by using Bode Plot Techn...Design of Compensators for Speed Control of DC Motor by using Bode Plot Techn...
Design of Compensators for Speed Control of DC Motor by using Bode Plot Techn...
 
Optimization of Cutting Parameters for MRR in Turning Process of EN-31 Steel ...
Optimization of Cutting Parameters for MRR in Turning Process of EN-31 Steel ...Optimization of Cutting Parameters for MRR in Turning Process of EN-31 Steel ...
Optimization of Cutting Parameters for MRR in Turning Process of EN-31 Steel ...
 
Adaptive Hyper-Parameter Tuning for Black-box LiDAR Odometry [IROS2021]
Adaptive Hyper-Parameter Tuning for Black-box LiDAR Odometry [IROS2021]Adaptive Hyper-Parameter Tuning for Black-box LiDAR Odometry [IROS2021]
Adaptive Hyper-Parameter Tuning for Black-box LiDAR Odometry [IROS2021]
 
Six Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data AccarucySix Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data Accarucy
 
Final Project
Final ProjectFinal Project
Final Project
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
Maestro_Abstract
Maestro_AbstractMaestro_Abstract
Maestro_Abstract
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithm
 

Plus de Jennifer Shelton

Bioinformatic core facilities discussion
Bioinformatic core facilities discussionBioinformatic core facilities discussion
Bioinformatic core facilities discussionJennifer Shelton
 
Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation DetectionJennifer Shelton
 
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...Jennifer Shelton
 
Journal club slides to discuss "Differential analysis of gene regulation at t...
Journal club slides to discuss "Differential analysis of gene regulation at t...Journal club slides to discuss "Differential analysis of gene regulation at t...
Journal club slides to discuss "Differential analysis of gene regulation at t...Jennifer Shelton
 
Applied Bioinformatics Journal Club Pacbio RNA-Seq
Applied Bioinformatics Journal Club Pacbio RNA-SeqApplied Bioinformatics Journal Club Pacbio RNA-Seq
Applied Bioinformatics Journal Club Pacbio RNA-SeqJennifer Shelton
 
RNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubRNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubJennifer Shelton
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.Jennifer Shelton
 

Plus de Jennifer Shelton (9)

Bioinformatic core facilities discussion
Bioinformatic core facilities discussionBioinformatic core facilities discussion
Bioinformatic core facilities discussion
 
Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation Detection
 
Bng presentation draft
Bng presentation draftBng presentation draft
Bng presentation draft
 
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...
 
Journal club slides to discuss "Differential analysis of gene regulation at t...
Journal club slides to discuss "Differential analysis of gene regulation at t...Journal club slides to discuss "Differential analysis of gene regulation at t...
Journal club slides to discuss "Differential analysis of gene regulation at t...
 
Hub gene selection_ds
Hub gene selection_dsHub gene selection_ds
Hub gene selection_ds
 
Applied Bioinformatics Journal Club Pacbio RNA-Seq
Applied Bioinformatics Journal Club Pacbio RNA-SeqApplied Bioinformatics Journal Club Pacbio RNA-Seq
Applied Bioinformatics Journal Club Pacbio RNA-Seq
 
RNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubRNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal Club
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
 

Dernier

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Param selection phase1summary_v2

  • 1. Systematic Analysis of Parameter Selection for Sequence Aligment Algorithms Project Recap and Phase 1 Summary Aaron Smalter Hall Molecular Graphics and Modeling Laboratory University of Kansas June 26, 2013
  • 2. Motivation ● Genomics has become heavily dependent on the use of sequence alignment tools ● Performance of sequence alignment is directly dependent on parameters ● To date there is no systematic analysis of sequence alignment parameters and their effects on alignment performance
  • 3. Challenges ● Sequence alignment is computationally intensive ● Sequence alignment is often controlled by many different parameters ● Often not tractable to perform alignment with multiple parameter combinations ● Number of reads in a data set is growing – partially offset by better hardware
  • 4. Approach ● Systematic analysis of effects of parameter perturbation on sequence alignment behavior – Analyze performance sensitivity of individual parameters (phase 1) – Analyze performance sensitivity of parameter combinations (phase 2) – Compare performance characteristics across sequence alignment tools (phase 3)
  • 5. Experimental Design (Phase 1) ● Identify a broad set of interesting alignment tools ● Identify a broad set of interesting parameters for each tool ● Identify interesting data sets to test tools/parameters ● Execute tools on the same data sets, while changing parameters individually over a wide range
  • 6. Experimental Design (Phase 2) ● Identify alignment parameters for each tool that are individually sensitive to changes ● Define sensitivity w.r.t.: – Computation time and memory required – Read mapping rate – Read mapping quality ● Identify functional ranges for each individual parameter ● Execute alignment tools on combinations of parameters, while perturbing parameters across functional range ● Identify regions of increased sensitivity and execute alignment tools with finer grained parameter value intervals
  • 7. Experimental Design (Phase 3) ● Identify relationships between parameters ● Identify parameter space regions of best performance ● Compare performance and sensitivities across sequence alignment tools
  • 8. Current Status ● We are at the end of phase one, ready to move into phase two ● Completed: – Select alignment tools – Select data sets – Select parameters of interest – Run experiments across broad parameter ranges – Collect performance and sensitivity data ● Now: visualize and assess data for each alignment tool
  • 9. Phase 2 Requirements ● Identify sensitive parameters ● Identify functional range of parameters ● Write software scripts to automatically generate parameter combination jobs to run on cluster (there will be many, many jobs) ● Execute jobs and collect results
  • 10. Experimental Choices ● Datasets ● Alignment tools ● Collected Results – Parameters – Sensitivities – Functional ranges
  • 11. Datasets ● Collected several data sets: – DePristo/Broad NA12878 Whole Genome – DePristo/Broad NA17878 Whole Exome – Synthetic paired end ● 10 million pairs ● 1 million pairs ● 100k pairs ● 10k pairs ● 1k pairs
  • 12. Alignment Tools ● BWA-mem ● BWA-sw ● SOAP2 ● Bowtie2 ● Novoalign ● SeqAlto ● RazerS
  • 13. Collected Results ● For each alignment tool: – Table of parameters ranked (roughly) by magnitude of effect on performance ● according to standard deviation of performance characteristics – Figures of most sensitive parameters showing performance results over entire parameter range tested – Scatter plots of every experimental result showing tradeoffs: ● CPU time vs. reads mapped ● CPU time vs. mean MAPQ ● Reads mapped vs. mean MAPQ
  • 14. Comparison at Defaults Aligner CPU Usage Max V.Mem mapped reads mapq mean mapq stdev pct mismatc h BWA- mem 483.35 5.576G 5827211 44.0732 20.0053 25.4545 BWA-sw 1102.11 5.215G 2853347 64.7333 69.9756 25.3673 SOAP2 463.01 5.613G 5645049 22.8406 12.7877 30.0828 Bowtie2 1565.36 3.343G 5803539 25.1817 14.6297 25.7779 Novoalign 4247.55 7.981G 5296829 65.422 11.862 25.3865 SeqAlto 2823.06 7.022G 5669412 49.164 18.4466 25.8334 RazerS 26738.65 8.577G 1777714 255 0 38.9175
  • 15. BWA-mem ● A recent addition to the BWA package – Designed for short reads up to 100bp ● Based on Burrows-Wheeler Transform index structures ● Some parameter values caused BWA to find more reads than should be present ● Fairly typical set of parameters ● Released 2012
  • 16. BWA-mem Parameter Sensitivities Name Flag CPU Memory Reads MAPQ Parameter Values Invalid Values minimum seed length -k 274.71 1,439.18 1,231,012.26 20.08 [0,1,10,19, 100] 1,000.00 occurrence threshold for discard -c 1,432.77 7,771.84 178,211.44 5.45 [1,10,100, 1000,10000, 100000] 0.00 mismatch penalty -B 55.54 290.55 892,401.53 8.28 [0,1,4,10, 100,1000] [] matching score -A 137.85 713.91 481,096.44 2.29 [1,10,100, 1000] 0.00 unpaired penalty -U 32.41 166.97 8.52 7.25 [0,1,9,10, 100,1000] [] re-seeding threshold -r 59.50 301.14 620.75 1.64 [0,1,1.01,1.1 ,1.5,2,10, 100,1000] [] gap open penalty -O 50.71 264.05 13,670.66 0.76 [0,1,6,10, 100,1000] [] band width -w 35.58 185.60 9,465.24 0.10 [0,1,10,100, 1000,10000] [] gap extension penalty -E 35.12 182.84 11,426.13 0.08 [0,1,10,100, 1000] [] clipping penalty -L 12.18 62.41 6,357.34 0.04 [0,1,5,10, 100,1000] []
  • 21. BWA-sw ● Doesn't work on paired ends – Treat each end as an individual read – Reads mapped reported is bugged because of identical read IDs ● Works on reads 70bp-1Mbp ● Similar features to BWA-mem ● Similar parameters to BWA-mem ● Released 2010
  • 22. BWA-sw – Parameter Sensitivities Name Flag CPU Memory Reads MAPQ Parameter Values Invalid Values min score threshold -T 410.90 2,121.72 727,340.56 27.76 [0,1,10,37, 100] 1,000.00 z-best heuristics -z 27,174.89 144,452.7 2 10,993.91 21.56 [1,10,100] 0.00 threshold adjustment coef -c 114.96 593.24 27,795.00 22.45 [0,1,5.5, 10] [100,100 0] mismatch penalty -b 122.87 634.87 7,638.45 16.90 [0,1,3,10, 100,1000] [] gap open penalty -q 311.60 1,609.43 18,425.36 16.73 [0,1,5,10, 100,1000] [] max SA interval for seed -s 11,048.94 57,081.69 24.42 1.13 [1,3,10, 100,1000] [] min number seeds -N 265.15 1,368.47 345.99 4.47 [0,1,5,10, 100,1000] [] gap extension penalty -r 271.25 1,400.37 2,992.56 0.62 [1,2,10, 100,1000] [] band width -w 111.93 577.31 2,800.64 0.01 [1,10,33, 100,1000] [] match score -a 0.00 0.00 0.00 0.00 1.00 [0,10,10 0,1000]
  • 27. SOAP2 ● Also based on BWT index structures ● Order of magnitude improvement over previous version ● Similar parameters to BWA ● Original release in 2008, latest release in 2011
  • 28. SOAP2 – Parameter Sensitivities Name Flag CPU Memory Reads MAPQ Parameter Values Invalid Values min insert size -m 268.29 1,505.68 0.00 4.49 [0,1,10,100,400, 1000,10000,10000 0] [] continuous gap size allowed -g 50.62 284.38 36,521.81 0.07 [0,1,10,100,1000] [] min alignment length -s 60.41 338.90 33,200.06 0.07 [10,100,255,1000] [] max insert size -x 186.14 1,044.54 0.00 2.70 [0,1,10,100,600, 1000,10000,10000 0] [] disallow gap within e-bp -e 24.60 137.84 0.00 0.00 [0,1,5,10,100,1000] [] max mismatch per read -v 20.21 113.54 31.70 0.00 [0,1,5,10,100,1000] [] seed length -l 11.50 64.80 412.81 0.00 [100,256,1000] [] number Ns to allow -n 7.33 41.21 115.89 0.0000 4 [0,1,5,10,100,1000] []
  • 32. Bowtie 2 ● Works on reads from 50-1000bp ● Compresses BWT index to limit memory footprint ● Similar parameters to BWA and SOAP2, with a few additions ● Released 2012, latest release in 2013
  • 33. Bowtie2 – Parameter Sensitivites Name Flag CPU Memory Reads MAPQ Parameter Values Invalid Values length of seed substring -L 27,039.75 2,857.25 34,224.99 2.68 [3,6,9,13,16, 19,22,26,29, 32] [] end of interval between seed substrings -i2 950.37 3,500.88 15,031.23 1.21 [0,1,1.25,2,4, 8,16] [] min acceptable alignment score coefficient -score- min2 501.10 718.53 801,965.66 13.57 [-0.9,-0.6,- 0.3,0,1] [] reference gap open penalty -rfg1 181.85 2,152.54 2,889.80 0.11 [0,1,3,5,6,10, 32,100] [] reference gap extend penalty -rfg2 462.52 1,624.29 5,373.45 0.15 [1,3,5,6,10,3 2,100] [] max mismatch penalty -mp1 91.02 1,071.79 59,024.02 5.69 [2,3,5,6,10,3 2,100] [] stop gap extension after <D> failures -D 76.67 1,441.92 12,201.07 0.93 [5,9,13,15,17 ,21,25] [] read gap open penalty -rdg1 31.19 1,395.54 6,491.17 0.04 [0,1,3,5,6,10, 32,100] [] min mismatch penalty -mp2 24.00 1,292.41 1,068.41 0.09 [2,3,4,5] [] try <R> sets of seeds for repetitive seeds -R 47.47 1,185.47 290.98 0.00 [1,2,3] [] penalty for Ns -np 27.86 1,161.23 3,349.96 0.01 [0,1,2,3,5,10, 32,100] [] read gap extension penalty -rdg2 30.43 1,072.39 6,812.47 0.04 [1,3,5,6,10,3 2,100] [] max mismatches in seed -N 0.00 0.00 0.00 0.00 0.00 1.00
  • 38. Novoalign ● Smallest number of parameters ● Requires paid license for commercial use ● Does global alignment with full Needleman-Wunsch algorithm ● Some nice 'bonus' features: – multithreaded support – Base quality calibration – Adapter stripping ● Originally released 2008, newest version 3 released last month
  • 39. Novoalign – Parameter Sensitivities Name Flag CPU Memory Reads MAPQ Parameter Values Invalid Values gap open penalty 'g' 201,109.30 3,731,245.16 4,535.13 3.09 [0,10,20,30,40,50, 60,70,80,90,99] [] threshold for highest alignment score t 948.24 7,490.47 1,095,323.89 1.11 [-1 0 10 20 40 50 60 70 80 90 100] [] minimum good qual bases for read l 565.53 4,489.02 244,324.72 0.07 [15 20 25 35 45 55 65 75 85 95 100] [] structural variation penalty for chimeric fragments v 423.15 5,630.93 4,542.41 0.18 [0 10 20 30 40 50 60 70 80 90 100 110 120 130 140] [] gap extend penalty x 251.55 1,985.14 6,112.46 0.08 [6 10 20 30 40 50 60 70 80 90 99] [] treshold for homopolymer filter 'h' 269.90 2,297.69 136.19 0.00 [0,10,20,30,40] []
  • 44. SeqAlto ● More parameters than other aligners ● Uses standard hashing index structures with larger seeds and adaptive stopping ● Designed for reads about 100bp or more ● Claims 2-4x faster than BWA but our results do not agree ● Initially released in 2012
  • 45. SeqAlto – Parameter Table Name Flag CPU Memory Reads MAPQ Parameter Values Invalid Values k-mer maximum occurance threshold (Needleman-Wunsch) max_occ_nw 78.04 545.72 65,393.48 0.73 [2,10,100,1000,100000 ] [] minimum gap open rate o 5,074.01 35,925.63 281.74 0.01 [0.005,0.05,0.5,0.99,1] [] maximum template size i 2,707.76 18,995.95 676.10 10.24 [250,550,5500,55000] [] k-mer maximum occurance threshold max_occ 174.30 1,222.60 58,578.73 0.66 [2,10,100,1000,10000, 100000] [] Phred score pairing prior d 103.76 727.66 928.91 1.96 [0,8,80,100,800,8000] [] maximum gap extension length e 892.84 6,259.12 4,723.80 0.05 [0,5,25,50,75,100,1000 ] [] Needleman-Wunsch mismatch penalty nw_sub 308.02 2,156.92 1,242.50 1.60 [0,10,15,100,1000] [] Needleman-Wunsch match score nw_mat 30.85 215.38 4,575.00 1.23 [0,2,5,10,100,1000] [] Needleman-Wunsch gap extension penalty nw_ext 16.84 116.35 5,343.32 0.12 [2,10,100,1000] [] Smith-Waterman match score sw_mat 32.53 225.79 2,322.35 0.05 [0,2,5,10,100,1000] [] Needleman-Wunsch gap open penalty nw_gap 33.41 232.77 1,922.74 0.01 [0,10,40,100,1000] [] additional k-mer look-ahead for high mismatch (Needleman- Wunsch) kmer_pen_nw 38.24 267.76 1,899.81 0.09 [0,1,10] [] additional k-mer look-ahead for high mismatch kmer_pen 34.28 240.52 1,647.21 0.08 [0,1,10,100] [] k-mer look ahead look_ahead 85.48 600.54 42.25 0.04 [0,2,10,100] [] minimum unclipped read percentage c 5.46 38.80 320.08 0.00 [0,5,25,50,75,100] [] Smith-Waterman gap open penalty sw_gap 6.87 48.40 313.79 0.01 [0,10,40,100,1000] [] Smith-Waterman mismatch penalty sw_sub 14.78 101.98 98.20 0.02 [0,10,15,100,1000] [] Smith-Waterman gap extension penalty sw_ext 9.57 65.70 4.49 0.00 [0,2,10,100,1000] [] k-mer look ahead (Needleman-Wunsch) look_ahead_n w 3.74 26.81 0.00 0.00 [0,2,10,100] [] average template size m 3.35 24.88 0.00 0.00 [0,100,200,300,550] []
  • 47. SeqAlto – i, max_occ
  • 50. RazerS ● Some irregularities: – Majority of experiments report only 1.7 million reads mapped, but some experiments report over 13 million reads mapped – MAPQ reported as 255 for all experiments – Mean read length for most experiments is ~4.5 ● Uses q-gram counting for approximate search ● Latest 3 supports parallelization ● Initially released 2012
  • 51. RazerS – Parameter Sensitivities Name Flag CPU Memory Reads MAP Q Parameter Values Invalid Values tolerated deviation from library size le 3,808.44 28,396.98 6,406,142.49 0.00 [0,25,50,100,1000, 10000] [] threshold of common kmers between read and reference t 40,775.85 297,513.56 886,521.11 0.00 [-1,1,10,100] [] percent identity threshold i 15,076.96 79,300.67 695,578.11 0.00 [92,100] [50,60] mean library length ll 1,492.70 16,932.42 1,578,820.93 0.00 [100,120,220,320, 2200] [] no gaps flag ng 2,623.38 21,989.04 669,685.28 0.00 [0,1] [] repeat length rl 2,826.93 18,235.32 6.00 0.00 [10,100,1000, 10000] [] distance range for best match errors dr 1,371.85 7,688.67 230,520.49 0.00 [-1,0,1,10,100] [] read kmers overabundence cutoff oc 1,205.15 5,543.36 0.00 0.00 [0,1] [] overlap length ol 1,107.97 5,205.90 0.00 0.00 [-1,0,1,10,100] [] mutation rate mr 734.48 3,577.62 0.00 0.00 [0,1,5,10] [] percent recognition rate rr 354.60 1,607.40 0.00 0.00 [82,85,90,99] [] taboo length tl 0.00 0.00 0.00 0.00 1.00 [10,100]
  • 56. CPU Time Histograms BWA-mem BWA-sw SOAP2 Bowtie2 Novoalign SeqAlto
  • 57. Mean MAPQ Histograms BWA-mem BWA-sw SOAP2 Bowtie2 Novoalign SeqAlto
  • 58. Reads Mapped Histograms BWA-mem BWA-sw SOAP2 Bowtie2 Novoalign SeqAlto
  • 59. Some Conclusions ● BWA-mem and SOAP2 are fastest, and execute in minutes – But, BWA accuracy is not great – SOAP2 accuracy is even worse ● Novoalign is the most accurate but requires more time and memory, and aligns fewer reads ● Bowtie2 is the most memory efficient ● Novoalign and SeqAlto appear to be the most stable aligners ● SeqAlto is decent all around, not the best, not the worst ● RazerS has some basic issues ● In many cases, the best performance characteristics can be achieved without sacrificing performance in other areas
  • 60. Next Steps ● Generate parameter combination jobs ● Submit jobs to Beocat for execution – Beocat has been under a lot of maintenance lately, is that more or less finished? ● Consolidate results for next round of analysis ● Interpret results and start working on manuscript
  • 61. Acknowledgements ● Faculty – Brooke Fridley – Jeremy Chen – Sue Brown ● Students, Staff, and Post-docs – Byunggil Yoo – Jennifer Shelton – Rama Raghavan – Greg Matuszek
  • 63.
  • 64.
  • 65.
  • 66.
  • 68.
  • 69.
  • 70.
  • 71.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.