This document summarizes phase 1 of a project analyzing the effects of parameter selection on the performance of sequence alignment algorithms. In phase 1, the authors tested a variety of sequence alignment tools across broad ranges of individual parameter values to analyze sensitivity. They identified sensitive parameters for each tool and collected results on computation time, memory usage, number of reads mapped, and mapping quality. The document outlines plans for phase 2 to analyze parameter combinations and compare tools.
How to Troubleshoot Apps for the Modern Connected Worker
Param selection phase1summary_v2
1. Systematic Analysis of Parameter Selection
for Sequence Aligment Algorithms
Project Recap and Phase 1 Summary
Aaron Smalter Hall
Molecular Graphics and Modeling Laboratory
University of Kansas
June 26, 2013
2. Motivation
● Genomics has become heavily dependent on
the use of sequence alignment tools
● Performance of sequence alignment is directly
dependent on parameters
● To date there is no systematic analysis of
sequence alignment parameters and their
effects on alignment performance
3. Challenges
● Sequence alignment is computationally
intensive
● Sequence alignment is often controlled by
many different parameters
● Often not tractable to perform alignment with
multiple parameter combinations
● Number of reads in a data set is growing –
partially offset by better hardware
4. Approach
● Systematic analysis of effects of parameter
perturbation on sequence alignment behavior
– Analyze performance sensitivity of individual
parameters (phase 1)
– Analyze performance sensitivity of parameter
combinations (phase 2)
– Compare performance characteristics across
sequence alignment tools (phase 3)
5. Experimental Design (Phase 1)
● Identify a broad set of interesting alignment tools
● Identify a broad set of interesting parameters for
each tool
● Identify interesting data sets to test
tools/parameters
● Execute tools on the same data sets, while
changing parameters individually over a wide
range
6. Experimental Design (Phase 2)
●
Identify alignment parameters for each tool that are individually
sensitive to changes
● Define sensitivity w.r.t.:
– Computation time and memory required
– Read mapping rate
– Read mapping quality
● Identify functional ranges for each individual parameter
● Execute alignment tools on combinations of parameters, while
perturbing parameters across functional range
●
Identify regions of increased sensitivity and execute alignment
tools with finer grained parameter value intervals
7. Experimental Design (Phase 3)
● Identify relationships between parameters
● Identify parameter space regions of best
performance
● Compare performance and sensitivities across
sequence alignment tools
8. Current Status
● We are at the end of phase one, ready to move
into phase two
● Completed:
– Select alignment tools
– Select data sets
– Select parameters of interest
– Run experiments across broad parameter ranges
– Collect performance and sensitivity data
● Now: visualize and assess data for each
alignment tool
9. Phase 2 Requirements
● Identify sensitive parameters
● Identify functional range of parameters
● Write software scripts to automatically
generate parameter combination jobs to run
on cluster (there will be many, many jobs)
● Execute jobs and collect results
13. Collected Results
● For each alignment tool:
– Table of parameters ranked (roughly) by magnitude of effect on
performance
● according to standard deviation of performance characteristics
– Figures of most sensitive parameters showing performance
results over entire parameter range tested
– Scatter plots of every experimental result showing tradeoffs:
● CPU time vs. reads mapped
● CPU time vs. mean MAPQ
● Reads mapped vs. mean MAPQ
15. BWA-mem
● A recent addition to the BWA package
– Designed for short reads up to 100bp
● Based on Burrows-Wheeler Transform index
structures
● Some parameter values caused BWA to find
more reads than should be present
● Fairly typical set of parameters
● Released 2012
21. BWA-sw
● Doesn't work on paired ends
– Treat each end as an individual read
– Reads mapped reported is bugged because of
identical read IDs
● Works on reads 70bp-1Mbp
● Similar features to BWA-mem
● Similar parameters to BWA-mem
● Released 2010
22. BWA-sw – Parameter Sensitivities
Name Flag CPU Memory Reads MAPQ Parameter
Values
Invalid
Values
min score threshold -T 410.90 2,121.72 727,340.56 27.76 [0,1,10,37,
100]
1,000.00
z-best heuristics -z 27,174.89 144,452.7
2
10,993.91 21.56 [1,10,100] 0.00
threshold adjustment coef -c 114.96 593.24 27,795.00 22.45 [0,1,5.5,
10]
[100,100
0]
mismatch penalty -b 122.87 634.87 7,638.45 16.90 [0,1,3,10,
100,1000]
[]
gap open penalty -q 311.60 1,609.43 18,425.36 16.73 [0,1,5,10,
100,1000]
[]
max SA interval for seed -s 11,048.94 57,081.69 24.42 1.13 [1,3,10,
100,1000]
[]
min number seeds -N 265.15 1,368.47 345.99 4.47 [0,1,5,10,
100,1000]
[]
gap extension penalty -r 271.25 1,400.37 2,992.56 0.62 [1,2,10,
100,1000]
[]
band width -w 111.93 577.31 2,800.64 0.01 [1,10,33,
100,1000]
[]
match score -a 0.00 0.00 0.00 0.00 1.00 [0,10,10
0,1000]
27. SOAP2
● Also based on BWT index structures
● Order of magnitude improvement over
previous version
● Similar parameters to BWA
● Original release in 2008, latest release in 2011
28. SOAP2 – Parameter Sensitivities
Name Flag CPU Memory Reads MAPQ Parameter Values Invalid
Values
min insert size -m 268.29 1,505.68 0.00 4.49 [0,1,10,100,400,
1000,10000,10000
0]
[]
continuous gap size allowed -g 50.62 284.38 36,521.81 0.07 [0,1,10,100,1000] []
min alignment length -s 60.41 338.90 33,200.06 0.07 [10,100,255,1000] []
max insert size -x 186.14 1,044.54 0.00 2.70 [0,1,10,100,600,
1000,10000,10000
0]
[]
disallow gap within e-bp -e 24.60 137.84 0.00 0.00 [0,1,5,10,100,1000] []
max mismatch per read -v 20.21 113.54 31.70 0.00 [0,1,5,10,100,1000] []
seed length -l 11.50 64.80 412.81 0.00 [100,256,1000] []
number Ns to allow -n 7.33 41.21 115.89 0.0000
4
[0,1,5,10,100,1000] []
32. Bowtie 2
● Works on reads from 50-1000bp
● Compresses BWT index to limit memory
footprint
● Similar parameters to BWA and SOAP2, with a
few additions
● Released 2012, latest release in 2013
33. Bowtie2 – Parameter Sensitivites
Name Flag CPU Memory Reads MAPQ Parameter
Values
Invalid
Values
length of seed substring -L 27,039.75 2,857.25 34,224.99 2.68 [3,6,9,13,16,
19,22,26,29,
32]
[]
end of interval between seed substrings -i2 950.37 3,500.88 15,031.23 1.21 [0,1,1.25,2,4,
8,16]
[]
min acceptable alignment score
coefficient
-score-
min2
501.10 718.53 801,965.66 13.57 [-0.9,-0.6,-
0.3,0,1]
[]
reference gap open penalty -rfg1 181.85 2,152.54 2,889.80 0.11 [0,1,3,5,6,10,
32,100]
[]
reference gap extend penalty -rfg2 462.52 1,624.29 5,373.45 0.15 [1,3,5,6,10,3
2,100]
[]
max mismatch penalty -mp1 91.02 1,071.79 59,024.02 5.69 [2,3,5,6,10,3
2,100]
[]
stop gap extension after <D> failures -D 76.67 1,441.92 12,201.07 0.93 [5,9,13,15,17
,21,25]
[]
read gap open penalty -rdg1 31.19 1,395.54 6,491.17 0.04 [0,1,3,5,6,10,
32,100]
[]
min mismatch penalty -mp2 24.00 1,292.41 1,068.41 0.09 [2,3,4,5] []
try <R> sets of seeds for repetitive seeds -R 47.47 1,185.47 290.98 0.00 [1,2,3] []
penalty for Ns -np 27.86 1,161.23 3,349.96 0.01 [0,1,2,3,5,10,
32,100]
[]
read gap extension penalty -rdg2 30.43 1,072.39 6,812.47 0.04 [1,3,5,6,10,3
2,100]
[]
max mismatches in seed -N 0.00 0.00 0.00 0.00 0.00 1.00
38. Novoalign
● Smallest number of parameters
● Requires paid license for commercial use
● Does global alignment with full Needleman-Wunsch
algorithm
● Some nice 'bonus' features:
– multithreaded support
– Base quality calibration
– Adapter stripping
● Originally released 2008, newest version 3 released last
month
39. Novoalign – Parameter Sensitivities
Name Flag CPU Memory Reads MAPQ Parameter Values Invalid
Values
gap open penalty 'g' 201,109.30 3,731,245.16 4,535.13 3.09 [0,10,20,30,40,50,
60,70,80,90,99]
[]
threshold for highest alignment score t 948.24 7,490.47 1,095,323.89 1.11 [-1 0 10 20 40 50
60 70 80 90 100]
[]
minimum good qual bases for read l 565.53 4,489.02 244,324.72 0.07 [15 20 25 35 45
55 65 75 85 95
100]
[]
structural variation penalty for chimeric
fragments
v 423.15 5,630.93 4,542.41 0.18 [0 10 20 30 40 50
60 70 80 90 100
110 120 130 140]
[]
gap extend penalty x 251.55 1,985.14 6,112.46 0.08 [6 10 20 30 40 50
60 70 80 90 99]
[]
treshold for homopolymer filter 'h' 269.90 2,297.69 136.19 0.00 [0,10,20,30,40] []
44. SeqAlto
● More parameters than other aligners
● Uses standard hashing index structures with
larger seeds and adaptive stopping
● Designed for reads about 100bp or more
● Claims 2-4x faster than BWA but our results
do not agree
● Initially released in 2012
45. SeqAlto – Parameter Table
Name Flag CPU Memory Reads MAPQ Parameter Values Invalid
Values
k-mer maximum occurance threshold (Needleman-Wunsch) max_occ_nw 78.04 545.72 65,393.48 0.73 [2,10,100,1000,100000
]
[]
minimum gap open rate o 5,074.01 35,925.63 281.74 0.01 [0.005,0.05,0.5,0.99,1] []
maximum template size i 2,707.76 18,995.95 676.10 10.24 [250,550,5500,55000] []
k-mer maximum occurance threshold max_occ 174.30 1,222.60 58,578.73 0.66 [2,10,100,1000,10000,
100000]
[]
Phred score pairing prior d 103.76 727.66 928.91 1.96 [0,8,80,100,800,8000] []
maximum gap extension length e 892.84 6,259.12 4,723.80 0.05 [0,5,25,50,75,100,1000
]
[]
Needleman-Wunsch mismatch penalty nw_sub 308.02 2,156.92 1,242.50 1.60 [0,10,15,100,1000] []
Needleman-Wunsch match score nw_mat 30.85 215.38 4,575.00 1.23 [0,2,5,10,100,1000] []
Needleman-Wunsch gap extension penalty nw_ext 16.84 116.35 5,343.32 0.12 [2,10,100,1000] []
Smith-Waterman match score sw_mat 32.53 225.79 2,322.35 0.05 [0,2,5,10,100,1000] []
Needleman-Wunsch gap open penalty nw_gap 33.41 232.77 1,922.74 0.01 [0,10,40,100,1000] []
additional k-mer look-ahead for high mismatch (Needleman-
Wunsch)
kmer_pen_nw 38.24 267.76 1,899.81 0.09 [0,1,10] []
additional k-mer look-ahead for high mismatch kmer_pen 34.28 240.52 1,647.21 0.08 [0,1,10,100] []
k-mer look ahead look_ahead 85.48 600.54 42.25 0.04 [0,2,10,100] []
minimum unclipped read percentage c 5.46 38.80 320.08 0.00 [0,5,25,50,75,100] []
Smith-Waterman gap open penalty sw_gap 6.87 48.40 313.79 0.01 [0,10,40,100,1000] []
Smith-Waterman mismatch penalty sw_sub 14.78 101.98 98.20 0.02 [0,10,15,100,1000] []
Smith-Waterman gap extension penalty sw_ext 9.57 65.70 4.49 0.00 [0,2,10,100,1000] []
k-mer look ahead (Needleman-Wunsch) look_ahead_n
w
3.74 26.81 0.00 0.00 [0,2,10,100] []
average template size m 3.35 24.88 0.00 0.00 [0,100,200,300,550] []
50. RazerS
● Some irregularities:
– Majority of experiments report only 1.7 million reads
mapped, but some experiments report over 13 million
reads mapped
– MAPQ reported as 255 for all experiments
– Mean read length for most experiments is ~4.5
● Uses q-gram counting for approximate search
● Latest 3 supports parallelization
● Initially released 2012
51. RazerS – Parameter Sensitivities
Name Flag CPU Memory Reads MAP
Q
Parameter Values Invalid
Values
tolerated deviation from library size le 3,808.44 28,396.98 6,406,142.49 0.00 [0,25,50,100,1000,
10000]
[]
threshold of common kmers between read
and reference
t 40,775.85 297,513.56 886,521.11 0.00 [-1,1,10,100] []
percent identity threshold i 15,076.96 79,300.67 695,578.11 0.00 [92,100] [50,60]
mean library length ll 1,492.70 16,932.42 1,578,820.93 0.00 [100,120,220,320,
2200]
[]
no gaps flag ng 2,623.38 21,989.04 669,685.28 0.00 [0,1] []
repeat length rl 2,826.93 18,235.32 6.00 0.00 [10,100,1000,
10000]
[]
distance range for best match errors dr 1,371.85 7,688.67 230,520.49 0.00 [-1,0,1,10,100] []
read kmers overabundence cutoff oc 1,205.15 5,543.36 0.00 0.00 [0,1] []
overlap length ol 1,107.97 5,205.90 0.00 0.00 [-1,0,1,10,100] []
mutation rate mr 734.48 3,577.62 0.00 0.00 [0,1,5,10] []
percent recognition rate rr 354.60 1,607.40 0.00 0.00 [82,85,90,99] []
taboo length tl 0.00 0.00 0.00 0.00 1.00 [10,100]
59. Some Conclusions
● BWA-mem and SOAP2 are fastest, and execute in minutes
– But, BWA accuracy is not great
– SOAP2 accuracy is even worse
● Novoalign is the most accurate but requires more time and memory,
and aligns fewer reads
● Bowtie2 is the most memory efficient
● Novoalign and SeqAlto appear to be the most stable aligners
● SeqAlto is decent all around, not the best, not the worst
● RazerS has some basic issues
● In many cases, the best performance characteristics can be
achieved without sacrificing performance in other areas
60. Next Steps
● Generate parameter combination jobs
● Submit jobs to Beocat for execution
– Beocat has been under a lot of maintenance lately,
is that more or less finished?
● Consolidate results for next round of analysis
● Interpret results and start working on
manuscript
61. Acknowledgements
● Faculty
– Brooke Fridley
– Jeremy Chen
– Sue Brown
● Students, Staff, and Post-docs
– Byunggil Yoo
– Jennifer Shelton
– Rama Raghavan
– Greg Matuszek