Taken from here:
https://www.dropbox.com/sh/55nfktmn7lgai98/48sIHw8bzJ/kvg_20_line_lifesavers_mad_v2.pptx.pdf
and uploaded to slide share for convenience. Credits to:
Kiran V Garimella (kiran.garimella@gmail.com ), Mark A DePristo
GENOME SEQUENCING AND ANALYSIS, BROAD INSTITUTE
Research Informatics Group
ELI LILLY AND COMPANY
This PowerPoint helps students to consider the concept of infinity.
20-Line Lifesavers: Coding simple solutions in the GATK
1. Eli Lilly / September 14-15, 2011
20-Line Lifesavers:"
Coding simple solutions in the GATK
Kiran V Garimella (kiran.garimella@gmail.com), Mark A DePristo
G E N O M E S E Q U E N C I N G A N D A N A LY S I S , B R O A D I N S T I T U T E
Research Informatics Group
E L I L I L LY A N D C O M PA N Y
2. Genome Analysis Toolkit (GATK)!
ˈjē-ˌnōm(,)ə-ˈna-lə-səs(,)ˈtül(,)kit&
Noun&
1. A suite of tools for working with medical resequencing projects
(e.g. 1,000 Genomes, The Cancer Genome Atlas)&
2. A structured software library that makes writing efficient
analysis tools using next-generation sequencing data easy!
3. Genome Analysis Toolkit (GATK)!
ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit&
Noun& Most users think of the toolkit merely as a set
of tools that implement our ideas…!
1. A suite of tools for working with medical resequencing projects
(e.g. 1,000 Genomes, The Cancer Genome Atlas)&
2. A structured software library that makes writing efficient
analysis tools using next-generation sequencing data easy!
4. Genome Analysis Toolkit (GATK)!
ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit&
Noun&
1. … suite the GATKʼs real with medical resequencing projects
A but of tools for working power is in how easy it
(e.g. 1,000 Genomes, The Cancer Genome Atlas)&
makes it to instantiate your ideas.!
2. A structured software library that makes writing efficient
analysis tools using next-generation sequencing data easy!
This is what we will discuss today.!
5. Some tasks are made difficult by the wrong tools
Convert to sam format, read the
header, parse the read group info into
a hash table keyed on the ID, loop
over the reads, look up the read group id in the hash, find
the platform unit tag,
These BAMS have numeric, non- prepend it to the read name,
convert back to BAM, reindex BAM.
unique read ids that collide when you
merge them! Lines of Code: 500.
How long will It take to fix?
All day!
With all apologies to Randall Munroe and XKCD&
6. That same task, written in the GATK (20 lines of code)
package org.broadinstitute.sting.gatk.walkers.examples;
import net.sf.samtools.SAMFileWriter;
import net.sf.samtools.SAMRecord;
import org.broadinstitute.sting.commandline.Output;
import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
import org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker;
import org.broadinstitute.sting.gatk.walkers.ReadWalker;
public class FixReadNames extends ReadWalker<Integer, Integer> {
@Output
SAMFileWriter out;
@Override
public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) {
read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName());
out.addAlignment(read);
return null;
}
@Override
public Integer reduceInit() { return null; }
@Override
public Integer reduce(Integer value, Integer sum) { return null; }
}
7. That same task, written in the GATK"
(code that’s not filled in for you by the IDE – 5 lines)
package org.broadinstitute.sting.gatk.walkers.examples;
import net.sf.samtools.SAMFileWriter;
import net.sf.samtools.SAMRecord;
import org.broadinstitute.sting.commandline.Output;
import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
import org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker;
import org.broadinstitute.sting.gatk.walkers.ReadWalker;
public class FixReadNames extends ReadWalker<Integer, Integer> {
@Output
SAMFileWriter out;
@Override
public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) {
read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName());
out.addAlignment(read);
return null;
}
@Override
public Integer reduceInit() { return null; }
@Override
public Integer reduce(Integer value, Integer sum) { return null; }
}
Most of the code is boilerplate, and the IDE can fill it in for you. The amount
of code you have to manually write is actually very small.!
8. Those tasks are simple when using the right tools…
Write a GATK READwalker that modifies
the read name and writes it out again.
Spend rest of time looking at lolCATs.
These BAMS have numeric, non- Lines of Code: 5.
unique read ids that collide when you
merge them!
How long will It take to fix?
Um, All day...
With all apologies to Randall Munroe and XKCD&
9. …though whether you’ll tell people that is up to you.
Hehe, I can haz cheezburger INDEED.
With all apologies to Randall Munroe and XKCD&
10. We ’ re g o i n g t o w r i t e g e n u i n e l y u s e f u l , d e a d l i n e
d e f e a t i n g , l i f e s a v i n g t o o l s i n < 20 lines of code
11. Now we’ll go through a bunch of programs and learn
to write new GATK tools by example
• Weʼll setup the environment and look at five tutorial programs:&
– HelloRead: A simple walker that prints read information from a BAM&
– FixReadNames: Modify read names and emit results to a new BAM file&
– HelloVariant: A simple walker that prints variant information from a VCF&
– ComputeCoverageFromVCF: Computes a coverage histogram from a VCF&
– FindExclusiveVariants: Create a new VCF of variants exclusive to a sample&
• Finished and commented versions are in the codebase at:&
– java/src/org/broadinstitute/sting/gatk/walkers/tutorial/&
• How these tutorials work:&
– The 3! icon enumerates the various steps in each tutorial.&
– The code that you should write at each step is in the IntelliJ window.&
– Text in boxes like this& give additional information on each step, emphasize
some information, and may clarify the command or code that you should write. &
13. See our wiki resources
• http://www.broadinstitute.org/gsa/wiki/index.php/
Configuring_IntelliJ&
• http://www.broadinstitute.org/gsa/wiki/index.php/
Queue_with_IntelliJ_IDEA&
14. Mechanics of a GATK “walker”"
(a program that “walks” along a dataset in a prescribed way)
15. ReadWalker: “walks” over reads and allows a
computation to be performed on each one
ReadWalker: process one read at a time!
reference!
TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
(1)!
(2)&
computation!
(3)&
order!
(4)& reads!
(5)&
Example use cases:&
1. Setting an extra metadata tag in a read&
2. Searching for mouse contaminant reads and excluding them&
3. Find or realign indels&
Some example GATK programs: CycleQualityWalker,
TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
16. ReadWalker: “walks” over reads and allows a
computation to be performed on each one
ReadWalker: process one read at a time!
reference!
TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
(1)&
(2)!
computation!
(3)&
order!
(4)& reads!
(5)&
Example use cases:&
1. Setting an extra metadata tag in a read&
2. Searching for mouse contaminant reads and excluding them&
3. Find or realign indels&
Some example GATK programs: CycleQualityWalker,
TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
17. ReadWalker: “walks” over reads and allows a
computation to be performed on each one
ReadWalker: process one read at a time!
reference!
TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
(1)&
(2)&
computation!
(3)!
order!
(4)& reads!
(5)&
Example use cases:&
1. Setting an extra metadata tag in a read&
2. Searching for mouse contaminant reads and excluding them&
3. Find or realign indels&
Some example GATK programs: CycleQualityWalker,
TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
18. LocusWalker: “walks” over genomic positions and
allows a computation to be performed at each one
LocusWalker: process a single-base genomic position at a time!
computation order! (1)(2)(3)(4)(5) …& reference!
TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
reads!
Example use cases:&
1. Variant calling&
2. Depth of coverage calculations&
3. Compute properties of regions (GC content, read error rates)&
Note: reads are required for locus walkers. RefWalkers are a similar type
of walker that examine each genomic locus, but do not require reads.!
19. LocusWalker: “walks” over genomic positions and
allows a computation to be performed at each one
LocusWalker: process a single-base genomic position at a time!
computation order! (1)(2)(3)(4)(5) …& reference!
TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
reads!
Example use cases:&
1. Variant calling&
2. Depth of coverage calculations&
3. Compute properties of regions (GC content, read error rates)&
Note: reads are required for locus walkers. RefWalkers are a similar type
of walker that examine each genomic locus, but do not require reads.!
20. LocusWalker: “walks” over genomic positions and
allows a computation to be performed at each one
LocusWalker: process a single-base genomic position at a time!
computation order! (1)(2)(3)(4)(5) …& reference!
TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
reads!
Example use cases:&
1. Variant calling&
2. Depth of coverage calculations&
3. Compute properties of regions (GC content, read error rates)&
Note: reads are required for locus walkers. RefWalkers are a similar type
of walker that examine each genomic locus, but do not require reads.!
21. RodWalker: “walks” over positions in a file and allows
a computation to be performed at each one
RodWalker: process a genomic position from a file (e.g. VCF) at a time!
computation order! (1)! (2)& (3)&4)&
( reference!
TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
SampleA! *! *& &
*
SampleB! *&
SampleC! *& *& variants!
Example use cases:&
1. Variant calling&
2. Depth of coverage calculations&
3. Compute properties of regions (GC content, read error rates)&
Some example GATK programs: VariantEval, PhaseByTransmission,
VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
22. RodWalker: “walks” over positions in a file and allows
a computation to be performed at each one
RodWalker: process a genomic position from a file (e.g. VCF) at a time!
computation order! (1)& (2)! (3)&4)&
( reference!
TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
SampleA! *& *& &
*
SampleB! *!
SampleC! *! *& variants!
Example use cases:&
1. Variant filtering&
2. Computing metrics on variants&
3. Refining variant calls by enforcing additional constraints&
Some example GATK programs: VariantEval, PhaseByTransmission,
VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
23. RodWalker: “walks” over positions in a file and allows
a computation to be performed at each one
RodWalker: process a genomic position from a file (e.g. VCF) at a time!
computation order! (1)& (2)! (3)!4)&
( reference!
TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
SampleA! *& *! &
*
SampleB! *&
SampleC! *& *& variants!
Example use cases:&
1. Variant filtering&
2. Computing metrics on variants&
3. Refining variant calls by enforcing additional constraints&
Some example GATK programs: VariantEval, PhaseByTransmission,
VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
25. Example 1: Hello, Read!
1! Right-click on “walkers”, select New->Package&
26. Example 1: Hello, Read!
2!
Type “examples” as the package name.&
3!
Click “OK”.&
27. Example 1: Hello, Read!
Right-click on “examples” and select New->Java class.
4! Enter the name “HelloRead”.&
A file declaring the class and proper package name is
created for you.&
28. Example 1: Hello, Read!
5!
Add the following text to the class declaration:&
extends ReadWalker<Integer, Integer> {
This will tell the GATK that you are creating a
program that iterates over all of the reads in a
BAM file, one at a time.&
The “import” statement at the top will be
added by the IDE.&
29. Example 1: Hello, Read!
6! IntelliJ can detect what methods you
need to implement in order to get your
program working.&
Make sure your cursor is on the class
declaration and type “Alt-Enter” to get
the contextual action menu.&
Select “Implement Methods”.&
30. Example 1: Hello, Read!
7! Select all of the methods
(usually, theyʼll already be
selected, so you wonʼt need
to do anything).&
8! Click “OK”.&
31. Example 1: Hello, Read!
The three methods, map(), reduceInit(), and reduce()
are now implemented with placeholder code.&
32. Example 1: Hello, Read!
9! Declare a PrintStream and mark it
with the @Output annotation. This
tells the GATK that weʼre going to
channel our output through this
object.&
Donʼt worry about instantiating it –
the GATK will do that automatically.&
33. Example 1: Hello, Read!
11! When youʼre done, hit the disk icon (or type Ctrl-S) to save your work.&
In your map() method, add a line of code that prints “Hello” and the name of the read:&
out.println(“Hello, ” + read.getReadName());
Or, just type read. and then hit Ctrl-Space. IntelliJ will show you a window of all the
methods you can call, and you can just select it from the list.&
10!
34. Example 1: Hello, Read!
12! Back in the terminal window, change
to your gatk-lilly directory and type:&
ant dist
This will compile the GATK-Lilly
codebase, including your new walker!&
36. Example 1: Hello, Read!
13! Run your code by entering the following command:&
java -jar dist/GenomeAnalysisTK.jar
-T HelloRead
-R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta
-I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam
| less
Every walker must be provided with a reference fasta file.
37. Example 1: Hello, Read!
Your code is now running and saying
“Hello” to every read in the file!
38. Example 1: Hello, Read!
Letʼs add some information to the output. Add the line:&
out.println(“Hello, ” + read.getReadName() +
“at ” + read.getReferenceName() +
“:” + read.getAlignmentStart()
);
This will print out the read name, the contig name, and
the starting position for the readʼs alignment.
14!
39. Example 1: Hello, Read!
15!
1!
Compile and run with a single command:&
Compile and run with a single command:&
ant dist && java -jar dist/GenomeAnalysisTK.jar
ant HelloRead -jar dist/GenomeAnalysisTK.jar
-T dist && java
-T HelloRead
-R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta
-R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta
-I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam
| less /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam
-I –S
| less –S
(The && instructs the shell to proceed only if the previous command
was successful. If the compilation fails, only if the previous be run.)
(The && instructs the shell to proceed HelloRead will not command
was successful. If the compilation fails, HelloRead will not be run.)
40. Example 1: Hello, Read!
The updated command is running and showing us the
alignment position in addition to the read name!
41. Example 1: Hello, Read!
16!
You can run on just a specific region by supplying the -L argument,
and redirect the output to a separate file with the -o argument:&
java -jar dist/GenomeAnalysisTK.jar
-T HelloRead
-R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta
-I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam
-L chr21:9411000-9411200
-o test.txt
No additional code is required on your part to enable this.
42. Example 1: Hello, Read!
The resultant file, with reads from chr21:9,411,000-9,411,200 only.
43. Example 2: Fix read names
Letʼs use what weʼve learned to
write a program that can change
read names like discussed earlier
in this tutorial.
44. Example 2: Fix read names
Now letʼs create a
new example
program called
“FixReadNames”.
1!
45. Example 2: Fix read names
Make FixReadNames a ReadWalker.
2!
46. Example 2: Fix read names
3! This time, weʼll emit a BAM file
by directing the output to a
SAMFileWriter object instead
of a PrintStream.
47. Example 2: Fix read names
4!
Change the read name,
tacking on the platform
unit information.
48. Example 2: Fix read names
5!
Add the alignment to
the output stream.
49. Example 2: Fix read names
6!
Compile and run your code:&
ant dist && java -jar dist/GenomeAnalysisTK.jar
-T FixReadNames
-R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta
-I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam
-L chr21:9411000-9411200
-o test.bam
50. Example 2: Fix read names
Run the following command to see your results:&
samtools view test.bam | less -S
7!
51. Example 2: Fix read names
All of the read names now have the
platform unit prepended to them!
52. Example 3: Hello, Variant!
This will be a larger example,
introducing variant processing,
map-reduce calculations, and
the onTraversalDone() method.
All code required is listed here.
53. Example 3: Hello, Variant!
Weʼve created a
new program called
“HelloVariant”.
1!
54. Example 3: Hello, Variant!
This program extends
RodWalker<Integer, Integer>
2!
56. Example 3: Hello, Variant!
4!
In the map() function, weʼll loop over lines in a
VCF file and print metadata from each record.
57. Example 3: Hello, Variant!
Return 1.&
5! This will get passed to reduce() later.
58. Example 3: Hello, Variant!
This gets called before the first
reduce() call. By returning 0,
6! we initialize the record counter.
59. Example 3: Hello, Variant!
All of the return values from map
() get passed to reduce(), one
at a time. Here, we add value to
sum, effectively counting all the
calls to map().
7!
60. Example 3: Hello, Variant!
8!
The onTraversalDone() method runs after the
computation is complete. Here, we print the total
number of map() calls made.
61. Example 3: Hello, Variant!
9!
Compile and run the HelloVariant walker, but this time, rather than specifying a BAM
file with the -I argument, weʼll attach a VCF file:&
ant dist && java –jar dist/GenomeAnalysisTK.jar
–T HelloVariant
-R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta
-B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf
62. Example 3: Hello, Variant!
The program prints out the reference allele,
alternate allele, and locus for each VCF record, and
finally prints out the number of records processed!
63. Example 4: Compute depth of coverage from a VCF file
Letʼs continue exploring variant
processing by taking a closer look
at the VariantContext object,
the programmatic representation
of a VCF record.&
This program will compute a depth
of coverage histogram using VCF
metadata rather than a BAM file.
64. Example 4: Compute depth of coverage from a VCF file
Create a new program called&
1!
ComputeCoverageFromVCF
of type&
RodWalker<Integer, Integer>
with the usual&
@Output
PrintStream out
declaration.&
65. Example 4: Compute depth of coverage from a VCF file
2!
Add a command line argument with the following code:&
@Argument(fullName=“sample”, shortName=“sn”, doc=“Sample to process”, required=false)
public string SAMPLE;
This adds the command-line argument --sample (aka -sn) and stores the inputted
value in the String variable SAMPLE.&
Weʼll use this to allow the user to specify whether they want to get coverage for a
specific sample or all of the samples (by specifying no sample at all).&
66. Example 4: Compute depth of coverage from a VCF file
3!
Declare a hashtable to store the coverage counts.&
private TreeMap<Integer, Integer> histogram = new TreeMap<Integer, Integer>();
A TreeMap is a special kind of hashtable that returns its keys in sorted order.&
67. Example 4: Compute depth of coverage from a VCF file
4!
Loop over the variants. For each one, weʼll print the
coverage observed. We also make sure that we get the
coverage for the sample requested (if the user specified a
sample name to the --sample argument), or for all
samples (if the user specified no sample name at all).&
For every coverage level we observe, we increment the
appropriate entry in the histogram object.&
68. Example 4: Compute depth of coverage from a VCF file
In the onTraversalDone() method, weʼll loop over every
coverage level in the histogram and output the depth and
the number of times we observed that depth.&
5!
69. Example 4: Compute depth of coverage from a VCF file
6! Compile and run:&
ant dist && java -jar dist/GenomeAnalysisTK.jar
-T ComputeCoverageFromVCF
-R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta
-B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf
-o histogram.txt
70. Example 4: Compute depth of coverage from a VCF file
Two columns of information are
printed. First column is the coverage
level, second is the number of times
that coverage level was observed!
71. Example 5: Find variants unique to a single sample
For our last example, weʼll write a simple
program that can take an input VCF and
write a new VCF containing only variants
that are exclusive to one sample.&
Weʼll also introduce the initialize()
method, which can be used to prepare the
environment for the computation.&
72. Example 5: Find variants unique to a single sample
1!
Create a new RodWalker called FindExclusiveVariants that has a
command-line argument called “sample” (aka “sn”) of type String.&
Add an output stream, but rather than be of type PrintStream,
make it of type VCFWriter. Weʼll use this to output a new VCF file
based on the input VCF.&
73. Example 5: Find variants unique to a single sample
2!
The initialize() method is called first, before any of the map() or
reduce() calls are made. It is useful for preparing the
environment, writing headers, setting up variables, etc.&
Here, weʼll write a VCF header to the output stream. While weʼre
free to add/remove header lines and samples, weʼll just copy the
input fileʼs header to the output file.&
74. Example 5: Find variants unique to a single sample
3!
Loop over each record in the VCF, and each
Genotype object contained within the
VariantContext object. Check the
genotypes of each sample and, if only our
sample of interest is variant, output the
record to the new VCF file.&
75. Example 5: Find variants unique to a single sample
4! Compile and run:&
ant dist && java -jar dist/GenomeAnalysisTK.jar
-T FindExclusiveVariants
-R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta
-B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf
-sn 113N
-o 113.exclusive.vcf
76. Example 5: Find variants unique to a single sample
5!
After the program completes, look at the output.
77. Example 5: Find variants unique to a single sample
6!
You can scroll left and right with the arrow key, but letʼs clean up the output to
make it easier to read. Supply this command instead:&
grep –v ‘##’ 113.exclusive.vcf | cut –f1-7,10- | head -10 | column –t | less -S
78. Example 5: Find variants unique to a single sample
Observe how the third sample is
variant and the other three samples
are not. Our program is selecting only
the variants that are exclusive to 113N!
79. Conclusions
• From the five example programs, we have learned how to:&
– configure IntelliJ for GATK development&
– create a new ReadWalker or RodWalker
– declare output streams (PrintStream, SAMFileWriter, VCFWriter)&
– access and modify metadata in reads&
– access variants, samples, and metadata from a VCF file&
– declare command-line arguments&
– prepare for computations with the initialize() method&
– finish computations with the onTraversalDone() method&
– compile and run new GATK programs&
• This tutorial is more than enough to get started with writing new
and useful GATK programs&
– Our FixReadNames, ComputeCoverageFromVCF, and FindExclusiveVariants
walkers are fully realized programs, ready to be used for real work.&
– You now have enough information to write your own somatic variant finder.&
80. Additional resources
• For more information on developing in the GATK and Java, see&
– http://www.broadinstitute.org/gsa/wiki/index.php/GATK_Development&
– http://download.oracle.com/javase/tutorial/java/index.html&
• Explore the GATK Git repository at&
– https://github.com/broadgsa&
– https://github.com/signup/free (to add your own code, sign up for free account)&
• To learn Git, the codebaseʼs version control system, see&
– http://gitref.org/&
– http://git-scm.com/course/svn.html (for those already familiar with SVN)&
• Read our papers on the GATK framework and tools&
– http://genome.cshlp.org/content/20/9/1297.long&
– http://www.nature.com/ng/journal/v43/n5/abs/ng.806.html&
• Fore more guidance, feel free to look at other programs in the GATK&
– Every program is a tutorial!&