SlideShare une entreprise Scribd logo
1  sur  80
Télécharger pour lire hors ligne
Eli Lilly / September 14-15, 2011

20-Line Lifesavers:"
Coding simple solutions in the GATK
Kiran V Garimella (kiran.garimella@gmail.com), Mark A DePristo
G E N O M E S E Q U E N C I N G A N D A N A LY S I S , B R O A D I N S T I T U T E 


Research Informatics Group
E L I L I L LY A N D C O M PA N Y
Genome Analysis Toolkit (GATK)!
ˈjē-ˌnōm(,)ə-ˈna-lə-səs(,)ˈtül(,)kit&


Noun&

1.  A suite of tools for working with medical resequencing projects
    (e.g. 1,000 Genomes, The Cancer Genome Atlas)&

2.  A structured software library that makes writing efficient
    analysis tools using next-generation sequencing data easy!
Genome Analysis Toolkit (GATK)!
ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit&


Noun&      Most users think of the toolkit merely as a set
           of tools that implement our ideas…!
1.  A suite of tools for working with medical resequencing projects
    (e.g. 1,000 Genomes, The Cancer Genome Atlas)&

2.  A structured software library that makes writing efficient
    analysis tools using next-generation sequencing data easy!
Genome Analysis Toolkit (GATK)!
ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit&


Noun&

1. … suite the GATKʼs real with medical resequencing projects
   A but of tools for working power is in how easy it
   (e.g. 1,000 Genomes, The Cancer Genome Atlas)&
  makes it to instantiate your ideas.!
2.  A structured software library that makes writing efficient
    analysis tools using next-generation sequencing data easy!

                             This is what we will discuss today.!
Some tasks are made difficult by the wrong tools

                                            Convert to sam format, read the
                                           header, parse the read group info into
                                           a hash table keyed on the ID, loop
                                        over the reads, look up the read group id in the hash, find
                                        the platform unit tag,
These BAMS have numeric, non-               prepend it to the read name,
                                         convert back to BAM, reindex BAM.
unique read ids that collide when you
merge them!                                   Lines of Code: 500.


How long will It take to fix?
                                                            All day!




                                                      With all apologies to Randall Munroe and XKCD&
That same task, written in the GATK (20 lines of code)
package org.broadinstitute.sting.gatk.walkers.examples;	

import    net.sf.samtools.SAMFileWriter;	
import    net.sf.samtools.SAMRecord;	
import    org.broadinstitute.sting.commandline.Output;	
import    org.broadinstitute.sting.gatk.contexts.ReferenceContext;	
import    org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker;	
import    org.broadinstitute.sting.gatk.walkers.ReadWalker;	

public class FixReadNames extends ReadWalker<Integer, Integer> {	
    @Output	
    SAMFileWriter out;	

     @Override	
     public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) {	
         read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName());	
         out.addAlignment(read);	

          return null;	
     }	

     @Override	
     public Integer reduceInit() { return null; }	

     @Override	
     public Integer reduce(Integer value, Integer sum) { return null; }	
}
That same task, written in the GATK"
     (code that’s not filled in for you by the IDE – 5 lines)
package org.broadinstitute.sting.gatk.walkers.examples;	

import    net.sf.samtools.SAMFileWriter;	
import    net.sf.samtools.SAMRecord;	
import    org.broadinstitute.sting.commandline.Output;	
import    org.broadinstitute.sting.gatk.contexts.ReferenceContext;	
import    org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker;	
import    org.broadinstitute.sting.gatk.walkers.ReadWalker;	

public class FixReadNames extends ReadWalker<Integer, Integer> {	
    @Output	
    SAMFileWriter out;	

     @Override	
     public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) {	
         read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName());	
         out.addAlignment(read);	

          return null;	
     }	

     @Override	
     public Integer reduceInit() { return null; }	

     @Override	
     public Integer reduce(Integer value, Integer sum) { return null; }	
}	


     Most of the code is boilerplate, and the IDE can fill it in for you. The amount
              of code you have to manually write is actually very small.!
Those tasks are simple when using the right tools…


                                        Write a GATK READwalker that modifies
                                        the read name and writes it out again.

                                        Spend rest of time looking at lolCATs.

These BAMS have numeric, non-           Lines of Code: 5.
unique read ids that collide when you
merge them!

How long will It take to fix?
                                                        Um, All day...




                                                   With all apologies to Randall Munroe and XKCD&
…though whether you’ll tell people that is up to you.




             Hehe, I can haz cheezburger INDEED.




                                          With all apologies to Randall Munroe and XKCD&
We ’ re g o i n g t o w r i t e g e n u i n e l y u s e f u l , d e a d l i n e
d e f e a t i n g , l i f e s a v i n g t o o l s i n < 20 lines of code
Now we’ll go through a bunch of programs and learn
       to write new GATK tools by example
•  Weʼll setup the environment and look at five tutorial programs:&
  –  HelloRead: A simple walker that prints read information from a BAM&
  –  FixReadNames: Modify read names and emit results to a new BAM file&
  –  HelloVariant: A simple walker that prints variant information from a VCF&
  –  ComputeCoverageFromVCF: Computes a coverage histogram from a VCF&
  –  FindExclusiveVariants: Create a new VCF of variants exclusive to a sample&


•  Finished and commented versions are in the codebase at:&
  –  java/src/org/broadinstitute/sting/gatk/walkers/tutorial/&


•  How these tutorials work:&
  –  The 3! icon enumerates the various steps in each tutorial.&
  –  The code that you should write at each step is in the IntelliJ window.&
  –  Text in boxes like this& give additional information on each step, emphasize
     some information, and may clarify the command or code that you should write. &
Setting up for GATK development
See our wiki resources



•  http://www.broadinstitute.org/gsa/wiki/index.php/
   Configuring_IntelliJ&

•  http://www.broadinstitute.org/gsa/wiki/index.php/
   Queue_with_IntelliJ_IDEA&
Mechanics of a GATK “walker”"
(a program that “walks” along a dataset in a prescribed way)
ReadWalker: “walks” over reads and allows a
 computation to be performed on each one

 ReadWalker: process one read at a time!

                                                           reference!
                      TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
               (1)!
               (2)&
  computation!
               (3)&
     order!
               (4)&                                           reads!
               (5)&



  Example use cases:&
  1.  Setting an extra metadata tag in a read&
  2.  Searching for mouse contaminant reads and excluding them&
  3.  Find or realign indels&
 Some example GATK programs: CycleQualityWalker,
 TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
ReadWalker: “walks” over reads and allows a
 computation to be performed on each one

 ReadWalker: process one read at a time!

                                                           reference!
                      TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
               (1)&
               (2)!
  computation!
               (3)&
     order!
               (4)&                                           reads!
               (5)&



  Example use cases:&
  1.  Setting an extra metadata tag in a read&
  2.  Searching for mouse contaminant reads and excluding them&
  3.  Find or realign indels&
 Some example GATK programs: CycleQualityWalker,
 TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
ReadWalker: “walks” over reads and allows a
 computation to be performed on each one

 ReadWalker: process one read at a time!

                                                           reference!
                      TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
               (1)&
               (2)&
  computation!
               (3)!
     order!
               (4)&                                           reads!
               (5)&



  Example use cases:&
  1.  Setting an extra metadata tag in a read&
  2.  Searching for mouse contaminant reads and excluding them&
  3.  Find or realign indels&
 Some example GATK programs: CycleQualityWalker,
 TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
LocusWalker: “walks” over genomic positions and
allows a computation to be performed at each one

   LocusWalker: process a single-base genomic position at a time!

    computation order!    (1)(2)(3)(4)(5) …&                   reference!
                         TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!



                                                                   reads!




     Example use cases:&
     1.  Variant calling&
     2.  Depth of coverage calculations&
     3.  Compute properties of regions (GC content, read error rates)&
    Note: reads are required for locus walkers. RefWalkers are a similar type
    of walker that examine each genomic locus, but do not require reads.!
LocusWalker: “walks” over genomic positions and
allows a computation to be performed at each one

   LocusWalker: process a single-base genomic position at a time!

    computation order!    (1)(2)(3)(4)(5) …&                   reference!
                         TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!



                                                                   reads!




     Example use cases:&
     1.  Variant calling&
     2.  Depth of coverage calculations&
     3.  Compute properties of regions (GC content, read error rates)&
    Note: reads are required for locus walkers. RefWalkers are a similar type
    of walker that examine each genomic locus, but do not require reads.!
LocusWalker: “walks” over genomic positions and
allows a computation to be performed at each one

   LocusWalker: process a single-base genomic position at a time!

    computation order!    (1)(2)(3)(4)(5) …&                   reference!
                         TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!



                                                                   reads!




     Example use cases:&
     1.  Variant calling&
     2.  Depth of coverage calculations&
     3.  Compute properties of regions (GC content, read error rates)&
    Note: reads are required for locus walkers. RefWalkers are a similar type
    of walker that examine each genomic locus, but do not require reads.!
RodWalker: “walks” over positions in a file and allows
    a computation to be performed at each one

   RodWalker: process a genomic position from a file (e.g. VCF) at a time!

      computation order!      (1)!   (2)&        (3)&4)&
                                                    (       reference!
                           TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
            SampleA!          *!                 *& &
                                                  *
            SampleB!                 *&
            SampleC!                 *&             *&        variants!



       Example use cases:&
       1.  Variant calling&
       2.  Depth of coverage calculations&
       3.  Compute properties of regions (GC content, read error rates)&
       Some example GATK programs: VariantEval, PhaseByTransmission,
       VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
RodWalker: “walks” over positions in a file and allows
    a computation to be performed at each one

   RodWalker: process a genomic position from a file (e.g. VCF) at a time!

      computation order!      (1)&    (2)!         (3)&4)&
                                                      (       reference!
                           TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
            SampleA!          *&                   *& &
                                                    *
            SampleB!                  *!
            SampleC!                  *!              *&       variants!



       Example use cases:&
       1.  Variant filtering&
       2.  Computing metrics on variants&
       3.  Refining variant calls by enforcing additional constraints&
       Some example GATK programs: VariantEval, PhaseByTransmission,
       VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
RodWalker: “walks” over positions in a file and allows
    a computation to be performed at each one

   RodWalker: process a genomic position from a file (e.g. VCF) at a time!

      computation order!      (1)&    (2)!         (3)!4)&
                                                      (       reference!
                           TTTAAATCTGTGGGGTTAATCGGCGGGCTAA!
            SampleA!          *&                   *! &
                                                    *
            SampleB!                  *&
            SampleC!                  *&              *&       variants!



       Example use cases:&
       1.  Variant filtering&
       2.  Computing metrics on variants&
       3.  Refining variant calls by enforcing additional constraints&
       Some example GATK programs: VariantEval, PhaseByTransmission,
       VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
Writing your first GATK walkers
Example 1: Hello, Read!




1! Right-click on “walkers”, select New->Package&
Example 1: Hello, Read!




                                  2!
   Type “examples” as the package name.&

    3!
    Click “OK”.&
Example 1: Hello, Read!




   Right-click on “examples” and select New->Java class.
4! Enter the name “HelloRead”.&

   A file declaring the class and proper package name is
   created for you.&
Example 1: Hello, Read!




                                     5!
          Add the following text to the class declaration:&

          extends ReadWalker<Integer, Integer> {	

          This will tell the GATK that you are creating a
          program that iterates over all of the reads in a
          BAM file, one at a time.&

          The “import” statement at the top will be
          added by the IDE.&
Example 1: Hello, Read!




             6! IntelliJ can detect what methods you
                need to implement in order to get your
                program working.&

                Make sure your cursor is on the class
                declaration and type “Alt-Enter” to get
                the contextual action menu.&

                Select “Implement Methods”.&
Example 1: Hello, Read!



                 7! Select all of the methods
                        (usually, theyʼll already be
                        selected, so you wonʼt need
                        to do anything).&




      8! Click “OK”.&
Example 1: Hello, Read!




The three methods, map(), reduceInit(), and reduce()
are now implemented with placeholder code.&
Example 1: Hello, Read!




         9! Declare a PrintStream and mark it
            with the @Output annotation. This
            tells the GATK that weʼre going to
            channel our output through this
            object.&

            Donʼt worry about instantiating it –
            the GATK will do that automatically.&
Example 1: Hello, Read!

   11! When youʼre done, hit the disk icon (or type Ctrl-S) to save your work.&



In your map() method, add a line of code that prints “Hello” and the name of the read:&

out.println(“Hello, ” + read.getReadName());	

Or, just type read. and then hit Ctrl-Space. IntelliJ will show you a window of all the
methods you can call, and you can just select it from the list.&

                10!
Example 1: Hello, Read!



      12! Back in the terminal window, change
          to your gatk-lilly directory and type:&

          ant dist	

          This will compile the GATK-Lilly
          codebase, including your new walker!&
Example 1: Hello, Read!




   Itʼll take about a minute to compile.&
Example 1: Hello, Read!




13! Run your code by entering the following command:&

    java -jar dist/GenomeAnalysisTK.jar 	
      -T HelloRead 	
      -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
      -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam 	
    | less	

    Every walker must be provided with a reference fasta file.
Example 1: Hello, Read!




         Your code is now running and saying
         “Hello” to every read in the file!
Example 1: Hello, Read!

    Letʼs add some information to the output. Add the line:&

    out.println(“Hello, ” + read.getReadName() +	
                    “at ” + read.getReferenceName() + 	
                      “:” + read.getAlignmentStart()	
    );	


    This will print out the read name, the contig name, and
    the starting position for the readʼs alignment.	


                             14!
Example 1: Hello, Read!



                                                                       15!
                                                                       1!
Compile and run with a single command:&
 Compile and run with a single command:&
ant dist && java -jar dist/GenomeAnalysisTK.jar 	
  ant HelloRead 	 -jar dist/GenomeAnalysisTK.jar 	
  -T dist && java
    -T HelloRead 	
  -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
    -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
  -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam 	
| less /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam 	
    -I –S	
  | less –S	
(The && instructs the shell to proceed only if the previous command
was successful. If the compilation fails, only if the previous be run.)	
  (The && instructs the shell to proceed HelloRead will not command
  was successful. If the compilation fails, HelloRead will not be run.)
Example 1: Hello, Read!




     The updated command is running and showing us the
     alignment position in addition to the read name!
Example 1: Hello, Read!


                                               16!
You can run on just a specific region by supplying the -L argument,
and redirect the output to a separate file with the -o argument:&

java   -jar dist/GenomeAnalysisTK.jar 	
  -T   HelloRead 	
  -R   /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
  -I   /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam 	
  -L   chr21:9411000-9411200 	
  -o   test.txt	

No additional code is required on your part to enable this.
Example 1: Hello, Read!




The resultant file, with reads from chr21:9,411,000-9,411,200 only.
Example 2: Fix read names




                Letʼs use what weʼve learned to
                write a program that can change
                read names like discussed earlier
                in this tutorial.
Example 2: Fix read names




Now letʼs create a
new example
program called
“FixReadNames”.	
  1!
Example 2: Fix read names




Make FixReadNames a ReadWalker.	
                                   2!
Example 2: Fix read names




         3! This time, weʼll emit a BAM file
            by directing the output to a
            SAMFileWriter object instead
            of a PrintStream.
Example 2: Fix read names




                                     4!
                       Change the read name,
                       tacking on the platform
                       unit information.
Example 2: Fix read names




                       5!
Add the alignment to
the output stream.
Example 2: Fix read names


                                                        6!
Compile and run your code:&

ant dist && java -jar dist/GenomeAnalysisTK.jar 	
  -T FixReadNames 	
  -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
  -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam 	
  -L chr21:9411000-9411200 	
  -o test.bam
Example 2: Fix read names




          Run the following command to see your results:&

          samtools view test.bam | less -S	

              7!
Example 2: Fix read names




All of the read names now have the
platform unit prepended to them!
Example 3: Hello, Variant!




                  This will be a larger example,
                  introducing variant processing,
                  map-reduce calculations, and
                  the onTraversalDone() method.
                  All code required is listed here.
Example 3: Hello, Variant!




Weʼve created a
new program called
“HelloVariant”.	
                1!
Example 3: Hello, Variant!

               This program extends
               RodWalker<Integer, Integer>	
          2!
Example 3: Hello, Variant!




         3! Declares a PrintStream.
Example 3: Hello, Variant!




                                      4!
          In the map() function, weʼll loop over lines in a
          VCF file and print metadata from each record.
Example 3: Hello, Variant!




         Return 1.&
      5! This will get passed to reduce() later.
Example 3: Hello, Variant!




              This gets called before the first
              reduce() call. By returning 0,
           6! we initialize the record counter.
Example 3: Hello, Variant!




                      All of the return values from map
                      () get passed to reduce(), one
                      at a time. Here, we add value to
                      sum, effectively counting all the
                      calls to map().	
                 7!
Example 3: Hello, Variant!




                        8!
         The onTraversalDone() method runs after the
         computation is complete. Here, we print the total
         number of map() calls made.
Example 3: Hello, Variant!



                                                          9!
Compile and run the HelloVariant walker, but this time, rather than specifying a BAM
file with the -I argument, weʼll attach a VCF file:&
ant dist && java –jar dist/GenomeAnalysisTK.jar 	
  –T HelloVariant 	
  -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
  -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf
Example 3: Hello, Variant!




 The program prints out the reference allele,
 alternate allele, and locus for each VCF record, and
 finally prints out the number of records processed!
Example 4: Compute depth of coverage from a VCF file




                               Letʼs continue exploring variant
                               processing by taking a closer look
                               at the VariantContext object,
                               the programmatic representation
                               of a VCF record.&

                               This program will compute a depth
                               of coverage histogram using VCF
                               metadata rather than a BAM file.
Example 4: Compute depth of coverage from a VCF file




             Create a new program called&
        1!
               ComputeCoverageFromVCF	

             of type&

               RodWalker<Integer, Integer>	

             with the usual&

               @Output	
               PrintStream out	

             declaration.&
Example 4: Compute depth of coverage from a VCF file




             2!
 Add a command line argument with the following code:&

   @Argument(fullName=“sample”, shortName=“sn”, doc=“Sample to process”, required=false)	
   public string SAMPLE;	


 This adds the command-line argument --sample (aka -sn) and stores the inputted
 value in the String variable SAMPLE.&

 Weʼll use this to allow the user to specify whether they want to get coverage for a
 specific sample or all of the samples (by specifying no sample at all).&
Example 4: Compute depth of coverage from a VCF file




          3!
    Declare a hashtable to store the coverage counts.&
      private TreeMap<Integer, Integer> histogram = new TreeMap<Integer, Integer>();	


    A TreeMap is a special kind of hashtable that returns its keys in sorted order.&
Example 4: Compute depth of coverage from a VCF file




                                                                4!
        Loop over the variants. For each one, weʼll print the
        coverage observed. We also make sure that we get the
        coverage for the sample requested (if the user specified a
        sample name to the --sample argument), or for all
        samples (if the user specified no sample name at all).&

        For every coverage level we observe, we increment the
        appropriate entry in the histogram object.&
Example 4: Compute depth of coverage from a VCF file




                In the onTraversalDone() method, weʼll loop over every
                coverage level in the histogram and output the depth and
                the number of times we observed that depth.&
                            5!
Example 4: Compute depth of coverage from a VCF file




     6! Compile and run:&
        ant dist && java -jar dist/GenomeAnalysisTK.jar 	
         -T ComputeCoverageFromVCF 	
         -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
         -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf 	
         -o histogram.txt
Example 4: Compute depth of coverage from a VCF file



         Two columns of information are
         printed. First column is the coverage
         level, second is the number of times
         that coverage level was observed!
Example 5: Find variants unique to a single sample




                             For our last example, weʼll write a simple
                             program that can take an input VCF and
                             write a new VCF containing only variants
                             that are exclusive to one sample.&

                             Weʼll also introduce the initialize()
                             method, which can be used to prepare the
                             environment for the computation.&
Example 5: Find variants unique to a single sample




    1!
     Create a new RodWalker called FindExclusiveVariants that has a
     command-line argument called “sample” (aka “sn”) of type String.&

     Add an output stream, but rather than be of type PrintStream,
     make it of type VCFWriter. Weʼll use this to output a new VCF file
     based on the input VCF.&
Example 5: Find variants unique to a single sample




                                                                   2!
The initialize() method is called first, before any of the map() or
reduce() calls are made. It is useful for preparing the
environment, writing headers, setting up variables, etc.&

Here, weʼll write a VCF header to the output stream. While weʼre
free to add/remove header lines and samples, weʼll just copy the
input fileʼs header to the output file.&
Example 5: Find variants unique to a single sample




                                         3!
                       Loop over each record in the VCF, and each
                       Genotype object contained within the
                       VariantContext object. Check the
                       genotypes of each sample and, if only our
                       sample of interest is variant, output the
                       record to the new VCF file.&
Example 5: Find variants unique to a single sample




4! Compile and run:&
   ant dist && java -jar dist/GenomeAnalysisTK.jar 	
     -T FindExclusiveVariants 	
     -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta 	
     -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf 	
     -sn 113N 	
     -o 113.exclusive.vcf
Example 5: Find variants unique to a single sample




                   5!
                    After the program completes, look at the output.
Example 5: Find variants unique to a single sample




                                                                6!
You can scroll left and right with the arrow key, but letʼs clean up the output to
make it easier to read. Supply this command instead:&
grep –v ‘##’ 113.exclusive.vcf | cut –f1-7,10- | head -10 | column –t | less -S
Example 5: Find variants unique to a single sample




                           Observe how the third sample is
                           variant and the other three samples
                           are not. Our program is selecting only
                           the variants that are exclusive to 113N!
Conclusions
•  From the five example programs, we have learned how to:&
  –  configure IntelliJ for GATK development&
  –  create a new ReadWalker or RodWalker	
  –  declare output streams (PrintStream, SAMFileWriter, VCFWriter)&
  –  access and modify metadata in reads&
  –  access variants, samples, and metadata from a VCF file&
  –  declare command-line arguments&
  –  prepare for computations with the initialize() method&
  –  finish computations with the onTraversalDone() method&
  –  compile and run new GATK programs&


•  This tutorial is more than enough to get started with writing new
   and useful GATK programs&
  –  Our FixReadNames, ComputeCoverageFromVCF, and FindExclusiveVariants
     walkers are fully realized programs, ready to be used for real work.&
  –  You now have enough information to write your own somatic variant finder.&
Additional resources
•  For more information on developing in the GATK and Java, see&
  –  http://www.broadinstitute.org/gsa/wiki/index.php/GATK_Development&
  –  http://download.oracle.com/javase/tutorial/java/index.html&

•  Explore the GATK Git repository at&
  –  https://github.com/broadgsa&
  –  https://github.com/signup/free (to add your own code, sign up for free account)&

•  To learn Git, the codebaseʼs version control system, see&
  –  http://gitref.org/&
  –  http://git-scm.com/course/svn.html (for those already familiar with SVN)&

•  Read our papers on the GATK framework and tools&
  –  http://genome.cshlp.org/content/20/9/1297.long&
  –  http://www.nature.com/ng/journal/v43/n5/abs/ng.806.html&

•  Fore more guidance, feel free to look at other programs in the GATK&
  –  Every program is a tutorial!&

Contenu connexe

En vedette

Year To Date Comparison
Year To Date ComparisonYear To Date Comparison
Year To Date Comparison
njhousehelper
 
Chuong 4 thach thuc tham hut thuong mai
Chuong 4   thach thuc tham hut thuong maiChuong 4   thach thuc tham hut thuong mai
Chuong 4 thach thuc tham hut thuong mai
Le Thuy Hanh
 
Chuong 1 tu bat on vi mo den con duong tai co cau
Chuong 1   tu bat on vi mo den con duong tai co cauChuong 1   tu bat on vi mo den con duong tai co cau
Chuong 1 tu bat on vi mo den con duong tai co cau
Le Thuy Hanh
 
Opening Microtravel
Opening MicrotravelOpening Microtravel
Opening Microtravel
Le Thuy Hanh
 
Chuong 5 bien dong lao dong va viec lam
Chuong 5   bien dong lao dong va viec lamChuong 5   bien dong lao dong va viec lam
Chuong 5 bien dong lao dong va viec lam
Le Thuy Hanh
 
Janssen immune system_&_microbiome_022213 (1)
Janssen immune system_&_microbiome_022213 (1)Janssen immune system_&_microbiome_022213 (1)
Janssen immune system_&_microbiome_022213 (1)
Calit2AG
 

En vedette (20)

Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipeline
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1
 
Ensembl Plants: Visualising, mining and analysing crop genomics data
Ensembl Plants: Visualising, mining and analysing crop  genomics dataEnsembl Plants: Visualising, mining and analysing crop  genomics data
Ensembl Plants: Visualising, mining and analysing crop genomics data
 
Workshop socialnetworking hra
Workshop socialnetworking hraWorkshop socialnetworking hra
Workshop socialnetworking hra
 
Year To Date Comparison
Year To Date ComparisonYear To Date Comparison
Year To Date Comparison
 
Microweb
MicrowebMicroweb
Microweb
 
Chuong 4 thach thuc tham hut thuong mai
Chuong 4   thach thuc tham hut thuong maiChuong 4   thach thuc tham hut thuong mai
Chuong 4 thach thuc tham hut thuong mai
 
如何开展社会化媒体营销?品牌拟人化
如何开展社会化媒体营销?品牌拟人化如何开展社会化媒体营销?品牌拟人化
如何开展社会化媒体营销?品牌拟人化
 
IBM MQ v8 enhancements
IBM MQ v8 enhancementsIBM MQ v8 enhancements
IBM MQ v8 enhancements
 
Chuong 1 tu bat on vi mo den con duong tai co cau
Chuong 1   tu bat on vi mo den con duong tai co cauChuong 1   tu bat on vi mo den con duong tai co cau
Chuong 1 tu bat on vi mo den con duong tai co cau
 
Statisitics 4 5
Statisitics 4 5Statisitics 4 5
Statisitics 4 5
 
Opening Microtravel
Opening MicrotravelOpening Microtravel
Opening Microtravel
 
Investor Relations 2.0 Jak to zacząć w Polsce?
Investor Relations 2.0 Jak to zacząć w Polsce?Investor Relations 2.0 Jak to zacząć w Polsce?
Investor Relations 2.0 Jak to zacząć w Polsce?
 
Amazon Ec2
Amazon Ec2Amazon Ec2
Amazon Ec2
 
Test
TestTest
Test
 
Domino must gather information
Domino must gather informationDomino must gather information
Domino must gather information
 
Workshop social networking 09
Workshop social networking 09Workshop social networking 09
Workshop social networking 09
 
Chuong 5 bien dong lao dong va viec lam
Chuong 5   bien dong lao dong va viec lamChuong 5   bien dong lao dong va viec lam
Chuong 5 bien dong lao dong va viec lam
 
Product Platform
Product PlatformProduct Platform
Product Platform
 
Janssen immune system_&_microbiome_022213 (1)
Janssen immune system_&_microbiome_022213 (1)Janssen immune system_&_microbiome_022213 (1)
Janssen immune system_&_microbiome_022213 (1)
 

Similaire à 20-Line Lifesavers: Coding simple solutions in the GATK

Similaire à 20-Line Lifesavers: Coding simple solutions in the GATK (20)

Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
 
"Xapi-lang For declarative code generation" By James Nelson
"Xapi-lang For declarative code generation" By James Nelson"Xapi-lang For declarative code generation" By James Nelson
"Xapi-lang For declarative code generation" By James Nelson
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Go 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX GoGo 1.10 Release Party - PDX Go
Go 1.10 Release Party - PDX Go
 
Code quality par Simone Civetta
Code quality par Simone CivettaCode quality par Simone Civetta
Code quality par Simone Civetta
 
Code review
Code reviewCode review
Code review
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
Software Profiling: Understanding Java Performance and how to profile in Java
Software Profiling: Understanding Java Performance and how to profile in JavaSoftware Profiling: Understanding Java Performance and how to profile in Java
Software Profiling: Understanding Java Performance and how to profile in Java
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Custom Detectors for FindBugs (London Java Community Unconference 2)
Custom Detectors for FindBugs (London Java Community Unconference 2)Custom Detectors for FindBugs (London Java Community Unconference 2)
Custom Detectors for FindBugs (London Java Community Unconference 2)
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
 
Measuring Your Code 2.0
Measuring Your Code 2.0Measuring Your Code 2.0
Measuring Your Code 2.0
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
A Life of breakpoint
A Life of breakpointA Life of breakpoint
A Life of breakpoint
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 

Plus de Dan Bolser

Semantic MediaWiki Workshop
Semantic MediaWiki WorkshopSemantic MediaWiki Workshop
Semantic MediaWiki Workshop
Dan Bolser
 

Plus de Dan Bolser (8)

Ramona Tăme - Email Encryption and Digital SIgning
Ramona Tăme - Email Encryption and Digital SIgningRamona Tăme - Email Encryption and Digital SIgning
Ramona Tăme - Email Encryption and Digital SIgning
 
Nice 2012, BioWikis and DASWiki
Nice 2012, BioWikis and DASWikiNice 2012, BioWikis and DASWiki
Nice 2012, BioWikis and DASWiki
 
Ensembl plants hsf_d_bolser_2012
Ensembl plants hsf_d_bolser_2012Ensembl plants hsf_d_bolser_2012
Ensembl plants hsf_d_bolser_2012
 
NETTAB 2012 flyer
NETTAB 2012 flyerNETTAB 2012 flyer
NETTAB 2012 flyer
 
Semantic MediaWiki Workshop
Semantic MediaWiki WorkshopSemantic MediaWiki Workshop
Semantic MediaWiki Workshop
 
Wikis at work
Wikis at workWikis at work
Wikis at work
 
BioWikis BSB10
BioWikis BSB10BioWikis BSB10
BioWikis BSB10
 
Wikipedia and the Global Brain
Wikipedia and the Global BrainWikipedia and the Global Brain
Wikipedia and the Global Brain
 

Dernier

Dernier (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 

20-Line Lifesavers: Coding simple solutions in the GATK

  • 1. Eli Lilly / September 14-15, 2011 20-Line Lifesavers:" Coding simple solutions in the GATK Kiran V Garimella (kiran.garimella@gmail.com), Mark A DePristo G E N O M E S E Q U E N C I N G A N D A N A LY S I S , B R O A D I N S T I T U T E Research Informatics Group E L I L I L LY A N D C O M PA N Y
  • 2. Genome Analysis Toolkit (GATK)! ˈjē-ˌnōm(,)ə-ˈna-lə-səs(,)ˈtül(,)kit& Noun& 1.  A suite of tools for working with medical resequencing projects (e.g. 1,000 Genomes, The Cancer Genome Atlas)& 2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy!
  • 3. Genome Analysis Toolkit (GATK)! ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit& Noun& Most users think of the toolkit merely as a set of tools that implement our ideas…! 1.  A suite of tools for working with medical resequencing projects (e.g. 1,000 Genomes, The Cancer Genome Atlas)& 2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy!
  • 4. Genome Analysis Toolkit (GATK)! ˈjē-ˌnōm(,)ə-ˈna-lə-səs(.)ˈtül(,)kit& Noun& 1. … suite the GATKʼs real with medical resequencing projects A but of tools for working power is in how easy it (e.g. 1,000 Genomes, The Cancer Genome Atlas)& makes it to instantiate your ideas.! 2.  A structured software library that makes writing efficient analysis tools using next-generation sequencing data easy! This is what we will discuss today.!
  • 5. Some tasks are made difficult by the wrong tools Convert to sam format, read the header, parse the read group info into a hash table keyed on the ID, loop over the reads, look up the read group id in the hash, find the platform unit tag, These BAMS have numeric, non- prepend it to the read name, convert back to BAM, reindex BAM. unique read ids that collide when you merge them! Lines of Code: 500. How long will It take to fix? All day! With all apologies to Randall Munroe and XKCD&
  • 6. That same task, written in the GATK (20 lines of code) package org.broadinstitute.sting.gatk.walkers.examples; import net.sf.samtools.SAMFileWriter; import net.sf.samtools.SAMRecord; import org.broadinstitute.sting.commandline.Output; import org.broadinstitute.sting.gatk.contexts.ReferenceContext; import org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker; import org.broadinstitute.sting.gatk.walkers.ReadWalker; public class FixReadNames extends ReadWalker<Integer, Integer> { @Output SAMFileWriter out; @Override public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) { read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName()); out.addAlignment(read); return null; } @Override public Integer reduceInit() { return null; } @Override public Integer reduce(Integer value, Integer sum) { return null; } }
  • 7. That same task, written in the GATK" (code that’s not filled in for you by the IDE – 5 lines) package org.broadinstitute.sting.gatk.walkers.examples; import net.sf.samtools.SAMFileWriter; import net.sf.samtools.SAMRecord; import org.broadinstitute.sting.commandline.Output; import org.broadinstitute.sting.gatk.contexts.ReferenceContext; import org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker; import org.broadinstitute.sting.gatk.walkers.ReadWalker; public class FixReadNames extends ReadWalker<Integer, Integer> { @Output SAMFileWriter out; @Override public Integer map(ReferenceContext ref, SAMRecord read, ReadMetaDataTracker metaDataTracker) { read.setReadName(read.getReadGroup().getPlatformUnit() + "." + read.getReadName()); out.addAlignment(read); return null; } @Override public Integer reduceInit() { return null; } @Override public Integer reduce(Integer value, Integer sum) { return null; } } Most of the code is boilerplate, and the IDE can fill it in for you. The amount of code you have to manually write is actually very small.!
  • 8. Those tasks are simple when using the right tools… Write a GATK READwalker that modifies the read name and writes it out again. Spend rest of time looking at lolCATs. These BAMS have numeric, non- Lines of Code: 5. unique read ids that collide when you merge them! How long will It take to fix? Um, All day... With all apologies to Randall Munroe and XKCD&
  • 9. …though whether you’ll tell people that is up to you. Hehe, I can haz cheezburger INDEED. With all apologies to Randall Munroe and XKCD&
  • 10. We ’ re g o i n g t o w r i t e g e n u i n e l y u s e f u l , d e a d l i n e d e f e a t i n g , l i f e s a v i n g t o o l s i n < 20 lines of code
  • 11. Now we’ll go through a bunch of programs and learn to write new GATK tools by example •  Weʼll setup the environment and look at five tutorial programs:& –  HelloRead: A simple walker that prints read information from a BAM& –  FixReadNames: Modify read names and emit results to a new BAM file& –  HelloVariant: A simple walker that prints variant information from a VCF& –  ComputeCoverageFromVCF: Computes a coverage histogram from a VCF& –  FindExclusiveVariants: Create a new VCF of variants exclusive to a sample& •  Finished and commented versions are in the codebase at:& –  java/src/org/broadinstitute/sting/gatk/walkers/tutorial/& •  How these tutorials work:& –  The 3! icon enumerates the various steps in each tutorial.& –  The code that you should write at each step is in the IntelliJ window.& –  Text in boxes like this& give additional information on each step, emphasize some information, and may clarify the command or code that you should write. &
  • 12. Setting up for GATK development
  • 13. See our wiki resources •  http://www.broadinstitute.org/gsa/wiki/index.php/ Configuring_IntelliJ& •  http://www.broadinstitute.org/gsa/wiki/index.php/ Queue_with_IntelliJ_IDEA&
  • 14. Mechanics of a GATK “walker”" (a program that “walks” along a dataset in a prescribed way)
  • 15. ReadWalker: “walks” over reads and allows a computation to be performed on each one ReadWalker: process one read at a time! reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! (1)! (2)& computation! (3)& order! (4)& reads! (5)& Example use cases:& 1.  Setting an extra metadata tag in a read& 2.  Searching for mouse contaminant reads and excluding them& 3.  Find or realign indels& Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
  • 16. ReadWalker: “walks” over reads and allows a computation to be performed on each one ReadWalker: process one read at a time! reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! (1)& (2)! computation! (3)& order! (4)& reads! (5)& Example use cases:& 1.  Setting an extra metadata tag in a read& 2.  Searching for mouse contaminant reads and excluding them& 3.  Find or realign indels& Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
  • 17. ReadWalker: “walks” over reads and allows a computation to be performed on each one ReadWalker: process one read at a time! reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! (1)& (2)& computation! (3)! order! (4)& reads! (5)& Example use cases:& 1.  Setting an extra metadata tag in a read& 2.  Searching for mouse contaminant reads and excluding them& 3.  Find or realign indels& Some example GATK programs: CycleQualityWalker, TableRecalibrationWalker, CountReadsWalker, IndelRealigner, etc.!
  • 18. LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one LocusWalker: process a single-base genomic position at a time! computation order! (1)(2)(3)(4)(5) …& reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! reads! Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!
  • 19. LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one LocusWalker: process a single-base genomic position at a time! computation order! (1)(2)(3)(4)(5) …& reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! reads! Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!
  • 20. LocusWalker: “walks” over genomic positions and allows a computation to be performed at each one LocusWalker: process a single-base genomic position at a time! computation order! (1)(2)(3)(4)(5) …& reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! reads! Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Note: reads are required for locus walkers. RefWalkers are a similar type of walker that examine each genomic locus, but do not require reads.!
  • 21. RodWalker: “walks” over positions in a file and allows a computation to be performed at each one RodWalker: process a genomic position from a file (e.g. VCF) at a time! computation order! (1)! (2)& (3)&4)& ( reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! SampleA! *! *& & * SampleB! *& SampleC! *& *& variants! Example use cases:& 1.  Variant calling& 2.  Depth of coverage calculations& 3.  Compute properties of regions (GC content, read error rates)& Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
  • 22. RodWalker: “walks” over positions in a file and allows a computation to be performed at each one RodWalker: process a genomic position from a file (e.g. VCF) at a time! computation order! (1)& (2)! (3)&4)& ( reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! SampleA! *& *& & * SampleB! *! SampleC! *! *& variants! Example use cases:& 1.  Variant filtering& 2.  Computing metrics on variants& 3.  Refining variant calls by enforcing additional constraints& Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
  • 23. RodWalker: “walks” over positions in a file and allows a computation to be performed at each one RodWalker: process a genomic position from a file (e.g. VCF) at a time! computation order! (1)& (2)! (3)!4)& ( reference! TTTAAATCTGTGGGGTTAATCGGCGGGCTAA! SampleA! *& *! & * SampleB! *& SampleC! *& *& variants! Example use cases:& 1.  Variant filtering& 2.  Computing metrics on variants& 3.  Refining variant calls by enforcing additional constraints& Some example GATK programs: VariantEval, PhaseByTransmission, VariantAnnotator, VariantRecalibrator, SelectVariants, etc.!
  • 24. Writing your first GATK walkers
  • 25. Example 1: Hello, Read! 1! Right-click on “walkers”, select New->Package&
  • 26. Example 1: Hello, Read! 2! Type “examples” as the package name.& 3! Click “OK”.&
  • 27. Example 1: Hello, Read! Right-click on “examples” and select New->Java class. 4! Enter the name “HelloRead”.& A file declaring the class and proper package name is created for you.&
  • 28. Example 1: Hello, Read! 5! Add the following text to the class declaration:& extends ReadWalker<Integer, Integer> { This will tell the GATK that you are creating a program that iterates over all of the reads in a BAM file, one at a time.& The “import” statement at the top will be added by the IDE.&
  • 29. Example 1: Hello, Read! 6! IntelliJ can detect what methods you need to implement in order to get your program working.& Make sure your cursor is on the class declaration and type “Alt-Enter” to get the contextual action menu.& Select “Implement Methods”.&
  • 30. Example 1: Hello, Read! 7! Select all of the methods (usually, theyʼll already be selected, so you wonʼt need to do anything).& 8! Click “OK”.&
  • 31. Example 1: Hello, Read! The three methods, map(), reduceInit(), and reduce() are now implemented with placeholder code.&
  • 32. Example 1: Hello, Read! 9! Declare a PrintStream and mark it with the @Output annotation. This tells the GATK that weʼre going to channel our output through this object.& Donʼt worry about instantiating it – the GATK will do that automatically.&
  • 33. Example 1: Hello, Read! 11! When youʼre done, hit the disk icon (or type Ctrl-S) to save your work.& In your map() method, add a line of code that prints “Hello” and the name of the read:& out.println(“Hello, ” + read.getReadName()); Or, just type read. and then hit Ctrl-Space. IntelliJ will show you a window of all the methods you can call, and you can just select it from the list.& 10!
  • 34. Example 1: Hello, Read! 12! Back in the terminal window, change to your gatk-lilly directory and type:& ant dist This will compile the GATK-Lilly codebase, including your new walker!&
  • 35. Example 1: Hello, Read! Itʼll take about a minute to compile.&
  • 36. Example 1: Hello, Read! 13! Run your code by entering the following command:& java -jar dist/GenomeAnalysisTK.jar -T HelloRead -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam | less Every walker must be provided with a reference fasta file.
  • 37. Example 1: Hello, Read! Your code is now running and saying “Hello” to every read in the file!
  • 38. Example 1: Hello, Read! Letʼs add some information to the output. Add the line:& out.println(“Hello, ” + read.getReadName() + “at ” + read.getReferenceName() + “:” + read.getAlignmentStart() ); This will print out the read name, the contig name, and the starting position for the readʼs alignment. 14!
  • 39. Example 1: Hello, Read! 15! 1! Compile and run with a single command:& Compile and run with a single command:& ant dist && java -jar dist/GenomeAnalysisTK.jar ant HelloRead -jar dist/GenomeAnalysisTK.jar -T dist && java -T HelloRead -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam | less /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam -I –S | less –S (The && instructs the shell to proceed only if the previous command was successful. If the compilation fails, only if the previous be run.) (The && instructs the shell to proceed HelloRead will not command was successful. If the compilation fails, HelloRead will not be run.)
  • 40. Example 1: Hello, Read! The updated command is running and showing us the alignment position in addition to the read name!
  • 41. Example 1: Hello, Read! 16! You can run on just a specific region by supplying the -L argument, and redirect the output to a separate file with the -o argument:& java -jar dist/GenomeAnalysisTK.jar -T HelloRead -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam -L chr21:9411000-9411200 -o test.txt No additional code is required on your part to enable this.
  • 42. Example 1: Hello, Read! The resultant file, with reads from chr21:9,411,000-9,411,200 only.
  • 43. Example 2: Fix read names Letʼs use what weʼve learned to write a program that can change read names like discussed earlier in this tutorial.
  • 44. Example 2: Fix read names Now letʼs create a new example program called “FixReadNames”. 1!
  • 45. Example 2: Fix read names Make FixReadNames a ReadWalker. 2!
  • 46. Example 2: Fix read names 3! This time, weʼll emit a BAM file by directing the output to a SAMFileWriter object instead of a PrintStream.
  • 47. Example 2: Fix read names 4! Change the read name, tacking on the platform unit information.
  • 48. Example 2: Fix read names 5! Add the alignment to the output stream.
  • 49. Example 2: Fix read names 6! Compile and run your code:& ant dist && java -jar dist/GenomeAnalysisTK.jar -T FixReadNames -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -I /lrlhps/apps/gatk-lilly/testdata/114T.chr21.analysis_ready.bam -L chr21:9411000-9411200 -o test.bam
  • 50. Example 2: Fix read names Run the following command to see your results:& samtools view test.bam | less -S 7!
  • 51. Example 2: Fix read names All of the read names now have the platform unit prepended to them!
  • 52. Example 3: Hello, Variant! This will be a larger example, introducing variant processing, map-reduce calculations, and the onTraversalDone() method. All code required is listed here.
  • 53. Example 3: Hello, Variant! Weʼve created a new program called “HelloVariant”. 1!
  • 54. Example 3: Hello, Variant! This program extends RodWalker<Integer, Integer> 2!
  • 55. Example 3: Hello, Variant! 3! Declares a PrintStream.
  • 56. Example 3: Hello, Variant! 4! In the map() function, weʼll loop over lines in a VCF file and print metadata from each record.
  • 57. Example 3: Hello, Variant! Return 1.& 5! This will get passed to reduce() later.
  • 58. Example 3: Hello, Variant! This gets called before the first reduce() call. By returning 0, 6! we initialize the record counter.
  • 59. Example 3: Hello, Variant! All of the return values from map () get passed to reduce(), one at a time. Here, we add value to sum, effectively counting all the calls to map(). 7!
  • 60. Example 3: Hello, Variant! 8! The onTraversalDone() method runs after the computation is complete. Here, we print the total number of map() calls made.
  • 61. Example 3: Hello, Variant! 9! Compile and run the HelloVariant walker, but this time, rather than specifying a BAM file with the -I argument, weʼll attach a VCF file:& ant dist && java –jar dist/GenomeAnalysisTK.jar –T HelloVariant -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf
  • 62. Example 3: Hello, Variant! The program prints out the reference allele, alternate allele, and locus for each VCF record, and finally prints out the number of records processed!
  • 63. Example 4: Compute depth of coverage from a VCF file Letʼs continue exploring variant processing by taking a closer look at the VariantContext object, the programmatic representation of a VCF record.& This program will compute a depth of coverage histogram using VCF metadata rather than a BAM file.
  • 64. Example 4: Compute depth of coverage from a VCF file Create a new program called& 1! ComputeCoverageFromVCF of type& RodWalker<Integer, Integer> with the usual& @Output PrintStream out declaration.&
  • 65. Example 4: Compute depth of coverage from a VCF file 2! Add a command line argument with the following code:& @Argument(fullName=“sample”, shortName=“sn”, doc=“Sample to process”, required=false) public string SAMPLE; This adds the command-line argument --sample (aka -sn) and stores the inputted value in the String variable SAMPLE.& Weʼll use this to allow the user to specify whether they want to get coverage for a specific sample or all of the samples (by specifying no sample at all).&
  • 66. Example 4: Compute depth of coverage from a VCF file 3! Declare a hashtable to store the coverage counts.& private TreeMap<Integer, Integer> histogram = new TreeMap<Integer, Integer>(); A TreeMap is a special kind of hashtable that returns its keys in sorted order.&
  • 67. Example 4: Compute depth of coverage from a VCF file 4! Loop over the variants. For each one, weʼll print the coverage observed. We also make sure that we get the coverage for the sample requested (if the user specified a sample name to the --sample argument), or for all samples (if the user specified no sample name at all).& For every coverage level we observe, we increment the appropriate entry in the histogram object.&
  • 68. Example 4: Compute depth of coverage from a VCF file In the onTraversalDone() method, weʼll loop over every coverage level in the histogram and output the depth and the number of times we observed that depth.& 5!
  • 69. Example 4: Compute depth of coverage from a VCF file 6! Compile and run:& ant dist && java -jar dist/GenomeAnalysisTK.jar -T ComputeCoverageFromVCF -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf -o histogram.txt
  • 70. Example 4: Compute depth of coverage from a VCF file Two columns of information are printed. First column is the coverage level, second is the number of times that coverage level was observed!
  • 71. Example 5: Find variants unique to a single sample For our last example, weʼll write a simple program that can take an input VCF and write a new VCF containing only variants that are exclusive to one sample.& Weʼll also introduce the initialize() method, which can be used to prepare the environment for the computation.&
  • 72. Example 5: Find variants unique to a single sample 1! Create a new RodWalker called FindExclusiveVariants that has a command-line argument called “sample” (aka “sn”) of type String.& Add an output stream, but rather than be of type PrintStream, make it of type VCFWriter. Weʼll use this to output a new VCF file based on the input VCF.&
  • 73. Example 5: Find variants unique to a single sample 2! The initialize() method is called first, before any of the map() or reduce() calls are made. It is useful for preparing the environment, writing headers, setting up variables, etc.& Here, weʼll write a VCF header to the output stream. While weʼre free to add/remove header lines and samples, weʼll just copy the input fileʼs header to the output file.&
  • 74. Example 5: Find variants unique to a single sample 3! Loop over each record in the VCF, and each Genotype object contained within the VariantContext object. Check the genotypes of each sample and, if only our sample of interest is variant, output the record to the new VCF file.&
  • 75. Example 5: Find variants unique to a single sample 4! Compile and run:& ant dist && java -jar dist/GenomeAnalysisTK.jar -T FindExclusiveVariants -R /lrlhps/apps/gatk-lilly/resources/ucsc.hg19.fasta -B:variant,VCF /lrlhps/apps/gatk-lilly/testdata/test.chr21.analysis_ready.vcf -sn 113N -o 113.exclusive.vcf
  • 76. Example 5: Find variants unique to a single sample 5! After the program completes, look at the output.
  • 77. Example 5: Find variants unique to a single sample 6! You can scroll left and right with the arrow key, but letʼs clean up the output to make it easier to read. Supply this command instead:& grep –v ‘##’ 113.exclusive.vcf | cut –f1-7,10- | head -10 | column –t | less -S
  • 78. Example 5: Find variants unique to a single sample Observe how the third sample is variant and the other three samples are not. Our program is selecting only the variants that are exclusive to 113N!
  • 79. Conclusions •  From the five example programs, we have learned how to:& –  configure IntelliJ for GATK development& –  create a new ReadWalker or RodWalker –  declare output streams (PrintStream, SAMFileWriter, VCFWriter)& –  access and modify metadata in reads& –  access variants, samples, and metadata from a VCF file& –  declare command-line arguments& –  prepare for computations with the initialize() method& –  finish computations with the onTraversalDone() method& –  compile and run new GATK programs& •  This tutorial is more than enough to get started with writing new and useful GATK programs& –  Our FixReadNames, ComputeCoverageFromVCF, and FindExclusiveVariants walkers are fully realized programs, ready to be used for real work.& –  You now have enough information to write your own somatic variant finder.&
  • 80. Additional resources •  For more information on developing in the GATK and Java, see& –  http://www.broadinstitute.org/gsa/wiki/index.php/GATK_Development& –  http://download.oracle.com/javase/tutorial/java/index.html& •  Explore the GATK Git repository at& –  https://github.com/broadgsa& –  https://github.com/signup/free (to add your own code, sign up for free account)& •  To learn Git, the codebaseʼs version control system, see& –  http://gitref.org/& –  http://git-scm.com/course/svn.html (for those already familiar with SVN)& •  Read our papers on the GATK framework and tools& –  http://genome.cshlp.org/content/20/9/1297.long& –  http://www.nature.com/ng/journal/v43/n5/abs/ng.806.html& •  Fore more guidance, feel free to look at other programs in the GATK& –  Every program is a tutorial!&