Sequencing data analysis
Workshop – part 2 / mapping to a reference genome


            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop

                  Maté Ongenaert
Previously in this workshop…
Introduction – the real cost of sequencing
Previously in this workshop…
Introduction – the real cost of sequencing
Previously in this workshop…
  The workflow of NGS data analysis
                            Data analysis

                 Raw machine reads… What’s next?

                Preprocessing (machine/technology)
                 - adaptors, indexes, conversions,…
                 - machine/technology dependent

              Reads with associated qualities (universal)
                              - FASTQ
                            - QC check

            Depending on application (general applicable)
        - ‘de novo’ assembly of genome (bacterial genomes,…)
         - Mapping to a reference genome  mapped reads
                          - SAM/BAM/…

             High-level analysis (specific for application)
                            - SNP calling
                           - Peak calling
Previously in this workshop…
  The workflow of NGS data analysis
Previously in this workshop…
                                     Main data formats
                                     Raw sequence reads:

- Represent the sequence ~ FASTA

- Extension: represent the quality, per base ~ FASTQ – Q for quality
Score ~ phred ~ ASCII table ~ phred + 33 = Sanger

- Machine and platform independent and compressed: SRA (NCBI)
Get the original FASTQ file using SRATools (NCBI)
Previously in this workshop…
                                Main data formats
- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
- BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM


# QNAME: template name
#RNAME: reference name
# POS: mapping position
#MAPQ: mapping quality
#CIGAR: CIGAR string
#RNEXT: reference name of the mate/next fragment
#PNEXT: position of the mate/next fragment
#TLEN: observed template length
#SEQ: fragment sequence
#QUAL: ASCII of Phred-scale base quality+33

@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45

#Alignment block
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
Previously in this workshop…
                                         Main data formats
- BED files (location / annotation / scores): Browser Extensible Data
Used for mapping / annotation / peak locations / - extension: bigBED (binary)
# chr
# start
# end
# name
# score
# strand

track   name=pairedReads description="Clone Paired Reads" useScore=1
#chr    start end name score strand
chr22   1000 5000 cloneA 960 +
chr22   2000 6000 cloneB 900 –

- BEDGraph files (location, combined with score)
Used to represent peak scores
track type=bedGraph name="BedGraph Format" description="BedGraph format"
visibility=full color=200,100,0 altColor=0,100,200 priority=20
#chr start    end      score
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50
Previously in this workshop…
                                       Main data formats
- WIG files (location / annotation / scores): wiggle
Used for visulization or summarize data, in most cases count data or normalized count
data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)

browser position chr19:59304200-59310700
browser hide all

#150 base wide bar graph at arbitrarily spaced positions,
#threshold line drawn at y=11.76
#autoScale off viewing range set to [0:25]
#priority = 10 positions this as the first graph

track type=wiggle_0 name="variableStep" description="variableStep format"
visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255
yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
59304701 10.0
59304901 12.5
59305401 15.0
59305601 17.5
59305901 20.0
59306081 17.5
Previously in this workshop…
                                    Main data formats
- GFF format (General Feature Format) or GTF
Used for annotation of genetic / genomic features – such as all coding genes in Ensembl
Often used in downstream analysis to assign annotation to regions / peaks / …

# seqname (the name of the sequence)
# source (the program that generated this feature)
# feature (the name of this type of feature – for example: exon)
# start (the starting position of the feature in the sequence)
# end (the ending position of the feature)
# score (a score between 0 and 1000)
# strand (valid entries include '+', '-', or '.')
# frame (if the feature is a coding exon, frame should be a number between
0-2 that represents the reading frame of the first base. If the feature is
not a coding exon, the value should be '.'.)
# group (all lines with the same group are linked together into a single

track name=regulatory description="TeleGene(tm)    Regulatory Regions"
#chr   source   feature   start    end   scores    tr fr group
chr22 TeleGene enhancer 1000000 1001000 500        + . touch1
chr22 TeleGene promoter 1010000 1010100 900        + . touch1
chr22 TeleGene promoter 1020000 1020000 800        - . touch2
Previously in this workshop…
                                     Main data formats
- VCF format (Variant Call Format)
For SNP representation
Previously in this workshop…
                                  Main data formats

- UCSC brower data formats, including all most commonly used formats that are
  accepted and widely used

- In addition, ENCODE data formats (narrowPeak / broadPEAK)
Sequencing data analysis
Workshop – part 2 / mapping to a reference genome


            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop

                  Maté Ongenaert
Mapping to a reference genome
                                      The workflow

Aligning the raw sequence reads to a reference genome by using an indexing strategy and
aligning algorithm, taking into account the quality scores and with specific conditions

- Raw sequence reads with quality scores: FASTQ
- Reference genome: FASTA files can be downloaded (UCSC/Ensembl)

- Sequence reads <> reference genome: alignment
- To perform an efficient alignment, an indexing strategy is used
- For instance (BWA/Bowtie): FM indexes (based on burrows-wheeler algorithm) on the
  reference genome and/or the sequence reads

- Specific conditions: single-end or paired-end; how many mismatches allowed; trade-off
  speed/accuracy/specificity; local re-alignment afterwards for improved indel calling; …

>> Result: mapped sequence reads: chr / start / end / quality >> SAM file (>> BAM)
Mapping to a reference genome
                                       The workflow
The reference genome

- Sequences (human; rat: mouse:…) can be downloaded from UCSC (Golden path) or
- Difficulty: download in 2bit format (needs convertor) >> fasta files (.fa)
- Need to be indexed by the mapping program you are going to use

- BWA: bwa index
- Bowtie: bowtie-build (pre-computed indexes available)

- BWA example:

bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>
Index database sequences in the FASTA format.

-c         Build color-space index. The input fast should be in nucleotide space.
-p STR     Prefix of the output database [same as db filename]
-a STR     Algorithm for constructing BWT index. Available options are:
is         IS linear-time algorithm for constructing suffix array.
           It requires 5.37N memory where N is the size of the database.
bwtsw      Algorithm implemented in BWT-SW. This method works with the whole human genome
Mapping to a reference genome
                                     The workflow
The sequencing reads

- Sequence reads with quality scores: FASTQ files from the machine
- Depending on the mapping program, need to be indexed as well

- BWA: converts reads to SA coordinates (Suffix Array) based on the reference genome
- Bowtie: not needed: indexing and aligning in one step

- BWA:
- Index reference genome
- Index sequence reads (INPUT: FASTQ and REF. GENOME ) >> SA coordinates (OUTPUT:
Mapping to a reference genome
                                       The workflow
aln        bwa aln [-n][-o][-e][-d][-i][-k][-l][-t][-cRN][-M][-O][-E][-q]
            <in.db.fasta> <in.query.fq> > <out.sai>

Find the SA coordinates of the input reads.
Maximum maxSeedDiff differences are allowed in the first seedLen subsequence
maximum maxDiff differences are allowed in the whole sequence.

-n NUM     Maximum edit distance if the value is INT
-o INT     Maximum number of gap opens
-e INT     Maximum number of gap extensions, -1 for k-difference mode
-d INT     Disallow a long deletion within INT bp towards the 3’-end
-i INT     Disallow an indel within INT bp towards the ends [5]
-l INT     Take the first INT subsequence as seed.
-k INT     Maximum edit distance in the seed
-t INT     Number of threads (multi-threading mode)
-M INT     Mismatch penalty
-O INT     Gap open penalty
-E INT     Gap extension penalty
-R INT     Proceed with suboptimal alignments
-c         Reverse query but not complement it
-N         Disable iterative search.
-q INT     Parameter for read trimming.
-I         The input is in the Illumina 1.3+ read format (quality equals ASCII-64)
-B INT     Length of barcode starting from the 5’-end.
-b         Specify the input read sequence file is the BAM format.
-0         When -b is specified, only use single-end reads in mapping.
-1         When -b is specified, only use the first read in a read pair in mapping
-2         When -b is specified, only use the second read in a read pair in mapping
Mapping to a reference genome
                                       The workflow
samse      bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
Generate alignments in the SAM format given single-end reads
Repetitive hits will be randomly chosen.

-n INT     Maximum number of alignments to output in the XA tag for reads paired properly.
-r STR     Specify the read group in a format like ‘@RGtID:footSM:bar’

sampe      bwa sampe [-a][-o][-n][-N][-P]<in.db.fasta>
<in1.sai><in2.sai><in1.fq><in2.fq> ><out.sam>
Generate alignments in the SAM format given paired-end reads.
Repetitive read pairs will be placed randomly.

-a INT     Maximum insert size for a read pair to be considered being mapped properly.
-o INT     Maximum occurrences of a read for pairing.
-P         Load the entire FM-index into memory to reduce disk operations
-n INT     Maximum number of alignments to output in the XA tag for reads paired properly
-N INT     Maximum number of alignments to output in the XA tag for disconcordant read pairs
-r STR     Specify the read group in a format like ‘@RGtID:footSM:bar’
Sequencing data analysis
Workshop – part 2 / mapping to a reference genome


            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop

                  Maté Ongenaert
Mapping to a reference genome
                                       The workshop
Mapping using BWA

bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai

bwa-0.5.9 BWA and its version
aln: alignement functionality of BWA
-t 4: use 4 processes (CPU cores) at the same time to speed up
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.fastq: fastq file to align to the reference
> Indicates outputting to a file
SRR058523.sai: the output file (SA Index file)

Maps the input sequences (FASTQ) to the reference genome index  output: indexes of
 the reads

No ‘real genomic mapping’ thus, this would need a next step…
Mapping to a reference genome
                                       The workshop
Mapping using BWA

bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF6-unsorted.bam –

bwa-0.5.9 BWA and its version
samse: single-end mapping and output to sam format
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.sai: the reads index
SRR058523.fastq: the raw reads and quality scores

This would output a sam file (> SRR058523.sam) for instance
But we don’t need the SAM file, we would like a BAM file  processing by samtools

| is the ‘pipe’ symbol: hands over the output from one command to the other
samtools-0.1.18: samtools and its version
view: the command to process sam files
- B output BAM ; h print the headers; S input is SAM; o output name
PHF6-unsorted.bam: output file name
- End of the | symbol (end of second command)
Mapping to a reference genome
                                        The workshop
Mapping using BWA

bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai

bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF8-unsorted.bam –

Two-step process in BWA

Next steps: process the BAM file  sort and index it (using samtools)

samtools-0.1.18 sort PHF8-unsorted.bam PHF8-sorted

Creates a sorted BAM file (PHF6-sorted.bam)
samtools-0.1.18 index PHF8-sorted.bam

Indexes the sorted BAM file (and created a BAM index file – PHF6-sorted.bam.bai)
Mapping to a reference genome
                                         The workshop
BAM: what’s next?

So, now we have the sorted and indexed BAM file – what’s next?

This file is the starting point for all other analysis, depending on the application:

ChIP-seq: peak calling
SNP calling
RNA-seq: calculate gene-expression levels of the transcripts / find splice variants

What are the first things?
- Visualize it (IGV can load BAM files)
- First downstream analysis: QC and basic statistics (how many mapped reads, quality
  distribution, distribution accross chromosomes,…)
Mapping to a reference genome
                                        The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

/opt/samstat/samstat PHF8-sorted.bam

- Outputs a HTML file with statistics
Mapping to a reference genome
                                                The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

BamUtil (stats)

Bam stats --in PHF8-sorted.bam –-basic --phred        --baseSum

Number of records read = 15732744

TotalReads(e6)   15.73
MappedReads(e6) 15.04
PairedReads(e6) 15.73
ProperPair(e6)   14.65
DuplicateReads(e6)                  0.00
QCFailureReads(e6)                  0.00

MappingRate(%)   95.59
PairedReads(%)   100.00
ProperPair(%)    93.11
DupRate(%)       0.00
QCFailRate(%)    0.00

TotalBases(e6)   802.37
BasesInMappedReads(e6)              766.95

Quality          Count
33               0
34               0
35               71373
36               0
37               0
38               203544
39               403649
40               921714
41               2081099
42               1974615
43               2285826
Mapping to a reference genome
                                       The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

samtools-0.1.18 idxstats PHF8-sorted.bam

1      249250621        503714   0
2      243199373        345217   0
3      198022430        273477   0
4      191154276        229016   0
5      180915260        360339   0
6      171115067        257468   0
7      159138663        269704   0
8      146364022        242656   0
9      141213431        203505   0
10     135534747        237496   0
11     135006516        218116   0
12     133851895        231426   0
13     115169878        106831   0
14     107349540        119062   0
15     102531392        141351   0
16     90354753         183004   0
17     81195210         187024   0
18     78077248         86101    0
Mapping to a reference genome
                                     The workshop
First downstream analysis

- Think about PCR duplicates  you may want to remove them (or set a ‘flag’ in the BAM
  file, indicating it is a duplicate)
- Samtools rmdup or Picard MarkDuplicates

- Find out how these tools work and what otyher flags are used in BAM files
- Can you make statistics with the BAM flags?
Mapping to a reference genome
                                     The workshop
Mapping – now let’s start!

- Mapping is only the starting point for most downstream analysis tools
- Depends on the application and what you want to do:

    - Exome sequencing / whole genome sequencing: SNP calling (samtools): based on
      mapping quality / coverage /  identification of SNPs (VCF output format)

    - ChIP-seq: peak calling: based on coverage of ChIP and input, enriched regions are
      identified (BED output, BEDgraph and/or WIG files)

    - RNA-seq: assign reads to the transcripts, normalize (length of exon and number of
      reads in the sequencing library = RPKM)  (relative) expression levels 
      identification of differentially expressed genes
