SlideShare une entreprise Scribd logo
1  sur  29
Sequencing data analysis
Workshop – part 2 / mapping to a reference genome



                      Outline

            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop




                  Maté Ongenaert
Previously in this workshop…
Introduction – the real cost of sequencing
Previously in this workshop…
Introduction – the real cost of sequencing
Previously in this workshop…
  The workflow of NGS data analysis
                            Data analysis

                 Raw machine reads… What’s next?

                Preprocessing (machine/technology)
                 - adaptors, indexes, conversions,…
                 - machine/technology dependent

              Reads with associated qualities (universal)
                              - FASTQ
                            - QC check

            Depending on application (general applicable)
        - ‘de novo’ assembly of genome (bacterial genomes,…)
         - Mapping to a reference genome  mapped reads
                          - SAM/BAM/…

             High-level analysis (specific for application)
                            - SNP calling
                           - Peak calling
Previously in this workshop…
  The workflow of NGS data analysis
Previously in this workshop…
                                     Main data formats
                                     Raw sequence reads:

- Represent the sequence ~ FASTA
  >SEQUENCE_IDENTIFIER
  GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT


- Extension: represent the quality, per base ~ FASTQ – Q for quality
Score ~ phred ~ ASCII table ~ phred + 33 = Sanger
  @SEQUENCE_IDENTIFIER
  GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
  +
  !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65



- Machine and platform independent and compressed: SRA (NCBI)
Get the original FASTQ file using SRATools (NCBI)
Previously in this workshop…
                                Main data formats
- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
- BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM

DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION

# QNAME: template name
#FLAG
#RNAME: reference name
# POS: mapping position
#MAPQ: mapping quality
#CIGAR: CIGAR string
#RNEXT: reference name of the mate/next fragment
#PNEXT: position of the mate/next fragment
#TLEN: observed template length
#SEQ: fragment sequence
#QUAL: ASCII of Phred-scale base quality+33

#Headers
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45

#Alignment block
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
Previously in this workshop…
                                         Main data formats
- BED files (location / annotation / scores): Browser Extensible Data
Used for mapping / annotation / peak locations / - extension: bigBED (binary)
FIELDS USED:
# chr
# start
# end
# name
# score
# strand

track   name=pairedReads description="Clone Paired Reads" useScore=1
#chr    start end name score strand
chr22   1000 5000 cloneA 960 +
chr22   2000 6000 cloneB 900 –


- BEDGraph files (location, combined with score)
Used to represent peak scores
track type=bedGraph name="BedGraph Format" description="BedGraph format"
visibility=full color=200,100,0 altColor=0,100,200 priority=20
#chr start    end      score
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50
Previously in this workshop…
                                       Main data formats
- WIG files (location / annotation / scores): wiggle
Used for visulization or summarize data, in most cases count data or normalized count
data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)




browser position chr19:59304200-59310700
browser hide all

#150 base wide bar graph at arbitrarily spaced positions,
#threshold line drawn at y=11.76
#autoScale off viewing range set to [0:25]
#priority = 10 positions this as the first graph

track type=wiggle_0 name="variableStep" description="variableStep format"
visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255
yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
59304701 10.0
59304901 12.5
59305401 15.0
59305601 17.5
59305901 20.0
59306081 17.5
Previously in this workshop…
                                    Main data formats
- GFF format (General Feature Format) or GTF
Used for annotation of genetic / genomic features – such as all coding genes in Ensembl
Often used in downstream analysis to assign annotation to regions / peaks / …
FIELDS USED:

# seqname (the name of the sequence)
# source (the program that generated this feature)
# feature (the name of this type of feature – for example: exon)
# start (the starting position of the feature in the sequence)
# end (the ending position of the feature)
# score (a score between 0 and 1000)
# strand (valid entries include '+', '-', or '.')
# frame (if the feature is a coding exon, frame should be a number between
0-2 that represents the reading frame of the first base. If the feature is
not a coding exon, the value should be '.'.)
# group (all lines with the same group are linked together into a single
item)

track name=regulatory description="TeleGene(tm)    Regulatory Regions"
#chr   source   feature   start    end   scores    tr fr group
chr22 TeleGene enhancer 1000000 1001000 500        + . touch1
chr22 TeleGene promoter 1010000 1010100 900        + . touch1
chr22 TeleGene promoter 1020000 1020000 800        - . touch2
Previously in this workshop…
                                     Main data formats
- VCF format (Variant Call Format)
For SNP representation
Previously in this workshop…
                                  Main data formats
- http://genome.ucsc.edu/FAQ/FAQformat.html

- UCSC brower data formats, including all most commonly used formats that are
  accepted and widely used

- In addition, ENCODE data formats (narrowPeak / broadPEAK)
Sequencing data analysis
Workshop – part 2 / mapping to a reference genome



                      Outline

            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop




                  Maté Ongenaert
Mapping to a reference genome
                                      The workflow
Mapping:

Aligning the raw sequence reads to a reference genome by using an indexing strategy and
aligning algorithm, taking into account the quality scores and with specific conditions

- Raw sequence reads with quality scores: FASTQ
- Reference genome: FASTA files can be downloaded (UCSC/Ensembl)

- Sequence reads <> reference genome: alignment
- To perform an efficient alignment, an indexing strategy is used
- For instance (BWA/Bowtie): FM indexes (based on burrows-wheeler algorithm) on the
  reference genome and/or the sequence reads

- Specific conditions: single-end or paired-end; how many mismatches allowed; trade-off
  speed/accuracy/specificity; local re-alignment afterwards for improved indel calling; …

>> Result: mapped sequence reads: chr / start / end / quality >> SAM file (>> BAM)
Mapping to a reference genome
                                       The workflow
The reference genome

- Sequences (human; rat: mouse:…) can be downloaded from UCSC (Golden path) or
  Ensembl
- Difficulty: download in 2bit format (needs convertor) >> fasta files (.fa)
- Need to be indexed by the mapping program you are going to use

- BWA: bwa index
- Bowtie: bowtie-build (pre-computed indexes available)

- BWA example:

bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>
Index database sequences in the FASTA format.

OPTIONS:
-c         Build color-space index. The input fast should be in nucleotide space.
-p STR     Prefix of the output database [same as db filename]
-a STR     Algorithm for constructing BWT index. Available options are:
is         IS linear-time algorithm for constructing suffix array.
           It requires 5.37N memory where N is the size of the database.
bwtsw      Algorithm implemented in BWT-SW. This method works with the whole human genome
Mapping to a reference genome
                                     The workflow
The sequencing reads

- Sequence reads with quality scores: FASTQ files from the machine
- Depending on the mapping program, need to be indexed as well

- BWA: converts reads to SA coordinates (Suffix Array) based on the reference genome
  index
- Bowtie: not needed: indexing and aligning in one step

- BWA:
- Index reference genome
- Index sequence reads (INPUT: FASTQ and REF. GENOME ) >> SA coordinates (OUTPUT:
  SAI)
- SA coordinates (INPUT: SAI/FASTQ and REF. GENOME >> SAM/BAM (OUTPUT)
Mapping to a reference genome
                                       The workflow
aln        bwa aln [-n][-o][-e][-d][-i][-k][-l][-t][-cRN][-M][-O][-E][-q]
            <in.db.fasta> <in.query.fq> > <out.sai>

Find the SA coordinates of the input reads.
Maximum maxSeedDiff differences are allowed in the first seedLen subsequence
maximum maxDiff differences are allowed in the whole sequence.

OPTIONS:
-n NUM     Maximum edit distance if the value is INT
-o INT     Maximum number of gap opens
-e INT     Maximum number of gap extensions, -1 for k-difference mode
-d INT     Disallow a long deletion within INT bp towards the 3’-end
-i INT     Disallow an indel within INT bp towards the ends [5]
-l INT     Take the first INT subsequence as seed.
-k INT     Maximum edit distance in the seed
-t INT     Number of threads (multi-threading mode)
-M INT     Mismatch penalty
-O INT     Gap open penalty
-E INT     Gap extension penalty
-R INT     Proceed with suboptimal alignments
-c         Reverse query but not complement it
-N         Disable iterative search.
-q INT     Parameter for read trimming.
-I         The input is in the Illumina 1.3+ read format (quality equals ASCII-64)
-B INT     Length of barcode starting from the 5’-end.
-b         Specify the input read sequence file is the BAM format.
-0         When -b is specified, only use single-end reads in mapping.
-1         When -b is specified, only use the first read in a read pair in mapping
-2         When -b is specified, only use the second read in a read pair in mapping
Mapping to a reference genome
                                       The workflow
samse      bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
Generate alignments in the SAM format given single-end reads
Repetitive hits will be randomly chosen.

OPTIONS:
-n INT     Maximum number of alignments to output in the XA tag for reads paired properly.
-r STR     Specify the read group in a format like ‘@RGtID:footSM:bar’


sampe      bwa sampe [-a][-o][-n][-N][-P]<in.db.fasta>
<in1.sai><in2.sai><in1.fq><in2.fq> ><out.sam>
Generate alignments in the SAM format given paired-end reads.
Repetitive read pairs will be placed randomly.

OPTIONS:
-a INT     Maximum insert size for a read pair to be considered being mapped properly.
-o INT     Maximum occurrences of a read for pairing.
-P         Load the entire FM-index into memory to reduce disk operations
-n INT     Maximum number of alignments to output in the XA tag for reads paired properly
-N INT     Maximum number of alignments to output in the XA tag for disconcordant read pairs
-r STR     Specify the read group in a format like ‘@RGtID:footSM:bar’
Sequencing data analysis
Workshop – part 2 / mapping to a reference genome



                      Outline

            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop




                  Maté Ongenaert
Mapping to a reference genome
                                       The workshop
Mapping using BWA

bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai

bwa-0.5.9 BWA and its version
aln: alignement functionality of BWA
-t 4: use 4 processes (CPU cores) at the same time to speed up
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.fastq: fastq file to align to the reference
> Indicates outputting to a file
SRR058523.sai: the output file (SA Index file)

Maps the input sequences (FASTQ) to the reference genome index  output: indexes of
 the reads

No ‘real genomic mapping’ thus, this would need a next step…
Mapping to a reference genome
                                       The workshop
Mapping using BWA

bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF6-unsorted.bam –


bwa-0.5.9 BWA and its version
samse: single-end mapping and output to sam format
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.sai: the reads index
SRR058523.fastq: the raw reads and quality scores

This would output a sam file (> SRR058523.sam) for instance
But we don’t need the SAM file, we would like a BAM file  processing by samtools

| is the ‘pipe’ symbol: hands over the output from one command to the other
samtools-0.1.18: samtools and its version
view: the command to process sam files
- B output BAM ; h print the headers; S input is SAM; o output name
PHF6-unsorted.bam: output file name
- End of the | symbol (end of second command)
Mapping to a reference genome
                                        The workshop
Mapping using BWA

bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai

bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF8-unsorted.bam –

Two-step process in BWA

Next steps: process the BAM file  sort and index it (using samtools)

samtools-0.1.18 sort PHF8-unsorted.bam PHF8-sorted

Creates a sorted BAM file (PHF6-sorted.bam)
samtools-0.1.18 index PHF8-sorted.bam

Indexes the sorted BAM file (and created a BAM index file – PHF6-sorted.bam.bai)
Mapping to a reference genome
                                         The workshop
BAM: what’s next?

So, now we have the sorted and indexed BAM file – what’s next?

This file is the starting point for all other analysis, depending on the application:

ChIP-seq: peak calling
SNP calling
RNA-seq: calculate gene-expression levels of the transcripts / find splice variants

What are the first things?
- Visualize it (IGV can load BAM files)
- First downstream analysis: QC and basic statistics (how many mapped reads, quality
  distribution, distribution accross chromosomes,…)
Mapping to a reference genome
                                        The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

Samstat
/opt/samstat/samstat PHF8-sorted.bam



- Outputs a HTML file with statistics
Mapping to a reference genome
                                                The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

BamUtil (stats)

Bam stats --in PHF8-sorted.bam –-basic --phred        --baseSum

Number of records read = 15732744

TotalReads(e6)   15.73
MappedReads(e6) 15.04
PairedReads(e6) 15.73
ProperPair(e6)   14.65
DuplicateReads(e6)                  0.00
QCFailureReads(e6)                  0.00

MappingRate(%)   95.59
PairedReads(%)   100.00
ProperPair(%)    93.11
DupRate(%)       0.00
QCFailRate(%)    0.00

TotalBases(e6)   802.37
BasesInMappedReads(e6)              766.95

Quality          Count
33               0
34               0
35               71373
36               0
37               0
38               203544
39               403649
40               921714
41               2081099
42               1974615
43               2285826
Mapping to a reference genome
                                       The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

Samtools
samtools-0.1.18 idxstats PHF8-sorted.bam

1      249250621        503714   0
2      243199373        345217   0
3      198022430        273477   0
4      191154276        229016   0
5      180915260        360339   0
6      171115067        257468   0
7      159138663        269704   0
8      146364022        242656   0
9      141213431        203505   0
10     135534747        237496   0
11     135006516        218116   0
12     133851895        231426   0
13     115169878        106831   0
14     107349540        119062   0
15     102531392        141351   0
16     90354753         183004   0
17     81195210         187024   0
18     78077248         86101    0
Mapping to a reference genome
                                     The workshop
First downstream analysis

- Think about PCR duplicates  you may want to remove them (or set a ‘flag’ in the BAM
  file, indicating it is a duplicate)
- Samtools rmdup or Picard MarkDuplicates

- Find out how these tools work and what otyher flags are used in BAM files
- Can you make statistics with the BAM flags?
Mapping to a reference genome
                                     The workshop
Mapping – now let’s start!

- Mapping is only the starting point for most downstream analysis tools
- Depends on the application and what you want to do:

    - Exome sequencing / whole genome sequencing: SNP calling (samtools): based on
      mapping quality / coverage /  identification of SNPs (VCF output format)

    - ChIP-seq: peak calling: based on coverage of ChIP and input, enriched regions are
      identified (BED output, BEDgraph and/or WIG files)

    - RNA-seq: assign reads to the transcripts, normalize (length of exon and number of
      reads in the sequencing library = RPKM)  (relative) expression levels 
      identification of differentially expressed genes
Blok
de   Van…
       ETER

Contenu connexe

Tendances

Simulation and Performance Analysis of AODV using NS-2.34
Simulation and Performance Analysis of AODV using NS-2.34Simulation and Performance Analysis of AODV using NS-2.34
Simulation and Performance Analysis of AODV using NS-2.34
Shaikhul Islam Chowdhury
 

Tendances (19)

3rd 3DDRESD: ReCPU 4 NIDS
3rd 3DDRESD: ReCPU 4 NIDS3rd 3DDRESD: ReCPU 4 NIDS
3rd 3DDRESD: ReCPU 4 NIDS
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Simulation and Performance Analysis of AODV using NS-2.34
Simulation and Performance Analysis of AODV using NS-2.34Simulation and Performance Analysis of AODV using NS-2.34
Simulation and Performance Analysis of AODV using NS-2.34
 
Prelim Slides
Prelim SlidesPrelim Slides
Prelim Slides
 
Inference accelerators
Inference acceleratorsInference accelerators
Inference accelerators
 
Protocol implementation on NS2
Protocol implementation on NS2Protocol implementation on NS2
Protocol implementation on NS2
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.
 
Tma ph d_school_2011
Tma ph d_school_2011Tma ph d_school_2011
Tma ph d_school_2011
 
Fann tool users_guide
Fann tool users_guideFann tool users_guide
Fann tool users_guide
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)
 
Model Based Schedulability Analysis of Java Bytecode Programs Executed on Com...
Model Based Schedulability Analysis of Java Bytecode Programs Executed on Com...Model Based Schedulability Analysis of Java Bytecode Programs Executed on Com...
Model Based Schedulability Analysis of Java Bytecode Programs Executed on Com...
 
Tridiagonal solver in gpu
Tridiagonal solver in gpuTridiagonal solver in gpu
Tridiagonal solver in gpu
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
 
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation Consoles
 
FEC & File Multicast
FEC & File MulticastFEC & File Multicast
FEC & File Multicast
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 

En vedette

Knowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisKnowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysis
COST action BM1006
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
Jan Aerts
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Maté Ongenaert
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
Thomas Keane
 

En vedette (20)

Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Integrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsIntegrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functions
 
Knowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisKnowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysis
 
Linux for bioinformatics
Linux for bioinformaticsLinux for bioinformatics
Linux for bioinformatics
 
Semantic Web from the 2013 Perspective
Semantic Web from the 2013 PerspectiveSemantic Web from the 2013 Perspective
Semantic Web from the 2013 Perspective
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the Cloud
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Bio2RDF @ W3C HCLS2009
Bio2RDF @ W3C HCLS2009Bio2RDF @ W3C HCLS2009
Bio2RDF @ W3C HCLS2009
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012
 
Genome voyager-beta-brochure
Genome voyager-beta-brochureGenome voyager-beta-brochure
Genome voyager-beta-brochure
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis Lokeren
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findings
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 

Similaire à Workshop NGS data analysis - 2

Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
SamHoney6
 
2007 Tidc India Profiling
2007 Tidc India Profiling2007 Tidc India Profiling
2007 Tidc India Profiling
danrinkes
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
Golden Helix Inc
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
Jan Aerts
 

Similaire à Workshop NGS data analysis - 2 (20)

XSEDE15_PhastaGateway
XSEDE15_PhastaGatewayXSEDE15_PhastaGateway
XSEDE15_PhastaGateway
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
SISY 2008
SISY 2008SISY 2008
SISY 2008
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
JCConf 2022 - New Features in Java 18 & 19
JCConf 2022 - New Features in Java 18 & 19JCConf 2022 - New Features in Java 18 & 19
JCConf 2022 - New Features in Java 18 & 19
 
2007 Tidc India Profiling
2007 Tidc India Profiling2007 Tidc India Profiling
2007 Tidc India Profiling
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
JavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsJavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame Graphs
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / Pipeline
 
Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming
Why Graphics Is Fast, and What It Can Teach Us About Parallel ProgrammingWhy Graphics Is Fast, and What It Can Teach Us About Parallel Programming
Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Variants Density along DNA Sequence
Variants Density along DNA SequenceVariants Density along DNA Sequence
Variants Density along DNA Sequence
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
計算機性能の限界点とその考え方
計算機性能の限界点とその考え方計算機性能の限界点とその考え方
計算機性能の限界点とその考え方
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
Algorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions EnumerationAlgorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions Enumeration
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
 

Plus de Maté Ongenaert

Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisExploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Maté Ongenaert
 
High-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themHigh-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting them
Maté Ongenaert
 
The post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationThe post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integration
Maté Ongenaert
 
Literature managment training
Literature managment trainingLiterature managment training
Literature managment training
Maté Ongenaert
 

Plus de Maté Ongenaert (12)

Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
 
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisExploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
 
High-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themHigh-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting them
 
Microarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMicroarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the bench
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
 
Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...
 
Bringing the data back to the researchers
Bringing the data back to the researchersBringing the data back to the researchers
Bringing the data back to the researchers
 
The post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationThe post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integration
 
Introduction
IntroductionIntroduction
Introduction
 
Literature managment training
Literature managment trainingLiterature managment training
Literature managment training
 
Scientific literature managment - exercises
Scientific literature managment - exercisesScientific literature managment - exercises
Scientific literature managment - exercises
 

Dernier

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Dernier (20)

ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 

Workshop NGS data analysis - 2

  • 1. Sequencing data analysis Workshop – part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  • 2. Previously in this workshop… Introduction – the real cost of sequencing
  • 3. Previously in this workshop… Introduction – the real cost of sequencing
  • 4. Previously in this workshop… The workflow of NGS data analysis Data analysis Raw machine reads… What’s next? Preprocessing (machine/technology) - adaptors, indexes, conversions,… - machine/technology dependent Reads with associated qualities (universal) - FASTQ - QC check Depending on application (general applicable) - ‘de novo’ assembly of genome (bacterial genomes,…) - Mapping to a reference genome  mapped reads - SAM/BAM/… High-level analysis (specific for application) - SNP calling - Peak calling
  • 5. Previously in this workshop… The workflow of NGS data analysis
  • 6. Previously in this workshop… Main data formats Raw sequence reads: - Represent the sequence ~ FASTA >SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT - Extension: represent the quality, per base ~ FASTQ – Q for quality Score ~ phred ~ ASCII table ~ phred + 33 = Sanger @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 - Machine and platform independent and compressed: SRA (NCBI) Get the original FASTQ file using SRATools (NCBI)
  • 7. Previously in this workshop… Main data formats - Now moving to a common file format  SAM / BAM (Sequence Alignment/Map) - BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION # QNAME: template name #FLAG #RNAME: reference name # POS: mapping position #MAPQ: mapping quality #CIGAR: CIGAR string #RNEXT: reference name of the mate/next fragment #PNEXT: position of the mate/next fragment #TLEN: observed template length #SEQ: fragment sequence #QUAL: ASCII of Phred-scale base quality+33 #Headers @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 #Alignment block r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
  • 8. Previously in this workshop… Main data formats - BED files (location / annotation / scores): Browser Extensible Data Used for mapping / annotation / peak locations / - extension: bigBED (binary) FIELDS USED: # chr # start # end # name # score # strand track name=pairedReads description="Clone Paired Reads" useScore=1 #chr start end name score strand chr22 1000 5000 cloneA 960 + chr22 2000 6000 cloneB 900 – - BEDGraph files (location, combined with score) Used to represent peak scores track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20 #chr start end score chr19 59302000 59302300 -1.0 chr19 59302300 59302600 -0.75 chr19 59302600 59302900 -0.50
  • 9. Previously in this workshop… Main data formats - WIG files (location / annotation / scores): wiggle Used for visulization or summarize data, in most cases count data or normalized count data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks) browser position chr19:59304200-59310700 browser hide all #150 base wide bar graph at arbitrarily spaced positions, #threshold line drawn at y=11.76 #autoScale off viewing range set to [0:25] #priority = 10 positions this as the first graph track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 59304701 10.0 59304901 12.5 59305401 15.0 59305601 17.5 59305901 20.0 59306081 17.5
  • 10. Previously in this workshop… Main data formats - GFF format (General Feature Format) or GTF Used for annotation of genetic / genomic features – such as all coding genes in Ensembl Often used in downstream analysis to assign annotation to regions / peaks / … FIELDS USED: # seqname (the name of the sequence) # source (the program that generated this feature) # feature (the name of this type of feature – for example: exon) # start (the starting position of the feature in the sequence) # end (the ending position of the feature) # score (a score between 0 and 1000) # strand (valid entries include '+', '-', or '.') # frame (if the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.) # group (all lines with the same group are linked together into a single item) track name=regulatory description="TeleGene(tm) Regulatory Regions" #chr source feature start end scores tr fr group chr22 TeleGene enhancer 1000000 1001000 500 + . touch1 chr22 TeleGene promoter 1010000 1010100 900 + . touch1 chr22 TeleGene promoter 1020000 1020000 800 - . touch2
  • 11. Previously in this workshop… Main data formats - VCF format (Variant Call Format) For SNP representation
  • 12. Previously in this workshop… Main data formats - http://genome.ucsc.edu/FAQ/FAQformat.html - UCSC brower data formats, including all most commonly used formats that are accepted and widely used - In addition, ENCODE data formats (narrowPeak / broadPEAK)
  • 13. Sequencing data analysis Workshop – part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  • 14. Mapping to a reference genome The workflow Mapping: Aligning the raw sequence reads to a reference genome by using an indexing strategy and aligning algorithm, taking into account the quality scores and with specific conditions - Raw sequence reads with quality scores: FASTQ - Reference genome: FASTA files can be downloaded (UCSC/Ensembl) - Sequence reads <> reference genome: alignment - To perform an efficient alignment, an indexing strategy is used - For instance (BWA/Bowtie): FM indexes (based on burrows-wheeler algorithm) on the reference genome and/or the sequence reads - Specific conditions: single-end or paired-end; how many mismatches allowed; trade-off speed/accuracy/specificity; local re-alignment afterwards for improved indel calling; … >> Result: mapped sequence reads: chr / start / end / quality >> SAM file (>> BAM)
  • 15. Mapping to a reference genome The workflow The reference genome - Sequences (human; rat: mouse:…) can be downloaded from UCSC (Golden path) or Ensembl - Difficulty: download in 2bit format (needs convertor) >> fasta files (.fa) - Need to be indexed by the mapping program you are going to use - BWA: bwa index - Bowtie: bowtie-build (pre-computed indexes available) - BWA example: bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta> Index database sequences in the FASTA format. OPTIONS: -c Build color-space index. The input fast should be in nucleotide space. -p STR Prefix of the output database [same as db filename] -a STR Algorithm for constructing BWT index. Available options are: is IS linear-time algorithm for constructing suffix array. It requires 5.37N memory where N is the size of the database. bwtsw Algorithm implemented in BWT-SW. This method works with the whole human genome
  • 16. Mapping to a reference genome The workflow The sequencing reads - Sequence reads with quality scores: FASTQ files from the machine - Depending on the mapping program, need to be indexed as well - BWA: converts reads to SA coordinates (Suffix Array) based on the reference genome index - Bowtie: not needed: indexing and aligning in one step - BWA: - Index reference genome - Index sequence reads (INPUT: FASTQ and REF. GENOME ) >> SA coordinates (OUTPUT: SAI) - SA coordinates (INPUT: SAI/FASTQ and REF. GENOME >> SAM/BAM (OUTPUT)
  • 17. Mapping to a reference genome The workflow aln bwa aln [-n][-o][-e][-d][-i][-k][-l][-t][-cRN][-M][-O][-E][-q] <in.db.fasta> <in.query.fq> > <out.sai> Find the SA coordinates of the input reads. Maximum maxSeedDiff differences are allowed in the first seedLen subsequence maximum maxDiff differences are allowed in the whole sequence. OPTIONS: -n NUM Maximum edit distance if the value is INT -o INT Maximum number of gap opens -e INT Maximum number of gap extensions, -1 for k-difference mode -d INT Disallow a long deletion within INT bp towards the 3’-end -i INT Disallow an indel within INT bp towards the ends [5] -l INT Take the first INT subsequence as seed. -k INT Maximum edit distance in the seed -t INT Number of threads (multi-threading mode) -M INT Mismatch penalty -O INT Gap open penalty -E INT Gap extension penalty -R INT Proceed with suboptimal alignments -c Reverse query but not complement it -N Disable iterative search. -q INT Parameter for read trimming. -I The input is in the Illumina 1.3+ read format (quality equals ASCII-64) -B INT Length of barcode starting from the 5’-end. -b Specify the input read sequence file is the BAM format. -0 When -b is specified, only use single-end reads in mapping. -1 When -b is specified, only use the first read in a read pair in mapping -2 When -b is specified, only use the second read in a read pair in mapping
  • 18. Mapping to a reference genome The workflow samse bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam> Generate alignments in the SAM format given single-end reads Repetitive hits will be randomly chosen. OPTIONS: -n INT Maximum number of alignments to output in the XA tag for reads paired properly. -r STR Specify the read group in a format like ‘@RGtID:footSM:bar’ sampe bwa sampe [-a][-o][-n][-N][-P]<in.db.fasta> <in1.sai><in2.sai><in1.fq><in2.fq> ><out.sam> Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly. OPTIONS: -a INT Maximum insert size for a read pair to be considered being mapped properly. -o INT Maximum occurrences of a read for pairing. -P Load the entire FM-index into memory to reduce disk operations -n INT Maximum number of alignments to output in the XA tag for reads paired properly -N INT Maximum number of alignments to output in the XA tag for disconcordant read pairs -r STR Specify the read group in a format like ‘@RGtID:footSM:bar’
  • 19. Sequencing data analysis Workshop – part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  • 20. Mapping to a reference genome The workshop Mapping using BWA bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai bwa-0.5.9 BWA and its version aln: alignement functionality of BWA -t 4: use 4 processes (CPU cores) at the same time to speed up /opt/genomes/index/bwa/GRCh37: location of the reference genome index SRR058523.fastq: fastq file to align to the reference > Indicates outputting to a file SRR058523.sai: the output file (SA Index file) Maps the input sequences (FASTQ) to the reference genome index  output: indexes of the reads No ‘real genomic mapping’ thus, this would need a next step…
  • 21. Mapping to a reference genome The workshop Mapping using BWA bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq | samtools-0.1.18 view -bhSo PHF6-unsorted.bam – bwa-0.5.9 BWA and its version samse: single-end mapping and output to sam format /opt/genomes/index/bwa/GRCh37: location of the reference genome index SRR058523.sai: the reads index SRR058523.fastq: the raw reads and quality scores This would output a sam file (> SRR058523.sam) for instance But we don’t need the SAM file, we would like a BAM file  processing by samtools | is the ‘pipe’ symbol: hands over the output from one command to the other samtools-0.1.18: samtools and its version view: the command to process sam files - B output BAM ; h print the headers; S input is SAM; o output name PHF6-unsorted.bam: output file name - End of the | symbol (end of second command)
  • 22. Mapping to a reference genome The workshop Mapping using BWA bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq | samtools-0.1.18 view -bhSo PHF8-unsorted.bam – Two-step process in BWA Next steps: process the BAM file  sort and index it (using samtools) samtools-0.1.18 sort PHF8-unsorted.bam PHF8-sorted Creates a sorted BAM file (PHF6-sorted.bam) samtools-0.1.18 index PHF8-sorted.bam Indexes the sorted BAM file (and created a BAM index file – PHF6-sorted.bam.bai)
  • 23. Mapping to a reference genome The workshop BAM: what’s next? So, now we have the sorted and indexed BAM file – what’s next? This file is the starting point for all other analysis, depending on the application: ChIP-seq: peak calling SNP calling RNA-seq: calculate gene-expression levels of the transcripts / find splice variants What are the first things? - Visualize it (IGV can load BAM files) - First downstream analysis: QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes,…)
  • 24. Mapping to a reference genome The workshop First downstream analysis - QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…) Samstat /opt/samstat/samstat PHF8-sorted.bam - Outputs a HTML file with statistics
  • 25. Mapping to a reference genome The workshop First downstream analysis - QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…) BamUtil (stats) Bam stats --in PHF8-sorted.bam –-basic --phred --baseSum Number of records read = 15732744 TotalReads(e6) 15.73 MappedReads(e6) 15.04 PairedReads(e6) 15.73 ProperPair(e6) 14.65 DuplicateReads(e6) 0.00 QCFailureReads(e6) 0.00 MappingRate(%) 95.59 PairedReads(%) 100.00 ProperPair(%) 93.11 DupRate(%) 0.00 QCFailRate(%) 0.00 TotalBases(e6) 802.37 BasesInMappedReads(e6) 766.95 Quality Count 33 0 34 0 35 71373 36 0 37 0 38 203544 39 403649 40 921714 41 2081099 42 1974615 43 2285826
  • 26. Mapping to a reference genome The workshop First downstream analysis - QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…) Samtools samtools-0.1.18 idxstats PHF8-sorted.bam 1 249250621 503714 0 2 243199373 345217 0 3 198022430 273477 0 4 191154276 229016 0 5 180915260 360339 0 6 171115067 257468 0 7 159138663 269704 0 8 146364022 242656 0 9 141213431 203505 0 10 135534747 237496 0 11 135006516 218116 0 12 133851895 231426 0 13 115169878 106831 0 14 107349540 119062 0 15 102531392 141351 0 16 90354753 183004 0 17 81195210 187024 0 18 78077248 86101 0
  • 27. Mapping to a reference genome The workshop First downstream analysis - Think about PCR duplicates  you may want to remove them (or set a ‘flag’ in the BAM file, indicating it is a duplicate) - Samtools rmdup or Picard MarkDuplicates - Find out how these tools work and what otyher flags are used in BAM files - Can you make statistics with the BAM flags?
  • 28. Mapping to a reference genome The workshop Mapping – now let’s start! - Mapping is only the starting point for most downstream analysis tools - Depends on the application and what you want to do: - Exome sequencing / whole genome sequencing: SNP calling (samtools): based on mapping quality / coverage /  identification of SNPs (VCF output format) - ChIP-seq: peak calling: based on coverage of ChIP and input, enriched regions are identified (BED output, BEDgraph and/or WIG files) - RNA-seq: assign reads to the transcripts, normalize (length of exon and number of reads in the sequencing library = RPKM)  (relative) expression levels  identification of differentially expressed genes
  • 29. Blok de Van… ETER