Workshop NGS data analysis - 3

Sequencing data analysis
Workshop – part 3 / peak calling and annotation

Outline

Previously in this workshop…

Peak calling and annotation – the steps

Peak calling and annotation – the workshop

Maté Ongenaert

Introduction – the real cost of sequencing

The workflow of NGS data analysis
Data analysis

Raw machine reads… What’s next?

Preprocessing (machine/technology)
- adaptors, indexes, conversions,…
- machine/technology dependent

Reads with associated qualities (universal)
- FASTQ
- QC check

Depending on application (general applicable)
- ‘de novo’ assembly of genome (bacterial genomes,…)
- Mapping to a reference genome  mapped reads
- SAM/BAM/…

High-level analysis (specific for application)
- SNP calling
- Peak calling

The workflow of NGS data analysis

Main data formats
Raw sequence reads:

- Represent the sequence ~ FASTA
>SEQUENCE_IDENTIFIER
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

- Extension: represent the quality, per base ~ FASTQ – Q for quality
Score ~ phred ~ ASCII table ~ phred + 33 = Sanger
@SEQUENCE_IDENTIFIER
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

- Machine and platform independent and compressed: SRA (NCBI)
Get the original FASTQ file using SRATools (NCBI)

Main data formats
- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
- BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM

DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION

# QNAME: template name
#FLAG
#RNAME: reference name
# POS: mapping position
#MAPQ: mapping quality
#CIGAR: CIGAR string
#RNEXT: reference name of the mate/next fragment
#PNEXT: position of the mate/next fragment
#TLEN: observed template length
#SEQ: fragment sequence
#QUAL: ASCII of Phred-scale base quality+33

#Headers
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45

#Alignment block
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *

Main data formats
- BED files (location / annotation / scores): Browser Extensible Data
Used for mapping / annotation / peak locations / - extension: bigBED (binary)
FIELDS USED:
# chr
# start
# end
# name
# score
# strand

track name=pairedReads description="Clone Paired Reads" useScore=1
#chr start end name score strand
chr22 1000 5000 cloneA 960 +
chr22 2000 6000 cloneB 900 –

- BEDGraph files (location, combined with score)
Used to represent peak scores
track type=bedGraph name="BedGraph Format" description="BedGraph format"
visibility=full color=200,100,0 altColor=0,100,200 priority=20
#chr start end score
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50

Main data formats
- WIG files (location / annotation / scores): wiggle
Used for visulization or summarize data, in most cases count data or normalized count
data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)

browser position chr19:59304200-59310700
browser hide all

#150 base wide bar graph at arbitrarily spaced positions,
#threshold line drawn at y=11.76
#autoScale off viewing range set to [0:25]
#priority = 10 positions this as the first graph

track type=wiggle_0 name="variableStep" description="variableStep format"
visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255
yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
59304701 10.0
59304901 12.5
59305401 15.0
59305601 17.5
59305901 20.0
59306081 17.5

Main data formats
- GFF format (General Feature Format) or GTF
Used for annotation of genetic / genomic features – such as all coding genes in Ensembl
Often used in downstream analysis to assign annotation to regions / peaks / …
FIELDS USED:

# seqname (the name of the sequence)
# source (the program that generated this feature)
# feature (the name of this type of feature – for example: exon)
# start (the starting position of the feature in the sequence)
# end (the ending position of the feature)
# score (a score between 0 and 1000)
# strand (valid entries include '+', '-', or '.')
# frame (if the feature is a coding exon, frame should be a number between
0-2 that represents the reading frame of the first base. If the feature is
not a coding exon, the value should be '.'.)
# group (all lines with the same group are linked together into a single
item)

track name=regulatory description="TeleGene(tm) Regulatory Regions"
#chr source feature start end scores tr fr group
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2

Peak calling
The workflow
Peak calling:

Identify genomic regions where the number of sequenced reads (coverage) of the IP-
sample is higher than can be estimated from the input (control) samples >> enriched
regions >> possibly captured by the IP & thus sequenced with more coverage

Peak annotation:

When such enriched regions are identified, where are they located (intron/exon/…) ?
What is the closest gene or the closest promoter region?

Peak calling
The workflow
Peak calling:

Coverage

From the BAM file: mapping against the reference genome
Both the IP-sample and the control (Input) must be mapped, duplicates will be ignored by
most peak callers

Peak caller will determine coverage for both samples
- Store them for visualisation (WIG files; BIGWIG files or similar)

Enriched

Find out which regions are enriched (or within the sample or versus a control (Input)
sample  statistics ~ model of tag distributions and normalisation strategy

Peak calling
The workflow
Peak calling:

Enriched

Find out which regions are enriched (or within the sample or versus a control (Input)
sample  statistics ~ model of tag distributions and normalisation strategy
Significance relative to control
Density profiles Peak assignment Control data adjustment Statistical model / test
data

Statistical
Window- Tag Gaussian Strand- Peak height Bacground Genomic Normalized Conditional Local Chromome
Program Reference FDR model on HMM T-test
based clustering kernel specific or FE subtract dupl/deletions control binomial poisson poisson
control

Cisgenome [73] X X X X X X
Minimal
ChipSeq [74] X X X
Peak Finder
E-range [75] X X X X X
MACS [76] X X X X X
QuEST [77] X X X X X
Hpeak [78] X X X X
Sole-Search [79] X X X X X
PeakSeq [80] X X X X
SISSRS [81] X X X
spp package [82] X X X X X

Peak calling
The workflow
Usage: macs14 <-t tfile> [-n name] [-g genomesize] [options]

Example: macs14 -t ChIP.bam -c Control.bam -f BAM -g h -n test -w --call-subpeaks

macs14 -- Model-based Analysis for ChIP-Sequencing

Options:
--version show program's version number and exit
-h, --help show this help message and exit.
-t TFILE, --treatment=TFILE
ChIP-seq treatment files. REQUIRED. When ELANDMULTIPET
is selected, you must provide two files separated by
comma, e.g.
s_1_1_eland_multi.txt,s_1_2_eland_multi.txt
-c CFILE, --control=CFILE
Control files. When ELANDMULTIPET is selected, you
must provide two files separated by comma, e.g.
s_2_1_eland_multi.txt,s_2_2_eland_multi.txt
-n NAME, --name=NAME Experiment name, which will be used to generate output
file names. DEFAULT: "NA"
-f FORMAT, --format=FORMAT
Format of tag file, "AUTO", "BED" or "ELAND" or
"ELANDMULTI" or "ELANDMULTIPET" or "ELANDEXPORT" or
"SAM" or "BAM" or "BOWTIE". The default AUTO option
will let MACS decide which format the file is. Please
check the definition in 00README file if you choose EL
AND/ELANDMULTI/ELANDMULTIPET/ELANDEXPORT/SAM/BAM/BOWTI
E. DEFAULT: "AUTO"

Peak calling
The workflow
Peak annotation

Enriched

Peak locations > in which features is my peak located; is it close to a gene; provide me
some statistics on how far my peaks are from annotated TSSes

R/BioConductor
ChipPeakAnno package

PeakAnalyzer

Peak calling
The workflow
Further downstream processing
Peak overlaps

Is this observed overlap larger
than one can expect if the
datasets were random?

 Peak caller gives each peak a
score
 Randomy distribute this score
accross the peaks of the same
peakset (factor) and, for a
percentage of top-
peaks, calculate overlapping
peaks in real dataset and with
random distributed scores

Peak calling
The workflow
Identify sequence motifs (region around ‘peak’, searched for motifs)

Identify differentially bound regions between conditions/factors/…

Peak calling
The workflow
Peak overlaps

Real 10% 15% 20% 30% 50% 75%

7 18 25 52 102 201

Means 0,
347 11
,53 2,
699 9,
297 42,
377 1 888
40,

Factor diff 20,7291
1 066 1 61 4484
5, 1 9,
262689885 5,
593202108 2,
406966043 14266651
, 52

FDR 10% 15% 20% 30% 50% 75%

0 0 0 0 0 0

10% 10% 15% 20% 30% 50% 75%

282 333 506 907 1000 1000

20% 10% 15% 20% 30% 50% 75%

59 33 125 332 1000 1000

30% 10% 15% 20% 30% 50% 75%

4 2 9 27 981 1000

50% 10% 15% 20% 30% 50% 75%

2 0 0 0 95 1000

75% 10% 15% 20% 30% 50% 75%

0 0 0 0 0 148

Workshop NGS data analysis - 3

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (17)

Similaire à Workshop NGS data analysis - 3

Similaire à Workshop NGS data analysis - 3 (20)

Plus de Maté Ongenaert

Plus de Maté Ongenaert (13)

Dernier

Dernier (20)

Workshop NGS data analysis - 3