SlideShare une entreprise Scribd logo
1  sur  57
Télécharger pour lire hors ligne
Raw data investigation

Joachim Jacob
20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts hereof.
Experimental setup
We have decided on:
● how many samples per condition
● how deep
This determines how reliable the statistics will be,
using experience, and tools like Scotty. A wrong
experimental design cannot be fixed. Best
approach: pilot data (3 samples per condition, 10M)
But we have other sequencing options to choose!
PE versus SE Illumina

●

Single end (SE): from each cDNA fragment only
one end is read.
Paired end (PE): the cDNA fragment is read from
both ends.

Purify and
fragment
SE

●

PE
PE versus SE Illumina
Single end (SE):
●

Gene level differential expression

Paired end (PE):
●

Novel splice junction detection

●

De novo assembly of transcriptome

●

Helps with correctly positioning reads on the
reference genome sequence.

Note: PE not the same as mate pairs.
Strandedness
●

●

Naive protocols obtain reads from cDNA
fragments. BUT the link with the sense or
antisense strand is broken.
Stranded protocols generate reads
from one strand, corresponding to the
sense or antisense strand (depending on
the protocol).
Strandedness

Not stranded

Stranded
Example of a stranded protocol
●

dUTP protocol to
generate stranded
reads.
Importance of strandedness
●

●

Strandedness can bias the read counts
compared to non-stranded protocols.
Depends on the genome whether you
should apply it, e.g. in case genes
overlap, the improved benefit of
assigning reads to correct genes can
outweigh technical variation.
Length of the reads
●

●

●

Does not matter so much (when we want
to quantify aligning to a reference
sequence): 50 bp will do.
The most important point is to be able to
accurately position the read on the
reference genome sequence, to assign it
to the correct gene.
Length can become important, if you
want to assemble the transcriptome.
For DE on the gene level
The 'cheapest' protocol for high-throughput
sequencing suffices to achieve DE detection:
●

SE

●

50bp

●

Option: strandedness.

Use the money you have left over for
increasing the number of replicates.
Illumina Truseq protocol
sdf
Raw Illumina data
The data you get arrives as...

barcode
experiment

Compressed, usually with gzip
Raw Illumina data
(this one: 87196924 lines)
@HWI-ST571:202:D1B86ACXX:2:1102:1146:2155 1:N:0:ACAGTG
CCAACATCGAGGTCGCAATCTTTTTNANCGATATGAACTCTCCAAAAAAA
+
@@@FFFDFHHDG?FFHIIJJJJJIJ#1#1:BFFIGJJJJJIJJGIJJJJA
@HWI-ST571:202:D1B86ACXX:2:1102:1073:2240 1:N:0:ACAGTG

One read (minimum 4 lines)
sequence

CGGAGCTGAAGGAGAAACTGAAATCCCTGCAATGTGAATTGTACGTTCTT
+
CCCFFFFFGGHHHIJJJJJJJIJFHIJIIIJJJJGIIIIIEFGHIFCHJI
@HWI-ST571:202:D1B86ACXX:2:1102:1385:2192 1:N:0:ACAGTG

certainty reading this base
at this position ('quality')

GTTGGCAGCCCTGGAGCCCTGCCTCGGTGGTTTAGCCAGTACTAGGGGAT
+
CCCFFFFFHHHHHJJJIJJJJJJGIJJCGHFHIGIHJJJBDHGHHJJJIE
@HWI-ST571:202:D1B86ACXX:2:1102:1352:2244 1:N:0:ACAGTG
ATTTCCTCTTATTTACGTTGCTTTAAAGCGAGACTTCAACGCCATTTGAC
+
@@CFFFFFHHFHDFGHIJIIJGIJGGEHGGJB>??FHHGFFFGHIGIECF
@HWI-ST571:202:D1B86ACXX:2:1102:1981:2152 1:N:0:ACAGTG
CATCGAAGCAAAGCATATAAAGTTANTNNTNNCTGAGTTGTACATATTGC
+
??;;D?DB6CDB+<EFE>:AFA443#2##1##11)0:0?9**0??DAGI4
@HWI-ST571:202:D1B86ACXX:2:1102:1877:2165 1:N:0:ACAGTG
GAAGTGCCCCGCTGGCAGCACACAAGGAGCAGCCCGCTGCCGGACCACTC
+
?@@DDDADFFAA:CEGHBFGAHGD?F@BE9BFF?D@F;'-8AG<B92=;;

http://wiki.bits.vib.be/index.php/.fas
Exploring the raw data
1) check whether the Fastq file is consistent
-

2) Make graphs of some metrics of the raw data

http://wiki.bits.vib.be/index.php/.fastq
http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Quality_control_and_visualization_of_raw_reads
FastQC – graphical exploration

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
FastQC – perfect example

Reads have
good quality!
FastQC – perfect example
Anna Karenina principle: “There is only one way
to be good, but there are many ways to be
wrong.”

We will start by showing a good sample.
Afterwards we will discuss a less good sample.

http://en.wikipedia.org/wiki/Anna_Karenina_principle
FastQC – perfect example

Smooth
histogram/
density line
towards the
right,
FastQC – perfect example

steady
nucleotide
distribution.

Bias typical
for illumina
FastQC – perfect example

Not strongly
fluctuating
GC content

Bias typical
for illumina
FastQC – perfect example

GC-content
nicely bell
shaped
FastQC – perfect example

No N's!
(should ring
something)
FastQC – perfect example

All reads have
length 50bp,
FastQC – perfect example

Reads are
nicely
duplicated:
some amount
of duplication
is to be
expected in
RNA-seq data.
FastQC – perfect example

Reads are
nicely
duplicated:
some amount
of duplication
is to be
expected in
RNA-seq data.
FastQC – perfect example

Kmers are short
sequence
stretches.
Sometimes they
are
overrepresented.
But in RNA-seq
this is not so
important
(duplication).
FastQC – less good RNA-seq sample

A relatively large
Portion of the
reads have
mistakes at
the 3' end of
the read.
FastQC – less good RNA-seq sample

There is an overrepresentation of reads
with a low mean
quality score
FastQC – less good RNA-seq sample

Not a steady level
of different nucleotide
fractions
FastQC – less good RNA-seq sample

Fluctuates
FastQC – less good RNA-seq sample

Heavily skewed versus
AT rich reads
FastQC – less good RNA-seq sample

Apparently a mixture
of two sets of reads
with different lengths
FastQC – less good RNA-seq sample

Duplication seems a
bit on the low side
(reported figures are
from 60 -75%)
FastQC – less good RNA-seq sample

Very highly skewed
read number.
Often the
sequence of Truseq
adaptor, or multiplex identifiers
can be
found here.
BLAST can reveal
more information!
FastQC – less good RNA-seq sample

Specific patterns of
Specific kmers.
Note: A and T rich
Quality control of raw data
Proceed? Or rerun?
This QC can guide you to which preprocessing steps
you need to apply for sure. The extra time and
money needed to correct the biases can sometimes
justify a rerun of the experiment.
This QC shows which preprocessing steps have
already been made by the sequencing provider.
Preprocessing
Removing unwanted parts of the raw data so it helps as
much as possible with reaching our goal: defining
differentially expressed genes.
1) removing technical contamination
● Low quality read parts
● Technical sequences: adaptors
● PhiX internal control sequences
2) removing biological contamination
● polyA-tails
● rRNA sequences
● mtDNA sequences
After this, we run FastQC again.
Technical contamination
Our goal is to define DE expression, for this we
need to assign reads with a high confidence to the
correct genomic location.
Removal of low quality read parts: they have a
higher chance to contain errors, and cause noise in
our read counts.
Technical contamination
Our goal is to define DE expression, for this we
need to assign reads with a high confidence to the
correct genomic location.
Removal of low quality read parts: they have a
higher chance to contain errors, and cause noise in
our read counts.
Technical contamination
Technical contamination
Our goal is to define DE expression, for this we
need to assign reads with a high confidence to the
correct genomic location.
Removal of adaptor sequences (and other
technical sequences, such as multiplex) as they
cannot be mapped to the reference genome.
Technical contamination
List of technical sequences

Our goal is to define DE expression, for this we
need to assign reads with a high confidence to the
correct genomic location.
Advised to use defaults

Removal of adaptor sequences (and other
technical sequences, such as multiplex) as they
cannot be mapped to the reference genome.

http://code.google.com/p/ea-utils/wiki/FastqMcf
Fastq-mcf output

http://code.google.com/p/ea-utils/wiki/FastqMcf
Technical contamination
Never remove duplicate reads! Highly expressed
genes can have genuine duplicate reads, which are
not due to the PCR amplification step in the
protocol.
●

PhiX sequences: the DNA of Phi X bacteriophage
is spiked in to monitor and optimize sequencing on
Illumina machines. Your sequencing provider
should filter out those sequences before delivery.
You can filter them out by aligning your reads to the
PhiX genome.
●

http://en.wikipedia.org/wiki/Phi_X_174
Biological contamination
cell

Mitochondria contain
rRNA, mRNA and mtDNA
rRNA and non-coding (95% of RNA)
nucleus
mRNA (5% of RNA)
Biological contamination
Mitochondrial
rRNA and nc

mRNAs are captured with
oligo-dT coated beads.
Occasionally, non-protein
coding sequences are also
captured (especially since
mtRNA and rRNA can be
relatively rich in AT).
We can remove them via
homology searching (BLAST)
with known non-protein
coding sequences.

mRNA (5% of RNA)
Biological contamination

AAAAAAAAAAAAA

mRNAs are post-transcriptionally modified: e.g. the
addition of a poly-A tail. If
our goal is to map the reads
to a reference genome
sequence, the polyA tails
should be removed. This
can be viewed as some
source of 'biological
contamination' in our
sequences (…).
Biological contamination
●

Get the non-protein coding sequences via
Biomart.

Mitochondrial genome sequence also.
Biological contamination
Biological contamination
Filter the biological contamination
Your reads
The biological reads
Imported via Biomart
We are interested in the
reads that don't map!
Filter the biological contamination
Your reads
The biological reads
Imported via Biomart
We are interested in the
reads that don't map!
Doing this in Galaxy
Useful: take a sample of your reads: fastq-to-tabular,
select random lines, tabular-to-fastq
1. create a new history
2. load the sample data in
3. Run fastqMcf to remove technical sequences
4. Run bowtie to match against biological sequence
databases, and keep reads that don't match.
5. Summarize: fastqc
→ make a workflow of this sample history.
→ run the workflow on all your samples in parallel
→ store the cleaned reads in a data library.
Summary preprocessing
Your reads
…...

Format consistent? Errors in quality?

Your groomed reads
Trends in raw data? QC report

...

…....

…...

Get technical contaminants
- ….

Your groomed reads without technical contamination
Get biological contaminants
- ….
- ….
Your groomed reads without technical
and biological contamination

…....

...

…...

How does your data look now? QC
Keywords
Paired end
Stranded reads
gzip
fastq
Biological contamination
Technical contamination
Adapter sequence

Write in your own words what the terms mean
Exercise
→ investigating and preprocessing raw RNA-seq data
Break

Contenu connexe

Tendances

Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)bedutilh
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
Transcriptome Analysis & Applications
Transcriptome Analysis & ApplicationsTranscriptome Analysis & Applications
Transcriptome Analysis & Applications1010Genome Pte Ltd
 
Flash introduction to Qiime2 -- 16S Amplicon analysis
Flash introduction to Qiime2 -- 16S Amplicon analysisFlash introduction to Qiime2 -- 16S Amplicon analysis
Flash introduction to Qiime2 -- 16S Amplicon analysisAndrea Telatin
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
Ppt of genome annotatioon 2
Ppt of genome annotatioon 2Ppt of genome annotatioon 2
Ppt of genome annotatioon 2shivangi1singh
 
Introduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsIntroduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsAndrea Telatin
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 

Tendances (20)

Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
real time-PCR..
real time-PCR..real time-PCR..
real time-PCR..
 
Real-Time PCR
Real-Time PCRReal-Time PCR
Real-Time PCR
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Transcriptome Analysis & Applications
Transcriptome Analysis & ApplicationsTranscriptome Analysis & Applications
Transcriptome Analysis & Applications
 
Flash introduction to Qiime2 -- 16S Amplicon analysis
Flash introduction to Qiime2 -- 16S Amplicon analysisFlash introduction to Qiime2 -- 16S Amplicon analysis
Flash introduction to Qiime2 -- 16S Amplicon analysis
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Primer designing for pcr and qpcr and their applications
Primer designing for pcr and qpcr and their applicationsPrimer designing for pcr and qpcr and their applications
Primer designing for pcr and qpcr and their applications
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
RT-PCR
RT-PCRRT-PCR
RT-PCR
 
Ppt of genome annotatioon 2
Ppt of genome annotatioon 2Ppt of genome annotatioon 2
Ppt of genome annotatioon 2
 
Introduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsIntroduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR Genomics
 
Express sequence tags
Express sequence tagsExpress sequence tags
Express sequence tags
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 

En vedette

Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5BITS
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1BITS
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4BITS
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS
 
Productivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsProductivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsBITS
 
The structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformaticsThe structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformaticsBITS
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsBITS
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6BITS
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsBITS
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in Rmikaelhuss
 
BITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizeAnn Loraine
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalJoachim Jacob
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS
 
Introduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsIntroduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsBITS
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisJunsu Ko
 

En vedette (20)

Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome level
 
Productivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsProductivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformatics
 
The structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformaticsThe structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformatics
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformatics
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformatics
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
 
BITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry data
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics data
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra tool
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goal
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomics
 
Introduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsIntroduction to Linux for bioinformatics
Introduction to Linux for bioinformatics
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 

Similaire à RNA-seq: analysis of raw data and preprocessing - part 2

Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataJoachim Jacob
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishingNikolay Vyahhi
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to resultsAGRF_Ltd
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014LutzFr
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Torsten Seemann
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Thermo Fisher Scientific
 
Anne_Vaittinen_advanced_seminar_presentation
Anne_Vaittinen_advanced_seminar_presentationAnne_Vaittinen_advanced_seminar_presentation
Anne_Vaittinen_advanced_seminar_presentationAnne Vaittinen
 
1073958 wp guide-develop-pcr_primers_1012
1073958 wp guide-develop-pcr_primers_10121073958 wp guide-develop-pcr_primers_1012
1073958 wp guide-develop-pcr_primers_1012Elsa von Licy
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfPushpendra83
 
2012 10-24 - ngs webinar
2012 10-24 - ngs webinar2012 10-24 - ngs webinar
2012 10-24 - ngs webinarElsa von Licy
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 

Similaire à RNA-seq: analysis of raw data and preprocessing - part 2 (20)

Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
 
Anne_Vaittinen_advanced_seminar_presentation
Anne_Vaittinen_advanced_seminar_presentationAnne_Vaittinen_advanced_seminar_presentation
Anne_Vaittinen_advanced_seminar_presentation
 
1073958 wp guide-develop-pcr_primers_1012
1073958 wp guide-develop-pcr_primers_10121073958 wp guide-develop-pcr_primers_1012
1073958 wp guide-develop-pcr_primers_1012
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
 
2012 10-24 - ngs webinar
2012 10-24 - ngs webinar2012 10-24 - ngs webinar
2012 10-24 - ngs webinar
 
RMR-Nirma-NGS-Heena.pdf
RMR-Nirma-NGS-Heena.pdfRMR-Nirma-NGS-Heena.pdf
RMR-Nirma-NGS-Heena.pdf
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
Gwas.emes.comp
Gwas.emes.compGwas.emes.comp
Gwas.emes.comp
 

Plus de BITS

BITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS
 
BITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Overview of sequence databases for mass spectrometry data analysisBITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Overview of sequence databases for mass spectrometry data analysisBITS
 
BITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS
 
BITS - Introduction to proteomics
BITS - Introduction to proteomicsBITS - Introduction to proteomics
BITS - Introduction to proteomicsBITS
 
BITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generationBITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generationBITS
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS
 
Marcs (bio)perl course
Marcs (bio)perl courseMarcs (bio)perl course
Marcs (bio)perl courseBITS
 
Basics statistics
Basics statistics Basics statistics
Basics statistics BITS
 
Cytoscape: Integrating biological networks
Cytoscape: Integrating biological networksCytoscape: Integrating biological networks
Cytoscape: Integrating biological networksBITS
 
Cytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networksCytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networksBITS
 
Genevestigator
GenevestigatorGenevestigator
GenevestigatorBITS
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS
 
Vnti11 basics course
Vnti11 basics courseVnti11 basics course
Vnti11 basics courseBITS
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structureBITS
 
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS: Introduction to Linux -  Software installation the graphical and the co...BITS: Introduction to Linux -  Software installation the graphical and the co...
BITS: Introduction to Linux - Software installation the graphical and the co...BITS
 

Plus de BITS (15)

BITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysis
 
BITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Overview of sequence databases for mass spectrometry data analysisBITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Overview of sequence databases for mass spectrometry data analysis
 
BITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS - Search engines for mass spec data
BITS - Search engines for mass spec data
 
BITS - Introduction to proteomics
BITS - Introduction to proteomicsBITS - Introduction to proteomics
BITS - Introduction to proteomics
 
BITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generationBITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generation
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
Marcs (bio)perl course
Marcs (bio)perl courseMarcs (bio)perl course
Marcs (bio)perl course
 
Basics statistics
Basics statistics Basics statistics
Basics statistics
 
Cytoscape: Integrating biological networks
Cytoscape: Integrating biological networksCytoscape: Integrating biological networks
Cytoscape: Integrating biological networks
 
Cytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networksCytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networks
 
Genevestigator
GenevestigatorGenevestigator
Genevestigator
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1
 
Vnti11 basics course
Vnti11 basics courseVnti11 basics course
Vnti11 basics course
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structure
 
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS: Introduction to Linux -  Software installation the graphical and the co...BITS: Introduction to Linux -  Software installation the graphical and the co...
BITS: Introduction to Linux - Software installation the graphical and the co...
 

Dernier

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Dernier (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

RNA-seq: analysis of raw data and preprocessing - part 2

  • 1. Raw data investigation Joachim Jacob 20 and 27 January 2014 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
  • 2. Experimental setup We have decided on: ● how many samples per condition ● how deep This determines how reliable the statistics will be, using experience, and tools like Scotty. A wrong experimental design cannot be fixed. Best approach: pilot data (3 samples per condition, 10M) But we have other sequencing options to choose!
  • 3. PE versus SE Illumina ● Single end (SE): from each cDNA fragment only one end is read. Paired end (PE): the cDNA fragment is read from both ends. Purify and fragment SE ● PE
  • 4. PE versus SE Illumina Single end (SE): ● Gene level differential expression Paired end (PE): ● Novel splice junction detection ● De novo assembly of transcriptome ● Helps with correctly positioning reads on the reference genome sequence. Note: PE not the same as mate pairs.
  • 5. Strandedness ● ● Naive protocols obtain reads from cDNA fragments. BUT the link with the sense or antisense strand is broken. Stranded protocols generate reads from one strand, corresponding to the sense or antisense strand (depending on the protocol).
  • 7. Example of a stranded protocol ● dUTP protocol to generate stranded reads.
  • 8. Importance of strandedness ● ● Strandedness can bias the read counts compared to non-stranded protocols. Depends on the genome whether you should apply it, e.g. in case genes overlap, the improved benefit of assigning reads to correct genes can outweigh technical variation.
  • 9. Length of the reads ● ● ● Does not matter so much (when we want to quantify aligning to a reference sequence): 50 bp will do. The most important point is to be able to accurately position the read on the reference genome sequence, to assign it to the correct gene. Length can become important, if you want to assemble the transcriptome.
  • 10. For DE on the gene level The 'cheapest' protocol for high-throughput sequencing suffices to achieve DE detection: ● SE ● 50bp ● Option: strandedness. Use the money you have left over for increasing the number of replicates.
  • 12. Raw Illumina data The data you get arrives as... barcode experiment Compressed, usually with gzip
  • 13. Raw Illumina data (this one: 87196924 lines) @HWI-ST571:202:D1B86ACXX:2:1102:1146:2155 1:N:0:ACAGTG CCAACATCGAGGTCGCAATCTTTTTNANCGATATGAACTCTCCAAAAAAA + @@@FFFDFHHDG?FFHIIJJJJJIJ#1#1:BFFIGJJJJJIJJGIJJJJA @HWI-ST571:202:D1B86ACXX:2:1102:1073:2240 1:N:0:ACAGTG One read (minimum 4 lines) sequence CGGAGCTGAAGGAGAAACTGAAATCCCTGCAATGTGAATTGTACGTTCTT + CCCFFFFFGGHHHIJJJJJJJIJFHIJIIIJJJJGIIIIIEFGHIFCHJI @HWI-ST571:202:D1B86ACXX:2:1102:1385:2192 1:N:0:ACAGTG certainty reading this base at this position ('quality') GTTGGCAGCCCTGGAGCCCTGCCTCGGTGGTTTAGCCAGTACTAGGGGAT + CCCFFFFFHHHHHJJJIJJJJJJGIJJCGHFHIGIHJJJBDHGHHJJJIE @HWI-ST571:202:D1B86ACXX:2:1102:1352:2244 1:N:0:ACAGTG ATTTCCTCTTATTTACGTTGCTTTAAAGCGAGACTTCAACGCCATTTGAC + @@CFFFFFHHFHDFGHIJIIJGIJGGEHGGJB>??FHHGFFFGHIGIECF @HWI-ST571:202:D1B86ACXX:2:1102:1981:2152 1:N:0:ACAGTG CATCGAAGCAAAGCATATAAAGTTANTNNTNNCTGAGTTGTACATATTGC + ??;;D?DB6CDB+<EFE>:AFA443#2##1##11)0:0?9**0??DAGI4 @HWI-ST571:202:D1B86ACXX:2:1102:1877:2165 1:N:0:ACAGTG GAAGTGCCCCGCTGGCAGCACACAAGGAGCAGCCCGCTGCCGGACCACTC + ?@@DDDADFFAA:CEGHBFGAHGD?F@BE9BFF?D@F;'-8AG<B92=;; http://wiki.bits.vib.be/index.php/.fas
  • 14. Exploring the raw data 1) check whether the Fastq file is consistent - 2) Make graphs of some metrics of the raw data http://wiki.bits.vib.be/index.php/.fastq http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Quality_control_and_visualization_of_raw_reads
  • 15. FastQC – graphical exploration http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • 16. FastQC – perfect example Reads have good quality!
  • 17. FastQC – perfect example Anna Karenina principle: “There is only one way to be good, but there are many ways to be wrong.” We will start by showing a good sample. Afterwards we will discuss a less good sample. http://en.wikipedia.org/wiki/Anna_Karenina_principle
  • 18. FastQC – perfect example Smooth histogram/ density line towards the right,
  • 19. FastQC – perfect example steady nucleotide distribution. Bias typical for illumina
  • 20. FastQC – perfect example Not strongly fluctuating GC content Bias typical for illumina
  • 21. FastQC – perfect example GC-content nicely bell shaped
  • 22. FastQC – perfect example No N's! (should ring something)
  • 23. FastQC – perfect example All reads have length 50bp,
  • 24. FastQC – perfect example Reads are nicely duplicated: some amount of duplication is to be expected in RNA-seq data.
  • 25. FastQC – perfect example Reads are nicely duplicated: some amount of duplication is to be expected in RNA-seq data.
  • 26. FastQC – perfect example Kmers are short sequence stretches. Sometimes they are overrepresented. But in RNA-seq this is not so important (duplication).
  • 27. FastQC – less good RNA-seq sample A relatively large Portion of the reads have mistakes at the 3' end of the read.
  • 28. FastQC – less good RNA-seq sample There is an overrepresentation of reads with a low mean quality score
  • 29. FastQC – less good RNA-seq sample Not a steady level of different nucleotide fractions
  • 30. FastQC – less good RNA-seq sample Fluctuates
  • 31. FastQC – less good RNA-seq sample Heavily skewed versus AT rich reads
  • 32. FastQC – less good RNA-seq sample Apparently a mixture of two sets of reads with different lengths
  • 33. FastQC – less good RNA-seq sample Duplication seems a bit on the low side (reported figures are from 60 -75%)
  • 34. FastQC – less good RNA-seq sample Very highly skewed read number. Often the sequence of Truseq adaptor, or multiplex identifiers can be found here. BLAST can reveal more information!
  • 35. FastQC – less good RNA-seq sample Specific patterns of Specific kmers. Note: A and T rich
  • 36. Quality control of raw data Proceed? Or rerun? This QC can guide you to which preprocessing steps you need to apply for sure. The extra time and money needed to correct the biases can sometimes justify a rerun of the experiment. This QC shows which preprocessing steps have already been made by the sequencing provider.
  • 37. Preprocessing Removing unwanted parts of the raw data so it helps as much as possible with reaching our goal: defining differentially expressed genes. 1) removing technical contamination ● Low quality read parts ● Technical sequences: adaptors ● PhiX internal control sequences 2) removing biological contamination ● polyA-tails ● rRNA sequences ● mtDNA sequences After this, we run FastQC again.
  • 38. Technical contamination Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location. Removal of low quality read parts: they have a higher chance to contain errors, and cause noise in our read counts.
  • 39. Technical contamination Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location. Removal of low quality read parts: they have a higher chance to contain errors, and cause noise in our read counts.
  • 41. Technical contamination Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location. Removal of adaptor sequences (and other technical sequences, such as multiplex) as they cannot be mapped to the reference genome.
  • 42. Technical contamination List of technical sequences Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location. Advised to use defaults Removal of adaptor sequences (and other technical sequences, such as multiplex) as they cannot be mapped to the reference genome. http://code.google.com/p/ea-utils/wiki/FastqMcf
  • 44. Technical contamination Never remove duplicate reads! Highly expressed genes can have genuine duplicate reads, which are not due to the PCR amplification step in the protocol. ● PhiX sequences: the DNA of Phi X bacteriophage is spiked in to monitor and optimize sequencing on Illumina machines. Your sequencing provider should filter out those sequences before delivery. You can filter them out by aligning your reads to the PhiX genome. ● http://en.wikipedia.org/wiki/Phi_X_174
  • 45. Biological contamination cell Mitochondria contain rRNA, mRNA and mtDNA rRNA and non-coding (95% of RNA) nucleus mRNA (5% of RNA)
  • 46. Biological contamination Mitochondrial rRNA and nc mRNAs are captured with oligo-dT coated beads. Occasionally, non-protein coding sequences are also captured (especially since mtRNA and rRNA can be relatively rich in AT). We can remove them via homology searching (BLAST) with known non-protein coding sequences. mRNA (5% of RNA)
  • 47. Biological contamination AAAAAAAAAAAAA mRNAs are post-transcriptionally modified: e.g. the addition of a poly-A tail. If our goal is to map the reads to a reference genome sequence, the polyA tails should be removed. This can be viewed as some source of 'biological contamination' in our sequences (…).
  • 48. Biological contamination ● Get the non-protein coding sequences via Biomart. Mitochondrial genome sequence also.
  • 51. Filter the biological contamination Your reads The biological reads Imported via Biomart We are interested in the reads that don't map!
  • 52. Filter the biological contamination Your reads The biological reads Imported via Biomart We are interested in the reads that don't map!
  • 53. Doing this in Galaxy Useful: take a sample of your reads: fastq-to-tabular, select random lines, tabular-to-fastq 1. create a new history 2. load the sample data in 3. Run fastqMcf to remove technical sequences 4. Run bowtie to match against biological sequence databases, and keep reads that don't match. 5. Summarize: fastqc → make a workflow of this sample history. → run the workflow on all your samples in parallel → store the cleaned reads in a data library.
  • 54. Summary preprocessing Your reads …... Format consistent? Errors in quality? Your groomed reads Trends in raw data? QC report ... ….... …... Get technical contaminants - …. Your groomed reads without technical contamination Get biological contaminants - …. - …. Your groomed reads without technical and biological contamination ….... ... …... How does your data look now? QC
  • 55. Keywords Paired end Stranded reads gzip fastq Biological contamination Technical contamination Adapter sequence Write in your own words what the terms mean
  • 56. Exercise → investigating and preprocessing raw RNA-seq data
  • 57. Break