Salient Features of India constitution especially power and functions
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
1. Introduction to NGS
(Now Generation Sequencing)
Data Analysis
Alex Sánchez
Statistics and Bioinformatics Research Group
Statistics department, Universitat de Barelona
Statistics and Bioinformatics Unit
Vall d’Hebron Institut de Recerca
NGS Data analysis http://ueb.vhir.org/NGS2012
2. Outline
• Introduction
• Bioinformatics Challenges
• NGS data analysis: Some examples and workflows
• Metagenomics, De novo sequencing, Variant detection, RNA-
seq
• Software
• Galaxy, Genome viewers
• Data formats and quality control
NGS Data analysis http://ueb.vhir.org/NGS2012
3. Introduction
NGS Data analysis http://ueb.vhir.org/NGS2012
4. Why is NGS revolutionary?
• NGS has brought high speed not only to genome
sequencing and personal medicine,
• it has also changed the way we do genome research
Got a question on genome organization?
SEQUENCE IT !!!
Ana Conesa, bioinformatics researcher at
Principe Felipe Research Center
NGS Data analysis http://ueb.vhir.org/NGS2012
5. NGS means high sequencing capacity
GS FLX 454 HiSeq 2000 5500xl SOLiD
(ROCHE) (ILLUMINA) (ABI)
GS Junior
Ion TORRENT
NGS Data analysis http://ueb.vhir.org/NGS2012
12. Some numbers
Platform 454/FLX Solex (Illum
a ina)AB S ID
OL
Read length ~350-400bp 36, 75, or 106 bp 50bp
Single read Yes Yes Yes
Paired-end Reads Yes Yes Yes
Long-insert (several Kbp) mate-paired reads Yes Yes No
Number of reads por instrument run 5.00K >100 M 400M
Max Data output 0.5Gbp 20.5 Gbp 20Gbp
Run time to 1Gb 6 Days > 1 Day >1 Day
Ease of use (workflow) Difficult Least difficult Difficult
Base Calling Flow Space Nucleotide space Color sapce
D Applica
NA tions
Whole genome sequencing and resequencing Yes Yes Yes
de novo sequencing Yes Yes Yes
Targeted resequencing Yes Yes Yes
Discovery of genetic variants ( SNPs, InDels, CNV, ...) Yes Yes Yes
Chromatin Immunopecipitation (ChIP) Yes Yes Yes
Methylation Analysis Yes Yes Yes
Metagenomics Yes No No
R Applica
NA tions Yes Yes Yes
Whole Transcriptome Yes Yes Yes
Small RNA Yes Yes Yes
Expression Tags Yes Yes Yes
NGS Data analysis http://ueb.vhir.org/NGS2012
14. I have my sequences/images. Now what?
NGS Data analysis http://ueb.vhir.org/NGS2012
15. NGS pushes (bio)informatics needs up
• Need for computer power
• VERY large text files (~10 million lines long)
– Can’t do ‘business as usual’ with familiar tools such as Perl/Python.
– Impossible memory usage and execution time
• Impossible to browse for problems
• Need sequence Quality filtering
• Need for large amount of CPU power
• Informatics groups must manage compute clusters
• Challenges in parallelizing existing software or redesign of algorithms to work in a
parallel environment
• Need for Bioinformatics power!!!
• The challenges turns from data generation into data analysis!
• How should bioinformatics be structured
• Bigger centralized bioinformatics services? (or research groups providing service?)
• Distributed model: bioinformaticians must be part of the temas. Interoperability?
NGS Data analysis http://ueb.vhir.org/NGS2012
16. Data management issues
• Raw data are large. How long should be kept?
• Processed data are manageable for most people
– 20 million reads (50bp) ~1Gb
• More of an issue for a facility: HiSeq recommends
32 CPU cores, each with 4GB RAM
• Certain studies much more data intensive than other
– Whole genome sequencing
• A 30X coverage genome pair (tumor/normal) ~500 GB
• 50 genome pairs ~ 25 TB
NGS Data analysis http://ueb.vhir.org/NGS2012
17. So what?
• In NGS we have to process really big amounts of data,
which is not trivial in computing terms.
• Big NGS projects require supercomputing infrastructures
• Or put another way: it's not the case that anyone can do
everything.
– Small facilities must carefully choose their projects to be scaled
with their computing capabilities.
NGS Data analysis http://ueb.vhir.org/NGS2012
18. Computational infrastructure for NGS
• There is great variety but a good point to start with:
– Computing cluster
• Multiple nodes (servers) with multiple cores
• High performance storage (TB, PB level)
• Fast networks (10Gb ethernet, infiniband)
– Enough space and conditions for the equipment
("servers room")
– Skilled people (sysadmin, developers)
• CNAG, in Barcelona: 36 people, more than 50% of them
informaticians
NGS Data analysis http://ueb.vhir.org/NGS2012
19. Alternatives (1): Cloud Computing
• Pros
– Flexibility.
– You pay what you use.
– Don´t need to maintain a data center.
• Cons
– Transfer big datasets over internet is
slow.
– You pay for consumed bandwidth.
That is a problem with big datasets.
– Lower performance, specially in disk
read/write.
– Privacy/security concerns.
– More expensive for big and long
term projects.
NGS Data analysis http://ueb.vhir.org/NGS2012
20. Alternatives (2): Grid Computing
• Pros
– Cheaper.
– More resources available.
• Cons
– Heterogeneous
environment.
– Slow connectivity (specially
in Spain).
– Much time required to find
good resources in the grid.
NGS Data analysis http://ueb.vhir.org/NGS2012
21. In summary?
•“NGS” arrived 2007/8
•No-one predicted NGS in 2001 (ten years ago)
•Therefore we cannot predict what we will come
up against
•TGS represents specific challenges
–Large Data Storage
–Technology-aware software
–Enables new assays and new science
•We would have said the same about NGS….
•These are not new problems, but will require
new solutions
•There is a lag between technology and
software….
NGS Data analysis http://ueb.vhir.org/NGS2012
22. Bioinformatics and bioinformaticians
• The term bioinformatician means many things
• Some may require a wide range of skills
• Others require a depth of specific skills
• The best thing we can teach is the ability to learn and
adapt
• The spirit of adventure
• There is a definite skills shortage
• There always has been
NGS Data analysis http://ueb.vhir.org/NGS2012
25. NGS data analysis stages
NGS Data analysis http://ueb.vhir.org/NGS2012
26. Quality control and preprocessing of
NGS data
NGS Data analysis http://ueb.vhir.org/NGS2012
27. Data types
NGS Data analysis http://ueb.vhir.org/NGS2012
28. Why QC and preprocessing
• Sequencer output:
– Reads + quality
• Natural questions
– Is the quality of my sequenced
data OK?
– If something is wrong can I fix it?
• Problem: HUGE files... How
do they look?
• Files are flat files and big...
tens of Gbs (even hard to
browse them)
NGS Data analysis http://ueb.vhir.org/NGS2012
30. How is quality measured?
• Sequencing systems use to assign quality scores to each peak
• Phred scores provide log(10)-transformed error probability values:
If p is probability that the base call is wrong the Phred score is
Q = .10·log10p
– score = 20 corresponds to a 1% error rate
– score = 30 corresponds to a 0.1% error rate
– score = 40 corresponds to a 0.01% error rate
• The base calling (A, T, G or C) is performed based on Phred
scores.
• Ambiguous positions with Phred scores <= 20 are labeled with N.
NGS Data analysis http://ueb.vhir.org/NGS2012
31. Data formats
• FastA format (everybody knows about it)
– Header line starts with “>” followed by a sequence ID
– Sequence (string of nt).
• FastQ format (http://maq.sourceforge.net/fastq.shtml)
– First is the sequence (like Fasta but starting with “@”)
– Then “+” and sequence ID (optional) and in the following line are
QVs encoded as single byte ASCII codes
• Different quality encode variants
• Nearly all downstream analysis take FastQ as input
sequence
NGS Data analysis http://ueb.vhir.org/NGS2012
32. The fastq format
• A FASTQ file normally uses four lines per sequence.
– Line 1 begins with a '@' character and is followed by a sequence
identifier and an optional description (like a FASTA title line).
– Line 2 is the raw sequence letters.
– Line 3 begins with a '+' character and isoptionally followed by the same
sequence identifier (and any description) again.
– Line 4 encodes the quality values for the sequence in Line 2, and must
contain the same number of symbols as letters in the sequence.
• Different encodings are in use
• Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126
@Seq description
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
NGS Data analysis http://ueb.vhir.org/NGS2012
33. Some tools to deal with QC
• Use FastQC to see your starting state.
• Use Fastx-toolkit to optimize different datasets and then
visualize the result with FastQC to prove your success!
• Hints:
– Trimming, clipping and filtering may improve quality
– But beware of removing too many sequences…
Go to the tutorial and try the exercises...
NGS Data analysis http://ueb.vhir.org/NGS2012
34. Applications
• [1] Metagenomics
• [2] De novo sequencing
• [3] Amplicon analysis
• [4] Variant discovery
• [5] Transcriptome analysis
• …and more …
NGS Data analysis http://ueb.vhir.org/NGS2012
35. [1] Metagenomics &other community-based “omics”
Zoetendal E G et al.
Gut 2008;57:1605-1615
NGS Data analysis http://ueb.vhir.org/NGS2012
37. [1] Metagenomic Approaches
SMALL-SCALE: 16S rRNA gene profiling
The basic approach is to identify microbes in a complex
community by exploiting universal and conserved targets,
such as rRNA genesPetrosini.
Challenges and limitations: Chimeric sequences caused by
PCR amplification and sequencing errors.
LARGE-SCALE: Whole Genome Shotgun (WGS)
Whole-genome approaches enable to identify and
annotate microbial genes and its functions in the
community.
Challenges and limitations:
relatively large amounts of starting material required
potential contamination of metagenomic samples with host
genetic material
high numbers of genes of unknown function. Environmental Shotgun Sequencing (ESS).
A primer on metagenomics.
PLoS Comput Biol. 2010 Feb 26;6(2):e1000667.
NGS Data analysis http://ueb.vhir.org/NGS2012
38. [1] Comparative Metagenomics
Comparing two or more metagenomes is necessary to understand how genomic differences
affect, and are affected by the abiotic environment.
MEGAN can also be used to
compare the OTU composition
of two or more frequency-
normalized samples.
MG-RAST provides a
comparative functional and
sequence-based analysis for
uploaded samples
.
Other software based on
phylogenetic
data are UniFrac.
39. [1] Some Metagenomics projects
"whole-genome shotgun sequencing" was applied to microbial populations
A total of 1.045 billion base pairs of nonredundant sequence were analyzed
"whole-genome shotgun sequencing"
78 million base pairs of unique DNA sequence were analyzed
To date, 242 metagenomic projects are on going and 103 are completed
(www.genomesonline.org).
NGS Data analysis http://ueb.vhir.org/NGS2012
40. [2] De novo sequencing
NGS Data analysis http://ueb.vhir.org/NGS2012
41. [3] Amplicon analysis
Each amplicon (PCR product) is sequenced individually, allowing
for the identification of rare variants and the assignment of
haplotype information over the full sequence length
Some applications:
●
Detection of low-frequency (<1%) variants in complex mixtures
→ rare somatic mutations, viral quasispecies... Ultra-deep
amplicon sequencing
●
Identification of rare alleles associated with hereditary diseases,
heterozygote SNP calling... Ultra-broad amplicon sequencing
●
Metabolic profiling of environmental habitats, bacterial taxonomy
and phlylogeny 16S rRNA amplicon sequencing
NGS Data analysis http://ueb.vhir.org/NGS2012
42. [3] Example of raw data generation with GS-FLX
...
NGS Data analysis http://ueb.vhir.org/NGS2012
44. [3] Final output examples
...
NT substitution (error) matrices
Bar plots output example (with circular legend for the AA) AA frequency tables
NGS Data analysis http://ueb.vhir.org/NGS2012
45. [4] Variant discovery
Your aligner decides the type/amount of variants you can
identify
Naive SNP calling
Reads counting
Statistic support SNP calling
Maximum likelihood, Bayesian
Quality score recalibration
Recalibrate quality score from whole alignment
Local realignment around indels
Realign reads
Known variants (limited species)
dbSNP
NGS Data analysis http://ueb.vhir.org/NGS2012
46. [4] Example: Exome Variant Analysis
NGS Data analysis http://ueb.vhir.org/NGS2012
49. [4]
NGS Data analysis http://ueb.vhir.org/NGS2012
50. [4] Many ongoing sequencing projects
NGS Data analysis http://ueb.vhir.org/NGS2012
51. [5] Transcriptome Analysis using NGS
RNA-Seq, or "Whole
Transcriptome Shotgun
Sequencing" ("WTSS")
refers to use of HTS
technologies to sequence
cDNA in order to get
information about a
sample's RNA content.
Reads produced by
sequencing
Aligned to a reference
genome to build
transcriptome mappings.
NGS Data analysis http://ueb.vhir.org/NGS2012
52. [5] Applications (1) Whole transcriptome
analysis
mRNA AAAA
Fragmentation
Detects expression of known and
novel mRNAs
RT Identification of alternative
splicing events
cDNA library Detects expressed SNPs or
mutations
Identifies allele specific
sequencing expression patterns
NGS Data analysis http://ueb.vhir.org/NGS2012
53. [5] Applications (2) Differential expression
1.Reads are mapped to the reference
genome or transcriptome
2.Mapped reads are assembled into
expression summaries (tables of
counts, showing how may reads are in
coding region, exon, gene or junction);
3.The data are normalized;
4.Statistical testing of differential
expression (DE) is performed,
producing a list of genes with P-values
and fold changes.
NGS Data analysis http://ueb.vhir.org/NGS2012
54. [5] RNA Seq data analysis - Mapping
•Main Issues:
–Number of allowed mismatches End up with a list of
# of reads per transcript
–Number of multihits
–Mates expected distance These will be our (discrete)
response variable
–Considering exon junctions
NGS Data analysis http://ueb.vhir.org/NGS2012
55. [5] RNA Seq data analysis -Normalization
• Two main sources of bias
– Influence of length: Counts are proportional to the transcript
length times the mRNA expression level.
– Influence of sequencing depth: The higher sequencing depth, the
higher counts.
• How to deal with this
– Normalize (correct) gene counts to minimize biases.
– Use statistical models that take into account
length and sequencing depth
NGS Data analysis http://ueb.vhir.org/NGS2012
56. [5] RNA Seq - Differential expression methods
• Fisher's exact test or similar approaches.
• Use Generalized Linear Models and model counts using
– Poisson distribution.
– Negative binomial distribution.
• Transform count data to use existing approaches for
microarray data.
• …
NGS Data analysis http://ueb.vhir.org/NGS2012
57. [5] Advantages of RNA-seq
Unlike hybridization approaches does not require existing genomic
sequence
Expected to replace microarrays for transcriptomic studies
Very low background noise
Reads can be unabmiguously mapped
Resolution up to 1 bp
High-throughput quantitative measurement of transcript abundance
Better than Sanger sequencing of cDNA or EST libraries
Cost decreasing all the time
Lower than traditional sequencing
Can reveal sequence variations (SNPs)
Automated pipelines available
NGS Data analysis http://ueb.vhir.org/NGS2012
58. Software for NGS preprocessing and analysis
NGS Data analysis http://ueb.vhir.org/NGS2012
59. Which software for NGS (data) analysis?
• Answer is not straightforward.
http://seqanswers.com/wiki/Software/list
• Many possible classifications
– Biological domains
• SNP discovery, Genomics, ChIP-Seq, De-novo assembly, …
– Bioinformatics methods
• Mapping, Assembly, Alignment, Seq-QC,…
– Technology
• Illumina, 454, ABI SOLID, Helicos, …
– Operating system
• Linux, Mac OS X, Windows, …
– License type
• GPLv3, GPL, Commercial, Free for academic use,…
– Language
• C++, Perl, Java, C, Phyton
– Interface
• Web Based, Integrated solutions, command line tools, pipelines,…
NGS Data analysis http://ueb.vhir.org/NGS2012
60. Which software for NGS (data) analysis?
• Answer is not straightforward.
http://seqanswers.com/wiki/Software/list
• Many possible classifications
– Biological domains
• SNP discovery, Genomics, ChIP-Seq, De-novo assembly, …
– Bioinformatics methods
• Mapping, Assembly, Alignment, Seq-QC,…
– Technology
• Illumina, 454, ABI SOLID, Helicos, …
– Operating system
• Linux, Mac OS X, Windows, …
– License type
• GPLv3, GPL, Commercial, Free for academic use,…
– Language
• C++, Perl, Java, C, Phyton
– Interface
• Web Based, Integrated solutions, command line tools, pipelines,…
NGS Data analysis http://ueb.ir.vhebron.net/NGS
61. Some popular tools and places
NGS Data analysis http://ueb.vhir.org/NGS2012
63. Obtain data from many data
sources including the
UCSC Table Browser, Prepare data for further
BioMart, WormBase, analysis by rearranging
or your own data. or cutting data columns, Analyze data by finding
filtering data and many overlapping regions,
other actions. determining statistics,
phylogenetic analysis
and much more
63
64. User Register
contains links to
Shows the history
the downloading, of analysis steps,
pre-procession and displays data and result
viewing
analysis tools menus and
data inputs
NGS Data analysis http://ueb.vhir.org/NGS2012 64
72. Co
py
rig
ht
Op
en
He
lix.
No
us
e
or
re
pr
List saved histories and od
uct
shared histories. ion
Work on a current history, wit
ho
create new, share workflow ut
ex
pr
es
s
wri
tte
n
co
ns
en
NGS Data analysis http://ueb.vhir.org/NGS2012 t2
7
73. Creates a workflow, allows
user to repeat analysis
using different datasets.
NGS Data analysis http://ueb.vhir.org/NGS2012
75. Why is visualization important?
make large amounts of data more interpretable
glean patterns from the data
sanity check / visual debugging
more…
NGS Data analysis http://ueb.vhir.org/NGS2012
76. History of Genome Visualization
1800s 1900s 2000s
time
NGS Data analysis http://ueb.vhir.org/NGS2012
77. What is a “Genome Browser”
linear representation of a genome
position-based annotations, each called a track
continuous annotations: e.g. conservation
interval annotations: e.g. gene, read alignment
point annotations: e.g. SNPs
user specifies a subsection of genome to look at
NGS Data analysis http://ueb.vhir.org/NGS2012
78. Server-side model
(e.g. UCSC, Ensembl, Gbrowse)
serve
• central data
r
store
• renders
images
• sends to client
client
• requests
images
• displays
images
NGS Data analysis http://ueb.vhir.org/NGS2012
79. Client-side model
(e.g. Savant, IGV)
serve
• stores data
r
client HTS
• local HTS machine
store
• renders
images
• displays
images
80. Rough comparison of Genome
Browsers
UCSC Ensembl GBrowse Savant IGV
Model Server Server Server Client Client
Interactive
HTS support
Database of
tracks
Plugins
No support Some support Good support
NGS Data analysis http://ueb.vhir.org/NGS2012
81. Limitations of most genome
browsers
do not support multiple genomes simultaneously
do not capture 3-dimensional conformation
do not capture spatial or temporal information
do not integrate well with analytics
cannot be customized
The SAVANT
GENOME BROWSER
has been created
to overcome these
limitations
NGS Data analysis http://ueb.vhir.org/NGS2012
82. Integrative Genomics Viewer (IGV)
he Integrative Genomics Viewer (IGV) is a high-performance visualization tool
for interactive exploration of large, integrated datasets. It supports a wide variety
of data types including sequence alignments, microarrays, and genomic
annotations.
83. Acknowledgements
Grupo de investigación en Estadística y Bioinformática del
departamento de Estadística de la Universidad de
Barcelona.
All the members at the Unitat d’Estadística i Bioinformàtica
del VHIR (Vall d’Hebron Institut de Recerca)
Unitat de Serveis Científico Tècnics (UCTS) del VHIR (Vall
d’Hebron Institut de Recerca)
People whose materials have been borrowed or who have
contributed with their work
Manel Comabella, Rosa Prieto, Paqui Gallego, Javier
Santoyo, Ana Conesa, Thomas Girke and Silvia
Cardona.…
NGS Data analysis http://ueb.vhir.org/NGS2012
84. Gracias por la atención y la paciencia
NGS Data analysis http://ueb.vhir.org/NGS2012