SlideShare une entreprise Scribd logo
1  sur  84
Télécharger pour lire hors ligne
Introduction to NGS
(Now Generation Sequencing)
        Data Analysis
           Alex Sánchez

        Statistics and Bioinformatics Research Group
        Statistics department, Universitat de Barelona

        Statistics and Bioinformatics Unit
        Vall d’Hebron Institut de Recerca




     NGS Data analysis       http://ueb.vhir.org/NGS2012
Outline
• Introduction
• Bioinformatics Challenges
• NGS data analysis: Some examples and workflows
   • Metagenomics, De novo sequencing, Variant detection, RNA-
     seq
• Software
   • Galaxy, Genome viewers
• Data formats and quality control




             NGS Data analysis   http://ueb.vhir.org/NGS2012
Introduction




       NGS Data analysis   http://ueb.vhir.org/NGS2012
Why is NGS revolutionary?
• NGS has brought high speed not only to genome
  sequencing and personal medicine,
• it has also changed the way we do genome research

  Got a question on genome organization?


         SEQUENCE IT !!!

                 Ana Conesa, bioinformatics researcher at
                         Principe Felipe Research Center



           NGS Data analysis   http://ueb.vhir.org/NGS2012
NGS means high sequencing capacity




  GS FLX 454               HiSeq 2000               5500xl SOLiD
  (ROCHE)                  (ILLUMINA)               (ABI)




               GS Junior


                                         Ion TORRENT




           NGS Data analysis   http://ueb.vhir.org/NGS2012
NGS Platforms Performance




                               454 GS Junior
                               35MB




     NGS Data analysis   http://ueb.vhir.org/NGS2012
454 Sequencing




      NGS Data analysis   http://ueb.vhir.org/NGS2012
ABI SOLID Sequencing




      NGS Data analysis   http://ueb.vhir.org/NGS2012
Solexa sequencing




       NGS Data analysis   http://ueb.vhir.org/NGS2012
Applications of Next-Generation
          Sequencing




    NGS Data analysis   http://ueb.vhir.org/NGS2012
Comparison of 2nd NGS




      NGS Data analysis   http://ueb.vhir.org/NGS2012
Some numbers

Platform                                                  454/FLX      Solex (Illum
                                                                            a      ina)AB S ID
                                                                                           OL
Read length                                               ~350-400bp   36, 75, or 106 bp   50bp
Single read                                               Yes          Yes                 Yes
Paired-end Reads                                          Yes          Yes                 Yes
Long-insert (several Kbp) mate-paired reads               Yes          Yes                 No
Number of reads por instrument run                        5.00K        >100 M              400M
Max Data output                                           0.5Gbp       20.5 Gbp            20Gbp
Run time to 1Gb                                           6 Days       > 1 Day             >1 Day
Ease of use (workflow)                                    Difficult    Least difficult     Difficult
Base Calling                                              Flow Space   Nucleotide space    Color sapce
D Applica
 NA      tions
Whole genome sequencing and resequencing                  Yes          Yes                 Yes
de novo sequencing                                        Yes          Yes                 Yes
Targeted resequencing                                     Yes          Yes                 Yes
Discovery of genetic variants ( SNPs, InDels, CNV, ...)   Yes          Yes                 Yes
Chromatin Immunopecipitation (ChIP)                       Yes          Yes                 Yes
Methylation Analysis                                      Yes          Yes                 Yes
Metagenomics                                              Yes          No                  No
R Applica
 NA      tions                                            Yes          Yes                 Yes
Whole Transcriptome                                       Yes          Yes                 Yes
Small RNA                                                 Yes          Yes                 Yes
Expression Tags                                           Yes          Yes                 Yes




                           NGS Data analysis                    http://ueb.vhir.org/NGS2012
Bioinformatics challenges of NGS




      NGS Data analysis   http://ueb.vhir.org/NGS2012
I have my sequences/images. Now what?




        NGS Data analysis   http://ueb.vhir.org/NGS2012
NGS pushes (bio)informatics needs up
• Need for computer power
   •   VERY large text files (~10 million lines long)
        – Can’t do ‘business as usual’ with familiar tools such as Perl/Python.
        – Impossible memory usage and execution time
        • Impossible to browse for problems
   •   Need sequence Quality filtering
   •   Need for large amount of CPU power
        •   Informatics groups must manage compute clusters
        •   Challenges in parallelizing existing software or redesign of algorithms to work in a
            parallel environment
• Need for Bioinformatics power!!!
   •   The challenges turns from data generation into data analysis!
   •   How should bioinformatics be structured
        •   Bigger centralized bioinformatics services? (or research groups providing service?)
        •   Distributed model: bioinformaticians must be part of the temas. Interoperability?




                  NGS Data analysis            http://ueb.vhir.org/NGS2012
Data management issues
• Raw data are large. How long should be kept?
• Processed data are manageable for most people
   – 20 million reads (50bp) ~1Gb
• More of an issue for a facility: HiSeq recommends
  32 CPU cores, each with 4GB RAM

• Certain studies much more data intensive than other
   – Whole genome sequencing
      • A 30X coverage genome pair (tumor/normal) ~500 GB
      • 50 genome pairs ~ 25 TB




            NGS Data analysis   http://ueb.vhir.org/NGS2012
So what?

• In NGS we have to process really big amounts of data,
  which is not trivial in computing terms.

• Big NGS projects require supercomputing infrastructures

• Or put another way: it's not the case that anyone can do
  everything.
   – Small facilities must carefully choose their projects to be scaled
     with their computing capabilities.




             NGS Data analysis    http://ueb.vhir.org/NGS2012
Computational infrastructure for NGS
• There is great variety but a good point to start with:

   – Computing cluster
       • Multiple nodes (servers) with multiple cores
       • High performance storage (TB, PB level)
       • Fast networks (10Gb ethernet, infiniband)
   – Enough space and conditions for the equipment
     ("servers room")
   – Skilled people (sysadmin, developers)
       • CNAG, in Barcelona: 36 people, more than 50% of them
         informaticians




             NGS Data analysis   http://ueb.vhir.org/NGS2012
Alternatives (1): Cloud Computing
• Pros
   – Flexibility.
   – You pay what you use.
   – Don´t need to maintain a data center.
• Cons
   – Transfer big datasets over internet is
     slow.
   – You pay for consumed bandwidth.
     That is a problem with big datasets.
   – Lower performance, specially in disk
     read/write.
   – Privacy/security concerns.
   – More expensive for big and long
     term projects.




               NGS Data analysis       http://ueb.vhir.org/NGS2012
Alternatives (2): Grid Computing
• Pros
   – Cheaper.
   – More resources available.
• Cons
   – Heterogeneous
     environment.
   – Slow connectivity (specially
     in Spain).
   – Much time required to find
     good resources in the grid.



            NGS Data analysis   http://ueb.vhir.org/NGS2012
In summary?
•“NGS” arrived 2007/8
•No-one predicted NGS in 2001 (ten years ago)
•Therefore we cannot predict what we will come
   up against
•TGS represents specific challenges
–Large Data Storage
–Technology-aware software
–Enables new assays and new science
•We would have said the same about NGS….
•These are not new problems, but will require
   new solutions
•There is a lag between technology and
   software….



             NGS Data analysis   http://ueb.vhir.org/NGS2012
Bioinformatics and bioinformaticians
•   The term bioinformatician means many things
•   Some may require a wide range of skills
•   Others require a depth of specific skills
•   The best thing we can teach is the ability to learn and
    adapt
    • The spirit of adventure
    • There is a definite skills shortage
    • There always has been




              NGS Data analysis    http://ueb.vhir.org/NGS2012
Increasing importance of data analysis
needs




         NGS Data analysis   http://ueb.vhir.org/NGS2012
NGS data analysis




      NGS Data analysis   http://ueb.vhir.org/NGS2012
NGS data analysis stages




       NGS Data analysis   http://ueb.vhir.org/NGS2012
Quality control and preprocessing of
             NGS data




      NGS Data analysis   http://ueb.vhir.org/NGS2012
Data types




       NGS Data analysis   http://ueb.vhir.org/NGS2012
Why QC and preprocessing
• Sequencer output:
   – Reads + quality
• Natural questions
   – Is the quality of my sequenced
     data OK?
   – If something is wrong can I fix it?
• Problem: HUGE files... How
  do they look?
• Files are flat files and big...
  tens of Gbs (even hard to
  browse them)



             NGS Data analysis    http://ueb.vhir.org/NGS2012
Preprocessing sequences improves results




        NGS Data analysis   http://ueb.vhir.org/NGS2012
How is quality measured?




• Sequencing systems use to assign quality scores to each peak
• Phred scores provide log(10)-transformed error probability values:
  If p is probability that the base call is wrong the Phred score is
                Q = .10·log10p
    – score = 20 corresponds to a 1% error rate
    – score = 30 corresponds to a 0.1% error rate
    – score = 40 corresponds to a 0.01% error rate
• The base calling (A, T, G or C) is performed based on Phred
  scores.
• Ambiguous positions with Phred scores <= 20 are labeled with N.


               NGS Data analysis    http://ueb.vhir.org/NGS2012
Data formats
• FastA format (everybody knows about it)
   – Header line starts with “>” followed by a sequence ID
   – Sequence (string of nt).


• FastQ format (http://maq.sourceforge.net/fastq.shtml)
   – First is the sequence (like Fasta but starting with “@”)
   – Then “+” and sequence ID (optional) and in the following line are
     QVs encoded as single byte ASCII codes
       • Different quality encode variants


• Nearly all downstream analysis take FastQ as input
  sequence



              NGS Data analysis     http://ueb.vhir.org/NGS2012
The fastq format
• A FASTQ file normally uses four lines per sequence.
   – Line 1 begins with a '@' character and is followed by a sequence
     identifier and an optional description (like a FASTA title line).
   – Line 2 is the raw sequence letters.
   – Line 3 begins with a '+' character and isoptionally followed by the same
     sequence identifier (and any description) again.
   – Line 4 encodes the quality values for the sequence in Line 2, and must
     contain the same number of symbols as letters in the sequence.
       • Different encodings are in use
       • Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126


@Seq description
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65




               NGS Data analysis           http://ueb.vhir.org/NGS2012
Some tools to deal with QC
• Use FastQC to see your starting state.

• Use Fastx-toolkit to optimize different datasets and then
  visualize the result with FastQC to prove your success!

• Hints:
   – Trimming, clipping and filtering may improve quality
   – But beware of removing too many sequences…


Go to the tutorial and try the exercises...




             NGS Data analysis   http://ueb.vhir.org/NGS2012
Applications
•   [1] Metagenomics
•   [2] De novo sequencing
•   [3] Amplicon analysis
•   [4] Variant discovery
•   [5] Transcriptome analysis
•   …and more …




             NGS Data analysis   http://ueb.vhir.org/NGS2012
[1] Metagenomics &other community-based “omics”




Zoetendal E G et al.
Gut 2008;57:1605-1615




                        NGS Data analysis   http://ueb.vhir.org/NGS2012
[1] A metagenomics workflow
AAGACGTGGACA         GTCCGTCACAACTGA
                                                     AAGACGTGGACAGATCTGCTCAGGCTAGCATGAAC
            CATGCGTGCATG
                                                  GATAGGTGGACCGATATGCATTAGACTTGCAGGGC
      AGTCGTCAGTCATGGG
         Short reads (40-150 bps)      Assembly                    Contigs

                                                  Gene prediction
 1           3000          6000
                                                        1           3000        6000
 1          2000

                                     Homology searching            ORFs
     Proteins, families, functions
                     Functional classification
                     Ontologies


                                        Binning

                                                            Sequences into species
       Functional profiles
[1] Metagenomic Approaches
SMALL-SCALE: 16S rRNA gene profiling
The basic approach is to identify microbes in a complex
community by exploiting universal and conserved targets,
such as rRNA genesPetrosini.

Challenges and limitations: Chimeric sequences caused by
PCR amplification and sequencing errors.

LARGE-SCALE: Whole Genome Shotgun (WGS)
Whole-genome approaches enable to identify and
annotate microbial genes and its functions in the
community.
 Challenges and limitations:
  relatively large amounts of starting material required
  potential contamination of metagenomic samples with host
 genetic material
  high numbers of genes of unknown function.                   Environmental Shotgun Sequencing (ESS).
                                                                               A primer on metagenomics.
                                                             PLoS Comput Biol. 2010 Feb 26;6(2):e1000667.




                    NGS Data analysis       http://ueb.vhir.org/NGS2012
[1] Comparative Metagenomics
Comparing two or more metagenomes is necessary to understand how genomic differences
affect, and are affected by the abiotic environment.


MEGAN can also be used to
compare the OTU composition
of two or more frequency-
normalized samples.
MG-RAST provides a
comparative functional and
sequence-based analysis for
uploaded samples

.
Other software based on
phylogenetic
data are UniFrac.
[1] Some Metagenomics projects



"whole-genome shotgun sequencing" was applied to microbial populations
A total of 1.045 billion base pairs of nonredundant sequence were analyzed




"whole-genome shotgun sequencing"
78 million base pairs of unique DNA sequence were analyzed



To date, 242 metagenomic projects are on going and 103 are completed
(www.genomesonline.org).




                   NGS Data analysis         http://ueb.vhir.org/NGS2012
[2] De novo sequencing




      NGS Data analysis   http://ueb.vhir.org/NGS2012
[3] Amplicon analysis
Each amplicon (PCR product) is sequenced individually, allowing
  for the identification of rare variants and the assignment of
  haplotype information over the full sequence length

Some applications:
     ●
         Detection of low-frequency (<1%) variants in complex mixtures
           → rare somatic mutations, viral quasispecies... Ultra-deep
                                                            amplicon sequencing
     ●
         Identification of rare alleles associated with hereditary diseases,
            heterozygote SNP calling... Ultra-broad amplicon sequencing

     ●
         Metabolic profiling of environmental habitats, bacterial taxonomy
           and phlylogeny         16S rRNA amplicon sequencing




               NGS Data analysis    http://ueb.vhir.org/NGS2012
[3] Example of raw data generation with GS-FLX

...




         NGS Data analysis   http://ueb.vhir.org/NGS2012
[3] Data Workflow
...




                    Data Processing
[3] Final output examples
...
                                                                       NT substitution (error) matrices




Bar plots output example (with circular legend for the AA)                   AA frequency tables




                        NGS Data analysis                http://ueb.vhir.org/NGS2012
[4] Variant discovery
Your aligner decides the type/amount of variants you can
  identify
Naive SNP calling
   Reads counting
Statistic support SNP calling
   Maximum likelihood, Bayesian
Quality score recalibration
   Recalibrate quality score from whole alignment
Local realignment around indels
   Realign reads
Known variants (limited species)
   dbSNP



             NGS Data analysis   http://ueb.vhir.org/NGS2012
[4] Example: Exome Variant Analysis




       NGS Data analysis   http://ueb.vhir.org/NGS2012
[4] Genotype calling tools




        NGS Data analysis   http://ueb.vhir.org/NGS2012
[4] GATK pipeline




       NGS Data analysis   http://ueb.vhir.org/NGS2012
[4]




      NGS Data analysis   http://ueb.vhir.org/NGS2012
[4] Many ongoing sequencing projects




       NGS Data analysis   http://ueb.vhir.org/NGS2012
[5] Transcriptome Analysis using NGS

    RNA-Seq, or "Whole
    Transcriptome Shotgun
    Sequencing" ("WTSS")
    refers to use of HTS
    technologies to sequence
    cDNA in order to get
    information about a
    sample's RNA content.
    
        Reads produced by
        sequencing
    
        Aligned to a reference
        genome to build
        transcriptome mappings.




               NGS Data analysis   http://ueb.vhir.org/NGS2012
[5] Applications (1)  Whole transcriptome
                  analysis
mRNA                  AAAA

             Fragmentation
                               Detects expression of known and
                              novel mRNAs

              RT                   Identification   of   alternative
                              splicing events
       cDNA library                Detects     expressed    SNPs     or
                              mutations
                                    Identifies    allele    specific
       sequencing             expression patterns




         NGS Data analysis   http://ueb.vhir.org/NGS2012
[5] Applications (2) Differential expression
                                 1.Reads are mapped to the reference
                                 genome or transcriptome
                                 2.Mapped reads are assembled into
                                 expression summaries (tables of
                                 counts, showing how may reads are in
                                 coding region, exon, gene or junction);
                                 3.The data are normalized;
                                 4.Statistical testing of differential
                                 expression (DE) is performed,
                                 producing a list of genes with P-values
                                 and fold changes.




           NGS Data analysis   http://ueb.vhir.org/NGS2012
[5] RNA Seq data analysis - Mapping




•Main Issues:
   –Number of allowed mismatches              End up with a list of
                                              # of reads per transcript
   –Number of multihits
   –Mates expected distance                   These will be our (discrete)
                                              response variable
   –Considering exon junctions


           NGS Data analysis   http://ueb.vhir.org/NGS2012
[5] RNA Seq data analysis -Normalization
• Two main sources of bias
   – Influence of length: Counts are proportional to the transcript
     length times the mRNA expression level.
   – Influence of sequencing depth: The higher sequencing depth, the
     higher counts.


• How to deal with this
   – Normalize (correct) gene counts to minimize biases.
   – Use statistical models that take into account
     length and sequencing depth




            NGS Data analysis   http://ueb.vhir.org/NGS2012
[5] RNA Seq - Differential expression methods

• Fisher's exact test or similar approaches.

• Use Generalized Linear Models and model counts using
   – Poisson distribution.
   – Negative binomial distribution.

• Transform count data to use existing approaches for
  microarray data.

• …




             NGS Data analysis   http://ueb.vhir.org/NGS2012
[5] Advantages of RNA-seq

    Unlike hybridization approaches does not require existing genomic
    sequence
    
        Expected to replace microarrays for transcriptomic studies

    Very low background noise
    
        Reads can be unabmiguously mapped

    Resolution up to 1 bp

    High-throughput quantitative measurement of transcript abundance
    
        Better than Sanger sequencing of cDNA or EST libraries

    Cost decreasing all the time
    
        Lower than traditional sequencing

    Can reveal sequence variations (SNPs)

    Automated pipelines available




                NGS Data analysis     http://ueb.vhir.org/NGS2012
Software for NGS preprocessing and analysis




         NGS Data analysis   http://ueb.vhir.org/NGS2012
Which software for NGS (data) analysis?
• Answer is not straightforward.
                                       http://seqanswers.com/wiki/Software/list
• Many possible classifications
   – Biological domains
       • SNP discovery, Genomics, ChIP-Seq, De-novo assembly, …
   – Bioinformatics methods
       • Mapping, Assembly, Alignment, Seq-QC,…
   – Technology
       • Illumina, 454, ABI SOLID, Helicos, …
   – Operating system
       • Linux, Mac OS X, Windows, …
   – License type
       • GPLv3, GPL, Commercial, Free for academic use,…
   – Language
       • C++, Perl, Java, C, Phyton
   – Interface
       • Web Based, Integrated solutions, command line tools, pipelines,…




              NGS Data analysis       http://ueb.vhir.org/NGS2012
Which software for NGS (data) analysis?
• Answer is not straightforward.
                                       http://seqanswers.com/wiki/Software/list
• Many possible classifications
   – Biological domains
       • SNP discovery, Genomics, ChIP-Seq, De-novo assembly, …
   – Bioinformatics methods
       • Mapping, Assembly, Alignment, Seq-QC,…
   – Technology
       • Illumina, 454, ABI SOLID, Helicos, …
   – Operating system
       • Linux, Mac OS X, Windows, …
   – License type
       • GPLv3, GPL, Commercial, Free for academic use,…
   – Language
       • C++, Perl, Java, C, Phyton
   – Interface
       • Web Based, Integrated solutions, command line tools, pipelines,…




              NGS Data analysis       http://ueb.ir.vhebron.net/NGS
Some popular tools and places




       NGS Data analysis   http://ueb.vhir.org/NGS2012
http://galaxy.psu.edu/




Galaxy Site




                                       62
Obtain data from many data
   sources including the
    UCSC Table Browser,    Prepare data for further
    BioMart, WormBase,     analysis by rearranging
      or your own data.    or cutting data columns, Analyze data by finding
                           filtering data and many   overlapping regions,
                                 other actions.     determining statistics,
                                                    phylogenetic analysis
                                                       and much more




                                                                         63
User           Register




  contains links to
                                                      Shows the history
 the downloading,                                     of analysis steps,
pre-procession and            displays                data and result
                                                      viewing
   analysis tools            menus and
                             data inputs




               NGS Data analysis   http://ueb.vhir.org/NGS2012             64
Click Get Data




                 65
Get Data
         from Database




NGS Data analysis   http://ueb.vhir.org/NGS2012   66
Upload File   File Format

              Upload or paste file




                                     67
NGS Data analysis   http://ueb.vhir.org/NGS2012   68
FASTQ file manipulation:
  format conversation,
  summary statistics,
    trimming reads,
     filtering reads
   by quality score…
Input: sanger FASTQ
Output: SAM format
Downstream analysis:
        SAM -> BAM




NGS Data analysis   http://ueb.vhir.org/NGS2012
Co
                                                  py
                                                  rig
                                                  ht
                                                  Op
                                                  en
                                                  He
                                                  lix.
                                                  No
                                                  us
                                                  e
                                                  or
                                                  re
                                                  pr
                List saved histories and          od
                                                  uct
                    shared histories.             ion
              Work on a current history,          wit
                                                  ho
              create new, share workflow          ut
                                                  ex
                                                  pr
                                                  es
                                                  s
                                                  wri
                                                  tte
                                                  n
                                                  co
                                                  ns
                                                  en
NGS Data analysis   http://ueb.vhir.org/NGS2012   t2
                                                  7
Creates a workflow, allows
   user to repeat analysis
   using different datasets.




NGS Data analysis      http://ueb.vhir.org/NGS2012
DATA VISUALIZATION




     NGS Data analysis   http://ueb.vhir.org/NGS2012
Why is visualization important?
make large amounts of data more interpretable
glean patterns from the data
sanity check / visual debugging
more…




          NGS Data analysis   http://ueb.vhir.org/NGS2012
History of Genome Visualization




   1800s                       1900s                    2000s
                                time




           NGS Data analysis    http://ueb.vhir.org/NGS2012
What is a “Genome Browser”
linear representation of a genome
position-based annotations, each called a track
  continuous annotations: e.g. conservation
  interval annotations: e.g. gene, read alignment
  point annotations: e.g. SNPs
user specifies a subsection of genome to look at




           NGS Data analysis   http://ueb.vhir.org/NGS2012
Server-side model
 (e.g. UCSC, Ensembl, Gbrowse)


    serve
• central data
       r
store
• renders
images
• sends to client

   client
• requests
images
• displays
images

             NGS Data analysis   http://ueb.vhir.org/NGS2012
Client-side model
 (e.g. Savant, IGV)


    serve
• stores data
       r




   client              HTS
• local HTS           machine
store
• renders
images
• displays
images
Rough comparison of Genome
   Browsers
                 UCSC      Ensembl     GBrowse        Savant     IGV
Model            Server     Server       Server        Client    Client
Interactive
HTS support
Database of
tracks
Plugins


              No support     Some support         Good support




              NGS Data analysis   http://ueb.vhir.org/NGS2012
Limitations of most genome
browsers
do not support multiple genomes simultaneously
do not capture 3-dimensional conformation
do not capture spatial or temporal information
do not integrate well with analytics
cannot be customized

  The SAVANT
  GENOME BROWSER
  has been created
  to overcome these
  limitations




         NGS Data analysis   http://ueb.vhir.org/NGS2012
Integrative Genomics Viewer (IGV)




he Integrative Genomics Viewer (IGV) is a high-performance visualization tool
for interactive exploration of large, integrated datasets. It supports a wide variety
of data types including sequence alignments, microarrays, and genomic
annotations.
Acknowledgements
 Grupo de investigación en Estadística y Bioinformática del
  departamento de Estadística de la Universidad de
  Barcelona.

 All the members at the Unitat d’Estadística i Bioinformàtica
  del VHIR (Vall d’Hebron Institut de Recerca)

 Unitat de Serveis Científico Tècnics (UCTS) del VHIR (Vall
  d’Hebron Institut de Recerca)

 People whose materials have been borrowed or who have
  contributed with their work
    Manel Comabella, Rosa Prieto, Paqui Gallego, Javier
     Santoyo, Ana Conesa, Thomas Girke and Silvia
     Cardona.…




                   NGS Data analysis     http://ueb.vhir.org/NGS2012
Gracias por la atención y la paciencia




     NGS Data analysis   http://ueb.vhir.org/NGS2012

Contenu connexe

Tendances

The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeThe Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
Justin Johnson
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014
LutzFr
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
mkim8
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
GenomeInABottle
 

Tendances (20)

2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics Pipeline
 
BioChain Next Generation Sequencing Products
BioChain Next Generation Sequencing ProductsBioChain Next Generation Sequencing Products
BioChain Next Generation Sequencing Products
 
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeThe Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
 
Ngs introduction
Ngs introductionNgs introduction
Ngs introduction
 
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
QIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene PanelsQIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene Panels
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
ChIP-seq
ChIP-seqChIP-seq
ChIP-seq
 

Similaire à Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
sesejun
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 

Similaire à Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS (20)

Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysis
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010 Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
 
Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1
Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1
Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Bes meeting spain 2015_alfredo garcia fernandez
Bes meeting spain 2015_alfredo garcia fernandezBes meeting spain 2015_alfredo garcia fernandez
Bes meeting spain 2015_alfredo garcia fernandez
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
Getting Started with NGS (Discover the Benefits of Technology and How it Oper...
Getting Started with NGS (Discover the Benefits of Technology and How it Oper...Getting Started with NGS (Discover the Benefits of Technology and How it Oper...
Getting Started with NGS (Discover the Benefits of Technology and How it Oper...
 
Sc12 workshop-writeup
Sc12 workshop-writeupSc12 workshop-writeup
Sc12 workshop-writeup
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 

Plus de VHIR Vall d’Hebron Institut de Recerca

Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génicaCurso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica
VHIR Vall d’Hebron Institut de Recerca
 
Curso de Genómica - UAT (VHIR) 2012 - Tecnologías de Ultrasecuenciación y de ...
Curso de Genómica - UAT (VHIR) 2012 - Tecnologías de Ultrasecuenciación y de ...Curso de Genómica - UAT (VHIR) 2012 - Tecnologías de Ultrasecuenciación y de ...
Curso de Genómica - UAT (VHIR) 2012 - Tecnologías de Ultrasecuenciación y de ...
VHIR Vall d’Hebron Institut de Recerca
 

Plus de VHIR Vall d’Hebron Institut de Recerca (20)

Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...
Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...
Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...
 
Introduction to Functional Analysis with IPA (UEB-UAT Bioinformatics Course -...
Introduction to Functional Analysis with IPA (UEB-UAT Bioinformatics Course -...Introduction to Functional Analysis with IPA (UEB-UAT Bioinformatics Course -...
Introduction to Functional Analysis with IPA (UEB-UAT Bioinformatics Course -...
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
 
Brief Overview to Amplicon Variant Analysis (UEB-UAT Bioinformatics Course - ...
Brief Overview to Amplicon Variant Analysis (UEB-UAT Bioinformatics Course - ...Brief Overview to Amplicon Variant Analysis (UEB-UAT Bioinformatics Course - ...
Brief Overview to Amplicon Variant Analysis (UEB-UAT Bioinformatics Course - ...
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
 
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
 
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
 
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
 
Introduction to Bioinformatics (UEB-UAT Bioinformatics Course - Session 1.1 -...
Introduction to Bioinformatics (UEB-UAT Bioinformatics Course - Session 1.1 -...Introduction to Bioinformatics (UEB-UAT Bioinformatics Course - Session 1.1 -...
Introduction to Bioinformatics (UEB-UAT Bioinformatics Course - Session 1.1 -...
 
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
 
Information management at vhir ueb using tiki-cms
Information management at vhir ueb using tiki-cmsInformation management at vhir ueb using tiki-cms
Information management at vhir ueb using tiki-cms
 
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
 
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de RT-qPCR
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de RT-qPCRCurso de Genómica - UAT (VHIR) 2012 - Análisis de datos de RT-qPCR
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de RT-qPCR
 
Curso de Genómica - UAT (VHIR) 2012 - RT-qPCR
Curso de Genómica - UAT (VHIR) 2012 - RT-qPCRCurso de Genómica - UAT (VHIR) 2012 - RT-qPCR
Curso de Genómica - UAT (VHIR) 2012 - RT-qPCR
 
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génicaCurso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de expression génica
 
Curso de Genómica - UAT (VHIR) 2012 - Microarrays
Curso de Genómica - UAT (VHIR) 2012 - MicroarraysCurso de Genómica - UAT (VHIR) 2012 - Microarrays
Curso de Genómica - UAT (VHIR) 2012 - Microarrays
 
Curso de Genómica - UAT (VHIR) 2012 - Arrays de Proteínas Zeptosens
 Curso de Genómica - UAT (VHIR) 2012 - Arrays de Proteínas Zeptosens Curso de Genómica - UAT (VHIR) 2012 - Arrays de Proteínas Zeptosens
Curso de Genómica - UAT (VHIR) 2012 - Arrays de Proteínas Zeptosens
 
Curso de Genómica - UAT (VHIR) 2012 - Tecnologías de Ultrasecuenciación y de ...
Curso de Genómica - UAT (VHIR) 2012 - Tecnologías de Ultrasecuenciación y de ...Curso de Genómica - UAT (VHIR) 2012 - Tecnologías de Ultrasecuenciación y de ...
Curso de Genómica - UAT (VHIR) 2012 - Tecnologías de Ultrasecuenciación y de ...
 

Dernier

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Dernier (20)

Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answers
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 

Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

  • 1. Introduction to NGS (Now Generation Sequencing) Data Analysis Alex Sánchez Statistics and Bioinformatics Research Group Statistics department, Universitat de Barelona Statistics and Bioinformatics Unit Vall d’Hebron Institut de Recerca NGS Data analysis http://ueb.vhir.org/NGS2012
  • 2. Outline • Introduction • Bioinformatics Challenges • NGS data analysis: Some examples and workflows • Metagenomics, De novo sequencing, Variant detection, RNA- seq • Software • Galaxy, Genome viewers • Data formats and quality control NGS Data analysis http://ueb.vhir.org/NGS2012
  • 3. Introduction NGS Data analysis http://ueb.vhir.org/NGS2012
  • 4. Why is NGS revolutionary? • NGS has brought high speed not only to genome sequencing and personal medicine, • it has also changed the way we do genome research Got a question on genome organization? SEQUENCE IT !!! Ana Conesa, bioinformatics researcher at Principe Felipe Research Center NGS Data analysis http://ueb.vhir.org/NGS2012
  • 5. NGS means high sequencing capacity GS FLX 454 HiSeq 2000 5500xl SOLiD (ROCHE) (ILLUMINA) (ABI) GS Junior Ion TORRENT NGS Data analysis http://ueb.vhir.org/NGS2012
  • 6. NGS Platforms Performance 454 GS Junior 35MB NGS Data analysis http://ueb.vhir.org/NGS2012
  • 7. 454 Sequencing NGS Data analysis http://ueb.vhir.org/NGS2012
  • 8. ABI SOLID Sequencing NGS Data analysis http://ueb.vhir.org/NGS2012
  • 9. Solexa sequencing NGS Data analysis http://ueb.vhir.org/NGS2012
  • 10. Applications of Next-Generation Sequencing NGS Data analysis http://ueb.vhir.org/NGS2012
  • 11. Comparison of 2nd NGS NGS Data analysis http://ueb.vhir.org/NGS2012
  • 12. Some numbers Platform 454/FLX Solex (Illum a ina)AB S ID OL Read length ~350-400bp 36, 75, or 106 bp 50bp Single read Yes Yes Yes Paired-end Reads Yes Yes Yes Long-insert (several Kbp) mate-paired reads Yes Yes No Number of reads por instrument run 5.00K >100 M 400M Max Data output 0.5Gbp 20.5 Gbp 20Gbp Run time to 1Gb 6 Days > 1 Day >1 Day Ease of use (workflow) Difficult Least difficult Difficult Base Calling Flow Space Nucleotide space Color sapce D Applica NA tions Whole genome sequencing and resequencing Yes Yes Yes de novo sequencing Yes Yes Yes Targeted resequencing Yes Yes Yes Discovery of genetic variants ( SNPs, InDels, CNV, ...) Yes Yes Yes Chromatin Immunopecipitation (ChIP) Yes Yes Yes Methylation Analysis Yes Yes Yes Metagenomics Yes No No R Applica NA tions Yes Yes Yes Whole Transcriptome Yes Yes Yes Small RNA Yes Yes Yes Expression Tags Yes Yes Yes NGS Data analysis http://ueb.vhir.org/NGS2012
  • 13. Bioinformatics challenges of NGS NGS Data analysis http://ueb.vhir.org/NGS2012
  • 14. I have my sequences/images. Now what? NGS Data analysis http://ueb.vhir.org/NGS2012
  • 15. NGS pushes (bio)informatics needs up • Need for computer power • VERY large text files (~10 million lines long) – Can’t do ‘business as usual’ with familiar tools such as Perl/Python. – Impossible memory usage and execution time • Impossible to browse for problems • Need sequence Quality filtering • Need for large amount of CPU power • Informatics groups must manage compute clusters • Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment • Need for Bioinformatics power!!! • The challenges turns from data generation into data analysis! • How should bioinformatics be structured • Bigger centralized bioinformatics services? (or research groups providing service?) • Distributed model: bioinformaticians must be part of the temas. Interoperability? NGS Data analysis http://ueb.vhir.org/NGS2012
  • 16. Data management issues • Raw data are large. How long should be kept? • Processed data are manageable for most people – 20 million reads (50bp) ~1Gb • More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM • Certain studies much more data intensive than other – Whole genome sequencing • A 30X coverage genome pair (tumor/normal) ~500 GB • 50 genome pairs ~ 25 TB NGS Data analysis http://ueb.vhir.org/NGS2012
  • 17. So what? • In NGS we have to process really big amounts of data, which is not trivial in computing terms. • Big NGS projects require supercomputing infrastructures • Or put another way: it's not the case that anyone can do everything. – Small facilities must carefully choose their projects to be scaled with their computing capabilities. NGS Data analysis http://ueb.vhir.org/NGS2012
  • 18. Computational infrastructure for NGS • There is great variety but a good point to start with: – Computing cluster • Multiple nodes (servers) with multiple cores • High performance storage (TB, PB level) • Fast networks (10Gb ethernet, infiniband) – Enough space and conditions for the equipment ("servers room") – Skilled people (sysadmin, developers) • CNAG, in Barcelona: 36 people, more than 50% of them informaticians NGS Data analysis http://ueb.vhir.org/NGS2012
  • 19. Alternatives (1): Cloud Computing • Pros – Flexibility. – You pay what you use. – Don´t need to maintain a data center. • Cons – Transfer big datasets over internet is slow. – You pay for consumed bandwidth. That is a problem with big datasets. – Lower performance, specially in disk read/write. – Privacy/security concerns. – More expensive for big and long term projects. NGS Data analysis http://ueb.vhir.org/NGS2012
  • 20. Alternatives (2): Grid Computing • Pros – Cheaper. – More resources available. • Cons – Heterogeneous environment. – Slow connectivity (specially in Spain). – Much time required to find good resources in the grid. NGS Data analysis http://ueb.vhir.org/NGS2012
  • 21. In summary? •“NGS” arrived 2007/8 •No-one predicted NGS in 2001 (ten years ago) •Therefore we cannot predict what we will come up against •TGS represents specific challenges –Large Data Storage –Technology-aware software –Enables new assays and new science •We would have said the same about NGS…. •These are not new problems, but will require new solutions •There is a lag between technology and software…. NGS Data analysis http://ueb.vhir.org/NGS2012
  • 22. Bioinformatics and bioinformaticians • The term bioinformatician means many things • Some may require a wide range of skills • Others require a depth of specific skills • The best thing we can teach is the ability to learn and adapt • The spirit of adventure • There is a definite skills shortage • There always has been NGS Data analysis http://ueb.vhir.org/NGS2012
  • 23. Increasing importance of data analysis needs NGS Data analysis http://ueb.vhir.org/NGS2012
  • 24. NGS data analysis NGS Data analysis http://ueb.vhir.org/NGS2012
  • 25. NGS data analysis stages NGS Data analysis http://ueb.vhir.org/NGS2012
  • 26. Quality control and preprocessing of NGS data NGS Data analysis http://ueb.vhir.org/NGS2012
  • 27. Data types NGS Data analysis http://ueb.vhir.org/NGS2012
  • 28. Why QC and preprocessing • Sequencer output: – Reads + quality • Natural questions – Is the quality of my sequenced data OK? – If something is wrong can I fix it? • Problem: HUGE files... How do they look? • Files are flat files and big... tens of Gbs (even hard to browse them) NGS Data analysis http://ueb.vhir.org/NGS2012
  • 29. Preprocessing sequences improves results NGS Data analysis http://ueb.vhir.org/NGS2012
  • 30. How is quality measured? • Sequencing systems use to assign quality scores to each peak • Phred scores provide log(10)-transformed error probability values: If p is probability that the base call is wrong the Phred score is Q = .10·log10p – score = 20 corresponds to a 1% error rate – score = 30 corresponds to a 0.1% error rate – score = 40 corresponds to a 0.01% error rate • The base calling (A, T, G or C) is performed based on Phred scores. • Ambiguous positions with Phred scores <= 20 are labeled with N. NGS Data analysis http://ueb.vhir.org/NGS2012
  • 31. Data formats • FastA format (everybody knows about it) – Header line starts with “>” followed by a sequence ID – Sequence (string of nt). • FastQ format (http://maq.sourceforge.net/fastq.shtml) – First is the sequence (like Fasta but starting with “@”) – Then “+” and sequence ID (optional) and in the following line are QVs encoded as single byte ASCII codes • Different quality encode variants • Nearly all downstream analysis take FastQ as input sequence NGS Data analysis http://ueb.vhir.org/NGS2012
  • 32. The fastq format • A FASTQ file normally uses four lines per sequence. – Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). – Line 2 is the raw sequence letters. – Line 3 begins with a '+' character and isoptionally followed by the same sequence identifier (and any description) again. – Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. • Different encodings are in use • Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126 @Seq description GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 NGS Data analysis http://ueb.vhir.org/NGS2012
  • 33. Some tools to deal with QC • Use FastQC to see your starting state. • Use Fastx-toolkit to optimize different datasets and then visualize the result with FastQC to prove your success! • Hints: – Trimming, clipping and filtering may improve quality – But beware of removing too many sequences… Go to the tutorial and try the exercises... NGS Data analysis http://ueb.vhir.org/NGS2012
  • 34. Applications • [1] Metagenomics • [2] De novo sequencing • [3] Amplicon analysis • [4] Variant discovery • [5] Transcriptome analysis • …and more … NGS Data analysis http://ueb.vhir.org/NGS2012
  • 35. [1] Metagenomics &other community-based “omics” Zoetendal E G et al. Gut 2008;57:1605-1615 NGS Data analysis http://ueb.vhir.org/NGS2012
  • 36. [1] A metagenomics workflow AAGACGTGGACA GTCCGTCACAACTGA AAGACGTGGACAGATCTGCTCAGGCTAGCATGAAC CATGCGTGCATG GATAGGTGGACCGATATGCATTAGACTTGCAGGGC AGTCGTCAGTCATGGG Short reads (40-150 bps) Assembly Contigs Gene prediction 1 3000 6000 1 3000 6000 1 2000 Homology searching ORFs Proteins, families, functions Functional classification Ontologies Binning Sequences into species Functional profiles
  • 37. [1] Metagenomic Approaches SMALL-SCALE: 16S rRNA gene profiling The basic approach is to identify microbes in a complex community by exploiting universal and conserved targets, such as rRNA genesPetrosini. Challenges and limitations: Chimeric sequences caused by PCR amplification and sequencing errors. LARGE-SCALE: Whole Genome Shotgun (WGS) Whole-genome approaches enable to identify and annotate microbial genes and its functions in the community. Challenges and limitations: relatively large amounts of starting material required potential contamination of metagenomic samples with host genetic material high numbers of genes of unknown function. Environmental Shotgun Sequencing (ESS). A primer on metagenomics. PLoS Comput Biol. 2010 Feb 26;6(2):e1000667. NGS Data analysis http://ueb.vhir.org/NGS2012
  • 38. [1] Comparative Metagenomics Comparing two or more metagenomes is necessary to understand how genomic differences affect, and are affected by the abiotic environment. MEGAN can also be used to compare the OTU composition of two or more frequency- normalized samples. MG-RAST provides a comparative functional and sequence-based analysis for uploaded samples . Other software based on phylogenetic data are UniFrac.
  • 39. [1] Some Metagenomics projects "whole-genome shotgun sequencing" was applied to microbial populations A total of 1.045 billion base pairs of nonredundant sequence were analyzed "whole-genome shotgun sequencing" 78 million base pairs of unique DNA sequence were analyzed To date, 242 metagenomic projects are on going and 103 are completed (www.genomesonline.org). NGS Data analysis http://ueb.vhir.org/NGS2012
  • 40. [2] De novo sequencing NGS Data analysis http://ueb.vhir.org/NGS2012
  • 41. [3] Amplicon analysis Each amplicon (PCR product) is sequenced individually, allowing for the identification of rare variants and the assignment of haplotype information over the full sequence length Some applications: ● Detection of low-frequency (<1%) variants in complex mixtures → rare somatic mutations, viral quasispecies... Ultra-deep amplicon sequencing ● Identification of rare alleles associated with hereditary diseases, heterozygote SNP calling... Ultra-broad amplicon sequencing ● Metabolic profiling of environmental habitats, bacterial taxonomy and phlylogeny 16S rRNA amplicon sequencing NGS Data analysis http://ueb.vhir.org/NGS2012
  • 42. [3] Example of raw data generation with GS-FLX ... NGS Data analysis http://ueb.vhir.org/NGS2012
  • 43. [3] Data Workflow ... Data Processing
  • 44. [3] Final output examples ... NT substitution (error) matrices Bar plots output example (with circular legend for the AA) AA frequency tables NGS Data analysis http://ueb.vhir.org/NGS2012
  • 45. [4] Variant discovery Your aligner decides the type/amount of variants you can identify Naive SNP calling Reads counting Statistic support SNP calling Maximum likelihood, Bayesian Quality score recalibration Recalibrate quality score from whole alignment Local realignment around indels Realign reads Known variants (limited species) dbSNP NGS Data analysis http://ueb.vhir.org/NGS2012
  • 46. [4] Example: Exome Variant Analysis NGS Data analysis http://ueb.vhir.org/NGS2012
  • 47. [4] Genotype calling tools NGS Data analysis http://ueb.vhir.org/NGS2012
  • 48. [4] GATK pipeline NGS Data analysis http://ueb.vhir.org/NGS2012
  • 49. [4] NGS Data analysis http://ueb.vhir.org/NGS2012
  • 50. [4] Many ongoing sequencing projects NGS Data analysis http://ueb.vhir.org/NGS2012
  • 51. [5] Transcriptome Analysis using NGS  RNA-Seq, or "Whole Transcriptome Shotgun Sequencing" ("WTSS") refers to use of HTS technologies to sequence cDNA in order to get information about a sample's RNA content.  Reads produced by sequencing  Aligned to a reference genome to build transcriptome mappings. NGS Data analysis http://ueb.vhir.org/NGS2012
  • 52. [5] Applications (1)  Whole transcriptome analysis mRNA AAAA Fragmentation  Detects expression of known and novel mRNAs RT  Identification of alternative splicing events cDNA library  Detects expressed SNPs or mutations  Identifies allele specific sequencing expression patterns NGS Data analysis http://ueb.vhir.org/NGS2012
  • 53. [5] Applications (2) Differential expression 1.Reads are mapped to the reference genome or transcriptome 2.Mapped reads are assembled into expression summaries (tables of counts, showing how may reads are in coding region, exon, gene or junction); 3.The data are normalized; 4.Statistical testing of differential expression (DE) is performed, producing a list of genes with P-values and fold changes. NGS Data analysis http://ueb.vhir.org/NGS2012
  • 54. [5] RNA Seq data analysis - Mapping •Main Issues: –Number of allowed mismatches End up with a list of # of reads per transcript –Number of multihits –Mates expected distance These will be our (discrete) response variable –Considering exon junctions NGS Data analysis http://ueb.vhir.org/NGS2012
  • 55. [5] RNA Seq data analysis -Normalization • Two main sources of bias – Influence of length: Counts are proportional to the transcript length times the mRNA expression level. – Influence of sequencing depth: The higher sequencing depth, the higher counts. • How to deal with this – Normalize (correct) gene counts to minimize biases. – Use statistical models that take into account length and sequencing depth NGS Data analysis http://ueb.vhir.org/NGS2012
  • 56. [5] RNA Seq - Differential expression methods • Fisher's exact test or similar approaches. • Use Generalized Linear Models and model counts using – Poisson distribution. – Negative binomial distribution. • Transform count data to use existing approaches for microarray data. • … NGS Data analysis http://ueb.vhir.org/NGS2012
  • 57. [5] Advantages of RNA-seq  Unlike hybridization approaches does not require existing genomic sequence  Expected to replace microarrays for transcriptomic studies  Very low background noise  Reads can be unabmiguously mapped  Resolution up to 1 bp  High-throughput quantitative measurement of transcript abundance  Better than Sanger sequencing of cDNA or EST libraries  Cost decreasing all the time  Lower than traditional sequencing  Can reveal sequence variations (SNPs)  Automated pipelines available NGS Data analysis http://ueb.vhir.org/NGS2012
  • 58. Software for NGS preprocessing and analysis NGS Data analysis http://ueb.vhir.org/NGS2012
  • 59. Which software for NGS (data) analysis? • Answer is not straightforward. http://seqanswers.com/wiki/Software/list • Many possible classifications – Biological domains • SNP discovery, Genomics, ChIP-Seq, De-novo assembly, … – Bioinformatics methods • Mapping, Assembly, Alignment, Seq-QC,… – Technology • Illumina, 454, ABI SOLID, Helicos, … – Operating system • Linux, Mac OS X, Windows, … – License type • GPLv3, GPL, Commercial, Free for academic use,… – Language • C++, Perl, Java, C, Phyton – Interface • Web Based, Integrated solutions, command line tools, pipelines,… NGS Data analysis http://ueb.vhir.org/NGS2012
  • 60. Which software for NGS (data) analysis? • Answer is not straightforward. http://seqanswers.com/wiki/Software/list • Many possible classifications – Biological domains • SNP discovery, Genomics, ChIP-Seq, De-novo assembly, … – Bioinformatics methods • Mapping, Assembly, Alignment, Seq-QC,… – Technology • Illumina, 454, ABI SOLID, Helicos, … – Operating system • Linux, Mac OS X, Windows, … – License type • GPLv3, GPL, Commercial, Free for academic use,… – Language • C++, Perl, Java, C, Phyton – Interface • Web Based, Integrated solutions, command line tools, pipelines,… NGS Data analysis http://ueb.ir.vhebron.net/NGS
  • 61. Some popular tools and places NGS Data analysis http://ueb.vhir.org/NGS2012
  • 63. Obtain data from many data sources including the UCSC Table Browser, Prepare data for further BioMart, WormBase, analysis by rearranging or your own data. or cutting data columns, Analyze data by finding filtering data and many overlapping regions, other actions. determining statistics, phylogenetic analysis and much more 63
  • 64. User Register contains links to Shows the history the downloading, of analysis steps, pre-procession and displays data and result viewing analysis tools menus and data inputs NGS Data analysis http://ueb.vhir.org/NGS2012 64
  • 66. Get Data from Database NGS Data analysis http://ueb.vhir.org/NGS2012 66
  • 67. Upload File File Format Upload or paste file 67
  • 68. NGS Data analysis http://ueb.vhir.org/NGS2012 68
  • 69. FASTQ file manipulation: format conversation, summary statistics, trimming reads, filtering reads by quality score…
  • 71. Downstream analysis: SAM -> BAM NGS Data analysis http://ueb.vhir.org/NGS2012
  • 72. Co py rig ht Op en He lix. No us e or re pr List saved histories and od uct shared histories. ion Work on a current history, wit ho create new, share workflow ut ex pr es s wri tte n co ns en NGS Data analysis http://ueb.vhir.org/NGS2012 t2 7
  • 73. Creates a workflow, allows user to repeat analysis using different datasets. NGS Data analysis http://ueb.vhir.org/NGS2012
  • 74. DATA VISUALIZATION NGS Data analysis http://ueb.vhir.org/NGS2012
  • 75. Why is visualization important? make large amounts of data more interpretable glean patterns from the data sanity check / visual debugging more… NGS Data analysis http://ueb.vhir.org/NGS2012
  • 76. History of Genome Visualization 1800s 1900s 2000s time NGS Data analysis http://ueb.vhir.org/NGS2012
  • 77. What is a “Genome Browser” linear representation of a genome position-based annotations, each called a track continuous annotations: e.g. conservation interval annotations: e.g. gene, read alignment point annotations: e.g. SNPs user specifies a subsection of genome to look at NGS Data analysis http://ueb.vhir.org/NGS2012
  • 78. Server-side model (e.g. UCSC, Ensembl, Gbrowse) serve • central data r store • renders images • sends to client client • requests images • displays images NGS Data analysis http://ueb.vhir.org/NGS2012
  • 79. Client-side model (e.g. Savant, IGV) serve • stores data r client HTS • local HTS machine store • renders images • displays images
  • 80. Rough comparison of Genome Browsers UCSC Ensembl GBrowse Savant IGV Model Server Server Server Client Client Interactive HTS support Database of tracks Plugins No support Some support Good support NGS Data analysis http://ueb.vhir.org/NGS2012
  • 81. Limitations of most genome browsers do not support multiple genomes simultaneously do not capture 3-dimensional conformation do not capture spatial or temporal information do not integrate well with analytics cannot be customized The SAVANT GENOME BROWSER has been created to overcome these limitations NGS Data analysis http://ueb.vhir.org/NGS2012
  • 82. Integrative Genomics Viewer (IGV) he Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated datasets. It supports a wide variety of data types including sequence alignments, microarrays, and genomic annotations.
  • 83. Acknowledgements  Grupo de investigación en Estadística y Bioinformática del departamento de Estadística de la Universidad de Barcelona.  All the members at the Unitat d’Estadística i Bioinformàtica del VHIR (Vall d’Hebron Institut de Recerca)  Unitat de Serveis Científico Tècnics (UCTS) del VHIR (Vall d’Hebron Institut de Recerca)  People whose materials have been borrowed or who have contributed with their work  Manel Comabella, Rosa Prieto, Paqui Gallego, Javier Santoyo, Ana Conesa, Thomas Girke and Silvia Cardona.… NGS Data analysis http://ueb.vhir.org/NGS2012
  • 84. Gracias por la atención y la paciencia NGS Data analysis http://ueb.vhir.org/NGS2012