SlideShare a Scribd company logo
1 of 57
Download to read offline
Discovery and annotation of
variants by exome analysis using
              NGS


                  Granada, June 2011

                    Javier Santoyo
                    jsantoyo@cipf.es
                   http://bioinfo.cipf.es
        Bioinformatics and Genomics Department
      Centro de Investigacion Principe Felipe (CIPF)
                    (Valencia, Spain)
Some of the most common
    applications of NGS.
  Many array-based technologies become obsolete !!!
                                            Resequencing:
    RNA-seq                                 SNV and indel
Transcriptomics:
  Quantitative                                Structural
   Descriptive                             variation (CNV,
  (alternative                             translocations,
    splicing)                             inversions, etc.)
     miRNA




          Chip-seq                         De novo
 Protein-DNA interactions                 sequencing
 Active transcription factor
     binding sites, etc.         Metagenomics
                               Metatranscriptomics
Evolution of the papers published in
microarray and next gen technologies




Source Pubmed. Query: "high-throughput sequencing"[Title/Abstract] OR "next
generation sequencing"[Title/Abstract] OR "rna seq"[Title/Abstract]) AND
year[Publication Date]
Projections 2011 based on January and February
Next generation sequencing
                   technologies are here
     100,000       First genome:
                      13 years
                  ~3,000,000,000€
     10,000


       1,000
€




        100


         10

                                                   Moore’s
          1
M




                                                    Law
n
o
l
i




         0.1
                                                        <2weeks
                                                            ~1000€
        0.01


       0.001


               1990             2001   2007 2009     2012




 While the cost goes down, the amount of data to manage and its complexity raise
exponentially.
Sequence to Variation Workflow
                  Raw                        FastX
                                                        FastQ
                  Data                       Tookit
                            FastQ


                                                               BWA/
       IGV                                                    BWASW
                                          BWA/
                                         BWASW

                                                        SAM



                                                              SAM
                                                              Tools
                                             Samtools
                BCF         BCF
                                             mPileup
       Filter   Tools Raw   Tools
 GFF                                Pileup              BAM
       VCF            VCF
Raw Sequence Data Format
• Fasta, csfasta

• Fastq, csfastq

• SFF

• SRF

• The eXtensible SeQuence (XSQ)


• Others:
  – http://en.wikipedia.org/wiki/List_of_file_formats#Biology
Fasta & Fastq formats
• FastA format
  – Header line starts with “>” followed by a sequence ID
  – Sequence (string of nt).


• FastQ format
  – First is the sequence (like Fasta but starting with “@”)
  – Then “+” and sequence ID (optional) and in the following line
    are QVs encoded as single byte ASCII codes


• Nearly all aligners take FastA or FastQ as input sequence

• Files are flat files and are big
Processed Data Formats
• Column separated file format contains features
  and their chromosomal location. There are flat
  files (no compact)
  – GFF and GTF
  – BED
  – WIG


• Similar but compact formats and they can handle
  larger files
  – BigBED
  – BigWIG
Processed Data Formats GFF
• Column separated file format contains features
  located at chromosomal locations

• Not a compact format

• Several versions
  – GFF 3 most currently used
  – GFF 2.5 is also called GTF (used at Ensembl for
    describing gene features)
GFF structure




GFF3 can describes the representation of a protein-coding gene
GFF3 file example
##gff-version   3
##sequence-region   ctg123 1 1497228
ctg123 . gene            1000 9000     .   +   .   ID=gene00001;Name=EDEN
ctg123 . TF_binding_site 1000 1012     .   +   .   Parent=gene00001
ctg123 . mRNA            1050 9000     .   +   .   ID=mRNA00001;Parent=gene00001
                                                                                          Column 1: "seqid"
ctg123 . mRNA            1050 9000     .   +   .   ID=mRNA00002;Parent=gene00001
ctg123 . mRNA            1300 9000     .   +   .   ID=mRNA00003;Parent=gene00001          Column 2: "source"
ctg123 . exon            1300 1500     .   +   .   Parent=mRNA00003
ctg123 . exon            1050 1500     .   +   .   Parent=mRNA00001,mRNA00002             Column 3: "type"
ctg123 . exon            3000 3902     .   +   .   Parent=mRNA00001,mRNA00003
ctg123 . exon            5000 5500     .   +   .   Parent=mRNA00001,mRNA00002,mRNA00003   Column 4: "start"
ctg123 . exon            7000 9000     .   +   .   Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . CDS             1201 1500     .   +   0   ID=cds00001;Parent=mRNA00001           Column 5: "end"
ctg123 . CDS             3000 3902     .   +   0   ID=cds00001;Parent=mRNA00001
                                                                                          Column 6: "score"
ctg123 . CDS             5000 5500     .   +   0   ID=cds00001;Parent=mRNA00001
ctg123 . CDS             7000 7600     .   +   0   ID=cds00001;Parent=mRNA00001           Column 7: "strand"
ctg123 . CDS             1201 1500     .   +   0   ID=cds00002;Parent=mRNA00002
ctg123 . CDS             5000 5500     .   +   0   ID=cds00002;Parent=mRNA00002           Column 8: "phase"
ctg123 . CDS             7000 7600     .   +   0   ID=cds00002;Parent=mRNA00002
                                                                                          Column 9: "attributes"
ctg123 . CDS             3301 3902     .   +   0   ID=cds00003;Parent=mRNA00003
ctg123 . CDS             5000 5500     .   +   2   ID=cds00003;Parent=mRNA00003
ctg123 . CDS             7000 7600     .   +   2   ID=cds00003;Parent=mRNA00003
ctg123 . CDS             3391 3902     .   +   0   ID=cds00004;Parent=mRNA00003
ctg123 . CDS             5000 5500     .   +   2   ID=cds00004;Parent=mRNA00003
ctg123 . CDS             7000 7600     .   +   2   ID=cds00004;Parent=mRNA00003
BED

• Created by USCS genome team

• Contains similar information to the GFF, but
 optimized for viewing in the UCSC genome browser
• BIG BED, optimized for next gen data – essentially
 a binary version
  – It can be displayed at USCS Web browser (even several
    Gbs !!)
WIG


• Also created by USCS team

• Optimized for storing “levels”

• Useful for displaying “peaks” (transcriptome, ChIP-
 seq)
• BIG WIG is a binary WIG format

• It can also be uploaded onto USCS web browser
Short Read Aligners, just a few?
     AGiLE                               PerM
     BFAST                               QPalma
     BLASTN                              RazerS
     BLAT                                RMAP
     Bowtie                              SeqMap
     BWA                                 Shrec
     CASHX                               SHRiMP
     CUDA-EC                             SLIDER
     ELAND                               SLIM Search
     GNUMAP                              SOAP and SOAP2
     GMAP and GSNAP                      SOCS
     Geneious Assembler                  SSAHA and SSAHA2
     LAST                                Stampy
     MAQ                                 Taipan
     MOM                                 UGENE
     MOSAIK                              XpressAlign
     Novoalign                           ZOOM
     PALMapper                           ...


                          Adapted from http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
Mapped Data: SAM specification
This specification aims to define a generic sequence alignment format, SAM,
that describes the alignment of query sequencing reads to a reference
sequence or assembly, and:
• Is flexible enough to store all the alignment information generated by various
alignment programs;
• Is simple enough to be easily generated by alignment programs or converted
from existing alignment formats;


• SAM specification was developed for the 1000 Genome Project.
   – Contains information about the alignment of a read to a genome and
     keeps track of chromosomal position, quality alignment, and features of
     the alignment (extended cigar).
• Includes mate pair / paired end information joining distinct reads
• Quality of alignment denoted by mapping/pairing QV
 Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data
 Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
Mapped Data: SAM/BAM format
●
     SAM (Sequence Alignment/Map) developed to keep track of
     chromosomal position, quality alignments and features of sequence
     reads alignment.

• BAM is a binary version of SAM - This format is more compact

• Most of downstream analysis programs takes this binary format

• Allows most of operations on the alignment to work on a stream
without loading the whole alignment into memory

• Allows the file to be indexed by genomic position to efficiently
retrieve all reads aligning to a locus.

    Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data
    Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
BAM format
• Many tertiary analysis tools use BAM
   – BAM makes machine specific issues “transparent” e.g.
     colour space
   – A common format makes downstream analysis
     independent from the mapping program




 IGV http://www.broadinstitute.org/igv
                                                                                              SAVANT Fiume et al

   Adapted from http://www.bioinformatics.ca/files/CBW%20-%20presentations/HTSeq_2010_Module%202/HTSeq_2010_Module%202.pdf
SAM format
SAM format
VCF format
- The Variant Call Format (VCF) is the emerging standard for
storing variant data.
- Originally designed for SNPs and short INDELs, it also works for
structural variations.


- VCF consists of a header section and a data section.
- The header must contain a line starting with one '#', showing
the name of each field, and then the sample names starting at
the 10th column.
- The data section is TAB delimited with each line consisting of
at least 8 mandatory fields
The FORMAT field and sample information are allowed to be
absent.


         http://samtools.sourceforge.net/mpileup.shtml
VCF data section fields




http://samtools.sourceforge.net/mpileup.shtml
The Draft Human Genome Sequence millestone


                                             DISEASE


                                             METABOLOME


                                             PROTEOME


                                             TRANSCRIPTOME


                                             GENOME
Some diseases are coded in the
             genome
How to find mutations associated to diseases?




           An equivalent of the genome would amount
           almost 2000 books, containing 1.5 million letters
           each (average books with 200 pages).
           This information is contained in any single cell of
           the body.
In monogenic diseases only one
  mutation causes the disease.
                                   The challenge is to find
                                   this letter changed out of
                                   all the 3000 millions of
                                   letters in the 2000 books
                                   of the library
     T
                                   Solution:
                                   Read it all

                                   Problem:
Example:                           Too much to read
Book 1129, pag. 163, 3rd
paragraph, 5th line, 27th letter
should be a A instead a T
The bioinfomatic challenge: finding the
    mutation that causes the disease.

The problem is even worst: we cannot read the library
complete library.

Sequencers can only read small portions. No more than
500 letters at a time.


So, the library
must be inferred
from fragments
of pages of the
                    ATCCACTGG
books               CCCCTCGTA
                    GCGAAAAGC
Reading the text
            En un lugar de la Mancha, de cuyo nombre no quiero acordarme…

            En un lugar de la Mancha, de cuyo nombre no quiero acordarme…

            En un lugar de la Mancha, de cuyo nombre ni quiero acordarme…

            En un lugar de la Mancha, de cuyo nombre ni quiero acordarme…




En u | n lugar d | e la Manc | ha, de c | uyo no | mbre no qu | iero acor | darme

En un lu | gar de la M | ancha, de c | uyo nom |    bre no q | uiero aco | rdarme

En | un luga | r de la Ma | ncha, de cu | yo nombr | e ni quie | ro acordar | me

En un lu | gar de la Man | cha, d | e cuyo n | ombre n | i quier | o acorda | rme
Mapping fragments and detection of
            mutations
                                  yo nombr
   un lugar de la Mancha, de cu      ombre n                darme
         gar de la Man       e cuyo n e      ni quie o acorda
En n lugar d        ancha, de cuyo nom           uiero aco     me
En u     gar de la M    ha, de c         bre no q iero acor rme
En un lu      e la Manc                 mbre no qu ro acordar
En un lugar de la M cha, d uyo no             i quier      rdarme
En un lugar de la Mancha, de cuyo nombre no quiero acordarme
Space reduction: Look only for
         Mutations that can affect transcription and/or gene
                              products

Triplex-forming
                                       Human-mouse                            Amino acid
sequences                                                                                   Splicing inhibitors
(Goñi et al. Nucleic Acids
                                       conserved regions                      change
Res.32:354-60, 2004)




                                                                                            SF2/ASF                SC35




             Transfac

           TFBSs
                                                                                              SRp40                SRp55
          (Wingender et al., Nucleic
          Acids Res., 2000)               Intron/exon junctions                            ESE (exonic splicing
                                          (Cartegni et al., Nature Rev. Genet., 2002)      enhancers) motifs
                                                                                           recognized by SR
                                                                                           proteins
                                                                                           (Cartegni et al., Nucleic Acids Res., 2003)
•
                     Why exome sequencing?
        Whole-genome sequencing of individual humans is increasingly practical . But cost remains a key
        consideration and added value of intergenic mutations is not cost-effective.

    ●
         Alternative approach: targeted resequencing of all protein-coding subsequences (exome
         sequencing, ~1% of human genome)

•       Linkage analysis/positional cloning studies that focused on protein coding sequences were highly
        successful at identification of variants underlying monogenic diseases (when adequately powered)

•       Known allelic variants known to underlie Mendelian disorders disrupt protein-coding sequences

    ●
         Large fraction of rare non-synonymous variants in human genome are predicted to be deleterious

•       Splice acceptor and donor sites are also enriched for highly functional variation and are therefore
        targeted as well


        The exome represents a highly enriched subset of the genome in which to
                        search for variants with large effect sizes
How does exome sequencing work?
     Gene A                      Gene B

                                                        DNA (patient)

1     Produce shotgun
      library




 2                                                              Determine
        Capture exon                                            variants,
        sequences                Map against             5      Filter, compare
                          4      reference genome               patients

               3
                                                    candidate genes

               Wash & Sequence
Exome sequencing
         Common Research Goals

Identify novel genes responsible for monogenic diseases

Use the results of genetic research to discover new drugs acting on
new targets (new genes associated with human disease pathways)

Identify susceptibility genes for common diseases




                      Challenges

To develop innovative bioinformatics tools for the detection and
  characterisation of mutations using genomic information.
Rare and common disorders
                                            Exome sequencing




Manolio TA, et al (2009). Finding the missing heritability of complex diseases. Nature., 461(7265), 747
Finding mutations associated to
           diseases
    The simplest case: dominant monogenic disease


                                     A


                         B       C




                             D       E
Controls
                                              Cases
The principle: comparison of patients
      and reference controls

mutation                                                  Patient 1

                                                          Patient 2

                                                          Patient 3

                                                          Control 1

                                                          Control 2

                                                          Control 3



           candidate gene (shares mutation for all patients but no controls)
Different levels of complexity
• Diseases can be dominant or recessive
• Diseases can have incomplete penetrancy
• Patients can be sporadic or familiar
• Controls can or cannot be available

                       Dominant: (hetero in B, C and D) AND (no in A
    A        B        and E) AND no in controls

                       Recessive: (homo in B, C and D) hetero in A
                      and D AND NO homo in controls

C       D        E

        Ad-hoc strategies are needed for the analysis
Filtering
                                                      Reducing the number of
Strategies                                            candidate genes in two
                                                      directions

                                                          Share Genes with
                                                          variations

               Patient      Patient 2    Patient 3
               1                                         Several shared Genes
No filtering   genes with   genes with   genes with
               variant      variant      variant


Remove         genes with   genes with   genes with
known          variant      variant      variant
variants
Remove         genes with   genes with   genes with
synonymou      variant      variant      variant
s variants
...
                                                          Few shared Genes
An example with MTC
     A         B       Dominant:
                       Heterozygotic in A and D
                       Homozygotic reference allele in B and C
                       Homozygotic reference allele in controls
     D         C




                                                                  A

                                                                  D
 RET, codon
634 mutation
                                                                  B


                                                                  C
The Pursuit of Better and more Efficient Healthcare
as well as Clinical Innovation through Genetic and
                 Genomic Research


          Public-Private partnership
          Autonomous Government of Andalusia

          Spanish Ministry of Innovation

          Pharma and Biotech Companies
MGP Specific objectives

 To sequence the genomes of clinically well characterized patients
  with potential mutations in novel genes.


 To generate and validate a database of genomes of phenotyped
  control individuals.


 To develop bioinformatics tools for the detection and
  characterisation of mutations
SAMPLES + UPDATED AND COMPREHENSIVE HEALTH RECORD
            Currently14,000 Phenotyped DNA Samples
             from patients and control individuals.


     Prospective Healthcare:                   Patient health & sample record real
  linking research & genomic                   time automatic update and
 information to health record                  comprehensive data mining system
           databases




SIDCA Bio e-Bank
Andalusian DNA Bank
Direct link to the health care system

MPG roadmap is based on the availability of >14.000 well-
characterized samples with a permanent updated sample
information & PHR that will be used as the first steps of the
implementation of personalized medicine in the Andalusian HCS

                •Cancer                        • Unknown genes
                •Congenital anomalies          • Known genes discarded
                (heart, gut, CNS,…)
                                               • Responsible genes
                •Mental retardation            known but unknown
                •MCA/MR syndromes              modifier genes
>14.000         •Diabetes                      •Susceptibility Genes
phenotyped      •Neurodegenerative      with
samples         diseases                       •…

                •Stroke (familiar)
                •Endometriosis
                •Control Individuals
Two technologies to scan for
        variations

       454 Roche         Structural
       Longer reads
       Lower coverage    variation

                         •Amplifications
                         •Deletions
                         •CNV
                         •Inversions
                         •Translocations


       SOLiD ABI
       Shorter reads     Variants
       Higher coverage
                         •SNPs
                         •Mutations
                         •indels
Analysis Pipeline
Raw Data Quality Control QC   FastQC & in house software


                              BWA, Bowtie, BFAST
      Reads Mapping & QC




                                                                     Bioscope
                              QC in house software


      Variant Calling         GATK, SAMTOOLS, FreeBayes



      Variant Filtering       dbSNP, 1000 Genomes



     Variant annotation                Annovar & in house software



Disease/Control comparison              Family & healthy controls
Approach to discovery rare or novel
                      variants




130.00              99.000
0




           55.000                18.000.000
Timeline of the MGP


Now      2011                 2012 …



  • First 50                                   Yet unpredictable
                • Several hundreds of
  genomes                                      number of
                genomes
  • First 2-3                                  genomes and
                • 100 diseases (approx)        diseases
  diseases


                Announced         Expected
                changes in        changes in
                the               the
                technology        technology
Nimblegen capture arrays
●   SeqCap EZ Human Exome Library v1.0 / v2.0

●   Gene and exon annotations (v2.0):
    ●
        RefSeq (Jan 2010), CCDS (Sept 2009), miRBase (v.14, Sept 2009).

●   Total of ~30,000 coding genes (theoretically)
    ●
        ~300,000 exons;
    ●   36.5 Mb are targeted by the design.

●   2.1 million long oligo probes to cover the target regions.
    ●
        Because some flanking regions are also covered by probes, the total size of regions covered by probes is
        44.1 Mb

●   Real coverage:
    – Coding genes included:       18897
    – miRNAs       720
    – Coding genes not captured:    3865
Sequencing at MGP
 By the end of June
2011 there are 72
exomes sequenced
so far.

 4 SOLiD can produce
20 exomes per week

 The facilities of the
CASEGH can carry
out the MGP and
other collaborative
projects at the same
time
Real coverage and some figures
 Sequencing
 Enrichment +Sequence run: ~2 weeks
 400,000,000 sequences/flowcell
 20,000,000,000 bases/flowcell
 Short 50bp sequences


 Exome Coverage
 SeqCap EZ Human Exome Library v1.0 / v2.0
 Total of ~30,000 coding genes (theoretically)
    ~300,000 exons;
    36.5 Mb are targeted by the design (2.1 million long oligo
    probes).
 Real coverage:
    Coding genes included:      18,897
    miRNAs                         720
    Coding genes not captured: 3,865
    Genes of the exome with coverage >10x: 85%
Data Simulation & analysis
 ●
     Data simulated for 80K mutations and 60x coverage




• BWA finds less false positives in simulated data
Real Data & analysis
●
    Analiysis data is compared to genotyped data
    ●
        BFAST higher number of variants and 95% agreement
And this is what we find in the
  variant calling pipeline
Coverage           > 50x
Variants (SNV): 60.000 – 80.000
Variants (indels): 600-1000
100 known variants associated to disease risk
From data to knowledge
              Some considerations
                              Obvious: huge datasets need to be
                              managed by computers
                              Important: bioinformatics is
                              necessary to properly analyze the
                              data.
                              Even more important and not so
                              obvious: hypotheses must be tested
                              from the perspective provided by the
                              bioinformatics

The science is generated where the data reside.
Yesterday’s “one-bite” experiments fit into a laboratory notebook. Today, terabite
data from genomic experiments reside in computers, the new scientist’s notebook
And It gets more complicated
 Context and cooperation between genes cannot be ignored
Controls                                      Cases




The cases of the multifactorial disease will have different
mutations (or combinations). Many cases have to be used
to obtain significant associations to many markers. The
only common element is the pathway (yet unknow)
affected.
The Bioinformatics and Genomics Department at the Centro de
     Investigación Príncipe Felipe (CIPF), Valencia, Spain, and…
Joaquín Dopazo
Eva Alloza
Roberto Alonso
Alicia Amadoz
Davide Baù
Jose Carbonell
Ana Conesa
Alejandro de Maria
                     ...the INB, National Institute of Bioinformatics (Functional
Hernán Dopazo        Genomics Node) and the CIBERER Network of Centers for Rare
Pablo Escobar        Diseases
Fernando García
Francisco García
Luz García
Stefan Goetz
Carles Llacer
Martina Marbà
Marc Martí
Ignacio Medina
David Montaner
Luis Pulido
Rubén Sánchez
Javier Santoyo
Patricia Sebastian
François Serra
Sonia Tarazona
Joaquín Tárraga
Enrique Vidal
Adriana Cucchi
CAG

Área de Genómica                HOSPITAL UNIVERSITARIO VIRGEN DEL ROCÍO
 Dr. Rosario Fernández Godino   Dr. Macarena Ruiz Ferrer

 Dr. Alicia Vela Boza           Nerea Matamala Zamarro

 Dr. Slaven Erceg               Prof. Guillermo Antiñolo Gil

Dr. Sandra Pérez Buira          Director de la UGC de Genética, Reproducción y Medicina

 María Sánchez León             Fetal Director del Plan de Genética de Andalucía

 Javier Escalante Martín
 Ana Isabel López Pérez         CABIMER
 Beatriz Fuente Bermúdez         Director de CABIMER y del Departamento de Terapia
                                Celular y Medicina Regenerativa
Área Bioinformática
                                Prof. Shom Shanker Bhattacharya,
Daniel Navarro Gómez
 Pablo Arce García
                                 CENTRO DE INVESTIGACIÓN PRÍNCIPE FELIPE
Juan Miguel Cruz
Secretaría/Administración       Responsable de la Unidad de Bioinformática y Genómica y
Inmaculada Guillén Baena        Director científico asociado para Bioinformática del Plan de
                                Genética de Andalucía
                                Dr. Joaquín Dopazo Blázquez

More Related Content

What's hot

RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2BITS
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then someRNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then somebasepairtech
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Thomas Keane
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataJoachim Jacob
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packageLi Shen
 
Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ...
 Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ... Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ...
Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ...Fabio Caligaris
 
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...Fabio Caligaris
 
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2QIAGEN
 
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Integrated DNA Technologies
 
Increasing genome editing efficiency with optimized CRISPR-Cas enzymes
Increasing genome editing efficiency with optimized CRISPR-Cas enzymesIncreasing genome editing efficiency with optimized CRISPR-Cas enzymes
Increasing genome editing efficiency with optimized CRISPR-Cas enzymesIntegrated DNA Technologies
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqTimothy Tickle
 
rhAmp™ SNP Genotyping: A novel approach for improving PCR-based SNP genotyping
rhAmp™ SNP Genotyping: A novel approach for improving PCR-based SNP genotypingrhAmp™ SNP Genotyping: A novel approach for improving PCR-based SNP genotyping
rhAmp™ SNP Genotyping: A novel approach for improving PCR-based SNP genotypingIntegrated DNA Technologies
 

What's hot (20)

RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then someRNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis package
 
Abrf 2017 hadfield j
Abrf 2017 hadfield jAbrf 2017 hadfield j
Abrf 2017 hadfield j
 
Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ...
 Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ... Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ...
Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ...
 
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
 
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
 
Sept2016 sv nist_intro
Sept2016 sv nist_introSept2016 sv nist_intro
Sept2016 sv nist_intro
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
Sept2016 sv nabsys
Sept2016 sv nabsysSept2016 sv nabsys
Sept2016 sv nabsys
 
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
 
20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop
 
Increasing genome editing efficiency with optimized CRISPR-Cas enzymes
Increasing genome editing efficiency with optimized CRISPR-Cas enzymesIncreasing genome editing efficiency with optimized CRISPR-Cas enzymes
Increasing genome editing efficiency with optimized CRISPR-Cas enzymes
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 
rhAmp™ SNP Genotyping: A novel approach for improving PCR-based SNP genotyping
rhAmp™ SNP Genotyping: A novel approach for improving PCR-based SNP genotypingrhAmp™ SNP Genotyping: A novel approach for improving PCR-based SNP genotyping
rhAmp™ SNP Genotyping: A novel approach for improving PCR-based SNP genotyping
 
20140711 3 t_clark_ercc2.0_workshop
20140711 3 t_clark_ercc2.0_workshop20140711 3 t_clark_ercc2.0_workshop
20140711 3 t_clark_ercc2.0_workshop
 

Similar to Discovery and annotation of variants by exome analysis using NGS

rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomicsFrancisco Garc
 
SRAdb Bioconductor Package Overview
SRAdb Bioconductor Package OverviewSRAdb Bioconductor Package Overview
SRAdb Bioconductor Package OverviewSean Davis
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...Joseph Hughes
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryJan Aerts
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfPushpendra83
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolHong ChangBum
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1wjunjmt
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2Dan Gaston
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipelineDan Bolser
 

Similar to Discovery and annotation of variants by exome analysis using NGS (20)

rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomics
 
SRAdb Bioconductor Package Overview
SRAdb Bioconductor Package OverviewSRAdb Bioconductor Package Overview
SRAdb Bioconductor Package Overview
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Lecture6.pptx
Lecture6.pptxLecture6.pptx
Lecture6.pptx
 
06.pptx
06.pptx06.pptx
06.pptx
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Cisco CCNA Data Center Networking Fundamentals
Cisco CCNA Data Center Networking FundamentalsCisco CCNA Data Center Networking Fundamentals
Cisco CCNA Data Center Networking Fundamentals
 
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
 
dfl
dfldfl
dfl
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipeline
 

More from cursoNGS

Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemscursoNGS
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanacursoNGS
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNAcursoNGS
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-SeqcursoNGS
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysiscursoNGS
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGScursoNGS
 
Linux for bioinformatics
Linux for bioinformaticsLinux for bioinformatics
Linux for bioinformaticscursoNGS
 
Introduccion a la bioinformatica
Introduccion a la bioinformaticaIntroduccion a la bioinformatica
Introduccion a la bioinformaticacursoNGS
 

More from cursoNGS (8)

Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systems
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humana
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNA
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGS
 
Linux for bioinformatics
Linux for bioinformaticsLinux for bioinformatics
Linux for bioinformatics
 
Introduccion a la bioinformatica
Introduccion a la bioinformaticaIntroduccion a la bioinformatica
Introduccion a la bioinformatica
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Discovery and annotation of variants by exome analysis using NGS

  • 1. Discovery and annotation of variants by exome analysis using NGS Granada, June 2011 Javier Santoyo jsantoyo@cipf.es http://bioinfo.cipf.es Bioinformatics and Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain)
  • 2. Some of the most common applications of NGS. Many array-based technologies become obsolete !!! Resequencing: RNA-seq SNV and indel Transcriptomics: Quantitative Structural Descriptive variation (CNV, (alternative translocations, splicing) inversions, etc.) miRNA Chip-seq De novo Protein-DNA interactions sequencing Active transcription factor binding sites, etc. Metagenomics Metatranscriptomics
  • 3. Evolution of the papers published in microarray and next gen technologies Source Pubmed. Query: "high-throughput sequencing"[Title/Abstract] OR "next generation sequencing"[Title/Abstract] OR "rna seq"[Title/Abstract]) AND year[Publication Date] Projections 2011 based on January and February
  • 4. Next generation sequencing technologies are here 100,000 First genome: 13 years ~3,000,000,000€ 10,000 1,000 € 100 10 Moore’s 1 M Law n o l i 0.1 <2weeks ~1000€ 0.01 0.001 1990 2001 2007 2009 2012 While the cost goes down, the amount of data to manage and its complexity raise exponentially.
  • 5. Sequence to Variation Workflow Raw FastX FastQ Data Tookit FastQ BWA/ IGV BWASW BWA/ BWASW SAM SAM Tools Samtools BCF BCF mPileup Filter Tools Raw Tools GFF Pileup BAM VCF VCF
  • 6. Raw Sequence Data Format • Fasta, csfasta • Fastq, csfastq • SFF • SRF • The eXtensible SeQuence (XSQ) • Others: – http://en.wikipedia.org/wiki/List_of_file_formats#Biology
  • 7. Fasta & Fastq formats • FastA format – Header line starts with “>” followed by a sequence ID – Sequence (string of nt). • FastQ format – First is the sequence (like Fasta but starting with “@”) – Then “+” and sequence ID (optional) and in the following line are QVs encoded as single byte ASCII codes • Nearly all aligners take FastA or FastQ as input sequence • Files are flat files and are big
  • 8. Processed Data Formats • Column separated file format contains features and their chromosomal location. There are flat files (no compact) – GFF and GTF – BED – WIG • Similar but compact formats and they can handle larger files – BigBED – BigWIG
  • 9. Processed Data Formats GFF • Column separated file format contains features located at chromosomal locations • Not a compact format • Several versions – GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)
  • 10. GFF structure GFF3 can describes the representation of a protein-coding gene
  • 11. GFF3 file example ##gff-version 3 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001 Column 1: "seqid" ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001 Column 2: "source" ctg123 . exon 1300 1500 . + . Parent=mRNA00003 ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002 Column 3: "type" ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003 ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003 Column 4: "start" ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003 ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001 Column 5: "end" ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001 Column 6: "score" ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001 Column 7: "strand" ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002 ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002 Column 8: "phase" ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002 Column 9: "attributes" ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003 ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003 ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003 ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003 ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003 ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003
  • 12. BED • Created by USCS genome team • Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser • BIG BED, optimized for next gen data – essentially a binary version – It can be displayed at USCS Web browser (even several Gbs !!)
  • 13. WIG • Also created by USCS team • Optimized for storing “levels” • Useful for displaying “peaks” (transcriptome, ChIP- seq) • BIG WIG is a binary WIG format • It can also be uploaded onto USCS web browser
  • 14. Short Read Aligners, just a few? AGiLE PerM BFAST QPalma BLASTN RazerS BLAT RMAP Bowtie SeqMap BWA Shrec CASHX SHRiMP CUDA-EC SLIDER ELAND SLIM Search GNUMAP SOAP and SOAP2 GMAP and GSNAP SOCS Geneious Assembler SSAHA and SSAHA2 LAST Stampy MAQ Taipan MOM UGENE MOSAIK XpressAlign Novoalign ZOOM PALMapper ... Adapted from http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
  • 15. Mapped Data: SAM specification This specification aims to define a generic sequence alignment format, SAM, that describes the alignment of query sequencing reads to a reference sequence or assembly, and: • Is flexible enough to store all the alignment information generated by various alignment programs; • Is simple enough to be easily generated by alignment programs or converted from existing alignment formats; • SAM specification was developed for the 1000 Genome Project. – Contains information about the alignment of a read to a genome and keeps track of chromosomal position, quality alignment, and features of the alignment (extended cigar). • Includes mate pair / paired end information joining distinct reads • Quality of alignment denoted by mapping/pairing QV Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
  • 16. Mapped Data: SAM/BAM format ● SAM (Sequence Alignment/Map) developed to keep track of chromosomal position, quality alignments and features of sequence reads alignment. • BAM is a binary version of SAM - This format is more compact • Most of downstream analysis programs takes this binary format • Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory • Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus. Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
  • 17. BAM format • Many tertiary analysis tools use BAM – BAM makes machine specific issues “transparent” e.g. colour space – A common format makes downstream analysis independent from the mapping program IGV http://www.broadinstitute.org/igv SAVANT Fiume et al Adapted from http://www.bioinformatics.ca/files/CBW%20-%20presentations/HTSeq_2010_Module%202/HTSeq_2010_Module%202.pdf
  • 20. VCF format - The Variant Call Format (VCF) is the emerging standard for storing variant data. - Originally designed for SNPs and short INDELs, it also works for structural variations. - VCF consists of a header section and a data section. - The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. - The data section is TAB delimited with each line consisting of at least 8 mandatory fields The FORMAT field and sample information are allowed to be absent. http://samtools.sourceforge.net/mpileup.shtml
  • 21. VCF data section fields http://samtools.sourceforge.net/mpileup.shtml
  • 22. The Draft Human Genome Sequence millestone DISEASE METABOLOME PROTEOME TRANSCRIPTOME GENOME
  • 23. Some diseases are coded in the genome How to find mutations associated to diseases? An equivalent of the genome would amount almost 2000 books, containing 1.5 million letters each (average books with 200 pages). This information is contained in any single cell of the body.
  • 24. In monogenic diseases only one mutation causes the disease. The challenge is to find this letter changed out of all the 3000 millions of letters in the 2000 books of the library T Solution: Read it all Problem: Example: Too much to read Book 1129, pag. 163, 3rd paragraph, 5th line, 27th letter should be a A instead a T
  • 25. The bioinfomatic challenge: finding the mutation that causes the disease. The problem is even worst: we cannot read the library complete library. Sequencers can only read small portions. No more than 500 letters at a time. So, the library must be inferred from fragments of pages of the ATCCACTGG books CCCCTCGTA GCGAAAAGC
  • 26. Reading the text En un lugar de la Mancha, de cuyo nombre no quiero acordarme… En un lugar de la Mancha, de cuyo nombre no quiero acordarme… En un lugar de la Mancha, de cuyo nombre ni quiero acordarme… En un lugar de la Mancha, de cuyo nombre ni quiero acordarme… En u | n lugar d | e la Manc | ha, de c | uyo no | mbre no qu | iero acor | darme En un lu | gar de la M | ancha, de c | uyo nom | bre no q | uiero aco | rdarme En | un luga | r de la Ma | ncha, de cu | yo nombr | e ni quie | ro acordar | me En un lu | gar de la Man | cha, d | e cuyo n | ombre n | i quier | o acorda | rme
  • 27. Mapping fragments and detection of mutations yo nombr un lugar de la Mancha, de cu ombre n darme gar de la Man e cuyo n e ni quie o acorda En n lugar d ancha, de cuyo nom uiero aco me En u gar de la M ha, de c bre no q iero acor rme En un lu e la Manc mbre no qu ro acordar En un lugar de la M cha, d uyo no i quier rdarme En un lugar de la Mancha, de cuyo nombre no quiero acordarme
  • 28. Space reduction: Look only for Mutations that can affect transcription and/or gene products Triplex-forming Human-mouse Amino acid sequences Splicing inhibitors (Goñi et al. Nucleic Acids conserved regions change Res.32:354-60, 2004) SF2/ASF SC35 Transfac TFBSs SRp40 SRp55 (Wingender et al., Nucleic Acids Res., 2000) Intron/exon junctions ESE (exonic splicing (Cartegni et al., Nature Rev. Genet., 2002) enhancers) motifs recognized by SR proteins (Cartegni et al., Nucleic Acids Res., 2003)
  • 29. Why exome sequencing? Whole-genome sequencing of individual humans is increasingly practical . But cost remains a key consideration and added value of intergenic mutations is not cost-effective. ● Alternative approach: targeted resequencing of all protein-coding subsequences (exome sequencing, ~1% of human genome) • Linkage analysis/positional cloning studies that focused on protein coding sequences were highly successful at identification of variants underlying monogenic diseases (when adequately powered) • Known allelic variants known to underlie Mendelian disorders disrupt protein-coding sequences ● Large fraction of rare non-synonymous variants in human genome are predicted to be deleterious • Splice acceptor and donor sites are also enriched for highly functional variation and are therefore targeted as well The exome represents a highly enriched subset of the genome in which to search for variants with large effect sizes
  • 30. How does exome sequencing work? Gene A Gene B DNA (patient) 1 Produce shotgun library 2 Determine Capture exon variants, sequences Map against 5 Filter, compare 4 reference genome patients 3 candidate genes Wash & Sequence
  • 31.
  • 32. Exome sequencing Common Research Goals Identify novel genes responsible for monogenic diseases Use the results of genetic research to discover new drugs acting on new targets (new genes associated with human disease pathways) Identify susceptibility genes for common diseases Challenges To develop innovative bioinformatics tools for the detection and characterisation of mutations using genomic information.
  • 33. Rare and common disorders Exome sequencing Manolio TA, et al (2009). Finding the missing heritability of complex diseases. Nature., 461(7265), 747
  • 34. Finding mutations associated to diseases The simplest case: dominant monogenic disease A B C D E Controls Cases
  • 35. The principle: comparison of patients and reference controls mutation Patient 1 Patient 2 Patient 3 Control 1 Control 2 Control 3 candidate gene (shares mutation for all patients but no controls)
  • 36. Different levels of complexity • Diseases can be dominant or recessive • Diseases can have incomplete penetrancy • Patients can be sporadic or familiar • Controls can or cannot be available Dominant: (hetero in B, C and D) AND (no in A A B and E) AND no in controls Recessive: (homo in B, C and D) hetero in A and D AND NO homo in controls C D E Ad-hoc strategies are needed for the analysis
  • 37. Filtering Reducing the number of Strategies candidate genes in two directions Share Genes with variations Patient Patient 2 Patient 3 1 Several shared Genes No filtering genes with genes with genes with variant variant variant Remove genes with genes with genes with known variant variant variant variants Remove genes with genes with genes with synonymou variant variant variant s variants ... Few shared Genes
  • 38. An example with MTC A B Dominant: Heterozygotic in A and D Homozygotic reference allele in B and C Homozygotic reference allele in controls D C A D RET, codon 634 mutation B C
  • 39. The Pursuit of Better and more Efficient Healthcare as well as Clinical Innovation through Genetic and Genomic Research Public-Private partnership Autonomous Government of Andalusia Spanish Ministry of Innovation Pharma and Biotech Companies
  • 40. MGP Specific objectives  To sequence the genomes of clinically well characterized patients with potential mutations in novel genes.  To generate and validate a database of genomes of phenotyped control individuals.  To develop bioinformatics tools for the detection and characterisation of mutations
  • 41. SAMPLES + UPDATED AND COMPREHENSIVE HEALTH RECORD Currently14,000 Phenotyped DNA Samples from patients and control individuals. Prospective Healthcare: Patient health & sample record real linking research & genomic time automatic update and information to health record comprehensive data mining system databases SIDCA Bio e-Bank Andalusian DNA Bank
  • 42. Direct link to the health care system MPG roadmap is based on the availability of >14.000 well- characterized samples with a permanent updated sample information & PHR that will be used as the first steps of the implementation of personalized medicine in the Andalusian HCS •Cancer • Unknown genes •Congenital anomalies • Known genes discarded (heart, gut, CNS,…) • Responsible genes •Mental retardation known but unknown •MCA/MR syndromes modifier genes >14.000 •Diabetes •Susceptibility Genes phenotyped •Neurodegenerative with samples diseases •… •Stroke (familiar) •Endometriosis •Control Individuals
  • 43. Two technologies to scan for variations 454 Roche Structural Longer reads Lower coverage variation •Amplifications •Deletions •CNV •Inversions •Translocations SOLiD ABI Shorter reads Variants Higher coverage •SNPs •Mutations •indels
  • 44. Analysis Pipeline Raw Data Quality Control QC FastQC & in house software BWA, Bowtie, BFAST Reads Mapping & QC Bioscope QC in house software Variant Calling GATK, SAMTOOLS, FreeBayes Variant Filtering dbSNP, 1000 Genomes Variant annotation Annovar & in house software Disease/Control comparison Family & healthy controls
  • 45. Approach to discovery rare or novel variants 130.00 99.000 0 55.000 18.000.000
  • 46. Timeline of the MGP Now 2011 2012 … • First 50 Yet unpredictable • Several hundreds of genomes number of genomes • First 2-3 genomes and • 100 diseases (approx) diseases diseases Announced Expected changes in changes in the the technology technology
  • 47.
  • 48. Nimblegen capture arrays ● SeqCap EZ Human Exome Library v1.0 / v2.0 ● Gene and exon annotations (v2.0): ● RefSeq (Jan 2010), CCDS (Sept 2009), miRBase (v.14, Sept 2009). ● Total of ~30,000 coding genes (theoretically) ● ~300,000 exons; ● 36.5 Mb are targeted by the design. ● 2.1 million long oligo probes to cover the target regions. ● Because some flanking regions are also covered by probes, the total size of regions covered by probes is 44.1 Mb ● Real coverage: – Coding genes included: 18897 – miRNAs 720 – Coding genes not captured: 3865
  • 49. Sequencing at MGP By the end of June 2011 there are 72 exomes sequenced so far. 4 SOLiD can produce 20 exomes per week The facilities of the CASEGH can carry out the MGP and other collaborative projects at the same time
  • 50. Real coverage and some figures Sequencing Enrichment +Sequence run: ~2 weeks 400,000,000 sequences/flowcell 20,000,000,000 bases/flowcell Short 50bp sequences Exome Coverage SeqCap EZ Human Exome Library v1.0 / v2.0 Total of ~30,000 coding genes (theoretically) ~300,000 exons; 36.5 Mb are targeted by the design (2.1 million long oligo probes). Real coverage: Coding genes included: 18,897 miRNAs 720 Coding genes not captured: 3,865 Genes of the exome with coverage >10x: 85%
  • 51. Data Simulation & analysis ● Data simulated for 80K mutations and 60x coverage • BWA finds less false positives in simulated data
  • 52. Real Data & analysis ● Analiysis data is compared to genotyped data ● BFAST higher number of variants and 95% agreement
  • 53. And this is what we find in the variant calling pipeline Coverage > 50x Variants (SNV): 60.000 – 80.000 Variants (indels): 600-1000 100 known variants associated to disease risk
  • 54. From data to knowledge Some considerations Obvious: huge datasets need to be managed by computers Important: bioinformatics is necessary to properly analyze the data. Even more important and not so obvious: hypotheses must be tested from the perspective provided by the bioinformatics The science is generated where the data reside. Yesterday’s “one-bite” experiments fit into a laboratory notebook. Today, terabite data from genomic experiments reside in computers, the new scientist’s notebook
  • 55. And It gets more complicated Context and cooperation between genes cannot be ignored Controls Cases The cases of the multifactorial disease will have different mutations (or combinations). Many cases have to be used to obtain significant associations to many markers. The only common element is the pathway (yet unknow) affected.
  • 56. The Bioinformatics and Genomics Department at the Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain, and… Joaquín Dopazo Eva Alloza Roberto Alonso Alicia Amadoz Davide Baù Jose Carbonell Ana Conesa Alejandro de Maria ...the INB, National Institute of Bioinformatics (Functional Hernán Dopazo Genomics Node) and the CIBERER Network of Centers for Rare Pablo Escobar Diseases Fernando García Francisco García Luz García Stefan Goetz Carles Llacer Martina Marbà Marc Martí Ignacio Medina David Montaner Luis Pulido Rubén Sánchez Javier Santoyo Patricia Sebastian François Serra Sonia Tarazona Joaquín Tárraga Enrique Vidal Adriana Cucchi
  • 57. CAG Área de Genómica HOSPITAL UNIVERSITARIO VIRGEN DEL ROCÍO Dr. Rosario Fernández Godino Dr. Macarena Ruiz Ferrer Dr. Alicia Vela Boza Nerea Matamala Zamarro Dr. Slaven Erceg Prof. Guillermo Antiñolo Gil Dr. Sandra Pérez Buira Director de la UGC de Genética, Reproducción y Medicina María Sánchez León Fetal Director del Plan de Genética de Andalucía Javier Escalante Martín Ana Isabel López Pérez CABIMER Beatriz Fuente Bermúdez Director de CABIMER y del Departamento de Terapia Celular y Medicina Regenerativa Área Bioinformática Prof. Shom Shanker Bhattacharya, Daniel Navarro Gómez Pablo Arce García CENTRO DE INVESTIGACIÓN PRÍNCIPE FELIPE Juan Miguel Cruz Secretaría/Administración Responsable de la Unidad de Bioinformática y Genómica y Inmaculada Guillén Baena Director científico asociado para Bioinformática del Plan de Genética de Andalucía Dr. Joaquín Dopazo Blázquez