SlideShare une entreprise Scribd logo
1  sur  27
A Hybrid Approach to Assemble and
Annotate the Brassica rapa
Transcriptome in the Cloud through the
iPlant Collaborative and XSEDE
Upendra Kumar Devisetty
Postdoctoral Researcher
Maloof Lab, UC Davis
R500 IMB211
• Reference Transcriptome
• Genome annotation
R500 (oil seed cultivar)
IMB211 (rapid cycling cultivar)
B. rapa mapping population
Research in Maloof Lab
Mainly relied on in silico gene models and EST’s data from datasets
(Wang et al. 2011)
– In silico gene models (GENSCAN,
GlimmerHMM, Fgenesh)
• short exons
• very long exons
• non-translated exons
• genes that encode non-coding
RNAs accurately
– EST’s
• miss 20-40% of novel
transcripts
• transcribed only under
highly specific tissue,
environmental or
treatment conditions
• 3’ biased
• short length
Original
Why there is a need for accurate genome annotation?
• Accurate and comprehensive genome annotation (e.g. gene
models) is imperative for functional studies
• Useful for accurate mRNA abundance and detection of
eQTLs (expression QTLs) in mapping populations
Objectives
• To detect transcripts that are not present in the existing
genome reference of B. rapa (novel transcripts)
• To update the existing gene models of B. rapa genome
UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
Growth Chamber, Green House, Field
apical
meristem
R500
Library construction
 TRUSEQ RNA-SEQ kit (Illumina)
 High throughput and easy to use
Sequencing
 128 RNA-Seq libraries
 17 lanes
 PE100 sequencing
 Illumina GAIIx
 3,354 million raw paired end reads
Quality control
o Atmosphere and iRODS
o 2,550 million quality controlled
paired end reads (888 GB)
Servers
(iPlant Atmosphere)
XX-TB
Storage
(iPlant Data Store and EBS)
Users
Now everyone can share data without sharing resources!
B. rapa transcriptome assembly and genome reannotation pipeline
UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
Transcriptome assembly
de novo vs Reference based assemblies
Approach Advantages Disadvantages
de novo -no reference needed
-detection of non-collinear
transcripts
-lowly expressed genes
-missassemblies due to repeats
reference -alignment tolerates
sequencing errors
-repeats detected through
alignment
-reference is needed
-assumes transcripts are collinear
with the genome
Transcriptome assembly
UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
• XSEDE is the most powerful integrated advanced digital
resources and services in the world funded by NSF
• Scientists and Engineers around the world use XSEDE
resources and services: supercomputers, collections of data,
help services
• Consists of supercomputers, high-end visualization, data
analysis and storage around the country
xsede.org
xsede.org
Summary statistics for de novo assembly pipeline
UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
Assembly
type
Number of
transcripts
Average transcript
length (bp)
N50
Velvet-Oases 601,915 1553 2,218
Trinity 158,863 1112 1863
Transcriptome assembly
TopHat-Cufflinks-Cuffcompare was run on Atmosphere
Summary statistics for Reference assembly pipeline
UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
Do the assembly algorithms differ with respect to
detection of novel transcripts?
UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
RT-PCR validations of assembled novel transcripts
Transcriptome annotation
Entire Transcriptome annotation was run on Atmosphere
Problem 1) Cap3 merged transcripts have multiple ORF's
for the same contigs
Challenges and problems during annotation Cap3
transdecoder
cds to bam
Merged final novel transcripts
bam to bed
Use blastx to NCBI nr database and chose appropriate filters
Problem 2) Overlapping transcripts in the bed file
Use bedops merge and then select the long transcript
transdecoder
cds to bam
Merged final novel transcripts
bam to bed
Cap3
Problem 3) Very long transcripts due to missassembly
transdecoder
cds to bam
Merged final novel transcripts
bam to bed
Filter the transcripts
After filtering
Cap3
Problem 4) No lines connecting the exons in bed file
Use a custom script and join the lines. Check UCSC bed file guideline
cds to bam
Merged final novel transcripts
bam to bed
Cap3
transdecode
r
UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
Results for detection of novel transcripts
Number of novel transcripts detected - 3,537 (v1.2) and 2,732 (v1.5)
Original
Novel
Original
Novel
o Genome annotation pipeline from TIGR, used
widely elsewhere
o Uses EST spliced alignments to model genes
o Gene structure consistent with experimental
data
o Identifies alternate splicing variations
o Helps to correct gene structure
Program to Assemble Spliced Alignments (PASA)
UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
PASA was installed and run on Atmosphere
Number of gene models updated – 28,139 (v1.2) & 28,112 (v1.5)
UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
Results for updating Gene models
Original
NovelBra000108
Original
NovelBra022192
Genome Browser
(http://tinyurl.com/BrapaGenome)
Conclusions
• Deep RNA-Seq provides enough coverage for the detection
of a large number unknown transcripts and genome
improved annotation
• Neither de novo assembly nor reference-based category is
the best choice and hybrid assembly can offer more
accurate assembly and annotation
• Problems during genome re-annotation needs to be
addressed before a fully annotated genome is obtained
• iPlant Collaborative and XSEDE provides the systems and
people to facilitate transcriptome assembly and genome
reannotation
ACKNOWLEDGEMENTS
• Julin Maloof
• Mike Covington
• Cody Markelz
• An Tat
• Kazu Nozue
• Saradadevi Lekkala
• Maloof lab
• Harmer lab
• Cynthia Weinig
• Marc T. Brock
• Matthew Rubin
• Brian Haas
• Andy Edmonds
• Edwin Skidmore
• Sangeeta Kuchimanchi
• Matt Vaughan

Contenu connexe

Tendances

Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
Nikolay Vyahhi
 
RNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubRNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal Club
Jennifer Shelton
 

Tendances (20)

Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Functionally annotate genomic variants
Functionally annotate genomic variantsFunctionally annotate genomic variants
Functionally annotate genomic variants
 
T-bioinfo overview
T-bioinfo overviewT-bioinfo overview
T-bioinfo overview
 
Review of CRISPR/Cas9
Review of CRISPR/Cas9Review of CRISPR/Cas9
Review of CRISPR/Cas9
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2
 
Comparative genomics to the rescue: How complete is your plant genome sequence?
Comparative genomics to the rescue: How complete is your plant genome sequence?Comparative genomics to the rescue: How complete is your plant genome sequence?
Comparative genomics to the rescue: How complete is your plant genome sequence?
 
Rna seq
Rna seq Rna seq
Rna seq
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Overview of Polymerase Chain Reaction
Overview of Polymerase Chain ReactionOverview of Polymerase Chain Reaction
Overview of Polymerase Chain Reaction
 
Ensembl annotation
Ensembl annotationEnsembl annotation
Ensembl annotation
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binning
 
Exome Sequencing
Exome SequencingExome Sequencing
Exome Sequencing
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
 
CSU Next Generation Sequencing Core 06/09/2015
CSU Next Generation Sequencing Core 06/09/2015CSU Next Generation Sequencing Core 06/09/2015
CSU Next Generation Sequencing Core 06/09/2015
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencing
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of Strathclyde
 
NCER Position on Crispr-Cas9
NCER Position on Crispr-Cas9NCER Position on Crispr-Cas9
NCER Position on Crispr-Cas9
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
RNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubRNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal Club
 

Similaire à Iplant pag

Plant functionalgenomics
Plant functionalgenomicsPlant functionalgenomics
Plant functionalgenomics
Clifford Stone
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
Long Pei
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
Dongyan Zhao
 

Similaire à Iplant pag (20)

Genome sequencing. ppt.pptx
Genome sequencing. ppt.pptxGenome sequencing. ppt.pptx
Genome sequencing. ppt.pptx
 
CROP GENOME SEQUENCING
CROP GENOME SEQUENCINGCROP GENOME SEQUENCING
CROP GENOME SEQUENCING
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
Crispr suman
Crispr  sumanCrispr  suman
Crispr suman
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 
Plant functionalgenomics
Plant functionalgenomicsPlant functionalgenomics
Plant functionalgenomics
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Genotyping by Sequencing
Genotyping by SequencingGenotyping by Sequencing
Genotyping by Sequencing
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
31931 31941
31931 3194131931 31941
31931 31941
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 

Iplant pag

  • 1. A Hybrid Approach to Assemble and Annotate the Brassica rapa Transcriptome in the Cloud through the iPlant Collaborative and XSEDE Upendra Kumar Devisetty Postdoctoral Researcher Maloof Lab, UC Davis
  • 2. R500 IMB211 • Reference Transcriptome • Genome annotation R500 (oil seed cultivar) IMB211 (rapid cycling cultivar) B. rapa mapping population Research in Maloof Lab
  • 3. Mainly relied on in silico gene models and EST’s data from datasets (Wang et al. 2011) – In silico gene models (GENSCAN, GlimmerHMM, Fgenesh) • short exons • very long exons • non-translated exons • genes that encode non-coding RNAs accurately – EST’s • miss 20-40% of novel transcripts • transcribed only under highly specific tissue, environmental or treatment conditions • 3’ biased • short length Original
  • 4. Why there is a need for accurate genome annotation? • Accurate and comprehensive genome annotation (e.g. gene models) is imperative for functional studies • Useful for accurate mRNA abundance and detection of eQTLs (expression QTLs) in mapping populations Objectives • To detect transcripts that are not present in the existing genome reference of B. rapa (novel transcripts) • To update the existing gene models of B. rapa genome
  • 5. UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics Growth Chamber, Green House, Field apical meristem R500 Library construction  TRUSEQ RNA-SEQ kit (Illumina)  High throughput and easy to use Sequencing  128 RNA-Seq libraries  17 lanes  PE100 sequencing  Illumina GAIIx  3,354 million raw paired end reads Quality control o Atmosphere and iRODS o 2,550 million quality controlled paired end reads (888 GB)
  • 6. Servers (iPlant Atmosphere) XX-TB Storage (iPlant Data Store and EBS) Users Now everyone can share data without sharing resources!
  • 7. B. rapa transcriptome assembly and genome reannotation pipeline UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
  • 9. de novo vs Reference based assemblies Approach Advantages Disadvantages de novo -no reference needed -detection of non-collinear transcripts -lowly expressed genes -missassemblies due to repeats reference -alignment tolerates sequencing errors -repeats detected through alignment -reference is needed -assumes transcripts are collinear with the genome
  • 10. Transcriptome assembly UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
  • 11. • XSEDE is the most powerful integrated advanced digital resources and services in the world funded by NSF • Scientists and Engineers around the world use XSEDE resources and services: supercomputers, collections of data, help services • Consists of supercomputers, high-end visualization, data analysis and storage around the country
  • 13. Summary statistics for de novo assembly pipeline UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics Assembly type Number of transcripts Average transcript length (bp) N50 Velvet-Oases 601,915 1553 2,218 Trinity 158,863 1112 1863
  • 15. Summary statistics for Reference assembly pipeline UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics
  • 16. Do the assembly algorithms differ with respect to detection of novel transcripts? UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics RT-PCR validations of assembled novel transcripts
  • 17. Transcriptome annotation Entire Transcriptome annotation was run on Atmosphere
  • 18. Problem 1) Cap3 merged transcripts have multiple ORF's for the same contigs Challenges and problems during annotation Cap3 transdecoder cds to bam Merged final novel transcripts bam to bed Use blastx to NCBI nr database and chose appropriate filters
  • 19. Problem 2) Overlapping transcripts in the bed file Use bedops merge and then select the long transcript transdecoder cds to bam Merged final novel transcripts bam to bed Cap3
  • 20. Problem 3) Very long transcripts due to missassembly transdecoder cds to bam Merged final novel transcripts bam to bed Filter the transcripts After filtering Cap3
  • 21. Problem 4) No lines connecting the exons in bed file Use a custom script and join the lines. Check UCSC bed file guideline cds to bam Merged final novel transcripts bam to bed Cap3 transdecode r
  • 22. UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics Results for detection of novel transcripts Number of novel transcripts detected - 3,537 (v1.2) and 2,732 (v1.5) Original Novel Original Novel
  • 23. o Genome annotation pipeline from TIGR, used widely elsewhere o Uses EST spliced alignments to model genes o Gene structure consistent with experimental data o Identifies alternate splicing variations o Helps to correct gene structure Program to Assemble Spliced Alignments (PASA) UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics PASA was installed and run on Atmosphere
  • 24. Number of gene models updated – 28,139 (v1.2) & 28,112 (v1.5) UK Devisetty et al. 2014 G3: Genes|Genomes|Genetics Results for updating Gene models Original NovelBra000108 Original NovelBra022192
  • 26. Conclusions • Deep RNA-Seq provides enough coverage for the detection of a large number unknown transcripts and genome improved annotation • Neither de novo assembly nor reference-based category is the best choice and hybrid assembly can offer more accurate assembly and annotation • Problems during genome re-annotation needs to be addressed before a fully annotated genome is obtained • iPlant Collaborative and XSEDE provides the systems and people to facilitate transcriptome assembly and genome reannotation
  • 27. ACKNOWLEDGEMENTS • Julin Maloof • Mike Covington • Cody Markelz • An Tat • Kazu Nozue • Saradadevi Lekkala • Maloof lab • Harmer lab • Cynthia Weinig • Marc T. Brock • Matthew Rubin • Brian Haas • Andy Edmonds • Edwin Skidmore • Sangeeta Kuchimanchi • Matt Vaughan

Notes de l'éditeur

  1. Why am I doing integrated approach? This is why….. Second point biased