SlideShare une entreprise Scribd logo
1  sur  17
Statistics for Next Generation
   Sequencing (RNA-Seq)
Distribution?
• 25000 genes, each with counts over several
  samples
     • 2 conditions, each with several replicates

• Recall, log-Normal for Microarrays
     • Based on fitting on actual data with many replicates


• No equivalent data for RNA-Seq
     • So go back to first principles
RNA-Seq Setting
RNA-Seq Counts Distribution
Hypergeometric Distribution
Simplifying the Hypergeometric
          Distribution
The Poisson Distribution




 λ is both mean
  and variance
The Poisson Distribution
                       (Wikipedia)
•   The number of soldiers killed by horse-kicks each year in each corps in
    the Prussian cavalry. This example was made famous by a book of Ladislaus
    Josephovich Bortkiewicz (1868–1931).
•   The number of yeast cells used when brewing Guinness beer. This example was
    made famous by William Sealy Gosset (1876–1937).[19]
•   The number of phone calls arriving at a call centre per minute.
•   The number of goals in sports involving two competing teams.
•   The number of deaths per year in a given age group.
•   The number of jumps in a stock price in a given time interval.
•   Under an assumption of homogeneity, the number of times a web server is
    accessed per minute.
•   The number of mutations in a given stretch of DNA after a certain amount of
    radiation.
•   The proportion of cells that will be infected at a given multiplicity of infection.
Is Mean = Variance for NGS ?


– Variance ∝ Mean2




 Log Scale: White
line is the Poisson
         line
Why this Over-Dispersion

• The Poisson model only
  models technical variation,
  not biological variation

• Biological variation induces
  more variance than
  captured by the Poisson
  model
–    No reason for difference from
     microarrays where SD ∝ Mean
         (or Variance ∝ Mean2)
                                     SD vs Mean for
                                      Microarrays
Handling Over-Dispersion
What Distribution is X?

• Log-Normal for Arrays?

• The combination of log-Normal and Poisson
  doesn’t have a neat closed form (i.e., formula)

• So assume Gamma distribution
   – Poisson + Gamma -> Negative Binomial
   – Used traditionally to fix the problem of over-
     dispersion
The Gamma Distribution




             Control on
              Right Tail
The Negative Binomial Distribution
Estimating Parameters




                For each gene, estimate
              the mean across replicates,
                 and then estimate the
              variance from the curve fit
                         above
Bias Correction
Thank You

Contenu connexe

En vedette

Signal Transduction Revised
Signal Transduction RevisedSignal Transduction Revised
Signal Transduction Revised
MD Specialclass
 
REGULATION OF GENE EXPRESSION IN PROKARYOTES & EUKARYOTES
REGULATION OF GENE EXPRESSION IN PROKARYOTES & EUKARYOTESREGULATION OF GENE EXPRESSION IN PROKARYOTES & EUKARYOTES
REGULATION OF GENE EXPRESSION IN PROKARYOTES & EUKARYOTES
University of Louisiana at Monroe, USA
 

En vedette (9)

Signal Transduction Revised
Signal Transduction RevisedSignal Transduction Revised
Signal Transduction Revised
 
Dna sequencing
Dna sequencingDna sequencing
Dna sequencing
 
217 c reactive protein
217 c reactive protein217 c reactive protein
217 c reactive protein
 
Chem 45 Biochemistry: Stoker chapter 25 Lipid Metabolism
Chem 45 Biochemistry: Stoker chapter 25 Lipid MetabolismChem 45 Biochemistry: Stoker chapter 25 Lipid Metabolism
Chem 45 Biochemistry: Stoker chapter 25 Lipid Metabolism
 
Lipid metabolism
Lipid metabolismLipid metabolism
Lipid metabolism
 
Regulation of Gene Expression ppt
Regulation of Gene Expression pptRegulation of Gene Expression ppt
Regulation of Gene Expression ppt
 
REGULATION OF GENE EXPRESSION IN PROKARYOTES & EUKARYOTES
REGULATION OF GENE EXPRESSION IN PROKARYOTES & EUKARYOTESREGULATION OF GENE EXPRESSION IN PROKARYOTES & EUKARYOTES
REGULATION OF GENE EXPRESSION IN PROKARYOTES & EUKARYOTES
 
Dna Sequencing
Dna SequencingDna Sequencing
Dna Sequencing
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similaire à Introduction to statistics iii

sequencing of genome
sequencing of genomesequencing of genome
sequencing of genome
Naveen Gupta
 
SNPs Presentation Cavalcanti Lab
SNPs Presentation Cavalcanti LabSNPs Presentation Cavalcanti Lab
SNPs Presentation Cavalcanti Lab
jsrep91
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
sesejun
 
DHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptxDHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptx
DivyanshGupta922023
 

Similaire à Introduction to statistics iii (20)

Gene expression introduction
Gene expression introductionGene expression introduction
Gene expression introduction
 
sequencing of genome
sequencing of genomesequencing of genome
sequencing of genome
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
20170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_10120170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_101
 
SNPs Presentation Cavalcanti Lab
SNPs Presentation Cavalcanti LabSNPs Presentation Cavalcanti Lab
SNPs Presentation Cavalcanti Lab
 
Ssr assignment
Ssr assignmentSsr assignment
Ssr assignment
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
Whole Genome Analysis
Whole Genome AnalysisWhole Genome Analysis
Whole Genome Analysis
 
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
 
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR GenomicsTarget Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
 
Diagnosis of mrsa by molecular methods
Diagnosis of mrsa by molecular methodsDiagnosis of mrsa by molecular methods
Diagnosis of mrsa by molecular methods
 
Genomics seminar
Genomics seminarGenomics seminar
Genomics seminar
 
How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2
 
SNPs analysis methods
SNPs analysis methodsSNPs analysis methods
SNPs analysis methods
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
DHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptxDHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptx
 
SlipChip - Oct 2012
SlipChip - Oct 2012SlipChip - Oct 2012
SlipChip - Oct 2012
 
DNA analysis
DNA analysisDNA analysis
DNA analysis
 

Plus de Strand Life Sciences Pvt Ltd

Converting High Dimensional Problems to Low Dimensional Ones
Converting High Dimensional Problems to Low Dimensional OnesConverting High Dimensional Problems to Low Dimensional Ones
Converting High Dimensional Problems to Low Dimensional Ones
Strand Life Sciences Pvt Ltd
 

Plus de Strand Life Sciences Pvt Ltd (12)

Strand genomics features in CIO review
Strand genomics features in CIO reviewStrand genomics features in CIO review
Strand genomics features in CIO review
 
Rules of a Quantum World
Rules of  a Quantum WorldRules of  a Quantum World
Rules of a Quantum World
 
Least common ancestors in constant time
Least common ancestors in constant timeLeast common ancestors in constant time
Least common ancestors in constant time
 
Introduction to statistics ii
Introduction to statistics iiIntroduction to statistics ii
Introduction to statistics ii
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Dynamic programming for simd
Dynamic programming for simdDynamic programming for simd
Dynamic programming for simd
 
Complex numbers polynomial multiplication
Complex numbers polynomial multiplicationComplex numbers polynomial multiplication
Complex numbers polynomial multiplication
 
Converting High Dimensional Problems to Low Dimensional Ones
Converting High Dimensional Problems to Low Dimensional OnesConverting High Dimensional Problems to Low Dimensional Ones
Converting High Dimensional Problems to Low Dimensional Ones
 
Searching using Quantum Rules
Searching using Quantum RulesSearching using Quantum Rules
Searching using Quantum Rules
 
Randomized algorithms
Randomized algorithmsRandomized algorithms
Randomized algorithms
 
Suffix arrays
Suffix arraysSuffix arrays
Suffix arrays
 
Alignment of raw reads in Avadis NGS
Alignment of raw reads in Avadis NGSAlignment of raw reads in Avadis NGS
Alignment of raw reads in Avadis NGS
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Introduction to statistics iii

  • 1. Statistics for Next Generation Sequencing (RNA-Seq)
  • 2. Distribution? • 25000 genes, each with counts over several samples • 2 conditions, each with several replicates • Recall, log-Normal for Microarrays • Based on fitting on actual data with many replicates • No equivalent data for RNA-Seq • So go back to first principles
  • 7. The Poisson Distribution λ is both mean and variance
  • 8. The Poisson Distribution (Wikipedia) • The number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry. This example was made famous by a book of Ladislaus Josephovich Bortkiewicz (1868–1931). • The number of yeast cells used when brewing Guinness beer. This example was made famous by William Sealy Gosset (1876–1937).[19] • The number of phone calls arriving at a call centre per minute. • The number of goals in sports involving two competing teams. • The number of deaths per year in a given age group. • The number of jumps in a stock price in a given time interval. • Under an assumption of homogeneity, the number of times a web server is accessed per minute. • The number of mutations in a given stretch of DNA after a certain amount of radiation. • The proportion of cells that will be infected at a given multiplicity of infection.
  • 9. Is Mean = Variance for NGS ? – Variance ∝ Mean2 Log Scale: White line is the Poisson line
  • 10. Why this Over-Dispersion • The Poisson model only models technical variation, not biological variation • Biological variation induces more variance than captured by the Poisson model – No reason for difference from microarrays where SD ∝ Mean (or Variance ∝ Mean2) SD vs Mean for Microarrays
  • 12. What Distribution is X? • Log-Normal for Arrays? • The combination of log-Normal and Poisson doesn’t have a neat closed form (i.e., formula) • So assume Gamma distribution – Poisson + Gamma -> Negative Binomial – Used traditionally to fix the problem of over- dispersion
  • 13. The Gamma Distribution Control on Right Tail
  • 14. The Negative Binomial Distribution
  • 15. Estimating Parameters For each gene, estimate the mean across replicates, and then estimate the variance from the curve fit above