SlideShare une entreprise Scribd logo
1  sur  25
Statistics for Microarray Data
Background

         μ, σ2




• Few observations made by a black box

• What is the distribution behind the black box?

• E.g., with what probability will it output a number
  bigger than 5?
Approach

• Easy to determine with many observations

• With few observations..

• Assume a canonical distribution based on prior
  knowledge

• Determine parameters of this distribution using
  the observations, e.g., mean, variance
Estimating the mean
Estimating the variance σ2

                         Chi-Square if
                          the original
                         distribution
                         was Normal
Microarray Data
• Many genes, 25000

• 2 conditions (or more), many replicates within
  each condition

• Which genes are differentially expressed
  between the two conditions?
More Specifically
• For a particular gene
  – Each condition is a black box
  – Say 3 observations from each black box


• Do both black boxes have the same
  distribution?
  – Assume same canonical distribution
  – Do both have the same parameters?
Which Canonical Distribution
• Use data with many replicates

• 418.0294, 295.8019, 272.1220, 315.2978, 294.2242,
  379.8320, 392.1817, 450.4758, 335.8242, 265.2478,
  196.6982, 289.6532, 274.4035, 246.6807, 254.8710,
  165.9416, 281.9463, 246.6434, 259.0019, 242.1968


• Distribution??
What is a QQ Plot
Distribution of log raw intensities
 across genes on a single array
The QQ plot of log scale intensities
(i.e., actual vs simulated from normal)
QQ Plot against a Normal Distribution
• 10 + 10 replicates in
  two groups

• Single group QQ plot

• Combined 2 groups QQ
  plot

• Combined log-scale QQ
  plot
                          Shapiro-
                          Wilk Test
Which Canonical Distribution



• Assume log normal distribution
Benford’s Law
• Frequency distribution of first significant digit




    Pr(d<=x<d+1 )= log10(1+d)-log10(d), log10(x) is uniformly distributed in [0,1]
Differential Expression

          μ1,σ12             μ2,σ22




Group 1                                        Group 2


                    Is μ1= μ2?
                     σ1 = σ2 ?        Is variance a
                                       function of
                                          mean?
SD
increases
 linearly
with Mean




  SD vs Mean across 3 replicates plotted for all genes
SD is flat
   now,
except for
 very low
  values




                                          Another
                                         reason to
                                          work on
                                          the log
                                           scale




SD vs Mean across 3 replicates computed for all
       genes after log-transformation
Differential Expression

          μ1,σ12             μ2,σ22




Group 1                                   Group 2


                    Is μ1= μ2?
                     σ1 = σ2 ?        Sort-of YES
The T-Statistic
The T-Statistic
The T-Statistic
The T-Statistic
                   Flattened
                  Normal or T-
                  Distribution
A Problem
The curve
                                          fit here
                                         may be a
                                           better
                                         estimate




Lots of false
positives can                            Not much
 be avoided                              difference
    here                                    here


SD vs Mean across 3 replicates computed for all
       genes after log-transformattion
Thank You

Contenu connexe

En vedette (14)

Introduction of suffix tree
Introduction of suffix treeIntroduction of suffix tree
Introduction of suffix tree
 
Packet forwarding in wan.46
Packet  forwarding in wan.46Packet  forwarding in wan.46
Packet forwarding in wan.46
 
Trie tree
Trie treeTrie tree
Trie tree
 
Suffix Tree and Suffix Array
Suffix Tree and Suffix ArraySuffix Tree and Suffix Array
Suffix Tree and Suffix Array
 
Data structure tries
Data structure triesData structure tries
Data structure tries
 
Lec18
Lec18Lec18
Lec18
 
Fundamentals
FundamentalsFundamentals
Fundamentals
 
Tries - Tree Based Structures for Strings
Tries - Tree Based Structures for StringsTries - Tree Based Structures for Strings
Tries - Tree Based Structures for Strings
 
Basic Packet Forwarding in NS2
Basic Packet Forwarding in NS2Basic Packet Forwarding in NS2
Basic Packet Forwarding in NS2
 
Application of tries
Application of triesApplication of tries
Application of tries
 
Digital Search Tree
Digital Search TreeDigital Search Tree
Digital Search Tree
 
Trie Data Structure
Trie Data StructureTrie Data Structure
Trie Data Structure
 
Multi ways trees
Multi ways treesMulti ways trees
Multi ways trees
 
Cis82 e2-1-packet forwarding
Cis82 e2-1-packet forwardingCis82 e2-1-packet forwarding
Cis82 e2-1-packet forwarding
 

Similaire à Introduction to statistics ii

T Test For Two Independent Samples
T Test For Two Independent SamplesT Test For Two Independent Samples
T Test For Two Independent Samplesshoffma5
 
How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2AdamCribbs1
 
Statisticsforbiologists colstons
Statisticsforbiologists colstonsStatisticsforbiologists colstons
Statisticsforbiologists colstonsandymartin
 
Chapter one on sampling distributions.ppt
Chapter one on sampling distributions.pptChapter one on sampling distributions.ppt
Chapter one on sampling distributions.pptFekaduAman
 
Standard Scores
Standard ScoresStandard Scores
Standard Scoresshoffma5
 
Microarray Statistics
Microarray StatisticsMicroarray Statistics
Microarray StatisticsA Roy
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Austin Benson
 
Early generation selection in an intra population recurrent selection breedin...
Early generation selection in an intra population recurrent selection breedin...Early generation selection in an intra population recurrent selection breedin...
Early generation selection in an intra population recurrent selection breedin...CIAT
 
Association mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingAssociation mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingMahesh Biradar
 

Similaire à Introduction to statistics ii (20)

Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
T Test For Two Independent Samples
T Test For Two Independent SamplesT Test For Two Independent Samples
T Test For Two Independent Samples
 
DNA Microarray
DNA MicroarrayDNA Microarray
DNA Microarray
 
Microarray Analysis
Microarray AnalysisMicroarray Analysis
Microarray Analysis
 
How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2
 
Lesson 3
Lesson 3Lesson 3
Lesson 3
 
unit 4 nearest neighbor.ppt
unit 4 nearest neighbor.pptunit 4 nearest neighbor.ppt
unit 4 nearest neighbor.ppt
 
Statisticsforbiologists colstons
Statisticsforbiologists colstonsStatisticsforbiologists colstons
Statisticsforbiologists colstons
 
Two dependent samples (matched pairs)
Two dependent samples (matched pairs) Two dependent samples (matched pairs)
Two dependent samples (matched pairs)
 
Chapter one on sampling distributions.ppt
Chapter one on sampling distributions.pptChapter one on sampling distributions.ppt
Chapter one on sampling distributions.ppt
 
Standard Scores
Standard ScoresStandard Scores
Standard Scores
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
Microarray Statistics
Microarray StatisticsMicroarray Statistics
Microarray Statistics
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
The T-test
The T-testThe T-test
The T-test
 
Early generation selection in an intra population recurrent selection breedin...
Early generation selection in an intra population recurrent selection breedin...Early generation selection in an intra population recurrent selection breedin...
Early generation selection in an intra population recurrent selection breedin...
 
Sd
SdSd
Sd
 
Association mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingAssociation mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mapping
 
Chapter 04
Chapter 04Chapter 04
Chapter 04
 
wk-2.pptx
wk-2.pptxwk-2.pptx
wk-2.pptx
 

Plus de Strand Life Sciences Pvt Ltd (11)

Strand genomics features in CIO review
Strand genomics features in CIO reviewStrand genomics features in CIO review
Strand genomics features in CIO review
 
Rules of a Quantum World
Rules of  a Quantum WorldRules of  a Quantum World
Rules of a Quantum World
 
Least common ancestors in constant time
Least common ancestors in constant timeLeast common ancestors in constant time
Least common ancestors in constant time
 
Introduction to statistics iii
Introduction to statistics iiiIntroduction to statistics iii
Introduction to statistics iii
 
Dynamic programming for simd
Dynamic programming for simdDynamic programming for simd
Dynamic programming for simd
 
Complex numbers polynomial multiplication
Complex numbers polynomial multiplicationComplex numbers polynomial multiplication
Complex numbers polynomial multiplication
 
Converting High Dimensional Problems to Low Dimensional Ones
Converting High Dimensional Problems to Low Dimensional OnesConverting High Dimensional Problems to Low Dimensional Ones
Converting High Dimensional Problems to Low Dimensional Ones
 
Searching using Quantum Rules
Searching using Quantum RulesSearching using Quantum Rules
Searching using Quantum Rules
 
Randomized algorithms
Randomized algorithmsRandomized algorithms
Randomized algorithms
 
Suffix arrays
Suffix arraysSuffix arrays
Suffix arrays
 
Alignment of raw reads in Avadis NGS
Alignment of raw reads in Avadis NGSAlignment of raw reads in Avadis NGS
Alignment of raw reads in Avadis NGS
 

Dernier

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Introduction to statistics ii

  • 2. Background μ, σ2 • Few observations made by a black box • What is the distribution behind the black box? • E.g., with what probability will it output a number bigger than 5?
  • 3. Approach • Easy to determine with many observations • With few observations.. • Assume a canonical distribution based on prior knowledge • Determine parameters of this distribution using the observations, e.g., mean, variance
  • 5. Estimating the variance σ2 Chi-Square if the original distribution was Normal
  • 6. Microarray Data • Many genes, 25000 • 2 conditions (or more), many replicates within each condition • Which genes are differentially expressed between the two conditions?
  • 7. More Specifically • For a particular gene – Each condition is a black box – Say 3 observations from each black box • Do both black boxes have the same distribution? – Assume same canonical distribution – Do both have the same parameters?
  • 8. Which Canonical Distribution • Use data with many replicates • 418.0294, 295.8019, 272.1220, 315.2978, 294.2242, 379.8320, 392.1817, 450.4758, 335.8242, 265.2478, 196.6982, 289.6532, 274.4035, 246.6807, 254.8710, 165.9416, 281.9463, 246.6434, 259.0019, 242.1968 • Distribution??
  • 9. What is a QQ Plot
  • 10. Distribution of log raw intensities across genes on a single array
  • 11. The QQ plot of log scale intensities (i.e., actual vs simulated from normal)
  • 12. QQ Plot against a Normal Distribution • 10 + 10 replicates in two groups • Single group QQ plot • Combined 2 groups QQ plot • Combined log-scale QQ plot Shapiro- Wilk Test
  • 13. Which Canonical Distribution • Assume log normal distribution
  • 14. Benford’s Law • Frequency distribution of first significant digit Pr(d<=x<d+1 )= log10(1+d)-log10(d), log10(x) is uniformly distributed in [0,1]
  • 15. Differential Expression μ1,σ12 μ2,σ22 Group 1 Group 2 Is μ1= μ2? σ1 = σ2 ? Is variance a function of mean?
  • 16. SD increases linearly with Mean SD vs Mean across 3 replicates plotted for all genes
  • 17. SD is flat now, except for very low values Another reason to work on the log scale SD vs Mean across 3 replicates computed for all genes after log-transformation
  • 18. Differential Expression μ1,σ12 μ2,σ22 Group 1 Group 2 Is μ1= μ2? σ1 = σ2 ? Sort-of YES
  • 22. The T-Statistic Flattened Normal or T- Distribution
  • 24. The curve fit here may be a better estimate Lots of false positives can Not much be avoided difference here here SD vs Mean across 3 replicates computed for all genes after log-transformattion