Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
A General Framework for
Multiple Testing Dependence
Jeffrey Leek
Johns Hopkins University School of Medicine
High-dimensional multiple hypothesis testing is common.
Problem:
Dependence between tests can result in incorrect statisti...
High-Dimensional Multiple Testing Is Common
Spatial EpidemiologyBrain Imaging
Molecular Biology
4
Inflammation and the Host Response to Injury
mRNA
Expression
~50,000
genes
Clinical Data 
>150
clinical variables
Patien...
Data at Initial Time Point
Multiple Organ Failure
Simple Analysis
1. Fit the model to the data, xi, for gene i:
xi = ai + biMOF + ei
2. Calculate P-values for testing the h...
Four “Replicated” Studies
Phase 1
Phase 3
Phase 2
Phase 4
P-value P-value
P-value P-value
Frequency
Frequency
Frequency
Fr...
•  Data for test i:
•  “Primary variable(s)”:
•  Model:
•  Hypothesis test i:
€
xi = xi1,xi2,…,xin( )
€
Y = y1,y2,…,yn( )
...
= +
X = B S(Y) + E
observations
tests
Underlying Model
A Simple Simulated Example
Independent E Dependent E
Genes
Genes
Arrays Arrays
Null P-Value Distributions
Independent E
Dependent E
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency
Frequency...
Null P-Value Distributions
|ρ| = 0.40 |ρ| = 0.31 |ρ| = 0.10 |ρ| = 0.00Correlation
Independent E
Dependent E
Frequency
Freq...
Null Distribution Behavior
Dependent E
Independent E
False Discovery Rate Estimates
Independent E Dependent E
Ranking Estimates
Independent E Dependent E
Data X
Fit Model
X= BS + E
Obtain
and R
€
ˆB
Calculate
P-values
Form P-value
Threshold
When To Address Dependence?
Form Te...
Data X
Fit Model
X= BS + E
Obtain
and R
€
ˆB
Calculate
P-values
Form P-value
Threshold
When To Address Dependence?
Form Te...
Examples of Existing Approaches
•  Empirical Null
– Devlin and Roeder Biometrics (1999)
– Efron JASA (2004)
– Schwartzman ...
Data X
Fit Model
X= BS + E
Obtain
and R
€
ˆB
Calculate
P-values
Form P-value
Threshold
When To Address Dependence?
Form Te...
Dependence and bias are no longer present at any of these steps;
standard methods can be used.
Data X
Fit Model
X= BS + E
...
New Dependence Definitions
Definition – Data X are population-level multiple testing
dependent if:
Definition - Data X are...
Structure in E
Array
MOF1Genes
Signal + Dependent Noise
Dependent Noise
Independent Noise
= +
X = B S + E
observations
tests
data
random
variation
primary
variables
Decomposing E
= +
X = B S + H + U
tests
+
independent
variation
observations
data
primary
variables
dependent
variation
Decomposing E
= +
X = B S + Γ G + U
tests
+
independent
variation
observations
data
primary
variables
dependence
kernel
Decomposing E
H
Decomposing E
Theorem Let the data be distributed according to the
model:
Suppose that for each ei there is no Borel measu...
Dependence Kernel
Leek and Storey (2008)
Definition – Dependence Kernel
An r ×n matrix G forms a dependence kernel for the...
Fitting S & G Results In Independent Tests
Leek and Storey (2008)
Theorem Let G be any valid dependence kernel for the dat...
= +
X = B S + Γ G + U
tests
+
independent
variation
observations
data
primary
variables
dependence
kernel
A “Blessing” of ...
Iteratively Reweighted Surrogate Variable Analysis
1.  Estimate the row dimension, , of G.
2.  Form an initial estimate eq...
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
An Example of the IRW-SVA Algorithm
The Data True GEstimate of GPr(G & !S)
Iteratively Re-weighted Surrogate Variable Analysis
1.  Estimate the row dimension, , of G.
2.  Form an initial estimate e...
1.  Buja and Eyuboglu (1992) proposed a
permutation approach.
2.  Patterson, Price, and Reich (2006) proposed a
sequential...
1.  Assume the data follow X = BS + ΓG + U, where G
and S have row dimensions r and d, r + d < n.
2.  Calculate the singul...
Theorem As ,
is a consistent estimate of the row dimension of G,
provided that:
(1) uij are independent
(2) E[uij]=0
(3) 
...
Iteratively Re-weighted Surrogate Variable Analysis
1.  Estimate the row dimension, , of G.
2.  Form an initial estimate e...
Break The Estimation Into Two Components
1.  Form F-statistics F1,…,Fm for testing the hypotheses:
2.  Bootstrap from the conditional null model to obtain null-
st...
1.  Form F-statistics F1,…,Fm for testing the hypotheses:
2.  Bootstrap from the conditional null model to obtain null-
st...
1.  Form F-statistics F1,…,Fm for testing the hypotheses:
2.  Bootstrap from the conditional null model to obtain null-
st...
Estimating the Probability Weights
Estimate of posterior
probability bi ≠ 0.
SVA-Adjusted Analysis
1.  Estimate G with IRW-SVA
2.  Fit
3.  Test the hypotheses
€
H0i :bi ∈ Ω0 H1i :bi ∈ Ω1
A Simple Simulated Example
Independent E Dependent E
Genes
Genes
Arrays Arrays
Null Distribution Behavior
Dependent E
Independent E
Dependent E
+ IRW-SVA
False Discovery Rate Estimates
Independent E Dependent E
Dependent E
+ IRW-SVA
True False Discovery Rate True False Discov...
Ranking Estimates
Independent E Dependent E
Dependent E
+ IRW-SVA
Ranking by True Signal to Noise Ranking by True Signal t...
53
Inflammation and the Host Response to Injury
mRNA
Expression
~50,000
genes
Clinical Data 
>150
clinical variables
Patie...
Phase 1 Phase 2 Phase 3 Phase 4
Four “Replicated” Studies
FrequencyFrequency
P-value P-value P-value P-value
P-value P-val...
Functional Enrichment Across Phases
Number of phases in which a significant pathway appears
Percentoftotalsignificantpathw...
•  High-dimensional hypothesis testing is common.
•  Dependence between tests can result in incorrect
statistical and scie...
Future Work
•  Multiple Testing
– Develop dependence kernel estimates for spatial data
– Develop diagnostic tests for mult...
Thank You
1.  Calculate the residuals R = X - S.
2.  Calculate the singular values of R, d1,…,dn.
3.  Permute each row of R individu...
Why Does This Work?
Leek and Storey (2007), Leek and Storey (2008)
Useful Fact:
X = BS + E
= BS + ΓG + U
= BS + ΛH + U
if ...
•  References:
Benjamini Y and Hochberg Y. (1995), “Controlling the false discovery rate – a
practical and powerful approa...
1.  Perform each hypothesis test individually.
2.  Obtain the test-statistic for each test.
3.  Compare distribution of te...
Theoretical Null
Efron (2004)
Theoretical Null
Empirical Null
Efron (2004)
Empirical Null Results in Incorrect Null Distribution
Dep. Kernel
•  Observed statistics or observed P-values come
from mixture distribution:
π0g0 + π1g1
•  Dependence distorts g0 … can go...
Prochain SlideShare
Chargement dans…5
×

JHU Job Talk

1 514 vues

Publié le

Jeff Leek's JHU Job Talk from 2009 on surrogate variable analysis.

Publié dans : Données & analyses
  • Soyez le premier à commenter

JHU Job Talk

  1. 1. A General Framework for Multiple Testing Dependence Jeffrey Leek Johns Hopkins University School of Medicine
  2. 2. High-dimensional multiple hypothesis testing is common. Problem: Dependence between tests can result in incorrect statistical and scientific results. A solution: Define and address multiple testing dependence at the level of the data – not the P-values. Big Picture Ideas
  3. 3. High-Dimensional Multiple Testing Is Common Spatial EpidemiologyBrain Imaging Molecular Biology
  4. 4. 4 Inflammation and the Host Response to Injury mRNA Expression ~50,000 genes Clinical Data >150 clinical variables Patient 1 Patient 2 Patient 166…. MOF measures severity of injury
  5. 5. Data at Initial Time Point Multiple Organ Failure
  6. 6. Simple Analysis 1. Fit the model to the data, xi, for gene i: xi = ai + biMOF + ei 2. Calculate P-values for testing the hypotheses: H0: bi = 0 vs. H1: bi ≠ 0 3
  7. 7. Four “Replicated” Studies Phase 1 Phase 3 Phase 2 Phase 4 P-value P-value P-value P-value Frequency Frequency Frequency Frequency
  8. 8. •  Data for test i: •  “Primary variable(s)”: •  Model: •  Hypothesis test i: € xi = xi1,xi2,…,xin( ) € Y = y1,y2,…,yn( ) € xij = ai + biksk y j( ) k=1 d ∑ + eij H0i :bi ∈ Ω0 H1i :bi ∈ Ω1 {m hypothesis tests, n observations per test} Start With The Whole Data
  9. 9. = + X = B S(Y) + E observations tests Underlying Model
  10. 10. A Simple Simulated Example Independent E Dependent E Genes Genes Arrays Arrays
  11. 11. Null P-Value Distributions Independent E Dependent E Frequency Frequency Frequency Frequency Frequency Frequency Frequency Frequency P-value P-value P-value P-value P-value P-value P-value P-value
  12. 12. Null P-Value Distributions |ρ| = 0.40 |ρ| = 0.31 |ρ| = 0.10 |ρ| = 0.00Correlation Independent E Dependent E Frequency Frequency Frequency Frequency Frequency Frequency Frequency Frequency P-value P-value P-value P-value P-value P-value P-value P-value
  13. 13. Null Distribution Behavior Dependent E Independent E
  14. 14. False Discovery Rate Estimates Independent E Dependent E
  15. 15. Ranking Estimates Independent E Dependent E
  16. 16. Data X Fit Model X= BS + E Obtain and R € ˆB Calculate P-values Form P-value Threshold When To Address Dependence? Form Test-Statistics and Null Distribution
  17. 17. Data X Fit Model X= BS + E Obtain and R € ˆB Calculate P-values Form P-value Threshold When To Address Dependence? Form Test-Statistics and Null Distribution Existing Approaches Empirical null approaches modify the null distribution at the test-statistic level Dependence adjustments conservatively modify the P-value threshold
  18. 18. Examples of Existing Approaches •  Empirical Null – Devlin and Roeder Biometrics (1999) – Efron JASA (2004) – Schwartzman AOAS (2008) •  Error Rate Adjustments – Benjamini and Yekutieli Annals of Statistics (2001) – Romano, Shaikh, and Wolf Test (2001) – Dudoit, Gilbert, van der Laan Biometrical Journal (2008)
  19. 19. Data X Fit Model X= BS + E Obtain and R € ˆB Calculate P-values Form P-value Threshold When To Address Dependence? Form Test-Statistics and Null Distribution Our Approach Fit the model: X = BS + ΓG + U where G is a valid dependence kernel
  20. 20. Dependence and bias are no longer present at any of these steps; standard methods can be used. Data X Fit Model X= BS + E Obtain and R € ˆB Calculate P-values Form P-value Threshold When To Address Dependence? Form Test-Statistics and Null Distribution Our Approach Fit the model: X = BS + ΓG + U where G is a valid dependence kernel
  21. 21. New Dependence Definitions Definition – Data X are population-level multiple testing dependent if: Definition - Data X are estimation-level multiple testing dependent if: Leek and Storey (2008)
  22. 22. Structure in E Array MOF1Genes Signal + Dependent Noise Dependent Noise Independent Noise
  23. 23. = + X = B S + E observations tests data random variation primary variables Decomposing E
  24. 24. = + X = B S + H + U tests + independent variation observations data primary variables dependent variation Decomposing E
  25. 25. = + X = B S + Γ G + U tests + independent variation observations data primary variables dependence kernel Decomposing E H
  26. 26. Decomposing E Theorem Let the data be distributed according to the model: Suppose that for each ei there is no Borel measurable function, g, such that ei =g(ei,…,ei-1,ei+1,…,em) almost surely. Then there exist matrices Γ(m×r), G(r×n) (r ≤ n) and U(m×n) such that: where the rows of U are independent and ui ≠ 0 and ui=hi(ei) for a non-random Borel measurable function hi. Leek and Storey (2008)
  27. 27. Dependence Kernel Leek and Storey (2008) Definition – Dependence Kernel An r ×n matrix G forms a dependence kernel for the data X, if the following equality holds: X = BS + E = BS + ΓG + U where the rows of U are independent.
  28. 28. Fitting S & G Results In Independent Tests Leek and Storey (2008) Theorem Let G be any valid dependence kernel for the data X. Suppose that the model: is fit by least squares resulting in residuals: if the rowspace jointly spanned by S and G has dimension less than n, then the ri and the are jointly independent given S and G and: € ˆbi
  29. 29. = + X = B S + Γ G + U tests + independent variation observations data primary variables dependence kernel A “Blessing” of Dimensionality
  30. 30. Iteratively Reweighted Surrogate Variable Analysis 1.  Estimate the row dimension, , of G. 2.  Form an initial estimate equal to the first right singular vectors of R = X - S. 3.  Estimate . 4.  Weight the ith row of X by and set to be the first right singular vectors of the weighted matrix. ˆG(b+1) € ˆr € ˆB Iterate for b=0,…,B: € ˆG0 ˆr € X = BS + ΓG + U € xi = biS + γiG + ui Whole data: Test i data: € ˆr
  31. 31. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  32. 32. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  33. 33. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  34. 34. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  35. 35. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  36. 36. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  37. 37. An Example of the IRW-SVA Algorithm The Data True GEstimate of GPr(G & !S)
  38. 38. Iteratively Re-weighted Surrogate Variable Analysis 1.  Estimate the row dimension, , of G. 2.  Form an initial estimate equal to the first right singular vectors of R = X - S. 3.  Estimate . 4.  Weight the ith row of X by and set to be the first right singular vectors of the weighted matrix. ˆG(b+1) € ˆr € ˆB € ˆG0 ˆr € X = BS + ΓG + U € xi = biS + γiG + ui Whole data: Test i data: € ˆr Iterate for b=0,…,B:
  39. 39. 1.  Buja and Eyuboglu (1992) proposed a permutation approach. 2.  Patterson, Price, and Reich (2006) proposed a sequential testing strategy based on Tracey- Widom theory. 3.  Leek (in preparation) proposes an eigenvalue estimator that is consistent in the number of tests. Estimating The Row Dimension of G
  40. 40. 1.  Assume the data follow X = BS + ΓG + U, where G and S have row dimensions r and d, r + d < n. 2.  Calculate the singular values s1,…, sn of X and choose b, such that r+d < b. 3.  Calculate the eigenvalues, λ1,…, λn of where P = I - S(STS)-1ST and R = XP. 4.  Set ˆr = 1 λj > m−1/ 3 ( ) j=1 n ∑ € € 1 m RT R − sb 2 P[ ] Estimating The Row Dimension of G
  41. 41. Theorem As , is a consistent estimate of the row dimension of G, provided that: (1) uij are independent (2) E[uij]=0 (3)  (4)  (5)  ΓTΓ is positive definite with unique eigenvalues € m → ∞ € E[uij 2 ] = σi 2 < M1 € E[uij 4 ] < M2 € lim m→∞ 1 m Leek (In Prep.) € ˆr = 1 λj > m−1/ 3 ( ) j=1 n ∑ Estimating The Row Dimension of G
  42. 42. Iteratively Re-weighted Surrogate Variable Analysis 1.  Estimate the row dimension, , of G. 2.  Form an initial estimate equal to the first right singular vectors of R = X - S. 3.  Estimate . 4.  Weight the ith row of X by and set to be the first right singular vectors of the weighted matrix. ˆG(b+1) € ˆr € ˆB € ˆG0 ˆr € X = BS + ΓG + U € xi = biS + γiG + ui Whole data: Test i data: € ˆr Iterate for b=0,…,B:
  43. 43. Break The Estimation Into Two Components
  44. 44. 1.  Form F-statistics F1,…,Fm for testing the hypotheses: 2.  Bootstrap from the conditional null model to obtain null- statistics , k =1,…K. 3.  From Bayes’ Theorem: where and . Estimating the Probability Weights € F1 0k ,...,Fm 0k € Fi 0k ~ g0 € Fi ~ π0g0 + (1− π0)g1
  45. 45. 1.  Form F-statistics F1,…,Fm for testing the hypotheses: 2.  Bootstrap from the conditional null model to obtain null- statistics , k =1,…K. 3.  From Bayes’ Theorem: 4.  Estimate the ratio of the densities with a non-parametric logistic regression where Fi are “successes” and Fi 0k are “failures” (Anderson and Blair 1982). where and . . Estimating the Probability Weights € F1 0k ,...,Fm 0k € Fi 0k ~ g0 € Fi ~ π0g0 + (1− π0)g1
  46. 46. 1.  Form F-statistics F1,…,Fm for testing the hypotheses: 2.  Bootstrap from the conditional null model to obtain null- statistics , k =1,…K. 3.  From Bayes’ Theorem: 4.  Estimate the ratio of the densities with a non-parametric logistic regression where Fi are “successes” and Fi 0k are “failures” (Anderson and Blair 1982). 5.  Estimate π0 according to Storey (2002). where and . Estimating the Probability Weights € F1 0k ,...,Fm 0k € Fi 0k ~ g0 € Fi ~ π0g0 + (1− π0)g1
  47. 47. Estimating the Probability Weights Estimate of posterior probability bi ≠ 0.
  48. 48. SVA-Adjusted Analysis 1.  Estimate G with IRW-SVA 2.  Fit 3.  Test the hypotheses € H0i :bi ∈ Ω0 H1i :bi ∈ Ω1
  49. 49. A Simple Simulated Example Independent E Dependent E Genes Genes Arrays Arrays
  50. 50. Null Distribution Behavior Dependent E Independent E Dependent E + IRW-SVA
  51. 51. False Discovery Rate Estimates Independent E Dependent E Dependent E + IRW-SVA True False Discovery Rate True False Discovery Rate True False Discovery Rate Q-value Q-value Q-value
  52. 52. Ranking Estimates Independent E Dependent E Dependent E + IRW-SVA Ranking by True Signal to Noise Ranking by True Signal to Noise Ranking by True Signal to Noise AverageRankingbyT-Statistic AverageRankingbyT-Statistic AverageRankingbyT-Statistic
  53. 53. 53 Inflammation and the Host Response to Injury mRNA Expression ~50,000 genes Clinical Data >150 clinical variables Patient 1 Patient 2 Patient 166…. MOF1 measures severity of injury
  54. 54. Phase 1 Phase 2 Phase 3 Phase 4 Four “Replicated” Studies FrequencyFrequency P-value P-value P-value P-value P-value P-value P-value P-value Frequency Frequency Frequency Frequency Frequency Frequency Frequency
  55. 55. Functional Enrichment Across Phases Number of phases in which a significant pathway appears Percentoftotalsignificantpathways 1 of 4 2 of 4 3 of 4 4 of 4 Unadjusted IRW-SVAAdjusted
  56. 56. •  High-dimensional hypothesis testing is common. •  Dependence between tests can result in incorrect statistical and scientific inference. •  We can define and address dependence at the level of the model using the dependence kernel. •  IRW-SVA can be used to improve inference in high-dimensional multiple hypothesis testing. Summary
  57. 57. Future Work •  Multiple Testing – Develop dependence kernel estimates for spatial data – Develop diagnostic tests for multiple testing procedures •  High-Dimensional Asymptotics – Extend methods for asymptotic SVD to binary data •  Feature Selection for High-Dimensional Classifiers – Extensions of top-scoring pairs (TSP) to survival data – Theoretical connections to LDA and SVM – Embedding TSP in a logic regression framework
  58. 58. Thank You
  59. 59. 1.  Calculate the residuals R = X - S. 2.  Calculate the singular values of R, d1,…,dn. 3.  Permute each row of R individually to get R0. 4.  Take the SVD of the residuals R* = R0 - S to obtain null singular values . 5.  Compare di to for k=1,…,K to calculate a P- value for the ith right singular vector. Estimating The Row Dimension of G € ˆB € ˆB0 € di0 k € di0 k For k =1,…,K do steps 3-4: Buja and Eyuboglu (1992)
  60. 60. Why Does This Work? Leek and Storey (2007), Leek and Storey (2008) Useful Fact: X = BS + E = BS + ΓG + U = BS + ΛH + U if G and H have the same column space.
  61. 61. •  References: Benjamini Y and Hochberg Y. (1995), “Controlling the false discovery rate – a practical and powerful approach to multiple testing.” JRSSB, 57: 289-300. De Castro MC, Monte-Mor RL, Sawyer DO, and Singer, BH. (2005), “Malaria risk on the amazon frontier.” PNAS, 103: 2452-2457. Delin B and Roeder K. (1999), “Genomic control for association studies.” Biometrics, 55: 997-1004. Efron B. (2004) “Large-scale simultaneous hypothesis testing: The choice of a null hypothesis.” JASA, 99: 96-104. Leek JT and Storey JD. (2008) “A general framework for multiple testing dependence.” Proceedings of the National Academy of Sciences , 105: 18718-18723. Leek JT and Storey JD. (2007) “Capturing heterogeneity in gene expression studies by ‘Surrogate Variable Analysis’.” PLoS Genetics, 3: e161. Taylor JE and Worsley KJ. (2007) “Detecting sparse signals in random fields, with applications to brain mapping.” JASA, 102: 913-928. Thank You
  62. 62. 1.  Perform each hypothesis test individually. 2.  Obtain the test-statistic for each test. 3.  Compare distribution of test-statistics to the theoretical null distribution. 4.  Adjust theoretical null so that it matches the observed statistics in a low signal region. Empirical Null
  63. 63. Theoretical Null Efron (2004)
  64. 64. Theoretical Null Empirical Null Efron (2004)
  65. 65. Empirical Null Results in Incorrect Null Distribution Dep. Kernel
  66. 66. •  Observed statistics or observed P-values come from mixture distribution: π0g0 + π1g1 •  Dependence distorts g0 … can go either way: •  Must use full data set to capture dependence With Confounding Empirical Null is Ill-Posed

×