3. What causes Autism Spectrum Disorders?
Neuroimaging
Environment
Behavior
Genetics
37% heritable
MZ twins: 66% concordance, fraternel, 30%
No single SNP genome-wide significance
CNV’s: less than 1% of cases
De novo mutations: 10-20% of cases
valproic acid, rubella, infections during pregnancy,
alcohol, thalidomide, parental age, antidepressants,
something else?
aberrant functional connectivity and structure
not reproducible
biased and unreliable
“gold standard”
4. Research in Progress
1. Brain structure
2. Behavioral Phenotype
3. Genetic Signature of Behavior
1. Meta analysis of Brain Function
2. Gene Expression
3. Evaluation
5. Why is this work meaningful?
A new model of neuropsychiatric disorder based on
patterns of local brain structure
neuropsychiatric profile
brain
phenotype
cognitive
phenotype
6. 1. Brain Structure to Predict ASD
• N=400 samples
• M=276 features
– Area
– Volume
– Curvature
– Thickness
brain
phenotype
cognitive
phenotype
7. 2. Behavioral Phenotype
“Eye gaze score”
What is the developmental trajectory of eye gaze?
0: normal 1: aberrant
• National Database of Autism Research (NDAR)
• ~150-200 behavioral metrics
• “eye”,“gaze”: 678 questions for 22,823 subjects
cognitive
phenotype
8. 2. Behavioral Phenotype
ASD vs. Healthy Control Eye Gaze Scores
Two Sample T-Test
t = 46.315, p-value < 2.2e-16
score
Frequency
N=22,823
autism
control
11. 3. Genetic Signature of Behavior
Social deficits
Communication deficits
Repetitive behaviors
ASD
Brain Map
Meta Analysis of Brain Function
“anxiety” 525 Terms
http://vbmis.com/bmi/project/neuromap/
12. Gene
Expression
3. Genetic Signature of Behavior
Gene Expression
Social deficits
Communication deficits
Repetitive behaviors
ASD
Brain Map
“anxiety”
13. Why is this work meaningful?
Gene
Expression
Social deficits
Communication deficits
Repetitive behaviors
Brain MapBehavior• Clinical solutions:
– Autism has no drugs
– Identify genetic markers that can be detected in blood
• Genetic signature of a behavior
– Leads us closer to drug solution
– Signature indicates likelihood of drug working for
specific kind of ASD
14. Mapping behavior to genes
Gene
Expression
Social deficits
Communication deficits
Repetitive behaviors
Brain MapBehavior
“anxiety”
Neurosynth AllenOverlap
15. 3. Genetic Signature of Behavior
Match points in “anxiety” map to Allen Brain Atlas
Neurosynth Allen
18. 3. Genetic Signature of Behavior
How to find interesting genes for a behavioral map?
“anxiety”
19. • Assess the “relative importance” of each gene probe
to define a term
• If predictors in regression are uncorrelated,
assessing relative importance means:
3. Genetic Signature of Behavior
How to find interesting genes for a behavioral map?
Shapley Value Regression
Bigger change = more “important”
20. 3. Genetic Signature of Behavior
How to find interesting genes for a behavioral map?
Shapley Value Regression
• Assess the “relative importance” of each gene probe
to define a term
• If predictors in regression are uncorrelated,
assessing relative importance means:
R2
% variance accounted for by model
quality of model predictors
21. 3. Genetic Signature of Behavior
How to find interesting genes for a behavioral map?
Shapley Value Regression
• creates a score for each player in a game that
represents that player’s contribution to the total
value of the game
Attributes (genes): players
Total Value: quality of model (R2)
R2 with
attribute j
R2 without
attribute j
Shapley value
of gene j
weight based on n total
Predictors, k in model
22. 3. Genetic Signature of Behavior
How to find interesting genes for a behavioral map?
Shapley Value Regression
• creates a score for each player in a game that
represents that player’s contribution to the total
value of the game
Attributes (genes): players
Total Value: quality of model (R2)
marginal contribution to the R2 from adding the
attribute to the model last
23. 0 0 0
0 1 0
1 0 1
0 0 0
1 0 0
0 0 0
0 0 1
3. Genetic Signature of Behavior
How to find interesting genes for a behavioral map?
Shapley Value Regression
• Assess the “relative importance” of each gene to define a term
• Define an expression property: consistent pattern of regulation
0.25 0.12 1.20
1.50 0.80 3.40
0.80 0.90 1.00
0.40 0.75 0.20
1.40 0.32 4.50
0.89 0.21 2.40
0.70 0.10 1.20
Probes
Samples
1 0 1
0 0 0
0 0 0
0 1 0
0 0 0
1 0 1
0 0 0
Microarray Expression Condition 1 (B1) Condition 2 (B2)
24. 3. Genetic Signature of Behavior
How do I evaluate my gene subsets?
• Gene Set Enrichment Analysis
– determines whether an a priori defined set of genes
shows statistically significant, concordant differences
between two phenotypes.
Nextbio gene expression data for ASD vs. HC
Broad Institute Drug Gene Expression Database
25. 3. Genetic Signature of Behavior
How do I evaluate my subsets?
Gene Set Enrichment Analysis
1. Enrichment Score: the degree to which a set S is
overrepresented at the extremes of my list
2. Estimate the significance level of the scores
3. Multiple hypothesis testing
Subramanian, et. al, Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.
PNAS 2005 102 (43) 15545-15550; published ahead of print September 30, 2005,doi:10.1073/pnas.0506580102
26. 3. Genetic Signature of Behavior
How do I evaluate my gene subsets?
• Nextbio gene expression data for ASD vs. HC
Is actual gene expression data in ASD vs HC:
1. overexpressed for any of my behavioral term sets?
2. overexpressed for gene sets found aberrant in ASD?
3. overexpressed for any functional pathways (C2)
Analysis in Progress!
27. 3. Genetic Signature of Behavior
How do I evaluate my gene subsets?
– Broad Institute Drug Gene Expression Database
– Daily Med
(disorders with anxiety): Adjustment Disorders Affective Disorders, Psychotic
Neurocirculatory Asthenia Obsessive-Compulsive Disorder Premenstrual
Syndrome Seasonal Affective Disorder Panic Disorder
(drugs): Meprobamate Fluvoxamine Clorazepate Dipotassium Alprazolam
Chlormezanone Trazodone Lorazepam Temazepam Amobarbital Pentobarbital
Oxazepam Secobarbital Diazepam Hydroxyzine Ritanserin Oxprenolol
Medazepam Secobarbital Diazepam Meprobamate Fluvoxamine Clorazepate
Dipotassium Pentobarbital Amobarbital Alprazolam Chlormezanone
Trazodone Lorazepam Temazepam Hydroxyzine Oxazepam Oxprenolol
Medazepam
28. 3. Genetic Signature of Behavior
How do I evaluate my gene subsets?
– Broad Institute Connectivity map .CEL Files
• Extract Log2 transformed normalized data
• 17 cell lines, 22K probes, 5 anxiety medications
Is gene expression data in for cells exposed to drugs:
1. overexpressed for any of my behavioral term sets?
2. overexpressed for gene sets found aberrant in ASD?
3. overexpressed for any functional pathways (C2)
How to define phenotypes?
29. Acknowledgements
Advisors
Dennis Wall
Russ Altman
Daniel Rubin
Colleagues
Ruth O’Hara
Joachim Hallmayer
Antonio Hardan
Admin Support
Susan Aptekar
John DiMario
Mary Jeanne & Nancy
Steven Bagley
Funding
Microsoft Research
SGF and NSF
Wall Lab
Maude David
Leticia Diaz Beltran
Jena Daniels
Marlena Duda
Alex Lancaster
Jack Kosmicki
Jae Yoon-Jung
Nikhila Albert
Byron Hinebaugh
Rubin Lab
Francisco Gimenez
Rebecca Sawyer
Tiffany Ting Lu
BMI Family
Diego
Boots
Peyton
Linda
Katie
Natalie
Beth
Winn
Sarah
Emily
Jonathan
Erika and Brian & co
Luke
Sam
31. 3. Genetic Signature of Behavior
How to find interesting genes for a behavioral map?
PACall.csv
Contains a present/absent flag which indicates whether the probe's
expression is well above background. It is set to 1 when both of the
following conditions are met.
1) The 2-sided t-test p-value is lower than 0.01, (indicating the mean
signal of the probe's expression is significantly different from the
corresponding background).
2) The difference between the background subtracted signal and the
background is significant (> 2.6 * background standard deviation).
• Microarray expression
• PA Call
34. 3. Genetic Signature of Behavior
Gene Set Enrichment Analysis
1. Calculate an enrichment score (ES) that reflects the
degree to which a set S is overrepresented at the
extremes of the entire ranked list L.
2. Estimate the significance level of the ES by permuting
the phenotype labels and recomputing the ES for
permuted data null distribution calculate P value
3. Multiple hypothesis testing
37. (Age Specific) Brain Structure to Predict ASD
age 9-18 years 18+ years
Correctly Classified 58 100%
Incorrectly Classified 0 0
Correctly Classified 69 100%
Incorrectly Classified 0 0
38. 3. Genetic Signature of Behavior
Terms with >75% overlap
childhood : children
japanese : chinese
default : chinese
taskrelated : chinese
frequency : card
tracking : words
family : videos
default : japanese
taskrelated : japanese
taskrelated : default
So if you remember my quals talk, you know that my biological problem pertains to a data driven approach to discover subtypes of autism spectrum disorder. I’ll briefly motivate this again.
Cost, prevalence, emerges between 2-3 years of age. Prevalance increases 10-17%/year and it’s not due to changes in definition or diagnostic rates.
We use the DSM-5 to diagnose, it’s based on behavior and clinical observations, and the problem with this approach is that it’s subject to bias, differences between clinicians, and autism is so highly heterogenous. Not only is it comorbid with all of these disorders, but what is clear that it extends far beyond being a “brain” disorder. Individuals with ASD span the gamut in terms of intellectual disability, motor coordination, attention, sleep, gastrointestinal disturbance, and then you have a small cohort that excel in visual skills, music, math and art. So a list of behavioral terms is not sufficient for early diagnosis, which is essential for treatment and better long term outcome.
And you know, a big issue with autism and many neuropsychiatric disorders is that we don’t understand the underlying etiology of the disorder enough to know where to look in biology. So, of course this is a ripe area for research.
Sorry internet, the answer is not vaccines. Here is a summary of what we do know. We know that the disorder influences the brain, but we can’t find a reliable biomarker. We do know there is aberrant functional connectivity, prefrontal cortex and temporal cortex, increased white and gray matter, and this all starts around 2 years of age.
There are proven environmental factors that increase ASD risk, and Russ and Steven Bagley’s recent study which found a strong correlation between intellectual disability and environmental location.
Behavior on its own we know tends to be biased an unreliable, but we use it a lot because it’s the gold standard
Here’s the thing about genetics. We know that ASD has a genetic signature, we just haven't found it yet.
Current estimates: 37% heritable. MZ twins: 66% concordance, fraternal, 30%.
So now I want to talk about my research for the past 6 months, so you have context of my current “research in progress” I started by looking at pure brain structure, then developed methods to extract a behavioral phenotype, conduct meta analysis of brain function, and now my current work to identify the genetic signature of a behavior. We will talk about all of these.
If we understand the genetic signature of behavior, we can look at genes and predict someone’s behavior
If we find that any of the drug data is overexpressed for a set of genes, what we’ve essentially found is a set of druggable genes.
Autism has no drugs. We need strategy that leads us closer to hope for clinical/drug solutions
Understanding the relationship between brain, behavior in a way that leads to clinical solutions – e.g. identification of genetic markers that can be detected in the blood (that predicts early the likelihood of a behavior developing later). Signature that indicates likelihood of a drug working for a kid who will develop a specific kind of autism.
I started very simply. I extracted 276 structural metrics describing thickness, curvature, volume, and area across 400 individuals, and there was beautiful clustering! The problem, of course is the same story – there is definitely aberrant structure, but because it’s so heterogenous, we don’t have any labels to validate this clustering.
I was able to use these features with some behavioral traits to predict ASD with an accuracy of about 80%, but it dropped…
So of course I looked for validation in behavioral data! As we talked about, we have a lot of metrics that “get at” issues with social, communication, repetitive behaviors, but research suggests that much more important is sensory aspects, such as sensitivity to sound, touch, and something like eye gaze. So since we know that eye contact is aberrant with ASD, I decided that I would try to use the NDAR database to extract “eye gaze scores.” I spent about 2 months developing infrastructure and methods to query the entire database of the National Database of Autism Research (NDAR) to develop these scores. And it’s not complicated – I used regular expressions to find any kind of word related to eye / gaze / eye contact, and then manually curated my set of questions, manually normalized them to all be between 0 and 1 with 0 indicating normal eye gaze, 1 indicated abberant, and then I would want to know if I can distinguish ASD vs HC with my scores.
What about differences in age?
I could also break apart my data to look at differences in age groups, and we see clear difference between all ages of HC and different ages of autism, with the worst eye contact being in infants, and then having it slowly improve over time. So this was great, I now wanted to go back to my behavioral data to see if this could explain my clustering. It couldn’t, at all. So then I looked to do validation on another data source, ABIDE, but there just isn’t the overlap of behavioral metrics to make it possible. I would do a completely new analysis finding possibly different questions, and you can’t compare apples and oranges. And in the late fall a paper came out of harvard that manually curated EVERY single question in this database for these kind of behavioral terms, so I considered myself scooped and totally dropped this research. My manual curation method would be totally infeasible for any large number of metrics, and so at this point I have future plans to use their ontology.
I could also break apart my data to look at differences in age groups, and we see clear difference between all ages of HC and different ages of autism, with the worst eye contact being in infants, and then having it slowly improve over time. So this was great, I now wanted to go back to my behavioral data to see if this could explain my clustering. It couldn’t, at all. So then I looked to do validation on another data source, ABIDE, but there just isn’t the overlap of behavioral metrics to make it possible. I would do a completely new analysis finding possibly different questions, and you can’t compare apples and oranges. And in the late fall a paper came out of harvard that manually curated EVERY single question in this database for these kind of behavioral terms, so I considered myself scooped and totally dropped this research. My manual curation method would be totally infeasible for any large number of metrics, and so at this point I have future plans to use their ontology.
TODO: Better look up method
At this point I decided that getting scooped was terrible, and if I’m in an awesome new lab with Dennis Wall, I should try to expand my skillset beyond imaging data. I wanted to incorporate genetics somewhere in here, because heritability plays a big role. I would want to create hypotheses about the genetic signature of behavior.
If we can start with behaviors that are aberrant in a disorder, find brains areas involved in the manifestation of that behavior, and then look at gene expression, we can create a hypothesized subset of genes that are implicated for the behavior, and test with actual data. Then we can predict someone’s behavior from genetics!
So let’s start with our behavioral data step 1 is to figure out what spatial areas in the brain are likely to be involved with that behavior. I used the Neurosynth API
Takes as input a behavioral term, and a significance threshold
Performs meta analysis to produce a set of spatial maps
Extracts nonzero voxels from FDR corrected (absolute value) image --> MNI coordinates for significant spatial locations associated with term based on literature
Here is a visualization of the map for the term "anxiety" - which is my first query/test term. As we would expect, we see activation in bilateral amygdala, OFC, and insula.
Now we are interested in gene expression of these areas. And now we go to the Allen Brain Atlas, which has gene expression for 3,702 spatial locations in the brain for 60K gene probes.
If we understand the genetic signature of behavior, we can look at genes and predict someone’s behavior
If we find that any of the drug data is overexpressed for a set of genes, what we’ve essentially found is a set of druggable genes.
Autism has no drugs. We need strategy that leads us closer to hope for clinical/drug solutions
Understanding the relationship between brain, behavior in a way that leads to clinical solutions – e.g. identification of genetic markers that can be detected in the blood (that predicts early the likelihood of a behavior developing later). Signature that indicates likelihood of a drug working for a kid who will develop a specific kind of autism.
If we understand the genetic signature of behavior, we can look at genes and predict someone’s behavior
If we find that any of the drug data is overexpressed for a set of genes, what we’ve essentially found is a set of druggable genes.
Autism has no drugs. We need strategy that leads us closer to hope for clinical/drug solutions
Understanding the relationship between brain, behavior in a way that leads to clinical solutions – e.g. identification of genetic markers that can be detected in the blood (that predicts early the likelihood of a behavior developing later). Signature that indicates likelihood of a drug working for a kid who will develop a specific kind of autism.
My metric was simple – find the closest sample point for each point in my behavioral map, and only keep those that are 3mm or closer, because that’s the typical resolution of a voxel in neuroimaging. And you can see we have pretty good overlap.
OK, so at this point we have for each term a set of sample points in the Allen Brain Atlas – now I needed to figure out my interesting subset of genes. I started by taking the entirety of the Allen Brain Atlas and putting it into BigQuery. So let’s talk about the data that I have
So since this PA call matrix has a 1 to indicate expression above background across the entire brain, if I could just find the values of 1 for regions in my behavioral maps, those would be interesting. So here we have a toy example of this PA call matrix for a single behavioral term. Rows are samples, and columns are gene probes. So my first strategy was to sum over the samples, the idea being that a gene would be more relevant to a term if it’s expressed above background in more areas. So this becomes a vector of features to describe my behavioral term, and I could normalize these values to get the genes that are expressed across most of the map. However – this is misleading – because for any term, there is no one probe that is expressed sig. above background for greater than 2% of sample locations. And even if this was meaningful, I found that an arbitrary threshold at .9 still gave me 15-20K genes. That’s not a small enough subset!
So since this PA call matrix has a 1 to indicate expression above background across the entire brain, if I could just find the values of 1 for regions in my behavioral maps, those would be interesting. So here we have a toy example of this PA call matrix for a single behavioral term. Rows are samples, and columns are gene probes. So my first strategy was to sum over the samples, the idea being that a gene would be more relevant to a term if it’s expressed above background in more areas. So this becomes a vector of features to describe my behavioral term, and I could normalize these values to get the genes that are expressed across most of the map. However – this is misleading – because for any term, there is no one probe that is expressed sig. above background for greater than 2% of sample locations. And even if this was meaningful, I found that an arbitrary threshold at .9 still gave me 15-20K genes. That’s not a small enough subset!
A change in 1 standard unit of a coefficient == predicted change of βA units of the criterion variable
Bigger changes in β == bigger changes == more “important”
Take absolute value or square coefficients to deal with negatives
Sum is the R2 value == quality of model predictors, % variance accounted for by model
Show regression coefficients, animate larger
So let’s look at attempt number 2. We are going to use shapley value regression, which is use to assess the “relative importance” of a gene probe to define a term. If all of the predictor variables in a regression model are uncorrelated with each other then assessing the relative importance of the various predictors is fairly straightforward. If we consider the standardized regression coefficients (often called Beta coefficients) their interpretation is clear. A change of 1 standard unit in the variable A will result in a predicted change of βA standard units of our criterion variable
Bigger values of β mean bigger changes in our criterion. Therefore, β can be thought of as a measure of importance. We take absolute value or square to get rid of negative signs, and that’s why R2 gets at relative importance.
R2 == interpreted as the percent of variance in the criterion variable that is accounted for by the model.
But let’s go back to this “uncorrelated” term – yeah right! When we have correlated variables the idea of holding all constant and changing one to assess relative importance breaks down.
A change in 1 standard unit of a coefficient == predicted change of βA units of the criterion variable
Bigger changes in β == bigger changes == more “important”
Take absolute value or square coefficients to deal with negatives
Sum is the R2 value == quality of model predictors, % variance accounted for by model
Show regression coefficients, animate larger
So let’s look at attempt number 2. We are going to use shapley value regression, which is use to assess the “relative importance” of a gene probe to define a term. If all of the predictor variables in a regression model are uncorrelated with each other then assessing the relative importance of the various predictors is fairly straightforward. If we consider the standardized regression coefficients (often called Beta coefficients) their interpretation is clear. A change of 1 standard unit in the variable A will result in a predicted change of βA standard units of our criterion variable
Bigger values of β mean bigger changes in our criterion. Therefore, β can be thought of as a measure of importance. We take absolute value or square to get rid of negative signs, and that’s why R2 gets at relative importance.
R2 == interpreted as the percent of variance in the criterion variable that is accounted for by the model.
But let’s go back to this “uncorrelated” term – yeah right! When we have correlated variables the idea of holding all constant and changing one to assess relative importance breaks down.
So shapley value regression is creating a score for each player in a game that represents the players contribution to the total value of the game.
attributes as the players and the total value of the game as
the quality of the regression model or the R2
So this entire dudesey, when M is the full model, is the marginal contribution to the R squared from adding the attribute to the model last. So with these shapley values we have assessed the “relative importance” of each gene. Now how does this get applied to our data?
We start with microarray expression data, and we need to find genes that are associated with some “expression property” which in this case is being generally upregulated or downregulated in this set. A group of genes S⊆N which realizes the association between the expression property and the condition on a single array is called a winning coalition for that array. So I took the mean +/1- 1SD to define two new matrices, B1 and B2, B1 representing the conditoin of UP, and B2 DOWN.
And we plug these two matrices into the shapley value formula, and in order to get rid of high shapley values that could be attributed to chance we did 1000 bootstrap samples and for each calculated the unadjusted p value.
Then we column bind these two matrices, and use the R package multtest to do the bootstrap procedure and the result is WHAT And this was very helpful because sets of 20K genes went down to a couple of hundred.
Step 1: Calculation of an Enrichment Score. We calculate an enrichment score (ES) that reflects the degree to which a set S is
overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down the list L,
increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S. The magnitude
of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from
zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic (ref. 7 and Fig. 1B).
Step 2: Estimation of Significance Level of ES. We estimate the statistical significance (nominal P value) of the ES by using an
empirical phenotype-based permutation test procedure that preserves the complex correlation structure of the gene expression
data. Specifically, we permute the phenotype labels and recompute the ES of the gene set for the permuted data, which generates a null
distribution for the ES. The empirical, nominal P value of the observed ES is then calculated relative to this null distribution.
Importantly, the permutation of class labels preserves gene-gene correlations and, thus, provides a more biologically reasonable assessment of significance than would be obtained by permuting genes.
I have THIS MANY datasets from nextbio to use for this analysis.
I am interested if gene expression in ASD vs HC is overexpressed for any of my behavioral term sets, gene sets found to be aberrant in ASD, and for any functional pathways.
WRITE ABOUT RESULTS?
I can also take this data, filter it to only include terms in each of my subsets, and then do GSEA with the autism datasets and functional pathways database.
The Broad institute has a database of gene expression for THIS MANY cell cultures exposed to different medications. The stupid web interface requires an “up” and “down” list of genes, and it’s a black box, so I decided to download their instances and do analysis on my own. I saw that I would need to look up the instances based on the medication name, so it’s a question of “which medications are relevant for, anxiety?” for example. I wrote scripts that use Daily Med to find all medications relevant to anxiety:
I then could search my cmap instances for these drugs, and since a bunch of these are kind of old, I found 20 instances for 5 drugs. I used regular expressions to find them, and I’m going to go back and make sure that I haven’t missed any.
I now want to look at the Connectivity Map data, which are Affymetrix files. I extracted the log2 transformed, normalized data, and decided to keep drugs separate, because just because they all treat anxiety doesn’t mean we can just assume they impact gene expression equivalently. In total I have 17 cell lines across about 22K probes.
OK, so at this point we have for each term a set of sample points in the Allen Brain Atlas – now I needed to figure out my interesting subset of genes. I started by taking the entirety of the Allen Brain Atlas and putting it into BigQuery. So let’s talk about the data that I have
So of course if we have similar terms, I was worried that there would be too much overlap in my sample spatial maps. So first I looked at tanimoto scores, or the Jacaard index, to assess the intersection divided by the union – scores of 1 mean perfectly the same, and 0 are different. So here are pairwise scores, and we have the terms matched to themselves over here. I also looked at this plot for each behavioral term to all others because you get the sense there are some squished similar maps here, and visually looked at all maps with scores greater than .75. There was some overlap, but I didn’t see reason at this point to artificially remove terms from the analysis.
That just shows the data, but in terms of a classifier, my best performance was using ADTrees. I was able to correctly predict almost 80% of cases. And the features are what we would expect. And I sort of glossed over this, because in my mind this is not good enough. This classifier uses behavioral data, and I’m not convinced by that. Other people might have been, because there was a paper published with a 60% accuracy classifier. But honestly, why bother? When I removed the behavioral data, the accuracy dropped to about 70%. But you know, there is interesting clustering here. Can I come up with some behavioral metric to explain this?
Step 1: Calculation of an Enrichment Score. We calculate an enrichment score (ES) that reflects the degree to which a set S is
overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down the list L,
increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S. The magnitude
of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from
zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic (ref. 7 and Fig. 1B).
Step 2: Estimation of Significance Level of ES. We estimate the statistical significance (nominal P value) of the ES by using an
empirical phenotype-based permutation test procedure that preserves the complex correlation structure of the gene expression
data. Specifically, we permute the phenotype labels and recompute the ES of the gene set for the permuted data, which generates a null
distribution for the ES. The empirical, nominal P value of the observed ES is then calculated relative to this null distribution.
Importantly, the permutation of class labels preserves gene-gene correlations and, thus, provides a more biologically reasonable assessment of significance than would be obtained by permuting genes.
I also have 1300…
So we had some major overfitting going on here, but when I separated into groups, I could get perfect performance. This is ADTree 10 fold cross validation. However my sample sizes were also way too small. And it’s not so helpful for these two
Still not great. But one thing this doesn’t account for is the huge variation that we see between different age groups. So I tried that. age groups to make a diagnosis, we need intervention in the first few years of life.
I found a subset of terms with greater than 75% overlap, and manually checked them – and we can see that the spatial maps are different – but then the question is – do I really want to artificially remove terms just because they have similar spatial maps? I didn’t see any reason to.