Spring Boot vs Quarkus the ultimate battle - DevoxxUK
A Study of RandomForests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data
1. A Study of Random Forests Learning Mechanism with
Application to the Identification of Informative Gene
Interactions in Microarray Data
Jorge M. Arevalillo and Hilario Navarro
Dpt. Statistics and Operational Research
University Nacional de Educación a Distancia
1 Salford Analytics and Data Mining Conference 2012. San Diego
2. Outline
Weak Marginal / Strong bivariate genetic interactions
RF learning mechanism
RF bivariate interaction detector procedure
Controlling the curse of dimensionality
Handling the small sample effect
Application to microarray data
Conclusions
2 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
3. Human Genetics Basics
DNA is often described as the blueprint of living organisms. It is composed by
two complementary strands of nucleotides (A-T, C-G)
Adenine (A) pairs with thymine (T) and cytosine (C) with guanine (G)
Basically, a gene is a piece of the DNA that contains the genetic information for
the synthesis of a protein
The human genome in numbers
23 pairs of chromosomes
2 meters of DNA
A sequence of 3 billion bps length
30000 – 40000 genes
Over 99% of the genome is identical in all
human beings
3 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
4. The central dogma of molecular biology
The expression of the genetic information stored
in the DNA occurs in two stages
•TRANSCIPTION. During which DNA is transcribed into
messenger RNA (mRNA).
•TRANSLATION. At this stage mRNA is transported to cell
cytoplasm and translated to produce a protein
Amino acids are used to construct proteins which
in turn will determine the observed phenotype
DNA microarray technologies allow to measure the abundance of mRNA by
monitoring the expression levels for hundreds or thousands of genes at different
conditions of the phenotype
4 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
5. Weak marginal / Strong bivariate genetic
interactions
In binary classification we define a WM/SB bivariate gene to gene interaction as a
pair of variables (genes) whose joint distribution discriminates the outcome but have
irrelevant marginal distributions for class separation
5 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
6. RF learning mechanism
Random Forest is an ensemble of decision trees
grown in a special way
Randomness is injected in RF mechanism by
bootstrap resampling to grow each tree in the forest
and also by finding the best splitter at each node within
a randomly selected subset of inputs
The number ntree of trees in the forest and the number R of candidate inputs for
splitting each node must be set in advance. Defaults: ntree = 500 and R = square
root of the number p of inputs
Each tree is grown on nearly 63% of data. The classification error rate is
estimated using the 37% left out observations. The error rate evaluated on the out
of bag cases is called oob
6 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
7. RF table of variable importance
The high dimensional nature of the data obtained by gene expression microarray
experiments has created the need for variable selection procedures that separate
relevant predictors (genes) carrying on useful information for classifying the
phenotype from irrelevant predictor (genes)
RF generates variable importance measures that allow to rank predictors in
accordance to their contribution to the predictive accuracy of the ensemble
RF gives two measures of variable importance
• GINI MEASURE. Each variable is assigned a score that accounts for the all the improvements
in the Gini index in all the nodes of the trees in the forests that use the variable as splitting
variable
• PERMUTATION BASED MEASURE. For each variable, all the cases are randomly permuted
to a noisy predictor; this noisy predictor is used in place of the original predictor and the oob is
computed again. The importance of the variable is defined by the difference between oob
errors after and before permutation
7 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
8. The oob error rate degradation in high
dimensional settings
An extreme synthetic example. XOR interaction pattern
The oob error rate rapidly becomes degraded as the number of noisy inputs
increases; hence the XOR signal will be lost
The interaction is captured as long as it appears alone without the disturbance of
the noisy inputs; so an exhaustive search among all the pairs of inputs is required if
we want RF learning mechanism detects the interaction
Our proposal offers shortcuts and tricky artifacts that simplify the search
8 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
9. Search procedure. Sequential stage
RF ranking of variable importance gives new insights regarding the degradation
of the oob error rate
Some alternatives, Díaz Uriarte (2006) and Genuer (2008), that explore this
ranking in a sequential manner have been proposed to identify relevant patterns
correlated to the outcome
9 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
10. Search procedure. Hunting stage
The second stage is designed to hunt difficult to uncover bivariate associations,
which are lost by sequential search strategies
The idea is to group the inputs in blocks; then use the oob error of RF run for all
the variables belonging to each pair of blocks in order to highlight block matches
where the WM / SB interactions are more likely to appear. This will limit the search
Block j
Block i
Match (i,j)
Ranking of
block matches
10 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
11. Drawback with the oob error rate
Simulation experiment with block size = 6
The boxplots show that the oob error rate cannot distinguish between block
matches containing a weak marginal / strong bivariate association and block
matches with only noisy inputs
sample sizes (40,40) sample sizes (40,20)
0.7
0.50
0.45
0.6
0.40
The curses of
oob error rate
oob error rate
0.5
dimensionality and
0.35
low sample size are
0.4
0.30
coming up again
0.25
0.3
0.20
XOR NOISY INPUTS XOR NOISY INPUTS
overlap=0.31 overlap=0.42
11 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
12. Data augmentation
To overcome this drawback, data are artificially augmented and then oob
error rate of a RF run on the augmented data is computed
Data perturbation is carried out in accordance to the following scheme
r is the sample range of X
b is the number of bins the range is
divided in. It controls the amount of
perturbation
An augmentation parameter k that
Details in Arevalillo and Navarro (2011), gives the factor by which the dataset
Fundamenta Informaticae Special issue on must be amplified is also introduced
Machine Learning in Bioinformatics
The new oob error computed on the augmented merged dataset is actually a
perturbed error rate measure. We call it perturbed oob
12 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
13. The perturbed oob measure
sample sizes (40,40)
sample sizes (40,40)
0.7
0.7
0.6
0.6
0.5
oob error rate
perturbed oob
0.5
The perturbed
0.4
0.4
oob measure
0.3
0.3
0.2 overcomes the
initial drawback
0.1
XOR NOISY INPUTS
overlap=0.31
0.0
1 (overlap=0.15) 3 (overlap=0.07) 5 (overlap=0.05) 7 (overlap=0.05) 9 (overlap=0.03)
k
sample sizes (40,20)
0.7
sample sizes (40,20)
0.6
0.50
0.45
0.5
0.40
perturbed oob
oob error rate
0.4
0.35
0.3
0.30
0.25
0.2
0.20
0.1
XOR NOISY INPUTS
overlap=0.42
0.0
1 (overlap=0.35) 3 (overlap=0.24) 5 (overlap=0.18) 7 (overlap=0.16) 9 (overlap=0.14)
k
13 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
14. Summary of the algorithm
The details about the implementation of the algorithm can be seen in
Arevalillo and Navarro (2011), Fundamenta Informaticae Special issue on
Machine Learning in Bioinformatics
Usually bsize = 6, 8,
b =5 and k = 3, 5, 7
are good settings
Strategies for this step
include: screeplots for
variable importance, VARSEL
(Díaz Uriarte (BMC.
Bioinformatics. 2006) and
oob error smoothing
(Genuer et al. INRIA. 2008)
14 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
15. Application to the colon cancer data
Gene expression levels corresponding to 40 tumor and 22 healthy tissue samples
were collected with an Affymetrix oligonucleotide Hum6000 array (Alon et al. PNAS
1999). The expression levels were arranged in a matrix with 2000 columns (genes)
and 62 rows along with a column containing the clinical outcome variable Y
Y=1 for tumorous samples and Y=0 for healthy samples
The data are publicly available and
can be downloaded from the
package colonCA of Bioconductor
www.bioconductor.org
15 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
16. Data pre-processing
Gene expression intensities were pre-processed with a log transformation and a
standardization across genes
The figure shows the potential outliers given by RF outlier detector. Cases 18, 20,
52, 55 and 58 were previously indentified as outliers in the specialized literature
(Chow et al. Physiol. Genomics 2001. Ambroise and McLachlan. PNAS 2002)
These outliers might be caused by
different sources of error while collecting
the data. We eliminate them from the
analysis and end up with a data set
containing 57 cases and 2000 predictors
16 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
17. A first selection. Sequential search
Simple inspection of the screeplot of RF
variable importance allow us to identify the
most relevant variables. A forward sequential
search strategy as in Genuer (2008) gives a
selection containing the most informative
genes for classifying the clinical outcome
List of genes selected after the sequential
search step. It has a great agreement with
previous selections (Ben-Dor et al.
J.Comp.Biol. 2000)
17 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
18. Results
Control parameters for the hunting stage of the procedure have been set to block
size = 5, k = 5 and b = 5. RF controls ntree and mtry were set to their default values
Findings for three top ranked block matches (heat map plots of the oob for each
match and the scatter plots for the selected gene to gene interactions)
Bivariate gene
interaction
(X86693, M80815)
(R60883, U04953)
(L12350, X86693)
18 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
19. Additional insights
Oob error rate with all the genes = 3.5%
Oob error rate with the first 300 top ranked genes as
predictors = 1.8%
Oob error rate with all the genes but the 300 top
ranked = 26.3%
In this case the sequential stage is carried out
manually by filtering the 300 top ranked genes
The hunting step of RF bivariate interaction detector
procedure allows to uncover interesting patterns from
the remaining 1700 genes
Interesting gene associations come up from the first
100 positions of the ranking of block matches
RF oob error for the best 10 gene to gene
interactions is 10.5%
19 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
20. Summary and conclusions
RF is a widely used algorithm for classification and variable selection in high
dimensional small sample data. However, sequential search strategies based on
the oob error and its ranking of variable importance usually fail in uncovering weak
marginal / strong bivariate hidden interactions in these data structures
This happens because of the curse of dimensionality and the small sample size;
both of them produce the degradation in the performance of RF classifier. Data
augmentation and an exhaustive exploration by blocks of the feature space, which
uses RF as the search engine, will protect us from this phenomenon
A perturbed oob measure is obtained when RF is run for all the features
belonging to every pair of blocks in the augmented dataset
So the ranking of perturbed oobs will limit the search from the set of all possible
bivariate interactions to the variables within the top ranked blocks
The application of the proposed bivariate interaction detector algorithm to a real
gene expression data was able to uncover WM/SB gene to gene interactions
associated with the phenotype
20 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
21. Future research
The method was proposed for binary classification. Its extension to multi-class
problems and the development of tricks and shortcuts that reduce the
computational cost open future research avenues
The interaction detector algorithm utilizes RF as the search engine. The use of
other search engines with classifiers like LDA, QDA, SVM, … is also an issue for
future research. Recently, Arevalillo and Navarro (2011) BMC Bioinformatics have
proposed the QDA as search engine
The development of an R package that incorporates all these improvements
Finally, the study of the problem of finding informative WM/SB genomic
interactions in SNP data is an open research issue
21 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
22. Thank you for your attention
Jorge M. Arevalillo: jmartin@ccia.uned.es
Hilario Navarro: hnavarro@ccia.uned.es
Department of Statistics and Operational Research
University Nacional Educación a Distancia
Paseo Senda del Rey nº 9. 28040 Madrid
22 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions