This document discusses techniques for fast decision tree learning on microarray data. It introduces using attribute histograms to speed up the process of finding the best split points for decision tree learning. It also discusses optimizations for speeding up leave-one-out cross validation by reusing subtrees from previous runs. Experimental results on three microarray datasets show speedups of 150-400% from these techniques. Attribute pruning based on histogram indices is also introduced to further improve speed without loss of accuracy.
1. To appear Proc. The 2003 International Conference on Machine Learning and Applications (ICMLA'03)
Los Angeles, California, June 23-24, 2003.
Fast Decision Tree Learning Techniques
for Microarray Data Collections
Xiaoyong Li and Christoph F. Eick
Department of Computer Science
University of Houston, TX 77204-3010
e-mail: ceick@cs.uh.edu
Abstract gene expression profiles of tumors from cancer
patients [1]. In addition to the enormous scientific
DNA microarrays allow monitoring of potential of DNA microarrays to help in
expression levels for thousands of genes understanding gene regulation and interactions,
simultaneously. The ability to successfully microarrays have very important applications in
analyze the huge amounts of genomic data is of pharmaceutical and clinical research. By comparing
increasing importance for research in biology gene expression in normal and abnormal cells,
and medicine. The focus of this paper is the microarrays may be used to identify which genes are
discussion of techniques and algorithms of a involved in causing particular diseases. Currently,
decision tree learning tool that has been devised most approaches to the computational analysis of
taking into consideration the special features of gene expression data focus more on the attempt to
microarray data sets: continuous-valued learn about genes and tumor classes in an
attributes and small size of examples with a unsupervised way. Many research projects employ
large number of genes. The paper introduces cluster analysis for both tumor samples and genes,
novel approaches to speed up leave-one-out and mostly use hierarchical clustering methods [2,3]
cross validation through the reuse of results of and partitioning methods, such as self-organizing
previous computations, attribute pruning, and maps [4] to identify groups of similar genes and
through approximate computation techniques. groups of similar samples.
Our approach employs special histogram-based
data structures for continuous attributes for This paper, however, centers on the application of
speed up and for the purpose of pruning. We supervised learning techniques to microarray data
present experimental results concerning three collections. In particular, we will discuss the features
microarray data sets that suggest that these of a decision tree learning tool for microarray data
optimizations lead to speedups between 150% sets. We assume that each data set includes gene
and 400%. We also present arguments that our expression data of m-RNA samples. Normally, in
attribute pruning techniques not only lead to these data sets the number of genes is pretty large
better speed but also enhance the testing (usually between 1000 and 10,000). Each gene is
accuracy. characterized by numerical values that measure the
degree the gene is turned on for the particular
Key words and phrases: decision trees, concept sample. The number of examples in the training set,
learning for microarray data sets, leave-one-out on the other hand, is typically below one hundred.
cross validation, heuristics for split point Associated with each sample is its type or class that
selection, decision tree reuse. we are trying to predict. Moreover, in this paper we
will restrict our discussions to binary classification
problems.
1. Introduction
Section 2 introduces decision tree learning
The advent of DNA microarray technology provides techniques for microarray data collections. Section 3
biologists with the ability of monitoring expression discusses how to speed up leave-one-out cross
levels for thousands of genes simultaneously. validation. Section 4 presents experimental results
Applications of microarrays range from the study of that evaluate our techniques for three microarrray
gene expression in yeast under different data sets and Section 5 summarizes our findings.
environmental stress conditions to the comparison of
2. ∑
2
2. Decision Tree Learning Techniques Gain(D,S)= H(D) − i =1
(| D i | / | D |) * H(D i )
for Microarray Data Collections In the above |D| denotes the number of elements in
set D and D=(p1, p2) with p1+ p2 =1 and indicates
2.1 Decision Tree Algorithms Reviewed that of the |D| examples p1*|D| examples belong to
the first class and p2*|D| examples belong to the
The traditional decision tree learning algorithm (for second class.
more discussions on decision trees see [5]) builds a
Procedure buildTree(D):
decision tree breadth-first by recursively dividing
1. Initialize root node R of tree T using data set D;
the examples until each partition is pure by
2. Initialize queue Q to contain root node R;
definition or meets other termination conditions (to
3. While Q is not empty do {
be discussed later). If a node satisfies a termination
4. De-queue the first node N in Q;
condition, the node is marked with a class label that
5. If N is not satisfying the termination
is the majority class of the samples associated with
condition {
this node. In the case of microarray data sets, the
splitting criterion for assigning examples to nodes is
6. For each gene Gi (i= 1, 2, …. )
of the form “A < v” (where A is an attribute v is a 7. {Evaluate splits on gene Gi based on
real number). information gain;
In algorithms description in Fig. 1 below, we 8. Record the best split point Si for Gi
assume that and its information gain}
1. D is the whole microarray training data set; 9. Determine split point Smax with the
2. T is the decision tree to be built; highest information gain
3. N is one node of the decision tree in which holds 10. Use Smax to divide node N into N1 and N2
the indexes of samples; and attach nodes N1 and N2 to node N in the
4. R is the root node of the decision tree; decision tree T;
5. Q is a queue which contains nodes of the same 11. En-queue N1 and N2 to Q;
type with N; 12. }
6. Si: is a split point which is a structure containing 13. }
a gene index i, a real number v and an
Figure 1: Decision Tree Learning Algorithm
information gain value. A split point can be used
to provided a split criterion to partition the tree 2.2 Attribute Histograms
node N into two nodes N1 and N2 based on
whether the gene i’s value of each example in Our research introduced a number of new data
the node is or isnot greater than value v; structures for the purpose of speeding up the
7. Gi: denotes the i-th gene. decision tree learning algorithms. One of these data
structures is called attribute histogram that captures
The result of applying the decision tree learning the class distribution of a sorted continuous attribute.
algorithm is a tree whose intermediate nodes Let us assume we have 7 examples and their
associate split points with attributes, and whose leaf attribute values for an attribute A are 1.01, 1.07,
nodes represent decisions (classes in our case). Test 1.44, 2.20, 3.86, 4.3, and 5.71 and their class
conditions for a node are selected maximizing the distribution is (-, +, +, +, -, -, +); that is, the first
information gain relying on the following example belongs to class 2, the second example is
framework: We assume we have 2 classes , class 1,... If we group all the adjacent samples with
sometimes called ‘+’ and ‘-“ in the following, in our the same class, we obtain the histogram for this
classification problem. A test S subdivides the attribute which is (1-, 3+, 2-, 1+), for short (1,3,2,1)
examples D= (p1,p2) into 2 subsets D1 =(p11,p12) as depicted in Fig. 2; if the class distribution for the
and D2 =(p21,p22). The quality of a test S is sorted attribute A would have been (+,+,-,-,-,-,+) A’s
measured using Gain(D,S): histogram would be (2,4,1). Efficient algorithms to
Let H(D=(p1,…,pm))= Σi=1 (pi log2(1/pi)) (called compute attribute histograms have been discussed in
the entropy function) [6].
3. 2.3 Searching for the Best Split Point 3. Optimizations for Leave-one-out
As mentioned earlier the traditional decision tree Cross-validation
algorithm has a preference for tests that reduce In k-fold cross-validation, we divide the data into k
entropy. To find the best test for a node, we have to disjoint subsets of (approximately) equal size, then
search through all the possible split points for each train the classifier k times, each time leaving out one
attribute. In order to compute the best split point for of the subsets from training, but using only the
a numeric attribute, normally the (sorted) list of its omitted subset as the test set to compute the error
values is scanned from the beginning, and for each rate. If k equals the sample size, this is called "leave-
split point that is placed half way between every two one-out" cross-validation. For the large data set size,
adjacent attribute values, the entropy is computed. leave-one-out is very computation demanding since
The entropy for each split point can actually be it has to construct more decision trees than normal
efficiently computed as shown in Figure 2 because types of cross validation (k=10 is a popular choice in
of the existence of our attribute histogram data the literature). But for data sets with few examples,
structure. Based on its histogram (1-, 3+, 2-, 1+), we such as microarray data sets, leave-one-out cross
only consider three possible split (1- | 3+, 2-, 1+), validation is pretty popular and practical since it
(1-, 3+ | 2-, 1+) and (1-, 3+, 2- | 1+). The vertical bar gives the most unbiased evaluation model. Also,
represents the split point. Thus we eliminate from 6 when doing leave-one-out cross validation the
split points (Fayyad and Irani proved in [7] that computations for different subsets tend to be very
splitting between adjacent samples that belong to the similar. Therefore, it seems attractive to speed up
same class leads to sub-optimal information gain; in leave-one-out cross validation through the reuse of
general, their paper advocates a multi-splitting results of previous computations, which is the main
algorithms for continuous attributes whereas our topic of the next subsection.
approach relies on binary splits) down to 3 split
points. 3.1 Reuse of Sub-trees from Previous Runs
It is important to note that the whole data set and the
training sets in leave-one-out only differ in one
example. Therefore, in the likely event that the same
root test is selected for the two data sets, we already
know that at least one of the 2 sub-trees below the
root node generated by the first run (for the whole
data set) can be reused when constructing other
decision trees. Similar opportunities for reuse exist
at other levels of decision trees. Taking advantage of
this property, we compare the node to be split with
the stored nodes that are from pervious runs, and
reuse sub-trees if a match occurs.
Figure 2: Example of an Attribute Histogram In order to get a speed up through sub-tree
reuse, it is critical that matching nodes from
A situation that we have not discussed until
previous runs can be found quickly. To facilitate the
now, involves histograms that contain identical
comparison of two nodes, we use bit strings to
attribute values that belong to different classes. To
represent the sample list of each node. For example,
cope with this situation when considering a split
if we have totally 10 samples, and 5 are associated
point, we need to check the two neighboring
with the current node, we use the bit string
examples’ attribute values on both sides of the split
“0101001101” as the signature of this node, and use
point. If they are the same, we have to discard this
XOR string comparisons and signature hashing to
split point even if its information gain is high.
quickly determine if a reusable sub-tree exists.
After we determined the best split point for all
the attributes (genes in our cases), the attribute with 3.2 Using Histograms for Attribute Pruning
highest information gain is selected and used to split
the current node.
4. Assume that two histograms A (2+, 2-) and B (1+, 2nd: (2-, 3+, 7- | 5+, 2-). Apparently, the 2nd is better
1-, 1+, 1-) are given. In this case, our job is to find than the 1st. Since we are dealing with only binary
the best split point among all possible splits of both classification, we can assign a numeric value of +1
histograms. Obviously, B can never give a better to one class and a value of –1 to the other class, and
split than A because (2+ | 2-) has entropy 0. This we can use the sum of absolute differences in class
implies that performing information gain memberships in the two resulting partitions to
computations for attribute B is a waste of time. That approximate entropy computations; the larger this
prompts us to think of some way to distinguish result is, the lower the entropy is. In this case, for the
between “good” and “bad” histograms, and to first split the sum is |-2 + 3| + |-7 + 5 – 2| = 5, and for
exclude attributes with bad histograms from the second the sum is |-2 + 3 – 7| + |5 – 2| = 9. We
consideration for speed up. call this method absolute difference heuristic. We
Mathematically, it might be quite complicated performed some experiments [8] to determine how
to come up with a formula that predicts the best often the same split point is picked by the
attribute to be used for a particular node of the information gain heuristic and the absolute
decision tree. However, we are considering an difference heuristic. Our results indicate that in most
approximate method that may not always be correct cases (approx. between 91 and 100% depending on
but hopefully most of the time can be correct. The data set characteristics) the same split point is picked
idea is to use an index, which we call “hist index”. by both methods.
The hist index of histogram S is defined as:
m 4. Evaluation
Hist(S) = ∑
j= 1
Pj2
In this section we present the results of experiments
where Pj is the relative frequency of block j in S. that evaluate our methods for 3 different microarray
For example, if we have a histogram (1, 3, 4, 2), data sets.
its hist index would be: 12 + 32 + 42 + 22 = 30. A 4.1 Data Sets and Experimental Design
histogram with a high hist index is more likely to
contain the best split point than a histogram with low The first data set is a leukemia data collection that
hist index. Intuitively, we know that the fewer consists of 62 bone marrow and 10 peripheral blood
blocks the histogram has, the better chance it has to samples from acute leukemia patients (obtained
contain a good split point ---, mathematically, (a2 > from Golub el al [8]). The total 72 samples fall into
a12 + a22) holds if we have (a = a1 + a2). two types of acute leukemia: acute myeloid
Our decision tree learning algorithm uses the leukemia (AML) and acute lymphoblastic leukemia
hist index to prune attributes as follows. Prior to (ALL). These samples come from both adults and
determining the best split point of an attribute, its children. The RNA samples was hybridized to
hist index is computed and we compare it with the Affymetrix high-density oligonucleotide microarrays
average hist index of all the previous histograms in that contains probes for p = 7,130 human genes.
the same round; only if its hist index value is larger The second data set a colon tissue data set
than the previous average the best split point for this contains expression level (Red intensity/Green
attribute will be determined, otherwise, the attribute intensity) of the 2000 genes with highest minimal
is excluded from consideration for test conditions of intensity across 62 colon tissues. These gene
the particular node. expressions in 40 tumor and 22 normal colon tissue
samples were analyzed with an Affymetrix
3.3 Approximating Entropy Computations oligonucleotide array containing over 6,500 human
This sub-section addresses the following question: genes (Alon et al. [2]).
Do we really have to compute the log values that The third data set comes from a study of gene
require a lot of floating point computation to find expression in the breast cancer patients (Veer et al.
the smallest entropy values? [3]). The data set contains data from 98 primary
Let us assume we have a histogram (2-, 3+, 7-, breast cancers patients: 34 from patients who
5+, 2-) and we need to determine its split point that developed distant metastases within 5 years, 44 from
minimizes entropy. Let us consider the difference patients who continued to be disease-free after a
between the two splits. 1st: (*2-, 3+ | 7-, 5+, 2-) and period of at least 5 years, 18 from patients with
5. BRCA1 germline mutations, and 2 from BRCA2
carriers. All patients were lymph node negative, and
under 55 years of age at diagnosis. 4.2 Experimental Results
In the experiments, we did not use all genes, but
The first experiment evaluated the accuracy of the
rather selected a subset P with p elements of the
three decision tree learning tools. Tables 1-3 below
genes. Decision trees were then learnt that operate
display each algorithm’s error rate using the three
on the selected subset of genes. As proposed in [9],
different data sets and also using three different p
we are removing genes from datasets based on the
values for gene selection.
ratio of their between-groups to within-groups sum
The first column of the three tables represents
of squares. For a particular gene j, the ratio is
the p values that were used. The other columns give
BSS ( j ) ∑ i ∑ kI ( yi = k )( x kj − x . j ) 2 the number of total misclassification and the error
defined as: = ,
WSS ( j ) ∑ i ∑ kI ( yi = k )( xij − x kj ) 2 rate (inside the braces). Error rates were computed
where x . j denotes the average expression level of using leave-one-out cross validation.
gene j across all samples and x kj denotes the average Table 1: The Leukemia data set test result (72 samples)
level of gene j across samples belonging to class k.
Tools C5.0 Microarray Optimized
To give an explicit example here, assume we
Decision Decision Decision
have four samples and two genes for each sample:
Tree Tree Tree
the first gene’s expression level values for the four
P
samples are (1, 2, 3, 4) and the second’s are (1, 3, 2,
4); the sample class memberships are (+, -, +, -) 1024 5(6.9%) 5(6.9%) 4(5.6%)
(listed in the order of samples no.1, no.2, no.3 and
900 4(4.6%) 8(11.1%) 5(6.9%)
no.4). For gene 1, we have BSS/WSS = 0.125, and
for gene 2, BSS/WSS = 4. If we have to remove one 750 13(18.1% 11(15.3%) 3(4.2%)
gene, gene 1 will be removed according to our rule )
since it has a lower BSS/WSS value. The removal of
gene 1 is reasonable because we can tell the class Table 2: Colon Tissue data set test result (62 Samples)
membership of the samples by looking at their gene
2 expression level values: if one sample’s gene 2 Tools C5.0 Microarray Optimized
expression level is greater than 2.5, the sample Decision Decision Decision
should belong to the negative class, otherwise the Tree Tree Tree
P
sample belongs to the positive class. If we evaluate
gene 1 instead, we will not be able to perform the 1600 12(19.4% 15(24.2%) 16(25.8%)
classification in one single step like we have just )
done with gene 2.
After we calculate the BSS/WSS ratios for all 1200 12(19.4% 15(24.2%) 16(25.8%)
genes in a data set, only the p genes with the largest )
ratios will remain in the datasets that will be used in 800 12(19.4% 14(22.6%) 16(25.8%)
the experiments. Experiments were conducted with )
different p values.
In the experiments, we compared the popular Table 3: Breast Cancer data set test result (78 Samples)
C5.0/See5.0 decision tree tool (which was run with
its default parameter settings) with two versions of Tools C5.0 Microarray Optimized
our tool. The first version, called microarray Decision Decision Decision
decision tree tool, does not use any optimizations Tree Tree Tree
but employs pre-pruning. It stops growing the tree P
when at least 90% of the examples belong to the 5000 38(48.7% 29(33.3%) 35(44.9%)
majority class. The second version of our tool, that is )
called optimized decision tree tool, uses the same
pre-pruning and employs all the techniques that were 1600 39(50.0% 32(41.0%) 30(38.5%)
discussed in Section 3.
6. ) normal (Microarray Decision Tree) and optimized
(Optimized Decision Tree). All these experiments
1200 39(50.0% 31(39.7%) 29(33.3%)
were performed on an 850 Mhz Intel Pentium
)
processor with 128MB main memory. The cpu time
that is displayed (in seconds) in Table 4 includes the
If we study the error rates for the three methods time of tree building and evaluation process (Note:
listed in the three tables carefully, it can be noticed these experiments are identical to those previously
that at an average the error rates for the optimized listed in Tables 1 to 3). Our experimental results
decision tree are lower than that of the one not being suggest that the decision tree tool designed for
optimized, which looks quite surprising since in the microarray data sets normally runs slightly faster
optimized decision tree tool used a lot of than the C5.0 tool, while the speedup of the
approximate computations and pruning. optimized microarray decision tree tool is quite
However, further analysis revealed that the use significant and ranges from 150% to 400%.
of attribute pruning (using the hist index we Table 4: CPU time comparison of three different decision
introduced in Section 3.2) provides an explanation tree tools
for the better average accuracy of the optimized
decision tree tool . Why would attribute pruning lead P- CPU Time (Seconds)
Data Sets
to a more accurate prediction in some cases? The Value C5.0 Normal Optimized
reason is that the entropy function does not take the
1024 6.7 3.5 1.2
class distribution on sorted attributes into
Leukemia
consideration. For example, if we have two attribute 900 5.6 3.1 1.1
Data set
histograms (3+, 3-, 6+) and (3+, 1-, 2+, 1-, 2+, 1-, 750 6.0 4.1 1.1
2+), for the first histogram the best split point is (3+
| 3-, 6+) but for the second histogram there is one 1600 12.0 8.0 2.2
Colon
similar split point (3+ | 1-, 2+, 1-, 2+, 1-, 2+) which Tissue 1200 9.0 6.0 1.7
is equivalent to (3+ | 3-, 6+) with respect to the Data set
800 5.9 3.8 1.1
information gain heuristic. Therefore, both split
points have the same chance to be selected. But, just 5000 74.5 75.3 15.9
Breast
by intuition, we would say that the second split point Cancer 2000 30.4 30.2 6.4
is a much worse than the first split point because of Data set
its large number of blocks, requiring more tests to 1500 22.4 20.4 4.8
separate the two classes properly than the first one.
The traditional information gain heuristic 5. Summary and Conclusion
ignores such distributional aspects at all, which
We introduced decision tree learning algorithms for
causes the loss of accuracy in some circumstances.
microarray data sets, and its optimization to speed
However, hist index based pruning, as proposed in
up leave-one-out cross validation. Aimed at this
3.2, improved on this situation by removing
goal, several strategies were employed: the
attributes that have a low hist index (like the second
introduction of hist index to help pruning attributes,
attribute in the above example) beforehand.
approximate computations that measure entropy; and
Intuitively, continuous attributes with long
the reuse of subtrees from previous runs. We claim
histograms “representing flip-flopping class
that first two ideas are new, whereas, the third idea
memberships” are not very attractive to be chosen in
was also explored in Blockeel’s paper [10] that
test conditions, because more nodes/tests are
centered on the reuse of split points. The
necessary in a decision tree to predict classes
performance of microarray decision tree tool was
correctly based on this attribute. In summary, some
compared with that of commercially available
of those “bad” attributes were removed by attribute
decision tree tool C5.0/See5.0 using 3 microarray
pruning that explains the higher average accuracy in
data sets. The experiments suggest that our tool runs
the experiments.
between 150% and 400% faster than C5.0.
In another experiment we compared the cpu
We also compared the trees that were generated
time for leave-one cross validation for the three tree
in the experiments for the same data sets. We
decision tree learning tools: C5.0 Decision Tree,
7. observed that the trees generated by the same tool [6] Xiaoyong Li. Concept learning techniques for
are very similar. Trees generated by different tools microarray data collections, Master’s Thesis,
also had a significant degree of similarity. Basically, University of Houston, December 2002.
all the trees that were generated for the three data [7] U. Fayyad, and K. Irani. Multi-interval
sets are of small size with normally less than 10 discretization of continuous-valued attributes for
nodes. We also noticed that smaller trees seem to be classification learning, Proc. Int. Joint Conf. On
correlated with a lower error rates. Artificial Intelligence (IJCAI-93), pp. 1022-1029,
1993.
Also worth mentioning is that our experimental
results revealed that the use of the hist index resulted [8] T. R. Golub, D. K. Slonim, P. Tamayo, C.
in a better accuracy in some cases. These results also Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller,
M.L. Loh, J. R. Downing, M. A. Caligiuri, C. D.
suggest that for continuous attributes the traditional Bloomfield, and E. S. Lander. Molecular
entropy-based information gain heuristic does not classification of cancer: class discovery and class
work very well, because of its weakness to reflect prediction by gene expression monitoring, Science,
the class distribution characteristics of the samples 286:531-537, 1999.
with respect to continuous attributes. Therefore, [9] S. Dudoit, J. Fridlyand, and T. P. Speed.
better evaluation heuristics are needed for Comparison of discrimination methods for the
continuous attributes. This problem is the subject of classification of tumors using gene expression
our current research; in particular, we are currently data, Journal of the American Statistical
investigating multi-modal heuristics that use both Association, Vol. 97, No. 457, pp. 77—87, 2002.
hist index and entropy. Another problem that is [10] H. Blockeel, J. Struyf. Efficient algorithms for
investigated in our current research is the decision tree cross-validation, Machine Learning:
generalization of the techniques described in this Proceedings of the Eighteenth International
paper to classification problems that involve more Conference, 11-18, 2001.
than two classes.
References
[1] A. Brazma, J. Vilo. Gene expression data
analysis, FEBS Letters, 480:17-24, 2000.
[2] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S.
Ybarra, D. Mack, and A. J. Levine. Broad patterns
of gene expression revealed by clustering analysis
of tumor and normal colon tissues probed by
oligonucleotide arrays, Cell Biology, Vol. 96, pp.
6745-6750, June 1999.
[3] Laura J. van ‘t Veer, Hongyue Dai, Marc J. van
de Vijver, Yudong D. He, Augustinus A.M. Hart,
Mao Mao, Hans L. Peterse, Karin van der Kooy,
Matthew J. Marton, Anke T. Witteveen, George J.
Schreiber, Ron M. Kerkhoven, Chris Roberts,
Peter S. Linsley, René Bernards and Stephen H.
Friend. Gene expression profiling predicts clinical
outcome of breast cancer, Nature, 415, pp. 530–
536, 2002.
[4] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S.
Kitareewan, E. Dmitrovsky, E. Lander, and T.
Golub. Interpreting patterns of gene expression
with self-organizing maps. PNAS, 96:2907-2912,
1999.
[5] J.R. Quinlan. C4.5: Programs for machine
learning. Morgan Kaufman, San Mateo, 1993.