Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Knowledge extraction and visualisation using rule-based machine learning
1. Knowledge extraction and
visualisation using rule-based
machine learning
Dr. Jaume Bacardit
Interdisciplinary Computing and Complex Systems
(ICOS) research group
University of Nottingham
jaume.bacardit@nottingham.ac.uk
ICOS seminar. 11/10/2012
2. Preface
• I came to Nottingham in 2005 to work as a postdoc in a project applying
evolutionary rule learning to protein structure prediction (EPSRC
GR/T07534/01). In the project me managed to:
– Generate predictors that are competent with the start-of-the-art
– Indeed, extract human-readable explanations providing new
knowledge
– We proposed several improvements to the learning algorithms so they
could scale to big problems
• When I became a lecturer in 2008 I started several collaborations with
experimentalists analysing biological data of all kinds, always with the goal
of extracting knowledge
– Thanks to having sets of rules, it is relatively straightforward to
develop a generic methodology to extract knowledge from them, that
can be applied almost straight away to a variety of datasets
– Still, we are only at the tip of the iceberg, there are many ways in
which this analysis can be made more efficient/reliable/useful
4. A set of rules as a knowledge
representation
1
If (X<0.25 and Y>0.75) or
(X>0.75 and Y<0.25) then
If (X>0.75 and Y>0.75) then
Y If (X<0.25 and Y<0.25) then
Everything else
0 1
X
6. The BioHEL rule learning system
• BioHEL [Bacardit et al., 09] is an evolutionary
learning system that applies the Iterative Rule
Learning (IRL) approach
• Designed explicitly to deal with noisy large-scale
datasets
• IRL was first used in EC by the SIA system
[Venturini, 93]
7. BioHEL’s learning paradigm
– IRL has been used for many years in the ML community,
with the name of separate-and-conquer
– A standard elitist Genetic Algorithm generates each rule
8. BioHEL’s characteristics 1/2
• Objective function that tries to balance the
generation of accurate and general rules
– Accurate: not making many mistakes
– General: covering as many examples as possible and covering as much
of the search space as possible
• Attribute list rule representation
– Automatically identifying the relevant attributes for a given rule and
discarding all the other ones
• Ensemble mechanisms
– Exploiting the GA’s stochasticity to construct ensembles of rule sets, all
of them generated from the same data, but with different random
seeds, also ensembles for ordinal classification
9. BioHEL’s characteristics 2/2
• The ILAS windowing scheme
– Efficiency enhancement method. Training set divided into strata.
Different GA iterations use different strata for their evaluation using a
round-robin policy
• GPGPU-based fitness evaluation
– Obtaining ~50x speedups on large datasets on its own and ~700x
speedups in combination with ILAS
11. Functional Network Reconstruction for
seed germination
Microarray data obtained from seed tissue of
Arabidopsis Thaliana
122 samples represented by the expression level
of almost 14000 genes
It had been experimentally determined whether
each of the seeds had germinated or not
Can we learn to predict germination/dormancy
from the microarray data?
Bassel et al., Plant Cell 23(9):3101-3116, 2011
12. Generating rule sets
BioHEL was able to predict the
outcome of the samples with
93.5% accuracy (10 x 10-fold cross-
validation
Learning from a scrambled dataset
(labels randomly assigned to
samples) produced ~50% accuracy
If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96 Predict
germination
If At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66 Predict
germination
If At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66 Predict germination
If At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and At1g48320>56.80
Predict germination
Everything else Predict dormancy
13. Identifying regulators
Rule building process is stochastic
Generates different rule sets each time the system is
run
But if we run the system many times, we can see
some patterns in the rule sets
Genes appearing quite more frequent than the rest
Some associated to dormancy
Some associated to germination
We generated 10K rule sets for each outcome
Rules predicted one of the two outcomes
Default rule captured the other
15. Generating co-prediction networks of
interactions
• For each of the rules shown before to be
true, all of the conditions in it need to be
true at the same time
– Each rule is expressing an interaction between
certain gens
• From a high number of rule sets we can
identify pairs of genes that co-occur with
high frequency and generate functional
networks with a methodology coined as co-
prediction
• The network shows different topology when
compared to other type of network
construction methods (e.g. by gene co-
expression)
• Different regions in the network contain the
germination and dormancy genes.
• Other visualisations providing the big picture
exist (Urbanowicz et al., 2012)
16. Experimental validation
We have experimentally verified this analysis
By ordering and planting knockouts for the highly ranked
genes
We have been able to identify four new regulators of
germination, with phenotype different than the wild type
17. Same analysis. Different datasets
• We applied the same principle to three cancer
datasets from the literature (E. Glaab et al., PLoS
ONE (2012) 7(7):e39932)
• We checked PubMed to see if the genes linked
together in BioHEL’s rules appeared together in
the literature
• We used Point-Wise Mutual Information (PMI) to
quantify that the genes do not appear linked
together in the literature by chance
• Compared the PMI scores of the highly ranked
pairs of genes with random pairs
19. And to lots of other datasets!
• These datasets were generated using transcriptomics
technology
– Looks at RNA
• There are lots of other –omics (hundreds of them)
– Proteomics
– Lipidomics
– Metabolomics
– Next-generation sequencing
• Each –omics requires specific preprocessing, but the
learning and knowledge extraction process is exactly
the same
• Lots of datasets out there
20. Another example different from -omics
• Protein Structure Prediction aims to predict the 3D
structure of a protein based on its primary sequence
21. Prediction types of PSP
• There are several kinds of prediction problems within
the scope of PSP
– The main one, of course, is to predict the 3D coordinates
of all atoms of a protein (or at least the backbone) based
on its primary sequence
– There are many structural properties of individual residues
within a protein that can be predicted, for instance:
• The secondary structure state of the residue
• If a residue is buried in the core of the protein or exposed in the
surface
– Accurate predictions of these sub-problems can simplify
the general 3D PSP problem
22. Contact Map prediction
• Prediction, for each pair of residues in a
protein, whether these residues are in
contact (have a small distance between
them in the 3D structure) or not
• This problem can be represented by a
binary matrix. 1= contact, 0 = non
contact. Plotting this matrix reveals the
main traits in the protein structure
• Very sparse characteristic: Less than 2%
of contacts in native structures
• Training sets easily reach millions of
residue pairs
• Our method was one of the top
predictors in the last two editions of the
CASP competition (actually, the best
sequence-based predictor in last CASP)
helices sheets
(Bacardit et al., Bioinformatics (2012) 28 (19): 2441-2448)
23. Steps for CM prediction
1. Prediction of
Secondary structure (using PSIPRED)
Solvent Accessibility
Recursive Convex Hull Using BioHEL [Bacardit et al., 09]
Coordination Number
2. Integration of all these predictions plus other
sources of information
3. Final CM prediction (using BioHEL)
24. Characterisation of the contact map
problem
Three types of input information were used
1. Detailed information of three different windows of
residues centered around
The two target residues (2x)
The middle point between them
2. Information about the connecting segment between the
two target residues and
3. Global protein information.
1
3
2
25. Samples and ensembles
Training set
Training set contained 32 million
pairs of AA and 631 attributes
x50 (+60GB of disk space)
Samples
50 samples of 660K examples are
generated from the training set with a
ratio of 2:1 non-contacts/contacts
x25
Rule sets BioHEL is run 25 times for each sample
Prediction is done by a consensus of
1250 rule sets
Confidence of prediction is computed
based on the votes distribution in the
Consensus ensemble.
Whole training process took about 25K
CPU hours
Predictions
26. Knowledge extraction in contact map
prediction
• Basic analysis is exactly the same
Frequent attributes
Frequent pairs of
attributes
27. But analysis can be much more refined
• Because the representation has a very clear structure
and we have lots of domain knowledge
• For instance, there are several way to aggregate the
ranks of individual attributes based on characteristics
from the representation/domain
Ranks aggregated by
source of information
Ranks aggregated by
amino acid type
29. The knowledge extraction can be
much more refined
• We just looked at what attributes appear in the
rules, but not yet at the shape of the predicates
• Sometimes biasing the representation helps
generating knowledge that is more useful to the
domain experts
– In the experiments with the seed data BioHEL was
constrained to generate only predicates “Att>X”
– But we always have to be careful when introducing
bias
30. Is the knowledge real?
• Data is far from perfect, lots of spurious peaks
• Probably many of the edges in the network are false
positives
• Strategies for filtering the knowledge
– Classic blind feature selection?
– Contrast the knowledge with databases of curated
information about the genes/interactions
• Some of these are quite pricy!
• Or we need strong text mining skills
– Careful balance is needed, we don’t want to filter true
positives
– Using expert knowledge to bias the learning process (Moore
& White, 2006)
31. Modelling the ML problem
• Datasets annotated as “case/controls” are easy
• What happens with N>2 labels?
– Tricky for decision lists, as there is an implicit overlap
between rules
• What happens with continuous annotations?
– There are similar examples in the literature using
model trees (Nepomuceno-Chamorro et al., 2010)
• What happens when the annotation is a time
course?
– Ordinal classification problem
32. References
• BioHEL
– Improving the scalability of rule-based evolutionary learning. J. Bacardit, E.K.
Burke and N. Krasnogor. Memetic Computing journal 1(1):55-67, 2009
– Speeding Up the Evaluation of Evolutionary Learning Systems using GPGPUs.
M. Franco, N. Krasnogor and J. Bacardit. In Proceedings of the 12th Annual
Conference on Genetic and Evolutionary Computation (GECCO2010), 1039-
1046, ACM Press, 2010
– Modelling the Initialisation Stage of the ALKR Representation for Discrete
Domains and GABIL Encoding. M. Franco, N. Krasnogor and J. Bacardit. In
Proceedings of the 13th Annual Conference on Genetic and Evolutionary
Computation - GECCO2011, pages 1291-1298. ACM, 2011
– Post-processing Operators for Decision Lists. M. Franco, N. Krasnogor and J.
Bacardit. In Proceedings of the 14th Annual Conference on Genetic and
Evolutionary Computation - GECCO2012, pages 847-854. ACM, 2012
– Analysing BioHEL using challenging boolean functions. M. Franco, N.
Krasnogor and J. Bacardit. Evolutionary Intelligence, 5(2):87-102, June 2012
33. References
• Knowledge extraction and visualisation
– Prediction of Recursive Convex Hull Class Assignments for Protein Residues. Stout, M.,
Bacardit, J., Hirst, J.D. and Krasnogor, N. Bioinformatics, 24(7):916-923, 2008
– Automated Alphabet Reduction for Protein Datasets. J. Bacardit, M. Stout, J.D. Hirst, A.
Valencia, R.E. Smith and N. Krasnogor. BMC Bioinformatics 10:6, 2009
– Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on
Large-Scale Data Sets. George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J.
Holdsworth and Jaume Bacardit. The Plant Cell, 23(9):3101-3116, 2011
– E. Glaab, J. Bacardit, J.M. Garibaldi and N. Krasnogor. Using Rule-Based Machine
Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer
Gene Expression Data. PLoS ONE 7(7):e39932. 2012. doi:10.1371/journal.pone.0039932
– J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and Natalio
Krasnogor. Contact map prediction using a large-scale ensemble of rule sets and the
fusion of multiple predicted structural features. Bioinformatics (2012) 28 (19): 2441-
2448. doi:10.1093/bioinformatics/bts472
– HP Fainberg, K. Bodley, J. Bacardit, D. Li, F. Wessely, NP. Mongan, ME. Symonds, L. Clarke
and A. Mostyn, Reduced neonatal mortality in Meishan piglets: a role for hepatic fatty
acids? PLoS ONE, in press, 2012
34. References
• Related work
– Nepomuceno-Chamorro, I.A., Aguilar-Ruiz, J.S., and
Riquelme, J.C. (2010). Inferring gene regression networks
with model trees. BMC Bioinformatics 11: 517
– Moore, J. and White, B., Exploiting expert knowledge in
genetic programming for genome-wide genetic analysis,
Parallel Problem Solving from Nature-PPSN IX, pp. 969-
977, 2006
– R. J. Urbanowicz, A. Granizo-MacKenzie, and J. H. Moore.
Instance-linked attribute tracking and feedback for
michigan-style supervised learning classifier systems. In
GECCO ’12: Proceedings of the 14th annual conference on
Genetic and evolutionary computation , pages 927–934.
ACM Press, 2012
35. Acknowledgements
• Natalio Krasnogor
• Michael Holdsworth
• George Bassel
• Enrico Glaab
• Pawel Widera
• Maria Franco
• Anna Swan
• Hernan Fainberg
• EPSRC GR/T07534/01 & EP/H016597/1
36. Knowledge extraction and
visualisation using rule-based
machine learning
Dr. Jaume Bacardit
Interdisciplinary Computing and Complex Systems
(ICOS) research group
University of Nottingham
jaume.bacardit@nottingham.ac.uk
ICOS seminar. 11/10/2012