Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
Poster_JOBIM_v4.2
1. Introduction
Study objectives
In this project, our work consists in
developing a workflow using
Knowledge Discovery and Data
Mining methodologies to propose
advanced biomarker discovery
solutions.
We propose to use machine
learning algorithms, such as support
vector machines, and random
forests to analyze metabolomic
datasets in order to identify
predictive biomarkers of metabolic
syndrome.
Comparison of these methodologies
will be performed, followed by a
graphical visualization of the
relevant features obtained by
supervised approaches. Based on
formal concept analysis
methodology, a concept lattice will
be constructed and several
association rules will be discovered
between emerging features.
- Metabolomics: Generation of
complex and massive data
noisy, variable
redundant (correlated)
heterogeneous, scalable
high number of variables compared
to the number of samples
Problematics
Knowledge Discovery based on Formal
Concept Analysis for biomarker
identification from metabolomic data
Dhouha. Grissa1,2, Jérémie Bourseau2, Blandine Comte1, Amedeo Napoli2 , Estelle Pujos-Guillot1,3
1 INRA, UMR1019, UNH-MAPPING, F-63000 Clermont-Ferrand, France
2 LORIA, B.P. 239, F-54506 Vandoeuvre-lès-Nancy, France
3 INRA, UMR1019, Plateforme d’Exploration du Métabolisme, F-63000 Clermont-Ferrand, France
ConclusionMethods
References
- Data cleaning: to remove noise and
outliers
using signal filtering methods and PCA
respectively.
- Data transformations: to remove
systematic analytical variation
using zero-mean normalization and UV
scaling.
- Features ranking: for each group
obtained by MST-kNN technique, a
ranked list of ions is produced using
classification algorithms:
1. Support Vector Machines (SVM)
[Vapnik et Chervonenkis, 1964]:
Models based on supervised learning and
decision making algorithms kernels, are used to
separate data into discrete sets. In this study, we
are interested by the weights given for each
variable.
2. Random Forests (RF) [B. Leo, 2001]:
RF are a combination of tree predictors such that
each tree depends on the values of a random
vector sampled independently and with the
same distribution for all trees in the forest. RF
can be used to rank the importance of variables
in a regression or classification problem in a
natural way.
To extract knowledge from
biological datasets and help
experts finding relevant
information, we propose here an
approach that we applied on
metabolomic data.
A combination of several
techniques was found to be
essential to discover knowledge
from complex metabolomic data,
starting from the pre-processing
of data, until the visualization and
the validation of the extracted
features, which consist in
candidate biomarkers of
metabolic syndrome.
Actually, we are applying this
workflow on a Test dataset, well
known by the experts in biology,
and preliminary results are
promising. The most relevant
metabolites have been identified
firstly by MST-kNN in the first two
groups, and afterwards
highlighted by the supervised
approaches, Random Forest and
SVM where the weight of each
feature is considered. A new
matrix with a reduced dimension,
containing only the best ranked
metabolites, is then built in order
to apply FCA to discover and
visualize relationships among
biomarker candidates. In addition
to the extraction of association
rules between emerging patterns.
In the future, we will follow the
same process within the frame of
identifying predictive biomarkers
of metabolic syndrome / type2
diabetes.
[Ganter & Wille, 99]: B. Ganter and R. Wille.
Springer, 1999.
[Arefin et al., 2014]: A. S. Arefin, R. Vimieiro, C.
Riveros, H. Craig, P. Moscato. DOI:
10.1371/journal.pone.0111445, 2014.
[B. Leo, 2001]: B. Leo. Machine Learning 45 (1):
5–32, 2001. doi:10.1023/A:1010933404324.
[Vapnik et Chervonenkis, 1964]: V. Vapnik and A.
Chervonenkis. Automation and Remote Control,
25, 1964.
[Agrawal et al., 1993]: R. Agrawal; T. Imieliński
and A. Swami. Proceedings of the 1993 ACM
SIGMOD int. conf. on Management of data -
SIGMOD '93. p. 207, 1993.
1: Diet-health Interaction Along life – Predictive
biomarkers of life trAnSitiON outcome linked to
retirement.
Contact: dhouha.grissa@clermont.inra.fr
Ions = 1195 variables
- Variables/Ions clustering:
Using MST-kNN algorithm [Arefin et al.,
2014]
MST-kNN is a partitioning algorithm
based on graphs, and different measures of
distance: euclidean, manhattan, JSD, etc.
Step 1.2 : Reduction and Feature selection
Step 1.1 : Pre-processing of data
Step 2.1 : Unsupervised approach
Step 2.2 : Supervised approach
Step 3: Visualization
- Formal Concept Analysis (FCA)
[Ganter & Wille, 1999]:
Extraction of Relationships among
Data with FCA.
Data consists of a matrix containing
the most relevant features, deduced
from the previous step.
- Association rules between emerging
features [Agrawal et al., 1993]:
...
...
- Data reduction:
Correlated data, filtering,
- Feature selection:
Support Vector Machine-Recursive Feature
Elimination (SVM-RFE)
Random-Forest, t-test.
Figure 4: Basic Working of Random Forests
I = {i1, i2, …, im}: a set of items;
Transaction Database T: a set of transactions T = {t1, t2, …, tn}
An association rule is an implication of the form : X Y, where X, Y I, and X Y =
Figure 3: Basic Working of Support Vector Machines
Figure 5: Example of Concept Lattice
http://www.thebookmyproject.com/wp-content/uploads/Intrusion-Detection-Technique-by-using-K-means-Fuzzy-Neural-Network-and-SVM-
classifiers.jpg
Formal Context Specification and Extraction
Construction of
concept Lattices
K=(Ind,Ion,I)
I Ind x Ion
Figure 2: Process of Knowledge Extraction from
metabolomics raw data
Samples
Ions = variables
Ion intensities
Figure 1: Metabolomics is a powerful phenotyping tool in
nutrition research to better understand the biological
mechanisms involved in the pathophysiological processes
and identify biomarkers of metabolic deviations.
Need ways to extract relevant
information and ignore random
variation (noise)