Extracting biclusters of similar values with Triadic Concept Analysis

Extraction de biclusters de valeurs
similaires `a l’aide de l’analyse de concepts
triadiques
M. Kaytoue, S. O. Kuznetsov,
J. Macko, W. Meira Jr. et A. Napoli
Bordeaux, 31 Janvier - 3 F´evrier 2012
Extraction et Gestion des Connaissances - EGC 2012

Context
Knowledge Discovery in Databases
2 / 31
Extraction de biclusters de valeurs similaires `a l’aide de l’analyse de concepts triadiques

Biclustering numerical data
Numerical data and bicluster
Given a numerical dataset (G, M, W , I)
–object/attribute data-table–
G a set of objects (lines)
M a set of attributes (columns)
W a set of values
I ⊆ G × M × W a relation s.t. (g, m, w) ∈ I, written m(g) = w,
means that object g takes the value w for attribute m
–simply represents data-cells–
a bicluster is a pair (A, B) with A ⊆ G and B ⊆ M.
–a rectangle in the data-table–
3 / 31

Example
Given a dataset (G, M, W , I) with
G = {g1, g2, g3, g4}
M = {m1, m2, m3, m4, m5}
W = {0, 1, 2, 6, 7, 8, 9}
and e.g. m2(g4) = 9
the bicluster ({g2, g3, g4}, {m3, m4}) can be viewed as the gray
rectangle
m1 m2 m3 m4 m5
g1 1 2 2 1 6
g2 2 1 1 0 6
g3 2 2 1 7 6
g4 8 9 2 6 7
4 / 31

But... a bicluster should reﬂect
a local phenomena in the data: “rectangles of values”
connectedness of values: e.g. similar values
overlapping: objects/attributes may belong to several patterns
a partial order, e.g. for algorithmic issues
maximality of rectangles w.r.t. connectedness and ordering
Several types of biclusters
5 / 31

Several applications
Collaborative filtering and recommender systems
Finding web communities
Discovery of association rules in databases
Gene expression analysis, ...
Several algorithms
Iterative Row and Column Clustering Combination
Divide and Conquer / Distribution Parameter Identification
Greedy Iterative Search / Exhaustive Bicluster Enumeration
A difficult problem generally relying on heuristics
S. C. Madeira and A. L. Oliveira
Biclustering Algorithms for Biological Data Analysis: a survey.
In IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004.
6 / 31

Introducing similarity
A simple similarity relation
w1 θ w2 ⇐⇒ |w1 − w2| ≤ θ with θ ∈ R, w1, w2 ∈ W
Considered type of biclusters
A bicluster (A, B) is a bicluster of similar values if
mi (gj ) θ mk(gl ), ∀gj , gl ∈ A, ∀mi , mk ∈ B
m1 m2 m3 m4 m5
g1 1 2 2 1 6
g2 2 1 1 0 6
g3 2 2 1 7 6
g4 8 9 2 6 7
(with θ = 2)
and maximal if no object/attribute can be added
J. Besson, C. Robardet, L. De Raedt, J.-F. Boulicaut
Mining Bi-sets in Numerical Data.
In KDID 2006: 11-23.
7 / 31

Formal Concept Analysis (G. & W., 99)
From a formal context to a concept lattice...
m1 m2 m3
g1 × ×
g2 × ×
g3 × ×
g4 × ×
g5 × × ×
Formal concepts = maximal rectangles
... with interesting properties (and existing algorithms!)
Maximality of concepts as rectangles
Overlapping of concepts
Specialization/generalisation hierarchy
This is exactly what we need for biclustering
8 / 31

Contribution
FCA: an interesting framework for biclustering
Use FCA for a complete, correct and non-redundant extraction
of biclusters of similar values with lossless discretization
with no set similarity parameter (useful for top-k pattern
discovery)
with a given similarity parameter (as in the literature)
Design an algorithm
better than its competitors
can be easily distributed
can handle several constraints (e.g. size) in the ﬂy
A better understanding of closed numerical pattern mining
9 / 31

Outline
1 Formal Concept Analysis (FCA)
2 A ﬁrst FCA-based biclustering method
3 Algorithm TriMax
4 Experiments
5 Conclusion and perspectives
10 / 31

Formal Concept Analysis (FCA)
In a nutshell...
FCA
A data analysis theory rooted in order and lattice theory allowing
to characterize formal concepts (also known as closed itemsets)
A concept in a formal context
Formal context (G, M, I): objects, attributes, incidence relation
Two derivations operators allowing to deﬁne formal concepts
A concept is a maximal rectangle of ×, modulo column and line
permutations
m1 m2 m3
g1 × ×
g2 × ×
g3 × ×
g4 × ×
g5 × × ×
({g3, g4, g5}, {m2, m3}) is a formal concept
11 / 31

Formal Concept Analysis (FCA)
Triadic Concept Analysis (Lehmann &
Wille, 1995)
“Extension” of FCA to ternary relation
An object has an attribute for a given condition
Triadic context (G, M, B, Y ): objects, attributes, conditions,
incidence relation
Several derivation operators allowing to characterize “triadic
concepts” as maximal cubes of ×
b1 b2 b3
m1 m2 m3
g1 ×
g2 × ×
g3 × ×
g4 × ×
g5 × ×
m1 m2 m3
g1 × × ×
g2 × ×
g3 × × ×
g4 × ×
g5 × ×
m1 m2 m3
g1 × ×
g2 ×
g3 × × ×
g4 × ×
g5 × × ×
({g3, g4, g5}, {m2, m3}, {c1, c2, c3}) is a triadic concept
12 / 31

1 Formal Concept Analysis (FCA)
2 A ﬁrst FCA-based biclustering method
3 Algorithm TriMax
4 Experiments
5 Conclusion and perspectives

A ﬁrst FCA-based biclustering method
Basic idea
Principle
Start from a numerical dataset
Build a triadic context, with same objects, same attributes, and
a discretized non-lossy “numerical space” dimension
Extract triadic concepts
We show interesting links between biclusters of similar
values and triadic concepts
14 / 31

Discretization method
Interodinal scaling (existing discretization scale)
Let (G, M, W , I) be a numerical dataset (with W the set of
data-values.
Now consider the set
T = {[min(W ), w], ∀w ∈ W } ∪ {[w, max(W )], ∀w ∈ W }.
Known fact: T and all its intersections characterize any interval
of values on W .
Example
With W = {0, 1, 2, 6, 7, 8, 9}, one has
T = {[0, 0], [0, 1], [0, 2], [0, 3], ..., [1, 9], [2, 9], ..., [9, 9]}
and for example [0, 8] ∩ [2, 9] = [2, 8]
15 / 31

Building a triadic context
Transformation procedure
From a numerical dataset (G, M, W , I), build a triadic context
(G, M, T, Y ) such as (g, m, t) ∈ Y ⇐⇒ m(g) ∈ t
16 / 31

First contribution
We proved that there is a 1-1-correspondence between
(i) Triadic concepts of the resulting triadic context
(ii) Biclusters of similar values maximal for some θ ≥ 0
Interesting facts
Eﬃcient algorithm for concepts extraction (Data-Peeler)
L. Cerf, J. Besson, C. Robardet, J.-F. Boulicaut
Closed patterns meet n-ary relations.
In TKDD 3(1): (2009).
This algorithm allows to handle several constraints
Top-k biclusters: Concept (A, B, C) with high |A|, |B|, and |C|
corresponds to bicluster (A, B) as a large rectangle of close
values (by properties of interordinal scale)
This formalization allows us to design a new algorithm to
extract maximal biclusters for a given parameter θ
17 / 31

Algorithm TriMax
Compute all max. biclusters for a given
θ
Principle
Use another (but similar) discretization procedure to build the
triadic context based on tolerance blocks
Standard algorithms output biclusters of similar values but not
necessarily maximal
We design a new algorithm TriMax for that task
TriMax is ﬂexible, uses standard FCA algorithms in its
core and is better than its competitors
19 / 31

Algorithm TriMax
Finding maximal set of similar values
θ a tolerance relation
reﬂexive, symmetric, but not transitive
Blocks of tolerance of W
Maximal sets of pairwise similar values are closed sets
Example with θ = 1
1 0 1 2 6 7 8 9
0 × ×
1 × × ×
2 × ×
6 × ×
7 × × ×
8 × × ×
9 × ×
Blocks of tolerance
{0, 1}
{1, 2}
{6, 7}
{7, 8}
{8, 9}
Renamed classes
[0, 1]
[1, 2]
[6, 7]
[7, 8]
[8, 9]
S. O. Kuznetsov
Galois Connections in Data Analysis: Contributions from the Soviet Era and Modern Russian Research.
In Formal Concept Analysis, Foundations and Applications, 2005.
20 / 31

Algorithm TriMax
New transformation procedure
Tolerance blocks based scaling
Compute the set C of all blocks of tolerance over W
From the numerical dataset (G, M, W , I), build the triadic
context (G, M, C, Z) such that (g, m, c) ∈ Z ⇐⇒ m(g) ∈ c
Actually, we remove “useless information”
θ = 1
21 / 31

Algorithm TriMax
Second contribution
Algorithm TriMax
Any triadic concept corresponds to a bicluster of similar values,
but not necessarily maximal!
It lead us to the algorithm TriMax that:
Process each formal context (one for each block of tolerance)
with any existing FCA algorithm
Any resulting concept is a maximal bicluster candidate and a
simple procedure allow to check maximality (this may be
problematic, but experiments show a good behaviour)
Each context can be processed separately
TriMax allows a complete, correct and non redundant
extraction of all maximal biclusters of similar values for a
user deﬁned similarity parameter θ
22 / 31

Experiments
Trimax - settings
Implementation: C++, boost library 1.42
InClose algorithm for dyadic contexts processing
Data: gene expression data of the species Laccaria bicolor
Conﬁguration: Intel CPU 2.54 Ghz, 8 GB RAM
24 / 31

Experiments
Trimax - monitoring aspects
Starting with all 12 attributes, we make vary the number of
objects, the similarity parameter θ and monitor:
Number of maximal biclusters of similar values
Execution time (in seconds)
Number of tolerance blocks
Density of the triadic context
Comparison between the number of non-maximal biclusters with
the number of maximal biclusters
Execution time proﬁling of the main procedures of TriMax
25 / 31

Experiments
Trimax - experimental results
Nr. of max. biclusters Execution times in sec. Nr. of blocks of toler.
Density of 3-adic cont. Nr. generated of biclusters Execution time
26 / 31

Experiments
TriMax bottleneck
Computing the modus is problematic...
builds of formal context (2D) for each block of tolerance
extracts concepts (A, B) for each of them
computes the modus C to get triadic concept (A, B, C) and
check maximality
But...
In many applications, experts have preferences
One can remove a bicluster candidate before modus
computation according to some constraints
Example with θ = 33, 000, 500 objects, 12 attributes
104, 226 maximal biclusters extracted in 16.130 sec
5, 332 maximal biclusters in 2.1 sec with at least 10 (at last 40)
objects
27 / 31

Experiments
Comparison
Existing algorithms
Numerical Biset Miner (NBS-Miner) - not scalable
J. Besson, C. Robardet, L. De Raedt, J.-F. Boulicaut
Mining Bi-sets in Numerical Data.
In KDID 2006: 11-23.
Interval Pattern Structures (IPS) - less eﬃcient than TriMax
M. Kaytoue, S. O. Kuznetsov, and A. Napoli
Biclustering Numerical Data in Formal Concept Analysis.
ICFCA, Springer, 2011.
28 / 31

Experiments
An example of comparison
Increasing number of objects and all 12 attributes.
Results in milliseconds.
θ = 0 θ = 700 θ = 10000
Other scenarii show a similar behaviour.
29 / 31

Conclusion and perspectives
Conclusion
Contribution
A better understanding of closed numerical pattern mining
within FCA
A formal characterization of a type of bicluster
TriMax for eﬃcient computation
Perspectives
top-k bicluster discovery
n-dimensional numerical datasets
Distributed computation
Constraints (size, mean-square residue, etc.)
Links with Fuzzy FCA
31 / 31

Extracting biclusters of similar values with Triadic Concept Analysis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Extracting biclusters of similar values with Triadic Concept Analysis

Similaire à Extracting biclusters of similar values with Triadic Concept Analysis (20)

Dernier

Dernier (20)

Extracting biclusters of similar values with Triadic Concept Analysis