A Game Theoretic Framework for Heterogenous Information Network Clustering

Introduction Preliminaries The Bi-clustering Game Framework Reward Functions Experimental Results Conclusion

Game Theoretic Framework for
Heterogeneous Information Network Clustering

Faris Alqadah

Johns Hopkins University


Outline
1 Introduction
Motivation
2 Preliminaries
HINs and FCA
Game Theory
3 The Bi-clustering Game
Party-Planners
4 Framework
GHIN
5 Reward Functions
Expected Satisfaction
6 Experimental Results
Real world HINs
7 Conclusion


Motivation

Heterogeneous Information Networks (HINs) are pervasive
in applications ranging from bioinformatics to e-commerce.
Generalization of bi-clustering to pairwise relations as
opposed to tensor spaces.
No uniﬁed deﬁnition of a HIN-cluster or algorithmic
framework to mine them.
Address short coming of ‘pattern’-based approaches.


HINs

Objects derived from
distinct domains
Topology of the network
determined by
pairwise-binary relations
amongst domains.
Graph representation of a
HIN is a multi-partite
graph.
Clicking patterns, social
networks, gene networks
from different experiments.


Related Work

Three major categories of work
Multi-way clustering [5, 4, 1, 2]: Directly extend
bi-clustering or co-clustering. Mostly hard-clusters.
Information-network [10, 11]: Combine ranking and
clustering using probabilty generating models, limited by
network-topology, hard clustering.
Pattern-based [3, 12, 7]: Formal Concept Analysis,
overlapping clustering, too many clusters, parameter
settings.


Key Idea

For single-edge HIN,
trade-off between number
of nodes in bipartite sets.


Key Idea

For single-edge HIN,
trade-off between number
of nodes in bipartite sets.
Multiple-edge HIN,
competing
cluster-inﬂuences.


Key Idea

Multiple-edge HIN,
competing
An ‘ideal’ HIN-cluster
should be an equilibrium
point among all competing
clustering inﬂuences.


Key Idea

Multiple-edge HIN,
competing
An ‘ideal’ HIN-cluster
should be an equilibrium
point among all competing
clustering inﬂuences.
Nash equilibrium: No one
can do any better
assuming everyone else
retains the same strategy.


Notation

Context Kij = (Gi , Gj , Iij ), two sets and a relation.
A HIN Gn = (V, E) where V is a set of domains
{G1 , . . . , Gn } and (Gi , Gj ) ∈ E iff ∃Kij


Concepts (maximal bicliques)

Common neighbors:

{gj ∈ Gj |gj Iij gi ∀gi ∈ Ai } if (Gi , Gj ) ∈ E,
ψ j (Ai ) =
∅ otherwise.

Concept or maximal bi-clique: (Ai , Aj ) such that
ψ j (Ai ) = Aj and ψ i (Aj ) = Ai .


FCA-based approaches

Generalize the notion of a concept (several definitions),
and enumerate all such concepts.
Parameter settings not always intuitive.
Substantially different algorithm design for simple change
in definition.
For suitably defined game, Nash equilibrium points capture
maximal bi-cliques.


Normal form game

A ﬁnite, n-player, normal form game, G, is a triple N, (Mi ), (ri )
where
N = {1, . . . , n} is the set of players
Mi = {mi1 , . . . , mili } is the set of moves available to player i
and li is the number of available moves for that player.
ri : M1 × · · · × Mn → R is the reward function for each
player i. It maps a proﬁle of moves to a value.
Each player i selects a strategy from the set of all available
strategies, Pi = {pi : Mi → [0, 1]}


Nash equilibrium and example

Nash equilibrium: A strategy proﬁle in which no player has an
incentive to unilaterally deviate [8, 6].

∀i ∈ N, pi ∈ Pi :
∗ ∗ ∗ ∗ ∗
ri (p1 , . . . , pi−1 , pi , . . . , pn ) ≤ ri (p1 , . . . , pn )

Player 2 chooses 0 Player 2 chooses 1 Player 2 chooses 2
Player 1 chooses 0 (0,0) (1,0) (2,-2)
Player 1 chooses 1 (0,1) (1,1) ( 3,-2)
Player 1 chooses 2 (-2,2) (-2,3) (2,2)


Party planner game

Two party planners P1 and P2 plan a party by inviting
guests from disjoint sets of clients G1 and G2 .
Party planners receive compensation based on overall
satisfaction of clients.
Client satisfaction is a function of positive and negative
interactions at the party
P1 and P2 do not cooperate, but are privy to each others
guest list at any point. Both wish to maximize
compensation.


Satisfaction Reward Function

Let (A1 , A2 ) be a party. Deﬁne satisfaction of g1 ∈ A1 attending
party (A1 , A2 ) as

|ψ 2 (g1 ) ∩ A2 | − w ∗ |A2 ψ 2 (g1 )|
sat1 (g1 , A2 ) = (1)
|A2 |

Overall reward to party planner i:

risat (Ai , Aj ) = sati (gi , Aj ) (2)
gi ∈Ai


Concepts as Nash equilibrium points

M1 M1, M2 M1, M2, M3 M1, M3 M2 M2, M3 M3
G1 (1,1) (1,2) (1,3) (1,2) (1,1) (1,2) (1,1)
G1, G2 (2,1) (-1,-1) (-2,-3) (-1,-1) (-4,-2) (-4,-4) (-4,-2)
G1, G2, G3 (3,1) (0,0) (-3,-3) (-3,-2) (-3,-1) (-6,-4) (-9,-3)
G1, G3 (2,1) (2,2) (0,0) (-1,-1) (2,1) (-1,-1) (-4,-2)
G2 (1,1) (-2,-4) (-3,-9) (-2,-4) (-5,-5) (-5,-10) (-5,-5)
G2, G3 (2,1) (-1,-1) (-4,-6) (-4,-4) (-4,-2) (-7,-7) (-10,-5)
G3 (1,1) (1,2) (-1,-3) (-2,-4) (1,1) (-2,-4) (-5,-5)


Concepts as Nash equilibrium points

Theorem
For any instance of the bi-clustering game Gbicluster in which risat
is the selected reward function, there exists w ∗ , such that
∀w ≥ w ∗ if (A∗ , A∗ ) is a concept of K = (G1 , G2 , I12 ) then
1 2
(A∗ , A∗ ) is a Nash equilibrium point of Gbicluster .
1 2


HIN-clustering game

Extend bi-clustering game to n-party planners, n sets of guests.
Guest interactions are determined by network topology.
Mining HIN-clusters is equivalent to ﬁnding
Nash-equilibrium points of the HIN-clustering game.
Finding Nash-equilibrium is non-trivial [9].
Adapt simple strategy and key heuristic to enumerate the
Nash equilibrium points.


Strategy and heuristics

M1 M1, M2 M1, M2, M3 M1, M3 M2 M2, M3 M3
G1 (1,1) (1,2) (1,3) (1,2) (1,1) (1,2) (1,1)
G1, G2 (2,1) (-1,-1) (-2,-3) (-1,-1) (-4,-2) (-4,-4) (-4,-2)
G1, G2, G3 (3,1) (0,0) (-3,-3) (-3,-2) (-3,-1) (-6,-4) (-9,-3)
G1, G3 (2,1) (2,2) (0,0) (-1,-1) (2,1) (-1,-1) (-4,-2)
G2 (1,1) (-2,-4) (-3,-9) (-2,-4) (-5,-5) (-5,-10) (-5,-5)
G2, G3 (2,1) (-1,-1) (-4,-6) (-4,-4) (-4,-2) (-7,-7) (-10,-5)
G3 (1,1) (1,2) (-1,-3) (-2,-4) (1,1) (-2,-4) (-5,-5)

1 Mark all second components that are maximal in each row.



M1 M1, M2 M1, M2, M3 M1, M3 M2 M2, M3 M3
G1 (1,1) (1,2) (1,3**) (1,2) (1,1) (1,2) (1,1)
G1, G2 (2,1**) (-1,-1) (-2,-3) (-1,-1) (-4,-2) (-4,-4) (-4,-2)
G1, G2, G3 (3,1**) (0,0) (-3,-3) (-3,-2) (-3,-1) (-6,-4) (-9,-3)
G1, G3 (2,1) (2,2**) (0,0) (-1,-1) (2,1) (-1,-1) (-4,-2)
G2 (1,1**) (-2,-4) (-3,-9) (-2,-4) (-5,-5) (-5,-10) (-5,-5)
G2, G3 (2,1**) (-1,-1) (-4,-6) (-4,-4) (-4,-2) (-7,-7) (-10,-5)
G3 (1,1) (1,2**) (-1,-3) (-2,-4) (1,1) (-2,-4) (-5,-5)




M1 M1, M2 M1, M2, M3 M1, M3 M2 M2, M3 M3
G1 (1,1) (1,2) (1**,3**) (1**,2) (1,1) (1**,2) (1**,1)
G1, G2 (2,1**) (-1,-1) (-2,-3) (-1,-1) (-4,-2) (-4,-4) (-4,-2)
G1, G2, G3 (3**,1**) (0,0) (-3,-3) (-3,-2) (-3,-1) (-6,-4) (-9,-3)
G1, G3 (2,1) (2**,2**) (0,0) (-1,-1) (2**,1) (-1,-1) (-4,-2)
G2 (1,1**) (-2,-4) (-3,-9) (-2,-4) (-5,-5) (-5,-10) (-5,-5)
G2, G3 (2,1**) (-1,-1) (-4,-6) (-4,-4) (-4,-2) (-7,-7) (-10,-5)
G3 (1,1) (1,2**) (-1,-3) (-2,-4) (1,1) (-2,-4) (-5,-5)

2 Mark all ﬁrst components that are maximal in each column.



M1 M1, M2 M1, M2, M3 M1, M3 M2 M2, M3 M3
G1 (1,1) (1,2) (1**,3**) (1**,2) (1,1) (1**,2) (1**,1)
G1, G2 (2,1**) (-1,-1) (-2,-3) (-1,-1) (-4,-2) (-4,-4) (-4,-2)
G1, G2, G3 (3**,1**) (0,0) (-3,-3) (-3,-2) (-3,-1) (-6,-4) (-9,-3)
G1, G3 (2,1) (2**,2**) (0,0) (-1,-1) (2**,1) (-1,-1) (-4,-2)
G2 (1,1**) (-2,-4) (-3,-9) (-2,-4) (-5,-5) (-5,-10) (-5,-5)
G2, G3 (2,1**) (-1,-1) (-4,-6) (-4,-4) (-4,-2) (-7,-7) (-10,-5)
G3 (1,1) (1,2**) (-1,-3) (-2,-4) (1,1) (-2,-4) (-5,-5)

2 Mark all ﬁrst components that are maximal in each column.
3 Any cell that has both components marked is a Nash
equilibrium.
Heuristic: Every Nash equilibrium point is a superset of an
n-concept.


GHIN framework

Utilizing heuristic, exponential run time still possible.
Sacriﬁce completeness, but guarantee correctness
Attempt to form a Nash equilibrium point with each object
in the HIN.


GHIN framework

1 For each object gi in the seed set attempt to form
maximally large n-partite clique in HIN.
2 Add objects from all domains to the clique while the reward
increases.
3 Remove objects not in original clique from all domains
while the reward increases.
4 If no change from step 2 and 3 Nash equilibrium found,
else repeat 2 and 3.
5 Update the seed set by removing all objects in the cluster.


Shortcomings of satisfaction reward function

Satisfaction reward function simple, intuitive, and efﬁcient.
If matrices in HIN have signiﬁcantly different density levels,
then bias occurs.
Use expected satisfaction instead.


Expected satisfaction

Assume all objects are independent.
For given party (A1 , . . . , An ) expected number of
interactions is number of success in |Aj | draws from ﬁnite
population of |Gj | objects
Expected number of success is hypergeometrically
distributed random variable.


Expected satisfaction

|Aj | ∗ |ψ j (gi )|
expij (gi , Aj ) =
|Gj |
|Aj | ∗ |ψ j (gi )| ∗ (|Gj | − |Aj |) ∗ (|Gj | − |ψ j (gi )|)
varij (gi , Aj ) =
|Gj |2 ∗ (|Gj | − 1)

|ψ j (gi ) ∩ Aj | − expij (gi , Aj )
esat(gi , Aj ) = −w
varij (gi , Aj )
esat(gi , A−i ) = esat(gi , Gj )
Aj ⊆Gj ,(Gi ,Gj )∈E

riesat (Ai , A−i ) = esat(gi , A−i )
gi ∈Ai


Tiring party goers

Incorporate ‘tiring’ factor to avoid too much overlap. Let c(gi )
denote the number of clusters gi has appeared in upto the
current time-step, then let

t = f (c(gi ))

where
f : N → (0, 1]
and f is anti-monotonic. For example:

1
f (x) =
x2
1
f (x) =
ex


HINs and evaluation
HIN name Description Num domains Num classes Total num objects
MER Newsgroup, Middle East politics and Religion 3 2 24,783
REC Newsgroup, recreation 3 2 26,225
SCI Newsgroup, science 3 5 37,413
PC Newsgroup, pc and software 3 5 35,186
PCR Newsgroup, politics and Christianity 3 2 24,485
FOUR_AREAS DBLP subset of database, data mining, AI, and IR papers 4 4 70,517

Extrinsic evaluation, B 3 recall and precision:

min(|C(g) ∩ C(g )|, |L(g) ∩ L(g )|)
Prec(g, g ) =
|C(g) ∩ C(g )|
min(|C(g) ∩ C(g )|, |L(g) ∩ L(g )|)
Rcl(g, g ) =
|L(g) ∩ L(g )|

B 3 Prec = Avgg [Avgg ,C(g)∩C(g )=∅ [Prec(g, g )]]
B 3 Rcl = Avgg [Avgg ,L(g)∩L(g )=∅ [Rcl(g, g )]]


Results

HIN Algorithm F1 F0.5 F2
GHIN expsat 0.627051 0.736396 0.622735
GHIN sat 0.553790 0.649559 0.569664
MER
NetClus 0.3759 0.4512 0.322
MDC 0.3661 0.4533 0.3070
GHIN expsat 0.544189 0.633362 0.508778
GHIN sat 0.434367 0.485025 0.451840
REC
NetClus 0.2784 0.2870 0.2704
MDC 0.2845 0.2953 0.2746
GHIN expsat 0.484068 0.589704 0.530239
GHIN sat 0.402306 0.481798 0.462886
SCI
NetClus 0.2609 0.2583 0.2635
MDC 0.2532 0.2529 0.2535
GHIN expsat 0.334827 0.520472 0.302943
GHIN sat 0.306503 0.432229 0.345382
PC
NetClus 0.2254 0.2068 0.2477
MDC 0.2282 0.2116 0.2476
GHIN expsat 0.640894 0.793399 0.508778
GHIN sat 0.541986 0.574588 0.530971
PCR
NetClus 0.3642 0.4396 0.3109
MDC 0.3440 0.4268 0.2810
GHIN expsat 0.623117 0.598877 0.650079
GHIN sat 0.5315 0.506687 0.5588
FOUR_AREAS
NetClus 0.3612 0.36655 0.3560
MDC 0.5085 0.5162 0.5010


Class distributions in clusters

Algorithm Class C1 C2 C3 C4
DB 0.0601266 0.93633 0.0133188 0.0512748
DM 0.028481 0.0363608 0.0106007 0.850142
GHIN expsat
IR 0.882911 0.0204432 0.133188 0.0339943
AI 0.028481 0.00686642 0.842892 0.0645892
DB 0.0553833 0.450802 0.500074 0.0955971
DM 0.163934 0.15815 0.128535 0.304584
NetClus
IR 0.179553 0.0512035 0.242707 0.112786
AI 0.60113 0.339844 0.128684 0.487033
DB 0.186681 0.232455 0.803727 0.000000
DM 0.261844 0.000000 0.128592 0.161790
MDC
IR 0.003183 0.278748 0.000000 0.75888
AI 0.548292 0.488797 0.067680 0.079323


Sample Clusters

Terms Authors Conferences
data Surajit Chaudhuri VLDB
database Divesh Srivastava SIGMOD
queries H. V. Jagadish ICDE
databases Jeffrey F. Naughton PODS
querys Michael J. Carey EDBT
xml Raghu Ramakrishnan
mining Jiawei Han KDD
learning Christos Faloutsos PAKDD
data Wei Wang ICDM
frequent Heikki Mannila SDM
association Srinivasan Parthasarathy PKDD
patterns Ke Wang ICML


Applying GHIN to EMAP data

E-MAP (epistatic miniarray porﬁles) query and target genes
Genetic interaction score indicates whether strain is
healthier or sicker than expected (positive or negative)
Negative network derived by using scores ≤ −2.5
Find Nash points, and use functional enrichment: Do we
ﬁnd small functional classes?


Applying GHIN to EMAP data

Functional enrichment by large classes (31−500)
0.7
Exp sat tiring
Sat
0.6

Fraction of patterns enriched
0.5

0.4

0.3

0.2

0.1

0
−0.01 0 0.01 0.02 0.03 0.04 0.05 0.06
P−value threshold

Functional enrichment by small classes
0.7
Exp sat tiring
Sat
0.6
Fraction of patterns enriched

0.5

0.4

0.3

0.2

0.1

0
−0.01 0 0.01 0.02 0.03 0.04 0.05 0.06
P−value threshold


Clusters exclusively annotated by small functional classes:

YBR078W ECM33
YIL034C CAP2
YIL159W BNR1
YKL007W CAP1
YMR054W STV1
YMR058W FET3
YMR089C YTA12
YFL031W HAC1
YHR079C IRE1
YJL095W BCK1
YCL048W SPS22
YIL073C SPO22
YJL155C FBP26
YLR267W BOP2


Parameter study

Effect of w on extrinsic clustering quality.
0.7 0.7 0.9
mer mer mer
rec rec 0.8 rec
0.6 pcr 0.6 pcr pcr
pc pc pc
0.7
sci sci sci
0.5
four 0.5 four four
0.6

0.4
0.5
F0.5 score

0.4
F1 score

F2 score
0.3 0.4
0.3
0.3
0.2

0.2 0.2
0.1
0.1

0 0.1
0

−0.1 0 −0.1
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
w w w


Parameter study

Effect of w on algorithm operation.
4
x 10
30 2.5 1000
mer
rec mer mer
900
pcr rec rec
pc
25 2 pcr 800 pcr
Average num iterations to find Nash

sci
four pc pc
700
Total number of iterations

sci sci

Number clusters
four four
20 1.5 600

500

15 1 400

300

10 0.5 200

100

5 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
w w w


Conclusion

Novel framework for deﬁning and enumerating
HIN-clusters.
First (as far as I know) connection between Information
network clustering and game theory.
Initial experimental results show promise.


Ongoing and future work

Development of reward functions, (information theortic,
spectral?).
Clustering in biological data, do we ﬁnd smaller functional
classes compared to other bi-clustering methods?
Extension of framework to weighted HINs.
More algorithmic development.
Compare algorithms with actual Nash solver.


S. M. Arindam Banerjee, Sugato Basu.
Multi-way clustering on relation graphs.
In Proceedings of the SIAM International Conference on
Data Mining, 2007.
R. Bekkerman, R. El-Yaniv, and A. McCallum.
Multi-way distributional clustering via pairwise interactions.
In ICML ’05: Proceedings of the 22nd international
conference on Machine learning, pages 41–48, New York,
NY, USA, 2005. ACM.
J. Li, G. Liu, H. Li, and L. Wong.
Maximal biclique subgraphs and closed pattern pairs of the
adjacency matrix: A one-to-one correspondence and
mining algorithms.
IEEE Trans. Knowl. Data Eng., 19(12):1625–1637, 2007.
B. Long, X. Wu, Z. M. Zhang, and P. S. Yu.
Unsupervised learning on k-partite graphs.


In KDD ’06: Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data
mining, pages 317–326, New York, NY, USA, 2006. ACM.
B. Long, Z. M. Zhang, X. Wu, and P. S. Yu.
Spectral clustering for multi-type relational data.
In ICML ’06: Proceedings of the 23rd international
conference on Machine learning, pages 585–592, New
York, NY, USA, 2006. ACM.
E. Mendelson.
Introducing Game Theory and Its Applications.
Chapman & Hall / CRC, 2004.
I. A. T. S. Mohammed J Zaki, Markus Peters.
Clicks: An effective algorithm for mining subspace clusters
in categorical datasets.
Data and Knowledge Engineering special issue on
Intelligent Data Mining, 60 (2):51–70, 2007.


G. Owen.
Game Theory.
Academic Press, 1995.
R. Porter, E. Nudelman, and Y. Shoham.
Simple search methods for ﬁnding a nash equilibrium.
In Games and Economic Behavior, pages 664–669, 2004.
Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu.
Rankclus: Integrating clustering with ranking for
heterogeneous information network analysis.
In Proc. 2009 Int. Conf. on Extending Data Base
Technology (EDBT’09 ), 2009.
Y. Sun, Y. Yu, and J. Han.
Ranking-based clustering of heterogeneous information
networks with star network schema.
In Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge
Discovery and Data Mining (KDD’09 ), 2009.


A. Tanay, R. Sharan, and R. Shamir.
Discovering statistically signiﬁcant biclusters in gene
expression data.
In In Proceedings of ISMB 2002, 2002.

A Game Theoretic Framework for Heterogenous Information Network Clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Recently uploaded

Recently uploaded (20)

A Game Theoretic Framework for Heterogenous Information Network Clustering