Multitask learning for GGM

Inferring Multiple Graph Structures

Julien Chiquet1 , Yves Grandvalet2 , Christophe
Ambroise1
1 ´ ´ ´
Statistique et Genome, CNRS & Universite d’Evry Val d’Essonne
2 ´ `
Heudiasyc, CNRS & Universite de Technologie de Compiegne

NeMo – 21 juin 2010

Chiquet, Grandvalet, Ambroise, arXiv preprint.
Inferring multiple Gaussian graphical structures.

Chiquet, Grasseau, Charbonnier and Ambroise, R-package SIMoNe.
http://stat.genopole.cnrs.fr/~jchiquet/fr/softwares/simone

Inferring Multiple Graph Structures 1

Problem

Inference

few arrays ⇔ few examples
lots of genes ⇔ high dimension Which interactions?
interactions ⇔ very high dimension

The main trouble is the low sample size and high dimensional
setting
Our main hope is to beneﬁt from sparsity: few genes interact


Handling the scarcity of data
Merge several experimental conditions
experiment 1 experiment 2 experiment 3


Inferring each graph independently does not help

(1) (1) (2) (2) (3) (3)
(X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 )
inference inference inference


By pooling all the available data

(X1 , . . . , Xn ), n = n1 + n2 + n3 .
inference




(1) (1) (2) (2) (3) (3)
(X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 )


By breaking the separability

(1) (1) (2) (2) (3) (3)
(X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 )


Outline

Statistical model

Multi-task learning

Algorithms and methods

Model selection

Experiments


Outline

Statistical model

Multi-task learning


Model selection

Experiments


Gaussian graphical modeling

Let
X = (X1 , . . . , Xp ) ∼ N (0p , Σ) and assume n i.i.d. copies of X,
X be the n × p matrix whose kth row is Xk ,
Θ = (θij )i,j∈P Σ−1 be the concentration matrix.

Graphical interpretation
Since corij|P{i,j} = −θij / θii θjj for i = j,

 θij = 0
Xi ⊥ Xj |XP{i,j} ⇔
⊥ or
edge (i, j) ∈ network.
/


non zeroes in Θ describes the graph structure.


The model likelihood

Let S = n−1 X X be the empirical variance-covariance matrix: S
is a sufﬁcient statistic for X ⇒ L(Θ; X) = L(Θ; S)

The log-likelihood
n n n
L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π).
2 2 2

The MLE of Θ is S−1
not deﬁned for n < p
not sparse ⇒ fully connected graph


Penalized Approaches

Penalized Likelihood (Banerjee et al., 2008)
max L(Θ; S) − λ Θ 1
Θ∈S+

well deﬁned for n < p
sparse ⇒ sensible graph
SDP of size O(p2 ) (solved by Friedman et al., 2007)

¨
Neighborhood Selection (Meinshausen & Bulhman, 2006)
1 2
β = argmin
Xj − Xj β 2 + λ β 1
β∈Rp−1 n
where Xj is the jth column of X and Xj is X deprived of Xj

not symmetric, not positive-deﬁnite
p independent L ASSO problems of size (p − 1)


Neighborhood vs. Likelihood

Pseudo-likelihood (Besag, 1975)
p
P(X1 , . . . , Xp ) P(Xj |{Xk }k=j )
j=1

n n n
log det(D) −
L(Θ; S) = trace SD−1 Θ2 − log(2π)
2 2 2
n n n
L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π)
2 2 2
with D = diag(Θ).

Proposition (Ambroise, Chiquet, Matias, 2008)
Neighborhood selection leads to the graph maximizing the
penalized pseudo-log-likelihood
ˆ θ ˜
Proof: βi = − ij , where Θ = arg maxΘ L(Θ; S) − λ Θ 1
θjj


Outline

Statistical model

Multi-task learning


Model selection

Experiments


Multi-task learning
We have T samples (experimental cond.) of the same variables

X(t) is the tth data matrix, S(t) is the empirical covariance
examples are assumed to be drawn from N (0, Σ(t) )
Ignoring the relationships between the tasks leads to separable
objectives

max L(Θ(t) ; S(t) ) − λ Θ(t) 1
Θ(t) ∈Rp×p ,t=1...,T

Multi-task learning = solving the T tasks jointly
We may couple the objectives
through the ﬁtting term term,
through the penalty term.


Coupling through the ﬁtting term

Intertwined L ASSO

T
max L(Θ(t) ; S(t) ) − λ Θ(t) 1
Θ(t) ,t...,T
t=1

1 T (t) is the “pooled-tasks” covariance matrix.
S= n t=1 nt S
S(t) = αS(t) + (1 − α)S is a mixture between speciﬁc and
pooled covariance matrices.

α = 0 pools the data sets and infers a single graph
α = 1 separates the data sets and infers T graphs
independently
α = 1/2 in all our experiments


Coupling through penalties: group-L ASSO

X1 X2

We group parameters by sets of
corresponding edges across graphs:
X3 X4

Graphical group-L ASSO

T T 1/2
(t) (t) (t) 2
max L Θ ;S −λ θij
Θ(t) ,t...,T
t=1 i,j t=1
i=j

Sparsity pattern shared between graphs
Identical graphs across tasks


Coupling through penalties: group-L ASSO

X1 X2
X1 X2
We group parameters by sets of X1
X1
X2
X2
corresponding edges across graphs:
X3 X4
X3 X4
X3 X4
X3 X4
Graphical group-L ASSO

T T 1/2
(t) (t) (t) 2
max L Θ ;S −λ θij
Θ(t) ,t...,T
t=1 i,j t=1
i=j

Sparsity pattern shared between graphs
Identical graphs across tasks


Coupling through penalties: cooperative-L ASSO

Same grouping, and bet that X1 X2

correlations are likely to be sign
consistent
Gene interactions are either
X3 X4
inhibitory or activating across assays

Graphical cooperative-L ASSO

T T 1 T 1
(t) (t) (t) 2 2
(t) 2 2
max L S ;Θ −λ θij + θij
Θ(t) + −
t=1,...,T t=1 i,j t=1 t=1
i=j

where [u]+ = max(0, u) and [u]− = min(0, u).

Plausible in many other situations
Sparsity pattern shared between graphs, which may differ


Coupling through penalties: cooperative-L ASSO

Same grouping, and bet that X1 X2
X1 X2
correlations are likely to be sign X1 X2
X1 X2
consistent
Gene interactions are either
X3 X4
inhibitory or activating across assays X3 X4
X3 X4
X3 X4
Graphical cooperative-L ASSO

T T 1 T 1
(t) (t) (t) 2 2
(t) 2 2
max L S ;Θ −λ θij + θij
Θ(t) + −
t=1,...,T t=1 i,j t=1 t=1
i=j

where [u]+ = max(0, u) and [u]− = min(0, u).

Plausible in many other situations
Sparsity pattern shared between graphs, which may differ


Outline

Statistical model

Multi-task learning


Model selection

Experiments


A Geometric View of Sparsity
Constrained Optimization
L(β1 , β2 )

max L(β1 , β2 ) − λΩ(β1 , β2 )
β1 ,β2

β2
β1

L(β1 , β2 )

max L(β1 , β2 ) − λΩ(β1 , β2 )
β1 ,β2

max L(β1 , β2 )
⇔ β1 ,β2
s.t. Ω(β1 , β2 ) ≤ c

β2
β1


max L(β1 , β2 ) − λΩ(β1 , β2 )
β1 ,β2

max L(β1 , β2 )
⇔ β1 ,β2
s.t. Ω(β1 , β2 ) ≤ c
β2

β1


Supporting Hyperplane

An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2

β1



β2

β2

β1 β1



β2

β2

β2
β1 β1 β1

There are Supporting Hyperplane at all points of convex sets:
Generalize tangents

Dual Cone

Generalizes normals
β2

β2

β2
β1 β1 β1


Dual Cone

Generalizes normals
β2

β2

β2
β1 β1 β1

Shape of dual cones ⇒ sparsity pattern


Group-L ASSO balls
(2) (2)
β2 =0 β2 = 0.3

Admissible set
2 tasks (T = 2)
1 1
2 coefﬁcients (p = 2)

=0
(1) (1)
β2 −1 1 β2 −1 1

(2)
Unit ball

β1
−1 −1
(1) (1)
β1 β1
2 2 1/2
(t) 2
βi ≤1 1 1

i=1 t=1
= 0.3

(1) (1)
β2 −1 1 β2 −1 1
(2)
β1

−1 −1
(1) (1)
β1 β1


Cooperative-L ASSO balls
(2) (2)
β2 =0 β2 = 0.3

Admissible set
2 tasks (T = 2)
2 coefﬁcients (p = 2)
1 1

Unit ball

=0
(1) (1)
β2 −1 1 β2 −1 1

(2)
1/2

β1
2 2
(t) 2
−1 −1
(1) (1)
βj β1 β1
+
j=1 t=1
2 2 1/2 1 1

(t) 2
= 0.3

+ −βj ≤1 (1)
β2
(1)
β2
+ −1 1 −1 1
j=1 t=1
(2)
β1

−1 −1
(1) (1)
β1 β1


Decomposition strategy
Estimate the j th neighborhood of the T graphs

T
max ˜
L(K(t) ; S(t) ) − λ Ω(K(t) )
K(t) ,t=1...,T
t=1

decomposes into p convex optimization problems of size

β j = argmin fj (β) + λ Ω(β)
β∈RT ×(p−1)

where β j is a minimizer iff 0 ∈ β fj (β) + λ∂β Ω(β)



T
max ˜
L(K(t) ; S(t) ) − λ Ω(K(t) )
K(t) ,t=1...,T
t=1


β∈RT ×(p−1)

Group-L ASSO:
p−1
[1:T ]
Ω(β) = βi
2
i=1
[1:T ]
where β i is the vector corresponding to the edges (i, j) across
graphs



T
max ˜
L(K(t) ; S(t) ) − λ Ω(K(t) )
K(t) ,t=1...,T
t=1


β∈RT ×(p−1)

Coop-L ASSO:
p−1
[1:T ] [1:T ]
Ω(β) = βi + −β i
+ 2 + 2
i=1

[1:T ]
where β i is the vector corresponding to the edges (i, j) across
graphs

Active set algorithm: yellow belt
// 0. INITIALIZATION
β ← 0, A ← ∅
while 0 ∈ ∂β L(β) do
/
// 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A
Find a solution h to the smooth problem

h f (β A + h) + λ∂h Ω(β A + h) = 0, where ∂h Ω = { h Ω} .

βA ← βA + h
// 2. IDENTIFY NEWLY ZEROED VARIABLES

A ← A{i}

// 3. IDENTIFY NEW NON-ZERO VARIABLES
// Select a candidate i ∈ Ac

∂f (β)
i ← arg max vj , where vj = min ∂βj
+ λν
j∈Ac ν∈∂β Ω
j

end

Active set algorithm: orange belt
β ← 0, A ← ∅
/


βA ← βA + h

A ← A{i}

// Select a candidate i ∈ Ac which violates the more the optimality
conditions
∂f (β)
i ← arg max vj , where vj = min ∂βj
+ λν
j∈Ac ν∈∂β Ω
j
if it exists such an i then
A ← A ∪ {i}
else
Stop and return β, which is optimal
end
end

Active set algorithm: green belt
β ← 0, A ← ∅
/


βA ← βA + h
∂f (β)
while ∃i ∈ A : βi = 0 and min ∂β
+ λν = 0 do
ν∈∂β Ω i
i
A ← A{i}
end
// Select a candidate i ∈ Ac such that an infinitesimal change of βi
provides the highest reduction of L
∂f (β)
i ← arg max vj , where vj = min ∂β
+ λν
ν∈∂β Ω j
j∈Ac j
if vi = 0 then
A ← A ∪ {i}
else
Stop and return β, which is optimal
end
end

Outline

Statistical model

Multi-task learning


Model selection

Experiments


Tuning the penalty parameter
What does the literature say?

Theory based penalty choices
√
1. Optimal order of penalty in the p n framework: n log p
Bunea et al. 2007, Bickel et al. 2009

2. Control on the probability of connecting two distinct
connectivity sets
Meinshausen et al. 2006, Banerjee et al. 2008, Ambroise et al. 2009

practically much too conservative

Cross-validation
Optimal in terms of prediction, not in terms of selection
Problematic with small samples:
changes the sparsity constraint due to sample size


Tuning the penalty parameter
BIC / AIC

Theorem (Zou et al. 2008)

ˆlasso ˆlasso
df(βλ ) = βλ
0

Straightforward extensions to the graphical framework

ˆ ˆ log n
BIC(λ) = L(Θλ ; X) − df(Θλ )
2

ˆ ˆ
AIC(λ) = L(Θλ ; X) − df(Θλ )

Rely on asymptotic approximations, but still relevant for small
data set


Outline

Statistical model

Multi-task learning


Model selection

Experiments


Data Generation

We set

the number of nodes p
the number of edges K
the number of examples n

Process
1. Generate a random adjacency matrix with 2 K
off-diagonal terms
2. Compute the normalized Laplacian L
3. Generate a symmetric matrix of random signs R
4. Compute the concentration matrix Kij = Lij Rij
5. compute Σ by pseudo-inversion of K
6. generate correlated Gaussian data ∼ N (0, Σ )


Simulating Related Tasks

Generate
1. an “ancestor” with p = 20 nodes and K = 20 edges
2. T = 4 children by adding and deleting δ edges
3. T = 4 Gaussian samples

Figure: ancestor and children with δ = 2 perturbations


Simulation results

Precision/Recall curve ROC curve
precision = TP/(TP+FP) fallout = FP/N (type I error)
recall = TP/P (power) recall = TP/P (power)


Simulation results
large sample size

penalty: λmax −→ 0 penalty: λmax −→ 0
1.0

1.0
0.8

0.8
0.6

0.6
precision

recall
0.4

0.4
CoopLasso CoopLasso
0.2

0.2
GroupLasso GroupLasso
Intertwined Intertwined
Independent Independent
Pooled Pooled
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
recall fallout

Figure: nt = 100, δ = 1


Simulation results
large sample size

1.0

1.0
0.8

0.8
0.6

0.6
precision

recall
0.4

0.4
CoopLasso CoopLasso
0.2

0.2
Pooled Pooled
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
recall fallout

Figure: nt = 100, δ = 3


Simulation results
large sample size

1.0

1.0
0.8

0.8
0.6

0.6
precision

recall
0.4

0.4
CoopLasso CoopLasso
0.2

0.2
Pooled Pooled
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
recall fallout

Figure: nt = 100, δ = 5


Simulation results
medium sample size

1.0

1.0
0.8

0.8
0.6

0.6
precision

recall
0.4

0.4
CoopLasso CoopLasso
0.2

0.2
Pooled Pooled
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
recall fallout

Figure: nt = 50, δ = 1


Simulation results
medium sample size

1.0

1.0
0.8

0.8
0.6

0.6
precision

recall
0.4

0.4
CoopLasso CoopLasso
0.2

0.2
Pooled Pooled
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
recall fallout



Simulation results
small sample size

1.0

1.0
0.8

0.8
0.6

0.6
precision

recall
0.4

0.4
CoopLasso CoopLasso
0.2

0.2
Pooled Pooled
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
recall fallout



Breast Cancer
Prediction of the outcome of preoperative chemotherapy

Two types of patients
Patient response can be classiﬁed either as
1. pathologic complete response (PCR)
2. residual disease (not PCR)

Gene expression data
133 patients (99 not PCR, 34 PCR)
26 identiﬁed genes (differential analysis)


Package Demo

cancer data: Coop-Lasso


Conclusions
To sum-up
Clariﬁed links between neighborhood selection and graphical
L ASSO
Identiﬁed the relevance of Multi-Task Learning in network
inference
First methods for inferring multiple Gaussian Graphical Models
Consistent improvements upon the available baseline solutions
Available in the R package SIMoNe

Perspectives
Explore model-selection capabilities
Other applications of the Cooperative-L ASSO
Theoretical analysis (uniqueness, selection consistency)


Multitask learning for GGM

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Multitask learning for GGM

Similaire à Multitask learning for GGM (20)

Plus de Laboratoire Statistique et génome

Plus de Laboratoire Statistique et génome (6)

Multitask learning for GGM