1. Inferring Multiple Graph Structures
Julien Chiquet1 , Yves Grandvalet2 , Christophe
Ambroise1
1 ´ ´ ´
Statistique et Genome, CNRS & Universite d’Evry Val d’Essonne
2 ´ `
Heudiasyc, CNRS & Universite de Technologie de Compiegne
NeMo – 21 juin 2010
Chiquet, Grandvalet, Ambroise, arXiv preprint.
Inferring multiple Gaussian graphical structures.
Chiquet, Grasseau, Charbonnier and Ambroise, R-package SIMoNe.
http://stat.genopole.cnrs.fr/~jchiquet/fr/softwares/simone
Inferring Multiple Graph Structures 1
2. Problem
Inference
few arrays ⇔ few examples
lots of genes ⇔ high dimension Which interactions?
interactions ⇔ very high dimension
The main trouble is the low sample size and high dimensional
setting
Our main hope is to benefit from sparsity: few genes interact
Inferring Multiple Graph Structures 2
3. Handling the scarcity of data
Merge several experimental conditions
experiment 1 experiment 2 experiment 3
Inferring Multiple Graph Structures 3
4. Handling the scarcity of data
Inferring each graph independently does not help
experiment 1 experiment 2 experiment 3
(1) (1) (2) (2) (3) (3)
(X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 )
inference inference inference
Inferring Multiple Graph Structures 3
5. Handling the scarcity of data
By pooling all the available data
experiment 1 experiment 2 experiment 3
(X1 , . . . , Xn ), n = n1 + n2 + n3 .
inference
Inferring Multiple Graph Structures 3
9. Outline
Statistical model
Multi-task learning
Algorithms and methods
Model selection
Experiments
Inferring Multiple Graph Structures 4
10. Outline
Statistical model
Multi-task learning
Algorithms and methods
Model selection
Experiments
Inferring Multiple Graph Structures 5
11. Gaussian graphical modeling
Let
X = (X1 , . . . , Xp ) ∼ N (0p , Σ) and assume n i.i.d. copies of X,
X be the n × p matrix whose kth row is Xk ,
Θ = (θij )i,j∈P Σ−1 be the concentration matrix.
Graphical interpretation
Since corij|P{i,j} = −θij / θii θjj for i = j,
θij = 0
Xi ⊥ Xj |XP{i,j} ⇔
⊥ or
edge (i, j) ∈ network.
/
non zeroes in Θ describes the graph structure.
Inferring Multiple Graph Structures 6
12. The model likelihood
Let S = n−1 X X be the empirical variance-covariance matrix: S
is a sufficient statistic for X ⇒ L(Θ; X) = L(Θ; S)
The log-likelihood
n n n
L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π).
2 2 2
The MLE of Θ is S−1
not defined for n < p
not sparse ⇒ fully connected graph
Inferring Multiple Graph Structures 7
13. Penalized Approaches
Penalized Likelihood (Banerjee et al., 2008)
max L(Θ; S) − λ Θ 1
Θ∈S+
well defined for n < p
sparse ⇒ sensible graph
SDP of size O(p2 ) (solved by Friedman et al., 2007)
¨
Neighborhood Selection (Meinshausen & Bulhman, 2006)
1 2
β = argmin
Xj − Xj β 2 + λ β 1
β∈Rp−1 n
where Xj is the jth column of X and Xj is X deprived of Xj
not symmetric, not positive-definite
p independent L ASSO problems of size (p − 1)
Inferring Multiple Graph Structures 8
14. Penalized Approaches
Penalized Likelihood (Banerjee et al., 2008)
max L(Θ; S) − λ Θ 1
Θ∈S+
well defined for n < p
sparse ⇒ sensible graph
SDP of size O(p2 ) (solved by Friedman et al., 2007)
¨
Neighborhood Selection (Meinshausen & Bulhman, 2006)
1 2
β = argmin
Xj − Xj β 2 + λ β 1
β∈Rp−1 n
where Xj is the jth column of X and Xj is X deprived of Xj
not symmetric, not positive-definite
p independent L ASSO problems of size (p − 1)
Inferring Multiple Graph Structures 8
15. Neighborhood vs. Likelihood
Pseudo-likelihood (Besag, 1975)
p
P(X1 , . . . , Xp ) P(Xj |{Xk }k=j )
j=1
n n n
log det(D) −
L(Θ; S) = trace SD−1 Θ2 − log(2π)
2 2 2
n n n
L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π)
2 2 2
with D = diag(Θ).
Proposition (Ambroise, Chiquet, Matias, 2008)
Neighborhood selection leads to the graph maximizing the
penalized pseudo-log-likelihood
ˆ θ ˜
Proof: βi = − ij , where Θ = arg maxΘ L(Θ; S) − λ Θ 1
θjj
Inferring Multiple Graph Structures 9
16. Neighborhood vs. Likelihood
Pseudo-likelihood (Besag, 1975)
p
P(X1 , . . . , Xp ) P(Xj |{Xk }k=j )
j=1
n n n
log det(D) −
L(Θ; S) = trace SD−1 Θ2 − log(2π)
2 2 2
n n n
L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π)
2 2 2
with D = diag(Θ).
Proposition (Ambroise, Chiquet, Matias, 2008)
Neighborhood selection leads to the graph maximizing the
penalized pseudo-log-likelihood
ˆ θ ˜
Proof: βi = − ij , where Θ = arg maxΘ L(Θ; S) − λ Θ 1
θjj
Inferring Multiple Graph Structures 9
17. Outline
Statistical model
Multi-task learning
Algorithms and methods
Model selection
Experiments
Inferring Multiple Graph Structures 10
18. Multi-task learning
We have T samples (experimental cond.) of the same variables
X(t) is the tth data matrix, S(t) is the empirical covariance
examples are assumed to be drawn from N (0, Σ(t) )
Ignoring the relationships between the tasks leads to separable
objectives
max L(Θ(t) ; S(t) ) − λ Θ(t) 1
Θ(t) ∈Rp×p ,t=1...,T
Multi-task learning = solving the T tasks jointly
We may couple the objectives
through the fitting term term,
through the penalty term.
Inferring Multiple Graph Structures 11
19. Multi-task learning
We have T samples (experimental cond.) of the same variables
X(t) is the tth data matrix, S(t) is the empirical covariance
examples are assumed to be drawn from N (0, Σ(t) )
Ignoring the relationships between the tasks leads to separable
objectives
max L(Θ(t) ; S(t) ) − λ Θ(t) 1
Θ(t) ∈Rp×p ,t=1...,T
Multi-task learning = solving the T tasks jointly
We may couple the objectives
through the fitting term term,
through the penalty term.
Inferring Multiple Graph Structures 11
20. Coupling through the fitting term
Intertwined L ASSO
T
max L(Θ(t) ; S(t) ) − λ Θ(t) 1
Θ(t) ,t...,T
t=1
1 T (t) is the “pooled-tasks” covariance matrix.
S= n t=1 nt S
S(t) = αS(t) + (1 − α)S is a mixture between specific and
pooled covariance matrices.
α = 0 pools the data sets and infers a single graph
α = 1 separates the data sets and infers T graphs
independently
α = 1/2 in all our experiments
Inferring Multiple Graph Structures 12
21. Coupling through penalties: group-L ASSO
X1 X2
We group parameters by sets of
corresponding edges across graphs:
X3 X4
Graphical group-L ASSO
T T 1/2
(t) (t) (t) 2
max L Θ ;S −λ θij
Θ(t) ,t...,T
t=1 i,j t=1
i=j
Sparsity pattern shared between graphs
Identical graphs across tasks
Inferring Multiple Graph Structures 13
22. Coupling through penalties: group-L ASSO
X1 X2
X1 X2
We group parameters by sets of X1
X1
X2
X2
corresponding edges across graphs:
X3 X4
X3 X4
X3 X4
X3 X4
Graphical group-L ASSO
T T 1/2
(t) (t) (t) 2
max L Θ ;S −λ θij
Θ(t) ,t...,T
t=1 i,j t=1
i=j
Sparsity pattern shared between graphs
Identical graphs across tasks
Inferring Multiple Graph Structures 13
23. Coupling through penalties: group-L ASSO
X1 X2
X1 X2
We group parameters by sets of X1
X1
X2
X2
corresponding edges across graphs:
X3 X4
X3 X4
X3 X4
X3 X4
Graphical group-L ASSO
T T 1/2
(t) (t) (t) 2
max L Θ ;S −λ θij
Θ(t) ,t...,T
t=1 i,j t=1
i=j
Sparsity pattern shared between graphs
Identical graphs across tasks
Inferring Multiple Graph Structures 13
24. Coupling through penalties: group-L ASSO
X1 X2
X1 X2
We group parameters by sets of X1
X1
X2
X2
corresponding edges across graphs:
X3 X4
X3 X4
X3 X4
X3 X4
Graphical group-L ASSO
T T 1/2
(t) (t) (t) 2
max L Θ ;S −λ θij
Θ(t) ,t...,T
t=1 i,j t=1
i=j
Sparsity pattern shared between graphs
Identical graphs across tasks
Inferring Multiple Graph Structures 13
25. Coupling through penalties: group-L ASSO
X1 X2
X1 X2
We group parameters by sets of X1
X1
X2
X2
corresponding edges across graphs:
X3 X4
X3 X4
X3 X4
X3 X4
Graphical group-L ASSO
T T 1/2
(t) (t) (t) 2
max L Θ ;S −λ θij
Θ(t) ,t...,T
t=1 i,j t=1
i=j
Sparsity pattern shared between graphs
Identical graphs across tasks
Inferring Multiple Graph Structures 13
26. Coupling through penalties: cooperative-L ASSO
Same grouping, and bet that X1 X2
correlations are likely to be sign
consistent
Gene interactions are either
X3 X4
inhibitory or activating across assays
Graphical cooperative-L ASSO
T T 1 T 1
(t) (t) (t) 2 2
(t) 2 2
max L S ;Θ −λ θij + θij
Θ(t) + −
t=1,...,T t=1 i,j t=1 t=1
i=j
where [u]+ = max(0, u) and [u]− = min(0, u).
Plausible in many other situations
Sparsity pattern shared between graphs, which may differ
Inferring Multiple Graph Structures 14
27. Coupling through penalties: cooperative-L ASSO
Same grouping, and bet that X1 X2
correlations are likely to be sign
consistent
Gene interactions are either
X3 X4
inhibitory or activating across assays
Graphical cooperative-L ASSO
T T 1 T 1
(t) (t) (t) 2 2
(t) 2 2
max L S ;Θ −λ θij + θij
Θ(t) + −
t=1,...,T t=1 i,j t=1 t=1
i=j
where [u]+ = max(0, u) and [u]− = min(0, u).
Plausible in many other situations
Sparsity pattern shared between graphs, which may differ
Inferring Multiple Graph Structures 14
28. Coupling through penalties: cooperative-L ASSO
Same grouping, and bet that X1 X2
X1 X2
correlations are likely to be sign X1 X2
X1 X2
consistent
Gene interactions are either
X3 X4
inhibitory or activating across assays X3 X4
X3 X4
X3 X4
Graphical cooperative-L ASSO
T T 1 T 1
(t) (t) (t) 2 2
(t) 2 2
max L S ;Θ −λ θij + θij
Θ(t) + −
t=1,...,T t=1 i,j t=1 t=1
i=j
where [u]+ = max(0, u) and [u]− = min(0, u).
Plausible in many other situations
Sparsity pattern shared between graphs, which may differ
Inferring Multiple Graph Structures 14
29. Coupling through penalties: cooperative-L ASSO
Same grouping, and bet that X1 X2
X1 X2
correlations are likely to be sign X1 X2
X1 X2
consistent
Gene interactions are either
X3 X4
inhibitory or activating across assays X3 X4
X3 X4
X3 X4
Graphical cooperative-L ASSO
T T 1 T 1
(t) (t) (t) 2 2
(t) 2 2
max L S ;Θ −λ θij + θij
Θ(t) + −
t=1,...,T t=1 i,j t=1 t=1
i=j
where [u]+ = max(0, u) and [u]− = min(0, u).
Plausible in many other situations
Sparsity pattern shared between graphs, which may differ
Inferring Multiple Graph Structures 14
30. Coupling through penalties: cooperative-L ASSO
Same grouping, and bet that X1 X2
X1 X2
correlations are likely to be sign X1 X2
X1 X2
consistent
Gene interactions are either
X3 X4
inhibitory or activating across assays X3 X4
X3 X4
X3 X4
Graphical cooperative-L ASSO
T T 1 T 1
(t) (t) (t) 2 2
(t) 2 2
max L S ;Θ −λ θij + θij
Θ(t) + −
t=1,...,T t=1 i,j t=1 t=1
i=j
where [u]+ = max(0, u) and [u]− = min(0, u).
Plausible in many other situations
Sparsity pattern shared between graphs, which may differ
Inferring Multiple Graph Structures 14
31. Coupling through penalties: cooperative-L ASSO
Same grouping, and bet that X1 X2
X1 X2
correlations are likely to be sign X1 X2
X1 X2
consistent
Gene interactions are either
X3 X4
inhibitory or activating across assays X3 X4
X3 X4
X3 X4
Graphical cooperative-L ASSO
T T 1 T 1
(t) (t) (t) 2 2
(t) 2 2
max L S ;Θ −λ θij + θij
Θ(t) + −
t=1,...,T t=1 i,j t=1 t=1
i=j
where [u]+ = max(0, u) and [u]− = min(0, u).
Plausible in many other situations
Sparsity pattern shared between graphs, which may differ
Inferring Multiple Graph Structures 14
32. Coupling through penalties: cooperative-L ASSO
Same grouping, and bet that X1 X2
X1 X2
correlations are likely to be sign X1 X2
X1 X2
consistent
Gene interactions are either
X3 X4
inhibitory or activating across assays X3 X4
X3 X4
X3 X4
Graphical cooperative-L ASSO
T T 1 T 1
(t) (t) (t) 2 2
(t) 2 2
max L S ;Θ −λ θij + θij
Θ(t) + −
t=1,...,T t=1 i,j t=1 t=1
i=j
where [u]+ = max(0, u) and [u]− = min(0, u).
Plausible in many other situations
Sparsity pattern shared between graphs, which may differ
Inferring Multiple Graph Structures 14
33. Outline
Statistical model
Multi-task learning
Algorithms and methods
Model selection
Experiments
Inferring Multiple Graph Structures 15
35. A Geometric View of Sparsity
Constrained Optimization
L(β1 , β2 )
max L(β1 , β2 ) − λΩ(β1 , β2 )
β1 ,β2
max L(β1 , β2 )
⇔ β1 ,β2
s.t. Ω(β1 , β2 ) ≤ c
β2
β1
Inferring Multiple Graph Structures 16
36. A Geometric View of Sparsity
Constrained Optimization
max L(β1 , β2 ) − λΩ(β1 , β2 )
β1 ,β2
max L(β1 , β2 )
⇔ β1 ,β2
s.t. Ω(β1 , β2 ) ≤ c
β2
β1
Inferring Multiple Graph Structures 16
37. A Geometric View of Sparsity
Supporting Hyperplane
An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2
β1
Inferring Multiple Graph Structures 17
38. A Geometric View of Sparsity
Supporting Hyperplane
An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2
β2
β1 β1
Inferring Multiple Graph Structures 17
39. A Geometric View of Sparsity
Supporting Hyperplane
An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2
β2
β1 β1
Inferring Multiple Graph Structures 17
40. A Geometric View of Sparsity
Supporting Hyperplane
An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2
β2
β1 β1
Inferring Multiple Graph Structures 17
41. A Geometric View of Sparsity
Supporting Hyperplane
An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2
β2
β2
β1 β1 β1
There are Supporting Hyperplane at all points of convex sets:
Generalize tangents
Inferring Multiple Graph Structures 17
56. Decomposition strategy
Estimate the j th neighborhood of the T graphs
T
max ˜
L(K(t) ; S(t) ) − λ Ω(K(t) )
K(t) ,t=1...,T
t=1
decomposes into p convex optimization problems of size
β j = argmin fj (β) + λ Ω(β)
β∈RT ×(p−1)
where β j is a minimizer iff 0 ∈ β fj (β) + λ∂β Ω(β)
Inferring Multiple Graph Structures 21
57. Decomposition strategy
Estimate the j th neighborhood of the T graphs
T
max ˜
L(K(t) ; S(t) ) − λ Ω(K(t) )
K(t) ,t=1...,T
t=1
decomposes into p convex optimization problems of size
β j = argmin fj (β) + λ Ω(β)
β∈RT ×(p−1)
where β j is a minimizer iff 0 ∈ β fj (β) + λ∂β Ω(β)
Group-L ASSO:
p−1
[1:T ]
Ω(β) = βi
2
i=1
[1:T ]
where β i is the vector corresponding to the edges (i, j) across
graphs
Inferring Multiple Graph Structures 21
58. Decomposition strategy
Estimate the j th neighborhood of the T graphs
T
max ˜
L(K(t) ; S(t) ) − λ Ω(K(t) )
K(t) ,t=1...,T
t=1
decomposes into p convex optimization problems of size
β j = argmin fj (β) + λ Ω(β)
β∈RT ×(p−1)
where β j is a minimizer iff 0 ∈ β fj (β) + λ∂β Ω(β)
Coop-L ASSO:
p−1
[1:T ] [1:T ]
Ω(β) = βi + −β i
+ 2 + 2
i=1
[1:T ]
where β i is the vector corresponding to the edges (i, j) across
graphs
Inferring Multiple Graph Structures 21
59. Active set algorithm: yellow belt
// 0. INITIALIZATION
β ← 0, A ← ∅
while 0 ∈ ∂β L(β) do
/
// 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A
Find a solution h to the smooth problem
h f (β A + h) + λ∂h Ω(β A + h) = 0, where ∂h Ω = { h Ω} .
βA ← βA + h
// 2. IDENTIFY NEWLY ZEROED VARIABLES
A ← A{i}
// 3. IDENTIFY NEW NON-ZERO VARIABLES
// Select a candidate i ∈ Ac
∂f (β)
i ← arg max vj , where vj = min ∂βj
+ λν
j∈Ac ν∈∂β Ω
j
end
Inferring Multiple Graph Structures 22
60. Active set algorithm: orange belt
// 0. INITIALIZATION
β ← 0, A ← ∅
while 0 ∈ ∂β L(β) do
/
// 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A
Find a solution h to the smooth problem
h f (β A + h) + λ∂h Ω(β A + h) = 0, where ∂h Ω = { h Ω} .
βA ← βA + h
// 2. IDENTIFY NEWLY ZEROED VARIABLES
A ← A{i}
// 3. IDENTIFY NEW NON-ZERO VARIABLES
// Select a candidate i ∈ Ac which violates the more the optimality
conditions
∂f (β)
i ← arg max vj , where vj = min ∂βj
+ λν
j∈Ac ν∈∂β Ω
j
if it exists such an i then
A ← A ∪ {i}
else
Stop and return β, which is optimal
end
end
Inferring Multiple Graph Structures 22
61. Active set algorithm: green belt
// 0. INITIALIZATION
β ← 0, A ← ∅
while 0 ∈ ∂β L(β) do
/
// 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A
Find a solution h to the smooth problem
h f (β A + h) + λ∂h Ω(β A + h) = 0, where ∂h Ω = { h Ω} .
βA ← βA + h
// 2. IDENTIFY NEWLY ZEROED VARIABLES
∂f (β)
while ∃i ∈ A : βi = 0 and min ∂β
+ λν = 0 do
ν∈∂β Ω i
i
A ← A{i}
end
// 3. IDENTIFY NEW NON-ZERO VARIABLES
// Select a candidate i ∈ Ac such that an infinitesimal change of βi
provides the highest reduction of L
∂f (β)
i ← arg max vj , where vj = min ∂β
+ λν
ν∈∂β Ω j
j∈Ac j
if vi = 0 then
A ← A ∪ {i}
else
Stop and return β, which is optimal
end
end
Inferring Multiple Graph Structures 22
62. Outline
Statistical model
Multi-task learning
Algorithms and methods
Model selection
Experiments
Inferring Multiple Graph Structures 23
63. Tuning the penalty parameter
What does the literature say?
Theory based penalty choices
√
1. Optimal order of penalty in the p n framework: n log p
Bunea et al. 2007, Bickel et al. 2009
2. Control on the probability of connecting two distinct
connectivity sets
Meinshausen et al. 2006, Banerjee et al. 2008, Ambroise et al. 2009
practically much too conservative
Cross-validation
Optimal in terms of prediction, not in terms of selection
Problematic with small samples:
changes the sparsity constraint due to sample size
Inferring Multiple Graph Structures 24
64. Tuning the penalty parameter
BIC / AIC
Theorem (Zou et al. 2008)
ˆlasso ˆlasso
df(βλ ) = βλ
0
Straightforward extensions to the graphical framework
ˆ ˆ log n
BIC(λ) = L(Θλ ; X) − df(Θλ )
2
ˆ ˆ
AIC(λ) = L(Θλ ; X) − df(Θλ )
Rely on asymptotic approximations, but still relevant for small
data set
Inferring Multiple Graph Structures 25
65. Outline
Statistical model
Multi-task learning
Algorithms and methods
Model selection
Experiments
Inferring Multiple Graph Structures 26
66. Data Generation
We set
the number of nodes p
the number of edges K
the number of examples n
Process
1. Generate a random adjacency matrix with 2 K
off-diagonal terms
2. Compute the normalized Laplacian L
3. Generate a symmetric matrix of random signs R
4. Compute the concentration matrix Kij = Lij Rij
5. compute Σ by pseudo-inversion of K
6. generate correlated Gaussian data ∼ N (0, Σ )
Inferring Multiple Graph Structures 27
67. Simulating Related Tasks
Generate
1. an “ancestor” with p = 20 nodes and K = 20 edges
2. T = 4 children by adding and deleting δ edges
3. T = 4 Gaussian samples
Figure: ancestor and children with δ = 2 perturbations
Inferring Multiple Graph Structures 28
68. Simulating Related Tasks
Generate
1. an “ancestor” with p = 20 nodes and K = 20 edges
2. T = 4 children by adding and deleting δ edges
3. T = 4 Gaussian samples
Figure: ancestor and children with δ = 2 perturbations
Inferring Multiple Graph Structures 28
69. Simulating Related Tasks
Generate
1. an “ancestor” with p = 20 nodes and K = 20 edges
2. T = 4 children by adding and deleting δ edges
3. T = 4 Gaussian samples
Figure: ancestor and children with δ = 2 perturbations
Inferring Multiple Graph Structures 28
80. Breast Cancer
Prediction of the outcome of preoperative chemotherapy
Two types of patients
Patient response can be classified either as
1. pathologic complete response (PCR)
2. residual disease (not PCR)
Gene expression data
133 patients (99 not PCR, 34 PCR)
26 identified genes (differential analysis)
Inferring Multiple Graph Structures 30
82. Conclusions
To sum-up
Clarified links between neighborhood selection and graphical
L ASSO
Identified the relevance of Multi-Task Learning in network
inference
First methods for inferring multiple Gaussian Graphical Models
Consistent improvements upon the available baseline solutions
Available in the R package SIMoNe
Perspectives
Explore model-selection capabilities
Other applications of the Cooperative-L ASSO
Theoretical analysis (uniqueness, selection consistency)
Inferring Multiple Graph Structures 32