SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Learning probabilistic networks of condition-specific response:
           Digging deep in yeast stationary phase
         Sushmita Roy∗ , Terran Lane∗ , and Margaret Werner-Washburne+
           ∗
             Department of Computer Science, University of New Mexico
                +
                  Department of Biology, University of New Mexico


                                           Abstract

Condition-specific networks are functional networks of genes describing molecular behavior un-
der different conditions such as environmental stresses, cell types, or tissues. These networks
frequently comprise parts that are unique to each condition, and parts that are shared among
related conditions. Existing approaches for learning condition-specific networks typically iden-
tify either only differences or similarities across conditions. Most of these approaches first learn
networks per condition independently, and then identify similarities and differences in a post-
learning step. Such approaches have not exploited the shared information across conditions
during network learning.
   We describe an approach for learning condition-specific networks that simultaneously identi-
fies the shared and unique subgraphs during network learning, rather than as a post-processing
step. Our approach learns networks across condition sets, shares data from conditions, and leads
to high quality networks capturing biologically meaningful information.
   On simulated data from two conditions, our approach outperformed an existing approach
of learning networks per condition independently, especially on small training datasets. We
further applied our approach to microarray data from two yeast stationary-phase cell popu-
lations, quiescent and non-quiescent. Our approach identified several functional interactions
that suggest respiration-related processes are shared across the two conditions. We also iden-
tified interactions specific to each population including regulation of epigenetic expression in
the quiescent population, consistent with known characteristics of these cells. Finally, we found
several high confidence cases of combinatorial interaction among single gene deletions that can
be experimentally tested using double gene knock-outs, and contribute to our understanding of
differentiated cell populations in yeast stationary phase.



                                                1
1    Introduction

Although the DNA for an organism is relatively constant, every organism on earth has the po-

tential to respond to different environmental stimuli or to differentiate into distinct cell-types or

tissues. Different environmental conditions, cell-types or tissues can be considered as different in-

stantiations of a global variable, the condition variable, which induces condition-specific responses.

These condition-specific responses typically require global changes at the transcript, protein and

metabolic levels and are of interest as they provide insight into how organisms function at a systems

level. Condition-specific networks describe functional interactions among genes and other macro-

molecules under different conditions, providing a systemic view of condition-specific behavior in

organisms.

    Analysis of condition-specific responses has been one of the principal goals of molecular biology,

and several approaches have been developed to capture condition-specific responses at different

levels of granularity. The most common approach is the identification of differentially expressed

genes in a condition of interest using genome-wide measurements of gene, and often protein expres-

sion [20]. More recent approaches are based on bi-clustering, which cluster genes and conditions

simultaneously [5,7,9,29], and identify sets of genes that are co-regulated in sets of conditions. How-

ever, these approaches do not provide fine-grained interaction structure that explains the condition-

specific response of genes. More advanced approaches additionally identify transcription modules

(set of transcription factors regulating a set of target genes) that are co-expressed in a condition-

specific manner [11,13,26,31], but these too do not provide detailed interaction information among

genes for each condition.

    In this paper, we describe a novel approach, Network Inference with Pooling Data (NIPD), for

condition-specific response analysis that emphasizes the fine-grained interaction patterns among

genes under different conditions. The main conceptual contribution of our approach is to learn

networks for any subset of conditions. This subsumes existing approaches that find either only

patterns that are specific to each condition, or only patterns that are shared across conditions.

To make this clear, let us consider a simple example of two environmental starvation conditions:

Carbon and Nitrogen starvation. Using our approach we can simultaneously find patterns that are


                                                  2
specific only to Carbon starvation, only to Nitrogen starvation, and those that are shared across

these two conditions. From the methodological stand-point our work is similar to Bayesian multi-

nets [10], which we extend by allowing data to be pooled across conditions and learning networks

for any subset of conditions.

   NIPD is based on the framework of probabilistic graphical models (PGMs), where edges rep-

resent pairwise and higher-order statistical dependencies among genes. Similar to existing PGM

learning algorithms, NIPD infers networks by iteratively scoring candidate networks and selecting

the network with the highest score [12]. However, NIPD uses a novel score that evaluates candidate

networks with respect to data from any subset of conditions, pooling data for subsets with more

than one conditions. This subset score and search strategy of NIPD incorporates and exploits the

shared information across the conditions during structure learning, rather than as a post-processing

step. As a result, we are able to identify sub-networks not only specific to one condition, but to mul-

tiple conditions simultaneously, which allows us to build a more holistic picture of condition-specific

response.

   The data pooling aspect of NIPD makes more data available for estimating parameters for

higher-order interactions, i.e., interactions among more than two genes. This enables NIPD to

robustly estimate higher-order interactions, which are more difficult to estimate due to the high

number of parameters relative to pairwise dependencies.

   By formulating NIPD in the framework of PGMs we have additional benefits: (a) PGMs are

generative models of the data, providing a system-wide description of the condition-specific behavior

as a probabilistic network, (b) the probabilistic component naturally handles noise in the data,

(c) the graph structure captures condition-specific behavior at the level of gene-gene interactions,

rather than coarse clusters of genes, (d) the PGM framework can be easily extended to more

complex situations where the condition variable itself may be a random variable that must be

inferred during network learning. We implement NIPD with undirected, probabilistic graphical

models [14]. However, the NIPD framework is applicable to directed graphs as well.

   We are not the first to propose networks for capturing condition-specific behavior [24, 34].

Several network-based approaches have been developed for capturing condition-specific behavior



                                                  3
such as disease-specific subgraphs in cancer [8], stress response networks in yeast [21], or networks

across different species [4,28]. However, these approaches are not probabilistic in nature, often rely

on the network being known, and are restricted to pairwise co-expression relationships rather than

general statistical dependencies. Other approaches such as differential dependency networks [34],

and mixture of subgraphs [24], construct probabilistic models, but focus on differences rather

than both differences and similarities. The majority of these approaches infer a network for each

condition separately, and then compare the networks from different conditions to identify the edges

capturing condition-specific behavior.

    We compared NIPD against an existing approach for learning networks from the conditions

independently. We refer to this approach as INDEP, which represents a general class of existing

algorithms that learn networks per condition independently. On simulated data from networks

with known ground truth, NIPD inferred networks with higher quality than did INDEP, especially

on small training datasets. We also applied our approach to microarray data from two yeast

(Saccharomyces cerevisiae) cell types, quiescent and non-quiescent, isolated from glucose-starved,

stationary phase cultures [2]. Networks learned by NIPD were associated with many more Gene

ontology biological processes [3], or were enriched in targets of known transcription factors (TFs)

[17], than networks learned by INDEP. Many of the TFs were involved in stress response, which

is consistent with the fact that the populations are under starvation stress. NIPD also identified

many more shared edges, which represent biologically meaningful dependencies than the INDEP

approach. This suggests that by pooling data from multiple conditions, we are able to not only

capture shared structures better, but also to infer networks with higher overall quality.



2    Results

The goal of our experiments was three fold: (a) to examine the quality of condition-specific net-

works inferred by our approach that combines data from different conditions (NIPD) versus an

independent learner (INDEP), (b) to evaluate the algorithmic performance (measured by network

structure quality) as a function of training data size, (c) analyze how two different cell populations

behave, at the network level, in response to the same starvation stress. We address (a) and (b)

                                                 4
on simulated data from networks with known topology, giving us ground truth to directly validate

the inferred networks. We address (c) on microarray data from two yeast cell populations isolated

from glucose-starved stationary phase cultures [2].


2.1   NIPD had superior performance on networks with known ground truth

We simulated data from two sets of networks, each set with two networks, one network per condition.

In the first, HIGHSIM, the networks for the two conditions, shared a larger portion (60%) of the

edges, and in the second, LOWSIM, the networks shared a smaller (20%) portion of the edges.

We compared the networks inferred by NIPD to those inferred by INDEP by assessing the match

between true and inferred node neighborhoods (See Supplementary Methods). Briefly, the data were

split into q partitions, where q ∈ {2, 4, 6, 8, 10}, and networks learned for each partition. The size of

the training data decreased with increasing q. We first evaluated overall network structure quality

by obtaining the number of nodes on which one approach was significantly better (t-test p-value,

< 0.05) in capturing its neighborhood as a function of q. On LOWSIM, NIPD was significantly

better for smaller amounts of training data. On HIGHSIM, NIPD performed significantly better

than INDEP for all training data sizes (Fig 1).

   Next, we evaluated how well the shared edges were captured as a function of decreasing amounts

of training data (Supplementary Fig 1). NIPD captured shared edges better than INDEP on

LOWSIM as the amounts of training data decreased. NIPD was better than INDEP on HIGHSIM

regardless of the size of the training data.

   Our results show that when the underlying networks corresponding to the different conditions

share a lot of structure, NIPD has a significantly greater advantage than INDEP, which does not do

any pooling. Furthermore, as training data size decreases, NIPD is better than INDEP for learning

both overall and shared structures, independent of the extent of sharing in the true networks.


2.2   Application to yeast quiescence

We applied NIPD to microarray data from two yeast cell populations, quiescent (QUIESCENT)

and non-quiescent (NON-QUIESCENT), isolated from glucose starvation-induced stationary phase


                                                   5
cultures [2]. The two cell populations are in the same media but have differentiated physiologically

and morphologically, suggesting that each population is responding differently. We learned networks

using NIPD and INDEP treating each cell population as a condition. Because each array in the

dataset was obtained from a single gene deletion mutant, the networks were constrained such that

genes with deletion mutants connected to the remaining genes1 .

      The inferred networks from both methods were evaluated using information from Gene Ontology

(GO) process, GO Slim [3] and transcriptional regulatory networks [17]. Gene Ontology is a

hierarchically structured ontology of terms used to annotate genes. GO slim is a collapsed single

level view of the complete GO terms, providing high level information of the processes, functions

and cellular locations involving a set of genes. Finally, we analyzed combinations of genes with

deletions that were in the neighborhood of other non deletion genes.


2.2.1      NIPD identified more biologically meaningful dependencies

To determine if one network was more biologically meaningful than the other, we examined the net-

works based on Gene Ontology (GO) slim category (process, function and location), transcription

factor binding data and GO process, referred as GOSLIM, TFNET and GOPROC, respectively

(Fig 2). Network quality was determined by the number of GOSLIM categories (or TFNET or

GOPROC) with better coverage than random networks (See Methods). Both approaches were

equivalent for GOSLIM, with INDEP outperforming NIPD in QUIESCENT and NIPD outper-

forming INDEP on NON-QUIESCENT. NIPD outperformed INDEP with a larger margin than

was outperformed on TFNET categories from NON-QUIESCENT. NIPD was consistently better

than INDEP on GOPROC categories.

      The networks learned by NIPD had many more edges than the networks learned by INDEP

(Supplementary Table 1). To estimate the proportion of the edges capturing biologically meaningful

relationships, we computed semantic similarity of genes connected by the edges [16]. Although both

INDEP and NIPD had significantly better semantic similarity than random networks, INDEP

degraded in p-value for QUIESCENT at the highest value of semantic similarity (Fig 3). NIPD-
  1
      This is not a bi-partite graph because the genes with deletion mutants are allowed to connect to each other.




                                                           6
inferred networks had many more edges with high semantic similarity than INDEP, while keeping

the proportion of edges satisfying a particular semantic similarity threshold close to INDEP. This

suggests that NIPD identifies more dependencies that are biologically relevant than INDEP without

suffering in precision.


2.2.2   NIPD identified more shared edges representing common starvation response

We performed a more fine-grained analysis of the inferred networks by considering each gene and

its immediate neighborhood and tested whether these gene neighborhoods were enriched in GO

biological processes, or in the target set of transcription factors (TFs) (See Methods). Using a false

discovery rate (FDR) cutoff of 0.05, we identified many more subgraphs in the networks inferred

by NIPD than by INDEP to be enriched in a GO process or in targets of TFs (Figs 4, 5). NIPD

identified more processes and larger subgraphs in both populations (oxidative phosphorylation,

protein folding, fatty acid metabolism, ammonium transport) than did INDEP.

   NIPD identified subgraphs involved in aerobic respiration and oxidative phosphorylation were

enriched in targets of HAP4, a global activator for respiration genes. The presence of HAP4 targets

in both cell populations makes sense because both populations are experiencing glucose starvation

and must switch to respiration for deriving energy. We also found the TFs, MSN2, MSN4, and

HSF1, regulating subgraphs involved in protein folding. These TFs activate stress responses and

are known to activate genes involved in heat, oxidative and starvation stress. We also found

targets of SIP4 in both populations. SIP4 is a transcriptional activator of gluconeogenesis [32],

expressed highly in glucose repressed cells [15], and therefore would be expected to be present in

both quiescent and non-quiescent cells. In contrast, the only shared regulatory connection found

by INDEP was HAP4. We conclude that the NIPD approach identified more networks that were

biologically relevant and informative about glucose starvation response than did INDEP.




                                                  7
2.2.3    Wiring differences in NIPD-inferred networks exhibit population-specific star-

         vation response

NIPD identified several processes associated exclusively with quiescent cells. This included regu-

latory processes (regulation of epigenetic gene expression, and regulation of nucleobase, nucleoside

and nucleic acid metabolism) and metabolic processes (pentose phosphate shunt). These were

novel predictions that highlight differences between these cells based on network wiring. INDEP

identified only one population-specific GO process (response to reactive oxygen species in NON-

QUIESCENT). An INDEP identified subgraph specific to quiescent (protein de-ubiquitination), was

actually a subset of the NIPD-identified subgraph involved in epigenetic gene expression regulation,

indicating that NIPD subsumed most of the information captured by INDEP.

   NIPD QUIESCENT networks contained subgraphs enriched exclusively in targets of SKO1, and

AZF1. Both of these are zinc finger TFs, with AZF1 protein expressed highly under non-fermentable

carbon sources [27], and SKO1 which regulates low affinity glucose transporters [30], and are both

consistent with the condition experienced by these cells. Unlike NIPD, which identified SIP4 to

be associated with both populations, INDEP identified SIP4 only in QUIESCENT. However, as

we describe in the previous section, it is more likely that SIP4 is involved in both QUIESCENT

and NON-QUIESCENT populations. INDEP also found the TFs YAP7 and AFT2 exclusively in

QUIESCENT and NON-QUIESCENT, respectively. YAP7 is involved in general stress response

and would be expected to have targets in both QUIESCENT and NON-QUIESCENT. AFT2 is

required under oxidative stress and is consistent with the over-abundance of reactive oxygen species

in NON-QUIESCENT population [1].

   NIPD also identified wiring differences in the subgraphs involved in shared processes. For ex-

ample in addition to HAP4, NIPD identified HAP2 as an important TF in QUIESCENT. The

presence of both HAP2 and HAP4 makes biological sense because they are both part of the

HAP2/HAP3/HAP4/HAP5 complex required for activation of respiratory genes. The presence

of both HAP2 and HAP4 in QUIESCENT, but not NON-QUIESCENT, suggests that the QUI-

ESCENT population maybe better equipped for respiration and long term survival in stationary

phase.


                                                 8
Overall, the NIPD inferred networks captured key differences and similarities in metabolic and

regulatory processes, which are consistent with existing information about these cell populations

[1,2], and also include novel findings that can provide new insight into starvation response in yeast.


2.2.4     NIPD identified several knock-out combinations

The microarrays used in this study measured expression profile of single gene deletions that were

previously identified to be highly expressed at the mRNA level in stationary phase. We constrained

the inferred networks to identify neighborhoods of genes comprising only the genes with deletion

mutants, allowing us to identify combinations of such deletion mutants and their targets. Such com-

binations can be validated in the laboratory to verify cross-talk between pathways. We found that

NIPD-inferred networks contained significantly more deletion combinations compared to random

networks for both the quiescent and non-quiescent populations (p-value < 3E-10, Supplementary

Tables 3, 4, 5), which was not the case for the INDEP-identified networks (Supplementary Tables 6,

7).

      A more stringent analysis of the knock-out combinations using GO process semantic similar-

ity identified several double knock-out and target gene candidates (Supplementary Table 2). We

also found more deletion combinations in NON-QUIESCENT compared to QUIESCENT. This is

consistent with the identification of many more mutants affecting non-quiescent than quiescent

cells [2]. In QUIESCENT, we found three genes that were all likely down-stream targets of a

COX7-QCR8 double knock-outs, all involved in the cytochrome-c oxidase complex of the mito-

chondrial inner membrane. Other deletion mutant combinations were involved in mitochondrial

ATP synthesis and ion transport. Many of these genes have been shown to be required for qui-

escent non-quiescent cell function, viability and survival [2, 18]. In NON-QUIESCENT, we found

several knock-out combinations involved in oxidative phosphorylation, aerobic respiration etc, in-

cluding a novel combination, YMR31 and QCR8, connected to TPS2. All three genes are found in

the mitochondria, which play a critical and complex role in starved cells, but the exact mechanisms

are not well-understood. Experimental analysis of this triplet can provide new insights into the role

of mitochondria in glucose-starved cells. In summary, these results demonstrated another benefit



                                                  9
of data pooling in NIPD: learning more complex, combinatorial relationships among genes.



3    Discussion

Inference and analysis of cellular networks has been one of the cornerstones of systems biology.

We have developed a network learning approach, Network Inference with Pooling Data (NIPD) to

capture a systemic view of condition-specific response. NIPD is based on probabilistic graphical

models and infers the functional wiring among genes involved in condition-specific response. The

crux of our approach is to learn networks for any subset of conditions capturing fine-grained gene

interaction patterns not only in individual conditions but in any combination of conditions. This

allows NIPD to robustly identify both shared and unique components of condition-specific cellular

networks. In comparison to an approach that learns networks independently (INDEP), NIPD

(a) pools data across different conditions, enabling better exploitation of the shared information

between conditions, (b) learns better overall network structures in the face of decreasing amounts

of training data, and (c) learns structures with many more biologically meaningful dependencies.

    Small training data sets, which are especially common for biological data, present significant

challenges for any network learning approach. In particular, approaches such as INDEP may learn

drastically different networks due to small data perturbations leading to differences that are not

biologically meaningful. NIPD is more resilient to small perturbations because by pooling data

from different conditions during network learning, NIPD effectively has more data for estimating

parameters for the shared parts of the network.

    Another challenge in the analysis of condition-specific networks is to extract patterns that

are shared across conditions. Approaches such as INDEP that learn networks for each condition

independently, and then compare the networks, are more likely to learn different networks making

it difficult to identify the similarities across conditions. Application of both NIPD and INDEP

approaches to microarray data from two yeast populations showed that many of subgraphs that

would be considered specific to each population by INDEP, were actually shared biological processes

that must be activated in both populations irrespective of their morphological and physiological

differences.

                                                  10
One of the strengths of NIPD in comparison with INDEP was its ability to identify pairs of gene

deletions and downstream targets using data from individual gene deletions. Amazingly, several

of these gene deletions are already known to have a phenotypic effect on stationary phase cultures

and often on quiescent or non-quiescent cells (Supplementary Table 2) [2,18]. These predictions are

therefore good candidates for future experiments using double deletion mutants, and are a drastic

reduction of the space of possible combinations of sixty-nine single gene deletions. Identification of

population-specific malfunctions in signaling pathways via experimental analysis of these multiple

deletions can provide new insight into aging and cancer studies using yeast stationary phase as a

model system.

    The NIPD approach establishes ground-work for important future enhancements, including the

ability to efficiently learn networks from many conditions. The probabilistic framework of NIPD can

be easily extended to automatically infer the condition variable to make NIPD widely applicable to

datasets with uncertainty about the conditions. The NIPD approach can also integrate novel types

of high-throughput data including RNASeq [33] and ChipSeq [25]. These extensions will allow

us to systematically identify the parts, and the wiring among them that determine stage-specific,

tissue-specific and disease specific behavior in whole organisms.



4     Methods

4.1   Independent learning of condition-specific networks: INDEP

Existing approaches of learning condition-specific networks [4, 21, 28] can be considered as spe-

cial cases of a general independent learning approach, INDEP, where networks for each condition

are learned independently and then compared to identify network parts unique or shared across

conditions.

    Let {D1 , · · · , Dk } denote k datasets from k conditions. In the INDEP approach, each network

Gc , 1 ≤ c ≤ k, is learned independently using data from Dc only. Our implementation of the

INDEP framework considered each Gc as an undirected probabilistic graphical model, or a Markov

random field (MRF) [14], which like Bayesian networks, can capture higher-order dependencies,



                                                 11
but additionally captures cyclic dependencies. We use a pseudo-likelihood framework with an

MDL penalty to learn the structure of the MRF [6]. The pseudo-likelihood score for a network
                                                   N
Gc describing data Dc is PLL(Gc ) =                i=1 PLLV(Xi , Mci , c)      where X1 , · · · , XN are the random

variables (one for each gene), encoding the expression value of a gene. PLLV is Xi ’s contribution to

the overall pseudo-likelihood and is defined, including a minimum description length (MDL) penalty,
                                |Dc |                                        |θci |log(|Dc |)
as PLLV(Xi , Mci , c) =         d     logP (Xi   = xdi |Mci = mcdi ) +               2        .    Here Mci is the Markov

blanket (MB) of Xi in condition c and xdi and mcdi are assignments to Xi and Mci , respectively

from the dth data point. θci are the parameters of the conditional distribution P (Xi |Mci ). We

assume the conditional distributions to be conditional Gaussians. The structure learning algorithm

for each graph is described in [22].


4.2    Network Inference with Pooling Data: NIPD

The NIPD approach that we present extends the INDEP approach by incorporating shared infor-

mation across conditions during structure learning. In this framework, we do not learn networks

for each condition c separately. Instead, we devise a score for each edge addition that considers

networks for any subset of the conditions. Let C denote the set of k conditions. For a non-singleton

set, E ⊆ C, we pool the data from all conditions e ∈ E and evaluate the overall score improve-

ment on adding an edge to networks for all e ∈ E. To learn {G1 , · · · , Gk } for the k conditions

simultaneously, we maximize the following MDL-based score:


       S(G1 , · · · , Gk ) = P (D1 , · · · , Dk |θ1 , · · · , θk )P (θ1 , · · · , θk |G1 , · · · , Gk ) + MDL Penalty        (1)


Here θ1 , · · · , θk are the maximum likelihood parameters for the k graphs. We assume P (Dc |θ1 , · · · , θk ) =

P (Dc |θc ). That is, if we know the parameters θc , the likelihood of the data from condition, Dc , given
                                                                                                  k
θc can be estimated independently. Thus, P (D1 , · · · , Dk |θ1 , · · · , θk ) =                  c=1 P (Dc |θc ).   Because our

networks are MRFs, we use pseudo-likelihood PLL(Dc ). We expand the complete condition-specific

parameter set θc , to {θc1 , · · · , θcN }, which is the set of parameters of each variable Xi , 1 ≤ i ≤ N ,




                                                            12
in condition c. Using the parameter modularity assumption for each variable, we have:

                                                             N
                   P (θ1 , · · · , θk |G1 , · · · , Gk ) =         P (θ1i , · · · , θki |M1i , · · · , Mki )           (2)
                                                             i=1


Note the parameters of conditional probabilities of individual random variables are independent, but

the parameters per variable are not independent across conditions. To enforce dependency among

the θci , we make Mci depend on all the neighbors of Xi in condition c and all sets of conditions

that include c. To convey the intuition behind this idea, let us consider the two condition case

C = {A, B}. A variable Xj can be in Xi ’s MB in condition A, either if it is connected to Xi only

in condition A, or if it is connected to Xi in both conditions A and B. Let M∗ be the set of
                                                                             Ai

variables that are connected to Xi only in condition A but not in both A and B. Similarly, let

M∗
 {A,B}i denote the set of variables that are connected to Xi in both A and B conditions. Hence,

MAi = M∗ ∪ M∗
       Ai   {A,B}i . More generally, for any c ∈ C, Mci =
                                                                                                          ∗
                                                                                     E∈powerset(C) : c∈E MEi ,   where M∗
                                                                                                                        Ei

denotes the neighbors of Xi only in condition set E. To incorporate this dependency in the structure

score, we need to define P (Xi |Mci ) such that it takes into account all subsets E, c ∈ E. We assume

that the MBs, M∗ , independently influence Xi . This allows us to write P (Xi |Mci ) as a product:
               Ei

P (Xi |Mci ) ∝                               ∗
                 E∈powerset(C) : c∈E P (Xi |MEi ).           To learn the k graphs, we exhaustively enumerate

over condition sets, E, and estimate parameters θEi by pooling the data for all non-singleton E.

   Our structure learning algorithm maintains a conditional distribution for every variable, Xi for

every set E ∈ powerset(C). We consider the addition of an edge {Xi , Xk } in every set E. This addi-

tion will affect the conditionals of Xi and Xj in all conditions e ∈ E. Because the MB per condition

set independently influence the conditional, the pseudo-likelihood PLLV(Xi , Mei , e) decomposes as
                        ∗
  E s.t: e∈E PLLV(Xi , MEi , e)   (Supplementary information). The net score improvement of adding

an edge {Xi , Xj } to a condition set E is given by:

                                      |De |
          ∆Score{Xi ,Xj },E =                 PLLV(Xi , Mei ∪ {Xj }, e) − PLLV(Xi , Mei , e) +
                                e∈E d=1
                                                PLLV(Xj , Mej ∪ {Xi }, e) − PLLV(Xj , Mej , e)                         (3)




                                                             13
Because of the decomposability of PLLV(Xi |Mei ), all terms other than those involving the Markov

blanket variables in condition set E remain unchanged producing the score improvement:


                    ∆Score{Xi ,Xj },E = PLLV(Xi |M∗ ∪ Xj ) − PLLV(Xi |M∗ )
                                                  Ei                   Ei



This score decomposability allows us to efficiently learn networks over condition sets. Our structure

learning algorithm is described in more detail in Supplementary material.


4.3   Simulated data description and analysis

We generated simulated datasets using two sets of networks of known structure, HIGHSIM and

LOWSIM. All networks had the same number of nodes n = 68 and were obtained from the E. coli

regulatory network [23]. We used the INDEP model for generating the eight simulated datasets.

The parameters of the INDEP model were initialized using random partitions of an initial dataset

generated from a differential-equation based regulatory network simulator [19].


4.4   Microarray data description

Each microarray measures the expression of all yeast genes in response to genetic deletions from

quiescent (85) and non-quiescent (93) populations [2], with 69 common to both populations. The

arrays had biological replicates producing 170 and 186 measurements per gene in the quiescent

and non-quiescent populations, respectively. We filtered the microarray data to exclude genes with

> 80% missing values, resulting in 3,012 genes. We constrained the network structures such that a

gene connected to only the 69 genes with deletion mutants and no gene had more than 8 neighbors.


4.5   Validation of network edges using coverage of annotation categories

The coverage of an annotation category A is defined as the harmonic mean of a precision and

recall. Let L denote the complete list of genes used for network learning, LA ⊆ L denote the genes

annotated with A. Let lA denote the number edges in our learned network among two genes gi

and gj , such that gi ∈ LA and gj ∈ LA . Let tA be the total number of edges that are connected to

genes in LA (note tA > lA ). Let sA denote the total number of edges that could exist among the

                                                14
|LA |
genes in LA , which is       2     if |LA | < 8 and |LA | ∗ 8 if |LA | > 8. Precision for category A is defined
            lA                                   lA
as pA =     tA   and recall is defined as rA =    sA .   These are used to define the coverage of category A,
2pA rA
pA +rA .   We compute this coverage score for all categories using each inferred network, and compare

the score against an expected coverage from random networks with the same degree distribution.

    To compare of NIPD against INDEP, assume we were comparing the inferred quiescent networks.

Let AINDEP and ANIPD denote the categories better than random in the INDEP and NIPD quiescent

networks, respectively. To determine how much better INDEP is than NIPD, we obtain the number

of categories in AINDEP ∪ ANIPD on which INDEP has a better coverage than NIPD. We similarly

assess how much better NIPD is than INDEP. We repeat this procedure for the non-quiescent

networks. We also compared the semantic similarity of edges in inferred and random networks [16]

(Supplementary material).


4.6    Evaluation of gene deletion combinations

We identified combinations of genes with deletion mutants from Markov blankets comprising > 1 of

these deletion genes. We evaluated each algorithm’s ability to capture gene deletion combinations

by comparing the number of such combinations in random networks with the same number of

edges. This random network model provided a rough significance assessment on the number of

inferred knock-out combinations (Supplementary Table 3). We then performed a more stringent

analysis based on semantic similarity, using the sub-network spanning only the genes with deletion

combinations. We generated random networks with the same degree distributions as this sub-

network and computed the semantic similarity of each gene with the set of deletion genes connected

to it, in the inferred and random networks. We then selected genes with significantly higher semantic

similarity than in random networks (ztest, p-value <0.05).



5     Acknowledgements

This work is supported by grants from NIMH (1R01MH076282-03) and NSF (IIS-0705681) to

T.L., from NIH (GM-67593) and NSF (MCB0734918) to M.W.W. and from HHMI-NIH/NIBIB

(56005678).

                                                          15
HIGHSIM NET1                                                                                                            LOWSIM NET1
                  %"                                                                                                         +
                                                                                                                                                      '                                                                                           ,
                          9:;<                                                                                                                                        :;<=
                  %#      :9<=;                                                                                                                                       ;:=><
 4+,-+50/(067*8

                                                                                                                                                      !




                                                                                                                                    5,-.,610)178+9
                   "                                                                                                                                  $

                   #                                                                                                                                  #

                  !" +                                                                                                                               !$ ,
                   !"#   $$"      %"#                      %%$   &#      %"#              %%$             &#                                                   !"#            $$"            %"#                          %%$              &#
                                                                  '()*+,-+./0(1(12+30.0                                                                                             ()*+,-.,/01)2)23,41/1

                                                                 HIGHSIM NET2                                                                                                            LOWSIM NET2
                  %#                                                                                                         +                        ?                                                                                           ,
                                                                                                                                                                     :;<=
                         9:;<
                                                                                                                                                      '              ;:=><
                         :9<=;




                                                                                                                                    5,-.,610)178+9
 4+,-+50/(067*8




                   "
                                                                                                                                                      !

                                                                                                                                                      $
                   #
                                                                                                                                                      #

                  !" +                                                                                                                               !$ ,
                   !"#   $$"     %"#                       %%$   &#       %"#             %%$            &#                                                    !"#            $$"            %"#                          %%$              &#
                                                                 '()*+,-+./0(1(12+30.0                                                                                              ()*+,-.,/01)2)23,41/1




Figure 1: Number of variables (y-axis) on which one method was significantly better than the other
as function of the size of the training data (x-axis). Left is for the two networks (HIGHSIM) that
share 60% edges and right is for the two networks (LOWSIM) that share 20% of their edges. The
top and bottom graphs are for networks from the individual conditions.

                                                                        GOSLIM                                                                        TFNET                                                        GOPROC 
                                                           16                        INDEP>NIPD                        16                                       INDEP>NIPD                             80                           INDEP>NIPD 
                                                                                     NIPD>INDEP                                                                 NIPD>INDEP 
                                                                                                    # of Categories 
                                        # of Categories 




                                                                                                                                                                                    # of Categories 
                                                           12                                                          12                                                                              60                           NIPD>INDEP 

                                                            8                                                           8                                                                              40 

                                                            4                                                           4                                                                              20 

                                                            0                                                           0                                                                               0 
                                                                 QUIESCENT         NON‐QUIESCENT                                 QUIESCENT                     NON‐QUIESCENT                                 QUIESCENT          NON‐QUIESCENT 



Figure 2: Network quality comparison based on coverage of GOSlim (GOSLIM), targets of tran-
scription factors (TFNET) and GO process (GOPROC). Each bar represents the number of cat-
egories on which INDEP had better coverage than NIPD (INDEP>NIPD) or NIPD had better
coverage than INDEP (NIPD>INDEP).


References

 [1] C. Allen, S. B¨ttner, A. D. Aragon, J. A. Thomas, O. Meirelles, J. E. Jaetao, D. Benn,
                   u

                       S. W. Ruby, M. Veenhuis, F. Madeo, and M. Werner-Washburne. Isolation of quiescent and

                       nonquiescent cells from yeast stationary-phase cultures. J Cell Biol, 174(1):89–100, July 2006.

 [2] Anthony D. Aragon, Angelina L. Rodriguez, Osorio Meirelles, Sushmita Roy, George S. David-

                       son, Chris Allen, Ray Joe, Phillip Tapia, Don Benn, and Margaret Werner-Washburne. Charac-

                       terization of differentiated quiescent and non-quiescent cells in yeast stationary-phase cultures.

                       Molecular Biology of the Cell, 2008.

 [3] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis,


                                                                                                                                                       16
RAND (NIPD)                                                         7 
                                   7                                      QUIESCENT                                                                                                                                                   NON‐QUIESCENT                                             RAND (NIPD) 
                                                                                                                          NIPD                                                                                                                                                                  NIPD 
                                   6                                                                                                                                                          6 
                                                                                                                          RAND (INDEP)                                                                                                                                                          RAND (INDEP) 
                                   5                                                                                      INDEP                                                               5                                                                                                 INDEP 




                                                                                                                                                                        log(# of Edges) 
                log(# of Edges) 
                                   4                                                                                                                                                          4 
                                   3                                                                                                                                                          3 
                                   2                                                                                                                                                          2 
                                   1                                                                                                                                                          1 
                                   0                                                                                                                                                          0 
                                   ‐1                                                                                                                                                        ‐1 
                                         0      0.2             0.4    0.6       0.8                             1               1.2              1.4                                                0              0.2             0.4                  0.6            0.8              1           1.2      1.4 
                                                                 Seman1c Similarity                                                                                                                                                               Seman1c Similarity 



Figure 3: Network quality comparison based on semantic similarity. The dashed lines represents
the background distribution generated from random networks and the solid lines represents the
distribution of the semantic similarity in the inferred networks.




                                                                                                   HAP4_TF
                                                              HAP2_TF

                                                                                                                                                                                                                                                                 SIP4_TF                           LPD1
                                                                                                                                                                                                                                                                                                            NDE2
                                                                                                                                                 ATP3                                                      CCW12                                                                         KNS1
                                                                                                                                                                                              MIR1
                                                                                                               YGL088W                                   ATX2
                                                                                                                                                                                                                    IDP2
                                                                                                                                                                          YGR001C
                                                                                                                                                                                                                                                  YNL194C                                                          SDS23
                                                                                                                                                              YOR052C
                                                         SNC2                                                      UBC8
                                                                                  COX13                                                                                                              ATP2
                                                                      COX7                         QCR8                                        COX8                                                                        NDI1
                                                                                                                                                                        ATP16                                                                                      PCK1           FAS1          SDH2          YET3
                                                                                                                                                             NBP2                                                                                 PIN3

                                                                                                                   ILV1                                                                                           CDC48
                                                            AVT7                                                                   INH1
                                                                                                                                                                                                                                                                                                            AAT2
                                                                                                           QCR7                                                                                       ERV46                                        PTR2
                                                                                               THO1                                                                                                                                                              ICL1
                                                               QCR6                                                                                                                                                                                                                                 KGD1
                                                                                       QCR9


                                                                                                                                                                                                                                                                    acetyl-CoA metabolic process




                                              organelle ATP synthesis coupled electron transport                                          oxidative phosphorylation
                                                                                                      aerobic respiration


                                                                                   MSN2_TF            MSN4_TF          HSF1_TF
                                                                                                                                                                                                                     SKO1_TF
                                                                                                                                                              AZF1_TF

                                                                                                                            YDJ1
                                                                                                                                                                                                                       IRA2
                                                                                                                                 STI1
                                                                                                                                                                PRB1
                                                                                       HSP30     HSP42          HSP104
                                                                                                                                 HSP78                                  XBP1
                                                                                                                                                      OM14
                                                                                                  YDR266C                                                                                                   FAA1
                                                                                                                                                                                                                             HXT5
                                                                                                                SIS1        BIO2



                                                                                                                 protein folding

                                                                SBE22
                                                                          UBP10
                                                                                                                 YMR144W                         ADH2
                                                                                                          PDC5                    YMR187C                                                                   EMP46         GDH3
                                                                                                                                                                                              YMR090W
                                                     PUF4                                                                                                                                                                                  SWP1                            REG2      FOX2
                                                                                    GAC1           PDC1                                                                                                                                                  CTA1
                                                                        DOA4                                                                                                               YJL016W
                                                                                                                           SIP18                 CAT2         ALD4                                                                                                                       PXA1
                                                     ISW2                                               PAI3                                                                                                  ALD3          ALD2                                 ATO3         ADY2
                                                                                                                                                                                                                                      UTR1

                                                                                                                                                                                           YDR154C
                                                            regulation of gene expression, epigenetic           MUQ1
                                                                                                                                                                                                                                                     nitrogen utilization ammonium transport
             regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic process                       ethanol metabolic process
                                                                                                                                                                polyamine catabolic process                   beta-alanine biosynthetic process


                                                                                LSC2     MDH3
                                                                        FMP37                   MSS18                                                                                                                                 CUP2                MAP1
                                                                                                      PEX11                                                                                  SOD1
                                                                   FDH1                                                                                                                                       SOL4
                                                                                                                                                  YMR118C       GND2
                                                       ACS1                                                                          RPL2A                                                                                                          SFA1
                                                                                   ETR1                                                                                                                                             CRS5                           FTH1
                                                                   AYR1                                 PAT1                                                                                              HSP26
                                                                                                                                                                     TKL2                  RDH54                    YJR096W
                                                                                                                                        FYV7          YDL218W

                                                carboxylic acid biosynthetic process            NADH regeneration
                                                                                                                                                                                                                                      response to metal ion
                                                                           fatty acid metabolic process
                                                                                                                                           pentose-phosphate shunt                           pentose metabolic process




Figure 4: GO processes and TF targets for subgraphs from the NIPD-inferred networks using the
quiescent population. The text below each subgraph indicates the process. The diamonds represent
the TFs. A TF is connected to the subgraph which is enriched in the targets of the TF. The circular
nodes represent the genes in the network and color represents the extent of differential expression,
red: up-regulated, green: down-regulated.




                                                                                                                                                             17
HAP4_TF                                                                                             MSN4_TF
                                                                                                                  MSN2_TF                                                HSF1_TF
                                                                                                                                                                                                                                          SIP4_TF




                                      KGD2            MIR1
                                                                                                                                                                                                                           PTR2

                                                             PMT1                                                                                                                                                                                               ATP1
                                                                                                              HSP42
                              CDC48                                                                                                                       STI1
                                                                                                                                       HSP104
                                                                                                                                                                                                                                          PCK1
                                              ATP2                                                                                                                        SOD1
                                                                      ATP16                                                                                                                         HSP12                 PIN3
                          CCW12                                                                  HSP30                                                                               HSP26
                                                                                                                                                         URA6
                                                                                                                      SIS1
                                                                              SDH2                                                                                                                                                                       ICL1
                                                        RIP1
                                       BSD2                                                                                  PGM2                 SSA2                                       YJR096W
                                                                                                                                                                           TDH1                                                      IDP2
                                                                                                                                       HSP78
                                                                                                                                                                                      SSE2
              ion transport

                                      oxidative phosphorylation                                                          protein folding

                                                                                                                                                                                                               PST2      PUS5
                                                                              YER121W                                               ACS1
                                                                  RPS14A                                                                                                                                AYR1                      MDH3
                                                                                                                                                                        ADH2

                                                         FOX2                            PXA1                                                                                                    CYB2                                    PEX11


                                                                           ADY2                            ATO3
                                                                                                                                                         YKL187C                                                                           FMP37
                                                        RPL25                                                                                                                      LSC2                           ETR1




                                                                    ammonium transport                   nitrogen utilization                energy derivation by oxidation of organic compounds          fatty acid metabolic process




                                                                                  UTR1       YGR201C                                                                                                                                 CRC1
                                                               SOL4                                                                                                                                                     YIR035C
                                                                                                                                                                 APJ1     ARO3                                                                   GSC2
                                             EMP46                                                                                                                                   COX13
                                                                       YDR154C
                                                                                                     YMR114C                                                 TPS2                                                YAT2                                   PYC2

                                                                                                                                                                                             ILV1
                                                               ALD3                                                                        COX7
                                               AVT6                                       ALD2                                                              QCR9                                                                                        URA2
                                                                                                                                                                          QCR8                                                    YAT1
                                                                                                       GDH3
                                                                                                                                                                                          QCR6




                                         beta-alanine biosynthetic process                                             mitochondrial electron transport, ubiquinol to cytochrome c
                                                                                  polyamine catabolic process
                                                                                                                                                                                      aerobic respiration
                                                                                                                                                                                                                      carnitine metabolic process




Figure 5: GO processes and TF targets for subgraphs from the NIPD-inferred networks using the
non-quiescent population. Legend is similar to Fig 4

    K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis,

    S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene

    ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet,

    25(1):25–29, May 2000.

 [4] S. Bergmann, J. Ihmels, and N. Barkai. Similarities and differences in genome-wide expression

    data of six organisms. PLoS Biol, 2(1), January 2004.

 [5] Sven Bergmann, Jan Ihmels, and Naama Barkai. Iterative signature algorithm for the analysis

    of large-scale gene expression data. Physical review. E, Statistical, nonlinear, and soft matter

    physics, 67(3 Pt 1), March 2003.

 [6] Julian Besag. Efficiency of pseudolikelihood estimation for simple gaussian fields. Biometrika,

    64(3):616–618, December 1977.

 [7] Richard Bonneau, David J Reiss, Paul Shannon, Marc Facciotti, Leroy Hood, Nitin S Baliga,

    and Vesteinn Thorsson. The inferelator: an algorithm for learning parsimonious regulatory

    networks from systems-biology data sets de novo. Genome Biology, 2006.

                                                                                                                                    18
[8] Han-Yu Chuang, Eunjung Lee, Yu-Tsueng Liu, Doheon Lee, and Trey Ideker. Network-based

    classification of breast cancer metastasis. Mol Syst Biol, 3, October 2007.

 [9] Karthik Devarajan. Nonnegative matrix factorization: An analytical and interpretive tool in

    computational biology. PLoS Comput Biol, 4(7):e1000029+, July 2008.

[10] Dan Geiger and David Heckerman. Advances in probabilistic reasoning. In Proceedings of

    the seventh conference (1991) on Uncertainty in artificial intelligence, pages 118–126, San

    Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc.

[11] Christopher T. Harbison, D. Benjamin Gordon, Tong Ihn Lee, Nicola J. Rinaldi, Kenzie D.

    Macisaac, Timothy W. Danford, Nancy M. Hannett, Jean-Bosco Tagne, David B. Reynolds,

    Jane Yoo, Ezra G. Jennings, Julia Zeitlinger, Dmitry K. Pokholok, Manolis Kellis, P. Alex

    Rolfe, Ken T. Takusagawa, Eric S. Lander, David K. Gifford, Ernest Fraenkel, and Richard A.

    Young. Transcriptional regulatory code of a eukaryotic genome. Nature, 2004.

[12] David Heckerman. A Tutorial on Learning Bayesian Networks. Technical Report MSR-TR-

    95-06, Microsoft research, March 1995.

[13] Hyunsoo Kim, William Hu, and Yuval Kluger. Unraveling condition specific gene transcrip-

    tional regulatory networks in saccharomyces cerevisiae. BMC Bioinformatics, 2006.

[14] Steffen L. Lauritzen. Graphical Models. Oxford Statistical Science Series. Oxford University

    Press, New York, USA, July 1996.

[15] P. Lesage, X. Yang, and M. Carlson. Yeast snf1 protein kinase interacts with sip4, a c6 zinc

    cluster transcriptional activator: a new role for snf1 in the glucose response. Molecular and

    cellular biology, 16(5):1921–1928, May 1996.

[16] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble. Investigating semantic similarity

    measures across the gene ontology: the relationship between sequence and annotation. Bioin-

    formatics, 19(10):1275–1283, July 2003.




                                               19
[17] Kenzie Macisaac, Ting Wang, D. Benjamin Gordon, David Gifford, Gary Stormo, and Ernest

    Fraenkel. An improved map of conserved regulatory sites for saccharomyces cerevisiae. BMC

    Bioinformatics, 7(1):113+, March 2006.

[18] M. Juanita Martinez, Sushmita Roy, Amanda B. Archuletta, Peter D. Wentzell, Sonia S.

    Anna-Arriola, Angelina L. Rodriguez, Anthony D. Aragon, Gabriel A. Quinones, Chris Allen,

    and Margaret Werner-Washburne. Genomic analysis of stationary-phase and exit in saccha-

    romyces cerevisiae: Gene expression and identification of novel essential genes. Mol. Biol. Cell,

    15(12):5295–5305, December 2004.

[19] Pedro Mendes, Wei Sha, and Keying Ye. Artificial gene networks for objective comparison of

    analysis algorithms. Bioinformatics, 19:122–129, 2003.

[20] Wei Pan. A comparative review of statistical methods for discovering differentially expressed

    genes in replicated microarray experiments. Bioinformatics, 18(4):546–554, April 2002.

[21] Rokhlenko, Oleg, Wexler, Ydo, Yakhini, and Zohar. Similarities and differences of gene ex-

    pression in yeast stress conditions. Bioinformatics, 23(2):e184–e190, January 2007.

[22] Sushmita Roy, Terran Lane, and Margaret Werner-Washburne. Learning structurally consis-

    tent undirected probabilistic graphical models. In ICML, page 114, 2009.

[23] Heladia Salgado, Socorro Gama-Castro, Martin Peralta-Gil, Edgar Diaz-Peredo, Fabiola

    Sanchez-Solano, Alberto Santos-Zavaleta, Irma Martinez-Flores, Veronica Jimenez-Jacinto,

    Cesar Bonavides-Martinez, Juan Segura-Salazar, Agustino Martinez-Antonio, and Julio

    Collado-Vides. Regulondb (version 5.0): Escherichia coli k-12 transcriptional regulatory net-

    work, operon organization, and growth conditions. Nucleic Acids Research, 34:D394, 2006.

[24] Guido Sanguinetti, Josselin Noirel, and Phillip C. Wright. Mmg: a probabilistic tool to identify

    submodules of metabolic pathways. Bioinformatics, 24(8):1078–1084, April 2008.

[25] Dominic Schmidt, Michael D. Wilson, Christiana Spyrou, Gordon D. Brown, James Hadfield,

    and Duncan T. Odom. Chip-seq: Using high-throughput sequencing to discover proteindna

    interactions. Methods, 48(3):240–248, July 2009.

                                                 20
[26] Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller, and Nir Friedman. Learning module

    networks. Journal of Machine Learning Research, 6:557–588, April 2005.

[27] T. Stein, J. Kricke, D. Becher, and T. Lisowsky. Azf1p is a nuclear-localized zinc-finger protein

    that is preferentially expressed under non-fermentative growth conditions in saccharomyces

    cerevisiae. Current genetics, 34(4):287–296, October 1998.

[28] Joshua M. Stuart, Eran Segal, Daphne Koller, and Stuart K. Kim. A gene-coexpression network

    for global discovery of conserved genetic modules. Science, 302(5643):249–255, October 2003.

[29] Amos Tanay, Roded Sharan, Martin Kupiec, and Ron Shamir. Revealing modularity and

    organization in the yeast molecular network by integrated analysis of highly heterogeneous

    genomewide data. Proceedings of the National Academy of Sciences of the United States of

    America, 101(9):2981–2986, March 2004.

[30] Lidia Tom´s-Cobos, Laura Casadom´, Gl`ria Mas, Pascual Sanz, and Francesc Posas. Expres-
              a                      e    o

    sion of the hxt1 low affinity glucose transporter requires the coordinated activities of the hog

    and glucose signalling pathways. The Journal of biological chemistry, 279(21):22010–22019,

    May 2004.

[31] D. P. Tuck, H. M. Kluger, and Y. Kluger. Characterizing disease states from topological

    properties of transcriptional regulatory networks. BMC Bioinformatics, 7, 2006.

[32] O. Vincent and M. Carlson. Sip4, a snf1 kinase-dependent transcriptional activator, binds to

    the carbon source-responsive element of gluconeogenic genes. The EMBO journal, 17(23):7002–

    7008, December 1998.

[33] Zhong Wang, Mark Gerstein, and Michael Snyder. Rna-seq: a revolutionary tool for transcrip-

    tomics. Nat Rev Genet, 10(1):57–63, January 2009.

[34] Bai Zhang, Huai Li, Rebecca B. Riggins, Ming Zhan, Jianhua Xuan, Zhen Zhang, Eric P.

    Hoffman, Robert Clarke, and Yue Wang. Differential dependency network analysis to identify

    condition-specific topological changes in biological networks. Bioinformatics, pages btn660+,

    December 2008.

                                                 21
Appendix

1    Generation and analysis of simulated data

We first obtained a sub-network of n = 68 nodes, G1 , from the E. coli regulatory network [23]. We

then generated two networks, G2 and G3 , by flipping 20% and 60% of G1 ’s edges, respectively.

{G1 , G2 } comprised networks in HIGHSIM and {G1 , G3 } comprised networks in LOWSIM. For

each pair of networks, we generated initial datasets using a differential equation-based gene regu-

latory network simulator [19]. We then split the data into two parts, learned two INDEP models

for each partition, and generated data from these models. We repeated this procedure four times

producing eight sets of simulated data with different parameters but the same network topology.

It was possible to generate all eight sets from the regulatory network simulator by perturbing the

kinetic constants, but our current data generation procedure was faster.

    We compared the structure of the networks inferred by INDEP and NIPD using a per-variable

neighborhood comparison. Assume we are comparing the INDEP-inferred networks against the true

networks in HIGHSIM. We compare each of the true networks, {G1 , G2 } one at a time. Let GINDEP
                                                                                          1

and GINDEP be the two inferred networks inferred by INDEP using datasets from HIGHSIM. For
     2

each variable, Xi , we compare Xi ’s neighborhood in G1 to its inferred neighborhoods in both

GINDEP and GINDEP to obtain match score Fi1
 1          2
                                          INDEP and F INDEP , respectively. INDEP’s match of
                                                     i2
                                         INDEP and F INDEP . We obtain a match score for different
Xi ’s neighborhood in G1 is the max of Fi1          i2

folds of the data. Similarly we obtain a match score for NIPD for all variables from different folds

of the data. We then obtain the number of variables on which NIPD has a significantly higher

match score compared to INDEP as a function of training data size. We repeat this procedure

for all eight datasets for HIGHSIM to obtain the average number of variables NIPD is better than

INDEP. We repeat this procedure for G2 and then for the NIPD.




                                                22
2    Semantic similarity based-validation

We use the definition of semantic similarity from Lord et al. using [16]. Semantic similarity between

two annotation terms is defined as a function of the maximal amount of information present in a

common ancestor of the terms. For GO terms the information is inversely proportional to the

number of genes that are annotated with a term, that is a very specific term with few genes has

more information than a broader term that has many more genes annotated with it. The functional

similarity between two genes is given by the average semantic similarity of sets of GO process terms

associated with the genes. Let gi and gj be two genes connected by an edge in our inferred network.

Let Ti and Tj be the set of GO process terms associated with gi and gj , respectively. The average

semantic similarity, sim(gi , gj ) for all pairs of terms is

                                                   1
                           sim(gi , gj ) =                                    semsim(tp , tq )
                                             |Tp | ∗ |Tq |
                                                             tp ∈Ti ,tq ∈Tj


Semantic similarity, semsim(tp , tq ) = −log(mina∈Ppq pa ), where Ppq is the set of common ancestors

of the terms tp and tq in the GO process “is-a” hierarchy. −log(pa ) is the amount of information

associated with a term a, and pa is probability of the term defined as the ratio of the number of

genes annotated with the term a to the total number of genes with a GO process assignment.

    We used semantic similarity for global validation of the inferred edges and also for assessing

the strength of association between combinations of single gene knock-outs and a target gene.

In both cases, we generated random networks with the same degree distributions as the inferred

networks and estimated a background semantic similarity distribution. For assessing the strength

of association between a gene, gi and the set of knock-out genes that are connected to it, Ki , we

had to compare the similarity of a gene with a set of genes. We assumed GO process terms for

the set Ki to be the union of all terms associated with the genes, gj ∈ Ki . We then computed the

semantic similarity between the term set associated with gene gi and the union of terms associated

with Ki .




                                                         23
3    Structure learning algorithm of NIPD in detail

Our score for structure learning is based on the pseudo-likelihood of the data given model and

requires us to compute the conditional probability distribution of each variable in a condition c.

We require that the parameters of this conditional distribution be dependent such that we can pool

the data from the different conditions to estimate the parameters. The conditional distribution,

P (Xi |Mci ) in condition c is defined as a product:


              P (Xi = xid |Mci = mcid ) ∝                             P (Xi = xdi |M∗ = m∗ ),
                                                                                    Ei   Ei                 (4)
                                                E∈powerset(C) : c∈E


where d is the data point index and M∗ is the Markov blanket (MB) of Xi exclusively in condition
                                     E
                                                                                                 1
set E. The proportionality term can be eliminated using the normalization term                  Zcid .   In our
                               1                         2     2            2
conditional Gaussian case,    Z1id    = N (µ1id |µ3id , σ1i + σ3i ), where σ3i is the standard deviation from

the condition set {1, 2}, µ1id = w1i m∗ , is the mean of the conditional Gaussian using the dth data
                                  T
                                      1id
                                1
point in condition 1. Thus,    Z1id    is the probability of µ1id from a Gaussian distribution with mean

estimated from the pooled data. To make the product in Eq 4 a valid conditional distribution, we

need to subtract out the normalization term. However, working with the unnormalized form gives

us three benefits. First, and most important, it enables our score to be a decomposable sum on

taking logarithms. Second the normalization term behaves as a smoothing term for a condition-

specific mean, µ1id , preferring network structures with means µ1id closer to the shared mean µ3id .

Third, avoiding the computation of the Zid for each data point, gives us some runtime benefits.

    Our structure learning algorithm begins with k empty graphs and proposes edge additions for all

variables, for all subsets of the condition set C. The while loop iteratively makes edge modifications

until the score no longer improves. The outermost for loop (Steps 4-17 ) iterates over variables

Xi to identify new candidate MB variables Xj in a condition set E. We iterate over all candidate

MBs Xj (Steps 5-15) and condition sets E (Steps 6-14) and compute the score improvement for

each pair {Xj , E} (Step 16). In Steps 7-9 we add a check that if a variable Xj is already present

in any subset or super set D of E, we do not include it as a candidate. If the current condition

set under consideration has more than one conditions, data from these conditions is pooled and


                                                       24
parameters for the new distribution P (Xi |M∗ ) is estimated using the pooled dataset (Steps 10-
                                            Ei

12). A candidate move for a variable Xi is composed of a pair {Xj , E } with the maximal score

improvement over all variables and conditions (Step 16). After all candidate moves have been

identified, we attempt all the moves in the order of decreasing score improvement (Step 18). Each

move adds the edge {Xi , Xj } in condition set E . However, if either Xi or Xj was already updated

in a previous move, we ignore the move. Because not all candidate moves are made, by sorting the

move order in decreasing score improvement, we enable moves with the highest score improvements

to be attempted first. The algorithm converges when no edge addition improves the score of the k

graphs.

Algorithm 1 NIPD
 1: Input:
      Random variable set, X = {X1 , · · · , X|X| }
      Set of conditions C
      Datasets of RV joint assignments, {D1 , · · · , D|C| }
      maximum neighborhood size, kmax
 2: Output:
       Inferred graphs G1 , · · · , G|C|
 3: while Score(G1 , · · · , G|C| ) does not stabilize do
 4:   for Xi ∈ X do {/*Propose moves*/ }
 5:      for Xj ∈ (X  {Xi }) do
 6:         for E ∈ powerset(C) do
 7:            if Xj ∈ M∗ , s.t either D ⊂ E or E ⊂ D then
                             iD
 8:               Skip Xj .
 9:            end if
10:            if |E| > 1 then
11:               Estimate parameters for new conditional P (Xi |M∗ Ei ∪ {Xj }) using pooled dataset DE obtained
                  from merging all De s.t. e ∈ E.
12:            end if
13:            compute ∆Score{Xi Xj }E .
14:          end for
15:      end for
16:      Store {Xi , Xj , E } as candidate move for Xi , where {Xj , E } = arg max ∆Score{Xi Xj }E
                                                                          j,E
17:   end for
18:   Make candidate moves {Xi , Xj , E } in order of decreasing score improvement /*Attempt moves to modify graph
      structures*/
19: end while




                                                       25
=>?=+>@
                                #'*                                                                                   /
                                                                                                     E>FG    >EG9F
          +843.7/976./:!;<03.




                                #')

                                #'(

                                #'"

                                #'! /
                                           !"#             $$"              %"#               %%$            &#
                                                                   +,-./01/234,5,56/7424
                                                                        BCD+>@
                                #'"                                                                                   /
                                                                                                      E>FG    >EG9F
+843.7/976./:!;<03.




                        #'!"

                                #'!

                        #'A"

                                #'A

                        #'$" /
                                           !"#             $$"              %"#               %%$            &#
                                                                   +,-./01/234,5,56/7424

                                          Figure 1: Shared edges in the HIGHSIM and LOWSIM networks




                                        METHOD      POPULATION          EDGE-CNT           SHARED EDGE-CNT
                                                     QUIESCENT             378
                                         NIPD                                                       271
                                                   NON-QUIESCENT           402
                                                     QUIESCENT             171
                                        INDEP                                                       25
                                                   NON-QUIESCENT           200

                                         Table 1: Structure of the inferred networks using INDEP and NIPD.




                                                                        26
Condspe
Condspe
Condspe
Condspe
Condspe

Contenu connexe

Tendances

A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...
Roberto Anglani
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-viz
Alexander Pico
 
BRITEREU_finalposter
BRITEREU_finalposterBRITEREU_finalposter
BRITEREU_finalposter
Elsa Fecke
 
Network motifs in integrated cellular networks of transcription–regulation an...
Network motifs in integrated cellular networks of transcription–regulation an...Network motifs in integrated cellular networks of transcription–regulation an...
Network motifs in integrated cellular networks of transcription–regulation an...
Samuel Sattath
 
Report on System Biology Funding from BMBF
Report on System Biology Funding from BMBFReport on System Biology Funding from BMBF
Report on System Biology Funding from BMBF
EuroBioForum
 
Statistical SignificancePieceFinal
Statistical SignificancePieceFinalStatistical SignificancePieceFinal
Statistical SignificancePieceFinal
Jami Jackson
 
Systems Biology Approaches to Cancer
Systems Biology Approaches to CancerSystems Biology Approaches to Cancer
Systems Biology Approaches to Cancer
Raunak Shrestha
 

Tendances (20)

nm0915-965-2
nm0915-965-2nm0915-965-2
nm0915-965-2
 
Introduction to Network Medicine
Introduction to Network MedicineIntroduction to Network Medicine
Introduction to Network Medicine
 
Java tutorial: Programmatic Access to Molecular Interactions
Java tutorial: Programmatic Access to Molecular InteractionsJava tutorial: Programmatic Access to Molecular Interactions
Java tutorial: Programmatic Access to Molecular Interactions
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data sets
 
NetBioSIG2014-Talk by Hyunghoon Cho
NetBioSIG2014-Talk by Hyunghoon ChoNetBioSIG2014-Talk by Hyunghoon Cho
NetBioSIG2014-Talk by Hyunghoon Cho
 
A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...
 
Co-clustering algorithm for the identification of cancer subtypes from gene e...
Co-clustering algorithm for the identification of cancer subtypes from gene e...Co-clustering algorithm for the identification of cancer subtypes from gene e...
Co-clustering algorithm for the identification of cancer subtypes from gene e...
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-viz
 
Inference Networks for Molecular Database Similarity Searching
Inference Networks for Molecular Database Similarity SearchingInference Networks for Molecular Database Similarity Searching
Inference Networks for Molecular Database Similarity Searching
 
STRING - Cross-species integration of known and predicted protein-protein int...
STRING - Cross-species integration of known and predicted protein-protein int...STRING - Cross-species integration of known and predicted protein-protein int...
STRING - Cross-species integration of known and predicted protein-protein int...
 
BRITEREU_finalposter
BRITEREU_finalposterBRITEREU_finalposter
BRITEREU_finalposter
 
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based Method
 
Network motifs in integrated cellular networks of transcription–regulation an...
Network motifs in integrated cellular networks of transcription–regulation an...Network motifs in integrated cellular networks of transcription–regulation an...
Network motifs in integrated cellular networks of transcription–regulation an...
 
Report on System Biology Funding from BMBF
Report on System Biology Funding from BMBFReport on System Biology Funding from BMBF
Report on System Biology Funding from BMBF
 
NetBioSIG2013-Talk David Amar
NetBioSIG2013-Talk David AmarNetBioSIG2013-Talk David Amar
NetBioSIG2013-Talk David Amar
 
Statistical SignificancePieceFinal
Statistical SignificancePieceFinalStatistical SignificancePieceFinal
Statistical SignificancePieceFinal
 
STRING - Prediction of functionally associated proteins from heterogeneous ge...
STRING - Prediction of functionally associated proteins from heterogeneous ge...STRING - Prediction of functionally associated proteins from heterogeneous ge...
STRING - Prediction of functionally associated proteins from heterogeneous ge...
 
NetBioSIG2012 chrisevelo
NetBioSIG2012 chriseveloNetBioSIG2012 chrisevelo
NetBioSIG2012 chrisevelo
 
Systems Biology Approaches to Cancer
Systems Biology Approaches to CancerSystems Biology Approaches to Cancer
Systems Biology Approaches to Cancer
 
System biology and its tools
System biology and its toolsSystem biology and its tools
System biology and its tools
 

En vedette

Cellular respiration teacher
Cellular respiration teacherCellular respiration teacher
Cellular respiration teacher
nahomyitbarek
 
علامات الساعة
علامات الساعةعلامات الساعة
علامات الساعة
Imam Al Azhari
 
10 factors for uniting muslims in australia
10 factors for uniting muslims in australia10 factors for uniting muslims in australia
10 factors for uniting muslims in australia
Imam Al Azhari
 
علامات الساعة
علامات الساعةعلامات الساعة
علامات الساعة
Imam Al Azhari
 
Proyecto triqui 904 daniel castillo juan carreño
Proyecto triqui 904 daniel castillo juan carreñoProyecto triqui 904 daniel castillo juan carreño
Proyecto triqui 904 daniel castillo juan carreño
Dani Castillo Kastillo
 

En vedette (18)

Cellular respiration teacher
Cellular respiration teacherCellular respiration teacher
Cellular respiration teacher
 
Em06 iav
Em06 iavEm06 iav
Em06 iav
 
Following the prophet mohammed (pbuh)
Following the prophet mohammed (pbuh)Following the prophet mohammed (pbuh)
Following the prophet mohammed (pbuh)
 
Our youth in the west
Our youth in the westOur youth in the west
Our youth in the west
 
علامات الساعة
علامات الساعةعلامات الساعة
علامات الساعة
 
10 factors for uniting muslims in australia
10 factors for uniting muslims in australia10 factors for uniting muslims in australia
10 factors for uniting muslims in australia
 
Em10 fl
Em10 flEm10 fl
Em10 fl
 
Em03 t
Em03 tEm03 t
Em03 t
 
علامات الساعة
علامات الساعةعلامات الساعة
علامات الساعة
 
Our youth in the west
Our youth in the westOur youth in the west
Our youth in the west
 
Priscilla oti
Priscilla otiPriscilla oti
Priscilla oti
 
Bioe506
Bioe506Bioe506
Bioe506
 
TRABAJO EN CLASE # 3
TRABAJO EN CLASE # 3TRABAJO EN CLASE # 3
TRABAJO EN CLASE # 3
 
TRABAJO EN CLASE
TRABAJO EN CLASETRABAJO EN CLASE
TRABAJO EN CLASE
 
Proyecto triqui 904 daniel castillo juan carreño
Proyecto triqui 904 daniel castillo juan carreñoProyecto triqui 904 daniel castillo juan carreño
Proyecto triqui 904 daniel castillo juan carreño
 
ULTIMO PROYECTO EXCEL
ULTIMO PROYECTO EXCELULTIMO PROYECTO EXCEL
ULTIMO PROYECTO EXCEL
 
ROBOT EDUCADOR
ROBOT EDUCADORROBOT EDUCADOR
ROBOT EDUCADOR
 
Estadisticas 806
Estadisticas 806Estadisticas 806
Estadisticas 806
 

Similaire à Condspe

Deep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a reviewDeep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a review
ssuser6fc73c
 
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Luís Rita
 
A clonal based algorithm for the reconstruction of genetic network using s sy...
A clonal based algorithm for the reconstruction of genetic network using s sy...A clonal based algorithm for the reconstruction of genetic network using s sy...
A clonal based algorithm for the reconstruction of genetic network using s sy...
eSAT Journals
 
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of ActionA Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
Gerald Lushington
 
Confirming DNA Replication Origins of Saccharomyces Cerevisiae A Deep Learnin...
Confirming DNA Replication Origins of Saccharomyces Cerevisiae A Deep Learnin...Confirming DNA Replication Origins of Saccharomyces Cerevisiae A Deep Learnin...
Confirming DNA Replication Origins of Saccharomyces Cerevisiae A Deep Learnin...
Anthony Parziale
 

Similaire à Condspe (20)

Applied Bioinformatics Assignment 5docx
Applied Bioinformatics Assignment  5docxApplied Bioinformatics Assignment  5docx
Applied Bioinformatics Assignment 5docx
 
Deep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a reviewDeep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a review
 
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
 
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...Reconstruction and analysis of cancerspecific Gene regulatory networks from G...
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...
 
Big datasets
Big datasetsBig datasets
Big datasets
 
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...
 
A clonal based algorithm for the reconstruction of genetic network using s sy...
A clonal based algorithm for the reconstruction of genetic network using s sy...A clonal based algorithm for the reconstruction of genetic network using s sy...
A clonal based algorithm for the reconstruction of genetic network using s sy...
 
A clonal based algorithm for the reconstruction of
A clonal based algorithm for the reconstruction ofA clonal based algorithm for the reconstruction of
A clonal based algorithm for the reconstruction of
 
Systems biology for Medicine' is 'Experimental methods and the big datasets
Systems biology for Medicine' is 'Experimental methods and the big datasetsSystems biology for Medicine' is 'Experimental methods and the big datasets
Systems biology for Medicine' is 'Experimental methods and the big datasets
 
Metagenomics and it’s applications
Metagenomics and it’s applicationsMetagenomics and it’s applications
Metagenomics and it’s applications
 
metagenomicsanditsapplications-161222180924.pdf
metagenomicsanditsapplications-161222180924.pdfmetagenomicsanditsapplications-161222180924.pdf
metagenomicsanditsapplications-161222180924.pdf
 
System Biology and Pathway Network.pptx
System Biology and Pathway Network.pptxSystem Biology and Pathway Network.pptx
System Biology and Pathway Network.pptx
 
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of ActionA Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomics
 
Protein protein interaction, functional proteomics
Protein protein interaction, functional proteomicsProtein protein interaction, functional proteomics
Protein protein interaction, functional proteomics
 
Confirming DNA Replication Origins of Saccharomyces Cerevisiae A Deep Learnin...
Confirming DNA Replication Origins of Saccharomyces Cerevisiae A Deep Learnin...Confirming DNA Replication Origins of Saccharomyces Cerevisiae A Deep Learnin...
Confirming DNA Replication Origins of Saccharomyces Cerevisiae A Deep Learnin...
 
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...
 
CADD
CADDCADD
CADD
 
dream
dreamdream
dream
 
Unravelling the molecular linkage of co morbid diseases
Unravelling the molecular linkage of co morbid diseasesUnravelling the molecular linkage of co morbid diseases
Unravelling the molecular linkage of co morbid diseases
 

Plus de nahomyitbarek

Plus de nahomyitbarek (20)

Five themes of geography
Five themes of geographyFive themes of geography
Five themes of geography
 
Digestive
DigestiveDigestive
Digestive
 
Day in-the-life-of-a-cell
Day in-the-life-of-a-cellDay in-the-life-of-a-cell
Day in-the-life-of-a-cell
 
Cvd definitions and statistics jan 2012
Cvd definitions and statistics jan 2012Cvd definitions and statistics jan 2012
Cvd definitions and statistics jan 2012
 
Concept presentation on chemical bonding (iris lo)
Concept presentation on chemical bonding (iris lo)Concept presentation on chemical bonding (iris lo)
Concept presentation on chemical bonding (iris lo)
 
Computer assignment for grade 9
Computer assignment for grade  9Computer assignment for grade  9
Computer assignment for grade 9
 
Chembond
ChembondChembond
Chembond
 
Chapter13
Chapter13Chapter13
Chapter13
 
Cellular respiration
Cellular respirationCellular respiration
Cellular respiration
 
Cell respirationc
Cell respirationcCell respirationc
Cell respirationc
 
Cell respiration-apbio-1204285933555932-5
Cell respiration-apbio-1204285933555932-5Cell respiration-apbio-1204285933555932-5
Cell respiration-apbio-1204285933555932-5
 
Cell respirationa
Cell respirationaCell respirationa
Cell respirationa
 
Cell respiration
Cell respirationCell respiration
Cell respiration
 
Biol3400 labmanual
Biol3400 labmanualBiol3400 labmanual
Biol3400 labmanual
 
Bioint9 10
Bioint9 10Bioint9 10
Bioint9 10
 
11ch16
11ch1611ch16
11ch16
 
04 gr3 gd
04 gr3 gd04 gr3 gd
04 gr3 gd
 
Topic1
Topic1Topic1
Topic1
 
Em09 cn
Em09 cnEm09 cn
Em09 cn
 
Em08 ect
Em08 ectEm08 ect
Em08 ect
 

Condspe

  • 1. Learning probabilistic networks of condition-specific response: Digging deep in yeast stationary phase Sushmita Roy∗ , Terran Lane∗ , and Margaret Werner-Washburne+ ∗ Department of Computer Science, University of New Mexico + Department of Biology, University of New Mexico Abstract Condition-specific networks are functional networks of genes describing molecular behavior un- der different conditions such as environmental stresses, cell types, or tissues. These networks frequently comprise parts that are unique to each condition, and parts that are shared among related conditions. Existing approaches for learning condition-specific networks typically iden- tify either only differences or similarities across conditions. Most of these approaches first learn networks per condition independently, and then identify similarities and differences in a post- learning step. Such approaches have not exploited the shared information across conditions during network learning. We describe an approach for learning condition-specific networks that simultaneously identi- fies the shared and unique subgraphs during network learning, rather than as a post-processing step. Our approach learns networks across condition sets, shares data from conditions, and leads to high quality networks capturing biologically meaningful information. On simulated data from two conditions, our approach outperformed an existing approach of learning networks per condition independently, especially on small training datasets. We further applied our approach to microarray data from two yeast stationary-phase cell popu- lations, quiescent and non-quiescent. Our approach identified several functional interactions that suggest respiration-related processes are shared across the two conditions. We also iden- tified interactions specific to each population including regulation of epigenetic expression in the quiescent population, consistent with known characteristics of these cells. Finally, we found several high confidence cases of combinatorial interaction among single gene deletions that can be experimentally tested using double gene knock-outs, and contribute to our understanding of differentiated cell populations in yeast stationary phase. 1
  • 2. 1 Introduction Although the DNA for an organism is relatively constant, every organism on earth has the po- tential to respond to different environmental stimuli or to differentiate into distinct cell-types or tissues. Different environmental conditions, cell-types or tissues can be considered as different in- stantiations of a global variable, the condition variable, which induces condition-specific responses. These condition-specific responses typically require global changes at the transcript, protein and metabolic levels and are of interest as they provide insight into how organisms function at a systems level. Condition-specific networks describe functional interactions among genes and other macro- molecules under different conditions, providing a systemic view of condition-specific behavior in organisms. Analysis of condition-specific responses has been one of the principal goals of molecular biology, and several approaches have been developed to capture condition-specific responses at different levels of granularity. The most common approach is the identification of differentially expressed genes in a condition of interest using genome-wide measurements of gene, and often protein expres- sion [20]. More recent approaches are based on bi-clustering, which cluster genes and conditions simultaneously [5,7,9,29], and identify sets of genes that are co-regulated in sets of conditions. How- ever, these approaches do not provide fine-grained interaction structure that explains the condition- specific response of genes. More advanced approaches additionally identify transcription modules (set of transcription factors regulating a set of target genes) that are co-expressed in a condition- specific manner [11,13,26,31], but these too do not provide detailed interaction information among genes for each condition. In this paper, we describe a novel approach, Network Inference with Pooling Data (NIPD), for condition-specific response analysis that emphasizes the fine-grained interaction patterns among genes under different conditions. The main conceptual contribution of our approach is to learn networks for any subset of conditions. This subsumes existing approaches that find either only patterns that are specific to each condition, or only patterns that are shared across conditions. To make this clear, let us consider a simple example of two environmental starvation conditions: Carbon and Nitrogen starvation. Using our approach we can simultaneously find patterns that are 2
  • 3. specific only to Carbon starvation, only to Nitrogen starvation, and those that are shared across these two conditions. From the methodological stand-point our work is similar to Bayesian multi- nets [10], which we extend by allowing data to be pooled across conditions and learning networks for any subset of conditions. NIPD is based on the framework of probabilistic graphical models (PGMs), where edges rep- resent pairwise and higher-order statistical dependencies among genes. Similar to existing PGM learning algorithms, NIPD infers networks by iteratively scoring candidate networks and selecting the network with the highest score [12]. However, NIPD uses a novel score that evaluates candidate networks with respect to data from any subset of conditions, pooling data for subsets with more than one conditions. This subset score and search strategy of NIPD incorporates and exploits the shared information across the conditions during structure learning, rather than as a post-processing step. As a result, we are able to identify sub-networks not only specific to one condition, but to mul- tiple conditions simultaneously, which allows us to build a more holistic picture of condition-specific response. The data pooling aspect of NIPD makes more data available for estimating parameters for higher-order interactions, i.e., interactions among more than two genes. This enables NIPD to robustly estimate higher-order interactions, which are more difficult to estimate due to the high number of parameters relative to pairwise dependencies. By formulating NIPD in the framework of PGMs we have additional benefits: (a) PGMs are generative models of the data, providing a system-wide description of the condition-specific behavior as a probabilistic network, (b) the probabilistic component naturally handles noise in the data, (c) the graph structure captures condition-specific behavior at the level of gene-gene interactions, rather than coarse clusters of genes, (d) the PGM framework can be easily extended to more complex situations where the condition variable itself may be a random variable that must be inferred during network learning. We implement NIPD with undirected, probabilistic graphical models [14]. However, the NIPD framework is applicable to directed graphs as well. We are not the first to propose networks for capturing condition-specific behavior [24, 34]. Several network-based approaches have been developed for capturing condition-specific behavior 3
  • 4. such as disease-specific subgraphs in cancer [8], stress response networks in yeast [21], or networks across different species [4,28]. However, these approaches are not probabilistic in nature, often rely on the network being known, and are restricted to pairwise co-expression relationships rather than general statistical dependencies. Other approaches such as differential dependency networks [34], and mixture of subgraphs [24], construct probabilistic models, but focus on differences rather than both differences and similarities. The majority of these approaches infer a network for each condition separately, and then compare the networks from different conditions to identify the edges capturing condition-specific behavior. We compared NIPD against an existing approach for learning networks from the conditions independently. We refer to this approach as INDEP, which represents a general class of existing algorithms that learn networks per condition independently. On simulated data from networks with known ground truth, NIPD inferred networks with higher quality than did INDEP, especially on small training datasets. We also applied our approach to microarray data from two yeast (Saccharomyces cerevisiae) cell types, quiescent and non-quiescent, isolated from glucose-starved, stationary phase cultures [2]. Networks learned by NIPD were associated with many more Gene ontology biological processes [3], or were enriched in targets of known transcription factors (TFs) [17], than networks learned by INDEP. Many of the TFs were involved in stress response, which is consistent with the fact that the populations are under starvation stress. NIPD also identified many more shared edges, which represent biologically meaningful dependencies than the INDEP approach. This suggests that by pooling data from multiple conditions, we are able to not only capture shared structures better, but also to infer networks with higher overall quality. 2 Results The goal of our experiments was three fold: (a) to examine the quality of condition-specific net- works inferred by our approach that combines data from different conditions (NIPD) versus an independent learner (INDEP), (b) to evaluate the algorithmic performance (measured by network structure quality) as a function of training data size, (c) analyze how two different cell populations behave, at the network level, in response to the same starvation stress. We address (a) and (b) 4
  • 5. on simulated data from networks with known topology, giving us ground truth to directly validate the inferred networks. We address (c) on microarray data from two yeast cell populations isolated from glucose-starved stationary phase cultures [2]. 2.1 NIPD had superior performance on networks with known ground truth We simulated data from two sets of networks, each set with two networks, one network per condition. In the first, HIGHSIM, the networks for the two conditions, shared a larger portion (60%) of the edges, and in the second, LOWSIM, the networks shared a smaller (20%) portion of the edges. We compared the networks inferred by NIPD to those inferred by INDEP by assessing the match between true and inferred node neighborhoods (See Supplementary Methods). Briefly, the data were split into q partitions, where q ∈ {2, 4, 6, 8, 10}, and networks learned for each partition. The size of the training data decreased with increasing q. We first evaluated overall network structure quality by obtaining the number of nodes on which one approach was significantly better (t-test p-value, < 0.05) in capturing its neighborhood as a function of q. On LOWSIM, NIPD was significantly better for smaller amounts of training data. On HIGHSIM, NIPD performed significantly better than INDEP for all training data sizes (Fig 1). Next, we evaluated how well the shared edges were captured as a function of decreasing amounts of training data (Supplementary Fig 1). NIPD captured shared edges better than INDEP on LOWSIM as the amounts of training data decreased. NIPD was better than INDEP on HIGHSIM regardless of the size of the training data. Our results show that when the underlying networks corresponding to the different conditions share a lot of structure, NIPD has a significantly greater advantage than INDEP, which does not do any pooling. Furthermore, as training data size decreases, NIPD is better than INDEP for learning both overall and shared structures, independent of the extent of sharing in the true networks. 2.2 Application to yeast quiescence We applied NIPD to microarray data from two yeast cell populations, quiescent (QUIESCENT) and non-quiescent (NON-QUIESCENT), isolated from glucose starvation-induced stationary phase 5
  • 6. cultures [2]. The two cell populations are in the same media but have differentiated physiologically and morphologically, suggesting that each population is responding differently. We learned networks using NIPD and INDEP treating each cell population as a condition. Because each array in the dataset was obtained from a single gene deletion mutant, the networks were constrained such that genes with deletion mutants connected to the remaining genes1 . The inferred networks from both methods were evaluated using information from Gene Ontology (GO) process, GO Slim [3] and transcriptional regulatory networks [17]. Gene Ontology is a hierarchically structured ontology of terms used to annotate genes. GO slim is a collapsed single level view of the complete GO terms, providing high level information of the processes, functions and cellular locations involving a set of genes. Finally, we analyzed combinations of genes with deletions that were in the neighborhood of other non deletion genes. 2.2.1 NIPD identified more biologically meaningful dependencies To determine if one network was more biologically meaningful than the other, we examined the net- works based on Gene Ontology (GO) slim category (process, function and location), transcription factor binding data and GO process, referred as GOSLIM, TFNET and GOPROC, respectively (Fig 2). Network quality was determined by the number of GOSLIM categories (or TFNET or GOPROC) with better coverage than random networks (See Methods). Both approaches were equivalent for GOSLIM, with INDEP outperforming NIPD in QUIESCENT and NIPD outper- forming INDEP on NON-QUIESCENT. NIPD outperformed INDEP with a larger margin than was outperformed on TFNET categories from NON-QUIESCENT. NIPD was consistently better than INDEP on GOPROC categories. The networks learned by NIPD had many more edges than the networks learned by INDEP (Supplementary Table 1). To estimate the proportion of the edges capturing biologically meaningful relationships, we computed semantic similarity of genes connected by the edges [16]. Although both INDEP and NIPD had significantly better semantic similarity than random networks, INDEP degraded in p-value for QUIESCENT at the highest value of semantic similarity (Fig 3). NIPD- 1 This is not a bi-partite graph because the genes with deletion mutants are allowed to connect to each other. 6
  • 7. inferred networks had many more edges with high semantic similarity than INDEP, while keeping the proportion of edges satisfying a particular semantic similarity threshold close to INDEP. This suggests that NIPD identifies more dependencies that are biologically relevant than INDEP without suffering in precision. 2.2.2 NIPD identified more shared edges representing common starvation response We performed a more fine-grained analysis of the inferred networks by considering each gene and its immediate neighborhood and tested whether these gene neighborhoods were enriched in GO biological processes, or in the target set of transcription factors (TFs) (See Methods). Using a false discovery rate (FDR) cutoff of 0.05, we identified many more subgraphs in the networks inferred by NIPD than by INDEP to be enriched in a GO process or in targets of TFs (Figs 4, 5). NIPD identified more processes and larger subgraphs in both populations (oxidative phosphorylation, protein folding, fatty acid metabolism, ammonium transport) than did INDEP. NIPD identified subgraphs involved in aerobic respiration and oxidative phosphorylation were enriched in targets of HAP4, a global activator for respiration genes. The presence of HAP4 targets in both cell populations makes sense because both populations are experiencing glucose starvation and must switch to respiration for deriving energy. We also found the TFs, MSN2, MSN4, and HSF1, regulating subgraphs involved in protein folding. These TFs activate stress responses and are known to activate genes involved in heat, oxidative and starvation stress. We also found targets of SIP4 in both populations. SIP4 is a transcriptional activator of gluconeogenesis [32], expressed highly in glucose repressed cells [15], and therefore would be expected to be present in both quiescent and non-quiescent cells. In contrast, the only shared regulatory connection found by INDEP was HAP4. We conclude that the NIPD approach identified more networks that were biologically relevant and informative about glucose starvation response than did INDEP. 7
  • 8. 2.2.3 Wiring differences in NIPD-inferred networks exhibit population-specific star- vation response NIPD identified several processes associated exclusively with quiescent cells. This included regu- latory processes (regulation of epigenetic gene expression, and regulation of nucleobase, nucleoside and nucleic acid metabolism) and metabolic processes (pentose phosphate shunt). These were novel predictions that highlight differences between these cells based on network wiring. INDEP identified only one population-specific GO process (response to reactive oxygen species in NON- QUIESCENT). An INDEP identified subgraph specific to quiescent (protein de-ubiquitination), was actually a subset of the NIPD-identified subgraph involved in epigenetic gene expression regulation, indicating that NIPD subsumed most of the information captured by INDEP. NIPD QUIESCENT networks contained subgraphs enriched exclusively in targets of SKO1, and AZF1. Both of these are zinc finger TFs, with AZF1 protein expressed highly under non-fermentable carbon sources [27], and SKO1 which regulates low affinity glucose transporters [30], and are both consistent with the condition experienced by these cells. Unlike NIPD, which identified SIP4 to be associated with both populations, INDEP identified SIP4 only in QUIESCENT. However, as we describe in the previous section, it is more likely that SIP4 is involved in both QUIESCENT and NON-QUIESCENT populations. INDEP also found the TFs YAP7 and AFT2 exclusively in QUIESCENT and NON-QUIESCENT, respectively. YAP7 is involved in general stress response and would be expected to have targets in both QUIESCENT and NON-QUIESCENT. AFT2 is required under oxidative stress and is consistent with the over-abundance of reactive oxygen species in NON-QUIESCENT population [1]. NIPD also identified wiring differences in the subgraphs involved in shared processes. For ex- ample in addition to HAP4, NIPD identified HAP2 as an important TF in QUIESCENT. The presence of both HAP2 and HAP4 makes biological sense because they are both part of the HAP2/HAP3/HAP4/HAP5 complex required for activation of respiratory genes. The presence of both HAP2 and HAP4 in QUIESCENT, but not NON-QUIESCENT, suggests that the QUI- ESCENT population maybe better equipped for respiration and long term survival in stationary phase. 8
  • 9. Overall, the NIPD inferred networks captured key differences and similarities in metabolic and regulatory processes, which are consistent with existing information about these cell populations [1,2], and also include novel findings that can provide new insight into starvation response in yeast. 2.2.4 NIPD identified several knock-out combinations The microarrays used in this study measured expression profile of single gene deletions that were previously identified to be highly expressed at the mRNA level in stationary phase. We constrained the inferred networks to identify neighborhoods of genes comprising only the genes with deletion mutants, allowing us to identify combinations of such deletion mutants and their targets. Such com- binations can be validated in the laboratory to verify cross-talk between pathways. We found that NIPD-inferred networks contained significantly more deletion combinations compared to random networks for both the quiescent and non-quiescent populations (p-value < 3E-10, Supplementary Tables 3, 4, 5), which was not the case for the INDEP-identified networks (Supplementary Tables 6, 7). A more stringent analysis of the knock-out combinations using GO process semantic similar- ity identified several double knock-out and target gene candidates (Supplementary Table 2). We also found more deletion combinations in NON-QUIESCENT compared to QUIESCENT. This is consistent with the identification of many more mutants affecting non-quiescent than quiescent cells [2]. In QUIESCENT, we found three genes that were all likely down-stream targets of a COX7-QCR8 double knock-outs, all involved in the cytochrome-c oxidase complex of the mito- chondrial inner membrane. Other deletion mutant combinations were involved in mitochondrial ATP synthesis and ion transport. Many of these genes have been shown to be required for qui- escent non-quiescent cell function, viability and survival [2, 18]. In NON-QUIESCENT, we found several knock-out combinations involved in oxidative phosphorylation, aerobic respiration etc, in- cluding a novel combination, YMR31 and QCR8, connected to TPS2. All three genes are found in the mitochondria, which play a critical and complex role in starved cells, but the exact mechanisms are not well-understood. Experimental analysis of this triplet can provide new insights into the role of mitochondria in glucose-starved cells. In summary, these results demonstrated another benefit 9
  • 10. of data pooling in NIPD: learning more complex, combinatorial relationships among genes. 3 Discussion Inference and analysis of cellular networks has been one of the cornerstones of systems biology. We have developed a network learning approach, Network Inference with Pooling Data (NIPD) to capture a systemic view of condition-specific response. NIPD is based on probabilistic graphical models and infers the functional wiring among genes involved in condition-specific response. The crux of our approach is to learn networks for any subset of conditions capturing fine-grained gene interaction patterns not only in individual conditions but in any combination of conditions. This allows NIPD to robustly identify both shared and unique components of condition-specific cellular networks. In comparison to an approach that learns networks independently (INDEP), NIPD (a) pools data across different conditions, enabling better exploitation of the shared information between conditions, (b) learns better overall network structures in the face of decreasing amounts of training data, and (c) learns structures with many more biologically meaningful dependencies. Small training data sets, which are especially common for biological data, present significant challenges for any network learning approach. In particular, approaches such as INDEP may learn drastically different networks due to small data perturbations leading to differences that are not biologically meaningful. NIPD is more resilient to small perturbations because by pooling data from different conditions during network learning, NIPD effectively has more data for estimating parameters for the shared parts of the network. Another challenge in the analysis of condition-specific networks is to extract patterns that are shared across conditions. Approaches such as INDEP that learn networks for each condition independently, and then compare the networks, are more likely to learn different networks making it difficult to identify the similarities across conditions. Application of both NIPD and INDEP approaches to microarray data from two yeast populations showed that many of subgraphs that would be considered specific to each population by INDEP, were actually shared biological processes that must be activated in both populations irrespective of their morphological and physiological differences. 10
  • 11. One of the strengths of NIPD in comparison with INDEP was its ability to identify pairs of gene deletions and downstream targets using data from individual gene deletions. Amazingly, several of these gene deletions are already known to have a phenotypic effect on stationary phase cultures and often on quiescent or non-quiescent cells (Supplementary Table 2) [2,18]. These predictions are therefore good candidates for future experiments using double deletion mutants, and are a drastic reduction of the space of possible combinations of sixty-nine single gene deletions. Identification of population-specific malfunctions in signaling pathways via experimental analysis of these multiple deletions can provide new insight into aging and cancer studies using yeast stationary phase as a model system. The NIPD approach establishes ground-work for important future enhancements, including the ability to efficiently learn networks from many conditions. The probabilistic framework of NIPD can be easily extended to automatically infer the condition variable to make NIPD widely applicable to datasets with uncertainty about the conditions. The NIPD approach can also integrate novel types of high-throughput data including RNASeq [33] and ChipSeq [25]. These extensions will allow us to systematically identify the parts, and the wiring among them that determine stage-specific, tissue-specific and disease specific behavior in whole organisms. 4 Methods 4.1 Independent learning of condition-specific networks: INDEP Existing approaches of learning condition-specific networks [4, 21, 28] can be considered as spe- cial cases of a general independent learning approach, INDEP, where networks for each condition are learned independently and then compared to identify network parts unique or shared across conditions. Let {D1 , · · · , Dk } denote k datasets from k conditions. In the INDEP approach, each network Gc , 1 ≤ c ≤ k, is learned independently using data from Dc only. Our implementation of the INDEP framework considered each Gc as an undirected probabilistic graphical model, or a Markov random field (MRF) [14], which like Bayesian networks, can capture higher-order dependencies, 11
  • 12. but additionally captures cyclic dependencies. We use a pseudo-likelihood framework with an MDL penalty to learn the structure of the MRF [6]. The pseudo-likelihood score for a network N Gc describing data Dc is PLL(Gc ) = i=1 PLLV(Xi , Mci , c) where X1 , · · · , XN are the random variables (one for each gene), encoding the expression value of a gene. PLLV is Xi ’s contribution to the overall pseudo-likelihood and is defined, including a minimum description length (MDL) penalty, |Dc | |θci |log(|Dc |) as PLLV(Xi , Mci , c) = d logP (Xi = xdi |Mci = mcdi ) + 2 . Here Mci is the Markov blanket (MB) of Xi in condition c and xdi and mcdi are assignments to Xi and Mci , respectively from the dth data point. θci are the parameters of the conditional distribution P (Xi |Mci ). We assume the conditional distributions to be conditional Gaussians. The structure learning algorithm for each graph is described in [22]. 4.2 Network Inference with Pooling Data: NIPD The NIPD approach that we present extends the INDEP approach by incorporating shared infor- mation across conditions during structure learning. In this framework, we do not learn networks for each condition c separately. Instead, we devise a score for each edge addition that considers networks for any subset of the conditions. Let C denote the set of k conditions. For a non-singleton set, E ⊆ C, we pool the data from all conditions e ∈ E and evaluate the overall score improve- ment on adding an edge to networks for all e ∈ E. To learn {G1 , · · · , Gk } for the k conditions simultaneously, we maximize the following MDL-based score: S(G1 , · · · , Gk ) = P (D1 , · · · , Dk |θ1 , · · · , θk )P (θ1 , · · · , θk |G1 , · · · , Gk ) + MDL Penalty (1) Here θ1 , · · · , θk are the maximum likelihood parameters for the k graphs. We assume P (Dc |θ1 , · · · , θk ) = P (Dc |θc ). That is, if we know the parameters θc , the likelihood of the data from condition, Dc , given k θc can be estimated independently. Thus, P (D1 , · · · , Dk |θ1 , · · · , θk ) = c=1 P (Dc |θc ). Because our networks are MRFs, we use pseudo-likelihood PLL(Dc ). We expand the complete condition-specific parameter set θc , to {θc1 , · · · , θcN }, which is the set of parameters of each variable Xi , 1 ≤ i ≤ N , 12
  • 13. in condition c. Using the parameter modularity assumption for each variable, we have: N P (θ1 , · · · , θk |G1 , · · · , Gk ) = P (θ1i , · · · , θki |M1i , · · · , Mki ) (2) i=1 Note the parameters of conditional probabilities of individual random variables are independent, but the parameters per variable are not independent across conditions. To enforce dependency among the θci , we make Mci depend on all the neighbors of Xi in condition c and all sets of conditions that include c. To convey the intuition behind this idea, let us consider the two condition case C = {A, B}. A variable Xj can be in Xi ’s MB in condition A, either if it is connected to Xi only in condition A, or if it is connected to Xi in both conditions A and B. Let M∗ be the set of Ai variables that are connected to Xi only in condition A but not in both A and B. Similarly, let M∗ {A,B}i denote the set of variables that are connected to Xi in both A and B conditions. Hence, MAi = M∗ ∪ M∗ Ai {A,B}i . More generally, for any c ∈ C, Mci = ∗ E∈powerset(C) : c∈E MEi , where M∗ Ei denotes the neighbors of Xi only in condition set E. To incorporate this dependency in the structure score, we need to define P (Xi |Mci ) such that it takes into account all subsets E, c ∈ E. We assume that the MBs, M∗ , independently influence Xi . This allows us to write P (Xi |Mci ) as a product: Ei P (Xi |Mci ) ∝ ∗ E∈powerset(C) : c∈E P (Xi |MEi ). To learn the k graphs, we exhaustively enumerate over condition sets, E, and estimate parameters θEi by pooling the data for all non-singleton E. Our structure learning algorithm maintains a conditional distribution for every variable, Xi for every set E ∈ powerset(C). We consider the addition of an edge {Xi , Xk } in every set E. This addi- tion will affect the conditionals of Xi and Xj in all conditions e ∈ E. Because the MB per condition set independently influence the conditional, the pseudo-likelihood PLLV(Xi , Mei , e) decomposes as ∗ E s.t: e∈E PLLV(Xi , MEi , e) (Supplementary information). The net score improvement of adding an edge {Xi , Xj } to a condition set E is given by: |De | ∆Score{Xi ,Xj },E = PLLV(Xi , Mei ∪ {Xj }, e) − PLLV(Xi , Mei , e) + e∈E d=1 PLLV(Xj , Mej ∪ {Xi }, e) − PLLV(Xj , Mej , e) (3) 13
  • 14. Because of the decomposability of PLLV(Xi |Mei ), all terms other than those involving the Markov blanket variables in condition set E remain unchanged producing the score improvement: ∆Score{Xi ,Xj },E = PLLV(Xi |M∗ ∪ Xj ) − PLLV(Xi |M∗ ) Ei Ei This score decomposability allows us to efficiently learn networks over condition sets. Our structure learning algorithm is described in more detail in Supplementary material. 4.3 Simulated data description and analysis We generated simulated datasets using two sets of networks of known structure, HIGHSIM and LOWSIM. All networks had the same number of nodes n = 68 and were obtained from the E. coli regulatory network [23]. We used the INDEP model for generating the eight simulated datasets. The parameters of the INDEP model were initialized using random partitions of an initial dataset generated from a differential-equation based regulatory network simulator [19]. 4.4 Microarray data description Each microarray measures the expression of all yeast genes in response to genetic deletions from quiescent (85) and non-quiescent (93) populations [2], with 69 common to both populations. The arrays had biological replicates producing 170 and 186 measurements per gene in the quiescent and non-quiescent populations, respectively. We filtered the microarray data to exclude genes with > 80% missing values, resulting in 3,012 genes. We constrained the network structures such that a gene connected to only the 69 genes with deletion mutants and no gene had more than 8 neighbors. 4.5 Validation of network edges using coverage of annotation categories The coverage of an annotation category A is defined as the harmonic mean of a precision and recall. Let L denote the complete list of genes used for network learning, LA ⊆ L denote the genes annotated with A. Let lA denote the number edges in our learned network among two genes gi and gj , such that gi ∈ LA and gj ∈ LA . Let tA be the total number of edges that are connected to genes in LA (note tA > lA ). Let sA denote the total number of edges that could exist among the 14
  • 15. |LA | genes in LA , which is 2 if |LA | < 8 and |LA | ∗ 8 if |LA | > 8. Precision for category A is defined lA lA as pA = tA and recall is defined as rA = sA . These are used to define the coverage of category A, 2pA rA pA +rA . We compute this coverage score for all categories using each inferred network, and compare the score against an expected coverage from random networks with the same degree distribution. To compare of NIPD against INDEP, assume we were comparing the inferred quiescent networks. Let AINDEP and ANIPD denote the categories better than random in the INDEP and NIPD quiescent networks, respectively. To determine how much better INDEP is than NIPD, we obtain the number of categories in AINDEP ∪ ANIPD on which INDEP has a better coverage than NIPD. We similarly assess how much better NIPD is than INDEP. We repeat this procedure for the non-quiescent networks. We also compared the semantic similarity of edges in inferred and random networks [16] (Supplementary material). 4.6 Evaluation of gene deletion combinations We identified combinations of genes with deletion mutants from Markov blankets comprising > 1 of these deletion genes. We evaluated each algorithm’s ability to capture gene deletion combinations by comparing the number of such combinations in random networks with the same number of edges. This random network model provided a rough significance assessment on the number of inferred knock-out combinations (Supplementary Table 3). We then performed a more stringent analysis based on semantic similarity, using the sub-network spanning only the genes with deletion combinations. We generated random networks with the same degree distributions as this sub- network and computed the semantic similarity of each gene with the set of deletion genes connected to it, in the inferred and random networks. We then selected genes with significantly higher semantic similarity than in random networks (ztest, p-value <0.05). 5 Acknowledgements This work is supported by grants from NIMH (1R01MH076282-03) and NSF (IIS-0705681) to T.L., from NIH (GM-67593) and NSF (MCB0734918) to M.W.W. and from HHMI-NIH/NIBIB (56005678). 15
  • 16. HIGHSIM NET1 LOWSIM NET1 %" + ' , 9:;< :;<= %# :9<=; ;:=>< 4+,-+50/(067*8 ! 5,-.,610)178+9 " $ # # !" + !$ , !"# $$" %"# %%$ &# %"# %%$ &# !"# $$" %"# %%$ &# '()*+,-+./0(1(12+30.0 ()*+,-.,/01)2)23,41/1 HIGHSIM NET2 LOWSIM NET2 %# + ? , :;<= 9:;< ' ;:=>< :9<=; 5,-.,610)178+9 4+,-+50/(067*8 " ! $ # # !" + !$ , !"# $$" %"# %%$ &# %"# %%$ &# !"# $$" %"# %%$ &# '()*+,-+./0(1(12+30.0 ()*+,-.,/01)2)23,41/1 Figure 1: Number of variables (y-axis) on which one method was significantly better than the other as function of the size of the training data (x-axis). Left is for the two networks (HIGHSIM) that share 60% edges and right is for the two networks (LOWSIM) that share 20% of their edges. The top and bottom graphs are for networks from the individual conditions. GOSLIM  TFNET  GOPROC  16  INDEP>NIPD  16  INDEP>NIPD  80  INDEP>NIPD  NIPD>INDEP  NIPD>INDEP  # of Categories  # of Categories  # of Categories  12  12  60  NIPD>INDEP  8  8  40  4  4  20  0  0  0  QUIESCENT  NON‐QUIESCENT  QUIESCENT  NON‐QUIESCENT  QUIESCENT  NON‐QUIESCENT  Figure 2: Network quality comparison based on coverage of GOSlim (GOSLIM), targets of tran- scription factors (TFNET) and GO process (GOPROC). Each bar represents the number of cat- egories on which INDEP had better coverage than NIPD (INDEP>NIPD) or NIPD had better coverage than INDEP (NIPD>INDEP). References [1] C. Allen, S. B¨ttner, A. D. Aragon, J. A. Thomas, O. Meirelles, J. E. Jaetao, D. Benn, u S. W. Ruby, M. Veenhuis, F. Madeo, and M. Werner-Washburne. Isolation of quiescent and nonquiescent cells from yeast stationary-phase cultures. J Cell Biol, 174(1):89–100, July 2006. [2] Anthony D. Aragon, Angelina L. Rodriguez, Osorio Meirelles, Sushmita Roy, George S. David- son, Chris Allen, Ray Joe, Phillip Tapia, Don Benn, and Margaret Werner-Washburne. Charac- terization of differentiated quiescent and non-quiescent cells in yeast stationary-phase cultures. Molecular Biology of the Cell, 2008. [3] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, 16
  • 17. RAND (NIPD)  7  7  QUIESCENT  NON‐QUIESCENT  RAND (NIPD)  NIPD  NIPD  6  6  RAND (INDEP)  RAND (INDEP)  5  INDEP  5  INDEP  log(# of Edges)  log(# of Edges)  4  4  3  3  2  2  1  1  0  0  ‐1  ‐1  0  0.2  0.4  0.6  0.8  1  1.2  1.4  0  0.2  0.4  0.6  0.8  1  1.2  1.4  Seman1c Similarity  Seman1c Similarity  Figure 3: Network quality comparison based on semantic similarity. The dashed lines represents the background distribution generated from random networks and the solid lines represents the distribution of the semantic similarity in the inferred networks. HAP4_TF HAP2_TF SIP4_TF LPD1 NDE2 ATP3 CCW12 KNS1 MIR1 YGL088W ATX2 IDP2 YGR001C YNL194C SDS23 YOR052C SNC2 UBC8 COX13 ATP2 COX7 QCR8 COX8 NDI1 ATP16 PCK1 FAS1 SDH2 YET3 NBP2 PIN3 ILV1 CDC48 AVT7 INH1 AAT2 QCR7 ERV46 PTR2 THO1 ICL1 QCR6 KGD1 QCR9 acetyl-CoA metabolic process organelle ATP synthesis coupled electron transport oxidative phosphorylation aerobic respiration MSN2_TF MSN4_TF HSF1_TF SKO1_TF AZF1_TF YDJ1 IRA2 STI1 PRB1 HSP30 HSP42 HSP104 HSP78 XBP1 OM14 YDR266C FAA1 HXT5 SIS1 BIO2 protein folding SBE22 UBP10 YMR144W ADH2 PDC5 YMR187C EMP46 GDH3 YMR090W PUF4 SWP1 REG2 FOX2 GAC1 PDC1 CTA1 DOA4 YJL016W SIP18 CAT2 ALD4 PXA1 ISW2 PAI3 ALD3 ALD2 ATO3 ADY2 UTR1 YDR154C regulation of gene expression, epigenetic MUQ1 nitrogen utilization ammonium transport regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic process ethanol metabolic process polyamine catabolic process beta-alanine biosynthetic process LSC2 MDH3 FMP37 MSS18 CUP2 MAP1 PEX11 SOD1 FDH1 SOL4 YMR118C GND2 ACS1 RPL2A SFA1 ETR1 CRS5 FTH1 AYR1 PAT1 HSP26 TKL2 RDH54 YJR096W FYV7 YDL218W carboxylic acid biosynthetic process NADH regeneration response to metal ion fatty acid metabolic process pentose-phosphate shunt pentose metabolic process Figure 4: GO processes and TF targets for subgraphs from the NIPD-inferred networks using the quiescent population. The text below each subgraph indicates the process. The diamonds represent the TFs. A TF is connected to the subgraph which is enriched in the targets of the TF. The circular nodes represent the genes in the network and color represents the extent of differential expression, red: up-regulated, green: down-regulated. 17
  • 18. HAP4_TF MSN4_TF MSN2_TF HSF1_TF SIP4_TF KGD2 MIR1 PTR2 PMT1 ATP1 HSP42 CDC48 STI1 HSP104 PCK1 ATP2 SOD1 ATP16 HSP12 PIN3 CCW12 HSP30 HSP26 URA6 SIS1 SDH2 ICL1 RIP1 BSD2 PGM2 SSA2 YJR096W TDH1 IDP2 HSP78 SSE2 ion transport oxidative phosphorylation protein folding PST2 PUS5 YER121W ACS1 RPS14A AYR1 MDH3 ADH2 FOX2 PXA1 CYB2 PEX11 ADY2 ATO3 YKL187C FMP37 RPL25 LSC2 ETR1 ammonium transport nitrogen utilization energy derivation by oxidation of organic compounds fatty acid metabolic process UTR1 YGR201C CRC1 SOL4 YIR035C APJ1 ARO3 GSC2 EMP46 COX13 YDR154C YMR114C TPS2 YAT2 PYC2 ILV1 ALD3 COX7 AVT6 ALD2 QCR9 URA2 QCR8 YAT1 GDH3 QCR6 beta-alanine biosynthetic process mitochondrial electron transport, ubiquinol to cytochrome c polyamine catabolic process aerobic respiration carnitine metabolic process Figure 5: GO processes and TF targets for subgraphs from the NIPD-inferred networks using the non-quiescent population. Legend is similar to Fig 4 K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1):25–29, May 2000. [4] S. Bergmann, J. Ihmels, and N. Barkai. Similarities and differences in genome-wide expression data of six organisms. PLoS Biol, 2(1), January 2004. [5] Sven Bergmann, Jan Ihmels, and Naama Barkai. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical review. E, Statistical, nonlinear, and soft matter physics, 67(3 Pt 1), March 2003. [6] Julian Besag. Efficiency of pseudolikelihood estimation for simple gaussian fields. Biometrika, 64(3):616–618, December 1977. [7] Richard Bonneau, David J Reiss, Paul Shannon, Marc Facciotti, Leroy Hood, Nitin S Baliga, and Vesteinn Thorsson. The inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biology, 2006. 18
  • 19. [8] Han-Yu Chuang, Eunjung Lee, Yu-Tsueng Liu, Doheon Lee, and Trey Ideker. Network-based classification of breast cancer metastasis. Mol Syst Biol, 3, October 2007. [9] Karthik Devarajan. Nonnegative matrix factorization: An analytical and interpretive tool in computational biology. PLoS Comput Biol, 4(7):e1000029+, July 2008. [10] Dan Geiger and David Heckerman. Advances in probabilistic reasoning. In Proceedings of the seventh conference (1991) on Uncertainty in artificial intelligence, pages 118–126, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc. [11] Christopher T. Harbison, D. Benjamin Gordon, Tong Ihn Lee, Nicola J. Rinaldi, Kenzie D. Macisaac, Timothy W. Danford, Nancy M. Hannett, Jean-Bosco Tagne, David B. Reynolds, Jane Yoo, Ezra G. Jennings, Julia Zeitlinger, Dmitry K. Pokholok, Manolis Kellis, P. Alex Rolfe, Ken T. Takusagawa, Eric S. Lander, David K. Gifford, Ernest Fraenkel, and Richard A. Young. Transcriptional regulatory code of a eukaryotic genome. Nature, 2004. [12] David Heckerman. A Tutorial on Learning Bayesian Networks. Technical Report MSR-TR- 95-06, Microsoft research, March 1995. [13] Hyunsoo Kim, William Hu, and Yuval Kluger. Unraveling condition specific gene transcrip- tional regulatory networks in saccharomyces cerevisiae. BMC Bioinformatics, 2006. [14] Steffen L. Lauritzen. Graphical Models. Oxford Statistical Science Series. Oxford University Press, New York, USA, July 1996. [15] P. Lesage, X. Yang, and M. Carlson. Yeast snf1 protein kinase interacts with sip4, a c6 zinc cluster transcriptional activator: a new role for snf1 in the glucose response. Molecular and cellular biology, 16(5):1921–1928, May 1996. [16] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble. Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioin- formatics, 19(10):1275–1283, July 2003. 19
  • 20. [17] Kenzie Macisaac, Ting Wang, D. Benjamin Gordon, David Gifford, Gary Stormo, and Ernest Fraenkel. An improved map of conserved regulatory sites for saccharomyces cerevisiae. BMC Bioinformatics, 7(1):113+, March 2006. [18] M. Juanita Martinez, Sushmita Roy, Amanda B. Archuletta, Peter D. Wentzell, Sonia S. Anna-Arriola, Angelina L. Rodriguez, Anthony D. Aragon, Gabriel A. Quinones, Chris Allen, and Margaret Werner-Washburne. Genomic analysis of stationary-phase and exit in saccha- romyces cerevisiae: Gene expression and identification of novel essential genes. Mol. Biol. Cell, 15(12):5295–5305, December 2004. [19] Pedro Mendes, Wei Sha, and Keying Ye. Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics, 19:122–129, 2003. [20] Wei Pan. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics, 18(4):546–554, April 2002. [21] Rokhlenko, Oleg, Wexler, Ydo, Yakhini, and Zohar. Similarities and differences of gene ex- pression in yeast stress conditions. Bioinformatics, 23(2):e184–e190, January 2007. [22] Sushmita Roy, Terran Lane, and Margaret Werner-Washburne. Learning structurally consis- tent undirected probabilistic graphical models. In ICML, page 114, 2009. [23] Heladia Salgado, Socorro Gama-Castro, Martin Peralta-Gil, Edgar Diaz-Peredo, Fabiola Sanchez-Solano, Alberto Santos-Zavaleta, Irma Martinez-Flores, Veronica Jimenez-Jacinto, Cesar Bonavides-Martinez, Juan Segura-Salazar, Agustino Martinez-Antonio, and Julio Collado-Vides. Regulondb (version 5.0): Escherichia coli k-12 transcriptional regulatory net- work, operon organization, and growth conditions. Nucleic Acids Research, 34:D394, 2006. [24] Guido Sanguinetti, Josselin Noirel, and Phillip C. Wright. Mmg: a probabilistic tool to identify submodules of metabolic pathways. Bioinformatics, 24(8):1078–1084, April 2008. [25] Dominic Schmidt, Michael D. Wilson, Christiana Spyrou, Gordon D. Brown, James Hadfield, and Duncan T. Odom. Chip-seq: Using high-throughput sequencing to discover proteindna interactions. Methods, 48(3):240–248, July 2009. 20
  • 21. [26] Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller, and Nir Friedman. Learning module networks. Journal of Machine Learning Research, 6:557–588, April 2005. [27] T. Stein, J. Kricke, D. Becher, and T. Lisowsky. Azf1p is a nuclear-localized zinc-finger protein that is preferentially expressed under non-fermentative growth conditions in saccharomyces cerevisiae. Current genetics, 34(4):287–296, October 1998. [28] Joshua M. Stuart, Eran Segal, Daphne Koller, and Stuart K. Kim. A gene-coexpression network for global discovery of conserved genetic modules. Science, 302(5643):249–255, October 2003. [29] Amos Tanay, Roded Sharan, Martin Kupiec, and Ron Shamir. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proceedings of the National Academy of Sciences of the United States of America, 101(9):2981–2986, March 2004. [30] Lidia Tom´s-Cobos, Laura Casadom´, Gl`ria Mas, Pascual Sanz, and Francesc Posas. Expres- a e o sion of the hxt1 low affinity glucose transporter requires the coordinated activities of the hog and glucose signalling pathways. The Journal of biological chemistry, 279(21):22010–22019, May 2004. [31] D. P. Tuck, H. M. Kluger, and Y. Kluger. Characterizing disease states from topological properties of transcriptional regulatory networks. BMC Bioinformatics, 7, 2006. [32] O. Vincent and M. Carlson. Sip4, a snf1 kinase-dependent transcriptional activator, binds to the carbon source-responsive element of gluconeogenic genes. The EMBO journal, 17(23):7002– 7008, December 1998. [33] Zhong Wang, Mark Gerstein, and Michael Snyder. Rna-seq: a revolutionary tool for transcrip- tomics. Nat Rev Genet, 10(1):57–63, January 2009. [34] Bai Zhang, Huai Li, Rebecca B. Riggins, Ming Zhan, Jianhua Xuan, Zhen Zhang, Eric P. Hoffman, Robert Clarke, and Yue Wang. Differential dependency network analysis to identify condition-specific topological changes in biological networks. Bioinformatics, pages btn660+, December 2008. 21
  • 22. Appendix 1 Generation and analysis of simulated data We first obtained a sub-network of n = 68 nodes, G1 , from the E. coli regulatory network [23]. We then generated two networks, G2 and G3 , by flipping 20% and 60% of G1 ’s edges, respectively. {G1 , G2 } comprised networks in HIGHSIM and {G1 , G3 } comprised networks in LOWSIM. For each pair of networks, we generated initial datasets using a differential equation-based gene regu- latory network simulator [19]. We then split the data into two parts, learned two INDEP models for each partition, and generated data from these models. We repeated this procedure four times producing eight sets of simulated data with different parameters but the same network topology. It was possible to generate all eight sets from the regulatory network simulator by perturbing the kinetic constants, but our current data generation procedure was faster. We compared the structure of the networks inferred by INDEP and NIPD using a per-variable neighborhood comparison. Assume we are comparing the INDEP-inferred networks against the true networks in HIGHSIM. We compare each of the true networks, {G1 , G2 } one at a time. Let GINDEP 1 and GINDEP be the two inferred networks inferred by INDEP using datasets from HIGHSIM. For 2 each variable, Xi , we compare Xi ’s neighborhood in G1 to its inferred neighborhoods in both GINDEP and GINDEP to obtain match score Fi1 1 2 INDEP and F INDEP , respectively. INDEP’s match of i2 INDEP and F INDEP . We obtain a match score for different Xi ’s neighborhood in G1 is the max of Fi1 i2 folds of the data. Similarly we obtain a match score for NIPD for all variables from different folds of the data. We then obtain the number of variables on which NIPD has a significantly higher match score compared to INDEP as a function of training data size. We repeat this procedure for all eight datasets for HIGHSIM to obtain the average number of variables NIPD is better than INDEP. We repeat this procedure for G2 and then for the NIPD. 22
  • 23. 2 Semantic similarity based-validation We use the definition of semantic similarity from Lord et al. using [16]. Semantic similarity between two annotation terms is defined as a function of the maximal amount of information present in a common ancestor of the terms. For GO terms the information is inversely proportional to the number of genes that are annotated with a term, that is a very specific term with few genes has more information than a broader term that has many more genes annotated with it. The functional similarity between two genes is given by the average semantic similarity of sets of GO process terms associated with the genes. Let gi and gj be two genes connected by an edge in our inferred network. Let Ti and Tj be the set of GO process terms associated with gi and gj , respectively. The average semantic similarity, sim(gi , gj ) for all pairs of terms is 1 sim(gi , gj ) = semsim(tp , tq ) |Tp | ∗ |Tq | tp ∈Ti ,tq ∈Tj Semantic similarity, semsim(tp , tq ) = −log(mina∈Ppq pa ), where Ppq is the set of common ancestors of the terms tp and tq in the GO process “is-a” hierarchy. −log(pa ) is the amount of information associated with a term a, and pa is probability of the term defined as the ratio of the number of genes annotated with the term a to the total number of genes with a GO process assignment. We used semantic similarity for global validation of the inferred edges and also for assessing the strength of association between combinations of single gene knock-outs and a target gene. In both cases, we generated random networks with the same degree distributions as the inferred networks and estimated a background semantic similarity distribution. For assessing the strength of association between a gene, gi and the set of knock-out genes that are connected to it, Ki , we had to compare the similarity of a gene with a set of genes. We assumed GO process terms for the set Ki to be the union of all terms associated with the genes, gj ∈ Ki . We then computed the semantic similarity between the term set associated with gene gi and the union of terms associated with Ki . 23
  • 24. 3 Structure learning algorithm of NIPD in detail Our score for structure learning is based on the pseudo-likelihood of the data given model and requires us to compute the conditional probability distribution of each variable in a condition c. We require that the parameters of this conditional distribution be dependent such that we can pool the data from the different conditions to estimate the parameters. The conditional distribution, P (Xi |Mci ) in condition c is defined as a product: P (Xi = xid |Mci = mcid ) ∝ P (Xi = xdi |M∗ = m∗ ), Ei Ei (4) E∈powerset(C) : c∈E where d is the data point index and M∗ is the Markov blanket (MB) of Xi exclusively in condition E 1 set E. The proportionality term can be eliminated using the normalization term Zcid . In our 1 2 2 2 conditional Gaussian case, Z1id = N (µ1id |µ3id , σ1i + σ3i ), where σ3i is the standard deviation from the condition set {1, 2}, µ1id = w1i m∗ , is the mean of the conditional Gaussian using the dth data T 1id 1 point in condition 1. Thus, Z1id is the probability of µ1id from a Gaussian distribution with mean estimated from the pooled data. To make the product in Eq 4 a valid conditional distribution, we need to subtract out the normalization term. However, working with the unnormalized form gives us three benefits. First, and most important, it enables our score to be a decomposable sum on taking logarithms. Second the normalization term behaves as a smoothing term for a condition- specific mean, µ1id , preferring network structures with means µ1id closer to the shared mean µ3id . Third, avoiding the computation of the Zid for each data point, gives us some runtime benefits. Our structure learning algorithm begins with k empty graphs and proposes edge additions for all variables, for all subsets of the condition set C. The while loop iteratively makes edge modifications until the score no longer improves. The outermost for loop (Steps 4-17 ) iterates over variables Xi to identify new candidate MB variables Xj in a condition set E. We iterate over all candidate MBs Xj (Steps 5-15) and condition sets E (Steps 6-14) and compute the score improvement for each pair {Xj , E} (Step 16). In Steps 7-9 we add a check that if a variable Xj is already present in any subset or super set D of E, we do not include it as a candidate. If the current condition set under consideration has more than one conditions, data from these conditions is pooled and 24
  • 25. parameters for the new distribution P (Xi |M∗ ) is estimated using the pooled dataset (Steps 10- Ei 12). A candidate move for a variable Xi is composed of a pair {Xj , E } with the maximal score improvement over all variables and conditions (Step 16). After all candidate moves have been identified, we attempt all the moves in the order of decreasing score improvement (Step 18). Each move adds the edge {Xi , Xj } in condition set E . However, if either Xi or Xj was already updated in a previous move, we ignore the move. Because not all candidate moves are made, by sorting the move order in decreasing score improvement, we enable moves with the highest score improvements to be attempted first. The algorithm converges when no edge addition improves the score of the k graphs. Algorithm 1 NIPD 1: Input: Random variable set, X = {X1 , · · · , X|X| } Set of conditions C Datasets of RV joint assignments, {D1 , · · · , D|C| } maximum neighborhood size, kmax 2: Output: Inferred graphs G1 , · · · , G|C| 3: while Score(G1 , · · · , G|C| ) does not stabilize do 4: for Xi ∈ X do {/*Propose moves*/ } 5: for Xj ∈ (X {Xi }) do 6: for E ∈ powerset(C) do 7: if Xj ∈ M∗ , s.t either D ⊂ E or E ⊂ D then iD 8: Skip Xj . 9: end if 10: if |E| > 1 then 11: Estimate parameters for new conditional P (Xi |M∗ Ei ∪ {Xj }) using pooled dataset DE obtained from merging all De s.t. e ∈ E. 12: end if 13: compute ∆Score{Xi Xj }E . 14: end for 15: end for 16: Store {Xi , Xj , E } as candidate move for Xi , where {Xj , E } = arg max ∆Score{Xi Xj }E j,E 17: end for 18: Make candidate moves {Xi , Xj , E } in order of decreasing score improvement /*Attempt moves to modify graph structures*/ 19: end while 25
  • 26. =>?=+>@ #'* / E>FG >EG9F +843.7/976./:!;<03. #') #'( #'" #'! / !"# $$" %"# %%$ &# +,-./01/234,5,56/7424 BCD+>@ #'" / E>FG >EG9F +843.7/976./:!;<03. #'!" #'! #'A" #'A #'$" / !"# $$" %"# %%$ &# +,-./01/234,5,56/7424 Figure 1: Shared edges in the HIGHSIM and LOWSIM networks METHOD POPULATION EDGE-CNT SHARED EDGE-CNT QUIESCENT 378 NIPD 271 NON-QUIESCENT 402 QUIESCENT 171 INDEP 25 NON-QUIESCENT 200 Table 1: Structure of the inferred networks using INDEP and NIPD. 26