Project3.ppt

Bayesian Networks
in Bioinformatics
Kyu-Baek Hwang
Biointelligence Lab
School of Computer Science and Engineering
Seoul National University
kbhwang@bi.snu.ac.kr

Copyright (c) 2002 by SNU CSE Biointelligence Lab 2
Contents
 Bayesian networks – preliminaries
 Bayesian networks vs. causal networks
 Partially DAG representation of the Bayesian network
 Structural learning of the Bayesian network
 Classification using Bayesian networks
 Microarray data analysis with Bayesian networks
 Experimental results on the NCI60 data set
 Term Project #3
 Diagnosis using Bayesian networks

Bayesian Networks
 The joint probability distribution over all the variables in
the Bayesian network.


n
i i
i
n X
P
X
X
X
P 1
2
1 )
|
(
)
,...,
,
( Pa
)
|
(
)
|
(
)
,
|
(
)
(
)
(
)
,
,
,
|
(
)
,
,
|
(
)
,
|
(
)
|
(
)
(
)
,
,
,
,
(
C
E
P
B
D
P
B
A
C
P
B
P
A
P
D
C
B
A
E
P
C
B
A
D
P
B
A
C
P
A
B
P
A
P
E
D
C
B
A
P


B
A
C D
E
Local probability
distribution for Xi
i
i
i
i
i
i
iq
i
i
i
i
X
r
q
X
P
X
i
for
states
of
#
:
for
ions
configurat
of
#
:
)
|
(
for
parameter
~
)
,...,
(
of
parents
of
set
the
:
1
Pa
Pa
Pa





Knowing
the Joint Probability Distribution
 We can calculate any conditional probability from the
joint probability distribution in principle.
Gene B
Class
Gene F Gene G
Gene A
Gene C Gene D
Gene E Gene H
This Bayesian network can classify
the examples by calculating the
appropriate conditional
probabilities.
 P(Class| other variables)

Classification by Bayesian Networks I
 Calculate the conditional probability of ‘Class’ variable
given the value of the other variables.
 Infer the conditional probability from the joint probability
distribution.
 For example,
 where the summation is taken over all the possible class values.
,
)
,
,
,
,
,
,
,
,
(
)
,
,
,
,
,
,
,
,
(
)
,
,
,
,
,
,
,
|
(


Class
H
Gene
G
Gene
F
Gene
E
Gene
D
Gene
C
Gene
B
Gene
A
Gene
Class
P
H
Gene
G
Gene
F
Gene
E
Gene
D
Gene
C
Gene
B
Gene
A
Gene
Class
P
H
Gene
G
Gene
F
Gene
E
Gene
D
Gene
C
Gene
B
Gene
A
Gene
Class
P

Knowing the Causal Structure
Gene B
Class
Gene F Gene G
Gene A
Gene C Gene D
Gene E Gene H
Gene C regulates Gene E and F.
Gene D regulates Gene G and H.
Class has an effect on Gene F and G.

Bayesian Networks vs. Causal Networks
Bayesian networks Causal networks
Network structure
Conditional
independencies
Causal relationships
By d-separation property of the Bayesian network
structure
 The network structure asserts that every node is
conditionally independent from all of its non-
descendants given the values of its immediate parents.

Equivalent Two DAGs
X Y
X Y
These two DAGs assert
that X and Y are
dependent on each other.
 the same conditional
independencies
 equivalence class
Causal relationships are hard to learn from the
observational data.

Verma and Pearl’s Theorem
 Theorem:
 Two DAGs are equivalent if and only if they have the same
skeleton and the same v-structures.
X Y
Z
v-structure (X, Z, Y)
: X and Y are parents of Z and not
adjacent to each other.

PDAG Representations
 Minimal PDAG representations of the equivalence class
 The only directed edges are those that participate in v-structures.
 Completed PDAG representation
 Every directed edge corresponds to a compelled edge, and every
undirected edge corresponds to a reversible edge.

Example: PDAG Representations
X
Y
Z
W V X
Y
Z
W V
X
Y
Z
W V X
Y
Z
W V
An equivalence class
Minimal
PDAG
Completed
PDAG

Learning Bayesian Networks
 Metric approach
 Use a scoring metric to measure how well a particular structure
fits an observed set of cases.
 A search algorithm is used.  Find a canonical form of an
equivalence class.
 Independence approach
 An independence oracle (approximated by some statistical test)
is queried to identify the equivalence class that captures the
independencies in the distribution from which the data was
generated.  Search for a PDAG

Scoring Metrics for Bayesian Networks
 Likelihood L(G, G, C) = P(C|Gh, G)
 Gh: the hypothesis that the data (C) was generated by a
distribution that can be factored according to G.
 The maximum likelihood metric of G
)
,
,
(
max
)
,
( C
G
L
C
G
M G
ML
G



 prefer the complete graph structure

Information Criterion Scoring Metrics
 The Akaike information criterion (AIC) metric
 The Bayesian information criterion (BIC) metric
)
(
)
,
(
log
)
,
( G
Dim
C
G
M
C
G
M ML
AIC 

N
G
Dim
C
G
M
C
G
M ML
BIC log
)
(
2
1
)
,
(
log
)
,
( 


MDL Scoring Metrics
 The minimum description length (MDL) metric 1
 The minimum description length (MDL) metric 1
)
,
(
)
(
log
)
,
(
1 C
G
M
G
P
C
G
M BIC
MDL 

)
(
log
|
|
)
,
(
log
)
,
(
2 G
Dim
c
N
E
C
G
M
C
G
M G
ML
MDL 




Bayesian Scoring Metrics
 A Bayesian metric
 The BDe (Bayesian Dirichlet & likelihood equivalence)
metric
c
G
C
P
G
P
C
G
M h
h


 )
,
|
(
log
)
|
(
log
)
,
,
( 


 
   







n
i
q
j
r
k ijk
ijk
ijk
ij
ij
ij
h
i i
N
N
N
N
N
N
G
C
P
1 1 1 )
'
(
)
'
(
)
'
(
)
'
(
)
,
|
( 

Greedy Search Algorithm
for Bayesian Network Learning
 Generate the initial Bayesian network structure G0.
 For m = 1, 2, 3, …, until convergence.
 Among all the possible local changes (insertion of an edge, reversal of
an edge, and deletion of an edge) in Gm–1, the one leads to the largest
improvement in the score is performed. The resulting graph is Gm.
 Stopping criterion
 Score(Gm–1) == Score(Gm).
 At each iteration (learning Bayesian network consisting of n
variables)
 O(n2) local changes should be evaluated to select the best one.
 Random restarts is usually adopted to escape the local
maxima.

Probabilistic Inference
 Calculate the conditional probability given the values of
the observed variables.
 Junction tree algorithm
 Sampling method
 General probabilistic inference is intractable.
 However, calculation of the conditional probability for the
classification is rather straightforward because of the property of
the Bayesian network structure.

The Markov Blanket
 All the variables of interest
 X = {X1, X2, …, Xn}
 For a variable Xi, its Markov blanket MB(Xi) is the
subset of X – Xi which satisfies the following:
 Markov boundary
 Minimal Markov blanket
)).
(
|
(
)
|
( i
i
i
i X
X
P
X
X
P MB
X 


Markov Blanket in Bayesian Networks
 Given the Bayesian network structure, the determination
of the Markov blanket of a variable is straightforward.
 By the conditional independence assertions.
Gene B
Class
Gene F Gene G
Gene A
Gene C Gene D
Gene E Gene H
The Markov blanket of a node in
the Bayesian network consists of
all of its parents, spouses, and
children.

Classification by Bayesian Networks II
)
,
|
(
)
,
|
(
)
,
|
(
)
,
|
(
)
,
|
(
)
(
)
,
|
(
)
(
)
(
)
(
)
,
|
(
)
,
|
(
)
(
)
,
|
(
)
(
)
(
)
(
)
|
(
)
,
|
(
)
,
|
(
)
|
(
)
(
)
,
|
(
)
(
)
(
)
(
)
|
(
)
,
|
(
)
,
|
(
)
|
(
)
(
)
,
|
(
)
(
)
(
)
(
)
,
,
,
,
,
,
,
,
(
)
,
,
,
,
,
,
,
,
(
)
,
,
,
,
,
,
,
|
(
D
Class
G
P
Class
C
F
P
B
A
Class
P
D
Class
G
P
Class
C
F
P
D
P
B
A
Class
P
C
P
B
P
A
P
D
Class
G
P
Class
C
F
P
D
P
B
A
Class
P
C
P
B
P
A
P
D
H
P
D
Class
G
P
Class
C
F
P
C
E
P
D
P
B
A
Class
P
C
P
B
P
A
P
D
H
P
D
Class
G
P
Class
C
F
P
C
E
P
D
P
B
A
Class
P
C
P
B
P
A
P
H
Gene
G
Gene
F
Gene
E
Gene
D
Gene
C
Gene
B
Gene
A
Gene
Class
P
H
Gene
G
Gene
F
Gene
E
Gene
D
Gene
C
Gene
B
Gene
A
Gene
Class
P
H
Gene
G
Gene
F
Gene
E
Gene
D
Gene
C
Gene
B
Gene
A
Gene
Class
P
Class
Class
Class








DNA Microarrays
 Monitor thousands of gene expression levels
simultaneously  traditional one gene experiments.
 Fabricated by high-speed robotics.
Known
probes

A Comparative
Hybridization Experiment
Image
analysis

Mining on
Gene Expression and Drug Activity Data
 Relationships among human cancer, gene expression, and drug
activity
 Revealing these relationships 
 Cause and mechanisms of the cancer development
 New molecular targets for anti-cancer drugs
Human cancer
Gene expression Drug activity

NCI (National Cancer Institute)
Drug Discovery Program
NCI 60
cell lines
data set

NCI60 Cell Lines Data Set
 From 60 human cancer cell lines
 Colorectal, renal, ovarian, breast, prostate, lung, and central
nervous system origin cancers, as well as leukemias and
melanomas
 Gene expression patterns
 cDNA microarray
 Drug activity patterns
 Sulphorhodamine B assay  changes in total cellular protein
after 48 hours of drug treatment

Schematic View
of the Modeling Approach
Gene B
Cancer
Drug B
Drug A
Gene A
- Selected genes, drugs
and cancer type node
Drug A
Cancer
Drug B
Gene B
Gene A
< Learned Bayesian network >
- Dependency analysis
- Probabilistic inference
Drug activity
Data
Gene Expression
Data
Preprocessing
- Thresholding
- Clustering
- Discretization

Data Preparation
 cDNA microarray data
 Gene expression profiles on
60 cell lines
 1376  60 matrix
 Drug activity data
 Drug activity patterns on 60
cell lines
 118  60 matrix
(1376 + 118)  60 data matrix
60 samples
Gene
expressions
60 samples
Drug
activities
1376
genes
118
drugs

Preprocessing
 Thresholding
 Elimination of
unknown ESTs 
805 genes
 Elimination of drugs
which have more
than 4 missing
values  84 drugs
 Discretization
 Local probability
model for Bayesian
networks:
multinomial
distribution
1376
genes
118
drugs
60 samples
805
genes
84
drugs
60 samples
  + c
 - c
1
0
-1

Bayesian Network Learning
for Gene-Drug Analysis
 Large-scale Bayesian network
 Several hundreds nodes (up to 890)
 General greedy search is inapplicable because of time and space
complexity.
 Search heuristics
 Local to global search heuristics
 Exploit the locality of Bayesian networks to reduce the entire
search space.
 The local structure: Markov blanket
 Find the candidate Markov blanket (of pre-determined size k) of
each node  reduce the global search space

Local to Global Search Heuristics
Input:
- A data set D.
- An initial Bayesian network structure B0.
- A decomposable scoring metric,
Output: A Bayesian network structure B.
Loop for n = 1, 2, …, until convergence.
- Local Search Step:
* Based on D and Bn–1, select for Xi, a set CBi
n (|CBi
n|  k) of candidate Markov blanket of Xi.
* For each set {Xi, CBi
n}, learn the local structure and determine the Markov blanket of Xi, BLn(Xi),
from this local structure.
* Merge all Markov blanket structures G({Xi, BLn(Xi)}, Ei) into a global network structure Hn
(could be cyclic).
- Global Search Step:
* Find the Bayesian network structure Bn  Hn, which maximizes Score(Bn, D) and retains all non-
cyclic edges in Hn.
.
)
),
(
|
(
)
,
( 
 i i
B
i D
X
Pa
X
Score
D
B
Score

Dimensionality Problem
 The number of attributes (nodes) >> sample size
 Unreliable structure of the learned Bayesian networks
 Probabilistic inference is nearly impossible.
 Downsize the number of attributes by clustering
 Prototype: mean of all members in a cluster
In the
preprocessing step

Bayesian Network with 45 Prototypes
 Node types (46 nodes in all)
 40 gene prototypes
 5 drug prototypes
 Cancer label
 Discretization boundary
  - c,  + c
 Bayesian network learning
 Varying candidate Markov
blanket size (k = 5 ~ 15)
 Select the best one
 Three data sets (c = 0.43, 0.50,
0.60)  three Bayesian
networks
 Probabilistic inference
c Distribution Ratio
-1 0 1
0.43 33.3
%
33.3
%
33.3
%
0.50 30.8
%
38.3
%
30.8
%
0.60 27.4 45.1 27.4

Correlations between
ASNS and L-Asparaginase
 Part of the Bayesian network (c = 0.60)
Prototype for ASNS and SID W
484773, PYRROLINE-5-
CARBOXYLATE REDUCTASE
[5':AA037688, 3':AA037689]
Prototype for L-Asparaginase
P(D2|G4) D2 = -1 D2 = 0 D2 = 1
G4 = -1 0.32096 0.27086 0.40818
G4 = 0 0.31387 0.41247 0.27366
G4 = 1 0.32167 0.34920 0.32913
< Conditional probability table >

Bayesian Networks
on Subset of Genes and Drugs
 Node types (17 nodes in all)
 12 genes
 4 drugs
 Cancer label
 Discretization boundary
  - c,  + c
 Bayesian network learning
 General greedy search with
restart (100 times)
 Select the best one
 Three data sets (c = 0.43, 0.50,
0.60)  three Bayesian
networks
 Probabilistic inference
c Distribution Ratio
-1 0 1
0.43 33.3
%
33.3
%
33.3
%
0.50 30.8
%
38.3
%
30.8
%
0.60 27.4 45.1 27.4
Clustering of genes and drugs
together
- From neighboring clusters

Around the L-Asparaginase
< Part of the Bayesian network (c = 0.6) >

Probabilistic Relationships
Around the L-Asparaginase
 Cancer type unobserved
 D1: L-Asparaginase
 G1: ASNS gene
 G2: PYRROLINE-5-CARBOXYLATE
REDUCTASE
P(D1|G1) D1 = -1 D1 = 0 D1 = 1
G1 = -1 0.19857 0.27471 0.52672
G1 = 0 0.31110 0.49795 0.19095
G1 = 1 0.42159 0.36279 0.21561
 Cancer type observed (= leukemia)
 D1: L-Asparaginase
 G1: ASNS gene
 G2: PYRROLINE-5-CARBOXYLATE
REDUCTASE
P(D1|G2) D1 = -1 D1 = 0 D1 = 1
G2 = -1 0.27510 0.35226 0.37263
G2 = 0 0.31621 0.41072 0.27307
G2 = 1 0.33837 0.39664 0.26499
P(D1|G1,L) D1 = -1 D1 = 0 D1 = 1
G1 = -1 0.17536 0.22838 0.59626
G1 = 0 0.27128 0.53790 0.19081
G1 = 1 0.38500 0.42437 0.19063
P(D1|G2,L) D1 = -1 D1 = 0 D1 = 1
G2 = -1 0.23812 0.33853 0.42335
G2 = 0 0.27978 0.42666 0.29356
G2 = 1 0.30371 0.42108 0.27520

Term Project #3:
Diagnosis Using Bayesian Networks

Outline
 Task 1: Structural learning of the Bayesian network
 Data generation from the ALARM network
 Structural learning of Bayesian networks using more than two
kinds of algorithms and scores
 Compare the learned results w.r.t. the edge errors according to
the various sample sizes and the learning algorithms
 Task 2: Classification using Bayesian networks
 Arbitrarily divide the Leukemia data set between the training set
and the test set
 Learn the Bayesian network from the training data set using one
of the metric-based approaches
 Evaluate the performance of the Bayesian network as a
classifier (classification accuracy)

Data Generation
 Using the Netica Software (http://www.norsys.com)
 The ALARM network
 # of nodes: 37
 # of edges: 46

Structural Learning
 Independence method
 BN Power constructor
(http://www.cs.ualberta.ca/~jcheng/bnsoft.htm)
 Metric-based method
 LearnBayes (http://www.cs.huji.ac.il/labs/compbio/LibB/)
 MDL, BIC, BD, and likelihood score are can be used.

The Leukemia Data Set
 Class type
 ALL (acute lymphoblastic leukemia) or AML (acute myeloid
leukemia)
 Data set
 # of attributes: 50 gene expression levels (0 or 1)
 # of samples: 72

Submission
 Deadline: 2002. 11. 27
 Location: 301-419

Project3.ppt

Recommended

Recommended

More Related Content

Similar to Project3.ppt

Similar to Project3.ppt (20)

More from ssuser30e7d2

More from ssuser30e7d2 (7)

Recently uploaded

Recently uploaded (20)

Project3.ppt