Network analysis for computational biology

Network analysis for computational biology
Nathalie Villa-Vialaneix - nathalie.villa@univ-paris1.fr
http://www.nathalievilla.org
IUT STID (Carcassonne) & SAMM (Université Paris 1)
Séminaire INSERM - 11 octobre 2011
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 1 / 19

Outline
1 An introduction to networks/graphs
2 Analysis of the eect of low-calorie diet on obese women

An introduction to networks/graphs
Outline

What is a network/graph? réseau/graphe
Mathematical object used to model relational data between entities.

The entities are called the nodes or the vertices
n÷uds/sommets

A relation between two entities is modeled by an edge
arête

Examples
Bibliographic network
nodes gene, TF, enzyme...
edges a relationship between two nodes that is already reported in
the litterature

Examples
Network inferred from expression data
nodes genes
edges a strong direct co-expression between two nodes observed
in the gene expression data

Standard issues associated with networks
Inference
Giving expression data, how to build a graph whose edges represent the
direct links between genes?
Example: co-expression networks built from microarray/RNAseq data (nodes = genes;
edges = signicant direct links between expressions of two genes)

Inference
Graph mining (examples)
1 Network visualization: nodes are not a priori associated to a given
position. How to represent the network in a meaningful way?
Random positions
Positions aiming at representing
connected nodes closer

Inference
Graph mining (examples)
1 Network visualization: nodes are not a priori associated to a given
position. How to represent the network in a meaningful way?
2 Network clustering: identify communities (groups of nodes that are
densely connected and share a few links (comparatively) with the other groups)

Network inference
Data: large scale gene expression data
individuals
n 30/50



X =


. . . . . .
. . X
j
i . . .
. . . . . .


variables (genes expression), p 103/4
What we want to obtain: a network with
• nodes: genes;
• edges: signicant and direct co-expression between two genes (track
transcription regulations)

Advantages of inferring a network from large scale
transcription data
1 over raw data: focuses on the strongest direct relationships:
irrelevant or indirect relations are removed (more robust) and the data
are easier to visualize and understand.
Expression data are analyzed all together and not by pairs.

Advantages of inferring a network from large scale
transcription data
1 over raw data: focuses on the strongest direct relationships:
irrelevant or indirect relations are removed (more robust) and the data
are easier to visualize and understand.
Expression data are analyzed all together and not by pairs.
2 over bibliographic network: can handle interactions with yet
unknown (not annotated) genes and deal with data collected in a
particular condition.

Using correlations: relevance network
[Butte and Kohane, 1999,
Butte and Kohane, 2000]
First (naive) approach: calculate correlations between expressions for all
pairs of genes, threshold the smallest ones and build the network.
Correlations Thresholding Graph

But correlation is not causality...

strong indirect correlation
y z
x
set.seed(2807); x - runif(100)
y - 2*x+1+rnorm(100,0,0.1); cor(x,y); [1] 0.9988261
z - 2*x+1+rnorm(100,0,0.1); cor(x,z); [1] 0.998751
cor(y,z); [1] 0.9971105

y z
x
set.seed(2807); x - runif(100)
y - 2*x+1+rnorm(100,0,0.1); cor(x,y); [1] 0.9988261
z - 2*x+1+rnorm(100,0,0.1); cor(x,z); [1] 0.998751
cor(y,z); [1] 0.9971105
Partial correlation
cor(lm(y∼x)$residuals,lm(z∼x)$residuals) [1] -0.1933699

y z
x
Networks are built using partial correlations, i.e., correlations between
gene expressions knowing the expression of all the other genes
(residual correlations).

Visualization tools help understand the graph
macro-structure
Purpose: How to display the nodes in a meaningful and aesthetic way?

macro-structure
Standard approach: force directed placement algorithms (FDP)
algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])

macro-structure
• attractive forces: similar to springs along the edges

macro-structure
• repulsive forces: similar to electric forces between all pairs of vertices

macro-structure
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.

Visualization software
• package igraph1 [Csardi and Nepusz, 2006] (static
representation with useful tools for graph mining)
1
http://igraph.sourceforge.net/
2
http://gephi.org

Visualization software
• package igraph1 [Csardi and Nepusz, 2006] (static
representation with useful tools for graph mining)
• free software Gephi2 (interactive software, supports zooming
and panning)
1
http://igraph.sourceforge.net/
2
http://gephi.org

Extracting important nodes
1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertex
popularity.

popularity.
2 vertex betweenness centralité: number of shortest paths between all
pairs of vertices that pass through the vertex. Betweenness is a
centrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
The orange node's degree is equal to 2, its betweenness to 4.

popularity.
2 vertex betweenness centralité: number of shortest paths between all
pairs of vertices that pass through the vertex. Betweenness is a
centrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).

Vertex clustering classication
Cluster vertexes into groups that are densely connected and share a few
links (comparatively) with the other groups. Clusters are often called
communities communautés (social sciences) or modules modules
(biology).
Example 1: Natty's facebook network

Find clusters by modularity optimization modularité
The modularity [Newman and Girvan, 2004] of the partition
(C1, . . . , CK) is equal to:
Q(C1, . . . , CK) =
1
2m
K
k=1xi ,xj∈Ck
(Wij − Pij)
with Pij: weight of a null model (graph with the same degree distribution
but no preferential attachment):
Pij =
didj
2m
with di = 1
2 j=i Wij.

Q(C1, . . . , CK) =
1
2m
K
k=1xi ,xj∈Ck
(Wij − Pij)
Pij =
didj
2m
with di = 1
2 j=i Wij.
A good clustering should maximize the modularity:
• Q when (xi, xj) are in the same cluster and Wij Pij
• Q when (xi, xj) are in two dierent clusters and Wij Pij

Q(C1, . . . , CK) =
1
2m
K
k=1xi ,xj∈Ck
(Wij − Pij)
Pij =
didj
2m
with di = 1
2 j=i Wij.
A good clustering should maximize the modularity:
• Q when (xi, xj) are in the same cluster and Wij Pij
• Q when (xi, xj) are in two dierent clusters and Wij Pij
Modularity optimization helps separate hubs (an edge is all the more
important in the criterion that it is attached to a vertex with a low degree).

Analysis of the eect of low-calorie diet on obese women
Outline

Data
Experimental protocol
135 obese women and 3 times: before LCD, after a 2-month LCD and 6
months later (between the end of LCD and the last measurement, women
are randomized into one of 5 recommended diet groups).
At every time step, 221 gene expressions, 28 fatty acids and 15 clinical
variables (i.e., weight, HDL, ...)

Data
Experimental protocol
135 obese women and 3 times: before LCD, after a 2-month LCD and 6
months later (between the end of LCD and the last measurement, women
are randomized into one of 5 recommended diet groups).
At every time step, 221 gene expressions, 28 fatty acids and 15 clinical
variables (i.e., weight, HDL, ...)
Correlations between gene expressions and between a gene expression and a
fatty acid levels are not of the same order: inference method must be
dierent inside the groups and between two groups.

Data
Data pre-processing
At CID3, individuals are split into three groups: weight loss, weight
regain and stable weight (groups are not correlated to the diet group
according to χ2-test).
Partition similar to the one obtained with weight increase/decrease ratio.

Method
Network inference Clustering Mining
3 inter-dataset networks
sparse CCA
merge for each
time step into
one network
3 intra-dataset networks
sparse partial
correlation
5 networks
CID1 CID2
3×CID3
Extract important nodes
Study/Compare clusters

Brief overview on results
5 networks inferred with 264 nodes each:
CID1 CID2 CID3g1 CID3g2 CID3g3
size LCC 244 251 240 259 258
density 2.3% 2.3% 2.3% 2.3% 2.3%
transitivity 17.2% 11.9% 21.6% 10.6% 10.4%
nb clusters 14 (2-52) 10 (4-52) 11 (2-46) 12 (2-51) 12 (3-54)

References
Butte, A. and Kohane, I. (1999).
Unsupervised knowledge discovery in medical databases using relevance networks.
In Proceedings of the AMIA Symposium, pages 711715.
Butte, A. and Kohane, I. (2000).
Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements.
In Proceedings of the Pacic Symposium on Biocomputing, pages 418429.
Csardi, G. and Nepusz, T. (2006).
The igraph software package for complex network research.
InterJournal, Complex Systems.
Fruchterman, T. and Reingold, B. (1991).
Graph drawing by force-directed placement.
Software, Practice and Experience, 21:11291164.
Newman, M. and Girvan, M. (2004).
Finding and evaluating community structure in networks.
Physical Review, E, 69:026113.

Network analysis for computational biology

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Network analysis for computational biology

Similar to Network analysis for computational biology (20)

More from tuxette

More from tuxette (20)

Recently uploaded

Recently uploaded (20)

Network analysis for computational biology