This document provides an introduction to network analysis and graphs. It discusses how networks can be used to model relationships between entities, with nodes representing entities and edges representing relationships. It also discusses issues like network inference from data, visualization of networks, identifying important nodes, and clustering nodes into communities. Standard algorithms and tools for tasks like force-directed placement, centrality measures, and modularity optimization are also introduced.
1. Network analysis for computational biology
Nathalie Villa-Vialaneix - nathalie.villa@univ-paris1.fr
http://www.nathalievilla.org
IUT STID (Carcassonne) & SAMM (Université Paris 1)
Séminaire INSERM - 11 octobre 2011
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 1 / 19
2. Outline
1 An introduction to networks/graphs
2 Analysis of the eect of low-calorie diet on obese women
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 2 / 19
3. An introduction to networks/graphs
Outline
1 An introduction to networks/graphs
2 Analysis of the eect of low-calorie diet on obese women
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 3 / 19
4. An introduction to networks/graphs
What is a network/graph? réseau/graphe
Mathematical object used to model relational data between entities.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 4 / 19
5. An introduction to networks/graphs
What is a network/graph? réseau/graphe
Mathematical object used to model relational data between entities.
The entities are called the nodes or the vertices
n÷uds/sommets
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 4 / 19
6. An introduction to networks/graphs
What is a network/graph? réseau/graphe
Mathematical object used to model relational data between entities.
A relation between two entities is modeled by an edge
arête
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 4 / 19
7. An introduction to networks/graphs
Examples
Bibliographic network
nodes gene, TF, enzyme...
edges a relationship between two nodes that is already reported in
the litterature
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 5 / 19
8. An introduction to networks/graphs
Examples
Network inferred from expression data
nodes genes
edges a strong direct co-expression between two nodes observed
in the gene expression data
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 5 / 19
9. An introduction to networks/graphs
Standard issues associated with networks
Inference
Giving expression data, how to build a graph whose edges represent the
direct links between genes?
Example: co-expression networks built from microarray/RNAseq data (nodes = genes;
edges = signicant direct links between expressions of two genes)
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 6 / 19
10. An introduction to networks/graphs
Standard issues associated with networks
Inference
Giving expression data, how to build a graph whose edges represent the
direct links between genes?
Graph mining (examples)
1 Network visualization: nodes are not a priori associated to a given
position. How to represent the network in a meaningful way?
Random positions
Positions aiming at representing
connected nodes closer
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 6 / 19
11. An introduction to networks/graphs
Standard issues associated with networks
Inference
Giving expression data, how to build a graph whose edges represent the
direct links between genes?
Graph mining (examples)
1 Network visualization: nodes are not a priori associated to a given
position. How to represent the network in a meaningful way?
2 Network clustering: identify communities (groups of nodes that are
densely connected and share a few links (comparatively) with the other groups)
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 6 / 19
12. An introduction to networks/graphs
Network inference
Data: large scale gene expression data
individuals
n 30/50
X =
. . . . . .
. . X
j
i . . .
. . . . . .
variables (genes expression), p 103/4
What we want to obtain: a network with
• nodes: genes;
• edges: signicant and direct co-expression between two genes (track
transcription regulations)
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 7 / 19
13. An introduction to networks/graphs
Advantages of inferring a network from large scale
transcription data
1 over raw data: focuses on the strongest direct relationships:
irrelevant or indirect relations are removed (more robust) and the data
are easier to visualize and understand.
Expression data are analyzed all together and not by pairs.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 8 / 19
14. An introduction to networks/graphs
Advantages of inferring a network from large scale
transcription data
1 over raw data: focuses on the strongest direct relationships:
irrelevant or indirect relations are removed (more robust) and the data
are easier to visualize and understand.
Expression data are analyzed all together and not by pairs.
2 over bibliographic network: can handle interactions with yet
unknown (not annotated) genes and deal with data collected in a
particular condition.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 8 / 19
15. An introduction to networks/graphs
Using correlations: relevance network
[Butte and Kohane, 1999,
Butte and Kohane, 2000]
First (naive) approach: calculate correlations between expressions for all
pairs of genes, threshold the smallest ones and build the network.
Correlations Thresholding Graph
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 9 / 19
16. An introduction to networks/graphs
But correlation is not causality...
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19
17. An introduction to networks/graphs
But correlation is not causality...
strong indirect correlation
y z
x
set.seed(2807); x - runif(100)
y - 2*x+1+rnorm(100,0,0.1); cor(x,y); [1] 0.9988261
z - 2*x+1+rnorm(100,0,0.1); cor(x,z); [1] 0.998751
cor(y,z); [1] 0.9971105
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19
18. An introduction to networks/graphs
But correlation is not causality...
strong indirect correlation
y z
x
set.seed(2807); x - runif(100)
y - 2*x+1+rnorm(100,0,0.1); cor(x,y); [1] 0.9988261
z - 2*x+1+rnorm(100,0,0.1); cor(x,z); [1] 0.998751
cor(y,z); [1] 0.9971105
Partial correlation
cor(lm(y∼x)$residuals,lm(z∼x)$residuals) [1] -0.1933699
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19
19. An introduction to networks/graphs
But correlation is not causality...
strong indirect correlation
y z
x
Networks are built using partial correlations, i.e., correlations between
gene expressions knowing the expression of all the other genes
(residual correlations).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 10 / 19
20. An introduction to networks/graphs
Visualization tools help understand the graph
macro-structure
Purpose: How to display the nodes in a meaningful and aesthetic way?
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19
21. An introduction to networks/graphs
Visualization tools help understand the graph
macro-structure
Purpose: How to display the nodes in a meaningful and aesthetic way?
Standard approach: force directed placement algorithms (FDP)
algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19
22. An introduction to networks/graphs
Visualization tools help understand the graph
macro-structure
Purpose: How to display the nodes in a meaningful and aesthetic way?
Standard approach: force directed placement algorithms (FDP)
algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19
23. An introduction to networks/graphs
Visualization tools help understand the graph
macro-structure
Purpose: How to display the nodes in a meaningful and aesthetic way?
Standard approach: force directed placement algorithms (FDP)
algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19
24. An introduction to networks/graphs
Visualization tools help understand the graph
macro-structure
Purpose: How to display the nodes in a meaningful and aesthetic way?
Standard approach: force directed placement algorithms (FDP)
algorithmes de forces (e.g., [Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 11 / 19
25. An introduction to networks/graphs
Visualization software
• package igraph1 [Csardi and Nepusz, 2006] (static
representation with useful tools for graph mining)
1
http://igraph.sourceforge.net/
2
http://gephi.org
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 12 / 19
26. An introduction to networks/graphs
Visualization software
• package igraph1 [Csardi and Nepusz, 2006] (static
representation with useful tools for graph mining)
• free software Gephi2 (interactive software, supports zooming
and panning)
1
http://igraph.sourceforge.net/
2
http://gephi.org
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 12 / 19
27. An introduction to networks/graphs
Extracting important nodes
1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertex
popularity.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
28. An introduction to networks/graphs
Extracting important nodes
1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertex
popularity.
2 vertex betweenness centralité: number of shortest paths between all
pairs of vertices that pass through the vertex. Betweenness is a
centrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
The orange node's degree is equal to 2, its betweenness to 4.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
29. An introduction to networks/graphs
Extracting important nodes
1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertex
popularity.
2 vertex betweenness centralité: number of shortest paths between all
pairs of vertices that pass through the vertex. Betweenness is a
centrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
30. An introduction to networks/graphs
Extracting important nodes
1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertex
popularity.
2 vertex betweenness centralité: number of shortest paths between all
pairs of vertices that pass through the vertex. Betweenness is a
centrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
31. An introduction to networks/graphs
Extracting important nodes
1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertex
popularity.
2 vertex betweenness centralité: number of shortest paths between all
pairs of vertices that pass through the vertex. Betweenness is a
centrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
32. An introduction to networks/graphs
Extracting important nodes
1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertex
popularity.
2 vertex betweenness centralité: number of shortest paths between all
pairs of vertices that pass through the vertex. Betweenness is a
centrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
33. An introduction to networks/graphs
Extracting important nodes
1 vertex degree degré: number of edges adjacent to a given vertex.
Vertices with a high degree are called hubs: measure of the vertex
popularity.
2 vertex betweenness centralité: number of shortest paths between all
pairs of vertices that pass through the vertex. Betweenness is a
centrality measure (vertices with a large betweenness that are the most likely to
disconnect the network if removed).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 13 / 19
34. An introduction to networks/graphs
Vertex clustering classication
Cluster vertexes into groups that are densely connected and share a few
links (comparatively) with the other groups. Clusters are often called
communities communautés (social sciences) or modules modules
(biology).
Example 1: Natty's facebook network
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 14 / 19
35. An introduction to networks/graphs
Find clusters by modularity optimization modularité
The modularity [Newman and Girvan, 2004] of the partition
(C1, . . . , CK) is equal to:
Q(C1, . . . , CK) =
1
2m
K
k=1xi ,xj∈Ck
(Wij − Pij)
with Pij: weight of a null model (graph with the same degree distribution
but no preferential attachment):
Pij =
didj
2m
with di = 1
2 j=i Wij.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 15 / 19
36. An introduction to networks/graphs
Find clusters by modularity optimization modularité
The modularity [Newman and Girvan, 2004] of the partition
(C1, . . . , CK) is equal to:
Q(C1, . . . , CK) =
1
2m
K
k=1xi ,xj∈Ck
(Wij − Pij)
with Pij: weight of a null model (graph with the same degree distribution
but no preferential attachment):
Pij =
didj
2m
with di = 1
2 j=i Wij.
A good clustering should maximize the modularity:
• Q when (xi, xj) are in the same cluster and Wij Pij
• Q when (xi, xj) are in two dierent clusters and Wij Pij
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 15 / 19
37. An introduction to networks/graphs
Find clusters by modularity optimization modularité
The modularity [Newman and Girvan, 2004] of the partition
(C1, . . . , CK) is equal to:
Q(C1, . . . , CK) =
1
2m
K
k=1xi ,xj∈Ck
(Wij − Pij)
with Pij: weight of a null model (graph with the same degree distribution
but no preferential attachment):
Pij =
didj
2m
with di = 1
2 j=i Wij.
A good clustering should maximize the modularity:
• Q when (xi, xj) are in the same cluster and Wij Pij
• Q when (xi, xj) are in two dierent clusters and Wij Pij
Modularity optimization helps separate hubs (an edge is all the more
important in the criterion that it is attached to a vertex with a low degree).
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 15 / 19
38. Analysis of the eect of low-calorie diet on obese women
Outline
1 An introduction to networks/graphs
2 Analysis of the eect of low-calorie diet on obese women
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 16 / 19
39. Analysis of the eect of low-calorie diet on obese women
Data
Experimental protocol
135 obese women and 3 times: before LCD, after a 2-month LCD and 6
months later (between the end of LCD and the last measurement, women
are randomized into one of 5 recommended diet groups).
At every time step, 221 gene expressions, 28 fatty acids and 15 clinical
variables (i.e., weight, HDL, ...)
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 17 / 19
40. Analysis of the eect of low-calorie diet on obese women
Data
Experimental protocol
135 obese women and 3 times: before LCD, after a 2-month LCD and 6
months later (between the end of LCD and the last measurement, women
are randomized into one of 5 recommended diet groups).
At every time step, 221 gene expressions, 28 fatty acids and 15 clinical
variables (i.e., weight, HDL, ...)
Correlations between gene expressions and between a gene expression and a
fatty acid levels are not of the same order: inference method must be
dierent inside the groups and between two groups.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 17 / 19
41. Analysis of the eect of low-calorie diet on obese women
Data
Data pre-processing
At CID3, individuals are split into three groups: weight loss, weight
regain and stable weight (groups are not correlated to the diet group
according to χ2-test).
Partition similar to the one obtained with weight increase/decrease ratio.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 17 / 19
42. Analysis of the eect of low-calorie diet on obese women
Method
Network inference Clustering Mining
3 inter-dataset networks
sparse CCA
merge for each
time step into
one network
3 intra-dataset networks
sparse partial
correlation
5 networks
CID1 CID2
3×CID3
Extract important nodes
Study/Compare clusters
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 18 / 19
43. Analysis of the eect of low-calorie diet on obese women
Brief overview on results
5 networks inferred with 264 nodes each:
CID1 CID2 CID3g1 CID3g2 CID3g3
size LCC 244 251 240 259 258
density 2.3% 2.3% 2.3% 2.3% 2.3%
transitivity 17.2% 11.9% 21.6% 10.6% 10.4%
nb clusters 14 (2-52) 10 (4-52) 11 (2-46) 12 (2-51) 12 (3-54)
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 19 / 19
44. Analysis of the eect of low-calorie diet on obese women
References
Butte, A. and Kohane, I. (1999).
Unsupervised knowledge discovery in medical databases using relevance networks.
In Proceedings of the AMIA Symposium, pages 711715.
Butte, A. and Kohane, I. (2000).
Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements.
In Proceedings of the Pacic Symposium on Biocomputing, pages 418429.
Csardi, G. and Nepusz, T. (2006).
The igraph software package for complex network research.
InterJournal, Complex Systems.
Fruchterman, T. and Reingold, B. (1991).
Graph drawing by force-directed placement.
Software, Practice and Experience, 21:11291164.
Newman, M. and Girvan, M. (2004).
Finding and evaluating community structure in networks.
Physical Review, E, 69:026113.
1 octobre 2013 (Séminaire INSERM) Network Nathalie Villa-Vialaneix 19 / 19