SlideShare une entreprise Scribd logo
1  sur  74
Graph Mining and Social
Network Analysis
Data Mining; EECS 4412
Darren Rolfe + Vince Chu
11.06.14
Agenda
Graph Mining
Methods for Mining Frequent Subgraphs
Apriori-based Approach: AGM, FSG
Pattern-Growth Approach: gSpan
Social Networks Analysis
Properties and Features of Social Real Graphs
Models of Graphs we can use
Using those models to predict/other things
Graph Mining
Methods for Mining Frequent Subgraphs
Why Mine Graphs?
A lot of data today can be represented in the form of a graph
Social: Friendship networks, social media networks, email and instant
messaging networks, document citation networks, blogs
Technological: Power grid, the internet
Biological: Spread of virus/disease, protein/gene regulatory networks
What Do We Need To Do
Identify various kinds of graph patterns
Frequent substructures are the very basic patterns that can be discovered in
a collection of graphs, useful for:
characterizing graph sets,
discriminating different groups of graphs,
classifying and clustering graphs,
building graph indices, and
facilitating similarity search in graph databases
Mining Frequent Subgraphs
Performed on a collection of graphs
Notation:
Vertex set of a graph 𝑔 by 𝑉(𝑔)
Edge set of a graph 𝑔 by 𝐸(𝑔)
A label function, 𝐿, maps a vertex or an edge to a label.
A graph 𝑔 is a subgraph of another graph 𝑔’ if there exists a subgraph isomorphism from 𝑔 to 𝑔’.
Given a labeled graph data set, 𝐷 = {𝐺1, 𝐺2, … , 𝐺𝑛}, we define 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑔) (or 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝑔)) as the
percentage (or number) of graphs in 𝐷 where 𝑔 is a subgraph.
A frequent graph is a graph whose support is no less than a minimum support threshold, 𝑚𝑖𝑛_𝑠𝑢𝑝.
Discovering Frequent Substructures
Usually consists of two steps:
1. Generate frequent substructure candidates.
2. Check the frequency of each candidate.
Most studies on frequent substructure discovery focus on the optimization of
the first step, because the second step involves a subgraph isomorphism test
whose computational complexity is excessively high (i.e., NP-complete).
Graph Isomorphism
Isomorphism of graphs G and H
is a bijection between the vertex
sets of G and H
𝐹: 𝑉(𝑔) → 𝑉(𝐻)
Such that any two vertices 𝑢 and 𝑣 of
𝐺 are adjacent in 𝐺 if and only if
ƒ(𝑢) and ƒ(𝑣) are adjacent in 𝐻.
A G
B H
C I
D J
A
G B
H
C
I D
J
Graph G Graph H
Frequent
Subgraphs: An
Example
1. Start with a labelled graph
data set.
2. Set a minimum support
threshold for frequent
graph.
3. Generate frequent
substructure candidates.
4. Check the frequency of each
candidate.
A B
C A
B B
C A
A B
C A
C B
A B
Graph 1 Graph 2
Graph 3 Graph 4
Frequent
Subgraphs: An
Example
Let the support minimum for
this example be 50%.
1. Start with a labelled graph
data set.
2. Set a minimum support
threshold for frequent
graph.
3. Generate frequent
substructure candidates.
4. Check the frequency of each
candidate.
Frequent
Subgraphs: An
Example
1. Start with a labelled graph
data set.
2. Set a minimum support
threshold for frequent
graph.
3. Generate frequent
substructure candidates.
4. Check the frequency of each
candidate.
A B C
A B
B C
A C
A A
B B
A B
C
A A
C
B B
C
A B
C A
A B
C A
B B
C A
C B
A B
B B
C A
B B
C A
k = 1
k = 2
k = 3
k = 4
Frequent
Subgraphs: An
Example
1. Start with a labelled graph
data set.
2. Set a minimum support
threshold for frequent
graph.
3. Generate frequent
substructure candidates.
4. Check the frequency of each
candidate.
A B C
A B
B C
A C
A A
B B
A B
C
A A
C
A B
C A
A B
C A
B B
C A
C B
A B
B B
C A
B B
C A
k = 1
k = 2
k = 3
k = 4
B B
C
Frequent
Subgraphs: An
Example
1. Start with a labelled graph
data set.
2. Set a minimum support
threshold for frequent
graph.
3. Generate frequent
substructure candidates.
4. Check the frequency of each
candidate.
A B
C A
A B
C A
C B
A B
Graph 1 Graph 2
Graph 3 Graph 4
B B
C A
k = 3, frequency: 3, support: 75%
Frequent
Subgraphs: An
Example
1. Start with a labelled graph
data set.
2. Set a minimum support
threshold for frequent
graph.
3. Generate frequent
substructure candidates.
4. Check the frequency of each
candidate.
A B
C A
C B
A B
Graph 1 Graph 2
Graph 3 Graph 4
B B
C A
A B
C A
k = 4, frequency: 2, support: 50%
Apriori-based Approach
Apriori-based frequent substructure mining algorithms share similar
characteristics with Apriori-based frequent itemset mining algorithms.
Search for frequent graphs:
Starts with graphs of small “size”; definition of graph size depends on algorithm used.
Proceeds in a bottom-up manner by generating candidates having an extra vertex, edge, or
path.
Main design complexity of Apriori-based substructure mining algorithms is
candidate generation step.
Candidate generation problem in frequent substructure mining is harder than that in
frequent itemset mining, because there are many ways to join two substructures.
Apriori-based Approach
1. Generate size 𝑘 frequent subgraph candidates
Generated by joining two similar but slightly different frequent subgraphs that were
discovered in the previous call of the algorithm.
2. Check the frequency of each candidate
3. Generate the size 𝑘 + 1 frequent candidates
4. Continue until candidates are empty
Algorithm: AprioriGraph
Apriori-based Frequent Substructure Mining
Input:
𝐷, a graph data set
𝑚𝑖𝑛_𝑠𝑢𝑝, minimum support threshold
Output:
𝑆 𝑘, frequent substructure set
Method:
𝑆1 ← frequent single-elements in 𝐷
Call 𝐴𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑟𝑎𝑝ℎ(𝐷, 𝑚𝑖𝑛_𝑠𝑢𝑝, 𝑆1)
procedure AprioriGraph(D, min_sup, Sk)
1 Sk+1 ← ∅;
2 foreach frequent gi ∈ Sk do
3 foreach frequent gj ∈ Sk do
4 foreach size (k+1) graph g formed by merge(gi, gj) do
5 if g is frequent in D and g ∉ Sk+1 then
6 insert g into Sk+1;
7 if sk+1 ≠ ∅ then
8 AprioriGraph(D, min_sup, Sk+1);
9 return;
AGM - Apriori-based Graph Mining
Vertex-based candidate generation method that increases the substructure
size by one vertex at each iteration of AprioriGraph.
𝑘, graph size is the number of vertices in the graph
Two size-k frequent graphs are joined only if they have the same size-(k−1) subgraph.
Newly formed candidate includes the size-(k−1) subgraph in common and the additional
two vertices from the two size-k patterns.
Because it is undetermined whether there is an edge connecting the additional two
vertices, we actually can form two substructures.
AGM: An
Example
Two substructures joined by
two chains.
𝑘, graph size is the number of
vertices in the graph
A B
C A
B B
C A
+
A B
C A
B
A B
C A
B
k = 4
k = 5
FSG – Frequent Subgraph Discovery
Edge-based candidate generation strategy that increases the substructure size
by one edge in each call of AprioriGraph.
𝑘, graph size is the number of edges in the graph
Two size-k patterns are merged if and only if they share the same subgraph having k−1
edges, which is called the core.
Newly formed candidate includes the core and the additional two edges from size-k
patterns.
FSG: An
Example
Two substructure patterns and
their potential candidates.
𝑘, graph size is the number of
edges in the graph
B B
C A
B B
C A
+
B B
C A
B
B B
C A
k = 4
k = 5
A A
A A
C
FSG: Another
Example
Two substructure patterns and
their potential candidates.
𝑘, graph size is the number of
edges in the graph
AA
AA
B
+
k = 5
k = 6
AA
AA
B
C
A
AA
CB
A A
AA
CB
A
Pitfall: Apriori-based Approach
Generation of subgraph candidates is complicated and expensive.
Level-wise candidate generation → Breadth-first search
To determine whether a size-(k+1) graph is frequent, must check all corresponding size-k
subgraphs to obtain the upper bound of frequency.
Before mining any size-(k+1) subgraph, requires complete mining of size-k subgraphs
Subgraph isomorphism is an NP Subgraph isomorphism is an NP-complete
problem, so pruning is expensive.
Pattern-Growth Approach
1. Initially, start with the frequent vertices as frequent graphs
2. Extend these graphs by adding a new edge such that newly formed graphs are
frequent graphs
A graph g can be extended by adding a new edge e; newly formed graph is denoted by 𝑔  𝑥 𝑒.
If e introduces a new vertex, we denote the new graph by 𝑔  𝑥𝑓 𝑒, otherwise 𝑔  𝑥𝑏 𝑒, where f or
b indicates that the extension is in a forward or backward direction
3. For each discovered graph g, it performs extensions recursively until all the
frequent graphs with g embedded are discovered.
4. The recursion stops once no frequent graph can be generated.
Algorithm: PatternGrowthGraph
Simplistic Pattern Growth-based Frequent Substructure Mining
Input:
𝑔, a frequent graph
𝐷, a graph data set
𝑚𝑖𝑛_𝑠𝑢𝑝, minimum support threshold
Output:
𝑆, frequent graph set
Method:
𝑆 ← ∅
Call 𝑃𝑎𝑡𝑡𝑒𝑟𝑛𝐺𝑟𝑜𝑤𝑡ℎ𝐺𝑟𝑎𝑝ℎ(𝑔, 𝐷, 𝑚𝑖𝑛_𝑠𝑢𝑝, 𝑆)
procedure PatternGrowthGraph(g, D, min_sup, S)
1 if g ∈ S then return;
2 else insert g into S;
3 scan D once, find all edges e that g can be extended to g  𝑥e;
4 foreach frequent g  𝑥e do
5 PatternGrowthGraph(g  𝑥e, D, min_sup, S);
6 return;
Pattern-Growth:
An Example
1. Start with the frequent
vertices as frequent graphs
2. Extend these graphs by
adding a new edge forming
new frequent graphs
3. For each discovered graph
g, recursively extend
4. Stops once no frequent
graph can be generated
A B
C A
B B
C A
A B
C A
C B
A B
Graph 1 Graph 2
Graph 3 Graph 4
Pattern-Growth:
An Example
Let the support minimum for
this example be 50%.
1. Start with the frequent
vertices as frequent graphs
2. Extend these graphs by
adding a new edge forming
new frequent graphs
3. For each discovered graph
g, recursively extend
4. Stops once no frequent
graph can be generated
Pattern-Growth:
An Example
1. Start with the frequent
vertices as frequent graphs
2. Extend these graphs by
adding a new edge forming
new frequent graphs
3. For each discovered graph
g, recursively extend
4. Stops once no frequent
graph can be generated
A B
C A
B B
C A
A B
C A
C B
A B
Graph 1 Graph 2
Graph 3 Graph 4
Let’s arbitrarily start with this frequent vertex
Pattern-Growth:
An Example
1. Start with the frequent
vertices as frequent graphs
2. Extend these graphs by
adding a new edge forming
new frequent graphs
3. For each discovered graph
g, recursively extend
4. Stops once no frequent
graph can be generated
A B
C A
B B
C A
A B
C A
C B
A B
Graph 1 Graph 2
Graph 3 Graph 4
Extend graph (forward); add frequent edge
Pattern-Growth:
An Example
1. Start with the frequent
vertices as frequent graphs
2. Extend these graphs by
adding a new edge forming
new frequent graphs
3. For each discovered graph
g, recursively extend
4. Stops once no frequent
graph can be generated
Graph 1 Graph 2
Graph 3 Graph 4
Extend frequent graph (forward) again…
A B
C A
B B
C A
A B
C A
C B
A B
Pattern-Growth:
An Example
1. Start with the frequent
vertices as frequent graphs
2. Extend these graphs by
adding a new edge forming
new frequent graphs
3. For each discovered graph
g, recursively extend
4. Stops once no frequent
graph can be generated
Graph 1 Graph 2
Graph 3 Graph 4
Extend graph (backward); previously seen node
A B
C A
B B
C A
A B
C A
C B
A B
Pattern-Growth:
An Example
1. Start with the frequent
vertices as frequent graphs
2. Extend these graphs by
adding a new edge forming
new frequent graphs
3. For each discovered graph
g, recursively extend
4. Stops once no frequent
graph can be generated
Graph 1 Graph 2
Graph 3 Graph 4
Extend frequent graph (forward) again…
A B
C A
B B
C A
A B
C A
C B
A B
Pattern-Growth:
An Example
1. Start with the frequent
vertices as frequent graphs
2. Extend these graphs by
adding a new edge forming
new frequent graphs
3. For each discovered graph
g, recursively extend
4. Stops once no frequent
graph can be generated
Graph 1 Graph 2
Graph 3 Graph 4
Stop recursion, try different start vertex…
B B
C A
C B
A B
A B
C A
A B
C A
Pitfall: PatternGrowthGraph
Simple, but not efficient
Same graph can be discovered many times; duplicate graph
Generation and detection of duplicate graphs increases workload
gSpan (Graph-Based Substructure Pattern Mining)
Designed to reduce the generation of duplicate graphs.
Explores via depth-first search (DFS)
DFS lexicographic order and minimum DFS code form a canonical labeling
system to support DFS search.
Discovers all the frequent subgraphs without candidate generation and false
positives pruning. It combines the growing and checking of frequent
subgraphs into one procedure, thus accelerates the mining process.
gSpan (Graph-Based Substructure Pattern Mining)
DFS Subscripting
When performing a DFS in a graph, we construct a DFS tree
One graph can have several different DFS trees
Depth-first discovery of the vertices forms a linear order
Use subscripts to label this order according to their discovery time
i < j means vi is discovered before vj.
vo, the root and vn, the rightmost vertex.
The straight path from v0 to vn, rightmost path.
gSpan (Graph-Based Substructure Pattern Mining)
DFS Code
We transform each subscripted graph to an edge sequence, called a DFS
code, so that we can build an order among these sequences.
The goal is to select the subscripting that generates the minimum sequence
as its base subscripting.
There are two kinds of orders in this transformation process:
1. Edge order, which maps edges in a subscripted graph into a sequence; and
2. Sequence order, which builds an order among edge sequences
An edge is represented by a 5-tuple, (𝑖, 𝑗, 𝑙𝑖, 𝐼(𝑖,𝑗), 𝑙𝑗); 𝑙𝑖 and 𝑙𝑗 are the labels
of 𝑣𝑖 and 𝑣𝑗, respectively, and 𝐼(𝑖,𝑗) is the label of the edge connecting them
gSpan (Graph-Based Substructure Pattern Mining)
DFS Lexicographic Order
For the each DFS tree, we sort the DFS code (tuples) to a set of orderings.
Based on the DFS lexicographic ordering, the minimum DFS code of a given
graph G, written as dfs(G), is the minimal one among all the DFS codes.
The subscripting that generates the minimum DFS code is called the base
subscripting.
Given two graphs 𝐺 and 𝐺’, 𝐺 is isomorphic to 𝐺’ if and only if 𝑑𝑓𝑠(𝐺) = 𝑑𝑓𝑠(𝐺’).
Based on this property, what we need to do for mining frequent subgraphs is to perform only the
right-most extensions on the minimum DFS codes, since such an extension will guarantee the
completeness of mining results.
DFS Code: An
Example
DFS Subscripting
When performing a DFS in a
graph, we construct a DFS tree
One graph can have several
different DFS trees
X
X
Z Y
a
a
b
b
v0 v1 v2 v3
X
X
Z Y
a
a
b
b
X
X
Z Y
a
a
b
b
X
X
Z Y
a
a
b
b
DFS
Lexicographic
Order: An
Example
For the each DFS tree, we sort
the DFS code (tuples) to a set of
orderings.
Based on the DFS lexicographic
ordering, the minimum DFS
code of a given graph G,
written as dfs(G), is the minimal
one among all the DFS codes.
X
X
Z Y
a
a
b
b
X
X
Z Y
a
a
b
b
X
X
Z Y
a
a
b
b
edge γ0
e0 (0, 1, X, a, X) ●
e1 (1, 2, X, a, Z) ●
e2 (2, 0, Z, b, X)
e3 (1, 3, X, b, Y)
edge γ1
e0 (0, 1, X, a, X) ●
e1 (1, 2, X, b, Y)
e2 (1, 3, X, a, Z)
e3 (3, 0, Z, b, X)
edge γ2
e0 (0, 1, Y, b, X)
e1 (1, 2, X, a, X)
e2 (2, 3, X, b, Z)
e3 (3, 1, Z, a, X)
gSpan (Graph-Based Substructure Pattern Mining)
1. Initially, a starting vertex is randomly chosen
2. Vertices in a graph are marked so that we can tell which vertices have been
visited
3. Visited vertex set is expanded repeatedly until a full DFS tree is built
4. Given a graph G and a DFS tree T in G, a new edge e
Can be added between the right-most vertex and another vertex on the
right-most path (backward extension); or
Can introduce a new vertex and connect to a vertex on the right-most
path (forward extension).
Because both kinds of extensions take place on the right-most path, we call them right-
most extension, denoted by 𝑔  𝑟 𝑒
Algorithm: gSpan
Pattern growth-based frequent substructure mining that reduces duplicate graph generation.
Input:
𝑠, a DFS code
𝐷, a graph data set
𝑚𝑖𝑛_𝑠𝑢𝑝, minimum support
threshold
Output:
𝑆, frequent graph set
Method:
𝑆 ← ∅
Call 𝑔𝑆𝑝𝑎𝑛(𝑠, 𝐷, 𝑚𝑖𝑛_𝑠𝑢𝑝, 𝑆)
procedure gSpan(s, D, min_sup, S)
1 if s ≠ dfs(s) then return;
2 insert s into S;
3 set C to ∅;
4 scan D once, find all edges e that s can be right-most extended to s  𝑟e;
5 insert s  𝑟e into C and count its frequency;
6 foreach frequent s  𝑟e in C do
7 gSpan(s  𝑟e, D, min_sup, S);
8 return;
Other Graph Mining
So far the techniques we have discussed:
Can handle only one special kind of graphs:
Labeled, undirected, connected simple graphs without any specific constraints
Assume that the database to be mined contains a set of graphs
Each consisting of a set of labeled vertices and labeled but undirected edges, with no other
constraints.
Other Graph Mining
Mining Variant and Constrained Substructure Patterns
Closed frequent substructure
where a frequent graph G is closed if and only if there is no proper supergraph G0 that has
the same support as G
Maximal frequent substructure
where a frequent pattern G is maximal if and only if there is no frequent super-pattern of G.
Constraint-based substructure mining
Element, set, or subgraph containment constraint
Geometric constraint
Value-sum constraint
Application: Classification
We mine frequent graph patterns in the training set.
The features that are frequent in one class but rather infrequent in the other
class(es) should be considered as highly discriminative features; used for
model construction.
To achieve high-quality classification,
We can adjust: the thresholds on frequency, discriminativeness, and graph connectivity
Based on: the data, the number and quality of the features generated, and the classification accuracy.
Application: Cluster analysis
We mine frequent graph patterns in the training set.
The set of graphs that share a large set of similar graph patterns should be
considered as highly similar and should be grouped into similar clusters.
The minimal support threshold can be used as a way to adjust the number of
frequent clusters or generate hierarchical clusters.
Social Network Analysis
Examples of Social Networks
Twitter network
http://willchernoff.com/
Email Network
https://wiki.cs.umd.edu
Air Transportation Network
www.mccormick.northwestern.edu
Social Network Analysis
Nodes often represent
an object or entity such as a person,
computer/server, power generator,
airport, etc
Links represent relationships,
e.g. ‘likes’, ‘follow’s, ‘flies to’, etc
http://www.liacs.nl/~erwin/dbdm2009/GraphMining.pdf
Why are we interested?
It turns out that the structure of
real-world graphs often have
special characteristics
This is important because
structure always affects function
e.g. the structure of a social network
affects how a rumour, or an
infectious disease, spreads
e.g. the structure of a power grid
determines how robust the network
is to power failures
Goal:
1. Identify the characteristics /
properties of graphs; structural and
dynamic / behavioural
2. Generate models of graphs that
exhibit these characteristics
3. Use these tools to make predictions
about the behaviour of graphs
Properties of Real-World Social Graphs
1. Degree Distribution
Plot the fraction of nodes with degree k (denoted pk) vs. k
Our intuition:
Poisson/Normal
Distribution
WRONG!
mathworld.wolfram.com
Correct: Highly Skewed
http://cs.stanford.edu/people/jure/talks/www08tutorial/
Properties of Real-World Social Graphs
1. (continued)
Real-world social networks tend to have a
highly skewed distribution that follows the
Power Law: pk ~ k-a
A small percentage of nodes have very high
degree, are highly connected
Example: Spread of a virus
black squares = infected
pink = infected but not contagious
green = exposed but not infected
Properties of Real-World Social Graphs
2. Small World Effect: for most real graphs, the number
of hops it takes to reach any node from any other node
is about 6 (Six Degrees of Separation).
Milgram did an experiment, asked people in
Nebraska to send letters to people in Boston
Constraint: letters could only be delivered to people
known on a first name basis.
Only 25% of letters made it to their target, but the
ones that did made it in 6 hops
Properties of Real-World Social Graphs
2. (continued)
The distribution of the shortest
path lengths.
Example: MSN Messenger
Network
If we pick a random node in the
network and then count how many
hops it is from every other node, we
get this graph
Most nodes are at a distance of 7
hops away from any other node
http://cs.stanford.edu/people/jure/talks/www08tutorial/
Properties of Real-World Social Graphs
3. Network Resilience
If a node is removed, how is the network affected?
For a real-world graphs, you must remove the highly connected nodes in order to reduce
the connectivity of the graph
Removing a node that is sparsely connected does not have a significant effect on
connectivity
Since the proportion of highly connected nodes in a real-world graph is small, the
probability of choosing and removing such a node at random is small
→ real-world graphs are resilient to random attacks!
→ conversely, targeted attacks on highly connected nodes are very effective!
Properties of Real-World Social Graphs
4. Densification
How does the number of edges in the
graph grow as the number of nodes
grows?
Previous belief: # edges grows linearly
with # nodes i.e. 𝐸(𝑡) ~ 𝑁(𝑡)
Actually, # edges grows superlinearly with
the # nodes, i.e. the # of edges grows
faster than the number of nodes
i.e. 𝐸(𝑡) ~ 𝑁(𝑡) 𝑎
Graph gets denser over time
Properties of Real-World Social Graphs
5. Shrinking Diameter
Diameter is taken to be the longest-shortest path in the graph
As a network grows, the diameter actually gets smaller, i.e. the distance between nodes
slowly decreases
Features/Properties of Graphs
Community structure
Densification
Shrinking diameter
Generators: How do we model graphs
Try:
Generating a random graph
Given n vertices connect each pair i.i.d. with
Probability p
Follows a Poisson distribution
Follows from our intuition
Not useful; no community structure
Does not mirror real-world graphs
Generators: How do we model graphs
(Erdos‐Renyi) Random graphs (1960s)
Exponential random graphs
Small‐world model
Preferential attachment
Edge copying model
Community guided attachment
Forest Fire
Kronecker graphs (today)
Kronecker Graphs
For kronecker graphs all the properties of
real world graphs can actually be proven
Best model we have today
Adjacency matrix, recursive generation
Kronecker Graphs
1. Construct adjacency matrix for a graph G:
𝐴 𝐺 = (𝐴𝑖𝑗) = { 1 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑎𝑑𝑗𝑎𝑐𝑒𝑛𝑡, 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 }
Side Note: The eigenvalue of a matrix A is the scalar value ƛ for which the
following is true: Av = ƛv (where v is an eigenvector of the matrix A)
Kronecker Graphs
2. Generate the 2nd Kronecker graph by taking the Kronecker product of the
1st graph with itself. The Kronecker product of 2 graphs is defined as:
Kronecker Graphs
Visually, this is just taking the the first matrix and replacing the entries that
were equal to 1 with the second matrix.
3 x 3 matrix 9 x 9 matrix
Kronecker Graphs
We define the Kronecker product of two graphs as the Kronecker product of
their adjacency matrices
Therefore, we can compute the Kth Kronecker graph by iteratively taking the
Kronecker product of an initial graph G1 k times:
Gk = G1 ⛒ G1 ⛒ G1 ⛒ … ⛒ G1
Applying Models to Real World Graphs
Can then predict and understand the structure
Virus Propagation
A form of diffusion; a fundamental process in social networks
Can also refer to spread of rumours, news, etc
Virus Propagation
SIS Model: Susceptible - Infected - Susceptible
Virus birth rate β = the probability that an infected node attacks a neighbour
Virus death rate ẟ = probability that an infected node becomes cured
Heals with Prob ẟ
Infects with Prob β Infects with Prob β
Healthy Node
At risk
Node
Infected NodeInfected Node
Virus Propagation
The virus strength of a graph: s = β/ẟ
The epidemic threshold 𝜏 of a graph is a value such that if:
s = β/ẟ < 𝜏 then an epidemic cannot happen.
So we can ask the question:
Will the virus become epidemic?
Will the rumours/news become viral?
How to find threshold 𝛕 ? Theorem:
𝜏 = 1/ƛ 1,A where ƛ 1,A is the largest eigenvalue of adjacency matrix of the graph
So if s < 𝜏 then there is no epidemic
Link Prediction
Given a social network at time t1, predict the edges that will be added at time t2
Assign connection score(x,y) to each pair of nodes
Usually taken to be the shortest path between the nodes x and y, other measures use # of
neighbours in common, and the Katz measure
Produce a list of scores in decreasing order
The pair at the top of the list are most likely to have a link created between them in the
future
Can also use this measure for clustering
Link Prediction
Score(x,y) = # of neighbours in
common
Top score = score(B,C) = 5G
A B
F
C
E H I J
D
Likely new link between
B and C
Viral Marketing
A customer may increase the sales of some product if they interact positively with their
peers in the social network
Assign a network value to a customer
Diffusion in Networks: Influential Nodes
Some nodes in the network can be active
they can spread their influence to other nodes
e.g. news, opinions, etc that propagate through a network of friends
2 models: Threshold model, Independent Contagion model
Thanks
Any questions?

Contenu connexe

Tendances

similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Network centrality measures and their effectiveness
Network centrality measures and their effectivenessNetwork centrality measures and their effectiveness
Network centrality measures and their effectivenessemapesce
 
Community detection in social networks
Community detection in social networksCommunity detection in social networks
Community detection in social networksFrancisco Restivo
 
Chicago Crime Dataset Project Proposal
Chicago Crime Dataset Project ProposalChicago Crime Dataset Project Proposal
Chicago Crime Dataset Project ProposalAashri Tandon
 
Data Mining and Intrusion Detection
Data Mining and Intrusion Detection Data Mining and Intrusion Detection
Data Mining and Intrusion Detection amiable_indian
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clusteringKrish_ver2
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patternsKrish_ver2
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxRohanBorgalli
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data miningKrish_ver2
 
BIM Data Mining Unit1 by Tekendra Nath Yogi
 BIM Data Mining Unit1 by Tekendra Nath Yogi BIM Data Mining Unit1 by Tekendra Nath Yogi
BIM Data Mining Unit1 by Tekendra Nath YogiTekendra Nath Yogi
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
 

Tendances (20)

3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Clustering - K-Means, DBSCAN
Clustering - K-Means, DBSCANClustering - K-Means, DBSCAN
Clustering - K-Means, DBSCAN
 
SPADE -
SPADE - SPADE -
SPADE -
 
similarity measure
similarity measure similarity measure
similarity measure
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Network centrality measures and their effectiveness
Network centrality measures and their effectivenessNetwork centrality measures and their effectiveness
Network centrality measures and their effectiveness
 
Community detection in social networks
Community detection in social networksCommunity detection in social networks
Community detection in social networks
 
Chicago Crime Dataset Project Proposal
Chicago Crime Dataset Project ProposalChicago Crime Dataset Project Proposal
Chicago Crime Dataset Project Proposal
 
Data Mining and Intrusion Detection
Data Mining and Intrusion Detection Data Mining and Intrusion Detection
Data Mining and Intrusion Detection
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
BIM Data Mining Unit1 by Tekendra Nath Yogi
 BIM Data Mining Unit1 by Tekendra Nath Yogi BIM Data Mining Unit1 by Tekendra Nath Yogi
BIM Data Mining Unit1 by Tekendra Nath Yogi
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Artificial Neural Networks for Data Mining
Artificial Neural Networks for Data MiningArtificial Neural Networks for Data Mining
Artificial Neural Networks for Data Mining
 

En vedette

Data mining in social network
Data mining in social networkData mining in social network
Data mining in social networkakash_mishra
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And MiningSrinath Srinivasa
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosBigMine
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDatamining Tools
 
Social network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and moreSocial network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and moreWael Elrifai
 
Social media mining PPT
Social media mining PPTSocial media mining PPT
Social media mining PPTChhavi Mathur
 
Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)SocialMediaMining
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningtuxette
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
 
Social Network Analysis in Two Parts
Social Network Analysis in Two PartsSocial Network Analysis in Two Parts
Social Network Analysis in Two PartsPatti Anklam
 
Mining the social graph
Mining the social graphMining the social graph
Mining the social graphshunya kimura
 
Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...ACMBangalore
 
Graph mining seminar_2009
Graph mining seminar_2009Graph mining seminar_2009
Graph mining seminar_2009Houw Liong The
 
Marketing analytics alpesh doshi social network analysis - using social gra...
Marketing analytics alpesh doshi   social network analysis - using social gra...Marketing analytics alpesh doshi   social network analysis - using social gra...
Marketing analytics alpesh doshi social network analysis - using social gra...Alpesh Doshi
 

En vedette (20)

Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 
Data mining in social network
Data mining in social networkData mining in social network
Data mining in social network
 
gSpan algorithm
gSpan algorithmgSpan algorithm
gSpan algorithm
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And Mining
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Social network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and moreSocial network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and more
 
Social media mining PPT
Social media mining PPTSocial media mining PPT
Social media mining PPT
 
120808
120808120808
120808
 
gSpan algorithm
 gSpan algorithm gSpan algorithm
gSpan algorithm
 
Close Graph
Close GraphClose Graph
Close Graph
 
Graph mining
Graph miningGraph mining
Graph mining
 
Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph mining
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
Social Network Analysis in Two Parts
Social Network Analysis in Two PartsSocial Network Analysis in Two Parts
Social Network Analysis in Two Parts
 
Mining the social graph
Mining the social graphMining the social graph
Mining the social graph
 
Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...
 
Graph mining seminar_2009
Graph mining seminar_2009Graph mining seminar_2009
Graph mining seminar_2009
 
Marketing analytics alpesh doshi social network analysis - using social gra...
Marketing analytics alpesh doshi   social network analysis - using social gra...Marketing analytics alpesh doshi   social network analysis - using social gra...
Marketing analytics alpesh doshi social network analysis - using social gra...
 

Similaire à Data Mining Seminar - Graph Mining and Social Network Analysis

141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblasMIT
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblasgraphulo
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graphJames Wong
 
Graph classification problem.pptx
Graph classification problem.pptxGraph classification problem.pptx
Graph classification problem.pptxTony Nguyen
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graphFraboni Ec
 
Text categorization
Text categorization Text categorization
Text categorization Luis Goldster
 
Text categorization as a graph
Text categorization as a graph Text categorization as a graph
Text categorization as a graph David Hoen
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graphHarry Potter
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graphYoung Alista
 
A Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph DatabasesA Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph DatabasesIJMER
 
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircleFinding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCirclecharlingual
 
Scalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelScalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelSqrrl
 
Essay On Linear Function
Essay On Linear FunctionEssay On Linear Function
Essay On Linear FunctionAngie Lee
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Convex Hull Approximation of Nearly Optimal Lasso Solutions
Convex Hull Approximation of Nearly Optimal Lasso SolutionsConvex Hull Approximation of Nearly Optimal Lasso Solutions
Convex Hull Approximation of Nearly Optimal Lasso SolutionsSatoshi Hara
 
clique-summary
clique-summaryclique-summary
clique-summaryJia Wang
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphspione30
 

Similaire à Data Mining Seminar - Graph Mining and Social Network Analysis (20)

141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblas
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblas
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graph
 
Graph classification problem.pptx
Graph classification problem.pptxGraph classification problem.pptx
Graph classification problem.pptx
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graph
 
Text categorization
Text categorization Text categorization
Text categorization
 
Text categorization as a graph
Text categorization as a graph Text categorization as a graph
Text categorization as a graph
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graph
 
Text categorization as a graph
Text categorization as a graphText categorization as a graph
Text categorization as a graph
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
A Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph DatabasesA Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph Databases
 
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircleFinding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
 
Scalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelScalable Graph Clustering with Pregel
Scalable Graph Clustering with Pregel
 
Essay On Linear Function
Essay On Linear FunctionEssay On Linear Function
Essay On Linear Function
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Convex Hull Approximation of Nearly Optimal Lasso Solutions
Convex Hull Approximation of Nearly Optimal Lasso SolutionsConvex Hull Approximation of Nearly Optimal Lasso Solutions
Convex Hull Approximation of Nearly Optimal Lasso Solutions
 
clique-summary
clique-summaryclique-summary
clique-summary
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 

Dernier

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 

Dernier (20)

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 

Data Mining Seminar - Graph Mining and Social Network Analysis

  • 1. Graph Mining and Social Network Analysis Data Mining; EECS 4412 Darren Rolfe + Vince Chu 11.06.14
  • 2. Agenda Graph Mining Methods for Mining Frequent Subgraphs Apriori-based Approach: AGM, FSG Pattern-Growth Approach: gSpan Social Networks Analysis Properties and Features of Social Real Graphs Models of Graphs we can use Using those models to predict/other things
  • 3. Graph Mining Methods for Mining Frequent Subgraphs
  • 4. Why Mine Graphs? A lot of data today can be represented in the form of a graph Social: Friendship networks, social media networks, email and instant messaging networks, document citation networks, blogs Technological: Power grid, the internet Biological: Spread of virus/disease, protein/gene regulatory networks
  • 5. What Do We Need To Do Identify various kinds of graph patterns Frequent substructures are the very basic patterns that can be discovered in a collection of graphs, useful for: characterizing graph sets, discriminating different groups of graphs, classifying and clustering graphs, building graph indices, and facilitating similarity search in graph databases
  • 6. Mining Frequent Subgraphs Performed on a collection of graphs Notation: Vertex set of a graph 𝑔 by 𝑉(𝑔) Edge set of a graph 𝑔 by 𝐸(𝑔) A label function, 𝐿, maps a vertex or an edge to a label. A graph 𝑔 is a subgraph of another graph 𝑔’ if there exists a subgraph isomorphism from 𝑔 to 𝑔’. Given a labeled graph data set, 𝐷 = {𝐺1, 𝐺2, … , 𝐺𝑛}, we define 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑔) (or 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝑔)) as the percentage (or number) of graphs in 𝐷 where 𝑔 is a subgraph. A frequent graph is a graph whose support is no less than a minimum support threshold, 𝑚𝑖𝑛_𝑠𝑢𝑝.
  • 7. Discovering Frequent Substructures Usually consists of two steps: 1. Generate frequent substructure candidates. 2. Check the frequency of each candidate. Most studies on frequent substructure discovery focus on the optimization of the first step, because the second step involves a subgraph isomorphism test whose computational complexity is excessively high (i.e., NP-complete).
  • 8. Graph Isomorphism Isomorphism of graphs G and H is a bijection between the vertex sets of G and H 𝐹: 𝑉(𝑔) → 𝑉(𝐻) Such that any two vertices 𝑢 and 𝑣 of 𝐺 are adjacent in 𝐺 if and only if ƒ(𝑢) and ƒ(𝑣) are adjacent in 𝐻. A G B H C I D J A G B H C I D J Graph G Graph H
  • 9. Frequent Subgraphs: An Example 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. Check the frequency of each candidate. A B C A B B C A A B C A C B A B Graph 1 Graph 2 Graph 3 Graph 4
  • 10. Frequent Subgraphs: An Example Let the support minimum for this example be 50%. 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. Check the frequency of each candidate.
  • 11. Frequent Subgraphs: An Example 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. Check the frequency of each candidate. A B C A B B C A C A A B B A B C A A C B B C A B C A A B C A B B C A C B A B B B C A B B C A k = 1 k = 2 k = 3 k = 4
  • 12. Frequent Subgraphs: An Example 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. Check the frequency of each candidate. A B C A B B C A C A A B B A B C A A C A B C A A B C A B B C A C B A B B B C A B B C A k = 1 k = 2 k = 3 k = 4 B B C
  • 13. Frequent Subgraphs: An Example 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. Check the frequency of each candidate. A B C A A B C A C B A B Graph 1 Graph 2 Graph 3 Graph 4 B B C A k = 3, frequency: 3, support: 75%
  • 14. Frequent Subgraphs: An Example 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. Check the frequency of each candidate. A B C A C B A B Graph 1 Graph 2 Graph 3 Graph 4 B B C A A B C A k = 4, frequency: 2, support: 50%
  • 15. Apriori-based Approach Apriori-based frequent substructure mining algorithms share similar characteristics with Apriori-based frequent itemset mining algorithms. Search for frequent graphs: Starts with graphs of small “size”; definition of graph size depends on algorithm used. Proceeds in a bottom-up manner by generating candidates having an extra vertex, edge, or path. Main design complexity of Apriori-based substructure mining algorithms is candidate generation step. Candidate generation problem in frequent substructure mining is harder than that in frequent itemset mining, because there are many ways to join two substructures.
  • 16. Apriori-based Approach 1. Generate size 𝑘 frequent subgraph candidates Generated by joining two similar but slightly different frequent subgraphs that were discovered in the previous call of the algorithm. 2. Check the frequency of each candidate 3. Generate the size 𝑘 + 1 frequent candidates 4. Continue until candidates are empty
  • 17. Algorithm: AprioriGraph Apriori-based Frequent Substructure Mining Input: 𝐷, a graph data set 𝑚𝑖𝑛_𝑠𝑢𝑝, minimum support threshold Output: 𝑆 𝑘, frequent substructure set Method: 𝑆1 ← frequent single-elements in 𝐷 Call 𝐴𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑟𝑎𝑝ℎ(𝐷, 𝑚𝑖𝑛_𝑠𝑢𝑝, 𝑆1) procedure AprioriGraph(D, min_sup, Sk) 1 Sk+1 ← ∅; 2 foreach frequent gi ∈ Sk do 3 foreach frequent gj ∈ Sk do 4 foreach size (k+1) graph g formed by merge(gi, gj) do 5 if g is frequent in D and g ∉ Sk+1 then 6 insert g into Sk+1; 7 if sk+1 ≠ ∅ then 8 AprioriGraph(D, min_sup, Sk+1); 9 return;
  • 18. AGM - Apriori-based Graph Mining Vertex-based candidate generation method that increases the substructure size by one vertex at each iteration of AprioriGraph. 𝑘, graph size is the number of vertices in the graph Two size-k frequent graphs are joined only if they have the same size-(k−1) subgraph. Newly formed candidate includes the size-(k−1) subgraph in common and the additional two vertices from the two size-k patterns. Because it is undetermined whether there is an edge connecting the additional two vertices, we actually can form two substructures.
  • 19. AGM: An Example Two substructures joined by two chains. 𝑘, graph size is the number of vertices in the graph A B C A B B C A + A B C A B A B C A B k = 4 k = 5
  • 20. FSG – Frequent Subgraph Discovery Edge-based candidate generation strategy that increases the substructure size by one edge in each call of AprioriGraph. 𝑘, graph size is the number of edges in the graph Two size-k patterns are merged if and only if they share the same subgraph having k−1 edges, which is called the core. Newly formed candidate includes the core and the additional two edges from size-k patterns.
  • 21. FSG: An Example Two substructure patterns and their potential candidates. 𝑘, graph size is the number of edges in the graph B B C A B B C A + B B C A B B B C A k = 4 k = 5
  • 22. A A A A C FSG: Another Example Two substructure patterns and their potential candidates. 𝑘, graph size is the number of edges in the graph AA AA B + k = 5 k = 6 AA AA B C A AA CB A A AA CB A
  • 23. Pitfall: Apriori-based Approach Generation of subgraph candidates is complicated and expensive. Level-wise candidate generation → Breadth-first search To determine whether a size-(k+1) graph is frequent, must check all corresponding size-k subgraphs to obtain the upper bound of frequency. Before mining any size-(k+1) subgraph, requires complete mining of size-k subgraphs Subgraph isomorphism is an NP Subgraph isomorphism is an NP-complete problem, so pruning is expensive.
  • 24. Pattern-Growth Approach 1. Initially, start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge such that newly formed graphs are frequent graphs A graph g can be extended by adding a new edge e; newly formed graph is denoted by 𝑔  𝑥 𝑒. If e introduces a new vertex, we denote the new graph by 𝑔  𝑥𝑓 𝑒, otherwise 𝑔  𝑥𝑏 𝑒, where f or b indicates that the extension is in a forward or backward direction 3. For each discovered graph g, it performs extensions recursively until all the frequent graphs with g embedded are discovered. 4. The recursion stops once no frequent graph can be generated.
  • 25. Algorithm: PatternGrowthGraph Simplistic Pattern Growth-based Frequent Substructure Mining Input: 𝑔, a frequent graph 𝐷, a graph data set 𝑚𝑖𝑛_𝑠𝑢𝑝, minimum support threshold Output: 𝑆, frequent graph set Method: 𝑆 ← ∅ Call 𝑃𝑎𝑡𝑡𝑒𝑟𝑛𝐺𝑟𝑜𝑤𝑡ℎ𝐺𝑟𝑎𝑝ℎ(𝑔, 𝐷, 𝑚𝑖𝑛_𝑠𝑢𝑝, 𝑆) procedure PatternGrowthGraph(g, D, min_sup, S) 1 if g ∈ S then return; 2 else insert g into S; 3 scan D once, find all edges e that g can be extended to g  𝑥e; 4 foreach frequent g  𝑥e do 5 PatternGrowthGraph(g  𝑥e, D, min_sup, S); 6 return;
  • 26. Pattern-Growth: An Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated A B C A B B C A A B C A C B A B Graph 1 Graph 2 Graph 3 Graph 4
  • 27. Pattern-Growth: An Example Let the support minimum for this example be 50%. 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated
  • 28. Pattern-Growth: An Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated A B C A B B C A A B C A C B A B Graph 1 Graph 2 Graph 3 Graph 4 Let’s arbitrarily start with this frequent vertex
  • 29. Pattern-Growth: An Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated A B C A B B C A A B C A C B A B Graph 1 Graph 2 Graph 3 Graph 4 Extend graph (forward); add frequent edge
  • 30. Pattern-Growth: An Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 1 Graph 2 Graph 3 Graph 4 Extend frequent graph (forward) again… A B C A B B C A A B C A C B A B
  • 31. Pattern-Growth: An Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 1 Graph 2 Graph 3 Graph 4 Extend graph (backward); previously seen node A B C A B B C A A B C A C B A B
  • 32. Pattern-Growth: An Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 1 Graph 2 Graph 3 Graph 4 Extend frequent graph (forward) again… A B C A B B C A A B C A C B A B
  • 33. Pattern-Growth: An Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 1 Graph 2 Graph 3 Graph 4 Stop recursion, try different start vertex… B B C A C B A B A B C A A B C A
  • 34. Pitfall: PatternGrowthGraph Simple, but not efficient Same graph can be discovered many times; duplicate graph Generation and detection of duplicate graphs increases workload
  • 35. gSpan (Graph-Based Substructure Pattern Mining) Designed to reduce the generation of duplicate graphs. Explores via depth-first search (DFS) DFS lexicographic order and minimum DFS code form a canonical labeling system to support DFS search. Discovers all the frequent subgraphs without candidate generation and false positives pruning. It combines the growing and checking of frequent subgraphs into one procedure, thus accelerates the mining process.
  • 36. gSpan (Graph-Based Substructure Pattern Mining) DFS Subscripting When performing a DFS in a graph, we construct a DFS tree One graph can have several different DFS trees Depth-first discovery of the vertices forms a linear order Use subscripts to label this order according to their discovery time i < j means vi is discovered before vj. vo, the root and vn, the rightmost vertex. The straight path from v0 to vn, rightmost path.
  • 37. gSpan (Graph-Based Substructure Pattern Mining) DFS Code We transform each subscripted graph to an edge sequence, called a DFS code, so that we can build an order among these sequences. The goal is to select the subscripting that generates the minimum sequence as its base subscripting. There are two kinds of orders in this transformation process: 1. Edge order, which maps edges in a subscripted graph into a sequence; and 2. Sequence order, which builds an order among edge sequences An edge is represented by a 5-tuple, (𝑖, 𝑗, 𝑙𝑖, 𝐼(𝑖,𝑗), 𝑙𝑗); 𝑙𝑖 and 𝑙𝑗 are the labels of 𝑣𝑖 and 𝑣𝑗, respectively, and 𝐼(𝑖,𝑗) is the label of the edge connecting them
  • 38. gSpan (Graph-Based Substructure Pattern Mining) DFS Lexicographic Order For the each DFS tree, we sort the DFS code (tuples) to a set of orderings. Based on the DFS lexicographic ordering, the minimum DFS code of a given graph G, written as dfs(G), is the minimal one among all the DFS codes. The subscripting that generates the minimum DFS code is called the base subscripting. Given two graphs 𝐺 and 𝐺’, 𝐺 is isomorphic to 𝐺’ if and only if 𝑑𝑓𝑠(𝐺) = 𝑑𝑓𝑠(𝐺’). Based on this property, what we need to do for mining frequent subgraphs is to perform only the right-most extensions on the minimum DFS codes, since such an extension will guarantee the completeness of mining results.
  • 39. DFS Code: An Example DFS Subscripting When performing a DFS in a graph, we construct a DFS tree One graph can have several different DFS trees X X Z Y a a b b v0 v1 v2 v3 X X Z Y a a b b X X Z Y a a b b X X Z Y a a b b
  • 40. DFS Lexicographic Order: An Example For the each DFS tree, we sort the DFS code (tuples) to a set of orderings. Based on the DFS lexicographic ordering, the minimum DFS code of a given graph G, written as dfs(G), is the minimal one among all the DFS codes. X X Z Y a a b b X X Z Y a a b b X X Z Y a a b b edge γ0 e0 (0, 1, X, a, X) ● e1 (1, 2, X, a, Z) ● e2 (2, 0, Z, b, X) e3 (1, 3, X, b, Y) edge γ1 e0 (0, 1, X, a, X) ● e1 (1, 2, X, b, Y) e2 (1, 3, X, a, Z) e3 (3, 0, Z, b, X) edge γ2 e0 (0, 1, Y, b, X) e1 (1, 2, X, a, X) e2 (2, 3, X, b, Z) e3 (3, 1, Z, a, X)
  • 41. gSpan (Graph-Based Substructure Pattern Mining) 1. Initially, a starting vertex is randomly chosen 2. Vertices in a graph are marked so that we can tell which vertices have been visited 3. Visited vertex set is expanded repeatedly until a full DFS tree is built 4. Given a graph G and a DFS tree T in G, a new edge e Can be added between the right-most vertex and another vertex on the right-most path (backward extension); or Can introduce a new vertex and connect to a vertex on the right-most path (forward extension). Because both kinds of extensions take place on the right-most path, we call them right- most extension, denoted by 𝑔  𝑟 𝑒
  • 42. Algorithm: gSpan Pattern growth-based frequent substructure mining that reduces duplicate graph generation. Input: 𝑠, a DFS code 𝐷, a graph data set 𝑚𝑖𝑛_𝑠𝑢𝑝, minimum support threshold Output: 𝑆, frequent graph set Method: 𝑆 ← ∅ Call 𝑔𝑆𝑝𝑎𝑛(𝑠, 𝐷, 𝑚𝑖𝑛_𝑠𝑢𝑝, 𝑆) procedure gSpan(s, D, min_sup, S) 1 if s ≠ dfs(s) then return; 2 insert s into S; 3 set C to ∅; 4 scan D once, find all edges e that s can be right-most extended to s  𝑟e; 5 insert s  𝑟e into C and count its frequency; 6 foreach frequent s  𝑟e in C do 7 gSpan(s  𝑟e, D, min_sup, S); 8 return;
  • 43. Other Graph Mining So far the techniques we have discussed: Can handle only one special kind of graphs: Labeled, undirected, connected simple graphs without any specific constraints Assume that the database to be mined contains a set of graphs Each consisting of a set of labeled vertices and labeled but undirected edges, with no other constraints.
  • 44. Other Graph Mining Mining Variant and Constrained Substructure Patterns Closed frequent substructure where a frequent graph G is closed if and only if there is no proper supergraph G0 that has the same support as G Maximal frequent substructure where a frequent pattern G is maximal if and only if there is no frequent super-pattern of G. Constraint-based substructure mining Element, set, or subgraph containment constraint Geometric constraint Value-sum constraint
  • 45. Application: Classification We mine frequent graph patterns in the training set. The features that are frequent in one class but rather infrequent in the other class(es) should be considered as highly discriminative features; used for model construction. To achieve high-quality classification, We can adjust: the thresholds on frequency, discriminativeness, and graph connectivity Based on: the data, the number and quality of the features generated, and the classification accuracy.
  • 46. Application: Cluster analysis We mine frequent graph patterns in the training set. The set of graphs that share a large set of similar graph patterns should be considered as highly similar and should be grouped into similar clusters. The minimal support threshold can be used as a way to adjust the number of frequent clusters or generate hierarchical clusters.
  • 48. Examples of Social Networks Twitter network http://willchernoff.com/ Email Network https://wiki.cs.umd.edu Air Transportation Network www.mccormick.northwestern.edu
  • 49. Social Network Analysis Nodes often represent an object or entity such as a person, computer/server, power generator, airport, etc Links represent relationships, e.g. ‘likes’, ‘follow’s, ‘flies to’, etc http://www.liacs.nl/~erwin/dbdm2009/GraphMining.pdf
  • 50. Why are we interested? It turns out that the structure of real-world graphs often have special characteristics This is important because structure always affects function e.g. the structure of a social network affects how a rumour, or an infectious disease, spreads e.g. the structure of a power grid determines how robust the network is to power failures Goal: 1. Identify the characteristics / properties of graphs; structural and dynamic / behavioural 2. Generate models of graphs that exhibit these characteristics 3. Use these tools to make predictions about the behaviour of graphs
  • 51. Properties of Real-World Social Graphs 1. Degree Distribution Plot the fraction of nodes with degree k (denoted pk) vs. k Our intuition: Poisson/Normal Distribution WRONG! mathworld.wolfram.com Correct: Highly Skewed http://cs.stanford.edu/people/jure/talks/www08tutorial/
  • 52. Properties of Real-World Social Graphs 1. (continued) Real-world social networks tend to have a highly skewed distribution that follows the Power Law: pk ~ k-a A small percentage of nodes have very high degree, are highly connected Example: Spread of a virus black squares = infected pink = infected but not contagious green = exposed but not infected
  • 53. Properties of Real-World Social Graphs 2. Small World Effect: for most real graphs, the number of hops it takes to reach any node from any other node is about 6 (Six Degrees of Separation). Milgram did an experiment, asked people in Nebraska to send letters to people in Boston Constraint: letters could only be delivered to people known on a first name basis. Only 25% of letters made it to their target, but the ones that did made it in 6 hops
  • 54. Properties of Real-World Social Graphs 2. (continued) The distribution of the shortest path lengths. Example: MSN Messenger Network If we pick a random node in the network and then count how many hops it is from every other node, we get this graph Most nodes are at a distance of 7 hops away from any other node http://cs.stanford.edu/people/jure/talks/www08tutorial/
  • 55. Properties of Real-World Social Graphs 3. Network Resilience If a node is removed, how is the network affected? For a real-world graphs, you must remove the highly connected nodes in order to reduce the connectivity of the graph Removing a node that is sparsely connected does not have a significant effect on connectivity Since the proportion of highly connected nodes in a real-world graph is small, the probability of choosing and removing such a node at random is small → real-world graphs are resilient to random attacks! → conversely, targeted attacks on highly connected nodes are very effective!
  • 56. Properties of Real-World Social Graphs 4. Densification How does the number of edges in the graph grow as the number of nodes grows? Previous belief: # edges grows linearly with # nodes i.e. 𝐸(𝑡) ~ 𝑁(𝑡) Actually, # edges grows superlinearly with the # nodes, i.e. the # of edges grows faster than the number of nodes i.e. 𝐸(𝑡) ~ 𝑁(𝑡) 𝑎 Graph gets denser over time
  • 57. Properties of Real-World Social Graphs 5. Shrinking Diameter Diameter is taken to be the longest-shortest path in the graph As a network grows, the diameter actually gets smaller, i.e. the distance between nodes slowly decreases
  • 58. Features/Properties of Graphs Community structure Densification Shrinking diameter
  • 59. Generators: How do we model graphs Try: Generating a random graph Given n vertices connect each pair i.i.d. with Probability p Follows a Poisson distribution Follows from our intuition Not useful; no community structure Does not mirror real-world graphs
  • 60. Generators: How do we model graphs (Erdos‐Renyi) Random graphs (1960s) Exponential random graphs Small‐world model Preferential attachment Edge copying model Community guided attachment Forest Fire Kronecker graphs (today)
  • 61. Kronecker Graphs For kronecker graphs all the properties of real world graphs can actually be proven Best model we have today Adjacency matrix, recursive generation
  • 62. Kronecker Graphs 1. Construct adjacency matrix for a graph G: 𝐴 𝐺 = (𝐴𝑖𝑗) = { 1 𝑖𝑓 𝑖 𝑎𝑛𝑑 𝑗 𝑎𝑟𝑒 𝑎𝑑𝑗𝑎𝑐𝑒𝑛𝑡, 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 } Side Note: The eigenvalue of a matrix A is the scalar value ƛ for which the following is true: Av = ƛv (where v is an eigenvector of the matrix A)
  • 63. Kronecker Graphs 2. Generate the 2nd Kronecker graph by taking the Kronecker product of the 1st graph with itself. The Kronecker product of 2 graphs is defined as:
  • 64. Kronecker Graphs Visually, this is just taking the the first matrix and replacing the entries that were equal to 1 with the second matrix. 3 x 3 matrix 9 x 9 matrix
  • 65. Kronecker Graphs We define the Kronecker product of two graphs as the Kronecker product of their adjacency matrices Therefore, we can compute the Kth Kronecker graph by iteratively taking the Kronecker product of an initial graph G1 k times: Gk = G1 ⛒ G1 ⛒ G1 ⛒ … ⛒ G1
  • 66. Applying Models to Real World Graphs Can then predict and understand the structure
  • 67. Virus Propagation A form of diffusion; a fundamental process in social networks Can also refer to spread of rumours, news, etc
  • 68. Virus Propagation SIS Model: Susceptible - Infected - Susceptible Virus birth rate β = the probability that an infected node attacks a neighbour Virus death rate ẟ = probability that an infected node becomes cured Heals with Prob ẟ Infects with Prob β Infects with Prob β Healthy Node At risk Node Infected NodeInfected Node
  • 69. Virus Propagation The virus strength of a graph: s = β/ẟ The epidemic threshold 𝜏 of a graph is a value such that if: s = β/ẟ < 𝜏 then an epidemic cannot happen. So we can ask the question: Will the virus become epidemic? Will the rumours/news become viral? How to find threshold 𝛕 ? Theorem: 𝜏 = 1/ƛ 1,A where ƛ 1,A is the largest eigenvalue of adjacency matrix of the graph So if s < 𝜏 then there is no epidemic
  • 70. Link Prediction Given a social network at time t1, predict the edges that will be added at time t2 Assign connection score(x,y) to each pair of nodes Usually taken to be the shortest path between the nodes x and y, other measures use # of neighbours in common, and the Katz measure Produce a list of scores in decreasing order The pair at the top of the list are most likely to have a link created between them in the future Can also use this measure for clustering
  • 71. Link Prediction Score(x,y) = # of neighbours in common Top score = score(B,C) = 5G A B F C E H I J D Likely new link between B and C
  • 72. Viral Marketing A customer may increase the sales of some product if they interact positively with their peers in the social network Assign a network value to a customer
  • 73. Diffusion in Networks: Influential Nodes Some nodes in the network can be active they can spread their influence to other nodes e.g. news, opinions, etc that propagate through a network of friends 2 models: Threshold model, Independent Contagion model