SlideShare une entreprise Scribd logo
1  sur  31
Simplicial closure and
higher-order link
predictionAustin R. Benson · Cornell
Linear Algebra and Network Science
SIAM Annual · July 11, 2018
1
bit.ly/sc-holp-code bit.ly/arb-SIAMAN18
bit.ly/sc-holp-data arXiv 1802.06916
Joint work with
Rediet Abebe & Jon Kleinberg (Cornell)
Michael Schaub & Ali Jadbabaie (MIT)
Networks are sets of nodes and edges (graphs)
that model real-world systems.
2
Collaboration
nodes are
people/groups
edges link entities
working together
Communications
nodes are
people/accounts
edges show info.
exchange
Physical proximity
nodes are people/animals
edges link those that
interact in close proximity
Drug compounds
nodes are substances
edge between substances
that appear in the same drug
Real-world systems are composed of “higher-
order” interactions that we often reduce to
pairwise ones.
3
Collaboration
nodes are
people/groups
teams are made up of
small groups
Communications
nodes are
people/accounts
emails often have several
recipients, not just one
Physical proximity
nodes are people/animals
people often gather in
small groups
Drug compounds
nodes are substances
drugs are made up of
several substances
There are many ways to mathematically represent
the higher-order structure present in relational
data.
4
• Hypergraphs [Berge 89]
• Set systems [Frankl 95]
• Affiliation networks [Feld 81, Newman-Watts-Strogatz 02]
• Multipartite networks [Lambiotte-Ausloos 05, Lind-Herrmann 07]
• Abstract simplicial complexes [Lim 15, Osting-Palande-Wang 17]
• Multilayer networks [Kivelä+ 14, Boccaletti+ 14, many others…]
• Meta-paths [Sun-Han 12]
• Motif-based representations [Benson-Gleich-Leskovec 15, 17]
• …
Data representation is not the problem.
The problem is how do we evaluate and compare models and algorithms?
Link prediction is a classical machine learning
problem in network science that is used to
evaluate models.
5
We observe data which is a list of edges in a graph up to some point t.
We want to predict which new edges will form in the future.
Shows up in a variety of applications
• Predicting new social relationships and friend recommendation.
[Backstrom-Leskovec 11; Wang+ 15]
• Inferring new links between genes and diseases.
[Wang-Gulbahce-Yu 11; Moreau-Tranchevent 12]
• Suggesting novel connections in the scientific community.
[Liben-Nowell-Kleinberg 07; Tang-Wu-Sun-Su 12]
Link prediction is also used as a framework to compare
[Liben-Nowell-Kleinberg 03, 07; Lü-Zhau 11]
We propose “higher-order link prediction” as a
similar framework for evaluation of higher-order
models.
6
Data.
Observe simplices up to some
time t. Using this data, want to
predict what groups of > 2
nodes will appear in a simplex
in the future.
t
1
2
3
4
5
6
7
8
9
We predict structure that classical link
prediction would not even consider!
Possible applications
• Novel combinations of drugs for treatments.
• Group chat recommendation in social networks.
• Team formation.
We collected many datasets of timestamped
simplices, where each simplex is a subset of
nodes.
7
1. Coauthorship in different domains.
2. Emails with multiple recipients.
3. Tags on Q&A forums.
4. Threads on Q&A forums.
5. Contact/proximity measurements.
6. Musical artist collaboration.
7. Substance makeup and classification
codes applied to drugs the FDA
examines.
8. U.S. Congress committee
memberships and bill sponsorship.
9. Combinations of drugs seen in
patients in ER visits. https://math.stackexchange.com/q/80181
bit.ly/sc-holp-data
Thinking of higher-order data as a weighted
projected graph with “filled-in” structures is a
convenient viewpoint.
8
1
2
3
4
5
6
7
8
9
Data. Pictures to have in mind.
9
5
23
16
20
10
i
j k
i
j k
Warm-up. What’s more common in data?
or
“Open triangle”
each pair has been in a
simplex together but all 3
nodes have never been in the
same simplex
“Closed triangle”
there is some simplex that
contains all 3 nodes
There is lots of variation in the fraction of triangles
that are open, but datasets from the same domain
are similar.
11See also [Patania-Petri-Vaccarino 17] for fraction of open triangles on arxiv collaboration networks.
Open problem.What is a generative model that
captures the diversity in these two features?
Dataset domain separation also occurs at the local
level.
12
• Randomly sample 100 egonets per dataset and measure log
of average degree and fraction of open triangles.
• Logistic regression model to predict domain
(coauthorship, tags, threads, email, contact)
• 75% model accuracy vs. 21% with random guessing.
13
Two remaining parts of this talk.
1. Simplicial closure.
How do new simplices (e.g., closed triangles) appear?
2. Higher-order link prediction.
How can we predict which new simplices appear?
Algorithms boil down to some fun linear algebra!
Groups of nodes go through trajectories until
finally reaching a “simplicial closure event.”
14
1
2
3
4
5
6
7
8
9
1
2 6
1
2 6
1
2 6
1
2 6
1
2 6
{ 1, 2, 3, 4}
t1
{ 1, 6}
t3
{ 2, 6}
t4
{ 1,2,6}
t8
For this talk, we will just look at simplicial closure on 3 nodes.
Groups of nodes go through trajectories until
finally reaching a “simplicial closure event.”
15
Substances in marketed drugs recorded in the National Drug Code
directory.HIV protease
inhibitors
UGT1A1
inhibitors
Breast cancer
resistanceprotein inhibitors
1
2+
2+
1
2+
1
1
2+
2+
1
2+
2+
2+
Reyataz
RedPharm
2003
Reyataz
Squibb & Sons
2003
Kaletra
Physicians
Total Care
2006
Promacta
GSK (25mg)
2008
Promacta
GSK (50mg)
2008
Kaletra
DOH Central
Pharmacy
2009
Evotaz
Squibb & Sons
2015
We bin weighted edges into “weak” and
“strong” ties in the projected graph W.
• Weak ties. Wij = 1 (one simplex contains i and j)
• Strong ties. Wij > 2 (at least two simplices contain i and j)
1
2+1
1
2+
1
1
1
1
2+
2+
2+
1
1
2+
2+
1
2+
2+
2+
245,996
74,219
14,541
7,575
5,781
773
3,560
952
445
389
618
285
2,732,839
157,236
66,644
7,987
8,844
328
3,171
779
722
285
We can also study temporal dynamics in
aggregate.
16
Coauthorship data of scholars publishing in
history. Nodes in most new closed
triangles have had no previous
interaction.
Open triangle of all weak
ties more likely to form a
strong tie before closing.
Simplicial closure depends on structure in projected
graph.
17
• First 80% of the data (in time) ⟶ record configurations of triplets not in closed triangle.
• Remainder of data ⟶ find fraction that are now closed triangles.
Increased edge density
increases closure
probability.
Increased tie strength
increases closure
probability.
Tension between
edge density and tie
strength.
Left and middle observations are consistent with theory and empirical studies of social
networks. [Granovetter 73; Leskovec+ 08; Backstrom+ 06; Kossinets-Watts 06]
Closure probability Closure probability Closure probability
18
Two remaining parts of this talk.
1. Simplicial closure.
How do new simplices (e.g., closed triangles) appear?
2. Higher-order link prediction.
How can we predict which new simplices appear?
Algorithms boil down to some fun linear algebra!
We propose “higher-order link prediction” as a
similar framework for evaluation of higher-order
models.
19
Data.
Observe simplices up to some
time t. Using this data, want to
predict what groups of > 2
nodes will appear in a simplex
in the future.
t
1
2
3
4
5
6
7
8
9
We predict structure that classical link
prediction would not even consider!
20
Our structural analysis tells us what we should be
looking at for prediction.
1. Edge density is a positive indicator.
⟶ focus our attention on predicting which open
triangles become closed triangles.
2. Tie strength is a positive indicator.
⟶ various ways of incorporating this information
i
j k
Wij
Wjk
Wjk
21
For every open triangle, we assign a score
function on first 80% of data based on structural
properties.
Four broad classes of score functions for
an open triangle.
Score s(i, j, k)…
1. is a function of Wij, Wjk, Wjk
2. is a function of A[:, [i, j, k]]
3. is built from “whole-network”
similarity scores on edges
4. is learned from data
i
j k
Wij
Wjk
Wjk
After computing scores, predict that open triangles with highest
scores will be closed triangles in final 20% of data.
22
1. s(i, j, k) is a function of Wij , Wjk , and Wjk
i
j k
Wij
Wjk
Wjk
1. Arithmetic mean
2. Geometric mean
3. Harmonic mean
4. Generalized mean
Open problem. Fast way to find top-k scores? Without forming W?
23
2. s(i, j, k) is a function of is a function of A[:, [i, j,
k]]
i
j k
Wij
Wjk
Wjk
1. Number of common fourth neighbors
2. Generalized Jaccard coefficient
3. Preferential attachment
Open problem. Fast way to find top-k scores (or just
sample)?
24
3. s(i, j, k) is is built from “whole-network”
similarity scores on edges: s(i, j, k) = Sij + Sji + Sjk +
Skj + Sik + Ski
i
j k
Wij
Wjk
Wjk
1. PageRank (unweighted or
weighted)
2. Katz (unweighted or weighted)
Hadamard product with A is just reducing storage costs.
25
How do we actually solve the systems?
Want to reduce
computational cost
(only need scores on
open triangles).
For each node i that participates in an open triangle
1. Solve
2. Store ith column
using iterative solver with low tolerance
Open software problem. We would like block iterative solvers in Julia!
26
4. s(i, j, k) is learned from data.
1. Split data into training and validation sets.
2. Compute features of (i, j, k) from previous ideas using training data.
3. Throw features + validation labels into machine learning blender
→ learn model.
4. Re-compute features on combined training + validation
→ apply model on the data.
27
28
A few lessons learned from applying all of these
ideas.
1. We can predict pretty well on all datasets using some method.
→ 4x to 107x better than random w/r/t mean average
precision
depending on the dataset/method
2. Thread co-participation and co-tagging on stack exchange are
consistently easy to predict.
3. Simply averaging Wij, Wjk, and Wik consistently performs well.
i
j k
Wij
Wjk
Wjk
Generalized means of edge weights are good
predictors of new closed triangles.
29
Good performance from this local information is a deviation from classical link prediction,
where methods that use long paths (e.g., PageRank) perform well [Liben-Nowell & Kleinberg
07].
For structures on k nodes, the subsets of size k-1 contain rich information only when k > 2.
i
j k
Wij
Wjk
Wjk
i
j k
?
There are lots of opportunities in studying this
data.
30
1. Higher-order data is pervasive!
We have ways to represent data, and higher-order link prediction is a
general framework for comparing comparing models and methods.
2. There is rich static and temporal structure in the datasets we
collected.
3. We only looked at 3-node patterns.
Computation becomes much more challenging for 4-node patterns.
4. Do you have a better model, algorithm, or data representation?
Great! Show us by out-performing these baselines.
Data. bit.ly/sc-holp-data
Code. bit.ly/sc-holp-code
Simplicial closure and higher-order link prediction.
31
Paper. arXiv 1802.06916
Data. bit.ly/sc-holp-data
Code. bit.ly/sc-holp-code
Slides. bit.ly/arb-SIAMAN18
Austin R. Benson
http://cs.cornell.edu/~arb
@austinbenson
arb@cs.cornell.edu
THANKS!
i
j k

Contenu connexe

Tendances

Link Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social NetworksLink Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
Sina Sajadmanesh
 
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
Harish Vaidyanathan
 
Probabilistic Relational Models for Link Prediction Problem
Probabilistic Relational Models for Link Prediction ProblemProbabilistic Relational Models for Link Prediction Problem
Probabilistic Relational Models for Link Prediction Problem
Sina Sajadmanesh
 
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
csandit
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
nlt2390
 

Tendances (20)

Simplicial closure and higher-order link prediction LA/OPT
Simplicial closure and higher-order link prediction LA/OPTSimplicial closure and higher-order link prediction LA/OPT
Simplicial closure and higher-order link prediction LA/OPT
 
Computational Frameworks for Higher-order Network Data Analysis
Computational Frameworks for Higher-order Network Data AnalysisComputational Frameworks for Higher-order Network Data Analysis
Computational Frameworks for Higher-order Network Data Analysis
 
Thesis2006
Thesis2006Thesis2006
Thesis2006
 
Higher-order link prediction and other hypergraph modeling
Higher-order link prediction and other hypergraph modelingHigher-order link prediction and other hypergraph modeling
Higher-order link prediction and other hypergraph modeling
 
Calculation of the Minimum Computational Complexity Based on Information Entropy
Calculation of the Minimum Computational Complexity Based on Information EntropyCalculation of the Minimum Computational Complexity Based on Information Entropy
Calculation of the Minimum Computational Complexity Based on Information Entropy
 
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSEVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
 
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social NetworksLink Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
 
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
Active Image Clustering: Seeking Constraints from Humans to Complement Algori...
 
EFFICIENT KNOWLEDGE BASE MANAGEMENT IN DCSP
EFFICIENT KNOWLEDGE BASE MANAGEMENT IN DCSP EFFICIENT KNOWLEDGE BASE MANAGEMENT IN DCSP
EFFICIENT KNOWLEDGE BASE MANAGEMENT IN DCSP
 
17 Statistical Models for Networks
17 Statistical Models for Networks17 Statistical Models for Networks
17 Statistical Models for Networks
 
Probabilistic Relational Models for Link Prediction Problem
Probabilistic Relational Models for Link Prediction ProblemProbabilistic Relational Models for Link Prediction Problem
Probabilistic Relational Models for Link Prediction Problem
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
 
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
 
08 Statistical Models for Nets I, cross-section
08 Statistical Models for Nets I, cross-section08 Statistical Models for Nets I, cross-section
08 Statistical Models for Nets I, cross-section
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
graph_embeddings
graph_embeddingsgraph_embeddings
graph_embeddings
 
A SECURE DIGITAL SIGNATURE SCHEME WITH FAULT TOLERANCE BASED ON THE IMPROVED ...
A SECURE DIGITAL SIGNATURE SCHEME WITH FAULT TOLERANCE BASED ON THE IMPROVED ...A SECURE DIGITAL SIGNATURE SCHEME WITH FAULT TOLERANCE BASED ON THE IMPROVED ...
A SECURE DIGITAL SIGNATURE SCHEME WITH FAULT TOLERANCE BASED ON THE IMPROVED ...
 
Reasoning Over Knowledge Base
Reasoning Over Knowledge BaseReasoning Over Knowledge Base
Reasoning Over Knowledge Base
 
Simplicial closure and simplicial diffusions
Simplicial closure and simplicial diffusionsSimplicial closure and simplicial diffusions
Simplicial closure and simplicial diffusions
 
Event-Driven, Client-Server Archetypes for E-Commerce
Event-Driven, Client-Server Archetypes for E-CommerceEvent-Driven, Client-Server Archetypes for E-Commerce
Event-Driven, Client-Server Archetypes for E-Commerce
 

Similaire à Simplicial closure and higher-order link prediction

ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
Daniel Katz
 
Distribution of maximal clique size of the
Distribution of maximal clique size of theDistribution of maximal clique size of the
Distribution of maximal clique size of the
IJCNCJournal
 
Clique-based Network Clustering
Clique-based Network ClusteringClique-based Network Clustering
Clique-based Network Clustering
Guang Ouyang
 

Similaire à Simplicial closure and higher-order link prediction (20)

Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link prediction
 
Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18
 
Community detection
Community detectionCommunity detection
Community detection
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
 
Distribution of maximal clique size of the
Distribution of maximal clique size of theDistribution of maximal clique size of the
Distribution of maximal clique size of the
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link prediction
 
Higher-order clustering coefficients
Higher-order clustering coefficientsHigher-order clustering coefficients
Higher-order clustering coefficients
 
Higher-order clustering coefficients at Purdue CSoI
Higher-order clustering coefficients at Purdue CSoIHigher-order clustering coefficients at Purdue CSoI
Higher-order clustering coefficients at Purdue CSoI
 
Socialnetworkanalysis (Tin180 Com)
Socialnetworkanalysis (Tin180 Com)Socialnetworkanalysis (Tin180 Com)
Socialnetworkanalysis (Tin180 Com)
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Community detection in social networks[1]
Community detection in social networks[1]Community detection in social networks[1]
Community detection in social networks[1]
 
Clique-based Network Clustering
Clique-based Network ClusteringClique-based Network Clustering
Clique-based Network Clustering
 
MODELING SOCIAL GAUSS-MARKOV MOBILITY FOR OPPORTUNISTIC NETWORK
MODELING SOCIAL GAUSS-MARKOV MOBILITY FOR OPPORTUNISTIC NETWORK MODELING SOCIAL GAUSS-MARKOV MOBILITY FOR OPPORTUNISTIC NETWORK
MODELING SOCIAL GAUSS-MARKOV MOBILITY FOR OPPORTUNISTIC NETWORK
 
Jürgens diata12-communities
Jürgens diata12-communitiesJürgens diata12-communities
Jürgens diata12-communities
 
Higher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsHigher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifs
 
Opposite Opinions
Opposite OpinionsOpposite Opinions
Opposite Opinions
 
LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
 LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
 
Medicinal Applications of Quantum Computing
Medicinal Applications of Quantum ComputingMedicinal Applications of Quantum Computing
Medicinal Applications of Quantum Computing
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
Topology ppt
Topology pptTopology ppt
Topology ppt
 

Plus de Austin Benson

Plus de Austin Benson (15)

Hypergraph Cuts with General Splitting Functions (JMM)
Hypergraph Cuts with General Splitting Functions (JMM)Hypergraph Cuts with General Splitting Functions (JMM)
Hypergraph Cuts with General Splitting Functions (JMM)
 
Spectral embeddings and evolving networks
Spectral embeddings and evolving networksSpectral embeddings and evolving networks
Spectral embeddings and evolving networks
 
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting FunctionsHypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting Functions
 
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting FunctionsHypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting Functions
 
Semi-supervised learning of edge flows
Semi-supervised learning of edge flowsSemi-supervised learning of edge flows
Semi-supervised learning of edge flows
 
Choosing to grow a graph
Choosing to grow a graphChoosing to grow a graph
Choosing to grow a graph
 
Random spatial network models for core-periphery structure
Random spatial network models for core-periphery structureRandom spatial network models for core-periphery structure
Random spatial network models for core-periphery structure
 
Random spatial network models for core-periphery structure.
Random spatial network models for core-periphery structure.Random spatial network models for core-periphery structure.
Random spatial network models for core-periphery structure.
 
Sampling methods for counting temporal motifs
Sampling methods for counting temporal motifsSampling methods for counting temporal motifs
Sampling methods for counting temporal motifs
 
Higher-order clustering in networks
Higher-order clustering in networksHigher-order clustering in networks
Higher-order clustering in networks
 
New perspectives on measuring network clustering
New perspectives on measuring network clusteringNew perspectives on measuring network clustering
New perspectives on measuring network clustering
 
Higher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsHigher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifs
 
Tensor Eigenvectors and Stochastic Processes
Tensor Eigenvectors and Stochastic ProcessesTensor Eigenvectors and Stochastic Processes
Tensor Eigenvectors and Stochastic Processes
 
Spacey random walks CMStatistics 2017
Spacey random walks CMStatistics 2017Spacey random walks CMStatistics 2017
Spacey random walks CMStatistics 2017
 
Spacey random walks CAM Colloquium
Spacey random walks CAM ColloquiumSpacey random walks CAM Colloquium
Spacey random walks CAM Colloquium
 

Dernier

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 

Dernier (20)

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 

Simplicial closure and higher-order link prediction

  • 1. Simplicial closure and higher-order link predictionAustin R. Benson · Cornell Linear Algebra and Network Science SIAM Annual · July 11, 2018 1 bit.ly/sc-holp-code bit.ly/arb-SIAMAN18 bit.ly/sc-holp-data arXiv 1802.06916 Joint work with Rediet Abebe & Jon Kleinberg (Cornell) Michael Schaub & Ali Jadbabaie (MIT)
  • 2. Networks are sets of nodes and edges (graphs) that model real-world systems. 2 Collaboration nodes are people/groups edges link entities working together Communications nodes are people/accounts edges show info. exchange Physical proximity nodes are people/animals edges link those that interact in close proximity Drug compounds nodes are substances edge between substances that appear in the same drug
  • 3. Real-world systems are composed of “higher- order” interactions that we often reduce to pairwise ones. 3 Collaboration nodes are people/groups teams are made up of small groups Communications nodes are people/accounts emails often have several recipients, not just one Physical proximity nodes are people/animals people often gather in small groups Drug compounds nodes are substances drugs are made up of several substances
  • 4. There are many ways to mathematically represent the higher-order structure present in relational data. 4 • Hypergraphs [Berge 89] • Set systems [Frankl 95] • Affiliation networks [Feld 81, Newman-Watts-Strogatz 02] • Multipartite networks [Lambiotte-Ausloos 05, Lind-Herrmann 07] • Abstract simplicial complexes [Lim 15, Osting-Palande-Wang 17] • Multilayer networks [Kivelä+ 14, Boccaletti+ 14, many others…] • Meta-paths [Sun-Han 12] • Motif-based representations [Benson-Gleich-Leskovec 15, 17] • … Data representation is not the problem. The problem is how do we evaluate and compare models and algorithms?
  • 5. Link prediction is a classical machine learning problem in network science that is used to evaluate models. 5 We observe data which is a list of edges in a graph up to some point t. We want to predict which new edges will form in the future. Shows up in a variety of applications • Predicting new social relationships and friend recommendation. [Backstrom-Leskovec 11; Wang+ 15] • Inferring new links between genes and diseases. [Wang-Gulbahce-Yu 11; Moreau-Tranchevent 12] • Suggesting novel connections in the scientific community. [Liben-Nowell-Kleinberg 07; Tang-Wu-Sun-Su 12] Link prediction is also used as a framework to compare [Liben-Nowell-Kleinberg 03, 07; Lü-Zhau 11]
  • 6. We propose “higher-order link prediction” as a similar framework for evaluation of higher-order models. 6 Data. Observe simplices up to some time t. Using this data, want to predict what groups of > 2 nodes will appear in a simplex in the future. t 1 2 3 4 5 6 7 8 9 We predict structure that classical link prediction would not even consider! Possible applications • Novel combinations of drugs for treatments. • Group chat recommendation in social networks. • Team formation.
  • 7. We collected many datasets of timestamped simplices, where each simplex is a subset of nodes. 7 1. Coauthorship in different domains. 2. Emails with multiple recipients. 3. Tags on Q&A forums. 4. Threads on Q&A forums. 5. Contact/proximity measurements. 6. Musical artist collaboration. 7. Substance makeup and classification codes applied to drugs the FDA examines. 8. U.S. Congress committee memberships and bill sponsorship. 9. Combinations of drugs seen in patients in ER visits. https://math.stackexchange.com/q/80181 bit.ly/sc-holp-data
  • 8. Thinking of higher-order data as a weighted projected graph with “filled-in” structures is a convenient viewpoint. 8 1 2 3 4 5 6 7 8 9 Data. Pictures to have in mind.
  • 10. 10 i j k i j k Warm-up. What’s more common in data? or “Open triangle” each pair has been in a simplex together but all 3 nodes have never been in the same simplex “Closed triangle” there is some simplex that contains all 3 nodes
  • 11. There is lots of variation in the fraction of triangles that are open, but datasets from the same domain are similar. 11See also [Patania-Petri-Vaccarino 17] for fraction of open triangles on arxiv collaboration networks. Open problem.What is a generative model that captures the diversity in these two features?
  • 12. Dataset domain separation also occurs at the local level. 12 • Randomly sample 100 egonets per dataset and measure log of average degree and fraction of open triangles. • Logistic regression model to predict domain (coauthorship, tags, threads, email, contact) • 75% model accuracy vs. 21% with random guessing.
  • 13. 13 Two remaining parts of this talk. 1. Simplicial closure. How do new simplices (e.g., closed triangles) appear? 2. Higher-order link prediction. How can we predict which new simplices appear? Algorithms boil down to some fun linear algebra!
  • 14. Groups of nodes go through trajectories until finally reaching a “simplicial closure event.” 14 1 2 3 4 5 6 7 8 9 1 2 6 1 2 6 1 2 6 1 2 6 1 2 6 { 1, 2, 3, 4} t1 { 1, 6} t3 { 2, 6} t4 { 1,2,6} t8 For this talk, we will just look at simplicial closure on 3 nodes.
  • 15. Groups of nodes go through trajectories until finally reaching a “simplicial closure event.” 15 Substances in marketed drugs recorded in the National Drug Code directory.HIV protease inhibitors UGT1A1 inhibitors Breast cancer resistanceprotein inhibitors 1 2+ 2+ 1 2+ 1 1 2+ 2+ 1 2+ 2+ 2+ Reyataz RedPharm 2003 Reyataz Squibb & Sons 2003 Kaletra Physicians Total Care 2006 Promacta GSK (25mg) 2008 Promacta GSK (50mg) 2008 Kaletra DOH Central Pharmacy 2009 Evotaz Squibb & Sons 2015 We bin weighted edges into “weak” and “strong” ties in the projected graph W. • Weak ties. Wij = 1 (one simplex contains i and j) • Strong ties. Wij > 2 (at least two simplices contain i and j)
  • 16. 1 2+1 1 2+ 1 1 1 1 2+ 2+ 2+ 1 1 2+ 2+ 1 2+ 2+ 2+ 245,996 74,219 14,541 7,575 5,781 773 3,560 952 445 389 618 285 2,732,839 157,236 66,644 7,987 8,844 328 3,171 779 722 285 We can also study temporal dynamics in aggregate. 16 Coauthorship data of scholars publishing in history. Nodes in most new closed triangles have had no previous interaction. Open triangle of all weak ties more likely to form a strong tie before closing.
  • 17. Simplicial closure depends on structure in projected graph. 17 • First 80% of the data (in time) ⟶ record configurations of triplets not in closed triangle. • Remainder of data ⟶ find fraction that are now closed triangles. Increased edge density increases closure probability. Increased tie strength increases closure probability. Tension between edge density and tie strength. Left and middle observations are consistent with theory and empirical studies of social networks. [Granovetter 73; Leskovec+ 08; Backstrom+ 06; Kossinets-Watts 06] Closure probability Closure probability Closure probability
  • 18. 18 Two remaining parts of this talk. 1. Simplicial closure. How do new simplices (e.g., closed triangles) appear? 2. Higher-order link prediction. How can we predict which new simplices appear? Algorithms boil down to some fun linear algebra!
  • 19. We propose “higher-order link prediction” as a similar framework for evaluation of higher-order models. 19 Data. Observe simplices up to some time t. Using this data, want to predict what groups of > 2 nodes will appear in a simplex in the future. t 1 2 3 4 5 6 7 8 9 We predict structure that classical link prediction would not even consider!
  • 20. 20 Our structural analysis tells us what we should be looking at for prediction. 1. Edge density is a positive indicator. ⟶ focus our attention on predicting which open triangles become closed triangles. 2. Tie strength is a positive indicator. ⟶ various ways of incorporating this information i j k Wij Wjk Wjk
  • 21. 21 For every open triangle, we assign a score function on first 80% of data based on structural properties. Four broad classes of score functions for an open triangle. Score s(i, j, k)… 1. is a function of Wij, Wjk, Wjk 2. is a function of A[:, [i, j, k]] 3. is built from “whole-network” similarity scores on edges 4. is learned from data i j k Wij Wjk Wjk After computing scores, predict that open triangles with highest scores will be closed triangles in final 20% of data.
  • 22. 22 1. s(i, j, k) is a function of Wij , Wjk , and Wjk i j k Wij Wjk Wjk 1. Arithmetic mean 2. Geometric mean 3. Harmonic mean 4. Generalized mean Open problem. Fast way to find top-k scores? Without forming W?
  • 23. 23 2. s(i, j, k) is a function of is a function of A[:, [i, j, k]] i j k Wij Wjk Wjk 1. Number of common fourth neighbors 2. Generalized Jaccard coefficient 3. Preferential attachment Open problem. Fast way to find top-k scores (or just sample)?
  • 24. 24 3. s(i, j, k) is is built from “whole-network” similarity scores on edges: s(i, j, k) = Sij + Sji + Sjk + Skj + Sik + Ski i j k Wij Wjk Wjk 1. PageRank (unweighted or weighted) 2. Katz (unweighted or weighted) Hadamard product with A is just reducing storage costs.
  • 25. 25 How do we actually solve the systems? Want to reduce computational cost (only need scores on open triangles). For each node i that participates in an open triangle 1. Solve 2. Store ith column using iterative solver with low tolerance Open software problem. We would like block iterative solvers in Julia!
  • 26. 26 4. s(i, j, k) is learned from data. 1. Split data into training and validation sets. 2. Compute features of (i, j, k) from previous ideas using training data. 3. Throw features + validation labels into machine learning blender → learn model. 4. Re-compute features on combined training + validation → apply model on the data.
  • 27. 27
  • 28. 28 A few lessons learned from applying all of these ideas. 1. We can predict pretty well on all datasets using some method. → 4x to 107x better than random w/r/t mean average precision depending on the dataset/method 2. Thread co-participation and co-tagging on stack exchange are consistently easy to predict. 3. Simply averaging Wij, Wjk, and Wik consistently performs well. i j k Wij Wjk Wjk
  • 29. Generalized means of edge weights are good predictors of new closed triangles. 29 Good performance from this local information is a deviation from classical link prediction, where methods that use long paths (e.g., PageRank) perform well [Liben-Nowell & Kleinberg 07]. For structures on k nodes, the subsets of size k-1 contain rich information only when k > 2. i j k Wij Wjk Wjk i j k ?
  • 30. There are lots of opportunities in studying this data. 30 1. Higher-order data is pervasive! We have ways to represent data, and higher-order link prediction is a general framework for comparing comparing models and methods. 2. There is rich static and temporal structure in the datasets we collected. 3. We only looked at 3-node patterns. Computation becomes much more challenging for 4-node patterns. 4. Do you have a better model, algorithm, or data representation? Great! Show us by out-performing these baselines. Data. bit.ly/sc-holp-data Code. bit.ly/sc-holp-code
  • 31. Simplicial closure and higher-order link prediction. 31 Paper. arXiv 1802.06916 Data. bit.ly/sc-holp-data Code. bit.ly/sc-holp-code Slides. bit.ly/arb-SIAMAN18 Austin R. Benson http://cs.cornell.edu/~arb @austinbenson arb@cs.cornell.edu THANKS! i j k