The most practical way to improve our machine learning predictions right away is using graph algorithms for connected feature extraction. We’ll quickly dive into creating a machine learning pipeline and tips on training and evaluating a model for link prediction – integrating Neo4j and Spark in our workflow. We’ll look at an example using several models to predict future collaborations and show measurable improvements using graph based features.
Speaker: Amy Hodler
5. Relationships Are often
the Strongest Predictors of Behavior
“Increasingly we're learning that you can make
better predictions about people by getting all the
information from their friends and their friends’
friends than you can from the information you
have about the person themselves”
10. Features for ML:
Feature Extraction
Feature Extraction is how when we change the shape or format of
the data to be usable in a machine learning pipeline. For example,
from a graph, we extract the relevant subset of the data into a
tabular format for model building.
11. Features for ML:
Feature Engineering
Feature Engineering is how we combine and process the data to
create new, more meaningful features, such as clustering or
connectivity metrics.
Influence
Connectivity
Communities
Relationships
12. Features for ML:
Feature Selection
Feature Selection is how we reduce the number of features used
in a model to a relevant subset. This can be done algorithmically or
based on domain expertise, but the objective is to maximize the
predictive power of your model while minimizing overfitting.
13. Stop Throwing Away Data You Already Have
Decisions
$
Better Decisions
Machine Learning Pipeline Machine Learning Pipeline
15. Can we infer which new interactions are likely to occur
in the future?
16. #UnifiedAnalytics #SparkAISummit
+ 50 years of biomedical
data integrated in a
knowledge graph
Predicting new uses for
drugs by using the graph
structure to create features
for link prediction
16
het.io
18. Link Prediction Methods
Algorithm Measures
Run targeted algorithms and score
outcomes
Set a threshold value used to
predict a link between nodes
Machine Learning
Use the measures as features to
train an ML model
Community
Detection
Link
Prediction
Similarity
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
20. Predicting Collaboration with a
Graph Enhanced ML Model
• Citation Network Dataset - Research Dataset
– Used a subset with 52K papers, 80K authors, 140K author
relationships and 29K citation relationships
– “ArnetMiner: Extraction and Mining of Academic Social
Networks”, by J. Tang et al
• Neo4j
– Create a co-authorship graph and connected feature engineering
• Spark and MLlib
– Train and test our model using a random forest classifier
21. Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine Learning
Train
Models
Evaluate
Results
Productionize
Identified sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Precision,
Accuracy, Recall
ROC Curve &
AUC
Model Selection:
Random Forest
Ensemble
method
22. Graph Algorithms Used for
Feature Engineering (few examples)
Preferential Attachment measure the
closeness of nodes based on shared neighbors
Common Neighbors measures the number of
possible neighbors (triadic closure)
Illustration from be.amazd.com/link-prediction/
23. Triangle counting and clustering coefficients
measure the density of connections around nodes
Louvain Modularity identifies interacting
communities and hierarchies
Graph Algorithms Used for
Feature Engineering (few examples)
24. Training Our Model
This is one decision tree in
our Random Forest used as a
binary classifier to learn how
to classify a pair: predicting
either linked or not linked.
25. OMG I’m Good!
Data Leakage!
We had to go back and use time-
based splits for train/test datasets
Did you get really high accuracy
on your first run without tuning?
28. Feature Influence for Tuning
To compute feature
importance, the random forest
algorithm in Spark averages
the reduction in impurity
across all trees in the forest
Feature rankings are in
comparison to the group of
features evaluated
29. Resources
#UnifiedAnalytics #SparkAISummit #Neo4j #GraphAnalytics
Code/Repositories:
This example from O’Reilly book
bit.ly/2FPgGVV (ML Folder)
Python notebook:
github.com/AliciaFrame/
Public-Python-Notebooks
neo4j.com/
graph-algorithms-book
Chapter 8: Link Prediction
30. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Amy.Hodler@neo4j.com
32. Resources
Spark Community
• spark.apache.org/community.html
• users@spark.apache.org
#UnifiedAnalytics #SparkAISummit #Neo4j #GraphAnalytics
Code/Repositories
This example from O’Reilly Book:
bit.ly/2FPgGVV (ML Folder)
Python notebook:
github.com/AliciaFrame/
Public-Python-Notebooks
Neo4j Community
• neo4j.com/developer/
• neo4j.com/developer/graph-algorithms/
• community.neo4j.com
33. CAR
DRIVES
name: “Dan”
born: May 29, 1970
twitter: “@dan”
name: “Ann”
born: Dec 5, 1975
since:
Jan 10, 2011
brand: “Volvo”
model: “V70”
Latitude: 37.5629900°
Longitude: -122.3255300°
Nodes
• Can have Labels to classify nodes
• Labels have native indexes
Relationships
• Relate nodes by type and direction
Properties
• Attributes of Nodes & Relationships
• Stored as Name/Value pairs
• Can have indexes and composite indexes
• Visibility security by user/role
Neo4j Invented the Labeled Property Graph Model
MARRIED TO
LIVES WITH
OW
NS
PERSON PERSON
33