Workshop - Neo4j Graph Data Science

© 2022 Neo4j, Inc. All rights reserved.
1
Workshop:
Neo4j Graph Data Science

Neo4j, Inc. All rights reserved 2022
Neo4j is a Native Graph Database
2

Relational VS Graph models
3
Relational Model Graph Model
KNOWS
KNOWS
KNOWS
ANDREAS
TOBIAS
MICA
DELIA
Person Friend
Person-Friend
ANDREAS
DELIA
TOBIAS
MICA

Labeled property graph model components
● Nodes
- Represent objects in the graph
● Relationships
- Relate nodes by type and direction
● Properties
- Name-value pairs that can go
on nodes and relationships
- Can have indexes and composite indexes
(types: String, Number, Long, Date, Spatial, byte
and arrays of those)
● Labels
- Group nodes
- Shape the domain
4
CAR
DRIVES
name: “Dan”
born: May 29, 1970
twitter: “@dan”
name: “Ann”
born: Dec 5, 1975
since:
Jan 10,
2011
brand: “Volvo”
model: “V70”
LOVES
LIVES WITH
O
W
N
S
PERSON PERSON
LOVES

5
What is data science?
“Data science is an interdisciplinary
field that uses scientific methods,
processes, algorithms and systems to
extract knowledge and insights from
structured and unstructured data.” -
Wikipedia
Domain Knowledge

6
What is Graph data science?
Graph Data Science is a science-
driven approach to gain knowledge
from the relationships and structures
in data, typically to power predictions.
Graph data scientists use
relationships to answer
questions.

7
So, When Do I Need Graph Algorithms?
Query (Cypher)
Real-time, local decisioning
and pattern matching
Graph Algorithms
Global analysis
and iterations
You know what you’re looking
for and making a decision
You’re learning the overall
structure of a network, updating
data, and predicting
Local
Patterns
Global
Computation

8
Graph Algorithm Categories
Determines the importance of
distinct nodes in the network
Finds optimal paths or evaluates
route availability and quality
Detects group clustering or
partition
Evaluates how alike nodes are by
neighbours and relationships
Pathfinding
& Search
Centrality &
Importance
Community
Detection
Similarity
Heuristic Link
Prediction
Estimates the likelihood of nodes
forming a future relationship
Node Embeddings
& ML
Compute low-dimensional vector
representations of nodes in a graph, and
allow you to train supervised machine
learning models
https://neo4j.com/docs/graph-data-science/current/

9
60+ Graph Data Science Techniques in Neo4j
Pathfinding &
Search
• Shortest Path
• Single-Source Shortest Path
• All Pairs Shortest Path
• A* Shortest Path
• Yen’s K Shortest Path
• Minimum Weight Spanning Tree
• K-Spanning Tree (MST)
• Random Walk
• Breadth & Depth First Search
Centrality &
Importance
• Degree Centrality
• Closeness Centrality
• Harmonic Centrality
• Betweenness Centrality & Approx.
• PageRank
• Personalized PageRank
• ArticleRank
• Eigenvector Centrality
• Hyperlink Induced Topic Search (HITS)
• Influence Maximization (Greedy, CELF)
Community
Detection
• Triangle Count
• Local Clustering Coefficient
• Connected Components (Union Find)
• Strongly Connected Components
• Label Propagation
• Louvain Modularity
• K-1 Coloring
• Modularity Optimization
• Speaker Listener Label Propagation
Supervised
Machine Learning
• Node Classification
• Link Prediction
• Node Regression
… and more!
Heuristic Link
Prediction
• Adamic Adar
• Common Neighbors
• Preferential Attachment
• Resource Allocations
• Same Community
• Total Neighbors
Similarity
• Node Similarity
• K-Nearest Neighbors (KNN)
• Jaccard Similarity
• Cosine Similarity
• Pearson Similarity
• Euclidean Distance
• Approximate Nearest Neighbors (ANN)
Graph
Embeddings
• Node2Vec
• FastRP
• FastRPExtended
• GraphSAGE
• Synthetic Graph Generation
• Scale Properties
• Collapse Paths
• One Hot Encoding
• Split Relationships
• Graph Export
• Pregel API (write your own algos)

10
How can they be used?
Stand Alone Solution
Find significant patterns and optimal
structures
Use community detection and
similarity scores for recommendations
Machine Learning Pipeline
Use the measures as features to train
an ML model
1st
node
2nd
node
Common
neighbors
Preferential
attachment
Label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
10

11
Access & deploy GDS
● In addition to the Neo4j Browser, access to the GDS library can be done
using the Neo4j Drivers

12
What, Where & Who?

13
Which of the colored nodes would be considered the most
‘important'?

14
Which of the colored nodes would be considered the most
‘important'?
D has the highest valence
This is the most connected individual in the network. If
importance is how well you are personally known, you
would pick D.
G has the highest closeness centrality (0,52)
Information will disperse through the network quicker
through this individual. If you need to get a message out
rapidly, you would choose G.
I has the highest betweenness centrality (0,59)
This element is an efficient connector to other elements. Risk of
disruption is higher if you lose I.

I'm in it for the money
Who will most likely get the highest pay rise?
15
It's the bridging employees.

16
Where - Horizontal
What are the Graph Data Science sweet spots?
Fraud
Detection
Disambiguation &
Segmentation
Personalized
Recommendations
Churn
Prediction
Search &
Master Data Mgmt.
Predictive
Maintenance
Cybersecurity

Where - Finance
● Synthetic identity fraud
● Fraud rings
● Money laundering
● Recommendations
● Customer segmentation
● Churn prediction
● ...

18
Where - Healthcare
● Drug repurposing
● Patient journey
● Contact tracing
● Regulatory compliance
● ...

Where - Retail
● Logistics & Routing
● Supply chain
● Recommendations
● Customer segmentation
● ...

20
Who - References
• Mostly anonymous users across devices and
sites with ever changing cookies
• 4.4 TB: +14 Bn nodes +20Bn relationships
• +160 Mn rich, unique profiles created
• 612% Increase in visits per profile
• Almost 70% of Credit Card fraud was missed
• Synthetic Identities were biggest challenge
• +1B Nodes and +1B Relationships to analyse
• Graph analytics with queries & algorithms
help find $10’s of millions of fraud in 1st year
Meredith Marketing
to the Anonymous
Financial Fraud
Detection & Recovery
Top 10
Bank
• Early intervention project with 3 yrs of visits,
tests & diagnosis with 10’s of Bn of records
• Finding similarities in patient journeys
• Graph algorithms for identifying
communities & best intervention points
AstraZeneca
Patient Journeys

21
Interacting moving parts

Describing the problem
22
Graph theory has been around for a while. So have a lot of the graph
algorithms. What you'll find is that the majority of them only works on a
very specific shape of graph ...

23
Multipartite
• Multiple Node types
• Multiple Relationship types
• Most common graph
• (what we’ve seen so far)
Mercha
nt
Transaction
Bank
N
E
X
T
Client
Phone Email
NI
Numb
er
T
O
PERFORMED
FIRST_TX
LAST_T
X
TO
T
O
H
A
S
_
P
H
O
N
E
H
A
S
_
E
M
A
I
L
HA
S_
NI
_N
UM
BE
R

24
Bipartite
• Contains nodes that can be
divided into two sets
◦ Such that relationships only
exist between sets but not
within each set.
• Node similarity relies on this
type of graph
Client
Phone Email
NI
Number
H
A
S
_
P
H
O
N
E
H
A
S
_
E
M
A
I
L
H
A
S
_
N
I_
N
U
M
B
E
R

25
Monopartite
• Contains one node label
and relationship
• Most Graph Data Science
algorithms rely on this type
of graph
Client
T
R
A
N
S
F
E
R
_
T
O

26
Why can’t I run my algorithm on a multipartite graph?
What if I try to run an algorithm on this graph?
• How many relationships does each person
have?
• How many relationships does each book
have?
• What is the direction of the relationships in
this graph?
• Can I reach a person node from another
person node, following the directed
relationships?
1 or 2
5 or 6
Person-[:APPEARED_IN]->Book
No!

27
What if I try to run an algorithm on this graph?
• What would an algorithm that used the number
of edges each node has to calculate centrality
conclude?
• What would an algorithm that followed
directed relationships to find communities
conclude?
Books are more important than people
There are seven communities?

28
If you want to find out:
• What person is the most important
• How many communities of people are there,
across all the books?
You need to reshape your graph!

29
Graph Catalog
Procedures (part of the GDS library) that let you reshape and subset your
transactional graph so you have the right data in the right shape to run
analytical algorithms.
Mutable in-memory
Workspace

Graph Algorithms
30

31
Creating the graph projection
Projection will be loaded it into memory
CALL gds.graph.create('GraphProjection', 'Character',{
INTERACTS_WITH:{
type: 'INTERACTS_WITH',
properties: {count: {property: 'count'}}
}
}) YIELD
graphName,nodeCount,relationshipCount,createMillis;
This is a Native Projection. Very efficient but the graph must exist with the same structure in the
database!

32
Calling an Algorithm Procedure
Good news! All algorithms in GDS follow the same syntax:
CALL gds[.<tier>].<algorithm>.<execution-mode>[.<estimate>](
graphName: STRING,
configuration: MAP
)

33
Tiers of Support
Product supported: Supported by product engineering, tested for
stability, scale, fully optimized
CALL gds.<algorithm>.<execution-mode>[.<estimate>]
Beta: Candidate for product supported tier
CALL gds.beta.<algorithm>.<execution-mode>[.<estimate>]
Alpha: Experimental implementation, may be changed in future.
CALL gds.alpha.<algorithm>.<execution-mode>[.<estimate>]
graphName: STRING,
configuration: MAP
)

34
Execution Modes
Stream: Stream your results back as Cypher result rows. Generally node id(s) and scores.
CALL gds[.<tier>].<algorithm>.stream[.<estimate>]
Write: Write your results back to Neo4j as node or relationship properties, or new
relationships. Must specify writeProperty
CALL gds[.<tier>].<algorithm>.write[.<estimate>]
Mutate: update the in-memory graph with the results of the algorithm
CALL gds[.<tier>].<algorithm>.mutate[.<estimate>]
Stats: Returns statistics about the algorithm output - percentiles, counts
CALL gds[.<tier>].<algorithm>.stats[.<estimate>]
graphName: STRING,
configuration: MAP
)

35
Estimation
Estimate lets you estimate the memory requirements for running your
algorithm with the specified configuration -- just like .estimate with
graph catalog operations.
CALL gds.<algorithm>.<execution-mode>.estimate
Note: Only production quality algorithms support
.stats and .estimate
graphName: STRING,
configuration: MAP
)

36
Common Configuration Parameters
graphName: STRING,
configuration: MAP
)
Key Meaning Default
concurrency How many concurrent threads can be used when executing the algo? 4
readConcurrency How many concurrent threads can be used when reading data? concurrency
writeConcurrency How many concurrent threads can be used when writing results? concurrency
relationshipWeightProperty Property containing the weight (must be numeric) null
writeProperty Property name to write back to n/a

Graph Embeddings
and Graph Native ML
37

Node Embedding
What are node embeddings?
How?
The representation of nodes as low-dimensional vectors that
summarize their graph position, the structure of their local graph
neighborhood as well as any possible node features
Encoder - Decoder Framework

Node Embedding

Node Embedding
Encode nodes such that similarity in
the embedding space, i.e. cosine
similarity, approximates similarity in
the graph

Graph Embeddings in Neo4j
Node2Vec
Random walk based embedding
that can encode structural similarity
or topological proximity.
Easy to understand, interpretable
parameters, plenty of examples
GraphSAGE
Inductive embedding that encodes
properties of neighboring nodes
when learning topology.
Generalizes to unseen graphs, first
method to incorporate properties
FastRP
A super fast linear algebra based
approach to embeddings that can
encode topology or properties.
75,000x faster than Node2Vec
extended to encode properties

42
GraphSAGE (SAmpling and AggreGatE)
A
A
010...01001l..001
1 ..n
1001l..001…..
010...n
...01001l..001..
.n
...01001l..001..
.n
A
SAMPLE AGGREGATE PREDICT
● Assumes that nodes in the same neighborhood should have similar representations
● Uses node properties in addition to relationships
● Inductive approach that learns a function to calculate an embedding

43
Some final thoughts ...

Data Science is COMPLICATED
44
Dozens of libraries,
hundreds of algos
& no docs!
How do we shape
data into a graph
in the first place?
We’ve picked a
library...good
luck learning the
syntax
What? We have
to build the
entire ETL pipeline
for this?
Are the results
right? How do
we get into
production?
Data
Modeling
Which
Algorithms?
Learn
Syntax
Reshape
Data
What
Now?

45
SIMPLIFY your experience
Dozens of
libraries,
hundreds of algos
& no docs!
We’ve picked a
library...good
luck learning the
syntax
What? We have
to build the
entire ETL pipeline
for this?
Are the results
right? How do
we get into
production?
Data
Modeling
Which
Algorithms?
Learn
Syntax
Reshape
Data
What
Now?
We have validated
algos, clear docs, &
tutorials
Neo4j syntax is
standardized and
simplified
Seamlessly
reshape data with
1 command
Simply write results
to Neo4j & move to
production
With Neo4j
it’s already a
graph

46
Eurovision Song Contest

Why that dataset?
● Relatively easy to find
● The domain is generally understood
● The results of our queries and algorithms can be verified
● There are a lot of myths to debunk / confirm … almost everybody in
Europe has at least one of them in their heads.

48
Model
That's a monopartite that is!

Couple of points
● This is an instance model rather than a classical database model. As we
don't have a schema to generate, we can just as well show some sample
data.
● You could argue that the year should also be a property of the
relationship, rather than part of the type. However, most of the analysis
we'll do today will be year-based.
● The dataset contains data from 1975 to 2018. That data was the easiest
to normalize (the voting system has changed a lot over the years) and
stays clear of recent controversy. Feel free to go 1956 to 2022 afterwards
though, it's really fun.

50
Cypher
Hands-on

SingFollow along
How this is going to work is that I am going to avoid flipping back and forth
between slides and executing syntax. Instead you are going to execute
syntax!
In the virtual environment https://milano-summit.graphdatabase.ninja:7473/ ...
execute the following guide in the Neo4j Browser:
:play https://metis.graphdatabase.ninja/summit/cypher.html
:play https://metis.graphdatabase.ninja/summit/gds.html
You will find the syntax labeled with numbers, exactly as on the slides. So do
follow along!

52
Taking it from the top
In Cypher you MATCH a pattern and then RETURN a result
MATCH (c:Country {name: "Finland"})
RETURN c;
001
Filtering is done with WHERE (this statement does exactly the same)
MATCH (c:Country)
WHERE c.name = "Finland"
RETURN c;
002

Using patterns to answer questions
Who won in 1975?
MATCH (c:Country)<-[vote:VOTE_1975_JURY|VOTE_1975_PUBLIC]-()
RETURN c.name, sum(vote.weight) as score
ORDER BY score DESC LIMIT 10;
003
● The Netherlands (with Ding-a-Dong) did and you can check at
https://eurovisionworld.com/eurovision/1975, the data is correct.
● Please take a moment to note down the positions of Finland, Sweden and
Ireland (7, 8, 9), this is going to be useful in a bit.

54
One more of those
Who won in 2006?
MATCH (c:Country)<-[vote:VOTE_2006_JURY|VOTE_2006_PUBLIC]-()
RETURN c.name, sum(vote.weight) as score
ORDER BY score DESC LIMIT 10;
004
Finland (Hard Rock Hallelujah) did
(https://eurovisionworld.com/eurovision/2006) … just in case you wondered
what the music was about.

Let's up the ante
Does country-X almost always give country-Y points?
That clearly requires a couple of definitions:
● almost always → at least 80% of the time
● a minimum of 15 entries for country-Y (otherwise it's not really significant
… sorry Australia)
● in order to keep the complexity limited the splitting and renaming of
countries is not taken into account (but you could if you wanted to)
● only jury votes are considered
● …

56
Let's up the ante
The approach then becomes:
● First you determine how many times a country competed.
● You keep that result with an intermediate projection (WITH) and filter out
based on the number of entries
● You then determine how many times the other countries voted for that
country
● Use another intermediate projection to filter based on the percentage
● Project the result ordered by relevance

Let's up the ante
MATCH (target:Country)<-[r]-()
WHERE NOT type(r) IN ['SPLIT_INTO','WAS_RENAMED']
AND NOT type(r) CONTAINS 'PUBLIC'
WITH target, count(DISTINCT type(r)) AS totalentries
WHERE totalentries > 15
MATCH (target)<-[r]-(source:Country)
WHERE NOT type(r) IN ['SPLIT_INTO','WAS_RENAMED']
AND NOT type(r) CONTAINS 'PUBLIC'
WITH target, source, count(r) as votes, totalentries
WHERE votes > totalentries * 0.80
RETURN source.name AS `country-X`, target.name as `country-Y`, votes,
totalentries ORDER BY totalentries+votes DESC;
005

58
Let's up the ante - Conclusions
● It does happen
● But it's not as common as some of the myths would have you believe.

Biting of more than we can chew
Are there blocks of countries (cliques/cohorts … whatever you want to call
them) that keep votes amongst themselves?
This is much harder to determine
● It requires reciprocity (it's not good enough that X always votes for Y, it
has to go the other way too)
● It needs quite a few countries to collaborate before you see the impact.
● …
It is a long standing myth (?) that the Scandinavian countries do exactly this.
Let's find out …

60
Biting of more than we can chew
You can do this with Cypher.
It would get pretty hairy though. If you however reduce the problem to it's
essence, what you want to do is find out if there are voting-communities
that persist over time …
I wonder if there are GDS algorithms that can determine communities …

61
60+ Graph Data Science Techniques in Neo4j
Pathfinding &
Search
• Shortest Path
• Single-Source Shortest Path
• All Pairs Shortest Path
• A* Shortest Path
• Yen’s K Shortest Path
• Minimum Weight Spanning Tree
• K-Spanning Tree (MST)
• Random Walk
• Breadth & Depth First Search
Centrality &
Importance
• Degree Centrality
• Closeness Centrality
• Harmonic Centrality
• Betweenness Centrality & Approx.
• PageRank
• Personalized PageRank
• ArticleRank
• Eigenvector Centrality
• Hyperlink Induced Topic Search (HITS)
• Influence Maximization (Greedy, CELF)
Community
Detection
• Triangle Count
• Local Clustering Coefficient
• Connected Components (Union Find)
• Strongly Connected Components
• Label Propagation
• Louvain Modularity
• K-1 Coloring
• Modularity Optimization
• Speaker Listener Label Propagation
Supervised
Machine Learning
• Node Classification
• Link Prediction
… and more!
Heuristic Link
Prediction
• Adamic Adar
• Common Neighbors
• Preferential Attachment
• Resource Allocations
• Same Community
• Total Neighbors
Similarity
• Node Similarity
• K-Nearest Neighbors (KNN)
• Jaccard Similarity
• Cosine Similarity
• Pearson Similarity
• Euclidean Distance
• Approximate Nearest Neighbors (ANN)
Graph
Embeddings
• Node2Vec
• FastRP
• FastRPExtended
• GraphSAGE
• Synthetic Graph Generation
• Scale Properties
• Collapse Paths
• One Hot Encoding
• Split Relationships
• Graph Export
• Pregel API (write your own algos)

62
Graph Data Science
Hands-on
… at last …

SingFollow along
How this is going to work is that I am going to avoid flipping back and forth
between slides and executing syntax. Instead you are going to execute
syntax!
In the virtual environment https://summit.graphdatabase.ninja:7473/ ... execute
the following guide in the Neo4j Browser:
:play https://metis.graphdatabase.ninja/summit/gds.html
You will find the syntax labeled with numbers, exactly as on the slides. So do
follow along. And oh yes ... one last thing ...
There will be questions!

64
Best practice
A typical run of a graph algorithm has the following steps:
1. Know your data. Run some statistics. This will help determine if the
results make sense. Run some estimates. Do you have enough memory?
2. Project the necessary data into the in-memory workspace.
3. Run the algorithm in estimate mode. Run it in stats mode. See 1. for the
reason.
4. Run the algorithm. Handle the results.
5. Remove the projection if it is no longer needed.

Best practice
In this session we will focus on 2. and 4. (to save time and reduce complexity)
but please do not forget the other steps once you are doing this on your own.

66
Using algorithms to answer questions
Who won in 1975?
This question is asking about the importance of countries in our voting graph.
That's a centrality problem and the best known algorithm for it is
pageranking so let's apply that!

Project the relevant data into the in-memory workspace
CALL gds.graph.project("eurosong1975",
"Country",
"VOTE_1975_JURY",
{ relationshipProperties: "weight" }
) YIELD graphName, nodeCount, relationshipCount
RETURN graphName, nodeCount, relationshipCount;
001
Something is not quite right, check https://eurovisionworld.com/eurovision/1975
again, how many countries participated?

68
Show an overview of the projections
CALL gds.graph.list();
002
Clean up the projection
CALL gds.graph.drop("eurosong1975");
003

And try it in a different way …
CALL gds.graph.project.cypher("eurosong1975",
"MATCH (c:Country) WHERE EXISTS ((c)-[:VOTE_1975_JURY]-())
RETURN id(c) as id, labels(c) as labels",
"MATCH (s:Country)-[r:VOTE_1975_JURY]->(t:Country) RETURN
id(s) as source, id(t) as target, type(r) as type, r.weight
as weight"
) YIELD graphName, nodeCount, relationshipCount
RETURN graphName, nodeCount, relationshipCount;
004

70
Native projection VERSUS Cypher projection
● Native projection is very efficient, scales to huge graphs
● Native projection requires that your original graph is completely tailored to
the problems
● Cypher projection is less efficient
● Cypher projection gives you full flexibility (you can even project things
that aren't there)
For our hands-on we'll go with Cypher projections, but do keep above in mind!

Streaming the results for 1975
CALL gds.pageRank.stream("eurosong1975", {
maxIterations: 20,
dampingFactor: 0.85,
relationshipWeightProperty: "weight"
}) YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC, name ASC LIMIT 10;
005
Does anybody notice something strange about positions 7, 8 and 9?

72
A bit of a rant
Why aren't Finland, Ireland and Sweden in the correct order? Is pageranking
giving us information that a plain score can not? Yes and no.
The way pageranking works is that incoming votes are only part of the story. A
vote gets more importance if it comes from a page that itself has a high score.
Ireland got votes from The Netherlands. The others did not.
The lesson here is that you
● Need to understand your data
● Need to understand the algorithms

Let's up the ante
What were the voting communities in 1975?
CALL gds.louvain.stream("eurosong1975", {
relationshipWeightProperty: "weight"
}) YIELD nodeId, communityId
RETURN collect(gds.util.asNode(nodeId).name) AS members,
communityId
ORDER BY communityId DESC
006
Nice, but without looking over all the years there's no way to bust the Scandinavian
myth …

74
Let's up the ante
Project the remaining years without televoting
UNWIND range(1976,2015,1) as year
CALL {
WITH year
CALL gds.graph.project.cypher("eurosong" + year,
"MATCH (c:Country) WHERE EXISTS ((c)-[:VOTE_" + year + "_JURY]-()) RETURN id(c)
as id, labels(c) as labels",
"MATCH (s:Country)-[r:VOTE_" + year + "_JURY]->(t:Country) RETURN id(s) as
source, id(t) as target, type(r) as type, r.weight as weight"
) YIELD graphName
RETURN graphName
}
RETURN year, graphName;
007

Let's up the ante
Project the remaining years with televoting
CALL {
WITH year
CALL gds.graph.project.cypher("eurosong" + year,
"MATCH (c:Country) WHERE EXISTS ((c)-[:VOTE_" + year + "_JURY]-()) RETURN id(c)
as id, labels(c) as labels",
"MATCH (s:Country)-[r:VOTE_" + year + "_JURY|VOTE_" + year + "_PUBLIC]-
>(t:Country) RETURN id(s) as source, id(t) as target, type(r) as type, r.weight as
weight"
) YIELD graphName
RETURN graphName
}
RETURN year, graphName;
008

76
Let's up the ante
Run Louvain in bulk and mutate the in-memory projection
CALL {
WITH year
CALL gds.louvain.mutate("eurosong" + year, {
relationshipWeightProperty: "weight",
mutateProperty: "louvain" + year
}) YIELD nodePropertiesWritten
RETURN nodePropertiesWritten
}
RETURN year, nodePropertiesWritten;
009

Mutadis mutandis
There are three main modes (ignoring stats and estimate) to run an algorithm
stream - streams (duh) the results and is typically either used as a test run
(with visual inspection of the results) or when you want to use the results
outside of Neo4j (in a machine learning pipeline for example)
write - modifies the original graph, which can be very useful if you want to
combine analytics with real time use cases
mutate - modifies the in-memory projection, which is typically done when you
have a chain of algorithms where one has to feed into the next

78
Let's up the ante talk about embeddings
Before we can finally confirm or debunk the Viking complot there's an image
we saw earlier that I'm betting none of you questioned …
Machine Learning Pipeline
How does that work? An ML pipeline eats features, not graphs. Enter
embeddings …

Let's talk about embeddings
An embedding is a vector, a list of numbers, that represents a (part of the)
graph. In Neo4j there are currently of node-embeddings, a node and it's place
in the graph is represented as a list of numbers. Which an ML pipeline can
totally ingest!
There are currently three algorithms that can create node-embeddings
● Fast Random Projection
● GraphSAGE
● Node2Vec

80
Let's talk about embeddings
So … you are going to ignore all three and create your own …
In the in-memory projections (one per year) the Country nodes now have an
additional property, louvainXXXX (with XXXX the year) that holds their
community.
I would argue that a node's community is a pretty good indication of the
structure around a node. Combining all of them (for all years) into one list
gives us … a pretty decent embedding (not to mention one that's human
interpretable). Let's do it!

Let's up the ante
Create the embedding
CALL gds.graph.streamNodeProperty("eurosong" + year, "louvain" +
year) YIELD nodeId, propertyValue
WITH nodeId, propertyValue, year
WITH nodeId, toInteger(toString(year) + toString(propertyValue)) as
embeddingvalue
WITH nodeId, collect(embeddingvalue) as embedding
MATCH (c:Country) WHERE id(c) = nodeId
SET c.embedding = embedding;
010

82
Let's up the ante
Verify the embedding
MATCH (c:Country)
RETURN c.name, c.embedding;
011

Let's up the ante
Cleanup, the in-memory projections have served their purpose
CALL {
WITH year
CALL gds.graph.drop("eurosong" + year) YIELD graphName
RETURN graphName
}
RETURN "dropped " + graphName;
012
Just building up the suspense btw … I could totally have ignored this …

84
Let's up the ante
Compare the embeddings and infer a SIMILAR relationship
MATCH (c1:Country),(c2:Country)
WHERE id(c1) > id(c2)
AND c1.embedding IS NOT NULL
AND c2.embedding IS NOT NULL
AND gds.similarity.jaccard(c1.embedding, c2.embedding) > 0.60
AND size(c1.embedding) > 1
AND size(c2.embedding) > 1
MERGE (c1)-[:SIMILAR {score: gds.similarity.jaccard(c1.embedding,
c2.embedding)}]->(c2);
013

Confirmed or Debunked?
Check the results
MATCH p=(:Country)-[r:SIMILAR]->(:Country) RETURN p;
014
Yes, there is some collusion, but remember, you'd need quite the cluster to
actually influence the results significantly. And it would seem it's not the
Scandinavian countries that have that at the moment.
What's going on between San Marino and Georgia though?

86
Thank you!
Contact us at
sales@neo4j.com

Workshop - Neo4j Graph Data Science

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Workshop - Neo4j Graph Data Science

Similaire à Workshop - Neo4j Graph Data Science (20)

Plus de Neo4j

Plus de Neo4j (20)

Dernier

Dernier (20)

Workshop - Neo4j Graph Data Science