Gremlin is a Turing-complete, graph-based programming language developed for key/value-pair multi-relational graphs called property graphs. Gremlin makes extensive use of XPath 1.0 to support complex graph traversals. Connectors exist to various graph databases and frameworks. This language has application in the areas of graph query, analysis, and manipulation.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Gremlin: A Graph-Based Programming Language
1. Gremlin G = (V, E)
A Graph-Based Programming Language
Marko A. Rodriguez
T-5, Center for Nonlinear Studies
Los Alamos National Laboratory
http://markorodriguez.com
http://gremlin.tinkerpop.com
February 25, 2010
2. Abstract
Gremlin is a Turing-complete, graph-based programming language
developed for key/value-pair multi-relational graphs called property graphs.
Gremlin makes extensive use of XPath 1.0 to support complex graph
traversals. Connectors exist to various graph databases and frameworks.
This language has application in the areas of graph query, analysis, and
manipulation.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
3. Acknowledgements
• Marko A. Rodriguez [http://markorodriguez.com]
designed, developed, tested, and documented Gremlin.
• Peter Neubauer [http://www.linkedin.com/in/neubauer]
aided in the design and the evangelizing of Gremlin.
• Pavel Yaskevich [http://github.com/xedin]
aided in the development of user defined functions in Gremlin.
• Joshua Shinavier [http://fortytwo.net]
provided initial conceptual support for Gremlin.
• Ketrina Yim [http://csillustrated.berkeley.edu]
designed the logo for Gremlin.
• Gremlin-Users Group [http://groups.google.com/group/gremlin-users]
provided much direction in the design and implementation of Gremlin.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
4. Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
• Conclusions
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
5. Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
• Conclusions
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
6. What is a Graph?
• A graph (network) is composed of a collection of vertices (dots) and edges (lines).
There are many types of graphs: directed/undirected, weighted, attributed, etc.
vertex-labeled
a
hyper
d edge-attributed
ed bele
ht e-la
multi
ig edgknows created=2-01-09
we 0.2 modified=2-11-09
cted
tic
undire
di
an
re
ct
m
hired ed
se
reg
ge
ula
half-ed
r
pseudo
http://ex.com/123
type="person"
name="emil" resource description framework
vertex-attributed
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
7. Why Use a Graph?
• A graph is a very general data structure that can be used to model
various systems.
A graph can model the structure of transportation, technological,
bibliographic, etc. systems.
A graph can model a list, a map, a tree, etc.
• There are numerous graph algorithms that are defined independent of
the domain of the graph model.
• There are numerous graph databases, frameworks, packages, etc.
that aid in the creation, manipulation, and analysis of graphs.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
8. Graph Databases, Frameworks, and Packages
• Neo4j Graph Database [http://neo4j.org]
• AllegroGraph Quad Store [http://http://www.franz.com/agraph]
• HyperGraphDB [http://www.kobrix.com/hgdb.jsp]
• Java Universal Network/Graph Framework [http://jung.sourceforge.net]
• OpenRDF Sesame Framework [http://www.openrdf.org]
• InfoGrid Graph Database [http://infogrid.org]
• Filament Graph Toolkit [http://filament.sourceforge.net]
• OWLim Semantic Repository [http://www.ontotext.com/owlim]
• Sones Graph Database [http://www.sones.com]
• NetworkX Graph Toolkit [http://networkx.lanl.gov]
• iGraph Toolkit [http://igraph.sourceforge.net]
• Blueprints Graph API [http://blueprints.tinkerpop.com]
• ... and many more.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
9. What Makes Gremlin Different?
• Gremlin is a domain specific language for working with graphs.
• Gremlin is not an application programming interface (API).
• Gremlin makes use of various graph databases, frameworks, packages.
• Gremlin is a language that currently has a virtual machine
implementation written in Java.
• What can be succinctly expressed in Gremlin is verbose/clumsy to
express in general purpose languages such as Java, Python, Ruby, etc.
• Gremlin allows one to map single-relational graph analysis algorithms
over to the multi-relational domain.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
10. Single-Relational Graphs
• In single-relational graphs, all edges have the same meaning
(e.g. all edges are either frienship, kinship, worksWith, knows, etc.).
G = (V, E ⊆ (V × V ))
• Most graph algorithms are defined for single-relational graphs
(e.g. centrality/ranking, clustering/community detection, etc.).
person-c
person-a person-b
NOTE: These types of graphs are also known as directed, vertex-labeled graphs.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
11. Multi-Relational Graphs
• In multi-relational graphs, edges can have different meanings.
G = (V, E ⊂ (V × V ), ω : E → Σ∗)
• Most graph software is designed for multi-relational graphs (e.g. arbitrary
objects as vertices and edges, knowledge-based reasoning systems, etc.).
book-c
read cites
person-a authored book-b
NOTE: These types of graphs are also known as directed, vertex/edge-labeled graphs.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
12. Gremlin and Multi-Relational Graphs
• Gremlin provides a means to elegantly map single-relational graph
analysis algorithms over to the multi-relational graph domain.
• Gremlin provides an elegant way to do automated reasoning in
multi-relational graphs using path expressions.
These two points form the primary thesis of this presentation.
Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis
Algorithms,” Journal of Informetrics, 4(1), 29–41, doi:10.1016/j.joi.2009.06.004, LA-UR-08-03931,
http://arxiv.org/abs/0806.2274, December 2009.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
13. Property Graphs
• Gremlin works with a type of multi-relational graph called a property
graph.
Vertices and edges are labeled with unique identifiers.
Edges are directed, labeled, and can form loops.
Multiple edges of the same label can exist for the same vertex pair.
Vertices and edges can have any number of key/value pair
properties/attributes.
Property graphs are a relatively general graph structure that can be constrained to model other graph
structures — though, a property-based hypergraph would be the most general (see HyperGraphDB and the
JUNG API).
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
14. Property Graphs
name = "lop"
lang = "java"
weight = 0.4 3
name = "marko"
age = 29 created
weight = 0.2
9
1
created
8 created
12
7 weight = 1.0
weight = 0.4 6
weight = 0.5
knows
knows 11 name = "peter"
age = 35
name = "josh"
4 age = 32
2
10
name = "vadas"
age = 27
weight = 1.0
created
5
name = "ripple"
lang = "java"
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
15. Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
16. Gremlin System Architecture
• The Gremlin console is a scripting environment
Gremlin Gremlin which allows for the dynamic evaluation of
Console ScriptEngine Gremlin code.
• Gremlin implements JSR 223 which allows
Gremlin to also be used within the Java
language and thus, as a virtual machine directly
accessible to Java applications. Popular JSR
223 implementations include Jython, JRuby, and
Groovy. For a fine list of implementations see
https://scripting.dev.java.net.
• Blueprints is a set of interfaces for abstract
data structures such as graphs and documents.
Implementations to these interfaces exist for
various data management systems.
• There exist many graph data management
systems that span various graph data models
Neo4j NativeStore TinkerGraph (e.g. edge labeled graphs, RDF graphs,
hypergraphs, etc.).
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
17. “Hello World” in the Gremlin Console
marko$ ./gremlin.sh
,,,/
(o o)
-----oOOo-(_)-oOOo-----
gremlin>
gremlin> concat(‘goodbye’, ‘ ’, ‘self’)
==>goodbye self
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
18. Simple Traversals in Gremlin
name = "lop" gremlin> $_ := g:key(‘name’,‘marko’)
lang = "java"
==>v[1]
weight = 0.4 3
name = "marko"
age = 29 created
gremlin> .
1
9 ==>v[1]
created
7
8 created
12 gremlin> ./outE
6
weight = 0.5
knows
==>e[7][1-knows->2]
knows 11
weight = 1.0 ==>e[9][1-created->3]
name = "josh"
4
2
age = 32 ==>e[8][1-knows->4]
name = "vadas"
10 gremlin> ./outE/@weight
age = 27
==>0.5
created
==>0.4
5
==>1.0
./outE/@weight: “Get the current object(s). Then get the outgoing edges of those objects. Then get the
weights of those edges.”
$ is a reserved variable meaning the root list of objects.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
19. Simple Traversals in Gremlin
name = "lop" gremlin> .
lang = "java"
==>v[1]
3
name = "marko" gremlin> ./outE[@label=‘created’]/inV
age = 29 created
9
==>v[3]
1 created
8 created
gremlin> $_ := $_last
12
7
6
==>v[3]
knows
knows
11
gremlin> ./@name
==>lop
4
2 gremlin> g:map(.)
10
==>name=lop
created
==>lang=java
5
./outE[@label=‘created’]/inV: “Get the current object(s). Then get the outgoing edges of those
objects, where their labels equal ‘created’. Then get the incoming vertices of those ‘created’ edges.”
$ last is a reserved variable meaning the last value evaluated.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
20. Simple Traversals in Gremlin
name = "lop"
lang = "java"
3
name = "marko"
age = 29 created
9
1 created
8 created
12
7
6
knows
knows 11
name = "josh"
4 age = 32
2
10
name = "vadas"
age = 27
created
5
./outE[@label=‘knows’]/inV[matches(@name,‘va.{3}’) and @age > 21]/@name
==>vadas
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
21. Simple Traversals in Gremlin
./outE[@label=‘knows’]/inV[matches(@name,‘va.{3}’) and @age > 21]/@name
1. .: Get the current object(s).
2. outE[@label=‘knows’]: Get the outgoing edges of the current
object(s), where their labels equal ‘knows’.
3. inV[matches(@name,‘va.{3}’) and @age > 21]: Get the incoming
vertices of those ‘knows’ edges, where the names of those vertices are 5
characters long, start with ‘va’, and whose age is greater than 21.
4. @name: get the name of those particular incoming vertices.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
22. Knowledge-Based Reasoning
• Blueprints implements the Sesame SAIL interfaces and thus, Gremlin
can be used over the many Resource Description Framework (RDF)
triple/quad stores. In such cases, RDF is modeled as a property graph
where the named graph component is the @ng edge property.
• Gremlin makes use of the Sesame SAIL SPARQL engine to allow for
queries based on graph-pattern matching.
gremlin> sail:sparql(‘SELECT ?x ?y WHERE { ?x foaf:knows ?y }’)
==>{y=v[http://ex.com#2], x=v[http://ex.com#1]}
==>{y=v[http://ex.com#4], x=v[http://ex.com#1]}
• Gremlin is useful for knowledge-based reasoning using path
expressions.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
23. Reasoning as Defining New Types of Adjacency
• Graph-based reasoning is the process
of making explicit what is implicit in
lop co-developer
the graph.
created
marko
created • A reasoner takes a graph G
co-developer
peter
and a collection of graph-patterns
created
(i.e. transformation/rewrite rules) and
knows knows
creates a new graph G (usually, G ⊂
josh
G ). G has new relationships/edges
vadas
and thus, new definitions of vertex
created adjacency.
• Example: The co-developers of person
ripple A are those people who have created
the same software as person A and who
are themselves, not person A (as person
For these “co-developer” examples, we will use
A has created the same software as him
vertex 1 (marko) as the source of the reasoning
or herself).
process.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
24. The Co-Developers of Marko A. Rodriguez in SPARQL
name = "lop" SELECT ?x WHERE {
lang = "java"
?y
marko created ?y .
3
name = "marko"
age = 29 created
?z created ?y .
marko 1
created
?z ?z != marko .
created
6 ?z name ?x
knows
name = "peter" }
age = 35 ?x
knows
?z
4
name = "josh"
age = 32 ?x
This query would return: josh and
2
peter.
created
5
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
25. The Co-Developers of Marko A. Rodriguez in Gremlin
co-developer
lop co-developer
created
created
marko co-developer
peter
created
knows knows
josh
vadas
created
ripple
gremin> ./@name
==>marko
gremlin> ./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]/@name
==>josh
==>peter
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
26. The Co-Developers of Marko A. Rodriguez in Gremlin
./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]/@name
1. .: Get the current object(s) (i.e. vertex 1 — denoting Marko).
2. outE[@label=‘created’]: Get the outgoing edges of the Marko vertex, where their
labels equal ‘created’.
3. inV: Get the incoming (i.e. head) vertices of those ‘created’ edges.
4. inE[@label=‘created’]: Get the incoming edges of those vertices, where their
labels equal ‘created’.
5. outV[g:except($ )]: Get the outgoing (i.e. tail) vertices of those ‘created’ edges,
where those vertices are not the Marko vertex.
6. @name: get the name of those non-Marko vertices.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
27. Defining Co-Developers in Gremlin
path co-developer
./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]
end
Once defined, you can use it like any other path segment.
gremlin> ./co-developer
==>v[4]
==>v[6]
gremlin> ./co-developer/@name
==>josh
==>peter
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
28. Defining Co-Developers in Java
public class CoDeveloperPath implements Path {
public List invoke(Object root) {
if(root instanceof Vertex) {
List<Vertex> projects = new ArrayList<Vertex>();
for(Edge edge : ((Vertex)root).getOutEdges()) {
if(edge.getLabel().equals("created")) {
projects.add(edge.getInVertex());
}
}
List<Vertex> coDevelopers = new ArrayList<Vertex>();
for(Vertex project : projects) {
for(Edge edge : project.getInEdges()) {
if(edge.getLabel().equals("created") && edge.getOutVertex() != root) {
coDevelopers.add(edge.getOutVertex());
}
}
}
return coDevelopers;
} else {
return null;
}
}
}
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
29. Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
• Conclusions
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
30. Gremlin Type System
object
element graph number string boolean map list
vertex edge
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
31. Predefined Paths and Properties
vertex 1 out edges vertex 3 in edges
edge 9 out vertex edge 9 label edge 9 in vertex
edge 9 id
1 9 created 3
8 11
knows created
4 vertex 4 id
vertex 4 properties
name = "josh"
age = 32
object property description example
graph V the vertex iterator of the graph $g/V
graph E the edge iterator of the graph $g/E
vertex/edge @id the identifier of the element $v/@id
vertex outE the outgoing edges of the vertex $v/outE
vertex inE the incoming edges of the vertex $v/inE
vertex bothE both in and out edges of the vertex $v/bothE
edge outV the outgoing tail vertex of the edge $e/outV
edge inV the incoming head vertex of the edge $e/outV
edge bothV both in and out vertices of the edge $e/bothV
edge @label the label of the edge $e/@label
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
32. Predefined Functions
g:assign() g:remove-idx() g:list() g:sort() g:print()
g:assign() g:load() g:dedup() g:map() g:time()
g:unassign() g:save() g:union() g:keys() g:p()
g:id() g:clear() g:intersect() g:values() g:to-json()
g:key() g:close() g:difference() g:rand-nat() g:from-json()
g:add-v() g:keys() g:retain() g:rand-real() ...
g:add-e() g:values() g:except() g:prob() ..
g:remove-ve() g:map() g:remove() g:cont() .
g:idx-all() g:get() g:get() g:halt()
g:add-idx() g:op-value() g:op-value() g:type()
There are over 70 predefined functions. See the following for a description of each.
http://wiki.github.com/tinkerpop/gremlin/core-function-library
http://wiki.github.com/tinkerpop/gremlin/gremlin-function-library
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
33. Working With Non-Graph Types
gremlin> 1.2 + 6
==>7.2
gremlin> ‘this is a string’
==>this is a string
gremlin> true() or false()
==>true
gremlin> g:map(‘marko’,‘lanl’,‘peter’,‘neotech’,‘josh’,‘rpi’)
==>marko=lanl
==>peter=neotech
==>josh=rpi
gremlin> g:list(‘graphs’,‘hockey’,‘motorcylces’,6)
==>graphs
==>hockey
==>motorcylces
==>6.0
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
34. Working With Non-Graph Types
gremlin> $m := g:map(‘hobbies’,g:list(‘hockey’,‘graphs’),
‘location’, g:map(‘state’,‘new mexico’, ‘city’, ‘santa fe’,
‘zipcode’, 87501), ‘age’, 30)
==>location={zipcode=87501.0, state=new mexico, city=santa fe}
==>age=30.0
==>hobbies=[hockey, graphs]
gremlin> $m/@age
==>30.0
gremlin> $m/@hobbies[2]
==>graphs
gremlin> $m/@location/@city
==>santa fe
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
35. Variables
• Variables in Gremlin are prefixed with a $ character.
• There are a collection of reserved variables that all begin with $ .
$ is the root list of objects.
$ last is the last result evaluated by the evaluator.
$ g is the “working graph” to reduce typing with graph functions.
gremlin> $x := 1
==>1.0
gremlin> $y := 2
==>2.0
gremlin> $x + $y
==>3.0
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
36. Language Statements
Variable Assignment Repeat
gremlin> $i := 0
gremlin> $i := 1 + 5 ==>0.0
==>6.0 gremlin> repeat 10
gremlin> $i $i := $i + 1
==>6.0 end
==>10.0
If/Else
While
gremlin> if true() gremlin> $i := ‘g’
$i := 1 ==>g
else gremlin> while not(matches($i, ‘ggg’))
$i := 2 $i := concat($i,‘g’)
end end
==>1.0 ==>ggg
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
37. Language Statements
Foreach Path
gremlin> $i := 0 gremlin> path friend_name
==>0.0 ./outE[@label=‘knows’]/inV/@name
gremlin> foreach $j in 1 | 2 | 3 end
$i := $i + $j gremlin> gremlin> ./friend_name
end ==>vadas
==>6.0 ==>josh
Function
gremlin> func ex:hello($name)
concat(‘hello ’, $name)
end
gremlin> ex:hello(‘pavel’)
==>hello pavel
You can define functions and paths in native Gremlin (as demonstrated above) or in Java.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
38. XPath Filters
• Use [ ] filters to filter objects in a path expression (i.e. “such that” or
“where”)
• The evaluated result of [ ] must be a number or boolean.
If its a number, it is treated as the position within an array (i.e. list).
If it is boolean, it is treated as whether to include or exclude the
object from the next path in the sequence.
gremlin> ./outE[@label=‘knows’]
==>e[7][1-knows->2]
==>e[8][1-knows->4]
gremlin> ./outE[@label=‘knows’ and @weight>0.5]/inV[@age<21 or @name=‘josh’][true()][1]
==>v[4]
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
39. Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
• Conclusion
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
40. A Grateful Dead Dataset
2,500 concerts
35,000 songs played
600 songs
30 years
11 members
1 band
... the Grateful Dead.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
41. A Grateful Dead Dataset
• vertices denote songs and artists
type: “song” or “artist”
name: name of song or artist.
performances: number of times song was
played in concert.
song type: whether the song was a “cover”
or “original”.
• edges denote followed by, sung by,
written by
weight: number of times a song was
followed by another song over all concerts
played.
Rodriguez, M.A., Gintautas, V., Pepe, A., “A Grateful Dead Analysis: The Relationship Between Concert and Listening
Behavior,” First Monday, 14(1), University of Illinois at Chicago Library, http://arxiv.org/abs/0807.2466, January 2009.
NOTE: A portion of the raw dataset courtesy of Mark Leone http://www.cs.cmu.edu/ mleone/gdead/setlists.html
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
42. A Grateful Dead Dataset
Stanley Theater type="artist"
type="artist"
name="Hunter"
name="Garcia"
Pittsburgh, PA (11/30/79) type="song"
name="Scarlet.."
7
2nd Set 5
written_by 1 sung_by
-------------------
weight=239
Scarlet Begonias
followed_by type="song"
Fire on the Mountain name="Fire on.." sung_by sung_by
written_by
Passenger 2
Terrapin Station weight=1
type="artist"
name="Lesh"
... followed_by
type="song"
name="Pass.." 6
..
written_by 3 sung_by
.
followed_by
type="song"
weight=2 name="Terrap.."
4
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
43. A Grateful Dead Dataset – Load Data/Basic Stats
gremlin> g:load(‘data/graph-example-2.xml’)
==>true
gremlin> count($_g/V)
==>809.0
gremlin> count($_g/E)
==>8049.0
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
44. A Grateful Dead Dataset – Out-Degree of Each Vertex
gremlin> $degrees := g:map()
gremlin> foreach $v in $_g/V
$degrees[@name=$v/@name] := count($v/outE)
end
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
45. A Grateful Dead Dataset – Out-Degree of Each Vertex
gremlin> g:sort($degrees, ‘value’, true())
==>PLAYING IN THE BAND=96.0
==>SUGAR MAGNOLIA=92.0
==>PROMISED LAND=89.0
==>GOOD LOVING=87.0
==>NOT FADE AWAY=86.0
==>I KNOW YOU RIDER=85.0
==>CASSIDY=83.0
==>DEAL=82.0
==>JACK STRAW=81.0
==>ONE MORE SATURDAY NIGHT=81.0
==>EL PASO=80.0
==>MEXICALI BLUES=79.0
...
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
46. A Grateful Dead Dataset – Inspecting Single Vertex
gremlin> $v := g:key(‘name’,‘CHINA DOLL’)[1]
==>v[129]
gremlin> g:map($v)
==>name=CHINA DOLL
==>song_type=original
==>performances=114
==>type=song
gremlin> $v/outE[@label=‘sung_by’]/inV/@name
==>Garcia
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
47. A Grateful Dead Dataset – Inspecting Single Vertex
gremlin> $v/outE[@label=‘followed_by’]/inV/@name
==>BIG RIVER
==>THROWING STONES
==>SAMSON AND DELILAH
==>TRUCKING
==>CASEY JONES
==>HIGH TIME
...
gremlin> $v/outE[@label=‘followed_by’]/@weight
==>2
==>8
==>1
==>2
==>1
==>1
...
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
48. Introduction to PageRank
• The remainder of this section will discuss the PageRank algorithm and
its application to multi-relational graphs.
• The arguments made and the examples presented generalizes to all other
single-relational graph algorithms. However, for the sake of brevity and
consistency, only PageRank will be discussed.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
49. Introduction to Matrix-Based PageRank
• PageRank is a centrality measure based on the primary eigenvector
|V |×|V |
of a modified version of a graph. Let A ∈ R+ denote the
adjacency matrix representing the graph.
• In order to ensure a positive real values in the eigenvector, the graph
must be strongly connected. PageRank induces strong connectivity
by overlaying a low probability (defined by α ∈ [0, 1] – usually 0.15)
1 |V |×|V |
“teleportation” graph over the original graph. Let B ∈ |V | denote
a teleportation adjacency matrix where ever vertex is connected to vertex
with equal probability.
|V |×|V |
C = (1 − α)A + αB, where C ∈ R+
|V |
λ = λC, where λ ∈ R+ is the PageRank vector over V .
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
50. Introduction to Random Walk-Based PageRank
• PageRank can be implemented by a random walk.
• Create a vertex counter map, m : V → N+.
• Place a walker on a random vertex in V . Denote the walker’s current
vertex i ∈ V .
1. increment the vertex counter by 1 (i.e. m(i) ← m(i) + 1).
2. the walker chooses a random adjacent vertex with probability α.
3. the walker chooses a random vertex in V with probability 1 − α.
4. rinse and repeat until m reaches a stationary probability distribution
(continually normalize m if you want a probability distribution).
• We will use this random walk model in the Gremlin examples to follow.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
51. PageRank over Multi-Relational Graphs
• PageRank was designed for single-relational graphs (i.e. where all edges
have the same meaning).
• In a multi-relational graph, what does it mean to find the centrality
of a vertex when vertices can be related by various types of edges?
For example, if there exists “socializes with” and “met once”, then the
person who “met once” many people could be the most centrally located
in the graph. Also, what if you graph has more than just “person”-type
vertices (e.g. cars, pets, buildings, articles, etc.) and “person”-type
edges (e.g. owns, walks, livesAt, cites, etc.).
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
52. PageRank over Multi-Relational Graphs
• Calculating single-relational PageRank
would yield Person as the most central ...
Person type
vertex. type
type
• You can boolean filter certain edge labels type
type
(e.g. ignore type edges — in such cases, type
type type type type type type
you would have the centrality scores over
the knows social graph).
• However, what if you only wanted to
traverse knows edges if and only if the Herbert Johan Marko Josh Jen ...
adjacent vertex knows more than 10
other people? knows knows knows knows
• In the end, you want complete
knows knows
control (universal computability)
over the paths that the
traverser/walker can take through
a graph.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
53. PageRank over Multi-Relational Graphs
• In multi-relational graphs, the meaning of your graph algorithm’s results are
defined by your definition of adjacency.
• With respect to random walk-based PageRank, define the path that the walker
should take. That path is the definition of adjacency.
• The stationary probability distribution created from this walk yields a path-dependent
centrality.
• Thus, in a multi-relational graph, there are many types of PageRanks that can
be calculated — one for each type of path defined for a walker.
Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks”, Knowledge-Based Systems,
21(7), 727–739, http://arxiv.org/abs/0803.4355, October 2008.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
54. PageRank over “Garcia Followed By” SubGraph
• Define a path that will go from song-to-song by “followed by” edges and
only traverse songs that are “sung by” Jerry Garcia.
(./outE[@label=‘followed_by’]/inV/outE[@label=‘sung_by’]
/inV[name=‘Garcia’]/../..)[g:rand-nat()]
A B C D /../..
followed_by sung_by name="Garcia"
g:rand-nat()
. followed_by sung_by name="Garcia"
followed_by sung_by name="Weir"
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
55. PageRank over “Garcia Followed By” SubGraph
path garcia-followed_by
(./outE[@label=‘followed_by’]/inV/outE[@label=‘sung_by’]
/inV[name=‘Garcia’]/../..)[g:rand-nat()]
end
$m := g:map()
$alpha := 0.15
$_ := g:key(‘type’, ‘song’)[g:rand-nat()]
repeat 2500
$_ := ./garcia-followed_by
if count($_) > 0
g:op-value(‘+’,$m,$_[1]/@name, 1.0)
end
if g:rand-real() < $alpha or count($_) = 0
$_ := g:key(‘type’, ’song’)[g:rand-nat()]
end
end
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
56. PageRank over “Garcia Followed By” SubGraph
gremlin> g:sort($m,‘value’,true())
==>CRAZY FINGERS=98.0
==>HES GONE=85.0
==>CHINA CAT SUNFLOWER=79.0
==>BERTHA=76.0
==>UNCLE JOHNS BAND=74.0
==>TERRAPIN STATION=72.0
==>GOING DOWN THE ROAD FEELING BAD=71.0
==>WHARF RAT=71.0
==>EYES OF THE WORLD=65.0
==>COLD RAIN AND SNOW=62.0
==>SHIP OF FOOLS=58.0
==>RAMBLE ON ROSE=53.0
==>CASEY JONES=51.0
==>DARK STAR=47.0
==>DEAL=46.0
...
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
57. Universal Computation in Paths
path path-name
# any arbitrary computation can occur here
end
• A path definition can be used to define adjacencies.
adjacency can be expressed as anything that can be computed by a Turing machine.
path definitions are used to create “semantically meaningful” results from single-
relational graph algorithms applied to multi-relational graphs.
path definitions make explicit what is implicit in the structure of the graph. This
has applications to knowledge-based reasoning.
• A path definition can perform any arbitrary computation.
path definitions can check/set vertex/edge properties.
path definitions can create new vertices and edges.
path definitions can call/define functions.
This allows fine grained control over how your traverser/walker moves through a graph.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
58. Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
• Conclusions
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
59. The Current Gremlin EcoSystems
• Webling: Web console for Gremlin
(developed by Pavel Yaskevich w/ funding from Neo Technology)
Webling
• Project Gargamel: Distributed Graph Computing
(uses Linked Process and Gremlin)
• ReXster: A Graph-Based Recommender Engine
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
60. Thank You
Please enjoy Gremlin at http://gremlin.tinkerpop.com ...
My homepage is http://markorodriguez.com.
Please feel to contact me with any questions or comments.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010