14. New Solution to the Bacon Problem $keanu = $actorIndex->find('name', 'Keanu Reeves'); $kevin = $actorIndex->find('name', 'Kevin Bacon'); $path = $keanu->findPathTo($kevin);
* Six degrees game * Relational databases can't easily answer certain types of questions
* first pass using a relational database * cast table: actor_name, movie_title * hard to visualize the solution * In order to do this, you need to do multiple passes or joins
* Each degree adds a join * Increases complexity * Decreases performance * Stop when the actor you're looking for is in the list
* this problem highlights the ugly truth about RDBs * they weren't designed to handle these types of problems. * arbitrary path query * RDB relationships join data, but are not data in themselves * Set math * Gather everything in the set that matches these criteria, then tell me if this thing is in the set * 1 set, no problem * 2nd set no problem * 3rd set not related to 1st * 4th not related to 2nd * 5th related to 1st and 4th * etc. * Relationships are only available between overlapping sets
* disjoint sets
* Graphs * Not X-Y * Computer Science definition of graphs * A graph is an ordered pair G = (V, E) where V is a set of vertices and E is a set of edges , which are pairs of vertices. * Node : vertex * Relationship : edge * Property : meta-datum attached to a node or relationship * Nodes can have arbitrary properties * Relationships are first-class citizens Have a type Have properties Have a direction Domain semantics Traversable in any direction * This is how graph dbs solve the problems that RDBs can't * Path : an ordered list of nodes and relationships * Paths are found using traversal algorithms
* Tree data-structures * Networks * Maps * vehicles on streets == packets through network * Relational databases are graphs!
* Make each record a node * Make every foreign key a relationship * RDB indexes are usually stored in a tree structure * Trees are graphs * Why not use RDBs? * The trouble with RDBs is how they are stored in memory and queried * Require a translation step from memory blocks to graph structure * Relationships not first-class citizens * Many problem domains map poorly to rows/tables
* Big Data ** billions of nodes and relationships in a single instance * "Internet of Things" buzzword * Social networking - friends of friends of friends of friends * Assembly/Manufacturing - 1 widget contains 3 gadgets each contain 2 gizmos * Map directions - starting at my house find a route to the office that goes past the pub * Multi-tenancy - root node per tenant * all queries start at root * No overlap between graphs = no accidental data spillage * Fraud: track transactions back to origination * Pretty much anything that can be drawn on a whiteboard
* Example: retail system * Customer makes Order * Store sells Order * Order contains Items * Supplier supplied Items * Customer rates Items * Did this customer rank supplier X highly? * Which suppliers sell the highest rated items? * Does item A get rated higher when ordered with Item B? * All can be answered with RDBs as well * Not as elegant * Not as performant
* This is where the power of graph dbs comes from * Paths - find any relationship chain between A and B * Kevin Bacon example, known start and end * Traversal - filter out paths that don't meet criteria * Complex path finding, base next decision on existing path from start to current position * Define path-finding (prune) and result filtering functions * Queries - Here is what I want, find it however you can * SPARQL, Gremlin, Cypher
* Actors are nodes * Movies are nodes * Relationship: Actor is IN a movie * pseudo-code shortened for brevity * Compare to degree selection join queries
* Cypher is "what to find" * describe the "shape" of the thing you're looking for * Very white-board friendly * Pros: easy to understand, query looks like domain model * Cons: not as powerful, not fully featured (YET) * result set is an array of arrays
* RDBs are really good at data aggregation * Set math, duh * Have to traverse the whole graph in order to do aggregation * Truly tabular means not a lot of relationships between the data types
* billions of nodes and relationships in a single instance * cluster replication * transactions * native bindings for Ruby, Python, and language that can run in JVM * Licensing