When Relational Isn't Enough: Neo4j at Squidoo

Neo4j at Seth
Godin’s Squidoo
with
Chief Engineer Gil Hildebrand

What’s ?

Passionate people sharing the ideas they care about
Social publishing platform with over 3 million users
100mm+ pageviews per month, Quantcast ranked #35
in US

Introducing Postcards
A brand new product from Squidoo
Currently in private beta (not public just yet)
Single page, beautifully designed personal
recommendations of books, movies, music albums,
quotes, and other products and media types

Semantic Web

A group of methods
and technologies to
allow machines to
understand the
meaning - or
"semantics" - of
information

Postcards get better with
the Semantic Web
We parse web pages and external APIs to extract
meaning.
Web pages - Meta and Open Graph tags
Title, Description, Photo, and Video
External APIs
Amazon, IMDB, Freebase, Google, YouTube, Bing,
and more

Problem is normalization

The meta tag “Hotel California” on a web page is not
particularly useful unless I know the tag is music related
- then I can search for music albums containing Hotel
California.
This is not easy, but the web as a whole is becoming
more structured.

Connecting the Dots

Crawl a web page or API to extract metadata
Store subjects, nouns, adjectives, and possessives into
Neo
Query Neo to organize subjects into Stacks based on
nouns, adjectives, and possessives

Stacking Up
Postcards are organized into Stacks. Stacks are a
taxonomy based on media type and other common
factors. Ex:
Books Stack
Crime Novel Books Stack
Tom Clancy Books Stack
Stacks created automatically based on metadata
associated with each Postcard.
Minimum of three Postcards is required for a Stack to
exist.

Modeling Taxonomy
Found that the “Parts of Speech” are a great way to
model Postcards taxonomy.
All Postcards have:
Name of the item (subject)
Domains or media types (nouns)
Descriptors (adjectives)
Owners or creators (possessives)

Modeling with our existing
DB platforms

Very familiar with MySQL.
Extremely reliable.
Relational model makes normalization possible, but
scaling is a concern as joins get larger and larger.

Schema Queries
CREATE TABLE post_meta (
post_id BIGINT,
user_id VARCHAR,
date_created SMALLINT,
subject VARCHAR, Seth Godin’s Business Books
noun VARCHAR,
KEY (user_id), SELECT m.post_id FROM post_meta m
KEY (date_created), JOIN possessives USING(user_id)
KEY (subject), JOIN adjectives USING(user_id)
KEY (noun) WHERE
); possessive='Seth Godin'
AND adjective='Business'
CREATE TABLE adjectives ( AND noun='Book';
post_id BIGINT,
user_id VARCHAR, 90s Rock Music Albums
adjective VARCHAR,
PRIMARY KEY (user_id, adjective), SELECT m.post_id FROM post_meta m
KEY (adjective) JOIN adjectives USING(user_id)
); WHERE
adjective='Rock'
CREATE TABLE possessives ( AND noun='Music';
post_id BIGINT, AND date_created BETWEEN 1990 AND
user_id VARCHAR, 1999;
possessive VARCHAR,
PRIMARY KEY (user_id, possessive),
KEY (possessive)
);

At Squidoo, used primarily for analytics.
Massively scalable, but no relational model or
aggregation features. Heavy denormalization required.
Many operations have to be performed asynchronously
using queues or batch processes.

Truly Relational
Our data model is very much a graph problem
Recommendation systems are one query away (easy!)
Meets all our tech requirements

Evaluating Tech Requirements

High availability
Great administrative tools
Great PHP wrapper
https://github.com/jadell/neo4jphp
Commercial support

Learning to think in graphs was
HARD, but now feels NATURAL

Should it be a node or a property?

Which direction should the relationship
point?

More so than any other type of
database I’ve encountered, graph
DBs require you to know in advance
exactly what queries you’ll need to
perform.

Reviewing Sample Graphs
(It Helps)

Ofﬁcial Examples: http://bit.ly/RzCDY9
5 Common Graphs: http://slidesha.re/cnomwz
Movies: http://bitly.com/QZbGw0

Designing with paper or ﬂow chart

First Prototype

Basic HTML
REST API only
Easy to get started,
but the real power
comes from Cypher

Extending the
Prototype with Cypher

Implement Cypher for recommendations and other
traversals.
Cypher looks intimidating at ﬁrst, and the “it’s like SQL”
analogy was not particularly helpful for me.
However, Cypher is essential for using Neo’s most
powerful features, and is worth learning. Once you get
past the strange (but necessary) arrow syntax, it does
start to feel like SQL.

Tip #1: Use reference nodes

START ref=node:Meta(title = "Actor")
MATCH ref<-[:IS]-actor
RETURN actor;

Tip #2: Use reference properties

foreach ($posts as $post) {
if ($post->getProperty(‘type’) == ‘Actor’) {
// do something special for actors
}
}

Tip #3: Schema Changes
At ﬁrst, there were a lot of schema changes during
development
No equivalent to MySQL’s ALTER TABLE or
TRUNCATE TABLE
Two options:
Shut down Neo, rm -rf data/graph.db/*, and restart
Or use this plugin: http://bitly.com/rHFSu6
With the plugin, node IDs do not restart from zero

Tip #3.1: Schema Changes
Wiped your DB and need to start over? Use an initialization script to set things up.

function initialize() {
$master = $this->client->getNode(0);
$master->setProperty('title', 'Master')->setProperty('parent', '')->save();

// should be node 1
$user_master = $this->client->makeNode();
$user_master->save();
$user_index = new EverymanNeo4jIndexNodeIndex($this->client, 'users');
$user_index->save();

$post_index = new EverymanNeo4jIndexNodeIndex($this->client, 'post');
$post_index->save();

$index = new EverymanNeo4jIndexNodeIndex($this->client, 'master');
$nouns = array('Movie', 'Music', 'TV', 'Book', 'Video', 'Article', 'Photo', 'Product', 'Game', 'Squidoo');

foreach ($nouns as $noun) {
$node = $this->client->makeNode();
$node->setProperty('title', $noun)->setProperty('type', 'master')->save();
$index->add($node, 'noun', $noun);
$index->save();
$node->relateTo($master, 'IS')->save();

$noun_index = new EverymanNeo4jIndexNodeIndex($this->client, $noun);
$noun_index->save();
}
}

Nouns

“Noun” is our word for the
domain or media type associated
with a Postcard

Movie Noun
Just one example. We have books, music albums, products, and many others!

Single User’s Stack about Director
Martin Scorsese

Single User’s Stack about Director
Martin Scorsese

START user=node({user_id})
MATCH user-[:POSTED]->post-[:POST]->subject-[:`BY`]->possessive
WHERE possessive.title={meta} AND subject.type={noun}
RETURN DISTINCT post, COLLECT(subject) as subject;

{user_id} = 123
{meta} = 'Martin Scorsese'
{noun} = 'Movie'

Finding Stacks for a Postcard

START post=node:post(post_id={post_id})
MATCH post-[:POST]->subject-->adjective-[:IS]->parent
RETURN subject, adjective, parent;

Finding a user’s “Liked” Postcards

START user=node({user_id})
MATCH user-[:LIKED]->post-[:POST]->subject
RETURN DISTINCT post, COLLECT(subject) as subject;

Popularity Sorting

Popularity is based on Likes, Comments, and other social
signals, using a time decay factor to favor newer Postcards.
Difﬁcult to ﬁnd an algorithm that allowed us support time
decay without having to constantly re-score all Postcards.
Long story short, we use Cypher’s ORDER BY for sorting. We
perform a calculation based on pop_score and pop_date
properties that exist in each Postcard node.
An individual Postcard’s pop_score and pop_date are
updated in real time when someone interacts with it.

Next Steps

Follow Users and Stacks (Activity Stream)
Load Balancing
Disambiguation

The End

Gil Hildebrand
gil@squidoo.com

When Relational Isn't Enough: Neo4j at Squidoo

Recommended

Recommended

More Related Content

Similar to When Relational Isn't Enough: Neo4j at Squidoo

Similar to When Relational Isn't Enough: Neo4j at Squidoo (20)

Recently uploaded

Recently uploaded (20)

When Relational Isn't Enough: Neo4j at Squidoo

Editor's Notes