Neo4j - Tales from the Trenches

Neo4J – Tales from the Trenches

A RECOMMENDATION ENGINE
CASE STUDY

Michal Bachman & Nicki Watt
@bachmanm & @techiewatt

Who we are …
role = “consultant”
works on Nicki Watt works for

Opigram colleague of OpenCredo

works on works for
Michal Bachman role = “consultant”

uses
Neo4J partner of

Opigram
• http://labs.yougov.co.uk
• Opinion Profile
• Social Network (TBD)
• Recommendation Engine
• CMS

Opigram

Recommendations/
Interesting Insights Things
generates

about

People who like …
also tend to like …
Opinions

People who like …
tend to support …
About
People who like … (themselves) provides
describe themselves as
….
Panelists

Opigram
• Started Feb 2011
• Nov 2011
– OpenCredo
– Many lessons learned
• Stats
• ~ 150k panelists (a.k.a. users)
• ~ 100k “things” (movies, books,…)
• ~ 8M relationships

Neo4J
• Graph Database
• Schema-less (NoSQL)
• Vertices and Edges
• a.k.a. Nodes and Relationships
• Traversals
• Version 1.7 just released!

Neo4J
role = “consultant”
works on Nicki Watt works for

Opigram colleague of OpenCredo

works on works for
Michal Bachman role = “consultant”

uses
Neo4J partner of

Opigram + Neo4J
• Taxonomy of “things”
• Opinions on “things”
• Recommendations
• Offline “Crunching”

Opigram + MySQL
• CMS Functionality
• Crunching Results
• Configuration / Metadata

Lessons Learned
• Everyone loves Neo4J! Find praise online
• “Trenches Talk” - Aiming to provide
insight into some real problems
encountered and approaches to solutions
• We have 5 practical lessons for you
– Tips
– Tricks
– Troubles

Lessons Learned
• Lesson 1: Graph “Schema”
• Lesson 2: Neo Node IDs
• Lesson 3: Graph-wide Operations
• Lesson 4: Extracting Randomised Data
• Lesson 5: Multi-threading

Schema-less ≠

Credit: Greencolander

Movie

review type
Michal Pulp Fiction
text =
“…”
descriptors =
Cool, Funny described as

described as votes = 1
votes = 1

Cool
Boring
type
type
Funny type

Descriptor
Romantic type

Movie

type
Michal Pulp Fiction

created review of

described as Cool
Review
Boring
text=“…” type
type
Funny type

described as
Descriptor
Romantic type

Neo Node IDs
• What are they
• Can I use them to represent my keys
– No!
• Why not
– Not Stable
– Ids are garbage collected over time, thus
only guaranteed to be unique during a
specific time span

Example

“User Transformation”

USER_ID NEO_ ACTIVE 1
NODE_ID
Michal
type

101 1 Y
2 type 4
102 2 Y Nicki Panelist
103 3 N
Y
3 type
Jim
MySQL

Jim is now Cool ! Cool 5
Boring
type
7
Funny type
type

6
8 Descriptor
Romantic type

Alternate ID Strategies
• Client provided IDs
– Add as a standard property on the node
– Add to index (or use auto indexer)
• Natural vs. Synthetic IDs
• Auto generate your own IDs
– Hook into Neo4J Transaction Kernel
– Use auto indexer

Auto generate your own IDs
1) Implement TransactionEventHandler

2) Register TransactionEventHandler with graphDatabaseService

3) Turn auto indexing on for seamless generation

Lesson 2: Conclusion
Don’t use Neo Node IDs as your keys!!!
It’s a losing battle, ultimately the force
will not be with you!

credit: http://uk.xbox.gamespy.com

Lesson 3

Graph-wide Operations

Motivations
• Fixes
– Bugs
– Re-indexing
• “Schema” Migrations
• Data Export
• Data Analysis
• Count Caching

Lesson 3: Graph-wide Operations
• Batch Updates
• Delete relationships only from one side
• GlobalGraphOperations since 1.6
• No need for TX when reading

Example

Deleting “soft-deleted” relationships

Lesson 3: Graph-wide Operations
• Batch Updates
• Delete only from 1 side
• GlobalGraphOperations since 1.6
• No need for TX when reading

Lesson 4

Extracting Randomised Data

Extracting Randomised Data
• Use Cases
– Provide Random Suggestions to users
– Use for statistical analysis aka “Random
Sampling”
• Problem
– No built in Neo4J support
– Not Neo4J’s sweet spot
– May result in very bad performance

Options
• Randomisation Strategies
– “Load, Shuffle, Pick”
– “Hit and Miss”
– Custom Relationship Expander/Evaluator
– Reservoir Sampling
• Performance Helpers
– Indexes
– Front with a cache if need be

Traversals vs. Index
25 random nodes extracted from [Sample Size] using “Reservoir Sampling” algorithm
X-Axis: Sample Size
Y-Axis: Time (milliseconds)

45000

40000

35000

30000 1.5 TRAVERSAL PASS 1 (COLD)
1.4.2 TRAVERSAL PASS 1 (COLD)
25000
1.4.2 TRAVERSAL PASS 2 (WARMISH)
20000 1.5 TRAVERSAL PASS 2 (WARMISH)
1.5 INDEX
15000
1.6.2 TRAVERSAL PASS 1 (COLD)"
10000 1.6.2 TRAVERSAL PASS 2 (WARMISH)

5000
Use of lucene indexes
0 can reduce time to +- 300 -
5000 10000 20000 40000 80000 160000 1000ms from cold

Conclusion
• Most options are not “truly random”
more “randomish”
• Primarily has bad performance when
hitting cold parts of graph
• Caching helps
– If an option, serve stale data until next
random sample can be selected

Lesson 5: Multi-threading
• Shortcoming in Neo4J
• Fixed in version 1.7
• Avoid relationship properties in multi-
threaded pre-1.7 apps

Beer Time!
• @bachmanm
• michal.bachman@opencredo.com

• @techiewatt
• nicki.watt@opencredo.com

Neo4j - Tales from the Trenches

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (9)

Plus de Michal Bachman

Plus de Michal Bachman (8)

Dernier

Dernier (20)

Neo4j - Tales from the Trenches

Notes de l'éditeur