Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Neo4j - Tales from the Trenches
1. Neo4J – Tales from the Trenches
A RECOMMENDATION ENGINE
CASE STUDY
Michal Bachman & Nicki Watt
@bachmanm & @techiewatt
2. Who we are …
role = “consultant”
works on Nicki Watt works for
Opigram colleague of OpenCredo
works on works for
Michal Bachman role = “consultant”
uses
Neo4J partner of
4. Opigram
Recommendations/
Interesting Insights Things
generates
about
People who like …
also tend to like …
Opinions
People who like …
tend to support …
About
People who like … (themselves) provides
describe themselves as
….
Panelists
5.
6.
7.
8. Opigram
• Started Feb 2011
• Nov 2011
– OpenCredo
– Many lessons learned
• Stats
• ~ 150k panelists (a.k.a. users)
• ~ 100k “things” (movies, books,…)
• ~ 8M relationships
9. Neo4J
• Graph Database
• Schema-less (NoSQL)
• Vertices and Edges
• a.k.a. Nodes and Relationships
• Traversals
• Version 1.7 just released!
10. Neo4J
role = “consultant”
works on Nicki Watt works for
Opigram colleague of OpenCredo
works on works for
Michal Bachman role = “consultant”
uses
Neo4J partner of
11. Opigram + Neo4J
• Taxonomy of “things”
• Opinions on “things”
• Recommendations
• Offline “Crunching”
13. Lessons Learned
• Everyone loves Neo4J! Find praise online
• “Trenches Talk” - Aiming to provide
insight into some real problems
encountered and approaches to solutions
• We have 5 practical lessons for you
– Tips
– Tricks
– Troubles
17. Movie
review type
Michal Pulp Fiction
text =
“…”
descriptors =
Cool, Funny described as
described as votes = 1
votes = 1
Cool
Boring
type
type
Funny type
Descriptor
Romantic type
18. Movie
type
Michal Pulp Fiction
created review of
described as Cool
Review
Boring
text=“…” type
type
Funny type
described as
Descriptor
Romantic type
20. Neo Node IDs
• What are they
• Can I use them to represent my keys
– No!
• Why not
– Not Stable
– Ids are garbage collected over time, thus
only guaranteed to be unique during a
specific time span
22. USER_ID NEO_ ACTIVE 1
NODE_ID
Michal
type
101 1 Y
2 type 4
102 2 Y Nicki Panelist
103 3 N
Y
3 type
Jim
MySQL
Jim is now Cool ! Cool 5
Boring
type
7
Funny type
type
6
8 Descriptor
Romantic type
23. Alternate ID Strategies
• Client provided IDs
– Add as a standard property on the node
– Add to index (or use auto indexer)
• Natural vs. Synthetic IDs
• Auto generate your own IDs
– Hook into Neo4J Transaction Kernel
– Use auto indexer
24. Auto generate your own IDs
1) Implement TransactionEventHandler
2) Register TransactionEventHandler with graphDatabaseService
3) Turn auto indexing on for seamless generation
25. Lesson 2: Conclusion
Don’t use Neo Node IDs as your keys!!!
It’s a losing battle, ultimately the force
will not be with you!
credit: http://uk.xbox.gamespy.com
27. Motivations
• Fixes
– Bugs
– Re-indexing
• “Schema” Migrations
• Data Export
• Data Analysis
• Count Caching
28. Lesson 3: Graph-wide Operations
• Batch Updates
• Delete relationships only from one side
• GlobalGraphOperations since 1.6
• No need for TX when reading
41. Extracting Randomised Data
• Use Cases
– Provide Random Suggestions to users
– Use for statistical analysis aka “Random
Sampling”
• Problem
– No built in Neo4J support
– Not Neo4J’s sweet spot
– May result in very bad performance
42. Options
• Randomisation Strategies
– “Load, Shuffle, Pick”
– “Hit and Miss”
– Custom Relationship Expander/Evaluator
– Reservoir Sampling
• Performance Helpers
– Indexes
– Front with a cache if need be
45. Traversals vs. Index
25 random nodes extracted from [Sample Size] using “Reservoir Sampling” algorithm
X-Axis: Sample Size
Y-Axis: Time (milliseconds)
45000
40000
35000
30000 1.5 TRAVERSAL PASS 1 (COLD)
1.4.2 TRAVERSAL PASS 1 (COLD)
25000
1.4.2 TRAVERSAL PASS 2 (WARMISH)
20000 1.5 TRAVERSAL PASS 2 (WARMISH)
1.5 INDEX
15000
1.6.2 TRAVERSAL PASS 1 (COLD)"
10000 1.6.2 TRAVERSAL PASS 2 (WARMISH)
5000
Use of lucene indexes
0 can reduce time to +- 300 -
5000 10000 20000 40000 80000 160000 1000ms from cold
46. Conclusion
• Most options are not “truly random”
more “randomish”
• Primarily has bad performance when
hitting cold parts of graph
• Caching helps
– If an option, serve stale data until next
random sample can be selected
Nicki:A complete online profile of your interests, tastes and opinions. Designed to be useful to you and to the rest of the world. http://labs.yougov.co.uk
Michal
Example: find all the companies that work on Opigram
Nicki
Don’t spend too much time on this:First 2: general and applicable to allNext 2: specific tips, there is a chance you’ll need themLast one: performance
Describe the problem and how it evolvedCan express:Users descriptors for a given reviewTop 5 descriptors for a given thingAll things that are coolProblems with the resulting schema:Change in a review => need to update votesNeed to make sure “described as” is deleted when votes = 0Finding “all people that described something as cool” is too complicatedNot future-proof, what if we now want to review 2 things together (like Nicki and I)
No need to keep track of votesCan still do all the traversals I needUsers descriptors for a given reviewTop 5 descriptors for a given thingAll things that are coolPLUS: All people that used a descriptorCan review multiple things now
As Michal explained, a node is - Neo Node ID is - Neo4j generated Unique id- long (generally auto incrementing – like Mysqlautoincrementing primary keys or Oracle sequences)- Easily accessible and exposed via Neo4J APIsMay hear that and think- great, I need a unique identifier, sounds like it does what I need, I shall just use that rather than manage it myselfENTER LESSON 1: Don’t Use Neo Node IDs as your primary keys
Benefits of this approach No code for you to worry about If you have multiple clients writing to the database (legacy system) this will be taken care for you under the coversgenerateUniqueID() needs to be unique across HA
Different versions handle differently1.4.2 Mostly recycling of old IDs1.5+ Possible changing of IDs between server restartsTODO: Don’t expose! + Index is your friend
ProblemTrying to pick a random number of nodes out of the graphNot Neo4J’s sweet spotEspecially hard when dealing with sub graphsExamplesPick some random nodes out of the graph to display to people to ask for recommendationsUse as part of statistical algorithms to make statements like …People who tend to like …. tend to also ….SolutionsIf size small enough and known traversal pathLoad into Collections and shuffleIf size largeCustom Relationship ExpanderIf the whole graph is in play ….ScattergunIndexer with Resevoir Sampling algorithmLessons Learned … - Random Access type work is not Neo4J’s sweet spot - Can get around it with indexes and random(ish) selection algorithms but may not be ideal
How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Hit and Miss”All nodes form part of “population”, not good when you want subsets of the graphGenerate random IDs, deal with cases of missesCustom Relationship Expander/EvaluatorRandomly discard relationships as you go alongIterables returned by traverser are generally not random, gives more precedence to nodes earlier onReservoir SamplingDesigned for use with IterablesRandomly build up and replace ultimate subset to returnUse an indexFront with a cache if need be
How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Scattergun”All nodes form part of “population”Generate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Scattergun”All nodes form part of “population”Generate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
Mac OS - 10.7 8GB RamLeftOver +-4.5GB JVM Heap max 1.5GB Neo4J Mapped Memory Settings 2.0GBneostore.nodestore.db.mapped_memory =256Mneostore.relationshipstore.db.mapped_memory =768Mneostore.propertystore.db.mapped_memory =512Mneostore.propertystore.db.strings.mapped_memory=256Mneostore.propertystore.db.arrays.mapped_memory =256M Post Upgrade from 1.4.2 -> 1.5 -> 1.6.2... 2.2M neostore.nodestore.db 395.0M neostore.propertystore.db 2.8M neostore.propertystore.db.strings 581.0M neostore.relationshipstore.db 7.6M neostore.propertystore.db.arrays Pre Upgrade from 1.4.2 -> 1.5 -> 1.6.2... 2.2M neostore.nodestore.db 1000.0M neostore.propertystore.db 54.0M neostore.propertystore.db.arrays 8000.0M neostore.propertystore.db.strings 581.0M neostore.relationshipstore.db
TODO: Mention disk accessHow Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Scattergun”All nodes form part of “population”Generate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be