SlideShare une entreprise Scribd logo
1  sur  52
Neo4J – Tales from the Trenches


   A RECOMMENDATION ENGINE
          CASE STUDY

     Michal Bachman & Nicki Watt
     @bachmanm & @techiewatt
Who we are …
                                            role = “consultant”
          works on     Nicki Watt           works for

Opigram                      colleague of          OpenCredo

          works on                          works for
                     Michal Bachman         role = “consultant”




     uses
                          Neo4J                        partner of
Opigram
•   http://labs.yougov.co.uk
•   Opinion Profile
•   Social Network (TBD)
•   Recommendation Engine
•   CMS
Opigram

Recommendations/
Interesting Insights                            Things
                         generates


                                                about

People who like …
also tend to like …
                                     Opinions

People who like …
tend to support …
                                  About
People who like …             (themselves)   provides
describe themselves as
….
                                             Panelists
Opigram
• Started Feb 2011
• Nov 2011
    – OpenCredo
    – Many lessons learned
•   Stats
•   ~ 150k panelists (a.k.a. users)
•   ~ 100k “things” (movies, books,…)
•   ~ 8M relationships
Neo4J
•   Graph Database
•   Schema-less (NoSQL)
•   Vertices and Edges
•   a.k.a. Nodes and Relationships
•   Traversals
•   Version 1.7 just released!
Neo4J
                                            role = “consultant”
          works on     Nicki Watt           works for

Opigram                      colleague of          OpenCredo

          works on                          works for
                     Michal Bachman         role = “consultant”




     uses
                          Neo4J                        partner of
Opigram + Neo4J
•   Taxonomy of “things”
•   Opinions on “things”
•   Recommendations
•   Offline “Crunching”
Opigram + MySQL
• CMS Functionality
• Crunching Results
• Configuration / Metadata
Lessons Learned
• Everyone loves Neo4J! Find praise online
• “Trenches Talk” - Aiming to provide
  insight into some real problems
  encountered and approaches to solutions
• We have 5 practical lessons for you
  – Tips
  – Tricks
  – Troubles
Lessons Learned
•   Lesson 1: Graph “Schema”
•   Lesson 2: Neo Node IDs
•   Lesson 3: Graph-wide Operations
•   Lesson 4: Extracting Randomised Data
•   Lesson 5: Multi-threading
Lesson 1

Graph “Schema”
Schema-less ≠




                Credit: Greencolander
Movie


         review                             type
Michal                   Pulp Fiction
         text =
         “…”
         descriptors =
         Cool, Funny                      described as

                             described as    votes = 1
                              votes = 1

                                          Cool
                                                                Boring
                                                        type
                                                                  type
                                Funny            type


                                                         Descriptor
                              Romantic           type
Movie

                                               type
Michal                       Pulp Fiction




     created       review of


                   described as              Cool
     Review
                                                                   Boring
    text=“…”                                               type
                                                                     type
                                    Funny           type

              described as
                                                            Descriptor
                                  Romantic          type
Lesson 2

Neo4J Node IDs
Neo Node IDs
• What are they
• Can I use them to represent my keys
  – No!
• Why not
  – Not Stable
  – Ids are garbage collected over time, thus
    only guaranteed to be unique during a
    specific time span
Example

“User Transformation”
USER_ID   NEO_    ACTIVE                         1
          NODE_ID
                                        Michal
                                                             type

101       1           Y
                                                 2     type                          4
102       2           Y                 Nicki                         Panelist
103       3           N
                      Y
                                                 3           type
                                         Jim
              MySQL




      Jim is now Cool !             Cool                                         5
                                                                    Boring
                                                      type
                                    7
                            Funny                                     type
                                               type

                                                                             6
                                    8                        Descriptor
                           Romantic            type
Alternate ID Strategies
• Client provided IDs
  – Add as a standard property on the node
  – Add to index (or use auto indexer)
• Natural vs. Synthetic IDs
• Auto generate your own IDs
  – Hook into Neo4J Transaction Kernel
  – Use auto indexer
Auto generate your own IDs
1)   Implement TransactionEventHandler




2)   Register TransactionEventHandler with graphDatabaseService




3)   Turn auto indexing on for seamless generation
Lesson 2: Conclusion
Don’t use Neo Node IDs as your keys!!!
It’s a losing battle, ultimately the force
will not be with you!




            credit: http://uk.xbox.gamespy.com
Lesson 3

Graph-wide Operations
Motivations
• Fixes
    – Bugs
    – Re-indexing
•   “Schema” Migrations
•   Data Export
•   Data Analysis
•   Count Caching
Lesson 3: Graph-wide Operations
•   Batch Updates
•   Delete relationships only from one side
•   GlobalGraphOperations since 1.6
•   No need for TX when reading
Example

Deleting “soft-deleted” relationships
Lesson 3: Graph-wide Operations
•   Batch Updates
•   Delete only from 1 side
•   GlobalGraphOperations since 1.6
•   No need for TX when reading
Lesson 3: Graph-wide Operations
•   Batch Updates
•   Delete only from 1 side
•   GlobalGraphOperations since 1.6
•   No need for TX when reading
Lesson 3: Graph-wide Operations
•   Batch Updates
•   Delete only from 1 side
•   GlobalGraphOperations since 1.6
•   No need for TX when reading
Lesson 3: Graph-wide Operations
•   Batch Updates
•   Delete only from 1 side
•   GlobalGraphOperations since 1.6
•   No need for TX when reading
Example

Computing statistics
Lesson 3: Graph-wide Operations
•   Batch Updates
•   Delete only from 1 side
•   GlobalGraphOperations since 1.6
•   No need for TX when reading
Lesson 4

Extracting Randomised Data
Extracting Randomised Data
• Use Cases
  – Provide Random Suggestions to users
  – Use for statistical analysis aka “Random
    Sampling”
• Problem
  – No built in Neo4J support
  – Not Neo4J’s sweet spot
  – May result in very bad performance
Options
• Randomisation Strategies
  – “Load, Shuffle, Pick”
  – “Hit and Miss”
  – Custom Relationship Expander/Evaluator
  – Reservoir Sampling
• Performance Helpers
  – Indexes
  – Front with a cache if need be
Custom Relationship Evaluator
Reservoir Sampling Algorithm
Traversals vs. Index
25 random nodes extracted from [Sample Size] using “Reservoir Sampling” algorithm
         X-Axis:    Sample Size
         Y-Axis:    Time (milliseconds)

45000

40000

35000

30000                                                          1.5 TRAVERSAL PASS 1 (COLD)
                                                               1.4.2 TRAVERSAL PASS 1 (COLD)
25000
                                                               1.4.2 TRAVERSAL PASS 2 (WARMISH)
20000                                                          1.5 TRAVERSAL PASS 2 (WARMISH)
                                                               1.5 INDEX
15000
                                                               1.6.2 TRAVERSAL PASS 1 (COLD)"
10000                                                          1.6.2 TRAVERSAL PASS 2 (WARMISH)

 5000
                                                                   Use of lucene indexes
    0                                                              can reduce time to +- 300 -
        5000    10000    20000   40000    80000   160000           1000ms from cold
Conclusion
• Most options are not “truly random”
  more “randomish”
• Primarily has bad performance when
  hitting cold parts of graph
• Caching helps
  – If an option, serve stale data until next
    random sample can be selected
Lesson 5

Multi-threading
Lesson 5: Multi-threading
• Shortcoming in Neo4J
• Fixed in version 1.7
• Avoid relationship properties in multi-
  threaded pre-1.7 apps
Questions?
Beer Time!
• @bachmanm
• michal.bachman@opencredo.com

• @techiewatt
• nicki.watt@opencredo.com

Contenu connexe

En vedette

Your own recommendation engine with neo4j and reco4php - DPC16
Your own recommendation engine with neo4j and reco4php - DPC16Your own recommendation engine with neo4j and reco4php - DPC16
Your own recommendation engine with neo4j and reco4php - DPC16Christophe Willemsen
 
Building a recommendation engine with python and neo4j
Building a recommendation engine with python and neo4jBuilding a recommendation engine with python and neo4j
Building a recommendation engine with python and neo4jMark Needham
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...sparktc
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchSigmoid
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jMax De Marzi
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4jNeo4j
 

En vedette (9)

Your own recommendation engine with neo4j and reco4php - DPC16
Your own recommendation engine with neo4j and reco4php - DPC16Your own recommendation engine with neo4j and reco4php - DPC16
Your own recommendation engine with neo4j and reco4php - DPC16
 
Building a recommendation engine with python and neo4j
Building a recommendation engine with python and neo4jBuilding a recommendation engine with python and neo4j
Building a recommendation engine with python and neo4j
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4j
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 

Plus de Michal Bachman

Recommendations with Neo4j (FOSDEM 2015)
Recommendations with Neo4j (FOSDEM 2015)Recommendations with Neo4j (FOSDEM 2015)
Recommendations with Neo4j (FOSDEM 2015)Michal Bachman
 
Advanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware FrameworkAdvanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware FrameworkMichal Bachman
 
GraphAware Framework Intro
GraphAware Framework IntroGraphAware Framework Intro
GraphAware Framework IntroMichal Bachman
 
Modelling Data as Graphs (Neo4j)
Modelling Data as Graphs (Neo4j)Modelling Data as Graphs (Neo4j)
Modelling Data as Graphs (Neo4j)Michal Bachman
 
Neo4j Introduction at Imperial College London
Neo4j Introduction at Imperial College LondonNeo4j Introduction at Imperial College London
Neo4j Introduction at Imperial College LondonMichal Bachman
 
WebExpo Prague 2012 - Introduction to Neo4j (Czech)
WebExpo Prague 2012 - Introduction to Neo4j (Czech)WebExpo Prague 2012 - Introduction to Neo4j (Czech)
WebExpo Prague 2012 - Introduction to Neo4j (Czech)Michal Bachman
 

Plus de Michal Bachman (8)

Recommendations with Neo4j (FOSDEM 2015)
Recommendations with Neo4j (FOSDEM 2015)Recommendations with Neo4j (FOSDEM 2015)
Recommendations with Neo4j (FOSDEM 2015)
 
Advanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware FrameworkAdvanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware Framework
 
GraphAware Framework Intro
GraphAware Framework IntroGraphAware Framework Intro
GraphAware Framework Intro
 
Modelling Data as Graphs (Neo4j)
Modelling Data as Graphs (Neo4j)Modelling Data as Graphs (Neo4j)
Modelling Data as Graphs (Neo4j)
 
Intro to Neo4j (CZ)
Intro to Neo4j (CZ)Intro to Neo4j (CZ)
Intro to Neo4j (CZ)
 
(Big) Data Science
(Big) Data Science(Big) Data Science
(Big) Data Science
 
Neo4j Introduction at Imperial College London
Neo4j Introduction at Imperial College LondonNeo4j Introduction at Imperial College London
Neo4j Introduction at Imperial College London
 
WebExpo Prague 2012 - Introduction to Neo4j (Czech)
WebExpo Prague 2012 - Introduction to Neo4j (Czech)WebExpo Prague 2012 - Introduction to Neo4j (Czech)
WebExpo Prague 2012 - Introduction to Neo4j (Czech)
 

Dernier

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Dernier (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Neo4j - Tales from the Trenches

  • 1. Neo4J – Tales from the Trenches A RECOMMENDATION ENGINE CASE STUDY Michal Bachman & Nicki Watt @bachmanm & @techiewatt
  • 2. Who we are … role = “consultant” works on Nicki Watt works for Opigram colleague of OpenCredo works on works for Michal Bachman role = “consultant” uses Neo4J partner of
  • 3. Opigram • http://labs.yougov.co.uk • Opinion Profile • Social Network (TBD) • Recommendation Engine • CMS
  • 4. Opigram Recommendations/ Interesting Insights Things generates about People who like … also tend to like … Opinions People who like … tend to support … About People who like … (themselves) provides describe themselves as …. Panelists
  • 5.
  • 6.
  • 7.
  • 8. Opigram • Started Feb 2011 • Nov 2011 – OpenCredo – Many lessons learned • Stats • ~ 150k panelists (a.k.a. users) • ~ 100k “things” (movies, books,…) • ~ 8M relationships
  • 9. Neo4J • Graph Database • Schema-less (NoSQL) • Vertices and Edges • a.k.a. Nodes and Relationships • Traversals • Version 1.7 just released!
  • 10. Neo4J role = “consultant” works on Nicki Watt works for Opigram colleague of OpenCredo works on works for Michal Bachman role = “consultant” uses Neo4J partner of
  • 11. Opigram + Neo4J • Taxonomy of “things” • Opinions on “things” • Recommendations • Offline “Crunching”
  • 12. Opigram + MySQL • CMS Functionality • Crunching Results • Configuration / Metadata
  • 13. Lessons Learned • Everyone loves Neo4J! Find praise online • “Trenches Talk” - Aiming to provide insight into some real problems encountered and approaches to solutions • We have 5 practical lessons for you – Tips – Tricks – Troubles
  • 14. Lessons Learned • Lesson 1: Graph “Schema” • Lesson 2: Neo Node IDs • Lesson 3: Graph-wide Operations • Lesson 4: Extracting Randomised Data • Lesson 5: Multi-threading
  • 16. Schema-less ≠ Credit: Greencolander
  • 17. Movie review type Michal Pulp Fiction text = “…” descriptors = Cool, Funny described as described as votes = 1 votes = 1 Cool Boring type type Funny type Descriptor Romantic type
  • 18. Movie type Michal Pulp Fiction created review of described as Cool Review Boring text=“…” type type Funny type described as Descriptor Romantic type
  • 20. Neo Node IDs • What are they • Can I use them to represent my keys – No! • Why not – Not Stable – Ids are garbage collected over time, thus only guaranteed to be unique during a specific time span
  • 22. USER_ID NEO_ ACTIVE 1 NODE_ID Michal type 101 1 Y 2 type 4 102 2 Y Nicki Panelist 103 3 N Y 3 type Jim MySQL Jim is now Cool ! Cool 5 Boring type 7 Funny type type 6 8 Descriptor Romantic type
  • 23. Alternate ID Strategies • Client provided IDs – Add as a standard property on the node – Add to index (or use auto indexer) • Natural vs. Synthetic IDs • Auto generate your own IDs – Hook into Neo4J Transaction Kernel – Use auto indexer
  • 24. Auto generate your own IDs 1) Implement TransactionEventHandler 2) Register TransactionEventHandler with graphDatabaseService 3) Turn auto indexing on for seamless generation
  • 25. Lesson 2: Conclusion Don’t use Neo Node IDs as your keys!!! It’s a losing battle, ultimately the force will not be with you! credit: http://uk.xbox.gamespy.com
  • 27. Motivations • Fixes – Bugs – Re-indexing • “Schema” Migrations • Data Export • Data Analysis • Count Caching
  • 28. Lesson 3: Graph-wide Operations • Batch Updates • Delete relationships only from one side • GlobalGraphOperations since 1.6 • No need for TX when reading
  • 30. Lesson 3: Graph-wide Operations • Batch Updates • Delete only from 1 side • GlobalGraphOperations since 1.6 • No need for TX when reading
  • 31.
  • 32. Lesson 3: Graph-wide Operations • Batch Updates • Delete only from 1 side • GlobalGraphOperations since 1.6 • No need for TX when reading
  • 33.
  • 34. Lesson 3: Graph-wide Operations • Batch Updates • Delete only from 1 side • GlobalGraphOperations since 1.6 • No need for TX when reading
  • 35.
  • 36. Lesson 3: Graph-wide Operations • Batch Updates • Delete only from 1 side • GlobalGraphOperations since 1.6 • No need for TX when reading
  • 38.
  • 39. Lesson 3: Graph-wide Operations • Batch Updates • Delete only from 1 side • GlobalGraphOperations since 1.6 • No need for TX when reading
  • 41. Extracting Randomised Data • Use Cases – Provide Random Suggestions to users – Use for statistical analysis aka “Random Sampling” • Problem – No built in Neo4J support – Not Neo4J’s sweet spot – May result in very bad performance
  • 42. Options • Randomisation Strategies – “Load, Shuffle, Pick” – “Hit and Miss” – Custom Relationship Expander/Evaluator – Reservoir Sampling • Performance Helpers – Indexes – Front with a cache if need be
  • 45. Traversals vs. Index 25 random nodes extracted from [Sample Size] using “Reservoir Sampling” algorithm X-Axis: Sample Size Y-Axis: Time (milliseconds) 45000 40000 35000 30000 1.5 TRAVERSAL PASS 1 (COLD) 1.4.2 TRAVERSAL PASS 1 (COLD) 25000 1.4.2 TRAVERSAL PASS 2 (WARMISH) 20000 1.5 TRAVERSAL PASS 2 (WARMISH) 1.5 INDEX 15000 1.6.2 TRAVERSAL PASS 1 (COLD)" 10000 1.6.2 TRAVERSAL PASS 2 (WARMISH) 5000 Use of lucene indexes 0 can reduce time to +- 300 - 5000 10000 20000 40000 80000 160000 1000ms from cold
  • 46. Conclusion • Most options are not “truly random” more “randomish” • Primarily has bad performance when hitting cold parts of graph • Caching helps – If an option, serve stale data until next random sample can be selected
  • 48.
  • 49.
  • 50. Lesson 5: Multi-threading • Shortcoming in Neo4J • Fixed in version 1.7 • Avoid relationship properties in multi- threaded pre-1.7 apps
  • 52. Beer Time! • @bachmanm • michal.bachman@opencredo.com • @techiewatt • nicki.watt@opencredo.com

Notes de l'éditeur

  1. TODO Neo logo
  2. Nicki:A complete online profile of your interests, tastes and opinions. Designed to be useful to you and to the rest of the world. http://labs.yougov.co.uk
  3. Michal
  4. Example: find all the companies that work on Opigram
  5. Nicki
  6. Don’t spend too much time on this:First 2: general and applicable to allNext 2: specific tips, there is a chance you’ll need themLast one: performance
  7. Describe the problem and how it evolvedCan express:Users descriptors for a given reviewTop 5 descriptors for a given thingAll things that are coolProblems with the resulting schema:Change in a review => need to update votesNeed to make sure “described as” is deleted when votes = 0Finding “all people that described something as cool” is too complicatedNot future-proof, what if we now want to review 2 things together (like Nicki and I)
  8. No need to keep track of votesCan still do all the traversals I needUsers descriptors for a given reviewTop 5 descriptors for a given thingAll things that are coolPLUS: All people that used a descriptorCan review multiple things now
  9. As Michal explained, a node is - Neo Node ID is - Neo4j generated Unique id- long (generally auto incrementing – like Mysqlautoincrementing primary keys or Oracle sequences)- Easily accessible and exposed via Neo4J APIsMay hear that and think- great, I need a unique identifier, sounds like it does what I need, I shall just use that rather than manage it myselfENTER LESSON 1: Don’t Use Neo Node IDs as your primary keys
  10. Benefits of this approach No code for you to worry about If you have multiple clients writing to the database (legacy system) this will be taken care for you under the coversgenerateUniqueID() needs to be unique across HA
  11. Different versions handle differently1.4.2 Mostly recycling of old IDs1.5+ Possible changing of IDs between server restartsTODO: Don’t expose! + Index is your friend
  12. ProblemTrying to pick a random number of nodes out of the graphNot Neo4J’s sweet spotEspecially hard when dealing with sub graphsExamplesPick some random nodes out of the graph to display to people to ask for recommendationsUse as part of statistical algorithms to make statements like …People who tend to like …. tend to also ….SolutionsIf size small enough and known traversal pathLoad into Collections and shuffleIf size largeCustom Relationship ExpanderIf the whole graph is in play ….ScattergunIndexer with Resevoir Sampling algorithmLessons Learned … - Random Access type work is not Neo4J’s sweet spot - Can get around it with indexes and random(ish) selection algorithms but may not be ideal
  13. How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Hit and Miss”All nodes form part of “population”, not good when you want subsets of the graphGenerate random IDs, deal with cases of missesCustom Relationship Expander/EvaluatorRandomly discard relationships as you go alongIterables returned by traverser are generally not random, gives more precedence to nodes earlier onReservoir SamplingDesigned for use with IterablesRandomly build up and replace ultimate subset to returnUse an indexFront with a cache if need be
  14. How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Scattergun”All nodes form part of “population”Generate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
  15. How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Scattergun”All nodes form part of “population”Generate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
  16. Mac OS - 10.7 8GB RamLeftOver +-4.5GB JVM Heap max 1.5GB Neo4J Mapped Memory Settings 2.0GBneostore.nodestore.db.mapped_memory =256Mneostore.relationshipstore.db.mapped_memory =768Mneostore.propertystore.db.mapped_memory =512Mneostore.propertystore.db.strings.mapped_memory=256Mneostore.propertystore.db.arrays.mapped_memory =256M Post Upgrade from 1.4.2 -> 1.5 -> 1.6.2... 2.2M neostore.nodestore.db 395.0M neostore.propertystore.db 2.8M neostore.propertystore.db.strings 581.0M neostore.relationshipstore.db 7.6M neostore.propertystore.db.arrays Pre Upgrade from 1.4.2 -> 1.5 -> 1.6.2... 2.2M neostore.nodestore.db 1000.0M neostore.propertystore.db 54.0M neostore.propertystore.db.arrays 8000.0M neostore.propertystore.db.strings 581.0M neostore.relationshipstore.db
  17. TODO: Mention disk accessHow Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Scattergun”All nodes form part of “population”Generate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be