Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Graph at Scale 
10/22/14 
Brad Nussbaum 
CTO | MediaHound 
brad@mediahound.com | @bradnussbaum 
www.mediahound.com
MediaHound started with a search…
But Entertainment is more than just Film & TV
Entertainment Preferences Are Scattered 
Music Movies Television Books
All Content 
one platform that connects: 
All Sources All Devices 
All Brands All Artists All Users
The Entertainment Graph 
A comprehensive database that brings together movies, 
books, games, music, and TV, including the...
Use Cases: 
Academy Award Winners On Netflix 
MATCH (c:Collection)-[:CONTAINS]-(m:Movie) WHERE (m)- 
[:MATCHED_SOURCE]-(n:...
Entertainment Recommendations 
Collaborative Filtering: 
If Joe, Amy and Steve like Gladiator, AND Joe and Amy like Toy St...
We needed our graph database to perform 
under sustained user write load AND during 
heavy batch update operations. 
We ne...
We realized through trial and error that using 
the Transactional Cypher HTTP Endpoint was 
the BEST solution to control b...
We ran tests using the low-level kernel and 
found that sustained transaction writes 
performed optimally between 400-2,00...
We used Enterprise Integration Patterns (EIP) 
to create optimal batch sizes for each 
transaction. 
• Splitters to break ...
We frequently run 40+ concurrent write 
transactions to a 3 instance cluster for hours 
at time. 
Deadlocks can occur ofte...
Test write throughput on your cluster with the 
push factor you plan to use in production and 
intentionally kill your mas...
Check your driver for transaction support. 
Embedded mode has full transaction support 
but most remote drivers do not at ...
Graph at Scale 
10/22/14 
Ben Nussbaum 
Director of Engineering | MediaHound 
ben@mediahound.com | @bennussbaum 
www.media...
We built custom algorithms that needed run-time 
decision making as Neo4j Extensions 
with Spring Data Neo4j. 
• Cache abs...
We took advantage of spot processing from 
AWS to run our custom extension algoritms. 
• On-demand graph processing with a...
We built a flexible job controller that enables 
concurrent job processing on spot instances 
• Large jobs are broken into...
Spot instances run Neo4j in SINGLE mode 
and stay up to date using a Topic. 
Neo4j HA 
Spot 1 
SINGLE 
Job 
Result 
MQ 
Sp...
Batch jobs return thousands of CQL 
statements which must not be dependent on 
any statements before or after. 
• Compound...
Graph at Scale 
10/22/14 
Q/A
Prochain SlideShare
Chargement dans…5
×

GraphConnect 2014 SF: Neo4j at Scale using Enterprise Integration Patterns

1 402 vues

Publié le

Neo4j at Scale using Enterprise Integration Patterns
presented by Brad Nussbaum, CTO, MediaHound

Publié dans : Technologie
  • Soyez le premier à commenter

GraphConnect 2014 SF: Neo4j at Scale using Enterprise Integration Patterns

  1. 1. Graph at Scale 10/22/14 Brad Nussbaum CTO | MediaHound brad@mediahound.com | @bradnussbaum www.mediahound.com
  2. 2. MediaHound started with a search…
  3. 3. But Entertainment is more than just Film & TV
  4. 4. Entertainment Preferences Are Scattered Music Movies Television Books
  5. 5. All Content one platform that connects: All Sources All Devices All Brands All Artists All Users
  6. 6. The Entertainment Graph A comprehensive database that brings together movies, books, games, music, and TV, including the cast & crew, sources, reviews, categories, genres, lists and more! The Entertainment Graph powers meaningful recommendations, exciting data insights and comprehensive social discovery.
  7. 7. Use Cases: Academy Award Winners On Netflix MATCH (c:Collection)-[:CONTAINS]-(m:Movie) WHERE (m)- [:MATCHED_SOURCE]-(n:NETFLIX) AND c.name=“Academy Award Winners” RETURN m; Movies and Shows Based On Zombie Books MATCH (m:Media)-[:BASED_ON]-(b:Book)-[:HAS_TRAIT]-(z:Trait) WHERE (m.type=“Movie” OR m.type=“Show”) AND z.name=“Zombies” RETURN m; To access ‘The Movie Graph’ mini app (which uses a different model than above), from your browser, run :play movie graph
  8. 8. Entertainment Recommendations Collaborative Filtering: If Joe, Amy and Steve like Gladiator, AND Joe and Amy like Toy Story, THEN MediaHound recommends Toy Story to Steve. Graph Influencers Joe is an early adopter. Joe, Amy and Steve like several things in common. MediaHound recommends Amy and Steve follow Joe and Amy Joe discovers the next big hit and shares it on his feed Amy and Steve see Steve’s post and give it a listen.
  9. 9. We needed our graph database to perform under sustained user write load AND during heavy batch update operations. We needed to recommend media content in real-time which required many concurrent pattern matching operations on the graph.
  10. 10. We realized through trial and error that using the Transactional Cypher HTTP Endpoint was the BEST solution to control batch writes. POST http://localhost:7474/db/data/transaction/commit Accept: application/json; charset=UTF-8 Content-Type: application/json { "statements": [ { "statement": "MERGE(n:User{username:"bradnussbaum"})-[:FOLLOWS]->(m:User{username:"bennussbaum"});" }, { "statement": "MERGE(n:User{username:"bradnussbaum"})<-[:FOLLOWS]-(m:User{username:"bennussbaum"});" } ] }
  11. 11. We ran tests using the low-level kernel and found that sustained transaction writes performed optimally between 400-2,000 nodes and relationships per transaction. Compare the difference between… • Writing a single relationship (33 bytes) per transaction f or 10k iterations compared to • Writing 1k relationships per transaction for 10 iterations. **As of 2.2, Neo4j will batch writes on the server http://neo4j.com/docs/stable/linux-performance-guide.html git clone git@github.com:neo4j-contrib/tooling.git
  12. 12. We used Enterprise Integration Patterns (EIP) to create optimal batch sizes for each transaction. • Splitters to break down larger messages • Aggregators to combine single CQL statements t ogether into a single batch transaction • Throttling to control concurrent requests and requests p er second
  13. 13. We frequently run 40+ concurrent write transactions to a 3 instance cluster for hours at time. Deadlocks can occur often with many concurrent write o perations. • Retry Transient Errors after a small period of time. • Use the Error Index to split failed TX statements. Read here to learn all the error status codes, seriously. http://neo4j.com/docs/stable/status-codes.html
  14. 14. Test write throughput on your cluster with the push factor you plan to use in production and intentionally kill your master under load. You need to have two load balancers: • One for all the instances you want performing reads • One for your master ONLY  (send writes here) Master Check: /db/manage/server/ha/master - Returns true|false Slave Check: /db/manage/server/ha/slave - Returns true|false Available Check: /db/manage/server/ha/available - Returns master|slave
  15. 15. Check your driver for transaction support. Embedded mode has full transaction support but most remote drivers do not at this time. This will be changing in the near future…depending on which driver you use. **Spring Data Neo4j is actively being developed to included these features as part of 2.2.
  16. 16. Graph at Scale 10/22/14 Ben Nussbaum Director of Engineering | MediaHound ben@mediahound.com | @bennussbaum www.mediahound.com
  17. 17. We built custom algorithms that needed run-time decision making as Neo4j Extensions with Spring Data Neo4j. • Cache abstraction with Google’s Guava to build large in-memory indexes of nodes and relationships. • Integration for jobs instructions and results to and from the broker. • Async for batch job processing. https://github.com/AtomRain/neo4j-extensions
  18. 18. We took advantage of spot processing from AWS to run our custom extension algoritms. • On-demand graph processing with as many instances at a time as needed (we have used up to 9). • Concurrent job operations per spot. • Cache optimizations based on Labels and context of the jobs.
  19. 19. We built a flexible job controller that enables concurrent job processing on spot instances • Large jobs are broken into smaller jobs that can be processed by a single spot instance. • Spots process unit jobs and return results. If a spot dies, the job stays in the queue and another spot picks it up. • Memory and CPU constraints on an instance make this a necessity, especially when processing 30M+ songs.
  20. 20. Spot instances run Neo4j in SINGLE mode and stay up to date using a Topic. Neo4j HA Spot 1 SINGLE Job Result MQ Spot 2 SINGLE Job Process MQ ESB SDN SDN TX Topic 1 2 3 4 5 1. ESB sends Jobs to MQ 2. Spots consume job instructions, process and send results b ack to MQ 3. ESB posts jobs results to HA 4. On successful post, send u pdates to Topic 5. Spots consume from Topic to s tay u to date
  21. 21. Batch jobs return thousands of CQL statements which must not be dependent on any statements before or after. • Compound statements to create nodes and relationships for specific sub-graphs to avoid the need for layering wherever possible. If not… • Run jobs in a linear phases (layering) to create nodes first then connect relationships
  22. 22. Graph at Scale 10/22/14 Q/A

×