Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Improve ML Predictions using Graph Analytics (today!)

80 vues

Publié le

Speaker: Amy Hodler, Analytics & AI Program Manager, Neo4j

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Improve ML Predictions using Graph Analytics (today!)

  1. 1. Amy E. Hodler Graph Analytics and AI Program Manager, Neo4j San Francisco, May 2019 Improve ML Predictions Using Graph Algorithms (Today!) Amy.Hodler@neo4j.com @amyhodler
  2. 2. neo4j.com/ graph-algorithms-book free download free in lobby today Chapter 8: Graph + ML Spark & Neo4j
  3. 3. Relationships: One of the Strongest Predictors of Behavior James Fowler
  4. 4. …And Success! David Burkus James Fowler Albert-Laszlo Barabasi
  5. 5. • Current data science models ignore network structure • Graphs add highly predictive features to existing ML models • Otherwise unattainable predictions based on relationships Graphs Increase the Predictive Power of AI with the Data You Already Have Machine Learning Pipeline
  6. 6. Steps Forward in Graph Data Science Graph Persistence Knowledge Graphs Connected Feature Engineering Graph Native Learning
  7. 7. Feature Engineering is how we combine and process the data to create new, more meaningful features, such as clustering or connectivity metrics. Graph Feature Engineering Add More Descriptive Features: - Influence - Relationships - Communities Feature Extraction
  8. 8. Graph Machine Learning Workflow Data aggregation Create and store graphs Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Identify uninteresting features Cleanse (outliers+) Feature engineering/ extraction Train / Test split Resample for meaningful representation (proportional, etc.) Precision, accuracy, recall (ROC curve & AUC) SME Review Cross-validation Model & variable selection Hyperparameter tuning Ensemble methods Example Technologies
  9. 9. Example: Machine Learning for Link Prediction
  10. 10. Can we infer new interactions in the future? What unobserved facts we’re missing?
  11. 11. Algorithm Measures Run targeted algorithms and score outcomes Set a threshold value used to predict a link between nodes Methods for Link Prediction Machine Learning Use the measures as features to train an ML model 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0
  12. 12. • Citation Network Dataset - Research Dataset – “ArnetMiner: Extraction and Mining of Academic Social Networks”, by J. Tang et al – Used a subset with 52K papers, 80K authors, 140K author relationships and 29K citation relationships • Neo4j – Create a co-authorship graph and connected feature engineering • Spark and MLlib – Train and test our model using a random forest classifier Predicting Collaboration with a Graph Enhanced ML Model
  13. 13. Graph Features: • Common Authors 4 Models: Multiple Graph Features “Graphy” Model Common Authors Model Triangles Model Community Model Graph Features: • Preferential Attachment • Total Neighbors Graph Features: • Min & Max Triangles • Min & Max Clustering Coefficient Graph Features: • Label Propagation • Louvain Modularity Trial, Trial and Error!
  14. 14. 15 Test/Train Split
  15. 15. 16 Test/Train Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 8 11 2 3 0 Train Test
  16. 16. OMG I’m Good! Data Leakage! Graph metric computation for the train set touches data from the test set. Did you get really high accuracy on your first run without tuning?
  17. 17. Train and Test Graphs: Time Based Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 Train Test 1st Node 2nd Node Common Neighbors Preferential Attachment label 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 < 2006 >= 2006
  18. 18. Train and Test Graphs: Time Based Split Train Test 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 1st Node 2nd Node Common Neighbors Preferential Attachment label 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1
  19. 19. Class Imbalance Negative Examples Positive Examples
  20. 20. A very high accuracy model could predict that a pair of nodes are not linked. 22 Class Imbalance
  21. 21. Class Imbalance
  22. 22. Training Our Model This is one decision tree in our Random Forest used as a binary classifier to learn how to classify a pair: predicting either linked or not linked.
  23. 23. Result: First Model ROC & AUC Common Authors Model 1
  24. 24. Result: All Models Common Authors Model 1 Community Model 4
  25. 25. Graph Feature Influence for Tuning For feature importance, the Spark random forest averages the reduction in impurity across all trees in the forest Feature rankings are in comparison to the group of features evaluated Also try PageRank! Try removing some features. LabelPropagation
  26. 26. Feature Selection is how we reduce the number of features used in a model to a relevant subset. This can be done algorithmically or based on domain expertise, but the objective is to maximize the predictive power of your model while minimizing overfitting Graph Feature Selection
  27. 27. Graph Algorithms for Feature Engineering
  28. 28. 32 Graph Feature Categories & Algorithms Pathfinding & Search Finds the optimal paths or evaluates route availability and quality Centrality / Importance Determines the importance of distinct nodes in the network Community Detection Detects group clustering or partition options Heuristic Link Prediction Estimates the likelihood of nodes forming a relationship Evaluates how alike nodes are Similarity Embeddings Learned representations of connectivity or topology
  29. 29. • Basic network analysis, e.g. “small-world” structures, stability • ML Features: • Tightness of groups • Probability of links Triangles and Clustering Coefficient u Triangles = 2 CC= 0.33 u Triangles = 2 CC= 0.2 Probability that neighbors are connected
  30. 30. • Nodes adopt labels based on neighbors to infer clusters • Great for proposing and large-scale initial clustering • Graph ML Feature: • Group Membership (classification) Label Propagation 1 2 2 5 3 2 1 6 1 5 4 Iteration
  31. 31. • Broad influence based on transitive relationships and originating node’s influence • ”Golfing with the CEO” • Graph ML Feature: • Score top influencer • Rank influence • Contextual ranking PageRank
  32. 32. • Measure of proportional similarities between nodes • Recommend of similar items • Graph ML Feature: • Coefficient representing similarity of nodes • Often as part of link prediction Jaccard Similarity A B A BB
  33. 33. • Measures closeness by multiplying the number of connections two nodes have • Rich get richer… • Graph ML Feature: • Probability of relationships forming Preferential Attachment Illustration: be.amazd.com/link-prediction/
  34. 34. • Based on number of potential triangles / closing triangles • 2 strangers with a lot of friends in common… • Graph ML Feature: • Probability of relationships forming Common Neighbors Weight with Adamic AdairIllustration: be.amazd.com/link-prediction/
  35. 35. DEMO: Data Game of Thrones – Books • 800 Nodes • 2,900 Relationships (weighted by interactions) • Game of Thrones – TV-Series • 400 Nodes • 565,800 Relationships Andrew Beveridge’s script to graph
  36. 36. DEMO: Neo4j Desktop, Algorithms, Playground neo4j.com/download/ install.graphapp.io Lab Tools
  37. 37. Resources • Algorithms Guide in Sandbox • neo4j.com/sandbox • Algorithms Playground (NEuler) • neo4j.com/developer/ • Community for Q&A • community.neo4j.com • Code & Citation data from O’Reilly book • bit.ly/2FPgGVV (ML Folder) Amy.Hodler@neo4j.com @amyhodler neo4j.com/ graph-algorithms-book
  38. 38. 44 DEMO BACKUP

×