Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Improve ml predictions using graph algorithms (webinar july 23_19).pptx

590 vues

Publié le

Graph enhancements to AI and ML are changing the landscape of intelligent applications. In this webinar, we’ll focus on using graph feature engineering to improve the accuracy, precision, and recall of machine learning models. You’ll learn how graph algorithms can provide more predictive features as well as aid in feature selection to reduce overfitting. We’ll illustrate a link prediction workflow using Spark and Neo4j to predict collaboration and discuss our missteps and tips to get to measurable improvements.

Publié dans : Données & analyses
  • Soyez le premier à commenter

Improve ml predictions using graph algorithms (webinar july 23_19).pptx

  1. 1. Improve ML Predictions using Graph Algorithms Jennifer Reif, Neo4j Amy Hodler, Neo4j July 2019 #Neo4j #GraphAnalytics
  2. 2. What in Common is Predictive?
  3. 3. Relationships: Strongest Predictors of Behavior! “Increasingly we're learning that you can make better predictions about people by getting all the information from their friends and their friends’ friends than you can from the information you have about the person themselves” James Fowler David Burkus James Fowler Albert-Laszlo Barabasi
  4. 4. • Graphs for Predictions • Connected Features • Link Prediction • Neo4j + Spark Workflow Amy E. Hodler Graph Analytics & AI Program Manager, Neo4j Amy.Hodler@neo4j.com @amyhodler Jennifer Reif Labs Engineer, Neo4j Jennifer.Reif@neo4j.com @JMHReif 4
  5. 5. Native Graph Platforms are Designed for Connected Data TRADITIONAL PLATFORMS BIG DATA TECHNOLOGY Store and retrieve data Aggregate and filter data Connections in data Real time storage & retrieval Real-Time Connected Insights Long running queries aggregation & filtering “Our Neo4j solution is literally thousands of times faster than the prior MySQL solution, with queries that require 10-100 times less code” Volker Pacher, Senior Developer Max # of hops ~3 Millions 5
  6. 6. Graph Databases Surging in Popularity Trends since 2013 DB-Engines.com 6
  7. 7. Graph in AI Research is Taking Off 7 4,000 3,000 2,000 1,000 0 2010 2011 2012 2013 2014 2015 2016 2017 2018 Mentions in Dimension Knowledge System graph neural network graph convolutional graph embedding graph learning graph attention graph kernel graph completion Research Papers on Graph-Related AI Dimension Knowledge System
  8. 8. Machine Learning Eats A Lot of Data Machine Learning uses algorithms to train software using specific examples and progressive improvements Algorithms iterate, continually adjusting to get closer to an objective goal, such as error reduction This learning requires a lot of data to a model and enabling it to learn how to process and incorporate that information 8
  9. 9. • Many data science models ignore network structure & complex relationships • Graphs add highly predictive features to existing ML models • Otherwise unattainable predictions based on relationships More Accurate Predictions with the Data You Already Have Machine Learning Pipeline 9
  10. 10. Graph Data Science Applications EXAMPLES Financial Crimes Recommendations Cybersecurity Predictive Maintenance Customer Segmentation Churn Prediction Search & MDM Drug Discovery 10
  11. 11. Graph Data Science Gives Us Better Decisions Knowledge Graphs Higher Accuracy Connected Feature Engineering More Trust and Applicability Graph Native Learning 11
  12. 12. Connected Features 12
  13. 13. Connection-related metrics about our graph, such as the number of relationships going into or out of nodes, a count of potential triangles, or neighbors in common. 13 What Are Connected Features?
  14. 14. Query (e.g. Cypher) Real-time, local decisioning and pattern matching Graph Algorithms Libraries Global analysis and iterations You know what you’re looking for and making a decision You’re learning the overall structure of a network, updating data, and predicting Local Patterns Global Computation Deriving Connected Features 14
  15. 15. Graph Feature Engineering Feature Engineering is how we combine and process the data to create new, more meaningful features, such as clustering or connectivity metrics. Add More Descriptive Features: - Influence - Relationships - Communities Extraction 15
  16. 16. 16 Graph Feature Categories & Algorithms Pathfinding & Search Finds the optimal paths or evaluates route availability and quality Centrality / Importance Determines the importance of distinct nodes in the network Community Detection Detects group clustering or partition options Heuristic Link Prediction Estimates the likelihood of nodes forming a relationship Evaluates how alike nodes are Similarity Embeddings Learned representations of connectivity or topology 16
  17. 17. Link Prediction 17
  18. 18. 18 Can we infer new interactions in the future? What unobserved facts we’re missing?
  19. 19. + 50 years of biomedical data integrated in a knowledge graph Predicting new uses for drugs by using the graph structure to create features for link prediction Example: het.io 19
  20. 20. Example: het.io 20
  21. 21. 21 Using Graph Algorithms Explore, Plan, Measure Find significant patterns and plan for optimal structures Score outcomes and set a threshold value for a prediction Feature Engineering for Machine Learning The measures as features to train 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0
  22. 22. Example: Predicting Collaboration
  23. 23. • Citation Network Dataset - Research Dataset – “ArnetMiner: Extraction and Mining of Academic Social Networks”, by J. Tang et al – Used a subset with 52K papers, 80K authors, 140K author relationships and 29K citation relationships • Neo4j – Create a co-authorship graph and connected feature engineering • Spark and MLlib – Train and test our model using a random forest classifier 23 Predicting Collaboration with a Graph Enhanced ML Model
  24. 24. Our Link Prediction Workflow Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize 24
  25. 25. Our Link Prediction Workflow Import Data Create Co-Author Graph Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize 25
  26. 26. 26
  27. 27. Our Link Prediction Workflow Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Import Data Create Co-Author Graph Identify sparse feature areas Feature Engineering: New graphy features 27
  28. 28. Graph Algorithms Used for Feature Engineering (few examples) Preferential Attachment multiplies the number of neighbors for pairs of nodes Illustration be.amazd.com/link-prediction/28 Common Neighbors measures the number of possible neighbors (triadic closure)
  29. 29. Graph Algorithms Used for Feature Engineering (few examples) Triangle counting and clustering coefficients measure the density of connections around nodes 29 Louvain Modularity identifies interacting communities and hierarchies
  30. 30. Our Link Prediction Workflow Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Import Data Create Co-Author Graph Identify sparse feature areas Feature Engineering: New graphy features Train / Test Split Resample: Downsampled for proportional representation 30
  31. 31. 31 Test/Train Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 8 11 2 3 0
  32. 32. 32 Test/Train Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 8 11 2 3 0 Train Test
  33. 33. OMG I’m Good! Data Leakage! Graph metric computation for the train set touches data from the test set. Did you get really high accuracy on your first run without tuning? 33
  34. 34. Train and Test Graphs: Time Based Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 Train Test 1st Node 2nd Node Common Neighbors Preferential Attachment label 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 < 2006 >= 2006 34
  35. 35. Train and Test Graphs: Time Based Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 Train Test 1st Node 2nd Node Common Neighbors Preferential Attachment label 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 35
  36. 36. Class Imbalance Negative Examples Positive Examples 36
  37. 37. 37 Class Imbalance A very high accuracy model could predict that a pair of nodes are not linked.
  38. 38. Class Imbalance 38
  39. 39. Our Link Prediction Workflow Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Import Data Create Co-Author Graph Identify sparse feature areas Feature Engineering: New graphy features Train / Test Split Resample: Downsampled for proportional representation Model Selection: Random Forest Ensemble method 39
  40. 40. Picking a Classifier 40
  41. 41. Training Our Model This is one decision tree in our Random Forest used as a binary classifier to learn how to classify a pair: predicting either linked or not linked. 41
  42. 42. 42 4 Layered Models Trained Common Authors Model “Graphy” Model Triangles Model Community Model • Common Authors Adds: • Pref. Attachment • Total Neighbors Adds: • Min & Max Triangles • Min & Max Clustering Coefficient Adds: • Label Propagation • Louvain Modularity Multiple graph features used to train the models
  43. 43. Our Link Prediction Workflow Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Import Data Create Co-Author Graph Identify sparse feature areas Feature Engineering: New graphy features Train / Test Split Resample: Downsampled for proportional representation Precision, Accuracy, Recall ROC Curve & AUC Model Selection: Random Forest Ensemble method 43
  44. 44. Measures Accuracy Proportion of total correct predictions. Beware of skewed data! Precision Proportion of positive predictions that are correct. Low score = more false positives Recall / True Positive Rate Proportion of actual positives that are correct. Low score = more false negatives False Positive Rate Proportion of incorrect positives ROC Curve & AUC X-Y Chart mapping above 2 metrics (TPR and FPR) with area under curve
  45. 45. Result: First Model ROC & AUC False Positives! Common Authors Model 1 45 FalseNegatives!
  46. 46. Result: All Models Common Authors Model 1 Community Model 4 46
  47. 47. Iteration & Tuning: Feature Influence For feature importance, the Spark random forest averages the reduction in impurity across all trees in the forest Feature rankings are in comparison to the group of features evaluated Also try PageRank! Try removing different features (LabelPropagation) 47
  48. 48. Graph Machine Learning Workflow Data aggregation Create and store graphs Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Identify uninteresting features Cleanse (outliers+) Feature engineering/ extraction Train / Test split Resample for meaningful representation (proportional, etc.) Precision, accuracy, recall (ROC curve & AUC) SME Review Cross-validation Model & variable selection Hyperparameter tuning Ensemble methods 48
  49. 49. Resources neo4j.com • /sandbox • /developer/graph-algorithms/ • /graphacademy/online-training/ Data & Code: • This example from O’Reilly book bit.ly/2FPgGVV (ML Folder) Jennifer.Reif@neo4j.com @JMHReif neo4j.com/ graph-algorithms-book Amy.Hodler@neo4j.com @amyhodler 49

×