Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Using Local Spectral Methods to Robustify Graph-Based Learning

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 30 Publicité

Using Local Spectral Methods to Robustify Graph-Based Learning

Télécharger pour lire hors ligne

This is my KDD2015 talk on robustness in semi-supervised learning. The paper is already on Michael Mahoney's website: http://www.stat.berkeley.edu/~mmahoney/pubs/robustifying-kdd15.pdf See the KDD paper for all the details, which this talk is a bit light on.

This is my KDD2015 talk on robustness in semi-supervised learning. The paper is already on Michael Mahoney's website: http://www.stat.berkeley.edu/~mmahoney/pubs/robustifying-kdd15.pdf See the KDD paper for all the details, which this talk is a bit light on.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (18)

Publicité

Similaire à Using Local Spectral Methods to Robustify Graph-Based Learning (20)

Plus récents (20)

Publicité

Using Local Spectral Methods to Robustify Graph-Based Learning

  1. 1. Using Local Spectral Methods to Robustify Graph-Based Learning David F. Gleich! Purdue University! Joint work with Michael Mahoney @ Berkeley supported by " NSF CAREER CCF-1149756 Code www.cs.purdue.edu/homes/dgleich/codes/robust-diffusions! KDD2015
  2. 2. The graph-based data analysis pipeline 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 " Raw data! •  Relationships •  Images •  Text records •  Etc. " Convert to a graph! •  Nearest neighs •  Kernels •  2-mode to 1-mode •  Etc. " Algorithm/Learning! •  Important nodes •  Infer features •  Clustering •  Etc. KDD2015 David Gleich · Purdue 2
  3. 3. “Noise” in the initial data modeling decisions " Explicit graphs! are those that are given to a data analyst. “A social network” •  Known spam accounts included? •  Users not logged in for a year? •  Etc. A type of noise " Constructed graphs! are built based on some other primary data." “nearest neighbor graphs” •  K-NN or ε-NN •  Thresholding correlations to zero Often made for computational convenience! (Graph too big.) A different type a noise! " Labeled graphs! occur in information diffusion/propagation “function prediction” •  Labeled nodes •  Labeled edges •  Some are wrong A direct type of noise! Do these decisions matter? Our experience Yes! Dramatically so! KDD2015 David Gleich · Purdue 3
  4. 4. The graph-based data analysis pipeline 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 " Raw data! •  Relationships •  Images •  Text records •  Etc. " Convert to a graph! •  Nearest neighs •  Kernels •  2-mode to 1-mode •  Etc. " Algorithm/Learning! •  Important nodes •  Infer features •  Clustering •  Etc. Most algorithmic and statistical research happens here The potential downstream signal is determined by this step KDD2015 David Gleich · Purdue 4
  5. 5. Our goal: towards an integrative analysis 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 ! •  How does the graph creation process affect the outcomes of graph-based learning? •  Is there anything we can do to make this process more robust? KDD2015 David Gleich · Purdue 5
  6. 6. Graph-based learning is usually only one component of a big pipeline KDD2015 David Gleich · Purdue 6 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 Many databases over" genes with survival rates" for various cancers List of possible genes" responsible for survival Cluster analysis Reinteration of data THIS STEP SHOULD! BE ROBUST TO ! VARIATIONS ABOVE!
  7. 7. Scalable graph analytics Local methods are one of the most successful classes of scalable graph analytics They don’t even look at the entire graph. •  Andersen Chung Lang (ACL) Push method Conjecture" Local methods regularize some variant of the original algorithm or problem. Justification" For ACL and a few relatives this is exact! Impact?! Improved robustness to noise? KDD2015 David Gleich · Purdue 7 c For instance, to " answer “what function” is shared by the started node, we’d only look at the circled region.
  8. 8. Our contributions We study these issues in the case of " semi-supervised learning (SSL) on graphs 1.  We illustrate a common mincut framework for a variety of SSL methods 2.  Show how to “localize” one (and make it scalable!) 3.  Provide a more robust SSL labeling method 4.  Identify a weakness in SSL methods: they cannot use extra edges! We find one useful way to do so. KDD2015 David Gleich · Purdue 8
  9. 9. Semi-supervised " graph-based learning KDD2015 David Gleich · Purdue 9 Given a graph, and a few labeled nodes, predict the labels on the rest of the graph. Algorithm 1.  Run a diffusion for each label (possibly with neg. info from other classes) 2.  Assign new labels based on the value of each diffusion
  10. 10. Semi-supervised " graph-based learning KDD2015 David Gleich · Purdue 10 Given a graph, and a few labeled nodes, predict the labels on the rest of the graph. Algorithm 1.  Run a diffusion for each label (possibly with neg. info from other classes) 2.  Assign new labels based on the value of each diffusion
  11. 11. Semi-supervised " graph-based learning KDD2015 David Gleich · Purdue 11 Given a graph, and a few labeled nodes, predict the labels on the rest of the graph. Algorithm 1.  Run a diffusion for each label (possibly with neg. info from other classes) 2.  Assign new labels based on the value of each diffusion
  12. 12. The diffusions proposed for semi- supervised learning are s,t-cut minorants KDD2015 David Gleich · Purdue 12 1 3 2 6 4 5 7 8 9 10 t s In the unweighted case, " solve via max-flow. In the weighted case, solve via network simplex or industrial LP. minimize qP ij2E Ci,j |xi xj |2 subject to xs = 1, xt = 0. minimize P ij2E Ci,j |xi xj | subject to xs = 1, xt = 0. MINCUT LP Spectral minorant – lin. sys.
  13. 13. Representative cut problems KDD2015 David Gleich · Purdue 13 ∞ ∞ ∞ ∞ s t ZGL α α 4α 3α 4α 6α 3α 3α 5α 5α 5α 2α 5α 4α 5α s t Zhou et al. Positive label Neg. label Unlabeled Andersen-Lang weighting " variation too Joachims has a variation too. Zhou et al. NIPS 2003; Zhu et al., ICML 2003; Andersen Lang, SODA 2008; Joachims, ICML 2003 These help our intuition about the solutions All spectral minorants are linear systems.
  14. 14. Implicit regularization views on the Zhou et al. diffusion KDD2015 David Gleich · Purdue 14 α α 4α 3α 4α 6α 3α 3α 5α 5α 5α 2α 5α 4α 5α s t Zhou et al. RESULT! The spectral minorant of Zhou is equivalent to the weakly-local MOV solution. PROOF! The two linear systems are the same (after working out a few equivalences). IMPORTANCE! We’d expect Zhou to be “more robust” minimize qP ij2E Ci,j |xi xj |2 subject to xs = 1, xt = 0. The Mahoney-Orecchia-Vishnoi (MOV) vector is a localized variation on the Fiedler vector to find a small conductance set nearby a seed set.
  15. 15. A scalable, localized algorithm for Zhou et al’s diffusion. KDD2015 David Gleich · Purdue 15 RESULT! We can use a variation on coordinate descent methods related to the Andersen-Chung-Lang PUSH procedure to solve Zhou’s diffusion in a scalable manner. PROOF. See Gleich-Mahoney ICML ‘14 IMPORTANCE (1)! We should be able to make Zhou et al. scale. IMPORTANCE (2)! Using this algorithm adds another implicit regularization term that should further improve robustness! minimize qP ij2E Ci,j |xi xj |2 subject to xs = 1, xt = 0. minimize P ij2E Ci,j |xi xj |2 + ⌧ P i2V di xi subject to xs = 1, xt = 0, xi 0.
  16. 16. Semi-supervised " graph-based learning KDD2015 David Gleich · Purdue 16 Given a graph, and a few labeled nodes, predict the labels on the rest of the graph. Algorithm 1.  Run a diffusion for each label (possibly with neg. info from other classes) 2.  Assign new labels based on the value of each diffusion
  17. 17. Traditional rounding methods for SSL are value-based KDD2015 David Gleich · Purdue 17 Class 1 Class 2Class 3 Class 1 Class 2 Class 3 CLASS 1 CLASS 2 CLASS 3 VALUE-BASED Use the largest value of the diffusion to pick the label. Zhou’s diffusion
  18. 18. But value based rounding doesn’t work for all diffusions KDD2015 David Gleich · Purdue 18 Class 1 Class 2Class 3 Class 1 Class 2 Class 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3 (b) Zhou et al., l = 3 (c) Andersen-Lang, l = 3 (d) Joachims, l = 3 (e) ZGL, l = 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3 (f) Zhou et al., l = 15 (g) Andersen-Lang, l = 15 (h) Joachims, l = 15 (i) ZGL, l = 15 CLASS 1 CLASS 3 CLASS 2 VALUE-BASED rounding fails for most of these diffusions BUT There is still a signal there! Adding more labels doesn’t help either, see the paper for those details
  19. 19. Rank-based rounding is far more robust. KDD2015 David Gleich · Purdue 19 Class 1 Class 2Class 3 NEW IDEA! Look at the RANK of the item in each diffusion instead of it’s VALUE. JUSTIFICATION! Based on the idea of sweep-cut rounding in spectral methods (use the order induced by the eigenvector, not its values) IMPACT! Much more robust rounding to labels
  20. 20. Rank-based rounding has a big impact on a real-study. KDD2015 David Gleich · Purdue 20 2 4 6 8 10 0 0.2 0.4 0.6 0.8 errorrate average training samples per class Zhou Zhou+Push 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1 errorrate average training samples per class Zhou Zhou+Push We used the digit prediction task out of Zhou’s paper and added just a bit of noise" as label errors and switched parameters. VALUE-BASED RANK-BASED
  21. 21. Main empirical results 1.  Zhou’s diffusion seems to work best for sparse graphs whereas the ZGL diffusion works best for dense 2.  On the digit’s dataset, dense graph constructions yield higher error rates 3.  Densifying the a super-sparse graph construction on the digits dataset yields lower error. 4.  And a similar fact holds on an Amazon co- purchasing network. KDD2015 David Gleich · Purdue 21
  22. 22. 5 10 0 50 100 150 number of labels numberofmistakes An illustrative synthetic problem shows the differences. Two-class block-model, 150 nodes each" between prob = 0.02, " withinprob = 0.35 (dense) or 0.06 (sparse) Reveal labels for k nodes (varied) and we have different error rates (sparse 0/10% low/high) and dense (20%/60% for low-high) KDD2015 David Gleich · Purdue 22 5 10 0 50 100 150 number of labels numberofmistakes Joachims Zhou ZGL Real-world scenario" sparse graph, high error Sparse graph," low error 20 40 60 0 20 40 60 80 100 number of labels numberofmistakes 20 40 60 0 20 40 60 80 100 number of labels numberofmistakes Dense graph," low error rate Dense graph," high error rate
  23. 23. Varying density in an SSL construction. KDD2015 David Gleich · Purdue 23 Ai,j = exp ✓ kdi dj k2 2 2 2 ◆ di dj = 2.5 = 1.25 We use the digits experiment from Zhou et al. 2003. 10 digits and a few label errors. We vary density either by the num. of nearest neighbors or by the kernel width.
  24. 24. As density increases, the results just get worse KDD2015 David Gleich · Purdue 24 1 1.5 2 2.5 0 0.1 0.2 0.3 0.4 errorrate σ 0.8 1.2 1.5 1.8 2.1 2.5 Zhou Zhou+Push 10 2 0 0.1 0.2 0.3 0.4 errorrate nearest neighbors 5 10 25 50 100 150 200 250 Zhou Zhou+Push Varying kernel width Varying nearest neighors •  Adding “more” edges seems to only hurt. (Unless there is no signal). •  Zhou+Push seems to be slightly more robust (Maybe).
  25. 25. Some observations and a question. Adding “more data” yields “worse results” for this procedure (in a simple setting). Suppose I have a real-world system that can work with up to E edges on some graph. Is there a way I can create new edges? KDD2015 David Gleich · Purdue 25
  26. 26. Densifying the graph with path expansions KDD2015 David Gleich · Purdue 26 Ak = kX `=1 A` If A is the adjacency matrix, then this counts the total weight on all paths up to length k. We now repeat the nearest neighbor computation, but with paired parameters such that we have the same average degree. Zhou Zhou w. Push Avg. Deg k = 1 k 1 k = 1 k 1 19 0.163 0.114 0.156 0.117 41 0.156 0.132 0.158 0.113 53 0.183 0.142 0.179 0.136 104 0.193 0.145 0.178 0.144 138 0.216 0.102 0.204 0.101 k=4, nn = 3
  27. 27. The same result holds for Amazon’s co-purchasing network KDD2015 David Gleich · Purdue 27 mean F1 Confidence intervals k Zhou Zhou w. Push Zhou Zhou w. Push 1 0.173 0.229 [0.15 0.19] [0.21 0.25] 2 0.197 0.231 [0.18 0.22] [0.21 0.25] 3 0.221 0.238 [0.17 0.27] [0.19 0.28] Amazon’s co-purchasing network (on Snap) is effectively a highly sparse nearest-neighbor network from their (denser) co-purchasing graph. We attempt to predict the items in a product category based on a small sample and study the F1 score for the predictions. Some small details missing – see the full paper. Ak = kX `=1 A`
  28. 28. (a) K2 sparse (b) K2 dense (c) RK2 Figure 2: We artificially densify this graph to Ak based on a cons and dense di↵usions and regularization. The color indicates the circled nodes. The unavoidable errors are caused by a mislabeled regularizing di↵usions on dense graphs produces only a small Towards some theory, i.e. why are densified sparse graphs better? KDD2015 David Gleich · Purdue 28 How do sparsity, density, and regularization of a diffusion play into the results in a controlled setting? THE ERROR Labels
  29. 29. (a) K2 sparse (b) K2 dense (c) RK2 Figure 2: We artificially densify this graph to Ak based on a cons and dense di↵usions and regularization. The color indicates the circled nodes. The unavoidable errors are caused by a mislabeled regularizing di↵usions on dense graphs produces only a small Towards some theory, i.e. why are densified sparse graphs better? KDD2015 David Gleich · Purdue 29 THE ERROR Labels Using Push algorithm P5 `=1 A` P5 `=1 A` Regularization Dense Dense
  30. 30. Recap, discussion, future work Contributions 1.  Flow-setup for SSL diffusions 2.  New robust rounding rule for class selection 3.  Localized Zhou’s diffusion 4.  Empirical insights on density of graph constructions Observations •  Many of these insights translate to directed, weighted graphs with fuzzy labels and/or some parallel architectures. •  Weakness mainly empirical results on the density. •  We need a theoretical basis for the densification theory! KDD2015 David Gleich · Purdue 30 Supported by NSF, ARO, DARPA CODE www.cs.purdue.edu/homes/dgleich/codes/robust-diffusions

×