Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

1. FAST KATZ AND COMMUTERS Efficient Estimation of Social Relatedness David F. Gleich Sandia National Labs WAW2010 16 December 2010 Palo Alto, CA With Pooya Esfandiar, Francesco Bonchi, Chen Grief, Laks V. S. Lakshmanan, and Byung-Won On Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. David F. Gleich (Sandia) ICME la/opt seminar 1 / 28

2. MAIN RESULTS For Katz Compute one fast A – adjacency matrix Compute top fast L – Laplacian matrix Katz score : Commute time: For Commute Compute one fast David F. Gleich (Sandia) ICME la/opt seminar 2 of 28

3. OUTLINE Katz Rank and Commute Time, then Why Matrices, moments, and quadrature rules for pairwise scores Sparse linear systems solves for top-k Some results David F. Gleich (Sandia) ICME la/opt seminar 3 of 28

4. NOTE – EVERYTHING IS SIMPLE All graphs are undirected All graphs are connected All graphs are unweighted David F. Gleich (Sandia) ICME la/opt seminar 4 of 28

5. KATZ SCORES The Katz score is Carl Neumann David F. Gleich (Sandia) ICME la/opt seminar 5 of 28

6. COMMUTE TIME Consider a uniform random walk on a graph Also called the hitting time from node i to j, or the first transition time : graph Laplacian is the only null-vector Fouss et al.TKDE 2007 David F. Gleich (Sandia) ICME la/opt seminar 6 of 28

7. WHAT DO OTHER PEOPLE DO? 1) Just work with the linear algebra formulations 2) For Katz, truncate the Neumann series as a few (3-5) terms 3) Use low-rank approximations from EVD(A) or EVD(L) 4) For commute, use Johnson-Lindenstrauss inspired random sampling 5) Approximately decompose into smaller problems Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009, Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007,Wang et al. ICDM2007 David F. Gleich (Sandia) ICME la/opt seminar 7 of 28

8. THE PROBLEM All of these techniques are preprocessing based because most people’s goal is to compute all the scores. We want to avoid preprocessing the graph. There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse David F. Gleich (Sandia) ICME la/opt seminar 8 of 28

9. WHY? LINK PREDICTION Neighborhood based Path based Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient David F. Gleich (Sandia) ICME la/opt seminar 9 of 28

10. WHY NO PREPROCESSING? The graph is constantly changing as I rate new movies. David F. Gleich (Sandia) ICME la/opt seminar 10 of 28

11. WHY NO PREPROCESSING? Top-k predicted “links” are movies to watch! Pairwise scores give user similarity David F. Gleich (Sandia) ICME la/opt seminar 11 of 28

12. PAIRWISE ALGORITHMS Katz Commute Golub and Meurant to the rescue! David F. Gleich (Sandia) ICME la/opt seminar 12 of 28

13. MMQ - THE BIG IDEA Quadratic form Think Weighted sum A is s.p.d. use EVD Stieltjes integral “A tautology” Quadrature approximation Matrix equation Lanczos David F. Gleich (Sandia) ICME la/opt seminar 13 of 28

14. MMQ PROCEDURE SKETCH Bad bounds give worse Goal performance! Given Larger k gives better results; k steps is k matvecs 1. Run k-steps of Lanczos on starting with 2. Compute , with an additional eigenvalue at , set Correspond to a Gauss-Radau rule, with u as a prescribed node Easy to “update” this inverse as k increases. 3. Compute , with an additional eigenvalue at , set Correspond to a Gauss-Radau rule, with l as a prescribed node 4. Output as lower and upper bounds on b David F. Gleich (Sandia) ICME la/opt seminar 14 of 28

15. PRACTICAL MMQ David F. Gleich (Sandia) ICME la/opt seminar 15 of 28

16. ONE LAST STEP FOR KATZ Katz  David F. Gleich (Sandia) ICME la/opt seminar 16 of 28

17. TOP-K ALGORITHM FOR KATZ Approximate where is sparse Keep sparse too Ideally, don’t “touch” all of For PageRank, we can do this! David F. Gleich (Sandia) ICME la/opt seminar 17 of 28

18. THE ALGORITHM - MCSHERRY For Start with the Richardson iteration Richardson converges if Rewrite Note is sparse If , then is sparse. Idea Only add one component of to McSherry WWW2005 David F. Gleich (Sandia) ICME la/opt seminar 18 of 28

19. THE ALGORITHM FOR KATZ For Init: Pick as max Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more David F. Gleich (Sandia) ICME la/opt seminar 19 of 28

20. CONVERGENCE? The “standard” proof works for 1/max-degree. If you pick as the maximum element, we can show this is convergent if Richardson converges. This proof requires to be symmetric positive definite. David F. Gleich (Sandia) ICME la/opt seminar 20 of 28

21. RESULTS – DATA, PARAMETERS All unweighted, connected graphs Easy : Hard : David F. Gleich (Sandia) ICME la/opt seminar 21 of 28

22. KATZ BOUND CONVERGENCE David F. Gleich (Sandia) ICME la/opt seminar 22 of 28

23. COMMUTE BOUND CONVERG. David F. Gleich (Sandia) ICME la/opt seminar 23 of 28

24. KATZ SET CONVERGENCE For arXiv graph. David F. Gleich (Sandia) ICME la/opt seminar 24 of 28

25. TIMING David F. Gleich (Sandia) ICME la/opt seminar 25 of 28

26. CONCLUSIONS These algorithms are faster than many alternatives in these special cases. For pairwise commute, stopping criteria are simpler with bounds. For top-k problems, we often need less than 1 matvec for good enough results David F. Gleich (Sandia) ICME la/opt seminar 26 of 28

27. WARTS AND TODOS Stopping criteria on our top-k algorithm can be a bit hairy… we should refine it. The top-k approach doesn’t work right for commute time… we have an alternative. Take advantage of new research to “seed” commute time better! Evaluate more datasets, like Netflix. von Luxburg et al. NIPS2010 David F. Gleich (Sandia) ICME la/opt seminar 27 of 28

28. More details in the paper Slides should be online soon Code is online already stanford.edu/~dgleich/ publications/2010/codes/fast-katz By AngryDogDesign on DeviantArt David F. Gleich (Sandia) ICME la/opt seminar 28

29. LEO KATZ David F. Gleich (Sandia) ICME la/opt seminar 29 of 28

30. NOT QUITE, WIKIPEDIA : adjacency, : random walk PageRank Katz These are equivalent if has constant degree David F. Gleich (Sandia) ICME la/opt seminar 30 of 28

31. WHAT KATZ ACTUALLY SAID “we assume that each link independently has the same probability of being effective” … “we conceive a constant , depending on the group and the context of the particular investigation, which has the force of a probability of effectiveness of a single link. A k-step chain then, has probability of being effective.” “We wish to find the column sums of the matrix” Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43 David F. Gleich (Sandia) ICME la/opt seminar 31 of 28

32. RETURNING TO THE MATRIX Carl Neumann David F. Gleich (Sandia) ICME la/opt seminar 32 of 28

33. PROPERTIES OF KATZ’S MATRIX is symmetric exists when is sym. pos. def. when Note that 1/max-degree suffices David F. Gleich (Sandia) ICME la/opt seminar 33 of 28

34. SKIPPING DETAILS : graph Laplacian is the only null-vector David F. Gleich (Sandia) ICME la/opt seminar 34 of 28

35. Carl Neumann I’ve heard the Neumann series called the “von Neumann” series more than I’d like! In fact, the von Neumann kernel of a graph should be named the “Neumann” kernel! Wikipedia page David F. Gleich (Sandia) ICME la/opt seminar 35 / 28

36. LANCZOS , k-steps of the Lanczos method produce and = David F. Gleich (Sandia) ICME la/opt seminar 36 of 28

37. PRACTICAL LANCZOS Only need to store the last 2 vectors in Updating requires O(matvec) work is not orthogonal David F. Gleich (Sandia) ICME la/opt seminar 37 of 28

38. PRACTICAL MMQ Increase k to become more accurate Bad eigenvalue bounds yield worse results and are easy to compute not required, we can iteratively update it’s LU factorization David F. Gleich (Sandia) ICME la/opt seminar 38 of 28

39. THE ALGORITHM For Init: How to pick ? David F. Gleich (Sandia) ICME la/opt seminar 39 of 28

40. MAIN RESULTS – SLIDE ONE A – adjacency matrix L – Laplacian matrix Katz score : Commute time: David F. Gleich (Sandia) ICME la/opt seminar 40 of 28

41. TOP-K INSPIRATION: PAGERANK Approximate where is sparse Keep sparse too? YES! Ideally, don’t “touch” all of ? YES! McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too. David F. Gleich (Sandia) ICME la/opt seminar 41 of 28

42. THE ALGORITHM Note is sparse. If , then is sparse. Idea only add one component of to David F. Gleich (Sandia) ICME la/opt seminar 42 of 28

43. MAIN RESULTS – SLIDE TWO For Katz Compute one fast Compute top fast For Commute Compute one fast For almost commute Compute top fast David F. Gleich (Sandia) ICME la/opt seminar 43 of 28

44. MAIN RESULTS – SLIDE THREE David F. Gleich (Sandia) ICME la/opt seminar 44 of 28

45. F-MEASURE David F. Gleich (Sandia) ICME la/opt seminar 45 of 28

46. PAIRWISE RESULTS Katz upper and lower bounds Katz error convergence Commute-time upper and lower bounds Commute-time error convergence For the arXiv graph here David F. Gleich (Sandia) ICME la/opt seminar 46 of 28

Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

Recommandé

Recommandé

Contenu connexe

Plus de David Gleich

Plus de David Gleich (20)

Dernier

Dernier (20)

Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks