Streamlining Python Development: A Guide to a Modern Project Setup
Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks
1. FAST KATZ AND COMMUTERS
Efficient Estimation of Social Relatedness
David F. Gleich
Sandia National Labs
WAW2010
16 December 2010
Palo Alto, CA
With Pooya Esfandiar, Francesco Bonchi, Chen Grief,
Laks V. S. Lakshmanan, and Byung-Won On
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin
Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
David F. Gleich (Sandia) ICME la/opt seminar 1 / 28
2. MAIN RESULTS
For Katz
Compute one fast
A – adjacency matrix Compute top fast
L – Laplacian matrix
Katz score :
Commute time:
For Commute
Compute one fast
David F. Gleich (Sandia) ICME la/opt seminar 2 of 28
3. OUTLINE
Katz Rank and Commute Time, then Why
Matrices, moments, and quadrature rules
for pairwise scores
Sparse linear systems solves for top-k
Some results
David F. Gleich (Sandia) ICME la/opt seminar 3 of 28
4. NOTE – EVERYTHING IS SIMPLE
All graphs are undirected
All graphs are connected
All graphs are unweighted
David F. Gleich (Sandia) ICME la/opt seminar 4 of 28
5. KATZ SCORES
The Katz score is
Carl Neumann
David F. Gleich (Sandia) ICME la/opt seminar 5 of 28
6. COMMUTE TIME
Consider a uniform random walk on a
graph
Also called the hitting
time from node i to j, or
the first transition time
: graph Laplacian
is the only null-vector
Fouss et al.TKDE 2007
David F. Gleich (Sandia) ICME la/opt seminar 6 of 28
7. WHAT DO OTHER PEOPLE DO?
1) Just work with the linear algebra formulations
2) For Katz, truncate the Neumann series as a few (3-5)
terms
3) Use low-rank approximations from EVD(A) or EVD(L)
4) For commute, use Johnson-Lindenstrauss inspired
random sampling
5) Approximately decompose into smaller problems
Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,
Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007,Wang et al. ICDM2007
David F. Gleich (Sandia) ICME la/opt seminar 7 of 28
8. THE PROBLEM
All of these techniques are
preprocessing based because
most people’s goal is to compute
all the scores.
We want to avoid
preprocessing the graph.
There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse
David F. Gleich (Sandia) ICME la/opt seminar 8 of 28
9. WHY? LINK PREDICTION
Neighborhood based
Path based
Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient
David F. Gleich (Sandia) ICME la/opt seminar 9 of 28
10. WHY NO PREPROCESSING?
The graph is constantly changing
as I rate new movies.
David F. Gleich (Sandia) ICME la/opt seminar 10 of 28
11. WHY NO PREPROCESSING?
Top-k predicted “links”
are movies to watch!
Pairwise scores give
user similarity
David F. Gleich (Sandia) ICME la/opt seminar 11 of 28
12. PAIRWISE ALGORITHMS
Katz
Commute
Golub and Meurant
to the rescue!
David F. Gleich (Sandia) ICME la/opt seminar 12 of 28
13. MMQ - THE BIG IDEA
Quadratic form Think
Weighted sum A is s.p.d. use EVD
Stieltjes integral “A tautology”
Quadrature approximation
Matrix equation Lanczos
David F. Gleich (Sandia) ICME la/opt seminar 13 of 28
14. MMQ PROCEDURE SKETCH
Bad bounds give worse
Goal performance!
Given
Larger k gives better results;
k steps is k matvecs
1. Run k-steps of Lanczos on starting with
2. Compute , with an additional eigenvalue at ,
set Correspond to a Gauss-Radau rule, with
u as a prescribed node
Easy to “update” this inverse as k increases.
3. Compute , with an additional eigenvalue at , set
Correspond to a Gauss-Radau rule, with
l as a prescribed node
4. Output as lower and upper bounds on b
David F. Gleich (Sandia) ICME la/opt seminar 14 of 28
16. ONE LAST STEP FOR KATZ
Katz
David F. Gleich (Sandia) ICME la/opt seminar 16 of 28
17. TOP-K ALGORITHM FOR KATZ
Approximate
where is sparse
Keep sparse too
Ideally, don’t “touch” all of
For PageRank, we can do this!
David F. Gleich (Sandia) ICME la/opt seminar 17 of 28
18. THE ALGORITHM - MCSHERRY
For
Start with the Richardson iteration
Richardson converges if
Rewrite
Note is sparse
If , then is sparse.
Idea Only add one component of to
McSherry WWW2005
David F. Gleich (Sandia) ICME la/opt seminar 18 of 28
19. THE ALGORITHM FOR KATZ
For
Init:
Pick as max
Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more
David F. Gleich (Sandia) ICME la/opt seminar 19 of 28
20. CONVERGENCE?
The “standard” proof works for
1/max-degree.
If you pick as the maximum element, we
can show this is convergent if Richardson
converges. This proof requires to be
symmetric positive definite.
David F. Gleich (Sandia) ICME la/opt seminar 20 of 28
21. RESULTS – DATA, PARAMETERS
All unweighted, connected graphs
Easy :
Hard :
David F. Gleich (Sandia) ICME la/opt seminar 21 of 28
26. CONCLUSIONS
These algorithms are faster than many
alternatives in these special cases.
For pairwise commute, stopping criteria
are simpler with bounds.
For top-k problems, we often need less
than 1 matvec for good enough results
David F. Gleich (Sandia) ICME la/opt seminar 26 of 28
27. WARTS AND TODOS
Stopping criteria on our top-k algorithm
can be a bit hairy… we should refine it.
The top-k approach doesn’t work right for
commute time… we have an alternative.
Take advantage of new research to “seed”
commute time better!
Evaluate more datasets, like Netflix.
von Luxburg et al. NIPS2010
David F. Gleich (Sandia) ICME la/opt seminar 27 of 28
28. More details in the paper
Slides should be online soon
Code is online already
stanford.edu/~dgleich/
publications/2010/codes/fast-katz
By AngryDogDesign on DeviantArt
David F. Gleich (Sandia) ICME la/opt seminar 28
30. NOT QUITE, WIKIPEDIA
: adjacency, : random walk
PageRank
Katz
These are equivalent if has constant
degree
David F. Gleich (Sandia) ICME la/opt seminar 30 of 28
31. WHAT KATZ ACTUALLY SAID
“we assume that each link independently has the
same probability of being effective” …
“we conceive a constant , depending
on the group and the context of the particular
investigation, which has the force of a probability
of effectiveness of a single link. A k-step chain
then, has probability of being effective.”
“We wish to find the column sums of the matrix”
Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43
David F. Gleich (Sandia) ICME la/opt seminar 31 of 28
32. RETURNING TO THE MATRIX
Carl Neumann
David F. Gleich (Sandia) ICME la/opt seminar 32 of 28
33. PROPERTIES OF KATZ’S MATRIX
is symmetric
exists when
is sym. pos. def. when
Note that 1/max-degree suffices
David F. Gleich (Sandia) ICME la/opt seminar 33 of 28
34. SKIPPING DETAILS
: graph Laplacian
is the only null-vector
David F. Gleich (Sandia) ICME la/opt seminar 34 of 28
35. Carl Neumann
I’ve heard the Neumann series called the “von Neumann”
series more than I’d like! In fact, the von Neumann kernel
of a graph should be named the “Neumann” kernel!
Wikipedia page
David F. Gleich (Sandia) ICME la/opt seminar 35 / 28
36. LANCZOS
, k-steps of the Lanczos method
produce
and
=
David F. Gleich (Sandia) ICME la/opt seminar 36 of 28
37. PRACTICAL LANCZOS
Only need to store the last 2 vectors in
Updating requires O(matvec) work
is not orthogonal
David F. Gleich (Sandia) ICME la/opt seminar 37 of 28
38. PRACTICAL MMQ
Increase k to become more accurate
Bad eigenvalue bounds yield worse results
and are easy to compute
not required, we can iteratively
update it’s LU factorization
David F. Gleich (Sandia) ICME la/opt seminar 38 of 28
39. THE ALGORITHM
For
Init:
How to pick ?
David F. Gleich (Sandia) ICME la/opt seminar 39 of 28
40. MAIN RESULTS – SLIDE ONE
A – adjacency matrix
L – Laplacian matrix
Katz score :
Commute time:
David F. Gleich (Sandia) ICME la/opt seminar 40 of 28
41. TOP-K INSPIRATION: PAGERANK
Approximate
where is sparse
Keep sparse too? YES!
Ideally, don’t “touch” all of ? YES!
McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too.
David F. Gleich (Sandia) ICME la/opt seminar 41 of 28
42. THE ALGORITHM
Note is sparse.
If , then is sparse.
Idea
only add one component of to
David F. Gleich (Sandia) ICME la/opt seminar 42 of 28
43. MAIN RESULTS – SLIDE TWO
For Katz Compute one fast
Compute top fast
For Commute
Compute one fast
For almost commute
Compute top fast
David F. Gleich (Sandia) ICME la/opt seminar 43 of 28
44. MAIN RESULTS – SLIDE THREE
David F. Gleich (Sandia) ICME la/opt seminar 44 of 28
46. PAIRWISE RESULTS
Katz upper and lower bounds
Katz error convergence
Commute-time upper and lower bounds
Commute-time error convergence
For the arXiv graph here
David F. Gleich (Sandia) ICME la/opt seminar 46 of 28