Fast pair-wise and node-wise algorithms for commute times and Katz scores
1. FAST MATRIX COMPUTATIONS FOR
PAIRWISE AND COLUMN-WISE KATZ
SCORES AND COMMUTE TIMES
David F. Gleich
Purdue University
Emory University
Math/CS Seminar
January 20, 2012
With Pooya Esfandiar, Francesco Bonchi, Chen Greif,
Laks V. S. Lakshmanan, and Byung-Won On
David F. Gleich (Purdue) Emory Math/CS Seminar 1 / 47
2. MAIN RESULTS
A – adjacency matrix For Katz
L – Laplacian matrix Compute one fast
Compute top fast
Katz score :
Commute time:
For Commute
Compute one fast
Estimate
David F. Gleich (Purdue) Emory Math/CS Seminar 2 of 47
3. OUTLINE
Why study these measures?
Katz Rank and Commute Time
How else do people compute them?
Quadrature rules for pairwise scores
Sparse linear systems solves for top-k
As many results as we have time for…
David F. Gleich (Purdue) Emory Math/CS Seminar 3 of 47
4. WHY? LINK PREDICTION
Neighborhood based
Path based
Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient
David F. Gleich (Purdue) Emory Math/CS Seminar 4 of 47
5. NOTE
All graphs are undirected
All graphs are connected
David F. Gleich (Purdue) Emory Math/CS Seminar 5 of 47
7. NOT QUITE, WIKIPEDIA
: adjacency, : random walk
PageRank
Katz
These are equivalent if has constant
degree
David F. Gleich (Purdue) Emory Math/CS Seminar 7 of 47
8. WHAT KATZ ACTUALLY SAID
“we assume that each link independently has the
same probability of being effective” …
“we conceive a constant , depending
on the group and the context of the particular
investigation, which has the force of a probability
of effectiveness of a single link. A k-step chain
then, has probability of being effective.”
“We wish to find the column sums of the matrix”
Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43
David F. Gleich (Purdue) Emory Math/CS Seminar 8 of 47
9. A MODERN TAKE
The Katz score (node-based) is
The Katz score (edge-based) is
David F. Gleich (Purdue) Emory Math/CS Seminar 9 of 47
10. RETURNING TO THE MATRIX
Carl Neumann
David F. Gleich (Purdue) Emory Math/CS Seminar 10 of 47
11. Carl Neumann
I’ve heard the Neumann series called the “von Neumann”
series more than I’d like! In fact, the von Neumann kernel
of a graph should be named the “Neumann” kernel!
Wikipedia page
David F. Gleich (Purdue) Emory Math/CS Seminar 11 / 47
12. PROPERTIES OF KATZ’S MATRIX
is symmetric
exists when
is spd when
Note that 1/max-degree suffices
David F. Gleich (Purdue) Emory Math/CS Seminar 12 of 47
13. COMMUTE TIME
Picture taken from Google images, seems to be
Bay Bridge Traffic by Jim M. Goldstein.
David F. Gleich (Purdue) Emory Math/CS Seminar 13 / 47
14. COMMUTE TIME
Consider a uniform random walk on a
graph
Also called the hitting
time from node i to j, or
the first transition time
David F. Gleich (Purdue) Emory Math/CS Seminar 14 of 47
15. SKIPPING DETAILS
: graph Laplacian
is the only null-vector
David F. Gleich (Purdue) Emory Math/CS Seminar 15 of 47
16. WHAT DO OTHER PEOPLE DO?
1) Just work with the linear algebra formulations
2) For Katz, Truncate the Neumann series as a few (3-5)
terms, or MMQ (Benzi & Boito)
3) Use low-rank approximations from EVD(A) or EVD(L)
4) For commute, use Johnson-Lindenstrauss inspired
random sampling
5) Approximately decompose into smaller problems
Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,
Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007, Wang et al. ICDM2007
David F. Gleich (Purdue) Emory Math/CS Seminar 16 of 47
17. THE PROBLEM
All of these techniques are
preprocessing based because
most people’s goal is to compute
all the scores.
We want to avoid
preprocessing the graph.
There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse
David F. Gleich (Purdue) Emory Math/CS Seminar 17 of 47
18. WHY NO PREPROCESSING?
The graph is constantly changing
as I rate new movies.
David F. Gleich (Purdue) Emory Math/CS Seminar 18 of 47
19. WHY NO PREPROCESSING?
Top-k predicted “links”
are movies to watch!
Pairwise scores give
user similarity
David F. Gleich (Purdue) Emory Math/CS Seminar 19 of 47
21. PAIRWISE ALGORITHMS
Katz
Commute
Golub and Meurant
to the rescue!
David F. Gleich (Purdue) Emory Math/CS Seminar 21 of 47
22. MMQ - THE BIG IDEA
Quadratic form Think
Weighted sum A is s.p.d. use EVD
Stieltjes integral “A tautology”
Quadrature approximation
Matrix equation Lanczos
David F. Gleich (Purdue) Emory Math/CS Seminar 22 of 47
23. LANCZOS
, k-steps of the Lanczos method
produce
and
=
David F. Gleich (Purdue) Emory Math/CS Seminar 23 of 47
24. PRACTICAL LANCZOS
Only need to store the last 2 vectors in
Updating requires O(matvec) work
is not orthogonal
David F. Gleich (Purdue) Emory Math/CS Seminar 24 of 47
25. MMQ PROCEDURE
Goal
Given
1. Run k-steps of Lanczos on starting with
2. Compute , with an additional eigenvalue at ,
set Correspond to a Gauss-Radau rule, with
u as a prescribed node
3. Compute , with an additional eigenvalue at , set
Correspond to a Gauss-Radau rule, with
l as a prescribed node
4. Output as lower and upper bounds on
David F. Gleich (Purdue) Emory Math/CS Seminar 25 of 47
26. PRACTICAL MMQ
Increase k to become more accurate
Bad eigenvalue bounds yield worse results
and are easy to compute
not required, we can iteratively
update it’s LU factorization
David F. Gleich (Purdue) Emory Math/CS Seminar 26 of 47
28. ONE LAST STEP FOR KATZ
Katz
David F. Gleich (Purdue) Emory Math/CS Seminar 28 of 47
29. COLUMN-WISE
ALGORITHMS
David F. Gleich (Purdue) Emory Math/CS Seminar 29 / 47
30. COLUMN-WISE COMMUTE-TIME
requires entire diagonal of
=
Each vector is
computed by the a
Lanczos based CG algorithm.
Paige and Saunders, 1975
David F. Gleich (Purdue) Emory Math/CS Seminar 30 of 47
31. COLUMN-WISE COMMUTE TIME
following
Hein et al. 2010 is a
MUCH better and faster
approximation.
David F. Gleich (Purdue) Emory Math/CS Seminar 31
32. KATZ SCORES ARE LOCALIZED
Up to 50 neighbors is
99.65% of the total
mass
David F. Gleich (Purdue) Emory Math/CS Seminar 32 of 47
34. TOP-K ALGORITHM FOR KATZ
Approximate
where is sparse
Keep sparse too
Ideally, don’t “touch” all of
David F. Gleich (Purdue) Emory Math/CS Seminar 34 of 47
35. INSPIRATION - PAGERANK
Approximate
where is sparse
Keep sparse too? YES!
Ideally, don’t “touch” all of ? YES!
McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too.
David F. Gleich (Purdue) Emory Math/CS Seminar 35 of 47
36. THE ALGORITHM - MCSHERRY
For
Start with the Richardson iteration
Rewrite
Richardson converges if
David F. Gleich (Purdue) Emory Math/CS Seminar 36 of 47
37. THE ALGORITHM
Note is sparse.
If , then is sparse.
Idea
only add one component of to
David F. Gleich (Purdue) Emory Math/CS Seminar 37 of 47
38. THE ALGORITHM
For
Init:
How to pick ?
David F. Gleich (Purdue) Emory Math/CS Seminar 38 of 47
39. THE ALGORITHM FOR KATZ
For
Init:
Pick as max
Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more
David F. Gleich (Purdue) Emory Math/CS Seminar 39 of 47
40. CONVERGENCE?
If 1/max-degree then is sub-
stochastic and the PageRank based proof
applies because the matrix is diagonally
dominant
For , then for symmetric , this
algorithm is the Gauss-Southwell
procedure and it still converges.
David F. Gleich (Purdue) Emory Math/CS Seminar 40 of 47
41. NONSYMMETRIC ALGORITHM?
Was this algorithm known in the non-
symmetric case?
YES! It’s just Gauss-Seidel/Gauss-
Southwell, but with a column vs. row
orientation.
Called dynamic residual to AMGers
David F. Gleich (Purdue) Emory Math/CS Seminar 41 of 47
42. RESULTS – DATA, PARAMETERS
All unweighted, connected graphs
Easy
Hard
David F. Gleich (Purdue) Emory Math/CS Seminar 42 of 47
47. CONCLUSIONS
These algorithms are faster than many
alternatives.
For pairwise commute, stopping criteria
are simpler with bounds
For top-k problems, we often need less
than 1 matvec for good enough results
David F. Gleich (Purdue) Emory Math/CS Seminar 47 of 47
48. Paper at WAW2010 and J. Internet Mathematics
Slides online
Code is online already
www.cs.purdue.edu/
homes/dgleich/codes/fast-katz-2011
By AngryDogDesign on DeviantArt
David F. Gleich (Purdue) Emory Math/CS Seminar 48