SlideShare une entreprise Scribd logo
1  sur  53
Télécharger pour lire hors ligne
FAST KATZ AND COMMUTERS

                                                       Quadrature Rules and Sparse Linear Solvers
                                                                    for Link Prediction Heuristics

                                                                                                              David F. Gleich
                                                                                                        Sandia National Labs

                                                                                           Berkeley LAPACK Seminar
                                                                                                     February 3, 2011

                                      With Pooya Esfandiar, Francesco Bonchi, Chen Grief,
                                               Laks V. S. Lakshmanan, and Byung-Won On
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin
                      Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
David F. Gleich (Sandia)                                   Berkeley SCMC Seminar                                                         1 / 43
MAIN RESULTS
A – adjacency matrix                    For Katz
L – Laplacian matrix                    Compute one       fast
                                        Compute top       fast
Katz score :
                         
Commute time:
                  

For Commute
Compute one       fast



David F. Gleich (Sandia)   Berkeley SCMC Seminar                 2 of 43
OUTLINE
Why study these measures?
Katz Rank and Commute Time
How else do people compute them?
Quadrature rules for pairwise scores
Sparse linear systems solves for top-k
As many results as we have time for…

David F. Gleich (Sandia)   Berkeley SCMC Seminar   3 of 43
WHY? LINK PREDICTION
                                                                                 Neighborhood based




                                                                                         Path based


                           Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient
David F. Gleich (Sandia)                         Berkeley SCMC Seminar                                            4 of 43
NOTE


All graphs are undirected

All graphs are connected




David F. Gleich (Sandia)   Berkeley SCMC Seminar   5 of 43
LEO KATZ




David F. Gleich (Sandia)   Berkeley SCMC Seminar   6 of 43
NOT QUITE, WIKIPEDIA
    : adjacency,                     : random walk

                 PageRank                      

                           Katz                

These are equivalent if     has constant
degree

David F. Gleich (Sandia)          Berkeley SCMC Seminar   7 of 43
WHAT KATZ ACTUALLY SAID
                                           “we assume that each link independently has the
                                           same probability of being effective” …

                                           “we conceive a constant     , depending
                                           on the group and the context of the particular
                                           investigation, which has the force of a probability
                                           of effectiveness of a single link. A k-step chain
                                           then, has probability       of being effective.”

                                           “We wish to find the column sums of the matrix”




                           Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43
David F. Gleich (Sandia)                         Berkeley SCMC Seminar                                           8 of 43
A MODERN TAKE
The Katz score (node-based) is

                                     
The Katz score (edge-based) is

                                     


David F. Gleich (Sandia)     Berkeley SCMC Seminar   9 of 43
RETURNING TO THE MATRIX
                                           

                                                      



                Carl Neumann                       
                                                 
David F. Gleich (Sandia)             Berkeley SCMC Seminar   10 of 43
Carl Neumann
                           I’ve heard the Neumann series called the “von Neumann”
                           series more than I’d like! In fact, the von Neumann kernel
                               of a graph should be named the “Neumann” kernel!


                                                                                        Wikipedia page
David F. Gleich (Sandia)                     Berkeley SCMC Seminar                              11 / 43
PROPERTIES OF KATZ’S MATRIX
    is symmetric

    exists when                      

                is spd when                      

Note that         1/max-degree suffices


David F. Gleich (Sandia)   Berkeley SCMC Seminar    12 of 43
COMMUTE TIME
                           Picture taken from Google images, seems to be
                                Bay Bridge Traffic by Jim M. Goldstein.


David F. Gleich (Sandia)               Berkeley SCMC Seminar               13 / 43
COMMUTE TIME
Consider a uniform random walk on a
graph                    
                                                   Also called the hitting
                                                   time from node i to j, or
                                                   the first transition time
                            
                                 




David F. Gleich (Sandia)   Berkeley SCMC Seminar                      14 of 43
SKIPPING DETAILS

                    : graph Laplacian

                               

              is the only null-vector



David F. Gleich (Sandia)   Berkeley SCMC Seminar   15 of 43
WHAT DO OTHER PEOPLE DO?
1) Just work with the linear algebra formulations

2) For Katz, Truncate the Neumann series as a few (3-5)
   terms

3) Use low-rank approximations from EVD(A) or EVD(L)

4) For commute, use Johnson-Lindenstrauss inspired
   random sampling

5) Approximately decompose into smaller problems

                                                    Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,
                           Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007,Wang et al. ICDM2007
David F. Gleich (Sandia)                 Berkeley SCMC Seminar                                        16 of 43
THE PROBLEM

                             All of these techniques are
                            preprocessing based because
                           most people’s goal is to compute
                                     all the scores.

                                   We want to avoid
                                preprocessing the graph.

                            There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse
David F. Gleich (Sandia)                               Berkeley SCMC Seminar                                               17 of 43
WHY NO PREPROCESSING?




                           The graph is constantly changing
                                as I rate new movies.
David F. Gleich (Sandia)              Berkeley SCMC Seminar   18 of 43
WHY NO PREPROCESSING?

                                                      Top-k predicted “links”
                                                      are movies to watch!




                           Pairwise scores give
                                  user similarity



David F. Gleich (Sandia)      Berkeley SCMC Seminar                             19 of 43
PAIRWISE ALGORITHMS
                Katz                  

Commute                                        

                              Golub and Meurant
                                to the rescue!




David F. Gleich (Sandia)        Berkeley SCMC Seminar   20 of 43
MMQ - THE BIG IDEA
Quadratic form                                          Think                    
       


Weighted sum                                            A is s.p.d. use EVD

       


Stieltjes integral                                      “A tautology”

       


Quadrature approximation                                     
        

Matrix equation                                         Lanczos
David F. Gleich (Sandia)    Berkeley SCMC Seminar                          21 of 43
LANCZOS
        , k-steps of the Lanczos method
produce

                                    and                      


                                         =           
                                                                 




David F. Gleich (Sandia)          Berkeley SCMC Seminar             22 of 43
PRACTICAL LANCZOS

Only need to store the last 2 vectors in      

Updating requires O(matvec) work

      is not orthogonal



David F. Gleich (Sandia)   Berkeley SCMC Seminar   23 of 43
MMQ PROCEDURE
Goal                        
Given                        

1. Run k-steps of Lanczos on     starting with    
2. Compute       ,     with an additional eigenvalue at     ,
        set                                    Correspond to a Gauss-Radau rule, with
                                               u as a prescribed node
3. Compute     ,     with an additional eigenvalue at   , set
                                                Correspond to a Gauss-Radau rule, with
                                                l as a prescribed node
4. Output               as lower and upper bounds on    



David F. Gleich (Sandia)        Berkeley SCMC Seminar                                    24 of 43
PRACTICAL MMQ
Increase k to become more accurate

Bad eigenvalue bounds yield worse results

      and     are easy to compute


    not required, we can iteratively
update it’s LU factorization
David F. Gleich (Sandia)   Berkeley SCMC Seminar   25 of 43
PRACTICAL MMQ




David F. Gleich (Sandia)   Berkeley SCMC Seminar   26 of 43
ONE LAST STEP FOR KATZ
                      Katz              

                                    

                                  
                                      



David F. Gleich (Sandia)         Berkeley SCMC Seminar   27 of 43
KATZ SCORES ARE LOCALIZED

                                                   Up to 50 neighbors is
                                                   99.65% of the total
                                                   mass




David F. Gleich (Sandia)   Berkeley SCMC Seminar                       28 of 43
TOP-K ALGORITHM FOR KATZ

Approximate    
                             
where     is sparse

Keep     sparse too
Ideally, don’t “touch” all of    


David F. Gleich (Sandia)   Berkeley SCMC Seminar   29 of 43
INSPIRATION - PAGERANK

 Approximate    
                              
 where     is sparse

 Keep     sparse too? YES!
 Ideally, don’t “touch” all of     ? YES!

McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too.
 David F. Gleich (Sandia)                         Berkeley SCMC Seminar                                         30 of 43
THE ALGORITHM - MCSHERRY
For              
Start with the Richardson iteration
                                      
Rewrite
                                     

Richardson converges if                        


David F. Gleich (Sandia)   Berkeley SCMC Seminar   31 of 43
THE ALGORITHM
Note     is sparse.

If                 , then         is sparse.

Idea
  only add one component of         to        



David F. Gleich (Sandia)   Berkeley SCMC Seminar   32 of 43
THE ALGORITHM
For              

Init:                                
                
                

How to pick   ?

David F. Gleich (Sandia)       Berkeley SCMC Seminar   33 of 43
THE ALGORITHM FOR KATZ
For                            

Init:                                  
               
                      

Pick   as max                               
   Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more
David F. Gleich (Sandia)                           Berkeley SCMC Seminar                                           34 of 43
CONVERGENCE?
The “standard” proof works for
        1/max-degree.

For                       , then for symmetric     ,
this algorithm is the Gauss-Southwell
procedure and it still converges.




David F. Gleich (Sandia)   Berkeley SCMC Seminar       35 of 43
RESULTS – DATA, PARAMETERS




               All unweighted, connected graphs
                    Easy                                    
                      Hard                            
David F. Gleich (Sandia)        Berkeley SCMC Seminar          36 of 43
KATZ BOUND CONVERGENCE




David F. Gleich (Sandia)   Berkeley SCMC Seminar   37 of 43
COMMUTE BOUND CONVERG.




David F. Gleich (Sandia)   Berkeley SCMC Seminar   38 of 43
KATZ SET CONVERGENCE




                                                   For arXiv graph.
David F. Gleich (Sandia)   Berkeley SCMC Seminar           39 of 43
TIMING




David F. Gleich (Sandia)   Berkeley SCMC Seminar   40 of 43
CONCLUSIONS
These algorithms are faster than many
alternatives.

For pairwise commute, stopping criteria
are simpler with bounds

For top-k problems, we often need less
than 1 matvec for good enough results

David F. Gleich (Sandia)   Berkeley SCMC Seminar   41 of 43
WARTS AND TODOS
Stopping criteria on our top-k algorithm
can be a bit hairy… we should refine it.
The top-k approach doesn’t work right for
commute time… we have an alternative.
                                           
                 requires the entire diagonal of      

Take advantage of new research to “seed”
commute time better!         von Luxburg et al. NIPS2010
David F. Gleich (Sandia)       Berkeley SCMC Seminar   42 of 43
Paper at WAW2010

                                                     Slides should be online soon

                                                     Code is online already
                                                        stanford.edu/~dgleich/
                                                        publications/2010/codes/fast-katz




        By AngryDogDesign on DeviantArt



David F. Gleich (Sandia)                  Berkeley SCMC Seminar                             43
F-MEASURE




David F. Gleich (Sandia)   Berkeley SCMC Seminar   44 of 43
MAIN RESULTS – SLIDE ONE
A – adjacency matrix
L – Laplacian matrix
Katz score :
                             
Commute time:
                                


David F. Gleich (Sandia)   Berkeley SCMC Seminar   45 of 43
MAIN RESULTS – SLIDE TWO
For Katz Compute one       fast
         Compute top       fast

For Commute
       Compute one       fast

For almost commute
             Compute top $F_{i,:}$ fast


David F. Gleich (Sandia)   Berkeley SCMC Seminar   46 of 43
MAIN RESULTS – SLIDE THREE




David F. Gleich (Sandia)   Berkeley SCMC Seminar   47 of 43
KATZ BOUND CONVERGENCE




David F. Gleich (Sandia)   Berkeley SCMC Seminar   48 of 43
KATZ ERROR CONVERGENCE




David F. Gleich (Sandia)   Berkeley SCMC Seminar   49 of 43
COMMUTE BOUND CONVERG.




David F. Gleich (Sandia)   Berkeley SCMC Seminar   50 of 43
COMMUTE ERROR CONVERG.




David F. Gleich (Sandia)   Berkeley SCMC Seminar   51 of 43
KATZ SET CONVERGENCE




David F. Gleich (Sandia)   Berkeley SCMC Seminar   52 of 43
KATZ ORDER CONVERGENCE




David F. Gleich (Sandia)   Berkeley SCMC Seminar   53 of 43

Contenu connexe

Similaire à Fast Katz and Commuters

Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...David Gleich
 
Rank aggregation via skew-symmetric matrix completion
Rank aggregation via skew-symmetric matrix completionRank aggregation via skew-symmetric matrix completion
Rank aggregation via skew-symmetric matrix completionDavid Gleich
 
Fast katz-presentation
Fast katz-presentationFast katz-presentation
Fast katz-presentationDavid Gleich
 
Skew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationSkew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationDavid Gleich
 
Simulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsSimulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsDavid Gleich
 
The spectre of the spectrum
The spectre of the spectrumThe spectre of the spectrum
The spectre of the spectrumDavid Gleich
 

Similaire à Fast Katz and Commuters (6)

Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...
 
Rank aggregation via skew-symmetric matrix completion
Rank aggregation via skew-symmetric matrix completionRank aggregation via skew-symmetric matrix completion
Rank aggregation via skew-symmetric matrix completion
 
Fast katz-presentation
Fast katz-presentationFast katz-presentation
Fast katz-presentation
 
Skew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationSkew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregation
 
Simulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsSimulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific Datasets
 
The spectre of the spectrum
The spectre of the spectrumThe spectre of the spectrum
The spectre of the spectrum
 

Plus de David Gleich

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksDavid Gleich
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresDavid Gleich
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansDavid Gleich
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsDavid Gleich
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph miningDavid Gleich
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresDavid Gleich
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structuresDavid Gleich
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detectionDavid Gleich
 

Plus de David Gleich (20)

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-means
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph mining
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detection
 

Fast Katz and Commuters

  • 1. FAST KATZ AND COMMUTERS Quadrature Rules and Sparse Linear Solvers for Link Prediction Heuristics David F. Gleich Sandia National Labs Berkeley LAPACK Seminar February 3, 2011 With Pooya Esfandiar, Francesco Bonchi, Chen Grief, Laks V. S. Lakshmanan, and Byung-Won On Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. David F. Gleich (Sandia) Berkeley SCMC Seminar 1 / 43
  • 2. MAIN RESULTS A – adjacency matrix For Katz L – Laplacian matrix Compute one       fast Compute top       fast Katz score :                          Commute time:                 For Commute Compute one       fast David F. Gleich (Sandia) Berkeley SCMC Seminar 2 of 43
  • 3. OUTLINE Why study these measures? Katz Rank and Commute Time How else do people compute them? Quadrature rules for pairwise scores Sparse linear systems solves for top-k As many results as we have time for… David F. Gleich (Sandia) Berkeley SCMC Seminar 3 of 43
  • 4. WHY? LINK PREDICTION Neighborhood based Path based Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient David F. Gleich (Sandia) Berkeley SCMC Seminar 4 of 43
  • 5. NOTE All graphs are undirected All graphs are connected David F. Gleich (Sandia) Berkeley SCMC Seminar 5 of 43
  • 6. LEO KATZ David F. Gleich (Sandia) Berkeley SCMC Seminar 6 of 43
  • 7. NOT QUITE, WIKIPEDIA     : adjacency,                     : random walk PageRank               Katz               These are equivalent if     has constant degree David F. Gleich (Sandia) Berkeley SCMC Seminar 7 of 43
  • 8. WHAT KATZ ACTUALLY SAID “we assume that each link independently has the same probability of being effective” … “we conceive a constant     , depending on the group and the context of the particular investigation, which has the force of a probability of effectiveness of a single link. A k-step chain then, has probability       of being effective.” “We wish to find the column sums of the matrix” Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43 David F. Gleich (Sandia) Berkeley SCMC Seminar 8 of 43
  • 9. A MODERN TAKE The Katz score (node-based) is            The Katz score (edge-based) is            David F. Gleich (Sandia) Berkeley SCMC Seminar 9 of 43
  • 10. RETURNING TO THE MATRIX                                         Carl Neumann                             David F. Gleich (Sandia) Berkeley SCMC Seminar 10 of 43
  • 11. Carl Neumann I’ve heard the Neumann series called the “von Neumann” series more than I’d like! In fact, the von Neumann kernel of a graph should be named the “Neumann” kernel! Wikipedia page David F. Gleich (Sandia) Berkeley SCMC Seminar 11 / 43
  • 12. PROPERTIES OF KATZ’S MATRIX     is symmetric     exists when                                       is spd when                       Note that         1/max-degree suffices David F. Gleich (Sandia) Berkeley SCMC Seminar 12 of 43
  • 13. COMMUTE TIME Picture taken from Google images, seems to be Bay Bridge Traffic by Jim M. Goldstein. David F. Gleich (Sandia) Berkeley SCMC Seminar 13 / 43
  • 14. COMMUTE TIME Consider a uniform random walk on a graph                     Also called the hitting time from node i to j, or the first transition time                                                       David F. Gleich (Sandia) Berkeley SCMC Seminar 14 of 43
  • 15. SKIPPING DETAILS                     : graph Laplacian                                               is the only null-vector David F. Gleich (Sandia) Berkeley SCMC Seminar 15 of 43
  • 16. WHAT DO OTHER PEOPLE DO? 1) Just work with the linear algebra formulations 2) For Katz, Truncate the Neumann series as a few (3-5) terms 3) Use low-rank approximations from EVD(A) or EVD(L) 4) For commute, use Johnson-Lindenstrauss inspired random sampling 5) Approximately decompose into smaller problems Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009, Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007,Wang et al. ICDM2007 David F. Gleich (Sandia) Berkeley SCMC Seminar 16 of 43
  • 17. THE PROBLEM All of these techniques are preprocessing based because most people’s goal is to compute all the scores. We want to avoid preprocessing the graph. There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse David F. Gleich (Sandia) Berkeley SCMC Seminar 17 of 43
  • 18. WHY NO PREPROCESSING? The graph is constantly changing as I rate new movies. David F. Gleich (Sandia) Berkeley SCMC Seminar 18 of 43
  • 19. WHY NO PREPROCESSING? Top-k predicted “links” are movies to watch! Pairwise scores give user similarity David F. Gleich (Sandia) Berkeley SCMC Seminar 19 of 43
  • 20. PAIRWISE ALGORITHMS Katz             Commute                                         Golub and Meurant to the rescue! David F. Gleich (Sandia) Berkeley SCMC Seminar 20 of 43
  • 21. MMQ - THE BIG IDEA Quadratic form                 Think                        Weighted sum           A is s.p.d. use EVD    Stieltjes integral           “A tautology”    Quadrature approximation              Matrix equation         Lanczos David F. Gleich (Sandia) Berkeley SCMC Seminar 21 of 43
  • 22. LANCZOS         , k-steps of the Lanczos method produce                                     and                              =               David F. Gleich (Sandia) Berkeley SCMC Seminar 22 of 43
  • 23. PRACTICAL LANCZOS Only need to store the last 2 vectors in       Updating requires O(matvec) work       is not orthogonal David F. Gleich (Sandia) Berkeley SCMC Seminar 23 of 43
  • 24. MMQ PROCEDURE Goal                         Given                         1. Run k-steps of Lanczos on     starting with     2. Compute       ,     with an additional eigenvalue at     , set            Correspond to a Gauss-Radau rule, with u as a prescribed node 3. Compute     ,     with an additional eigenvalue at   , set           Correspond to a Gauss-Radau rule, with l as a prescribed node 4. Output               as lower and upper bounds on     David F. Gleich (Sandia) Berkeley SCMC Seminar 24 of 43
  • 25. PRACTICAL MMQ Increase k to become more accurate Bad eigenvalue bounds yield worse results       and     are easy to compute     not required, we can iteratively update it’s LU factorization David F. Gleich (Sandia) Berkeley SCMC Seminar 25 of 43
  • 26. PRACTICAL MMQ David F. Gleich (Sandia) Berkeley SCMC Seminar 26 of 43
  • 27. ONE LAST STEP FOR KATZ Katz                                                                              David F. Gleich (Sandia) Berkeley SCMC Seminar 27 of 43
  • 28. KATZ SCORES ARE LOCALIZED Up to 50 neighbors is 99.65% of the total mass David F. Gleich (Sandia) Berkeley SCMC Seminar 28 of 43
  • 29. TOP-K ALGORITHM FOR KATZ Approximate                    where     is sparse Keep     sparse too Ideally, don’t “touch” all of     David F. Gleich (Sandia) Berkeley SCMC Seminar 29 of 43
  • 30. INSPIRATION - PAGERANK Approximate                    where     is sparse Keep     sparse too? YES! Ideally, don’t “touch” all of     ? YES! McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too. David F. Gleich (Sandia) Berkeley SCMC Seminar 30 of 43
  • 31. THE ALGORITHM - MCSHERRY For               Start with the Richardson iteration                             Rewrite                     Richardson converges if                         David F. Gleich (Sandia) Berkeley SCMC Seminar 31 of 43
  • 32. THE ALGORITHM Note     is sparse. If                 , then         is sparse. Idea only add one component of         to         David F. Gleich (Sandia) Berkeley SCMC Seminar 32 of 43
  • 33. THE ALGORITHM For               Init:                                                                   How to pick   ? David F. Gleich (Sandia) Berkeley SCMC Seminar 33 of 43
  • 34. THE ALGORITHM FOR KATZ For                             Init:                                                                          Pick   as max     Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more David F. Gleich (Sandia) Berkeley SCMC Seminar 34 of 43
  • 35. CONVERGENCE? The “standard” proof works for         1/max-degree. For                       , then for symmetric     , this algorithm is the Gauss-Southwell procedure and it still converges. David F. Gleich (Sandia) Berkeley SCMC Seminar 35 of 43
  • 36. RESULTS – DATA, PARAMETERS All unweighted, connected graphs Easy                                     Hard                             David F. Gleich (Sandia) Berkeley SCMC Seminar 36 of 43
  • 37. KATZ BOUND CONVERGENCE David F. Gleich (Sandia) Berkeley SCMC Seminar 37 of 43
  • 38. COMMUTE BOUND CONVERG. David F. Gleich (Sandia) Berkeley SCMC Seminar 38 of 43
  • 39. KATZ SET CONVERGENCE For arXiv graph. David F. Gleich (Sandia) Berkeley SCMC Seminar 39 of 43
  • 40. TIMING David F. Gleich (Sandia) Berkeley SCMC Seminar 40 of 43
  • 41. CONCLUSIONS These algorithms are faster than many alternatives. For pairwise commute, stopping criteria are simpler with bounds For top-k problems, we often need less than 1 matvec for good enough results David F. Gleich (Sandia) Berkeley SCMC Seminar 41 of 43
  • 42. WARTS AND TODOS Stopping criteria on our top-k algorithm can be a bit hairy… we should refine it. The top-k approach doesn’t work right for commute time… we have an alternative.                        requires the entire diagonal of       Take advantage of new research to “seed” commute time better! von Luxburg et al. NIPS2010 David F. Gleich (Sandia) Berkeley SCMC Seminar 42 of 43
  • 43. Paper at WAW2010 Slides should be online soon Code is online already stanford.edu/~dgleich/ publications/2010/codes/fast-katz By AngryDogDesign on DeviantArt David F. Gleich (Sandia) Berkeley SCMC Seminar 43
  • 44. F-MEASURE David F. Gleich (Sandia) Berkeley SCMC Seminar 44 of 43
  • 45. MAIN RESULTS – SLIDE ONE A – adjacency matrix L – Laplacian matrix Katz score :                          Commute time:                  David F. Gleich (Sandia) Berkeley SCMC Seminar 45 of 43
  • 46. MAIN RESULTS – SLIDE TWO For Katz Compute one       fast Compute top       fast For Commute Compute one       fast For almost commute Compute top $F_{i,:}$ fast David F. Gleich (Sandia) Berkeley SCMC Seminar 46 of 43
  • 47. MAIN RESULTS – SLIDE THREE David F. Gleich (Sandia) Berkeley SCMC Seminar 47 of 43
  • 48. KATZ BOUND CONVERGENCE David F. Gleich (Sandia) Berkeley SCMC Seminar 48 of 43
  • 49. KATZ ERROR CONVERGENCE David F. Gleich (Sandia) Berkeley SCMC Seminar 49 of 43
  • 50. COMMUTE BOUND CONVERG. David F. Gleich (Sandia) Berkeley SCMC Seminar 50 of 43
  • 51. COMMUTE ERROR CONVERG. David F. Gleich (Sandia) Berkeley SCMC Seminar 51 of 43
  • 52. KATZ SET CONVERGENCE David F. Gleich (Sandia) Berkeley SCMC Seminar 52 of 43
  • 53. KATZ ORDER CONVERGENCE David F. Gleich (Sandia) Berkeley SCMC Seminar 53 of 43