SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
FAST KATZ AND COMMUTERS
                                                              Efficient Estimation of Social Relatedness

                                                                                                              David F. Gleich
                                                                                                        Sandia National Labs

                                                                                                                 WAW2010
                                                                                                          16 December 2010
                                                                                                               Palo Alto, CA

                                      With Pooya Esfandiar, Francesco Bonchi, Chen Grief,
                                               Laks V. S. Lakshmanan, and Byung-Won On


Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin
                      Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
David F. Gleich (Sandia)                                     ICME la/opt seminar                                                         1 / 28
MAIN RESULTS
                                       For Katz
                                       Compute one       fast
A – adjacency matrix                   Compute top       fast
L – Laplacian matrix
Katz score :
                         
Commute time:
                  

                                       For Commute
                                       Compute one       fast
David F. Gleich (Sandia)   ICME la/opt seminar                  2 of 28
OUTLINE
Katz Rank and Commute Time, then Why
Matrices, moments, and quadrature rules
for pairwise scores
Sparse linear systems solves for top-k

Some results



David F. Gleich (Sandia)   ICME la/opt seminar   3 of 28
NOTE – EVERYTHING IS SIMPLE

All graphs are undirected

All graphs are connected

All graphs are unweighted



David F. Gleich (Sandia)   ICME la/opt seminar   4 of 28
KATZ SCORES
The Katz score is

                                         


                                             
                                                        
                           Carl Neumann                
                                                   
David F. Gleich (Sandia)               ICME la/opt seminar   5 of 28
COMMUTE TIME
Consider a uniform random walk on a
graph                  
                                                      Also called the hitting
                                                      time from node i to j, or
                                                      the first transition time

                                 
                                               : graph Laplacian
                                                          
                                         is the only null-vector
                                                            Fouss et al.TKDE 2007
David F. Gleich (Sandia)        ICME la/opt seminar                        6 of 28
WHAT DO OTHER PEOPLE DO?
1) Just work with the linear algebra formulations

2) For Katz, truncate the Neumann series as a few (3-5)
   terms

3) Use low-rank approximations from EVD(A) or EVD(L)

4) For commute, use Johnson-Lindenstrauss inspired
   random sampling

5) Approximately decompose into smaller problems

                                                    Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,
                           Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007,Wang et al. ICDM2007
David F. Gleich (Sandia)                   ICME la/opt seminar                                         7 of 28
THE PROBLEM

                             All of these techniques are
                            preprocessing based because
                           most people’s goal is to compute
                                     all the scores.

                                   We want to avoid
                                preprocessing the graph.

                            There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse
David F. Gleich (Sandia)                                 ICME la/opt seminar                                                8 of 28
WHY? LINK PREDICTION
                                                                                 Neighborhood based




                                                                                         Path based


                           Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient
David F. Gleich (Sandia)                          ICME la/opt seminar                                             9 of 28
WHY NO PREPROCESSING?




                           The graph is constantly changing
                                as I rate new movies.
David F. Gleich (Sandia)              ICME la/opt seminar     10 of 28
WHY NO PREPROCESSING?

                                                    Top-k predicted “links”
                                                    are movies to watch!




                           Pairwise scores give
                                 user similarity



David F. Gleich (Sandia)      ICME la/opt seminar                             11 of 28
PAIRWISE ALGORITHMS
                Katz                 

Commute                                        

                              Golub and Meurant
                                to the rescue!




David F. Gleich (Sandia)         ICME la/opt seminar   12 of 28
MMQ - THE BIG IDEA
Quadratic form                                         Think                    
       


Weighted sum                                           A is s.p.d. use EVD
       


Stieltjes integral                                     “A tautology”

       


Quadrature approximation                                    
        


Matrix equation                                        Lanczos
David F. Gleich (Sandia)     ICME la/opt seminar                          13 of 28
MMQ PROCEDURE SKETCH
                                                 Bad bounds give worse
Goal                                             performance!
Given                        
                                                 Larger k gives better results;
                                                 k steps is k matvecs

1. Run k-steps of Lanczos on     starting with    
2. Compute       ,     with an additional eigenvalue at     ,
        set                                      Correspond to a Gauss-Radau rule, with
                                                 u as a prescribed node
                                                 Easy to “update” this inverse as k increases.

3. Compute     ,     with an additional eigenvalue at   , set
                                                 Correspond to a Gauss-Radau rule, with
                                                 l as a prescribed node
4. Output               as lower and upper bounds on b

David F. Gleich (Sandia)        ICME la/opt seminar                                          14 of 28
PRACTICAL MMQ




David F. Gleich (Sandia)   ICME la/opt seminar   15 of 28
ONE LAST STEP FOR KATZ
                      Katz              

                                   

                                  
                                



David F. Gleich (Sandia)          ICME la/opt seminar   16 of 28
TOP-K ALGORITHM FOR KATZ
Approximate    
                             
where     is sparse

Keep     sparse too
Ideally, don’t “touch” all of    

                           For PageRank, we can do this!
David F. Gleich (Sandia)             ICME la/opt seminar   17 of 28
THE ALGORITHM - MCSHERRY
For              
Start with the Richardson iteration
                                                        Richardson converges if
Rewrite                                                             

                     

Note     is sparse
If                 , then         is sparse.

Idea Only add one component of         to        

                                                                   McSherry WWW2005
David F. Gleich (Sandia)          ICME la/opt seminar                        18 of 28
THE ALGORITHM FOR KATZ
For                          

Init:                                
               
                      

Pick   as max                               
   Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more
David F. Gleich (Sandia)                            ICME la/opt seminar                                            19 of 28
CONVERGENCE?
The “standard” proof works for
        1/max-degree.

If you pick   as the maximum element, we
can show this is convergent if Richardson
converges. This proof requires     to be
symmetric positive definite.



David F. Gleich (Sandia)   ICME la/opt seminar   20 of 28
RESULTS – DATA, PARAMETERS




               All unweighted, connected graphs
                     Easy     :                                
                       Hard     :                        
David F. Gleich (Sandia)          ICME la/opt seminar             21 of 28
KATZ BOUND CONVERGENCE




David F. Gleich (Sandia)   ICME la/opt seminar   22 of 28
COMMUTE BOUND CONVERG.




David F. Gleich (Sandia)   ICME la/opt seminar   23 of 28
KATZ SET CONVERGENCE




                                                 For arXiv graph.
David F. Gleich (Sandia)   ICME la/opt seminar           24 of 28
TIMING




David F. Gleich (Sandia)   ICME la/opt seminar   25 of 28
CONCLUSIONS
These algorithms are faster than many
alternatives in these special cases.

For pairwise commute, stopping criteria
are simpler with bounds.

For top-k problems, we often need less
than 1 matvec for good enough results

David F. Gleich (Sandia)   ICME la/opt seminar   26 of 28
WARTS AND TODOS
Stopping criteria on our top-k algorithm
can be a bit hairy… we should refine it.

The top-k approach doesn’t work right for
commute time… we have an alternative.

Take advantage of new research to “seed”
commute time better!

Evaluate more datasets, like Netflix.
                                                 von Luxburg et al. NIPS2010
David F. Gleich (Sandia)   ICME la/opt seminar                       27 of 28
More details in the paper

                                                       Slides should be online soon

                                                       Code is online already
                                                          stanford.edu/~dgleich/
                                                          publications/2010/codes/fast-katz




           By AngryDogDesign on DeviantArt



David F. Gleich (Sandia)                     ICME la/opt seminar                              28

Contenu connexe

Plus de David Gleich

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksDavid Gleich
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresDavid Gleich
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansDavid Gleich
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsDavid Gleich
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph miningDavid Gleich
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresDavid Gleich
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structuresDavid Gleich
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
 

Plus de David Gleich (20)

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-means
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph mining
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
 

Dernier

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Dernier (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

  • 1. FAST KATZ AND COMMUTERS Efficient Estimation of Social Relatedness David F. Gleich Sandia National Labs WAW2010 16 December 2010 Palo Alto, CA With Pooya Esfandiar, Francesco Bonchi, Chen Grief, Laks V. S. Lakshmanan, and Byung-Won On Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. David F. Gleich (Sandia) ICME la/opt seminar 1 / 28
  • 2. MAIN RESULTS For Katz Compute one       fast A – adjacency matrix Compute top       fast L – Laplacian matrix Katz score :                          Commute time:                 For Commute Compute one       fast David F. Gleich (Sandia) ICME la/opt seminar 2 of 28
  • 3. OUTLINE Katz Rank and Commute Time, then Why Matrices, moments, and quadrature rules for pairwise scores Sparse linear systems solves for top-k Some results David F. Gleich (Sandia) ICME la/opt seminar 3 of 28
  • 4. NOTE – EVERYTHING IS SIMPLE All graphs are undirected All graphs are connected All graphs are unweighted David F. Gleich (Sandia) ICME la/opt seminar 4 of 28
  • 5. KATZ SCORES The Katz score is                                                    Carl Neumann                                  David F. Gleich (Sandia) ICME la/opt seminar 5 of 28
  • 6. COMMUTE TIME Consider a uniform random walk on a graph                   Also called the hitting time from node i to j, or                             the first transition time                                              : graph Laplacian                                               is the only null-vector Fouss et al.TKDE 2007 David F. Gleich (Sandia) ICME la/opt seminar 6 of 28
  • 7. WHAT DO OTHER PEOPLE DO? 1) Just work with the linear algebra formulations 2) For Katz, truncate the Neumann series as a few (3-5) terms 3) Use low-rank approximations from EVD(A) or EVD(L) 4) For commute, use Johnson-Lindenstrauss inspired random sampling 5) Approximately decompose into smaller problems Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009, Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007,Wang et al. ICDM2007 David F. Gleich (Sandia) ICME la/opt seminar 7 of 28
  • 8. THE PROBLEM All of these techniques are preprocessing based because most people’s goal is to compute all the scores. We want to avoid preprocessing the graph. There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse David F. Gleich (Sandia) ICME la/opt seminar 8 of 28
  • 9. WHY? LINK PREDICTION Neighborhood based Path based Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient David F. Gleich (Sandia) ICME la/opt seminar 9 of 28
  • 10. WHY NO PREPROCESSING? The graph is constantly changing as I rate new movies. David F. Gleich (Sandia) ICME la/opt seminar 10 of 28
  • 11. WHY NO PREPROCESSING? Top-k predicted “links” are movies to watch! Pairwise scores give user similarity David F. Gleich (Sandia) ICME la/opt seminar 11 of 28
  • 12. PAIRWISE ALGORITHMS Katz            Commute                                         Golub and Meurant to the rescue! David F. Gleich (Sandia) ICME la/opt seminar 12 of 28
  • 13. MMQ - THE BIG IDEA Quadratic form                 Think                        Weighted sum           A is s.p.d. use EVD    Stieltjes integral           “A tautology”    Quadrature approximation              Matrix equation         Lanczos David F. Gleich (Sandia) ICME la/opt seminar 13 of 28
  • 14. MMQ PROCEDURE SKETCH Bad bounds give worse Goal                         performance! Given                         Larger k gives better results; k steps is k matvecs 1. Run k-steps of Lanczos on     starting with     2. Compute       ,     with an additional eigenvalue at     , set            Correspond to a Gauss-Radau rule, with u as a prescribed node Easy to “update” this inverse as k increases. 3. Compute     ,     with an additional eigenvalue at   , set           Correspond to a Gauss-Radau rule, with l as a prescribed node 4. Output               as lower and upper bounds on b David F. Gleich (Sandia) ICME la/opt seminar 14 of 28
  • 15. PRACTICAL MMQ David F. Gleich (Sandia) ICME la/opt seminar 15 of 28
  • 16. ONE LAST STEP FOR KATZ Katz                                                                            David F. Gleich (Sandia) ICME la/opt seminar 16 of 28
  • 17. TOP-K ALGORITHM FOR KATZ Approximate                    where     is sparse Keep     sparse too Ideally, don’t “touch” all of     For PageRank, we can do this! David F. Gleich (Sandia) ICME la/opt seminar 17 of 28
  • 18. THE ALGORITHM - MCSHERRY For               Start with the Richardson iteration                             Richardson converges if Rewrite                                  Note     is sparse If                 , then         is sparse. Idea Only add one component of         to         McSherry WWW2005 David F. Gleich (Sandia) ICME la/opt seminar 18 of 28
  • 19. THE ALGORITHM FOR KATZ For                           Init:                                                                        Pick   as max     Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more David F. Gleich (Sandia) ICME la/opt seminar 19 of 28
  • 20. CONVERGENCE? The “standard” proof works for         1/max-degree. If you pick   as the maximum element, we can show this is convergent if Richardson converges. This proof requires     to be symmetric positive definite. David F. Gleich (Sandia) ICME la/opt seminar 20 of 28
  • 21. RESULTS – DATA, PARAMETERS All unweighted, connected graphs Easy     :                                 Hard     :                         David F. Gleich (Sandia) ICME la/opt seminar 21 of 28
  • 22. KATZ BOUND CONVERGENCE David F. Gleich (Sandia) ICME la/opt seminar 22 of 28
  • 23. COMMUTE BOUND CONVERG. David F. Gleich (Sandia) ICME la/opt seminar 23 of 28
  • 24. KATZ SET CONVERGENCE For arXiv graph. David F. Gleich (Sandia) ICME la/opt seminar 24 of 28
  • 25. TIMING David F. Gleich (Sandia) ICME la/opt seminar 25 of 28
  • 26. CONCLUSIONS These algorithms are faster than many alternatives in these special cases. For pairwise commute, stopping criteria are simpler with bounds. For top-k problems, we often need less than 1 matvec for good enough results David F. Gleich (Sandia) ICME la/opt seminar 26 of 28
  • 27. WARTS AND TODOS Stopping criteria on our top-k algorithm can be a bit hairy… we should refine it. The top-k approach doesn’t work right for commute time… we have an alternative. Take advantage of new research to “seed” commute time better! Evaluate more datasets, like Netflix. von Luxburg et al. NIPS2010 David F. Gleich (Sandia) ICME la/opt seminar 27 of 28
  • 28. More details in the paper Slides should be online soon Code is online already stanford.edu/~dgleich/ publications/2010/codes/fast-katz By AngryDogDesign on DeviantArt David F. Gleich (Sandia) ICME la/opt seminar 28
  • 29. LEO KATZ David F. Gleich (Sandia) ICME la/opt seminar 29 of 28
  • 30. NOT QUITE, WIKIPEDIA     : adjacency,                     : random walk PageRank              Katz              These are equivalent if     has constant degree David F. Gleich (Sandia) ICME la/opt seminar 30 of 28
  • 31. WHAT KATZ ACTUALLY SAID “we assume that each link independently has the same probability of being effective” … “we conceive a constant     , depending on the group and the context of the particular investigation, which has the force of a probability of effectiveness of a single link. A k-step chain then, has probability       of being effective.” “We wish to find the column sums of the matrix” Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43 David F. Gleich (Sandia) ICME la/opt seminar 31 of 28
  • 32. RETURNING TO THE MATRIX                                         Carl Neumann                                  David F. Gleich (Sandia) ICME la/opt seminar 32 of 28
  • 33. PROPERTIES OF KATZ’S MATRIX     is symmetric     exists when                                   is sym. pos. def. when                       Note that         1/max-degree suffices David F. Gleich (Sandia) ICME la/opt seminar 33 of 28
  • 34. SKIPPING DETAILS                     : graph Laplacian                                               is the only null-vector David F. Gleich (Sandia) ICME la/opt seminar 34 of 28
  • 35. Carl Neumann I’ve heard the Neumann series called the “von Neumann” series more than I’d like! In fact, the von Neumann kernel of a graph should be named the “Neumann” kernel! Wikipedia page David F. Gleich (Sandia) ICME la/opt seminar 35 / 28
  • 36. LANCZOS           , k-steps of the Lanczos method produce                                     and                              =              David F. Gleich (Sandia) ICME la/opt seminar 36 of 28
  • 37. PRACTICAL LANCZOS Only need to store the last 2 vectors in       Updating requires O(matvec) work       is not orthogonal David F. Gleich (Sandia) ICME la/opt seminar 37 of 28
  • 38. PRACTICAL MMQ Increase k to become more accurate Bad eigenvalue bounds yield worse results       and     are easy to compute     not required, we can iteratively update it’s LU factorization David F. Gleich (Sandia) ICME la/opt seminar 38 of 28
  • 39. THE ALGORITHM For               Init:                                                                     How to pick   ? David F. Gleich (Sandia) ICME la/opt seminar 39 of 28
  • 40. MAIN RESULTS – SLIDE ONE A – adjacency matrix L – Laplacian matrix Katz score :                          Commute time:                  David F. Gleich (Sandia) ICME la/opt seminar 40 of 28
  • 41. TOP-K INSPIRATION: PAGERANK Approximate                    where     is sparse Keep     sparse too? YES! Ideally, don’t “touch” all of     ? YES! McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too. David F. Gleich (Sandia) ICME la/opt seminar 41 of 28
  • 42. THE ALGORITHM Note     is sparse. If                 , then         is sparse. Idea only add one component of         to         David F. Gleich (Sandia) ICME la/opt seminar 42 of 28
  • 43. MAIN RESULTS – SLIDE TWO For Katz Compute one       fast Compute top       fast For Commute Compute one       fast For almost commute Compute top       fast David F. Gleich (Sandia) ICME la/opt seminar 43 of 28
  • 44. MAIN RESULTS – SLIDE THREE David F. Gleich (Sandia) ICME la/opt seminar 44 of 28
  • 45. F-MEASURE David F. Gleich (Sandia) ICME la/opt seminar 45 of 28
  • 46. PAIRWISE RESULTS Katz upper and lower bounds Katz error convergence Commute-time upper and lower bounds Commute-time error convergence For the arXiv graph here David F. Gleich (Sandia) ICME la/opt seminar 46 of 28