SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Overlapping
Clusters for
Distributed
Computation
DAVID F. GLEICH "     REID ANDERSEN "
 PURDUE UNIVERSITY
     MICROSOFT CORP.
COMPUTER SCIENCE "    VAHAB MIRROKNI"
 DEPARTMENT
            GOOGLE RESEARCH, NYC




                                                                      1
                                 David Gleich · Purdue
   WSDM2012
Problem 
Find a good way to distribute a big graph 
    for solving things like linear systems and simulating random walks

Contributions
Theoretical demonstration that overlap helps
Proof of concept procedure to find overlapping
partitions to reduce communication (~20%)

All code available
http://www.cs.purdue.edu/~dgleich/codes/
  overlapping





                                                                              2
                                         David Gleich · Purdue
   WSDM2012
The problem
     WHAT OUR NETWORKS       WHAT OUR OTHER
     LOOK LIKE
              NETWORKS LOOK LIKE




                                                              3
                         David Gleich · Purdue
   WSDM2012
The problem
     COMBINING NETWORKS AND GRAPHS IS A MESS




                                                                 4
                            David Gleich · Purdue
   WSDM2012
“Good” data distributions are
a fundamental problem in
distributed computation.
!
How to divide the
communication graph!
Balance work
Balance communication
Balance data
Balance programming
  complexity too




                                                     5
                David Gleich · Purdue
   WSDM2012
Current solutions
                  Work
        Comm.
       Data
          Programming

Disjoint vertex                Okay to                     “Think like a
                  Excellent
                Excellent
partitions
                    Good
                       vertex”

2d or Edge
                  Excellent
   Excellent
   Good
          “Impossible”
Partitions



Where we fit!

Overlapping                    Good to                     “Think like a
                  Okay
                     “Let’s see”
partitions
                    Excellent
                  cached vertex”




                                                                                  6
                                            David Gleich · Purdue
    WSDM2012
Goals
Find a set of "
overlapping clusters "
where 

random walks stay in a
 cluster for a long time

solving diffusion-like problems
 requires little communication
 (think PageRank, Katz, hitting times,
 semi-supervised learning) 




                                                                              7
                                         David Gleich · Purdue
   WSDM2012
Related work
Domain decomposition, Schwarz methods
 How to solve a linear system with overlap. Szyld et al.
Communication avoiding algorithms
 k-step matrix-vector products (Demmel et al.) and "
 growing overlap around partitions (Fritzsche, Frommer, Szyld)
Overlapping communities and link partitioning
algorithms for social network analysis
 Link communities (Ahn et al.); surveys by Fortunato and Satu
P2P based PageRank algorithms
 Parreira, Castillo, Donato et al. 




                                                                            8
                                       David Gleich · Purdue
   WSDM2012
Overlapping clusters
                           Each vertex 
                              in at least one cluster
                              has one home cluster
                           
Formally,
                           an overlapping cover is
                           (C, ⌧ )

                           C={       ,   ,       }
                              = set of clusters

                           ⌧ : V 7! C = map to homes
                           ⌧ is a partition!




                                                                 9
                        David Gleich · Purdue
       WSDM2012
Random walks in
      overlapping clusters
                                      Each vertex 
                                          in at least one cluster
                                          has one home cluster
                                      
    red cluster "
keeps the walk
                       Random walks
                       red cluster "
                                          go to the home
                       sends the walk     cluster after leaving
                       to gray cluster
                                      
                                      




                                                                       10
                                  David Gleich · Purdue
   WSDM2012
An evaluation metric"
      Swapping probability
                                     Is (C, ⌧ ) a good
                                     overlapping cover?
                                     Does a random walk
                                     swap clusters often?
    red cluster "
keeps the walk
                      ⇢
                                     
 1      =
                                         probability that a walk
                      red cluster "
                      sends the walk     changes clusters on each
                      to gray cluster
   step
                                         computable expression in the paper




                                                                          11
                                 David Gleich · Purdue
     WSDM2012
Overlapping clusters
                           Each vertex 
                              is in at least one cluster
                              has one home cluster
                              

                           Vol(C) = sum of degrees of
                            vertices in cluster C
                           MaxVol = "
                            upper bound on Vol(C) 
                           TotalVol(C) = "
                                    C
                             sum of Vol(C) for all clusters
                           VolRatio = TotalVol(C) / Vol(G)"
                                               C
                             how much extra data!




                                                               12
                        David Gleich · Purdue
   WSDM2012
Swapping probability &
partitioning
                                                       No overlap in
       
                                               this figure !

P is a partition
       
⇢1 (P) 
=
     1    X
       
    Cut(P)
  Vol(G)
           P2P
       
       Much like a
       classical graph
       partitioning metric




                                                                        13
                              David Gleich · Purdue
      WSDM2012
Overlapping clusters vs.
Partitioning in theory
                         Take a cycle graph
                             M groups of ℓ������ vertices
                             MaxVol = 2ℓ������
                         
                         
 partitioning
                         for
                                 1
                         
1
                         ⇢     =          (Optimal!)
                                 `
                         for overlapping
                                  4
                         ⇢1 =
                               ⌦(`2 )




                                                            14
                      David Gleich · Purdue
    WSDM2012
Heuristics for finding good "                        N P-hard for optimal
overlapping clusters
                               solution L



      Our multi-stage heuristic!
      1.  Find a large set of good clusters
          Use personalized PageRank clusters
      2.  Find “well contained” nodes (cores)
          Compute expected “leavetime” 
      3.  Cover the graph with core vertices
          Approximately solve a min set-cover problem
      4.  Combine clusters up to MaxVol
          The swapping probability is sub-modular
      




                                                                           15
                                 David Gleich · Purdue
    WSDM2012
Heuristics for finding good "                        N P-hard for optimal
overlapping clusters
                               solution L



      Our multi-stage heuristic!
      1.  Find a large set of good clusters
                                                               Each cluster takes
          Use personalized PageRank clusters, or metis
        “< MaxVol” work

      2.  Find “well contained” nodes (cores)
                                                               Takes O(Vol)
          Compute expected “leave time” 
                      work per cluster
      3.  Cover the graph with core vertices
          Approximately solve a min set-cover problem
         Fast enough

      4.  Combine clusters up to MaxVol
          The swapping probability is sub-modular
             Fast enough

      




                                                                              16
                                 David Gleich · Purdue
    WSDM2012
Demo!




                                              17
         David Gleich · Purdue
   WSDM2012
Solving "
linear "
systems
 Like PageRank, Katz, and
 semi-supervised learning




                                                                  18
                             David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
the coordinate descent method.




                                     19
David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
the coordinate descent method.




A core vertex for the
gray cluster.




                                      20
 David Gleich · Purdue
   WSDM2012
All nodes solve locally using "
    the coordinate descent method.




   Red sends residuals to white.
White send residuals to red.




                                          21
     David Gleich · Purdue
   WSDM2012
White then uses the coordinate
descent method to adjust its solution.
Will cause communication to red/blue.




                                          22
 David Gleich · Purdue
   WSDM2012
That algorithm is called "
restricted additive Schwarz.

  PageRank
 We look at
                 PageRank!
  Katz scores
  semi-supervised learning
  any spd or M-matrix "
     linear system




                                                   23
              David Gleich · Purdue
   WSDM2012
It works!
                           2
         communication

                                            Swapping Probability (usroads)
                                            PageRank Communication (usroads)
                                            Swapping Probability (web−Google)
                          1.5
                                            PageRank Communication (web−Google)
Relative Relative Work




                           1                                                 Metis Partitioner
                                                                        Partitioning baseline

                          0.5


                           0
                            1   1.1   1.2    1.3     1.4         1.5         1.6           1.7
                                             Volume Ratio
                                      How much more of the
                                      graph we need to store.




                                                                                                 24
                                                    David Gleich · Purdue
     WSDM2012
Edges are counted twice and some graphs have self-
    loops. The first group are geometric networks and
    the second are information networks.
                              Graph
                             Graph     Vertices
                                       |V |                     Edges
                                                                |E|               MaxDeg
                                                                                  max deg                   Density
                                                                                                            |E|/|V |
                              onera    85567                    419201            5                         4.9
                            usroads    126146                   323900            7                         2.6
                            annulus    500000                   2999258           19                        6.0

            email-Enron                33696                    361622            1383                      10.7
           soc-Slashdot                77360                    1015667           2540                      13.1
                   dico                111982                   2750576           68191                     24.6
                   lcsh                144791                   394186            1025                      2.7
             web-Google                855802                   8582704           6332                      10.0
             as-skitter                1694616                  22188418          35455                     13.1
            cit-Patents                3764117                  33023481          793                       8.8

                   1                                       1                                       1

                  0.8                                     0.8                                     0.8
    Conductance




                                                                                    Conductance
-
                                            Conductance




                  0.6                                     0.6                                     0.6

                  0.4                                     0.4                                     0.4




                                                                                                                           25
                  0.2                                     0.2                                     0.2

                   0
                                                                         David Gleich · Purdue
                                                                                        0
                                                                                                             WSDM2012
                        0               5                  0                                            0              5
he communication ratio of our best result for the PageRan
ommunication volume compared to METIS or GRACLUS show
 at the method works for 6 of them (perf. ratio < 1). The
ommunication result is not a bug.
  Graph            Comm. of         Comm. of        Perf. Ratio      Vol. Ratio
                     Partition       Overlap
  onera                18654               48            0.003                2.82
  usroads               3256                0            0.000                1.49
  annulus              12074                2            0.000                0.01
  email-Enron       194536*           235316             1.210                 1.7
  soc-Slashdot      875435*         1.3 ⇥ 106            1.480                1.78
  dico            1.5 ⇥ 106 *       2.0 ⇥ 106            1.320                1.53
  lcsh                73000*           48777             0.668                2.17
  web-Google        201159*           167609             0.833                1.57
  as-skitter       2.4 ⇥ 106        3.9 ⇥ 106            1.645                1.93
  cit-Patents      8.7 ⇥ 106        7.3 ⇥ 106            0.845                1.34

             * means Graculus
nally, we evaluate our heuristic.
                         gave a better
                 partition than Metis
       At left, the cluster combine procedure reduces 106 clusters to




                                                                                26
       around 102 . Middle, combining clusters can decrease the volume
                                           David Gleich · Purdue
 WSDM2012
Summary
                         Future work
!                                
Overlap helps reduce             Truly distributed implementation and
communication in a distributed   evaluation
process!                         
!                                Can we exploit data redundancy to
Proof of concept procedure to    solve problems on large graphs faster?
find overlapping partitions to    
reduce communication 
                     Copy 1
           Copy 2
                                       src -> dst
       src -> dst
                                       src -> dst
       src -> dst
                                       src -> dst
       src -> dst

All code available
http://www.cs.purdue.edu/~dgleich/codes/
  overlapping




                                                                           27

                                    David Gleich · Purdue
   WSDM2012

Contenu connexe

Tendances

Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
 
Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithmKIRAN R
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...Edureka!
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Hsing-chuan Hsieh
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagationKrish_ver2
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descentkandelin
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineMusa Hawamdah
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
Kohonen self organizing maps
Kohonen self organizing mapsKohonen self organizing maps
Kohonen self organizing mapsraphaelkiminya
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted treesNihar Ranjan
 

Tendances (20)

Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
ID3 ALGORITHM
ID3 ALGORITHMID3 ALGORITHM
ID3 ALGORITHM
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
 
Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithm
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 
Decision trees
Decision treesDecision trees
Decision trees
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
What is Data Analysis and Machine Learning?
What is Data Analysis and Machine Learning?What is Data Analysis and Machine Learning?
What is Data Analysis and Machine Learning?
 
Decision tree
Decision treeDecision tree
Decision tree
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Kohonen self organizing maps
Kohonen self organizing mapsKohonen self organizing maps
Kohonen self organizing maps
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 

En vedette

Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansDavid Gleich
 
Graph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimcGraph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimcDavid Gleich
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresDavid Gleich
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph miningDavid Gleich
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsDavid Gleich
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structuresDavid Gleich
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsDavid Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignmentDavid Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...David Gleich
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceDavid Gleich
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDavid Gleich
 
A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignmentDavid Gleich
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveDavid Gleich
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...David Gleich
 

En vedette (20)

Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-means
 
Graph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimcGraph libraries in Matlab: MatlabBGL and gaimc
Graph libraries in Matlab: MatlabBGL and gaimc
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph mining
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulants
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignment
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architectures
 
A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignment
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspective
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...
 

Similaire à Overlapping clusters for distributed computation

DIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsDIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsJason Riedy
 
Rank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimizationRank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimizationDavid Gleich
 
Simulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsSimulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsDavid Gleich
 
The spectre of the spectrum
The spectre of the spectrumThe spectre of the spectrum
The spectre of the spectrumDavid Gleich
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsDavid Gleich
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisDavid Gleich
 

Similaire à Overlapping clusters for distributed computation (6)

DIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsDIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive Graphs
 
Rank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimizationRank aggregation via nuclear norm minimization
Rank aggregation via nuclear norm minimization
 
Simulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific DatasetsSimulation Informatics; Analyzing Large Scientific Datasets
Simulation Informatics; Analyzing Large Scientific Datasets
 
The spectre of the spectrum
The spectre of the spectrumThe spectre of the spectrum
The spectre of the spectrum
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph AlgorithmsMassive MapReduce Matrix Computations & Multicore Graph Algorithms
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
 

Plus de David Gleich

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksDavid Gleich
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresDavid Gleich
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksDavid Gleich
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detectionDavid Gleich
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...David Gleich
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationDavid Gleich
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for HadoopDavid Gleich
 

Plus de David Gleich (15)

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detection
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportation
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for Hadoop
 

Dernier

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Overlapping clusters for distributed computation

  • 1. Overlapping Clusters for Distributed Computation DAVID F. GLEICH " REID ANDERSEN " PURDUE UNIVERSITY MICROSOFT CORP. COMPUTER SCIENCE " VAHAB MIRROKNI" DEPARTMENT GOOGLE RESEARCH, NYC 1 David Gleich · Purdue WSDM2012
  • 2. Problem Find a good way to distribute a big graph for solving things like linear systems and simulating random walks Contributions Theoretical demonstration that overlap helps Proof of concept procedure to find overlapping partitions to reduce communication (~20%) All code available http://www.cs.purdue.edu/~dgleich/codes/ overlapping 2 David Gleich · Purdue WSDM2012
  • 3. The problem WHAT OUR NETWORKS WHAT OUR OTHER LOOK LIKE NETWORKS LOOK LIKE 3 David Gleich · Purdue WSDM2012
  • 4. The problem COMBINING NETWORKS AND GRAPHS IS A MESS 4 David Gleich · Purdue WSDM2012
  • 5. “Good” data distributions are a fundamental problem in distributed computation. ! How to divide the communication graph! Balance work Balance communication Balance data Balance programming complexity too 5 David Gleich · Purdue WSDM2012
  • 6. Current solutions Work Comm. Data Programming Disjoint vertex Okay to “Think like a Excellent Excellent partitions Good vertex” 2d or Edge Excellent Excellent Good “Impossible” Partitions Where we fit! Overlapping Good to “Think like a Okay “Let’s see” partitions Excellent cached vertex” 6 David Gleich · Purdue WSDM2012
  • 7. Goals Find a set of " overlapping clusters " where random walks stay in a cluster for a long time solving diffusion-like problems requires little communication (think PageRank, Katz, hitting times, semi-supervised learning) 7 David Gleich · Purdue WSDM2012
  • 8. Related work Domain decomposition, Schwarz methods How to solve a linear system with overlap. Szyld et al. Communication avoiding algorithms k-step matrix-vector products (Demmel et al.) and " growing overlap around partitions (Fritzsche, Frommer, Szyld) Overlapping communities and link partitioning algorithms for social network analysis Link communities (Ahn et al.); surveys by Fortunato and Satu P2P based PageRank algorithms Parreira, Castillo, Donato et al. 8 David Gleich · Purdue WSDM2012
  • 9. Overlapping clusters Each vertex in at least one cluster has one home cluster Formally, an overlapping cover is (C, ⌧ ) C={ , , } = set of clusters ⌧ : V 7! C = map to homes ⌧ is a partition! 9 David Gleich · Purdue WSDM2012
  • 10. Random walks in overlapping clusters Each vertex in at least one cluster has one home cluster red cluster " keeps the walk Random walks red cluster " go to the home sends the walk cluster after leaving to gray cluster 10 David Gleich · Purdue WSDM2012
  • 11. An evaluation metric" Swapping probability Is (C, ⌧ ) a good overlapping cover? Does a random walk swap clusters often? red cluster " keeps the walk ⇢ 1 = probability that a walk red cluster " sends the walk changes clusters on each to gray cluster step computable expression in the paper 11 David Gleich · Purdue WSDM2012
  • 12. Overlapping clusters Each vertex is in at least one cluster has one home cluster Vol(C) = sum of degrees of vertices in cluster C MaxVol = " upper bound on Vol(C) TotalVol(C) = " C sum of Vol(C) for all clusters VolRatio = TotalVol(C) / Vol(G)" C how much extra data! 12 David Gleich · Purdue WSDM2012
  • 13. Swapping probability & partitioning No overlap in this figure ! P is a partition ⇢1 (P) = 1 X Cut(P) Vol(G) P2P Much like a classical graph partitioning metric 13 David Gleich · Purdue WSDM2012
  • 14. Overlapping clusters vs. Partitioning in theory Take a cycle graph M groups of ℓ������ vertices MaxVol = 2ℓ������ partitioning for 1 1 ⇢ = (Optimal!) ` for overlapping 4 ⇢1 = ⌦(`2 ) 14 David Gleich · Purdue WSDM2012
  • 15. Heuristics for finding good " N P-hard for optimal overlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Use personalized PageRank clusters 2.  Find “well contained” nodes (cores) Compute expected “leavetime” 3.  Cover the graph with core vertices Approximately solve a min set-cover problem 4.  Combine clusters up to MaxVol The swapping probability is sub-modular 15 David Gleich · Purdue WSDM2012
  • 16. Heuristics for finding good " N P-hard for optimal overlapping clusters solution L Our multi-stage heuristic! 1.  Find a large set of good clusters Each cluster takes Use personalized PageRank clusters, or metis “< MaxVol” work 2.  Find “well contained” nodes (cores) Takes O(Vol) Compute expected “leave time” work per cluster 3.  Cover the graph with core vertices Approximately solve a min set-cover problem Fast enough 4.  Combine clusters up to MaxVol The swapping probability is sub-modular Fast enough 16 David Gleich · Purdue WSDM2012
  • 17. Demo! 17 David Gleich · Purdue WSDM2012
  • 18. Solving " linear " systems Like PageRank, Katz, and semi-supervised learning 18 David Gleich · Purdue WSDM2012
  • 19. All nodes solve locally using " the coordinate descent method. 19 David Gleich · Purdue WSDM2012
  • 20. All nodes solve locally using " the coordinate descent method. A core vertex for the gray cluster. 20 David Gleich · Purdue WSDM2012
  • 21. All nodes solve locally using " the coordinate descent method. Red sends residuals to white. White send residuals to red. 21 David Gleich · Purdue WSDM2012
  • 22. White then uses the coordinate descent method to adjust its solution. Will cause communication to red/blue. 22 David Gleich · Purdue WSDM2012
  • 23. That algorithm is called " restricted additive Schwarz. PageRank We look at PageRank! Katz scores semi-supervised learning any spd or M-matrix " linear system 23 David Gleich · Purdue WSDM2012
  • 24. It works! 2 communication Swapping Probability (usroads) PageRank Communication (usroads) Swapping Probability (web−Google) 1.5 PageRank Communication (web−Google) Relative Relative Work 1 Metis Partitioner Partitioning baseline 0.5 0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Volume Ratio How much more of the graph we need to store. 24 David Gleich · Purdue WSDM2012
  • 25. Edges are counted twice and some graphs have self- loops. The first group are geometric networks and the second are information networks. Graph Graph Vertices |V | Edges |E| MaxDeg max deg Density |E|/|V | onera 85567 419201 5 4.9 usroads 126146 323900 7 2.6 annulus 500000 2999258 19 6.0 email-Enron 33696 361622 1383 10.7 soc-Slashdot 77360 1015667 2540 13.1 dico 111982 2750576 68191 24.6 lcsh 144791 394186 1025 2.7 web-Google 855802 8582704 6332 10.0 as-skitter 1694616 22188418 35455 13.1 cit-Patents 3764117 33023481 793 8.8 1 1 1 0.8 0.8 0.8 Conductance Conductance - Conductance 0.6 0.6 0.6 0.4 0.4 0.4 25 0.2 0.2 0.2 0 David Gleich · Purdue 0 WSDM2012 0 5 0 0 5
  • 26. he communication ratio of our best result for the PageRan ommunication volume compared to METIS or GRACLUS show at the method works for 6 of them (perf. ratio < 1). The ommunication result is not a bug. Graph Comm. of Comm. of Perf. Ratio Vol. Ratio Partition Overlap onera 18654 48 0.003 2.82 usroads 3256 0 0.000 1.49 annulus 12074 2 0.000 0.01 email-Enron 194536* 235316 1.210 1.7 soc-Slashdot 875435* 1.3 ⇥ 106 1.480 1.78 dico 1.5 ⇥ 106 * 2.0 ⇥ 106 1.320 1.53 lcsh 73000* 48777 0.668 2.17 web-Google 201159* 167609 0.833 1.57 as-skitter 2.4 ⇥ 106 3.9 ⇥ 106 1.645 1.93 cit-Patents 8.7 ⇥ 106 7.3 ⇥ 106 0.845 1.34 * means Graculus nally, we evaluate our heuristic. gave a better partition than Metis At left, the cluster combine procedure reduces 106 clusters to 26 around 102 . Middle, combining clusters can decrease the volume David Gleich · Purdue WSDM2012
  • 27. Summary Future work ! Overlap helps reduce Truly distributed implementation and communication in a distributed evaluation process! ! Can we exploit data redundancy to Proof of concept procedure to solve problems on large graphs faster? find overlapping partitions to reduce communication Copy 1 Copy 2 src -> dst src -> dst src -> dst src -> dst src -> dst src -> dst All code available http://www.cs.purdue.edu/~dgleich/codes/ overlapping 27 David Gleich · Purdue WSDM2012