Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Engineering Data Science Objectives for Social Network Analysis

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 60 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Engineering Data Science Objectives for Social Network Analysis (20)

Publicité

Plus par David Gleich (12)

Plus récents (20)

Publicité

Engineering Data Science Objectives for Social Network Analysis

  1. 1. Engineering Data Science Objective Functions for Social Network Analysis David F. Gleich Purdue University With Nate Veldt (Purdue -> Cornell), Tony Wirth (Melbourne) Paper arXiv:1903.05246 Code github.com/nveldt/LearnResParams LLNL 1David Gleich · Purdue
  2. 2. Somewhere too close and very recently… Application expert. “Hi, I see you work on clustering. I want to cluster my data … … what algorithm should I use?” LLNLDavid Gleich · Purdue 2
  3. 3. The dreaded question for people who study clustering, community detection, etc. “What algorithm should I use?”
  4. 4. Why is this such a hard question? LLNLDavid Gleich · Purdue 4
  5. 5. Journal of Biomedicine and Biotechnology • 2005:2 (2005) 215–225 • DOI: 10.1155/JBB.2005.215 REVIEW ARTICLE Finding Groups in Gene Expression Data David J. Hand and Nicholas A. Heard Department of Mathematics, Faculty of Physical Sciences, Imperial College, London SW7 2AZ, UK Received 11 June 2004; revised 24 August 2004; accepted 24 August 2004 The vast potential of the genomic insight offered by microarray technologies has led to their widespread use since they were in- troduced a decade ago. Application areas include gene function discovery, disease diagnosis, and inferring regulatory networks. Microarray experiments enable large-scale, high-throughput investigations of gene activity and have thus provided the data analyst with a distinctive, high-dimensional field of study. Many questions in this field relate to finding subgroups of data profiles which are very similar. A popular type of exploratory tool for finding subgroups is cluster analysis, and many different flavors of algorithms have been used and indeed tailored for microarray data. Cluster analysis, however, implies a partitioning of the entire data set, and this does not always match the objective. Sometimes pattern discovery or bump hunting tools are more appropriate. This paper reviews these various tools for finding interesting subgroups. INTRODUCTION Microarray gene expression studies are now routinely used to measure the transcription levels of an organism’s genes at a particular instant of time. These mRNA levels serve as a proxy for either the level of synthesis of pro- teins encoded by a gene or perhaps its involvement in a metabolic pathway. Differential expression between a con- trol organism and an experimental or diseased organism can thus highlight genes whose function is related to the experimental challenge. An often cited example is the classification of cancer types (Golub et al [1], Alizadeh et al [2], Bittner et al [3], croarray slide can typically hold tens of thousands of gene fragments whose responses here act as the predictor vari- ables (p), whilst the number of patient tissue samples (n) available in such studies is much less (for the above exam- ples, 38 in Golub et al, 96 in Alizadeh et al, 38 in Bittner et al, 41 in Nielsen et al, 63 in Tibshirani et al, and 80 in Parmigiani et al). More generally, beyond such “supervised” classifica- tion problems, there is interest in identifying groups of genes with related expression level patterns over time or across repeated samples, say, even within the same classi- fication label type. Typically one will be looking for coreg- between neighbouring frequencies; analogously for mi- croarray data, there is evidence of correlation of expres- sion of genes residing closely to one another on the chro- mosome (Turkheimer et al [17]). Thus when we come to look at cluster analysis for microarray data, we will see a large emphasis on methods which are computationally suited to cope with the high-dimensional data. CLUSTER ANALYSIS The need to group or partition objects seems funda- mental to human understanding: once one can identify a class of objects, one can discuss the properties of the class members as a whole, without having to worry about indi- vidual differences. As a consequence, there is a vast litera- ture on cluster analysis methods, going back at least as far as the earliest computers. In fact, at one point in the early 1980s new ad hoc clustering algorithms were being devel- oped so rapidly that it was suggested there should be a moratorium on the development of new algorithms while some understanding of the properties of the existing ones fundamental pro pairwise similar such distances i objects in the da Cluster anal based solely on sist of relatively Since cluster an data set, usually ter. Extensions o tering, whereby than one cluster naturally to such these ideas (in f special case of th rithm) was given Since the aim which are simila how “similarity clustering this fo model). In someLLNLDavid Gleich · Purdue 5
  6. 6. Why is this such a hard question? There are many reasons people want to cluster data • Help understand it • Bin items for some downstream process • … There are many methods and strategies to cluster data • Linkage methods from stats • Partitioning methods • Objective functions (K-means) and updating algorithms • … I can’t psychically intuit what you need from your data! LLNLDavid Gleich · Purdue 6
  7. 7. I don’t like studying clustering… LLNLDavid Gleich · Purdue 7
  8. 8. I don’t like studying clustering… … so let’s do exactly that. LLNLDavid Gleich · Purdue 8
  9. 9. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 9
  10. 10. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 10
  11. 11. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 11
  12. 12. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 12
  13. 13. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 13 Let’s consult an expert!
  14. 14. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 14
  15. 15. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 15
  16. 16. Graph clustering seeks“communities”of nodes LLNLDavid Gleich · Purdue 16 Objective functions All seek to balance High internal densityLow external connectivity modularity, densest subgraph, maximum clique, conductance, sparsest cut, etc.
  17. 17. Two objectives at opposite ends of the spectrum min cut(S) `S` + cut(S) `¯S` Sparsest cut David Gleich · Purdue 17LLNL
  18. 18. Sparsest cut Minimize number of edges removed to partition graph into cliques Two objectives at opposite ends of the spectrum Cluster Deletion min cut(S) `S` + cut(S) `¯S` David Gleich · Purdue 18LLNL
  19. 19. We show sparsest cut and cluster deletion are two special cases of the same new clustering framework: LAMBDACC = λ Correlation Clustering This framework also leads to - new connections to other objectives (including modularity!) - new approximation algorithms (2-approx for cluster deletion) - several experiments/applications (social network analysis) - (aside) fast method for LPs w/ metric constraints (for approx. algs) David Gleich · Purdue 19LLNL
  20. 20. And now you are thinking… … is this talk really going to propose another new method?!??!?
  21. 21. I’m going to advocate for flexible clustering frameworks, which we can then engineer to “fit” example data LLNLDavid Gleich · Purdue 21
  22. 22. 22 Our framework is based on correlation clustering Edges in a signed graph indicate similarity (+) or dissimilarity (-) LLNLDavid Gleich · Purdue
  23. 23. i j k Edges can be weighted, but problems become harder. w+ ij wjk w+ ij wjk 23 Our framework is based on correlation clustering Edges in a signed graph indicate similarity (+) or dissimilarity (-) LLNLDavid Gleich · Purdue
  24. 24. Our framework is based on correlation clustering Edges in a signed graph indicate similarity (+) or dissimilarity (-) i j k Mistake Mistake Objective: Minimize the weight of “mistakes” w+ ij wjk w+ ij wjk 24LLNLDavid Gleich · Purdue
  25. 25. Given G = (V,E), construct signed graph G’ = (V,E+,E- ), an instance of correlation clustering You can use correlation clustering to cluster unsigned graphs LLNLDavid Gleich · Purdue 25 + ++ – – – + + – To model sparsest cut or cluster deletion, set resolution parameter λ ∈ (0,1) LAMBDACC 1 1 1 1 1 Without weights, unweighted correlation clustering is the same as cluster editing
  26. 26. Consider a restriction to two clusters Positive mistakes: (1 – λ) cut(S) Negative mistakes: λ |E–| – λ [ |S| |S| – cut(S) ] Total weight of mistakes = David Gleich · Purdue 26 S S cut(S)– λ |S| |S| + λ |E–| LLNL
  27. 27. This is a scaled version of sparsest cut! minimize cut(S) `S``¯S` + `E ` constantTwo-cluster LAMBDACC can be written cut(S) `S``S` < 0 () cut(S) `S``S` <Note David Gleich · Purdue 27 cut(S) `S` + cut(S) `S` = `V` cut(S) `S``S` LLNL
  28. 28. We can write the objective in terms of cuts to get a relationship with sparsest cut. The general LAMBDACC objective can be written THEOREM Minimizing this objective produces clusters with scaled sparsest cut at most λ (if they exist). There exists some λ’ such that minimizing LAMBDACC will return the minimum sparsest cut partition. minimize 1 2 kX i=1 cut(Si) 2 kX i=1 `Si``Si` + `E ` David Gleich · Purdue 28LLNL
  29. 29. We show this is equivalent to LAMBDACC for the right choice of λ ≫ (1-λ) 1 1 1 1 1 cluster deletion correlation clustering with infinite penalties on negative edges David Gleich · Purdue 29 1 1 1 1 1 For large λ,LAMBDACC generalizes cluster deletion LLNL
  30. 30. 1 2 1 4 3 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 6 1 6 4 2 1 6 Degree-weighted LAMBDACC is related to Modularity Though this does not preserve approximations… LAMBDACC is a linear function of Modularity Positive weight: 1 – λdidj Negative weight: λdidj David Gleich · Purdue 30LLNL
  31. 31. Degree- weighted Standard Sparsest Cut Cluster Deletion Correlation Clustering (Cluster Editing) Normalized Cut Modularity 1 2m 0 1 0 1 m m + 1 ⇢⇤ ⇤ = 1/2 Many other objectives are special cases of LAMBDACC LLNLDavid Gleich · Purdue 17 m = |E|
  32. 32. And now, an answer to one of the most frequently asked questions in clustering. “What method should I use”? LLNLDavid Gleich · Purdue 32
  33. 33. Changing your method (implicitly) changes the value of λ that you are using. Lambda 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 RatiotoLPbound 1 2 3 4 Graclus Louvain InfoMap RMQC RMC Dense subgraph regimeSparse cut regime This figure shows that if you use one of these algorithms (Graclus, Louvain, InfoMap, recursive max-quasi clique, or recursive max-clique) then you implicitly minimize λ-CC for some choice of λ. Turns the question “what method should I use?”into “what λ should I use?” LLNL 33David Gleich · Purdue
  34. 34. Changing your method (implicitly) changes the value of λ that you are using. Lambda 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 RatiotoLPbound 1 2 3 4 Graclus Louvain InfoMap RMQC RMC Dense subgraph regimeSparse cut regime This figure shows that if you use one of these algorithms (Graclus, Louvain, InfoMap, recursive max-quasi clique, or recursive max-clique) then you implicitly minimize λ-CC for some choice of λ. Turns the question “what method should I use?”into “what λ should I use?” LLNL 34David Gleich · Purdue We wrote an entire SIMODS paper explaining how we made this figure! LP bound involves an LP with 12 billion constraints.
  35. 35. 35 How should I set ! for my new clustering application? Can you give me an example of what you want your clusters to look like? I want communities that look like this! LambdaCC inspires an approach for learning the“right” objective function to use for new applications. David Gleich · Purdue LLNL
  36. 36. The goal is not to reproduce the example clusters. The goal to find sets with similar properties size and density tradeoffs. LLNLDavid Gleich · Purdue 36
  37. 37. Let’s go back to the figure we just saw David Gleich · Purdue 37 Each clustering traces out a bowl-shaped curve. The minimum point on each curve tells us the ! regime where the clustering optimizes LambdaCC. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound Graclus Louvain InfoMap RMQC RMC LLNL
  38. 38. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound David Gleich · Purdue 38 So the “example” clustering will also correspond to some type of curve. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound LLNL
  39. 39. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound David Gleich · Purdue 39 As will any other clustering. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound LLNL
  40. 40. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound David Gleich · Purdue 40 As will any other clustering. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound LLNL
  41. 41. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound David Gleich · Purdue 41 As will any other clustering. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound LLNL
  42. 42. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound David Gleich · Purdue 42 As will any other clustering. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound LLNL
  43. 43. David Gleich · Purdue 43 Strategy. Start with a fixed “good” clustering example. Find the minimizer for its curve, to get a ! that is designed to produce similar clusterings! Challenge. We want to do this without computing the entire curve. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound This is a new optimization problem where we are optimizing over !! LLNL
  44. 44. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound What function is tracing out these curves? David Gleich · Purdue Score for a clustering C. A linear function in λ. LambdaCC LP bound for fixed λ. A parametric LP: concave,and piecewise linear in λ (Adler & Montiero 1992). PC( ) = FC( ) G( ) 44 PC( ) = FC( ) G( ) The “parameter fitness function.” LLNL
  45. 45. We prove two useful properties about P David Gleich · Purdue 45 Since FC is linear and G is concave and piecewise linear:P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. Translation… 1. Once P goes up, it can’t go back down LLNL
  46. 46. David Gleich · Purdue 46 Since FC is linear and G is concave and piecewise linear:P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. Translation… 1. Once P goes up, it can’t go back down 2. There are no “flat” regions where we might get stuck We prove two useful properties about P LLNL
  47. 47. David Gleich · Purdue P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound We know the minimizer can’t be to the left of this point 47 This allows us to minimize P without seeing all of it LLNL
  48. 48. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound David Gleich · Purdue P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. We know the minimizer can’t be to the left of this point 48 So this is possible. This allows us to minimize P without seeing all of it LLNL
  49. 49. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound David Gleich · Purdue P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. We know the minimizer can’t be to the left of this point 49 So this is possible. But so is this. This allows us to minimize P without seeing all of it LLNL
  50. 50. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound David Gleich · Purdue P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. We know the minimizer can’t be to the left of this point Evaluate P at a new point 50 So we’ve ruled out this possibility! Now we know the minimizer can’t be to the right of this one This allows us to minimize P without seeing all of it LLNL
  51. 51. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound David Gleich · Purdue P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. If two input ! have the same fitness score, the minimizer is between them. 51 …so it’s not over here. This allows us to minimize P without seeing all of it LLNL
  52. 52. David Gleich · Purdue We developed a bisection-like approach for minimizing P by evaluating it at carefully selected points One branch scenario: The minimizer isn’t in [m,r] Two branch scenario: Evaluate a couple more points to rule out [m,r] 52LLNL
  53. 53. A simple synthetic test case to demonstrate that having an example helps. Nate Veldt 53 Modularity, (a special case of LambdaCC with λ = 1/(2m)) wasn’t able to get the community structure right for the graph G. Let’s fix that! 1. Generate a new random graph G’ from the same distribution 2. Using the ground truth of G’, learn a resolution parameter !’ 3. Cluster G using LambdaCC with ! = !’ We’ve captured the community structure for a specific class of graphs and can detect the right answer! G G’
  54. 54. Nate Veldt We tested this on a regime of synthetic graphs that is hard for modularity. 54 Smaller µ à ground truth easier to detect. For each µ, we train on one graph, and tested on 5 others. One example when µ = 0.3 Modularity often fails to separate ground truth clusters. “mixing parameter”
  55. 55. We can use this to test if a metadata attribute seems to be reflected in some characteristic graph structure LLNLDavid Gleich · Purdue 55 S/F Gen Maj. Maj. 2 Res. Yr HS min Preal 1.30 1.73 2.03 2.12 1.35 1.57 2.11 min Pfake 1.65 1.80 2.12 2.12 2.11 2.09 2.12 <latexit sha1_base64="u3gHB8ZaTnoGowuTl4th1FKapm8=">AAAGKHicdVTNbttGEKaSqE3Vv7g55rKpE7coHJW04do5FDDQwEmAGHUjOUlhCu6SHEpb7S6J3WVtZbGv0AfpudfkGXIrcu2pj9FZUqpF2aVg7nh25vvmj5OUnGkThu87167f6H7w4c2Peh9/8ulnn99a++KFLiqVwnFa8EK9SqgGziQcG2Y4vCoVUJFweJlMf/D3L38DpVkhh2ZWwkjQsWQ5S6lB1ela56s4gTGT1tCk4lQ5m5LWz/ViU5Sq4tDbIINvD8gGeQwS34f01/7i2ELhOWj//88KX08GJI57sWBZ7XgPJUliQc0kpdweudNaZsZiqNyRexsk6m+HxB+723hs9cPmiDyyv9ypL3d2G23k4f8fNadTQNTa5bvGcy+8wLs4oobr4VzpY04KYwpRhx2DzP6ry+mt9bAf1g+5LERzYT2YP0ena9f/ibMirQRIk3Kq9UkUlmZkqTIs5YB1rTSUNJ3SMZygKKkAPbJ1Sx25j5qM5IXCP2lIre0tuyCOorMWyiLW87Y2KYop3mjXa1OafG9kmSwrAzJtGPOKE1MQPygkYwpSw2ekTWumrx+MFS0nDYlh09ecJYqqmY+oONObekJL0JvYk3QzZwbt6ug5GDuscgPPIXPY+OzuXng34Qi7bGEmMFYA0tn68DZn2FJYsUl4Bc7695JFO79hNLK+dj65VgZHwwGVWI1YgYSztBCCYp/jnArGZxnktOLG2VjnC7ldAJ37KXO9+8tkGrOF7Puw/3Azxak0mATl2E0kMOc69xACh5Li92kMqF6M2LE05x5qv3G2+psTHKOdkVvYFpiob/ojwPlRMJiJpOAHmJJtULSzPx4+c1Z6CsGcFc7W9R6AucoYFdmqSzJ3mXN4h0GVaNwklV8QVxOsMgwODn1JFgTDqFU+m5w7q/kFiTduvO1TtPQ1oLycUHcR6i9PV6qejTmwdPKgqf1VN9hojV9Oe/SFh1nushiwsUAm/LZ1pcDD2TgRNm707tJYiGe4TLOrPOYX6IK7IVrdBJeFF7hitvtbP22t7+/Nt8TN4E7wZfB1EAW7wX7wJDgKjoO083vnz86bztvuH9133b+67xvTa525z+2g9XT//heBrBeH</latexit> (Listen, don’t read!) For the Caltech network, find the minimum value of lambda for a clustering X induced by a metadata attribute. Then look at the objective function P(λ,X) = F(λ,X)/G(λ) at the minimizer. Do this for the real attribute and a randomized attribute (just shuffle the labels); that gives a null-score where there is no relationship with graph structure.
  56. 56. We can use this to test if a metadata attribute seems to be reflected in some characteristic graph structure LLNLDavid Gleich · Purdue 56 S/F Gen Maj. Maj. 2 Res. Yr HS min Preal 1.30 1.73 2.03 2.12 1.35 1.57 2.11 min Pfake 1.65 1.80 2.12 2.12 2.11 2.09 2.12 <latexit sha1_base64="u3gHB8ZaTnoGowuTl4th1FKapm8=">AAAGKHicdVTNbttGEKaSqE3Vv7g55rKpE7coHJW04do5FDDQwEmAGHUjOUlhCu6SHEpb7S6J3WVtZbGv0AfpudfkGXIrcu2pj9FZUqpF2aVg7nh25vvmj5OUnGkThu87167f6H7w4c2Peh9/8ulnn99a++KFLiqVwnFa8EK9SqgGziQcG2Y4vCoVUJFweJlMf/D3L38DpVkhh2ZWwkjQsWQ5S6lB1ela56s4gTGT1tCk4lQ5m5LWz/ViU5Sq4tDbIINvD8gGeQwS34f01/7i2ELhOWj//88KX08GJI57sWBZ7XgPJUliQc0kpdweudNaZsZiqNyRexsk6m+HxB+723hs9cPmiDyyv9ypL3d2G23k4f8fNadTQNTa5bvGcy+8wLs4oobr4VzpY04KYwpRhx2DzP6ry+mt9bAf1g+5LERzYT2YP0ena9f/ibMirQRIk3Kq9UkUlmZkqTIs5YB1rTSUNJ3SMZygKKkAPbJ1Sx25j5qM5IXCP2lIre0tuyCOorMWyiLW87Y2KYop3mjXa1OafG9kmSwrAzJtGPOKE1MQPygkYwpSw2ekTWumrx+MFS0nDYlh09ecJYqqmY+oONObekJL0JvYk3QzZwbt6ug5GDuscgPPIXPY+OzuXng34Qi7bGEmMFYA0tn68DZn2FJYsUl4Bc7695JFO79hNLK+dj65VgZHwwGVWI1YgYSztBCCYp/jnArGZxnktOLG2VjnC7ldAJ37KXO9+8tkGrOF7Puw/3Azxak0mATl2E0kMOc69xACh5Li92kMqF6M2LE05x5qv3G2+psTHKOdkVvYFpiob/ojwPlRMJiJpOAHmJJtULSzPx4+c1Z6CsGcFc7W9R6AucoYFdmqSzJ3mXN4h0GVaNwklV8QVxOsMgwODn1JFgTDqFU+m5w7q/kFiTduvO1TtPQ1oLycUHcR6i9PV6qejTmwdPKgqf1VN9hojV9Oe/SFh1nushiwsUAm/LZ1pcDD2TgRNm707tJYiGe4TLOrPOYX6IK7IVrdBJeFF7hitvtbP22t7+/Nt8TN4E7wZfB1EAW7wX7wJDgKjoO083vnz86bztvuH9133b+67xvTa525z+2g9XT//heBrBeH</latexit> (Listen, don’t read!) For the Caltech network, find the minimum value of lambda for a clustering X induced by a metadata attribute. Then look at the objective function P(λ,X) = F(λ,X)/G(λ) at the minimizer. Do this for the real attribute and a randomized attribute (just shuffle the labels); that gives a null-score where there is no relationship with graph structure.
  57. 57. We can also investigate metadata sets in social networks. This led to a fun story! LLNLDavid Gleich · Purdue 57 1 1.2 1.4 1.6 1.8 0 0.2 0.4 0.6 0.8 1 How well you do at finding those same sets again. The objective ratio at a minimum, i.e. how close you get to the lower bound
  58. 58. We can also investigate metadata sets in social networks. This led to a fun story! LLNLDavid Gleich · Purdue 58 1 1.2 1.4 1.6 1.8 0 0.2 0.4 0.6 0.8 1 2006-2008 2009 How well you do at finding those same sets again. The objective ratio at a minimum, i.e. how close you get to the lower bound
  59. 59. A quick summary of other work from our research team on data-driven scientific computing Our team’s overall goal is to design algorithms and methods tuned to the evolving needs and nature of scientific data analysis. Low-rank methods for network alignment – Huda Nassar -> Stanford. • Principled methods that scale to aligning thousands of networks. Spectral properties and generation of realistic networks – Nicole Eikmeier -> Grinnell College • “Power-laws” in the top sing. vals of adj matrix are most robust than degree “power-laws” • Fast sampling for hypergraph models with higher-order structure. Local analysis of network data – Meng Liu • Applications in bioinformatics, software https://github.com/kfoynt/LocalGraphClustering LLNLDavid Gleich · Purdue 59 = aaa ddd aab bbb bdd Fig. 5. For a Kronecker graph with a 2 ⇥ 2 initi been “⌦-powered” three times to an 8 ⇥ 8 probability
  60. 60. LLNL Paper arXiv:1903.05246 (at WWW2019) Code github.com/nveldt/LearnResParams (at WWW2018),1806.01678 Software. github: nveldt/LamCC,nveldt/MetricOptimization 60 Don’t ask what algorithm, ask what kind of clusters! Issues. • Yeah, this is still slow L • Needs to be generalized beyond lambda-CC (ongoing work with Meng Liu at Purdue) See the paper and code! David Gleich · Purdue With Nate Veldt (Purdue), Tony Wirth (Melbourne). Cameron Ruggles (Purdue) James Saunderson (Monash)

×