Graph-structured data in network security, social networks, finance, and other applications not only are massive but also under continual evolution. The changes often are scattered across the graph, permitting novel parallel and incremental analysis algorithms. We discuss analysis algorithms for streaming graph data to maintain both local and global metrics with low latency and high efficiency.
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
1. scalable and efficient algorithms for
analysis of massive, streaming graphs
E. Jason Riedy and David A. Bader
MS76 Scalable Network Analysis: Tools, Algorithms, Applications
SIAM PP, 15 April 2016
HPC Lab, School of Computational Science and Engineering
Georgia Institute of Technology
3. (insert prefix here)-scale data analysis
Cyber-security Identify anomalies, malicious actors
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating algorithms
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
• Graphs are a motif / theme in data analysis.
• Changing and dynamic graphs are important! 3
4. outline
1. Motivation and background
2. Incremental PageRank
3. Seed set expansion
4. Community maintenance
5. STINGER: Framework for streaming graph analysis
4
5. why graphs?
Another tool, like dense and sparse linear algebra.
• Combine things with pairwise
relationships
• Smaller, more generic than raw data.
• Taught (roughly) to all CS students...
• Semantic attributions can capture
essential relationships.
• Traversals can be faster than filtering
DB joins.
• Provide clear phrasing for queries
about relationships.
5
6. potential applications
• Social Networks
• Identify communities, influences, bridges, trends,
anomalies (trends before they happen)...
• Potential to help social sciences, city planning, and
others with large-scale data.
• Cybersecurity
• Determine if new connections can access a device or
represent new threat in < 5ms...
• Is the transfer by a virus / persistent threat?
• Bioinformatics, health
• Construct gene sequences, analyze protein
interactions, map brain interactions
• Credit fraud forensics ⇒ detection ⇒ monitoring
• Integrate all the customer’s data, identify in real-time
6
7. streaming graph data
Networks data rates:
• Gigabit ethernet: 81k – 1.5M packets per second
• Over 130 000 flows per second on 10 GigE (< 7.7 µs)
Person-level data rates:
• 500M posts per day on Twitter (6k / sec)1
• 3M posts per minute on Facebook (50k / sec)2
We need to analyze only changes and not entire graph.
Throughput & latency trade off and expose different
levels of concurrency.
1
www.internetlivestats.com/twitter-statistics/
2
www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/
7
8. streaming graph analysis
Terminology:
• Streaming changes into a massive, evolving graph
• Not CS streaming algorithm (tiny memory)
• Need to handle deletions as well as insertions
Previous throughput results (not comprehensive review):
Data ingest >2M up/sec [Ediger, McColl, Poovey, Campbell, & B
2014]
Clustering coefficients >100K up/sec [R, Meyerhenke, Bader,
Ediger, & Mattson 2012]
Connected comp. >1M up/sec [McColl, Green, & B 2013]
Community clustering >100K up/sec∗
[R & B 2013]
8
10. pagerank
Everyone’s “favorite” metric: PageRank.
• Stationary distribution of the random surfer model.
• Eigenvalue problem can be re-phrased as a linear
system
(
I − αAT
D−1
)
x = kv,
with
α teleportation constant, much < 1
A adjacency matrix
D diagonal matrix of out degrees, with
x/0 = x (self-loop)
v personalization vector, here 1/|V|
k irrelevant scaling constant
• Amenable to analysis, etc. 10
11. incremental pagerank
• Streaming data setting, update PageRank without
touching the entire graph.
• Existing methods maintain databases of walks, etc.
• Let A∆ = A + ∆A, D∆ = D + ∆D for the new graph,
want to solve for x + ∆x.
• Simple algebra:
(
I − αAT
∆D−1
∆
)
∆x = α
(
A∆D−1
∆ − AD−1
)
x,
and the right-hand side is sparse.
• Re-arrange for Jacobi,
∆x(k+1)
= αAT
∆D−1
∆ ∆x(k)
+ α
(
A∆D−1
∆ − AD−1
)
x,
iterate, ...
11
12. incremental pagerank: accumulating error
• And fail. The updated solution wanders away from
the true solution. Top rankings stay the same...
12
13. incremental pagerank: think instead
• The old solution x is an ok, not exact, solution to the
original problem, now a nearby problem.
• How close? Residual:
r′
= kv − x + αA∆D−1
∆ x
= r + α
(
A∆D−1
∆ − AD−1
)
x.
• Solve (I − αA∆D−1
∆ )∆x = r′
.
• Cheat by not refining all of r′
, only region growing
around the changes:
(I − αA∆D−1
∆ )∆x = r′
|∆
• (Also cheat by updating r rather than recomputing at
the changes.)
13
17. graphs: big, nasty hairballs
Yifan Hu’s (AT&T) visualization of the in-2004 data set
http://www2.research.att.com/~yifanhu/gallery.html
17
18. but no shortage of structure...
Protein interactions, Giot et al., “A Protein
Interaction Map of Drosophila melanogaster”,
Science 302, 1722-1736, 2003.
Jason’s network via LinkedIn Labs
• Locally, there are clusters or communities.
• There are methods for global community detection.
• Also need local communities around seeds for
queries and targetted analysis.
18
19. seed set expansion
• Seed set expansion finds the “best” subgraph or
communities for a set of vertices of interest
• Many quality criteria: Modularity, conductance, etc.
• Can be applied to cryptocurrency to identify and
track groups of interacting entities
• Dynamic algorithm updates communities faster than
recomputation, allowing us to keep up with new data
produced
19
20. static seed set expansion
Greedy expansion starting from S = { seed vertices }:
1. Check the fitness of every vertex v neighboring S.
• fitness = f(S ∪ {v}) − f(S)
2. It any fitness is positive, include most fit v in S.
• Currently sequential, could include all sufficiently
good neighbors.
3. Record at which step v is included.
• This list is the base for updates.
Now the dynamic version by example...
20
21. dynamic seed set example
In preparation with Anita Zakrzewska and Eisha Nathan
21
22. dynamic seed set example
In preparation with Anita Zakrzewska and Eisha Nathan
21
23. dynamic seed set example
In preparation with Anita Zakrzewska and Eisha Nathan
21
24. dynamic seed set quality
Graphs from the Koblenz Network Collection.
22
25. dynamic seed set speed-up
Graphs from the Koblenz Network Collection.
23
27. community detection
• Partition a graph’s
vertices into disjoint
communities.
• A community locally
optimizes some
metric, NP-hard.
• Trying to capture that
vertices are more
similar within one
community than
between
communities. Jason’s network via LinkedIn Labs
25
28. what about streaming?
• Simple approach based on agglomeration:
1. Extract all vertices touched by an update.
2. Re-start agglomeration.
• “Works” and is fast (MTAAP 2013), but never
mentioned quality.
• Extracted vertices form bridges and do not re-merge.
• Some methods based on label propagation, not
metric-driven.
• Backtracking (e.g. Görke, et al., JEA 2013) preserves
quality at cost of change size.
• Ongoing: Can we limit backtracking?
26
29. community quality: achievable
Data from Stanford SNAP archive: Facebook.
Stream generation: Reversing the graph.
In preparation with Pushkar Godbolé
27
30. community quality: change size
Data from Stanford SNAP archive: Facebook.
Stream generation: Reversing the graph.
In preparation with Pushkar Godbolé
28
32. future directions
• Of course, continuing to develop streaming /
dynamic / incremental algorithms.
• For massive graphs, computing small changes is
always a win.
• Improving approximations or replacing expensive
metrics like betweenness centrality would be great.
• Including more external and semantic data.
• If vertices are documents or data records, many
more measures of similarity.
• Only now being exploited in concert with static graph
algorithms.
30
33. hpc lab people
Faculty:
• David A. Bader
• Oded Green
Data:
• Pushkar Godbolé
• Anita Zakrzewska
• Eisha Nathan
STINGER:
• Robert McColl,
• James Fairbanks,
• Adam McLaughlin,
• Daniel Henderson,
• David Ediger (now
GTRI),
• Jason Poovey (GTRI),
• Karl Jiang, and
• feedback from users in
industry, government,
academia
Support: DoD, DoE, NSF, Intel, IBM, Oracle 31
34. stinger: where do you get it?
Home: www.cc.gatech.edu/stinger/
Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/
Gateway to
• code,
• development,
• documentation,
• presentations...
Remember: Academic code, but maturing
with contributions.
Users / contributors / questioners:
Georgia Tech, PNNL, CMU, Berkeley, Intel,
Cray, NVIDIA, IBM, Federal Government,
Ionic Security, Citi, ...
32