6. “Visualization allows people to offload cognition to the
perceptual system, using carefully designed images as a
form of external memory.
The human visual system is a very high-bandwidth
channel to the brain, with a significant amount of
processing occurring in parallel and at the pre-conscious
level.”
- Tamara Munzner
30. Network
A set of people
and a set of connections between pairs of them
31. Types of connections
Social network analysis: only one type of connection between
individuals (e.g. "friend")
Link analysis: multiple types of connections
friend
brother
employer
went to university with
sold a car to
owns 51% of
Link analysis is much more relevant to journalism, because it
allows representation of much more detail and context.
32. People Act in Groups
Family and friendships: I am most closely connected to a small set of people,
who are usually closely connected to each other.
Business: I am much more likely to do business with people I already know.
Influence: I listen to people I know more than I listen to strangers.
Norms: what is right depends on what the people around me think.
People tend to marry, do business with, spend time with, etc. people from similar
backgrounds... and people who have social ties tend to be similar.
33. Two major analysis methods
…after you have the network data, which may be a very manual
process.
• Look at a visualization
• Apply algorithm
In both cases, the results are not interpretable without context.
34. A “sociogram” of a fraternity from Moreno’s Who Shall Survive? (1934). Arrows show one way
“attraction” and lines with a cross bar show “mutual attraction.”
35. Force-Directed Layout
Each edge is a "spring" with a fixed preferred length.
Plus global repulsive force that pushes all nodes apart.
36. The Effect of Graph Layout on Inference from Social Network Data,
Blythe et al.
37. The Effect of Graph Layout on Inference from Social Network Data,
Blythe et al.
We asked respondents three questions about the same five
focal nodes in each sociogram:
1) how many subgroups were in the sociogram
2) how “prominent” was each player in the sociogram
3) how important a “bridging” role did each player occupy in
the sociogram
38. Centrality
Often identified with "influence" or "power." Often important in journalism.
We can visualize the graph and use our eyes, or we can compute centrality
values algorithmically.
39. Degree centrality: number of edges
Models: cases where the number of connections is important.
Example: which celebrity can reach the most people at once?
40. Closeness centrality: average distance to all other nodes
Models: cases where time taken to reach a node is important.
Example: who finds out about gossip first?
41. Betweenness centrality:
number of shortest paths that pass through node
Models: cases where control over transmission is important.
Example: who has the most power to make introductions?
42. Eigenvector centrality:
how likely you are to end up at a node on a random walk
(same idea as PageRank)
Models: cases where importance of neighbors is important.
Example: the private adviser to the president
44. Finding Communities
No one definition of "community." Could mean a town, or a club, or an industry
network.
But for our purposes, a community is "a group of people with pre-existing
patterns of association."
In social network analysis, that translates into clusters in the graph.
50. Mathematical definitions of "cluster"
You've already seen several. If you can compute distance between any two
items, you can cluster.
But in social networks, not everyone is connected to everyone else...
52. Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise
There are total edges in the graph.
If they go between random vertices then number of
edges between i,j is
m = 1
2 kiå
kikj / 2m
53. Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise
Modularity
If Q>0 then there are "excess" edges inside the
groups (and fewer edges between them.)
Q = Aij -kikj / 2m( )
ij
å gij
54. Modularity algorithm
• Look for a division of nodes into two groups that maximizes Q
• Can find this through eigenvector technique
• Possible that no division has Q>0, in which case the graph is a
single community
• If a division with Q>0 found, split
• Recursively split sub-graphs
57. Case Study: Seattle Art World
In Seattle Art World, Women Run the Show, Seattle Times
Network obtained from
dozens of in-person
interviews. Interactive
visualization in story.
58. Case Study: Hot Wheels
Hot Wheels, Tampa Bay Times
Network obtained from
juvenile arrest records
concerning stolen cars.
Unpublished visualization
and centrality measures
used to direct reporting to
most interesting people.
59. Coded 34 Stories for Sources and Uses
Story visualization: published story contains a visualization
Reporting visualization: used to guide reporters, unpublished.
Scraping: network extracted from source documents
Algorithm: centrality, community, etc. used
Graph DB: network loaded into graph database
61. Why not algorithms?
Heterogeneous networks. Multiple entity/relationship types. “Link
analysis” like criminal investigations.
Incomplete data. Building out the network is often an interactive
process of data gathering.
Contextual interpretation: What does it mean for someone to be
“central”? Depends on the nature of the network and story.
62. Correlation of different types of info
Suppose you have a record of phone numbers called, a database of political
campaign donations, and a list of government appointees. Put them together, and
you have this story:
WASHINGTON—Time and again, Texas Gov. Rick Perry picked up his office phone in the
months before he would announce his bid for the presidency. He dialed wealthy friends who
were his big fundraisers and state officials who owed him for their jobs.
Perry also met with a Texas executive who would later co-found an independent political
committee that has promised to raise millions to support Perry but is prohibited from
coordinating its activities with the governor.
- Jack Gillum, Perry called top donors from work phones, AP, 6 Dec 2011
64. Graph Databases in Theory
Load everything into the database, then analyze using a graph query
language and interactive visualization.
“Magic bullet” for large, complex, cross border investigations.
68. Graph Databases in Practice
Incomplete data. Building a network often requires scraping from documents. Bulk data often
unavailable or impractical, and some records need to be purchased one at a time. Instead,
reporting involves interactive data enrichment.
Record linkage: With N databases, there could be N copies of each entity.
Graph queries are not that helpful. Cipher was available to PP investigators but no one
outside the core team learned it. Moreover, it’s not clear how often reporting problems can be
expressed as a graph query. Even “find path between” did not produce any (documented) leads
on PP.
Networks need to be narratives. The most useful networks are hand-built, for a particular line
of reporting.