TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
HW09 Social network analysis with Hadoop
1. Social network analysis with Hadoop
Jake Hofman
Yahoo! Research
October 2, 2009
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
2. Social networks
• Rapid increase in amount and variety of social network data
• Valuable information for products (recommendations, advertising,
etc.) and research (structure/dynamics, diffusion, etc.)
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
3. Social networks
Goal: to enable analysis of large-scale social network data with readily
available software/hardware
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
4. 1970s ∼ 101 nodes 456 JOURNAL OF ANTHROPOLOGICAL RESEARCH
FIGURE 1
Social Network Model of Relationships in the Karate Club
34 1
33 3 2
27 8
26 i 9
25 10
CONFLICT AND FISSION IN SMALL GROUPS 453
to bounded social groups of all types in all settings. Also, the data
required can be collected by a reliable method currently familiar to
anthropologists, the use of nominal scales.
19 18 16
18 17
THE ETHNOGRAPHIC RATIONALE
The is the the clubrepresentationline ofis the socialbetween of three years, the indi-1970
This
karate karate was observed for a period two amongwhen 34 two
viduals in
graphic
club. A drawn
relationships the
from
points
to 1972. In addition to direct observation, the history of outside those of to
individuals being represented consistently interacted in contexts the club prior
the period of the study and club meetings. Each through drawn is referredandasclub
karate classes, workouts, was reconstructed such line informants to
an edge.
records in the university archives. During the period of observation, the
club maintained between 50 and 100 members, and its activities
two individuals consistently were observed to interact outside the
included social affairs (parties, dances, and club
normal activities of the club (karate classes banquets, etc.) Thatwell as
as
• Few direct observations; highly detailed info on nodes and edges meetings).
regularly scheduled ifkarate lessons. could be said to be friends outside the
an edge is drawn the individuals The political organization of
is,
clubthe club activities.This while there was a constitutionin Figure 2. officers,
was informal, and graph is represented as a matrix and four All
most decisions were made nondirectional at represent interaction in both
the edges in Figure 1 are by concensus (they club meetings. For its classes,
• E.g. karate club (Zachary, 1977)
the club employed thepart-time karate instructor, who will possible to to
directions), and a graph is said to be symmetrical.It is also be referred
draw edges that are directed (representing one-way relationships); such
as Mr. Hi.2
At the beginning of the study there was an incipient conflict
between the club president, John A., and Mr. Hi over the price of
Jake Hofman (Yahoo! Research) karate lessons. Mr. Hi, who analysis with prices, claimed the authority
Social network wished to raise Hadoop October 2, 2009
5. 1990s ∼ 104 nodes
• Larger, indirect samples; relatively few details on nodes and edges
• E.g. APS co-authorship network (http://bit.ly/aps08jmh)
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
6. Present ∼ 108 nodes +
• Very large, dynamic samples; many details in node and edge metadata
• E.g. Mail, Messenger, Facebook, Twitter, etc.
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
7. Scale
...
• Example numbers:
• ∼ 107 nodes
• ∼ 102 edges/node (degree)
User 1 User 2
• no node/edge data
• static
• ∼8GB
...
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
8. Scale
...
• Example numbers:
• ∼ 107 nodes
• ∼ 102 edges/node (degree)
User 1 User 2
• no node/edge data
• static
• ∼8GB
...
Simple, static networks push memory limit for commodity machines
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
9. Scale
...
• Example numbers:
• ∼ 107 nodes
• ∼ 102 edges/node (degree) Message
Header
• node/edge metadata User 1 Content
...
User 2
User User
• dynamic Profile
History
Profile
History
• ∼100GB/day ... ...
...
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
10. Scale
...
• Example numbers:
• ∼ 107 nodes
• ∼ 102 edges/node (degree) Message
Header
• node/edge metadata User 1 Content
...
User 2
User User
• dynamic Profile
History
Profile
History
• ∼100GB/day ... ...
...
Dynamic, data-rich social networks exceed memory limits; require
considerable storage
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
11. Distributed network analysis
MapReduce convenient for
parallelizing individual
node/edge-level calculations
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
12. Distributed network analysis
Higher-order calculations more
difficult when network exceeds
memory constraints, but can be
adapted to MapReduce
framework
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
13. Package details
• Higher-order node-level
• Network
descriptive statistics
creation/manipulation
• Clustering coefficient
• Logs → edges
• Implicit degree
• Edge list ↔ adjacency list
• ...
• Directed ↔ undirected
• Edge thresholds • Global calculations
• First-order descriptive • Pairwise connectivity
• Connected components
statistics
• Minimum spanning tree
• Number of nodes
• Breadth-first search
• Number of edges
• Pagerank
• Node degrees
• Community detection
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
14. Package details
• Higher-order node-level
• Network
descriptive statistics
creation/manipulation
• Clustering coefficient
• Logs → edges
• Implicit degree
• Edge list ↔ adjacency list
• ...
• Directed ↔ undirected
• Edge thresholds • Global calculations
• First-order descriptive • Pairwise connectivity
• Connected components
statistics
• Minimum spanning tree
• Number of nodes
• Breadth-first search
• Number of edges
• Pagerank
• Node degrees
• Community detection
Currently implemented in Streaming with Python
Algorithms exist/developed for additional features
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
15. Application: Twitter
• Distributed crawl of Twitter social network + public messages
(crawler by Eytan Bakshy, http://bit.ly/eytanb)
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
16. Application: Twitter
• Distributed crawl of Twitter social network + public messages
(crawler by Eytan Bakshy, http://bit.ly/eytanb)
• ∼ 25 million nodes, ∼ 800 million edges
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
17. Twitter: Degree Distribution
8
10
out−degree (friends)
in−degree (followers)
7
10
6
10
5
10
count
4
10
3
10
2
10
1
10
0
10
0 1 2 3 4 5 6
10 10 10 10 10 10 10
degree
• Aggregates users by number of friends/followers seen in crawl
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
18. Twitter: Degree Distribution
8
10
out−degree (friends)
in−degree (followers)
7
10
6
10
5
10
count
4
10
3
10
2
10
1
10
0
10
0 1 2 3 4 5 6
10 10 10 10 10 10 10
degree
Many people not followed by anyone; few followed by many
Most people follow at least a few others
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
19. Twitter: Node-level clustering coefficient
?
?
• Fraction of edges amongst a node’s friends/followers (Watts &
Strogatz, 1998)
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
21. Twitter: Node-level clustering coefficient
8
10
followers
friends
7
10
6
10
? 5
10
count
4
10
3
10
?
2
10
1
10
0
10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
clustering coefficient
Suprisingly high density at 0.5 (many isolated triangles)
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
22. Future plans
• Open-source release
• “A Model of Computation for MapReduce”, Karloff, Suri, &
Vassilvitskii, Symposium on Discrete Algorithms, 2010 (Accepted)
• Twitter analysis publication (In progress)
Goal: to enable analysis of large-scale social network data with readily
available software/hardware
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
23. Collaborators
• Eytan Bakshym,y
• Sharad Goely
• Winter Masony
• Sid Suriy
• Sergei Vassilvitskiiy
• Duncan Wattsy
• (You?)
y Yahoo! Research (http://research.yahoo.com)
m University of Michigan
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009
24. Thanks.
Questions?1
1
hofman@yahoo-inc.com, jakehofman.com
Jake Hofman (Yahoo! Research) Social network analysis with Hadoop October 2, 2009