Generating synthetic online social network graph data and topologies
1. 3rd Workshop on Graph-based Technologies and
Applications (Graph-TA)
UPC, Barcelona
2. Introduction
● Motivation: one of the difficulties for data analysts of
online social networks is the public availability of data,
while respecting the privacy of the users.
● Solution: use synthetically generated data [1].
However, this presents a series of challenges related to
generating a realistic dataset in terms of topologies,
attribute values, communities, data distributions, and so
on.
3. Introduction
In the following we present an approach for
generating a graph topology and populating it with
synthetic data to simulate an online social network.
3 Steps
Data Definition
Topology
Generation
Data
Population
4. Step 1: Topology generation
We use the R-mat (Recursive MATrix), method [3] to generate an OSN like topology. R-mat
employs a statistical approach and a recursive process to replicate the power law
distributions, skew distributions and community structure (which can be hierarchical),
while maintaining a small diameter for the graph.
The adjacency matrix A of a graph of N nodes is an N N
matrix, with entry a(i, j) = 1 if the edge (i, j) exists, and 0
otherwise.
The adjacency matrix is recursively subdivided into four
equal-sized partitions, and edges are distributed within
these partitions with a unequal probabilities.
Starting from an empty adjacency matrix, edges are
assigned into the matrix one at a time.
For each edge, one of the four partitions is assigned with
probabilities a, b, c, d respectively (a + b + c + d = 1).
The chosen partition is again subdivided into four smaller
partitions, and the procedure is repeated until a simple
cell is obtained (1 1 partition).
The edge is then assigned to this cell of the adjacency
matrix.
Fig. 1. R-mat model
In the current work, we have used the
following probabilities: a=0.45, b=0.15,
c=0.15, d=0.25
5. Step 1: Topology generation
We identify the communities in the
graph structure generated by Rmat, by
processing it with the Louvain[4]
method, which assigns a community
label to each vertex in the graph.
Fig. 2a. Rmat generated topology.
Different colors indicate
communities.
6. Step 1: Topology generation
Next we identify a set of seed vertices
which will be used as the starting
points for propagating the data.
• For each community, we find
the medoid node in terms of the
statistical and topological
qualities (especially centrality
and degree).
• We can progressively assign
seed vertices which give an
optimal coverage of the
complete graph, while not
overlapping in their immediate
neighborhoods (to avoid
overwriting data propagated
from different seeds).
Fig. 2b. Rmat generated topology
with seed nodes indicated.
7. Step 2: Data definition
• The choice of data will be application specific.
• The distributions of the values of the different attributes should be similar to
those of some real social network (ground truth).
• In order to achieve this, we can use sources of official statistics, such as
government census data
www.indexmundi.com, www.census.gov, www.bls.gov
• Statistical summaries made public by the social network providers, such as
Facebook:
www.adweek.com, fanpagelist.com, http://royal.pingdom.com/2009/11/27/study-males-
vs-females-in-social-networks/
• We may also need some lookup tables for inter-related attribute values.
For example, for the age group “18-25”, there will be a much higher proportion of
“profession=student” and “marital status=single”, and for “gender=male” there will be a
higher proportion of “like3=soccer club”.
• Another option for ‘ground truth’ are publicly available OSN datasets,
such as:
SNAP: http://snap.stanford.edu/data/#communities
9. Step 3: Data population
In order to optimize the assignment, we can use a fitness function and find the
optimum configuration using a stochastic process.
Fitness = (, , )
where = set of seed assignments, = set of data propagation rules and thresholds, = set
of required data distributions.
We initiate the data population of the network using the set of seed nodes.
The rest of the nodes will be assigned data by a propagation from the seed nodes.
The immediate neighbors of a seed will have a higher probability of being assigned similar
attribute values.
For each hop further from the seed node, the influence of the seed node on other nodes is
progressively reduced. However, we give precedence to seed nodes in the same community as the
node to be assigned.
We can also use ontologies /taxonomies and a distance measure to assign similar, rather than
identical values (with an appropriate threshold) when propagating attribute-values.
The influence on assignment by the seed node has to be traded off by the desired overall
proportions of the attribute values (diversity).
10. Step 3: Data population
Fig. 3. Data propagation from seed
nodes and topology of Fig. 1.
11. Challenges
A high degree node chosen as a seed may
have a disproportionate influence on the
network.
The synthetic data generation
may create fictitious user
profiles (combinations of
attribute-values).
The restrictions on the
placement of the seed
nodes and the topology of
the communities may cause
a significant percentage of
nodes to have random
assignments (coverage).
Making the communities
representative of key
attribute-value profiles
Obtaining ‘ground truth’.
12. Acknowledgements / References
References
[1] D.F. Nettleton, J. Salas, A Data Driven Anonymization Method for Information Rich OSN
Graphs. Submitted to the Journal "Expert Systems with Applications", Dec. 2014.
[2] D.F. Nettleton, Data mining of social networks represented as graphs, Computer Science
Review 7, 1-34 (2013).
[3] D. Chakrabarti, Y. Zhan, C. Faloutsos, R-mat: A recursive model for graph mining, in Proc.
SIAM Data Mining Conference, 2004. SIAM, Philadelphia, PA.
[4] V.D. Blondel, J.L. Guillaume, R. Lambiotte, E. Lefebure, Fast unfolding of communities in
large networks, in Journal of Statistical Mechanics: Theory and Experiment (10), 2008, pp.
1000.
Acknowledgements
This work is partially funded by the Spanish MEC (project TIN2013-49814-EXP).