SlideShare une entreprise Scribd logo
1  sur  13
3rd Workshop on Graph-based Technologies and
Applications (Graph-TA)
UPC, Barcelona
Introduction
● Motivation: one of the difficulties for data analysts of
online social networks is the public availability of data,
while respecting the privacy of the users.
● Solution: use synthetically generated data [1].
However, this presents a series of challenges related to
generating a realistic dataset in terms of topologies,
attribute values, communities, data distributions, and so
on.
Introduction
 In the following we present an approach for
generating a graph topology and populating it with
synthetic data to simulate an online social network.
 3 Steps
Data Definition
Topology
Generation
Data
Population
Step 1: Topology generation
We use the R-mat (Recursive MATrix), method [3] to generate an OSN like topology. R-mat
employs a statistical approach and a recursive process to replicate the power law
distributions, skew distributions and community structure (which can be hierarchical),
while maintaining a small diameter for the graph.
The adjacency matrix A of a graph of N nodes is an N  N
matrix, with entry a(i, j) = 1 if the edge (i, j) exists, and 0
otherwise.
The adjacency matrix is recursively subdivided into four
equal-sized partitions, and edges are distributed within
these partitions with a unequal probabilities.
Starting from an empty adjacency matrix, edges are
assigned into the matrix one at a time.
For each edge, one of the four partitions is assigned with
probabilities a, b, c, d respectively (a + b + c + d = 1).
The chosen partition is again subdivided into four smaller
partitions, and the procedure is repeated until a simple
cell is obtained (1  1 partition).
The edge is then assigned to this cell of the adjacency
matrix.
Fig. 1. R-mat model
In the current work, we have used the
following probabilities: a=0.45, b=0.15,
c=0.15, d=0.25
Step 1: Topology generation
We identify the communities in the
graph structure generated by Rmat, by
processing it with the Louvain[4]
method, which assigns a community
label to each vertex in the graph.
Fig. 2a. Rmat generated topology.
Different colors indicate
communities.
Step 1: Topology generation
Next we identify a set of seed vertices
which will be used as the starting
points for propagating the data.
• For each community, we find
the medoid node in terms of the
statistical and topological
qualities (especially centrality
and degree).
• We can progressively assign
seed vertices which give an
optimal coverage of the
complete graph, while not
overlapping in their immediate
neighborhoods (to avoid
overwriting data propagated
from different seeds).
Fig. 2b. Rmat generated topology
with seed nodes indicated.
Step 2: Data definition
• The choice of data will be application specific.
• The distributions of the values of the different attributes should be similar to
those of some real social network (ground truth).
• In order to achieve this, we can use sources of official statistics, such as
government census data
 www.indexmundi.com, www.census.gov, www.bls.gov
• Statistical summaries made public by the social network providers, such as
Facebook:
 www.adweek.com, fanpagelist.com, http://royal.pingdom.com/2009/11/27/study-males-
vs-females-in-social-networks/
• We may also need some lookup tables for inter-related attribute values.
 For example, for the age group “18-25”, there will be a much higher proportion of
“profession=student” and “marital status=single”, and for “gender=male” there will be a
higher proportion of “like3=soccer club”.
• Another option for ‘ground truth’ are publicly available OSN datasets,
such as:
 SNAP: http://snap.stanford.edu/data/#communities
Step 2: Data definition
Attribute Values
Age "18-25", (32%) "26-35", (26%) "36-45" (20%) ,"46-55" (13%) ,"56-65" (9%)
Gender male (47%), female (53%)
Residence "Palo Alto“ (17%), "Santa Barbara“ (16%), "Boca Raton“ (16%), "Boston“ (17%), "Norfolk“ (17%),
"San Jose“ (17%)
Religion "Christian" (31.9%), "Hindu" (14.8%), "Jewish" (0.2%), "Muslim" (27.1%), "Sikh" (0.3%),
"Traditional Spirituality" (0.1%),
"Other Religions" (12.9%), "No religious affiliation" (12.7%)
Marital status "Single" (31.5%), "Married" (51.4%), "Divorced" (10.5%), "Widowed" (6.6%)
Profession
(ISCO-08
structure)
"Manager" (12.2%), "Professional" (17.1%), "Service" (13.9%), "Sales and office" (17.8%), “Student”
(23%),
"Natural resources construction and maintenance" (7.0%), "Production transportation and material
moving" (9.0%)
Political
orientation
"Far Left" (9.4%), "Left" (34.7%),"Center Left", (18.1%), "Center" (18.0%), "Center Right" (10.5%),
"Right" (8.0%), "Far Right" (1.3%)
{like1, like2,
like3}
Patterns: {"entertainment", "entertainment", "music artist"} (25%),
{"music artist", "music artist", "entertainment"} (25%),
{"drink brand", "drink brand", "entertainment"} (25%),
{"tv show", "drink brand", "soccer club"} (25%).
Table 1. Example attributes, attribute-values and their proportions.
Step 3: Data population
 In order to optimize the assignment, we can use a fitness function and find the
optimum configuration using a stochastic process.
Fitness = (, , )
where  = set of seed assignments,  = set of data propagation rules and thresholds,  = set
of required data distributions.
 We initiate the data population of the network using the set of seed nodes.
 The rest of the nodes will be assigned data by a propagation from the seed nodes.
 The immediate neighbors of a seed will have a higher probability of being assigned similar
attribute values.
 For each hop further from the seed node, the influence of the seed node on other nodes is
progressively reduced. However, we give precedence to seed nodes in the same community as the
node to be assigned.
 We can also use ontologies /taxonomies and a distance measure to assign similar, rather than
identical values (with an appropriate threshold) when propagating attribute-values.
 The influence on assignment by the seed node has to be traded off by the desired overall
proportions of the attribute values (diversity).
Step 3: Data population
Fig. 3. Data propagation from seed
nodes and topology of Fig. 1.
Challenges
 A high degree node chosen as a seed may
have a disproportionate influence on the
network.
 The synthetic data generation
may create fictitious user
profiles (combinations of
attribute-values).
 The restrictions on the
placement of the seed
nodes and the topology of
the communities may cause
a significant percentage of
nodes to have random
assignments (coverage).
 Making the communities
representative of key
attribute-value profiles
 Obtaining ‘ground truth’.
Acknowledgements / References
References
[1] D.F. Nettleton, J. Salas, A Data Driven Anonymization Method for Information Rich OSN
Graphs. Submitted to the Journal "Expert Systems with Applications", Dec. 2014.
[2] D.F. Nettleton, Data mining of social networks represented as graphs, Computer Science
Review 7, 1-34 (2013).
[3] D. Chakrabarti, Y. Zhan, C. Faloutsos, R-mat: A recursive model for graph mining, in Proc.
SIAM Data Mining Conference, 2004. SIAM, Philadelphia, PA.
[4] V.D. Blondel, J.L. Guillaume, R. Lambiotte, E. Lefebure, Fast unfolding of communities in
large networks, in Journal of Statistical Mechanics: Theory and Experiment (10), 2008, pp.
1000.
Acknowledgements
This work is partially funded by the Spanish MEC (project TIN2013-49814-EXP).
Thank you for your
attention !

Contenu connexe

Tendances

Predicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networksPredicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networks
Anvardh Nanduri
 
Social network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and moreSocial network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and more
Wael Elrifai
 
Using content and interactions for discovering communities in
Using content and interactions for discovering communities inUsing content and interactions for discovering communities in
Using content and interactions for discovering communities in
moresmile
 

Tendances (20)

Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with Spark
 
Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large G...
Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large G...Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large G...
Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large G...
 
Visualizing Big Data - Social Network Analysis
Visualizing Big Data - Social Network AnalysisVisualizing Big Data - Social Network Analysis
Visualizing Big Data - Social Network Analysis
 
Predicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networksPredicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networks
 
Ph.D. defense: semantic social network analysis
Ph.D. defense: semantic social network analysisPh.D. defense: semantic social network analysis
Ph.D. defense: semantic social network analysis
 
COMPLETEpaperAOAS
COMPLETEpaperAOASCOMPLETEpaperAOAS
COMPLETEpaperAOAS
 
Data mining based social network
Data mining based social networkData mining based social network
Data mining based social network
 
Evolving social data mining and affective analysis
Evolving social data mining and affective analysis  Evolving social data mining and affective analysis
Evolving social data mining and affective analysis
 
Social network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and moreSocial network analysis & Big Data - Telecommunications and more
Social network analysis & Big Data - Telecommunications and more
 
Community detection in complex social networks
Community detection in complex social networksCommunity detection in complex social networks
Community detection in complex social networks
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures
 
Data Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering AlgorithmData Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering Algorithm
 
Social network analysis course 2010 - 2011
Social network analysis course 2010 - 2011Social network analysis course 2010 - 2011
Social network analysis course 2010 - 2011
 
Finding prominent features in communities in social networks using ontology
Finding prominent features in communities in social networks using ontologyFinding prominent features in communities in social networks using ontology
Finding prominent features in communities in social networks using ontology
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
 
One Tag to bind them all: Measuring Term abstractness in Social Metadata
One Tag to bind them all: Measuring Term abstractness in Social MetadataOne Tag to bind them all: Measuring Term abstractness in Social Metadata
One Tag to bind them all: Measuring Term abstractness in Social Metadata
 
Using content and interactions for discovering communities in
Using content and interactions for discovering communities inUsing content and interactions for discovering communities in
Using content and interactions for discovering communities in
 
Community Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief OverviewCommunity Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief Overview
 
Survey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POISurvey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POI
 
COM494_SNA_DataPrep&NetworkTypes
COM494_SNA_DataPrep&NetworkTypesCOM494_SNA_DataPrep&NetworkTypes
COM494_SNA_DataPrep&NetworkTypes
 

Similaire à Generating synthetic online social network graph data and topologies

Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
lauratoni4
 
Community structure in social and biological structures
Community structure in social and biological structuresCommunity structure in social and biological structures
Community structure in social and biological structures
Maxim Boiko Savenko
 
Online Diabetes: Inferring Community Structure in Healthcare Forums.
Online Diabetes: Inferring Community Structure in Healthcare Forums. Online Diabetes: Inferring Community Structure in Healthcare Forums.
Online Diabetes: Inferring Community Structure in Healthcare Forums.
Luis Fernandez Luque
 
Contractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACContractor-Borner-SNA-SAC
Contractor-Borner-SNA-SAC
webuploader
 
Poster Abstracts
Poster AbstractsPoster Abstracts
Poster Abstracts
butest
 

Similaire à Generating synthetic online social network graph data and topologies (20)

06 Community Detection
06 Community Detection06 Community Detection
06 Community Detection
 
SSRI_pt1.ppt
SSRI_pt1.pptSSRI_pt1.ppt
SSRI_pt1.ppt
 
20142014_20142015_20142115
20142014_20142015_2014211520142014_20142015_20142115
20142014_20142015_20142115
 
AI Class Topic 5: Social Network Graph
AI Class Topic 5:  Social Network GraphAI Class Topic 5:  Social Network Graph
AI Class Topic 5: Social Network Graph
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
 
03 Communities in Networks (2017)
03 Communities in Networks (2017)03 Communities in Networks (2017)
03 Communities in Networks (2017)
 
Socialnetworkanalysis (Tin180 Com)
Socialnetworkanalysis (Tin180 Com)Socialnetworkanalysis (Tin180 Com)
Socialnetworkanalysis (Tin180 Com)
 
Community detection
Community detectionCommunity detection
Community detection
 
MODELING SOCIAL GAUSS-MARKOV MOBILITY FOR OPPORTUNISTIC NETWORK
MODELING SOCIAL GAUSS-MARKOV MOBILITY FOR OPPORTUNISTIC NETWORK MODELING SOCIAL GAUSS-MARKOV MOBILITY FOR OPPORTUNISTIC NETWORK
MODELING SOCIAL GAUSS-MARKOV MOBILITY FOR OPPORTUNISTIC NETWORK
 
Gf o2014talk
Gf o2014talkGf o2014talk
Gf o2014talk
 
Community structure in social and biological structures
Community structure in social and biological structuresCommunity structure in social and biological structures
Community structure in social and biological structures
 
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detection
 
Online Diabetes: Inferring Community Structure in Healthcare Forums.
Online Diabetes: Inferring Community Structure in Healthcare Forums. Online Diabetes: Inferring Community Structure in Healthcare Forums.
Online Diabetes: Inferring Community Structure in Healthcare Forums.
 
02 Network Data Collection (2016)
02 Network Data Collection (2016)02 Network Data Collection (2016)
02 Network Data Collection (2016)
 
Ijetcas14 347
Ijetcas14 347Ijetcas14 347
Ijetcas14 347
 
13 Community Detection
13 Community Detection13 Community Detection
13 Community Detection
 
Contractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACContractor-Borner-SNA-SAC
Contractor-Borner-SNA-SAC
 
Poster Abstracts
Poster AbstractsPoster Abstracts
Poster Abstracts
 

Plus de Graph-TA

Plus de Graph-TA (20)

Computing on Event-sourced Graphs
Computing on Event-sourced GraphsComputing on Event-sourced Graphs
Computing on Event-sourced Graphs
 
Using Evolutionary Computing for Feature-driven Graph generation
Using Evolutionary Computing for Feature-driven Graph generationUsing Evolutionary Computing for Feature-driven Graph generation
Using Evolutionary Computing for Feature-driven Graph generation
 
Reactive Databases for Big Data applications
Reactive Databases for Big Data applicationsReactive Databases for Big Data applications
Reactive Databases for Big Data applications
 
The scarcity of crossing dependencies: a direct outcome of a specific constra...
The scarcity of crossing dependencies: a direct outcome of a specific constra...The scarcity of crossing dependencies: a direct outcome of a specific constra...
The scarcity of crossing dependencies: a direct outcome of a specific constra...
 
Holistic Benchmarking of Big Linked Data: HOBBIT
Holistic Benchmarking of Big Linked Data: HOBBITHolistic Benchmarking of Big Linked Data: HOBBIT
Holistic Benchmarking of Big Linked Data: HOBBIT
 
Identifiability in Dynamic Casual Networks
Identifiability in Dynamic Casual NetworksIdentifiability in Dynamic Casual Networks
Identifiability in Dynamic Casual Networks
 
Polyglot Graph Databases using OCL as pivot
Polyglot Graph Databases using OCL as pivotPolyglot Graph Databases using OCL as pivot
Polyglot Graph Databases using OCL as pivot
 
Benchmarking Versioning for Big Linked Data
Benchmarking Versioning for Big Linked DataBenchmarking Versioning for Big Linked Data
Benchmarking Versioning for Big Linked Data
 
Use of Graphs for Cloud Service Selection in Multi-Cloud Environments
Use of Graphs for Cloud Service Selection in Multi-Cloud EnvironmentsUse of Graphs for Cloud Service Selection in Multi-Cloud Environments
Use of Graphs for Cloud Service Selection in Multi-Cloud Environments
 
Graphalytics: A big data benchmark for graph-processing platforms
Graphalytics: A big data benchmark for graph-processing platformsGraphalytics: A big data benchmark for graph-processing platforms
Graphalytics: A big data benchmark for graph-processing platforms
 
Modelling the Clustering Coefficient of a Random graph
Modelling the Clustering Coefficient of a Random graphModelling the Clustering Coefficient of a Random graph
Modelling the Clustering Coefficient of a Random graph
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
 
GRAPHITE — An Extensible Graph Traversal Framework for RDBMS
GRAPHITE — An Extensible Graph Traversal Framework for RDBMSGRAPHITE — An Extensible Graph Traversal Framework for RDBMS
GRAPHITE — An Extensible Graph Traversal Framework for RDBMS
 
On the Discovery of Novel Drug-Target Interactions from Dense SubGraphs
On the Discovery of Novel Drug-Target Interactions from Dense SubGraphsOn the Discovery of Novel Drug-Target Interactions from Dense SubGraphs
On the Discovery of Novel Drug-Target Interactions from Dense SubGraphs
 
Graphalytics: A big data benchmark for graph processing platforms
Graphalytics: A big data benchmark for graph processing platformsGraphalytics: A big data benchmark for graph processing platforms
Graphalytics: A big data benchmark for graph processing platforms
 
Autograph: an evolving lightweight graph tool
Autograph: an evolving lightweight graph toolAutograph: an evolving lightweight graph tool
Autograph: an evolving lightweight graph tool
 
Understanding Graph Structure in Knowledge Bases
Understanding Graph Structure in Knowledge BasesUnderstanding Graph Structure in Knowledge Bases
Understanding Graph Structure in Knowledge Bases
 
Finding patterns of chronic disease and medication prescriptions from a large...
Finding patterns of chronic disease and medication prescriptions from a large...Finding patterns of chronic disease and medication prescriptions from a large...
Finding patterns of chronic disease and medication prescriptions from a large...
 
Recent Updates on IBM System G — GraphBIG and Temporal Data
Recent Updates on IBM System G — GraphBIG and Temporal DataRecent Updates on IBM System G — GraphBIG and Temporal Data
Recent Updates on IBM System G — GraphBIG and Temporal Data
 
Analysing the degree distribution of real graphs by means of several probabil...
Analysing the degree distribution of real graphs by means of several probabil...Analysing the degree distribution of real graphs by means of several probabil...
Analysing the degree distribution of real graphs by means of several probabil...
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Generating synthetic online social network graph data and topologies

  • 1. 3rd Workshop on Graph-based Technologies and Applications (Graph-TA) UPC, Barcelona
  • 2. Introduction ● Motivation: one of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. ● Solution: use synthetically generated data [1]. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, and so on.
  • 3. Introduction  In the following we present an approach for generating a graph topology and populating it with synthetic data to simulate an online social network.  3 Steps Data Definition Topology Generation Data Population
  • 4. Step 1: Topology generation We use the R-mat (Recursive MATrix), method [3] to generate an OSN like topology. R-mat employs a statistical approach and a recursive process to replicate the power law distributions, skew distributions and community structure (which can be hierarchical), while maintaining a small diameter for the graph. The adjacency matrix A of a graph of N nodes is an N  N matrix, with entry a(i, j) = 1 if the edge (i, j) exists, and 0 otherwise. The adjacency matrix is recursively subdivided into four equal-sized partitions, and edges are distributed within these partitions with a unequal probabilities. Starting from an empty adjacency matrix, edges are assigned into the matrix one at a time. For each edge, one of the four partitions is assigned with probabilities a, b, c, d respectively (a + b + c + d = 1). The chosen partition is again subdivided into four smaller partitions, and the procedure is repeated until a simple cell is obtained (1  1 partition). The edge is then assigned to this cell of the adjacency matrix. Fig. 1. R-mat model In the current work, we have used the following probabilities: a=0.45, b=0.15, c=0.15, d=0.25
  • 5. Step 1: Topology generation We identify the communities in the graph structure generated by Rmat, by processing it with the Louvain[4] method, which assigns a community label to each vertex in the graph. Fig. 2a. Rmat generated topology. Different colors indicate communities.
  • 6. Step 1: Topology generation Next we identify a set of seed vertices which will be used as the starting points for propagating the data. • For each community, we find the medoid node in terms of the statistical and topological qualities (especially centrality and degree). • We can progressively assign seed vertices which give an optimal coverage of the complete graph, while not overlapping in their immediate neighborhoods (to avoid overwriting data propagated from different seeds). Fig. 2b. Rmat generated topology with seed nodes indicated.
  • 7. Step 2: Data definition • The choice of data will be application specific. • The distributions of the values of the different attributes should be similar to those of some real social network (ground truth). • In order to achieve this, we can use sources of official statistics, such as government census data  www.indexmundi.com, www.census.gov, www.bls.gov • Statistical summaries made public by the social network providers, such as Facebook:  www.adweek.com, fanpagelist.com, http://royal.pingdom.com/2009/11/27/study-males- vs-females-in-social-networks/ • We may also need some lookup tables for inter-related attribute values.  For example, for the age group “18-25”, there will be a much higher proportion of “profession=student” and “marital status=single”, and for “gender=male” there will be a higher proportion of “like3=soccer club”. • Another option for ‘ground truth’ are publicly available OSN datasets, such as:  SNAP: http://snap.stanford.edu/data/#communities
  • 8. Step 2: Data definition Attribute Values Age "18-25", (32%) "26-35", (26%) "36-45" (20%) ,"46-55" (13%) ,"56-65" (9%) Gender male (47%), female (53%) Residence "Palo Alto“ (17%), "Santa Barbara“ (16%), "Boca Raton“ (16%), "Boston“ (17%), "Norfolk“ (17%), "San Jose“ (17%) Religion "Christian" (31.9%), "Hindu" (14.8%), "Jewish" (0.2%), "Muslim" (27.1%), "Sikh" (0.3%), "Traditional Spirituality" (0.1%), "Other Religions" (12.9%), "No religious affiliation" (12.7%) Marital status "Single" (31.5%), "Married" (51.4%), "Divorced" (10.5%), "Widowed" (6.6%) Profession (ISCO-08 structure) "Manager" (12.2%), "Professional" (17.1%), "Service" (13.9%), "Sales and office" (17.8%), “Student” (23%), "Natural resources construction and maintenance" (7.0%), "Production transportation and material moving" (9.0%) Political orientation "Far Left" (9.4%), "Left" (34.7%),"Center Left", (18.1%), "Center" (18.0%), "Center Right" (10.5%), "Right" (8.0%), "Far Right" (1.3%) {like1, like2, like3} Patterns: {"entertainment", "entertainment", "music artist"} (25%), {"music artist", "music artist", "entertainment"} (25%), {"drink brand", "drink brand", "entertainment"} (25%), {"tv show", "drink brand", "soccer club"} (25%). Table 1. Example attributes, attribute-values and their proportions.
  • 9. Step 3: Data population  In order to optimize the assignment, we can use a fitness function and find the optimum configuration using a stochastic process. Fitness = (, , ) where  = set of seed assignments,  = set of data propagation rules and thresholds,  = set of required data distributions.  We initiate the data population of the network using the set of seed nodes.  The rest of the nodes will be assigned data by a propagation from the seed nodes.  The immediate neighbors of a seed will have a higher probability of being assigned similar attribute values.  For each hop further from the seed node, the influence of the seed node on other nodes is progressively reduced. However, we give precedence to seed nodes in the same community as the node to be assigned.  We can also use ontologies /taxonomies and a distance measure to assign similar, rather than identical values (with an appropriate threshold) when propagating attribute-values.  The influence on assignment by the seed node has to be traded off by the desired overall proportions of the attribute values (diversity).
  • 10. Step 3: Data population Fig. 3. Data propagation from seed nodes and topology of Fig. 1.
  • 11. Challenges  A high degree node chosen as a seed may have a disproportionate influence on the network.  The synthetic data generation may create fictitious user profiles (combinations of attribute-values).  The restrictions on the placement of the seed nodes and the topology of the communities may cause a significant percentage of nodes to have random assignments (coverage).  Making the communities representative of key attribute-value profiles  Obtaining ‘ground truth’.
  • 12. Acknowledgements / References References [1] D.F. Nettleton, J. Salas, A Data Driven Anonymization Method for Information Rich OSN Graphs. Submitted to the Journal "Expert Systems with Applications", Dec. 2014. [2] D.F. Nettleton, Data mining of social networks represented as graphs, Computer Science Review 7, 1-34 (2013). [3] D. Chakrabarti, Y. Zhan, C. Faloutsos, R-mat: A recursive model for graph mining, in Proc. SIAM Data Mining Conference, 2004. SIAM, Philadelphia, PA. [4] V.D. Blondel, J.L. Guillaume, R. Lambiotte, E. Lefebure, Fast unfolding of communities in large networks, in Journal of Statistical Mechanics: Theory and Experiment (10), 2008, pp. 1000. Acknowledgements This work is partially funded by the Spanish MEC (project TIN2013-49814-EXP).
  • 13. Thank you for your attention !