Data-driven Studies on Social Networks: Privacy and Simulation

Data-driven Studies on Social Networks:
Privacy and Simulation
1
Sameera Horawalavithana
Ph.D. Candidate,
Department of Computer Science and Eng.,
University of South Florida
sameera1@usf.edu

Outline
● Privacy in Social Networks
● Social Simulations
● The Design of the Multi-platform Cascades (MCAS) Social Simulator
○ Scenario #1: Endogenous Signals
■ Dataset
■ Evaluation
○ Scenario #2: Exogenous Signals
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
2

Privacy in Social Networks
● Data breaches happen regularly where adversaries use sophisticated techniques
(i.e., de-anonymization) to defeat data protection (i.e., anonymization)
mechanisms.
● A main research challenge is to develop a principled understanding of how to
measure the effectiveness of an anonymization scheme and thus, conversely, the
likely success of a de-anonymization attack.
● We introduce and experiment with a framework that identifies the relationships
between graph vulnerability and graph properties (Horawalavithana et al. 2019).
○ We show that protecting graph privacy is harder than previously considered
○ For example, our results show that preserving other network properties independent of the degree
distribution can reveal node identity.
● We quantitatively study the impact of binary node attributes on node privacy using
this framework (Horawalavithana et al. 2018).
○ Our experiments show that the population’s diversity on the binary attribute consistently degrades
anonymity
3

Outline
■ Dataset
■ Evaluation
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
4

Social Simulations
5
● Why do we need to develop accurate simulation techniques for online media
information?
○ Helpful for intervention techniques, disaster response, fraud detection, censorship removal, picking
up signals/trends as they relate to current events, etc.
Organic Discussions on Reddit Venezeulan Political Crisis

Social Simulations
● A reliable simulator can realistically respond to
internal and external stimuli and adapt to different
platforms, datasets, scenarios, each with different
characteristics.
● Our objective is to forecast finer-granular social
media activity without relying on the ground truth
in the testing period.
● Simulation results should match to the real world
data. The accuracy is measured by a set of
meaningful metrics that capture both macro-level
and micro-level simulation information.
6

Social Simulations
● Our Approach: We combine social theories with machine learning
methodologies for predicting information dissemination within and across
social online environments.
● Datasets: The majority of the datasets used in this work were collected by
Leidos, the official data provider in the DARPA SocialSim program
● Metrics: We used the evaluation code that was developed by Pacific
Northwest National Laboratory.
7

Outline
■ Dataset
■ Evaluation
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
8

Multi-platform Cascades Social Simulator (MCAS)
● Design: Given a history of per-topic social media events and relevant exogenous
events, predict the number of information cascades and the size and growth of
cascades in the future.
● Three main design components:
○ Topic Module annotates messages with topics. This module was implemented by one of our
collaborators. They manually annotated an initial subset of messages with a predefined list of topics,
and trained a multilingual BERT model to classify each message with one or multiple such sub-topics.
○ Seed Module includes ML models that specialize predictions to particular macro-level sub-problems
(e.g., daily # cascades)
○ Cascade Module includes a probabilistic generative model to predict the micro-level events
information (e.g., who did what to whom) in the form of cascades.
9

Multi-platform Cascades Social Simulator (MCAS)
● We present two scenarios that motivate the design of the social simulators.
○ We use the endogenous features as extracted from in-platform discussions to predict the
growth of conversations on Reddit (Scenario #1).
○ We use both endogenous (e.g., in-platform discussions related to topics) and exogenous
(e.g., news articles) features to predict Twitter activity (Scenario #2).
10

Outline
■ Dataset
■ Evaluation
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
11

Scenario #1: Endogenous Signals
● Given a set of "seeds" (e.g., original posts on a social platform, such as posts on
Reddit) in a continuous interval of time on a platform, can one predict the
information cascade trees (who responds to whom when) rooted in these seeds?
○ Can discussion threads be predicted using only post features (e.g., author who posts the
initial message, timing, textual content of the post)?
12

Conversation Pool Generation Algorithm
1. Generate N pools of conversations
probabilistically
a. Conversation Structure: We use the branching process to
generate the conversation structure
b. User: Users are assigned to conversation nodes following
the preferential attachment principle.
c. Timing: We use a distribution of message propagation
delays to estimate the timing
2. Test the goodness of generated conversation pools
using two trained classification models
3. Reconstruct the pool of conversations with the
feedback from the classification models
13
Generate N
number of
Conversation
Pools
Goodness Test
Reconstruct the
Best
Conversation
Pool

● Test the goodness of generated
conversation pools using two trained
classification models
○ We use the classification models to
assess how realistic is the generated
conversation with the attached user and
timing information.
○ We use two individual-level
properties—branching factor and
propagation delay—of conversation
nodes as the target units for the
prediction tasks.
○ We represent conversation information
in a data structure (as shown in Fig.
5.2) where each conversation node is
described by structural, user and
content features (Table 5.4).
14

● Goodness score of a conversation
○ We use the Area Under Curve (AUC) of
two branch vectors and two delay
vectors to calculate the goodness score
of a conversation.
○ Each conversation receives a goodness
score as the mean of two AUC scores
from the two models.
● This goodness score is used to
know which conversation is the best
during the simulation.
15

● Reconstruct the pool of conversations with the feedback from the
classification models
○ The objective is to create a pool of conversations that outperforms any
existing pool of conversations.
○ We treat the pool reconstruction problem as an optimization problem
that we solve using a genetic algorithm.
■ A gene is a conversation represented by the message tree with
assigned user and timing information to nodes.
■ An individual is a pool of conversations.
■ The population is the set of conversation pools.
16

17
Rank Pools New Pool Construction Reconstructed Pools
Uniform Crossover
Conversation
A Pool of Conversations
The goodness of
a pool of
conversations is
the sum of the
goodness scores
of the
conversations in
the pool.

Outline
■ Dataset
■ Evaluation
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
18

Scenario #1: Dataset
● We used a Reddit dataset covering the discussions in nine crypto currency
and 38 cyber security related subreddits between January 2015 and August
2017 to train and test the simulator.
●
19
Measurement Crypto Cyber
Number of
Posts
0.2M 1.76M
Number of
Comments
3.5M 35.3M
Number of
Users
0.14M 1.6M

Scenario #1: Overlapping Conversations
● Users respond with comments to
the original post or other users’
comments, repeatedly getting
involved in the same conversation.
● The same user can participate in
multiple related conversation
threads
20
Bitcoin scaling debate discussions on August 2017. There are
57 conversations with 4,418 messages posted by 1,458 users.
218 and 83 users appeared in more than one, and two
conversations, respectively.

Outline
■ Dataset
■ Evaluation
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
21

Scenario #1: Evaluation
● We predict the growth of Reddit conversations in one month (August 01 -
August 31, 2017).
○ We use the posts made between August 1 and August 3, 2017 as input seed posts.
○ There were 3,740 and 3,463 number of posts in the crypto-currency and cyber-security
domains, respectively.
● We use three baseline models.
○ Recent Replay baseline repeats the most recent n conversations from the training data.
○ Random baseline draws n conversations from the training data at random. We repeat this
process 10 times to minimize the bias of random selection.
○ Lumbreras Model uses the branching process in the generation of conversation
structures (Aragon et al. 2017).
22

● Predicting the structure of cascades
○ We report the distribution of the size and structural virality of generated conversations
■ Structural virality is measured by the Wiener index of conversation trees (Goel et al. 2015)
○ We calculate the JS divergence between the distributions of the structural metrics reported of the
generative models and of the ground truth
23

● Predicting the temporal growth of conversations
○ We report the growth of the Reddit discussions by the daily number of comments over 1 month.
○ We compare the predicted time series and ground truth time series using Dynamic Time
Warping (DTW) and Root Mean Square Error (RMSE) metrics.
24
Discussions on
crypto-currency
subreddits
Discussions on
cyber-security
subreddits

● Predicting the user engagement
○ We compare the number of users engaged in
multiple conversations between simulation and
ground truth (Fig. 5.9)
● Predicting the collective behavior
○ We record user participation in conversations in a
vector [c1
, c2
, ..., cn
], where ci
indicates a binary
value to reflect the user involvement in the ith
conversation.
○ We use the Pearson correlation coefficient to
compare all pairs of binary vectors.
○ We calculate the JS-divergence and RMSE between
the coefficient distributions of the simulation and the
ground truth data (Table 5.9).
○ Lower JS-divergence values reflect collective
behavior closer to that measured from the ground
25

Outline
■ Dataset
■ Evaluation
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
26

Scenario #2: Exogenous Signals
• Can one accurately generate the social media activity on a platform (for
example, Twitter) using the recorded signals from other platforms?
• Is that doable in the context of unexpected events, when social media users both react to
unexpected news in unpredictable ways and also generate news for many news outlets?
27
27

Scenario #2: Exogenous Signals
● Seed Module
○ We train multiple neural network models to predict the number of daily tweets per topic.
○ The module variations depend on the exogenous sources and recency of features.
■ Exogenous features are the number of news articles, and the number of Reddit posts
per topic. They are extracted on the “day before” and “day of” predictions.
○ We assign users to the predicted tweets randomly with probability proportional to the user
spread score.
■ The spread score for user u is the product of the fraction of the number of tweets
posted by u that get retweeted and the total number of retweets that user u gets for his
tweets (Alp et al. 2018).
■ Intuitively, the spread score captures the level of influence of a user: the higher the
spread score, the more influential the user is.
● Cascade Module is similar to the solution presented in Scenario #1.
○ This module takes the tweets predicted by the seed module as input.
○ We assign new users to the cascades.
■ We select leaves of the cascades predicted for each topic and assign those users a
completely new and unique identifier. 28

Outline
■ Dataset
■ Evaluation
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
29

30
• Twitter Dataset
• We used a Twitter dataset covering the
Venezuelan Presidential Crisis between
January and February 2019.
• This dataset covers a period of high
political tension which resulted in
nationwide protests, militarized responses,
and incidents of mass violence and arrests.
Number of Tweets ~1M
Number of Retweets ~11.6M
Number of Users ~1.15M

31
• Exogenous Data Sources
• We collected Reddit discussions from
one of the largest Venezuela-related
subreddits, /r/vzla.
• The news article data was collected
via a publicly available geopolitical
event database, GDELT
Number of Reddit Messages 56K
Number of News Articles 138K

Outline
■ Dataset
■ Evaluation
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
32

● We predict Twitter activity in two weeks (February 15 - February 28, 2019).
● We use two baselines,
○ Replay baseline repeats the messages from the last two weeks of training data.
○ Sampling baseline draws full Twitter cascades at random to match the average daily
volume of activity per topic observed in the last two weeks of training data.
● We use three metrics,
○ Time series comparison
■ NRMSE (Normalized Root Mean Squared Error) to capture temporal pattern
■ SMAPE (Symmetric Mean Absolute Percentage Error) to capture the volume and
temporal pattern
○ Distribution level comparison
■ EM (Earth Movers Distance) to compare the page-rank distributions.
33

● Predicting the daily number of tweets per topic.
○ We predict the big spikes in the number of tweets for most of the popular topics.
○ But spikes are mistimed in the models that use the features on the day before the predictions
(see dash lines).
34

● Predicting the daily number of
tweets per topic.
○ Multiple variants of our solution capture
the trend of the number of tweets closer
to the ground truth than any baselines for
most of the topics.
○ The models that use the news articles in
the last 24 hours before 8 a.m. perform
better on predicting the trend of tweets
than the models that use the news
articles in the previous day of predictions
(see two light green bars in Fig. a)
○ Using current day exogenous data leads
to more accurate predictions than using
the previous day exogenous data
35

● Predicting the daily number of tweets and retweets per topic.
○ Retweets are predicted by the cascade module. The temporal pattern of retweets is driven
mostly by the temporal pattern of tweets predicted by the seed module.
36

● Predicting the daily number of tweets and
retweets per topic.
○ Similar to the performance of the seed module, the
cascade module also captures the trend of number
of shares closer to the ground truth than any
baselines for most of the topics.
○ Results suggest that most representative
exogenous sources depend on the topic of interest.
■ News articles are more helpful to predict the
topics related to international humanitarian aid
event and violent clashes between the military
and protesters.
■ Reddit discussions are more helpful to predict
topics related to the Maduro’s dictatorship.
37
Performance View, #S- number of shares over time, #NU
- number of new user engagements over time, page rank
(PR) measurements. Green cells present that models
beat the baselines.
Predicting Twitter topic activity using Reddit discussions
Predicting Twitter topic activity using News Articles

Case Study #2: Evaluation
● Predicting the daily number of new user
engagements per topic.
○ Our models outperform the respective baselines
across all 12 topics with respect to NRMSE and
SMAPE
○ Models using only Reddit features show better
performance than those using only news in arrests
and maduro/narco topics
● Predicting the user interaction network
○ We create a directed retweet network for each topic in
which an edge points from the user who retweeted to
the user who posted the tweet.
○ The pagerank distribution of the user interaction
network is closer to the ground truth than the
Sampling baseline method for a majority of topics.
○ The network structures predicted by the Replay
baseline model are hard to beat in this network
measurement.
38
Performance View, #S- number of shares over time, #NU
- number of new user engagements over time, degree
(DEG) and page rank (PR) measurements. Green cells
present that models beat the baselines.
Predicting Twitter topic activity using Reddit discussions
Predicting Twitter topic activity using News Articles

Outline
■ Dataset
■ Evaluation
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
39

Lessons Learnt
• Recency matters
• To predict the social media activity (i.e., the volume of messages and the user interaction
network) in the immediate future, the immediate past is more useful than the delayed past.
• This would also make the baselines very competitive as they re-generate the recent past.
• Recency and Locality matter
• To predict activity within a particular topic, the recent activity within the same topic matters.
• This observation may be biased to the design of the topic assignment model (e.g., manual
annotation process, the distribution of topics, topic co-occurrence, etc.)
• Recency introduces small-ish data, but ML models need big-ish data?
• The number of data points available for training is depending on the time granularity of the
predictions. For example, one can generate more data points in the hourly granularity (or
less) than in the daily or weekly granularity.
• We increase the number of data points available for training by splitting the data based on
the topic. For example, given N number of topics, and M number of days, we can create N x
M number of data points. This also increases the variation in the training data which helps
ML models to learn multiple topic activity.
40

Lessons Learnt
• Exogenous features matter
• There are many potential exogenous data sources to capture the real-world events. But
selecting the most representative exogenous features to predict topic activity matter.
• “Big” spikes are hard to predict
• We tested our simulators on special cases (e.g., political crisis, influence campaigns) which
include big spikes due to external events.
• Exogenous features on the “day of” and “day before” predictions had a big impact on
predicting spikes more accurately.
• Long vs. short time horizon predictions
• The overall volume of activity can be predicted in the long time horizon with the help of
exogenous features, but predicting the temporal pattern is hard due to compounding errors in
the simulation.
• Hard to predict the structure of the user interaction network
• We found the baselines are hard to beat in the network structural measurements.
• As they regenerate the past, they capture the patterns of user interactions more accurately.
41

Outline
■ Dataset
■ Evaluation
■ Dataset
■ Evaluation
● Lessons Learnt
● Future Work
42

Future Work
• Reducing the error accumulated over different modules in the pipeline design
• Any error on predicting the volume of discussions can not be resolved later in the current
pipeline design. Accurately identifying which module penalizes overall prediction is important
to make improvements
• Testing the generalizability of modules across various other simulation
scenarios, and datasets.
• E.g., influence operations, disinformation campaigns, private group discussions, etc.
• Explaining the performance of simulators
• What characteristics of the data determine the models’ performance?
• During our performance analysis, we have seen the simulator performing differently on
different topics. This could be partly due to the influence of external events on the activity of
particular topics, or partly due to the regular patterns observed in the data.
43

Main Publications
● Horawalavithana, S., Ng, K., Iamnitchi, A., Predicting Twitter Topic Activity during
Political Crisis using Exogenous Data (Under Review)
● Horawalavithana, S., Choudhury, N., Iamnitchi, A., Online Discussion Threads as
Cascade Pools: Predicting the Growth of Discussion Threads on Reddit (Under
Review)
● Horawalavithana, S., Ng, K., Iamnitchi, A., Drivers of Polarized Discussions on Twitter
during Venezuela Political Crisis, The 13th International ACM Conference on Web
Science (WebSci), 2021.
● Horawalavithana, S., Silva, R., Nabeel, M., Elvitigala, C., Wijesekara, P., and Iamnitchi,
A., Malicious and Low Credibility URLs on Twitter during the AstraZeneca
COVID-19 Vaccine Development, International Conference on Social Computing,
Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and
Simulation (SBP-BRiMS), DC, USA, 2021
44

Main Publications (Contd.)
● Horawalavithana, S., Ng, K., Iamnitchi, A., Twitter is the Megaphone of
Cross-Platform Messaging on the White Helmets, International Conference on Social
Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in
Modeling and Simulation, DC, USA, 2020
● Horawalavithana, S., Bhattacharjee, A., Liu, R., Choudhury, N., O. Hall, L., & Iamnitchi,
A. Mentions of Security Vulnerabilities in Reddit, Twitter and GitHub,
IEEE/WIC/ACM International Conference on Web Intelligence, Greece, October, 2019
● Horawalavithana, S., Flores, J. G. A., Skvoretz, J., & Iamnitchi, A., Behind the Mask:
Understanding the Structural Forces that Make Social Graphs Vulnerable to
De-anonymization. IEEE Transactions on Computational Social Systems (TCSS), 2019
● Horawalavithana, S., Flores, J. A., Skvoretz, J., & Iamnitchi, A., The Risk of Node
Re-identification in Labeled Social Graphs, Applied Network Science (2019)
45

Other Publications
● NG, K.,, Horawalavithana, S., & Iamnitchi, A., Multi-platform Information Operations:
Twitter, Facebook and YouTube against the White Helmets, The Workshop Proceedings
of the 14th International AAAI Conference on Web and Social Media (ICWSM), 2021.
● Liu, R., Mubang, F., Hall, L. O., Horawalavithana, S., Iamnitchi, A., & Skvoretz, J. (2019,
October). Predicting longitudinal user activity at fine time granularity in online
collaborative platforms. In 2019 IEEE International Conference on Systems, Man and
Cybernetics (SMC) (pp. 2535-2542). IEEE.
● Alhazmi, E., Horawalavithana, S., Skvoretz, J., Blackburn, J., & Iamnitchi, A. (2017, July). An
empirical study on team formation in online games. In Proceedings of the 2017
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
(ASONAM) 2017 (pp. 431-438).
● Alhazmi, E., Choudhury, N., Horawalavithana, S., & Iamnitchi, A. (2019). Temporal mobility
networks in online gaming. Frontiers in Big Data, 2, 21.
46

References
● Aragón, P., Gómez, V., García, D., and Kaltenbrunner, A.. Generative models
of online discussion threads: state of the art and research challenges. Journal
of Internet Services and Applications, 8(1):15, 2017.
● Alp, Z., and Öğüdücü, S.. Identifying topical influencers on twitter based on
user behavior and network topology. Knowledge-Based Systems,
141:211–221, 2018.
● Goel, S., Anderson, A., Hofman, J., and Watts, D.. The structural virality of
online diffusion. Management Science, 62(1):180–196, 2015.
47

Acknowledgments
● Funded by DARPA SocialSim Program
● Data provided by Leidos. (Thanks Kin for Reddit data)
● Evaluation code was developed by Pacific Northwest National Laboratory
48

Data-driven Studies on Social Networks:
Privacy and Simulation
49
sameera1@usf.edu

Data-driven Studies on Social Networks: Privacy and Simulation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Data-driven Studies on Social Networks: Privacy and Simulation

Similaire à Data-driven Studies on Social Networks: Privacy and Simulation (20)

Plus de Sameera Horawalavithana

Plus de Sameera Horawalavithana (15)

Dernier

Dernier (20)

Data-driven Studies on Social Networks: Privacy and Simulation