We discuss the use of hierarchical transformers for user semantic similarity in the context of analyzing users' behavior and profiling social media users. The objectives of the research include finding the best model for computing semantic user similarity, exploring the use of transformer-based models, and evaluating whether the embeddings reflect the desired similarity concept and can be used for other tasks.
We use a large dataset of Twitter users and apply an automatic labeling approach. The dataset consists of English tweets posted in November and December 2020, totaling about 27GB of compressed data. Preprocessing steps include filtering out short texts, cleaning user connections, and selecting a benchmark set of users for evaluation.
Since Transformer architectures are known to work well on short text, we cannot use them on extensive collections of tweets describing the activity of a user. Therefore, we propose a hierarchical structure of transformer models to be used first on tweets and then on their aggregations.
The models used in the study include hierarchical transformers, and the tweet embeddings are obtained using four Transformer-based models: RoBERTa2, BERTweet3, Sentence BERT4, and Twitter4SSE5. The researchers test different techniques for processing tweet embeddings to generate accurate user embeddings, including mean pooling, recurrence over BERT (RoBERT), and transformer over BERT (ToBERT).
The evaluation of the models is done on a set of 5,000 users, comparing user similarities with 30 other candidate users, 5 of which are considered similar and 25 considered dissimilar. The evaluation metrics used include mean average precision (MAP), mean reciprocal rank (MRR) at 10, and normalized discounted cumulative gain (nDCG).
The optimization process involves selecting a loss function and using the AdamW optimizer with specific hyperparameters. The results show that the hierarchical approach with a Stage-1 Twitter4SSE model and a Stage-2 Transformer model performs the best among the alternatives.
In conclusion, the research provides a large unbiased dataset for user similarity analysis, presents a hierarchical language model optimized for accurate user similarity computation, and validates the models' performance on similarity tasks, with potential applications to related problems.
The future work includes investigating the impact of time and topic drift on the models' performance.
2. 2
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Agenda
1. Motivation
2. Model: hierarchical configuration of BERT text transformers
3. Evaluation
4. Conclusions
3. 3
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Context and Motivation
Analysis of users’ behaviour and profiling of social media users
► customization of the overall personal experience
► recommendations
► detection of duplicates
► social threats
Sources:
► textual-content shared by users
► the social graphs involving users
– Friendship /followship
– Mention, likes, …
► shared resources (links, media, content)
4. 4
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
► RQ1: best model to compute semantic user similarity?
► RQ2: use of Transformer-based model?
► RQ3: embeddings reflect our idea of similarity?
Can we use them for further tasks?
► Aim at a fully reproducible approach without influencing the results with
biased selections of small sets of users
Objectives
5. 5
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
► Large dataset of Twitter users, with automatic labelling approach
► Training of a Hierarchical Language Model to compute accurate user
similarity
► Optimization of hyper-parameters to obtain the best configuration of the
model;
► Test accuracy of embeddings when applied to othertasks
Contributions
7. 7
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
► Twitter
► Assumption: Retweet somehow represent agreement or interest or perceived
importance
► Data from Archive Team Twitter [*]
► Only the textual content shared by users. No demographics, no screen names
► We select English tweets, filtered accordingly to the “lang” field posted in November
and December 2020. They amount to about 27GB of compressed data.
[*] https://archive.org/details/twitterstream
Dataset and preprocessing
8. 8
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
► We remove texts shorter than 20 characters (29M texts tweeted by 10M
unique users)
► We set the maximum number of tweets to 60 and minimum 5 tweets (1M
users)
► Clean the connections between users, removing from pairs of ids of users
retweeting each other, and the auto-retweets (when a user retweets one of its
own tweets), duplicate pairs, links to excluded users, and users with more
than 50 connections (1.9M
connections between 950k unique users.)
► Benchmark consists of comparing a user with 30 other candidate users, 5 of
them considered similar to it since they share at least one retweet
connection, and 25 of them considered not similar
Preprocessing
10. 10
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Language Model
11. 11
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Down Memory Lane
12. 12
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Encoders and Decoders
13. 13
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Hierarchical Transformer Model
14. 14
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Tweet Embedding
► Obtain embedding of tweets using one of the following four
Transformer-based models that share the same architecture but are
pretrained with different approaches and datasets:
– RoBERTa2,
– BERTweet3,
– Sentence BERT 4,
– Twitter4SSE 5.
► We test them by freezing and unfreezing their weights during the
training step.
► BERTweet and Twitter4SSE models, being pretrained on texts from
Twitter, are able to successfully deal with the intrinsic noise of data
from social media, thus no further special cleaning is required (such as
dealing with hashtags, abbreviations, and typos).
https://huggingface.co/roberta-base
https://huggingface.co/vinai/bertweet-base
https://huggingface.co/sentence-transformers/stsb-roberta-base-v2
https://huggingface.co/digio/Twitter4SSE
15. 15
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
User Embedding
We test three techniques to process Twitter embeddings to generate accurate user
embeddings:
► MEAN: the weights of the Stage-1 model are frozen (no training is performed
when we select this variant). However we test this approach also unfreezing
the weights of the Stage-1 model, thus we limit the number of tweets per
user, also for a fair comparison with other variants;
► Recurrence over BERT (RoBERT): the embeddings of tweets are used
as input of a Recurrent Model. We select a 2-layer LSTM model with hidden
size of 768.6 We use the last output as the user embedding. We test this
approach both freezing and unfreezing the weights of the Stage-1 model;
► Transformer over BERT (ToBERT): the embeddings of tweets are used
as input of a Transformer Model with 2 encoding layers (EL) and 2 decoding
layers (DL), 16 heads, and 0.1 dropout. We also experimented with a model
with 1 encoding and 1 decoding layer and without. We test this approach
both freezing and unfreezing the weights of the Stage-1 model.
17. 17
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
► Evaluation set on 5K users
► benchmark consists of comparing a user with 30 other candidate users
► 5 of them considered similar and 25 of them considered not similar
Evaluation
18. 18
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Optimization
► We select Multiple Negative Loss (MNLoss) as our loss function
► We assume that a user did not retweet posts from any of the other n − 1
users. This assumption is valid for small batches due to the big total number
of users and the approach selected to collect data.
► We use AdamW optimizer, learning rate 2×10−5, linear scheduler with 10%
warmup steps on a single GPU (NVIDIA Tesla P100).
19. 19
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Model Evaluation Results
20. 20
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Results discussion
► Naive approaches underperform Hierarchical approaches confirming an ad-
vantage to encode single tweets independently.
► The hierarchical approach with a Stage-1 Twitter4SSE model and a Stage-
2 Transformer model outperforms the other alternatives.
21. 21
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Evaluation on the task
► 20 tweets per user, thus 124k pairs of users in the training set.
We evaluate the models by comparing three metrics
► Mean Average Precision (MAP) between the binary labels (connected
or not connected by retweets) and the similarities.
► Mean Reciprocal Rank (MRR) @10 as a ranking quality measure defined
as the reciprocal of the rank of the first relevant element
► normalized Discounted Cumulative Gain (nDCG)
22. 22
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Details on evaluation
► Stage-1 Model Comparison. Firstly we investigate the best initialization model. For each experiment,
we keep the same hyper-parameters and the same Stage-2 model is trained on top of it: ToBERT with
2 encoding layers (EL) and 2 decoding layers (DL), 0.1 dropout, and MEAN pooling. We test
RoBERTa, BERTweet, S-RoBERTa, and Twitter4SSE. Table 1 shows that Twitter4SSE is the best
initialization. As expected, this model, trained to generate accurate tweet embeddings, outperforms
both the model trained on Tweets using only MLM (BERTweet) and the model trained to generate
accurate sentence embed- dings on formal data (S-RoBERTa).
► MEAN Stage-2 Models Comparison. We test the MEAN Stage-2 approach on the four Stage-1
models with and without freezing their weights. Table 2 shows that unfreezing the weights leads to
better results, even if the batch size has to be reduced to 10 and the number of tokens per tweet is
reduced to 32 to fit in memory. We confirm that the best Stage-1 model is Twitter4SSE for these
configurations too.
► ToBERT Hyperparameter Comparison. We investigate the best hyperpa- rameter configuration of the
Stage-2 Transformer model (ToBERT). We inves- tigate with 1 and 2 encoding and decoding layers
(EL-DL), with and without dropout. We fix Twitter4SSE as initial model. Table 3 shows that 2 EL and 2
DL without dropout is the best overall configuration.
► Full Comparison. We compare the performance of the models with a Random baseline and with the
two best approaches from related work.
23. 23
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
24. 24
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
► As expected, a greater number of tweets per users results in a better
model, when the number of pairs of training users is fixed.
► However, a greater n implies a lower number of users since we have a
limited collection of tweets.
► We investigate what is the best trade-off between the number of users and
the number of tweets per user.
► The performance of models trained changing the number of tweets per user,
including every user available varies.
► A peak around 20 tweets is the best trade-off.
► !!! this number is highly dependent on our collection since the number of
downloaded tweets is high but finite (2 complete months).
25. 25
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
26. 26
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Other tasks
► Community analysis
► Polarization detection
► Outlier detection
► Fixed model: a hierarchical model with a frozen Stage-1 Twitter4SSE model and a Stage-2 ToBERT model with 2 layers,
0.1 dropout rate, MEAN pooling, trained using 20 tweets for each user for one epoch.
27. 27
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Other tasks
28. 28
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
29. 29
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Outliers
► Local Outlier Factor (LOF) algorithm on three lists of users and we manually
inspected the results.
► On embeddings of technology list
– Outlier on videogames
► On embeddings of chefs list
– Outlier on cook talking about other stuff
► On embeddings of charity-ngo list
– Outlier account of Charlize Theron
30. 30
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Concluding
► Large unbiased dataset ready for user similarity analysis
► Selection and optimization of herarchical LLM model
► Validation of models on similarity
► Application to related problems
► Future and ongoing work: Impact of time and topic drift
31. Hierarchical Transformers for User Semantic Similarity
THANKS!
Marco Di Giovanni
MARCO BRAMBILLA
http://datascience.deib.polimi.it/
https://marco-brambilla.com/
marco.brambilla@polimi.it
@marcobrambi