Hierarchical Transformers for User Semantic Similarity - ICWE 2023

Hierarchical
Transformers
for User
Semantic Similarity
Marco Di Giovanni
MARCO BRAMBILLA
marco.brambilla@polimi.it
@marcobrambi

2
M. Di Giovanni, M. Brambilla. Hierarchical Transformers for User Semantic Similarity. ICWE 2023
Agenda
1. Motivation
2. Model: hierarchical configuration of BERT text transformers
3. Evaluation
4. Conclusions

3
Context and Motivation
Analysis of users’ behaviour and profiling of social media users
► customization of the overall personal experience
► recommendations
► detection of duplicates
► social threats
Sources:
► textual-content shared by users
► the social graphs involving users
– Friendship /followship
– Mention, likes, …
► shared resources (links, media, content)

4
► RQ1: best model to compute semantic user similarity?
► RQ2: use of Transformer-based model?
► RQ3: embeddings reflect our idea of similarity?
Can we use them for further tasks?
► Aim at a fully reproducible approach without influencing the results with
biased selections of small sets of users
Objectives

5
► Large dataset of Twitter users, with automatic labelling approach
► Training of a Hierarchical Language Model to compute accurate user
similarity
► Optimization of hyper-parameters to obtain the best configuration of the
model;
► Test accuracy of embeddings when applied to othertasks
Contributions

7
► Twitter
► Assumption: Retweet somehow represent agreement or interest or perceived
importance
► Data from Archive Team Twitter [*]
► Only the textual content shared by users. No demographics, no screen names
► We select English tweets, filtered accordingly to the “lang” field posted in November
and December 2020. They amount to about 27GB of compressed data.
[*] https://archive.org/details/twitterstream
Dataset and preprocessing

8
► We remove texts shorter than 20 characters (29M texts tweeted by 10M
unique users)
► We set the maximum number of tweets to 60 and minimum 5 tweets (1M
users)
► Clean the connections between users, removing from pairs of ids of users
retweeting each other, and the auto-retweets (when a user retweets one of its
own tweets), duplicate pairs, links to excluded users, and users with more
than 50 connections (1.9M
connections between 950k unique users.)
► Benchmark consists of comparing a user with 30 other candidate users, 5 of
them considered similar to it since they share at least one retweet
connection, and 25 of them considered not similar
Preprocessing

10
Language Model

11
Down Memory Lane

12
Encoders and Decoders

13
Hierarchical Transformer Model

14
Tweet Embedding
► Obtain embedding of tweets using one of the following four
Transformer-based models that share the same architecture but are
pretrained with different approaches and datasets:
– RoBERTa2,
– BERTweet3,
– Sentence BERT 4,
– Twitter4SSE 5.
► We test them by freezing and unfreezing their weights during the
training step.
► BERTweet and Twitter4SSE models, being pretrained on texts from
Twitter, are able to successfully deal with the intrinsic noise of data
from social media, thus no further special cleaning is required (such as
dealing with hashtags, abbreviations, and typos).
https://huggingface.co/roberta-base
https://huggingface.co/vinai/bertweet-base
https://huggingface.co/sentence-transformers/stsb-roberta-base-v2
https://huggingface.co/digio/Twitter4SSE

15
User Embedding
We test three techniques to process Twitter embeddings to generate accurate user
embeddings:
► MEAN: the weights of the Stage-1 model are frozen (no training is performed
when we select this variant). However we test this approach also unfreezing
the weights of the Stage-1 model, thus we limit the number of tweets per
user, also for a fair comparison with other variants;
► Recurrence over BERT (RoBERT): the embeddings of tweets are used
as input of a Recurrent Model. We select a 2-layer LSTM model with hidden
size of 768.6 We use the last output as the user embedding. We test this
approach both freezing and unfreezing the weights of the Stage-1 model;
► Transformer over BERT (ToBERT): the embeddings of tweets are used
as input of a Transformer Model with 2 encoding layers (EL) and 2 decoding
layers (DL), 16 heads, and 0.1 dropout. We also experimented with a model
with 1 encoding and 1 decoding layer and without. We test this approach
both freezing and unfreezing the weights of the Stage-1 model.

POLITECNICO DI MILANO
Evaluation

17
► Evaluation set on 5K users
► benchmark consists of comparing a user with 30 other candidate users
► 5 of them considered similar and 25 of them considered not similar
Evaluation

18
Optimization
► We select Multiple Negative Loss (MNLoss) as our loss function
► We assume that a user did not retweet posts from any of the other n − 1
users. This assumption is valid for small batches due to the big total number
of users and the approach selected to collect data.
► We use AdamW optimizer, learning rate 2×10−5, linear scheduler with 10%
warmup steps on a single GPU (NVIDIA Tesla P100).

19
Model Evaluation Results

20
Results discussion
► Naive approaches underperform Hierarchical approaches confirming an ad-
vantage to encode single tweets independently.
► The hierarchical approach with a Stage-1 Twitter4SSE model and a Stage-
2 Transformer model outperforms the other alternatives.

21
Evaluation on the task
► 20 tweets per user, thus 124k pairs of users in the training set.
We evaluate the models by comparing three metrics
► Mean Average Precision (MAP) between the binary labels (connected
or not connected by retweets) and the similarities.
► Mean Reciprocal Rank (MRR) @10 as a ranking quality measure defined
as the reciprocal of the rank of the first relevant element
► normalized Discounted Cumulative Gain (nDCG)

22
Details on evaluation
► Stage-1 Model Comparison. Firstly we investigate the best initialization model. For each experiment,
we keep the same hyper-parameters and the same Stage-2 model is trained on top of it: ToBERT with
2 encoding layers (EL) and 2 decoding layers (DL), 0.1 dropout, and MEAN pooling. We test
RoBERTa, BERTweet, S-RoBERTa, and Twitter4SSE. Table 1 shows that Twitter4SSE is the best
initialization. As expected, this model, trained to generate accurate tweet embeddings, outperforms
both the model trained on Tweets using only MLM (BERTweet) and the model trained to generate
accurate sentence embeddings on formal data (S-RoBERTa).
► MEAN Stage-2 Models Comparison. We test the MEAN Stage-2 approach on the four Stage-1
models with and without freezing their weights. Table 2 shows that unfreezing the weights leads to
better results, even if the batch size has to be reduced to 10 and the number of tokens per tweet is
reduced to 32 to fit in memory. We confirm that the best Stage-1 model is Twitter4SSE for these
configurations too.
► ToBERT Hyperparameter Comparison. We investigate the best hyperparameter configuration of the
Stage-2 Transformer model (ToBERT). We investigate with 1 and 2 encoding and decoding layers
(EL-DL), with and without dropout. We fix Twitter4SSE as initial model. Table 3 shows that 2 EL and 2
DL without dropout is the best overall configuration.
► Full Comparison. We compare the performance of the models with a Random baseline and with the
two best approaches from related work.

23

24
► As expected, a greater number of tweets per users results in a better
model, when the number of pairs of training users is fixed.
► However, a greater n implies a lower number of users since we have a
limited collection of tweets.
► We investigate what is the best trade-off between the number of users and
the number of tweets per user.
► The performance of models trained changing the number of tweets per user,
including every user available varies.
► A peak around 20 tweets is the best trade-off.
► !!! this number is highly dependent on our collection since the number of
downloaded tweets is high but finite (2 complete months).

25

26
Other tasks
► Community analysis
► Polarization detection
► Outlier detection
► Fixed model: a hierarchical model with a frozen Stage-1 Twitter4SSE model and a Stage-2 ToBERT model with 2 layers,
0.1 dropout rate, MEAN pooling, trained using 20 tweets for each user for one epoch.

27
Other tasks

28

29
Outliers
► Local Outlier Factor (LOF) algorithm on three lists of users and we manually
inspected the results.
► On embeddings of technology list
– Outlier on videogames
► On embeddings of chefs list
– Outlier on cook talking about other stuff
► On embeddings of charity-ngo list
– Outlier account of Charlize Theron

30
Concluding
► Large unbiased dataset ready for user similarity analysis
► Selection and optimization of herarchical LLM model
► Validation of models on similarity
► Application to related problems
► Future and ongoing work: Impact of time and topic drift

Hierarchical Transformers for User Semantic Similarity
THANKS!
Marco Di Giovanni
MARCO BRAMBILLA
http://datascience.deib.polimi.it/
https://marco-brambilla.com/
marco.brambilla@polimi.it
@marcobrambi

Hierarchical Transformers for User Semantic Similarity - ICWE 2023

Recommended

Recommended

More Related Content

Similar to Hierarchical Transformers for User Semantic Similarity - ICWE 2023

Similar to Hierarchical Transformers for User Semantic Similarity - ICWE 2023 (20)

More from Marco Brambilla

More from Marco Brambilla (20)

Recently uploaded

Recently uploaded (20)

Hierarchical Transformers for User Semantic Similarity - ICWE 2023