This document describes a project to detect duplicate documents from the Hoaxy dataset using linguistic features and propagation dynamics on Twitter. It discusses collecting documents and diffusion networks from Hoaxy, preprocessing text, using LDA, LSI, and HDP for document clustering, extracting features on propagation dynamics, and training a random forest classifier on the clustered documents and features. The random forest achieves an F1-score of 0.72 for LDA, 0.75 for LSI, and 0.71 for HDP clusters in determining if document pairs are duplicates. The approach aims to predict topics of "dead" web pages using their diffusion networks on Twitter.
1. Duplicate Detection on Hoaxy
Dataset
Team: Spoilers
Social Media Mining
Hackathon
Essa, Yasanka and Sreeja
2. Hoaxy - Collects low-credible documents in
web, E.g., fake news, hoax, rumour,
conspiracy Etc.
- Visualize the spread of document URL
in Twitter
- Temporal trends
- # shares in Twitter
- Diffusion Network
- retweets, reply, mention Etc.
- Include bot scores, details of fact
checks Etc.
https://hoaxy.iuni.iu.edu
3.
4.
5. Hackathon Objectives
Given two low-credible claims find whether they are duplicates
- Based on the Linguistic Features of the claims
- looking at the content of claims (Obvious !!)
- Based on Temporal Dynamics in Twitter
- looking at the diffusion networks of two claim URLs (Wow !!)
9. Hoaxy API
● It provides seven types of queries
○ GET Articles, GET LatestArticles, GET Network, GET Timeline, GET TopArticles, GET TopUsers,
and GET Tweets.
● GET Articles query
○ It requires a keyword and the date published periods.
○ It returns url source, date published, article/document id, domain, site type, total tweets, the title,
the score.
● GET Network query
○ It requires the article/document id with specifying max edges, max nodes, and whether mentions
are included or not.
10. Data Collection Process
● Collecting Documents
○ Keywords List -->21 general terms in businesses, politics, trend news.
○ Hoaxy API for articles queries.
○ Collecting articles for one month between Sep 16, 2018 to Oct 16, 2018
● Collecting Star Networks
○ The collected article/document ids
○ Hoaxy API for diffusion network queries.
○ Collecting Star networks for each article.
11. Data Characterization
● Total documents is ~7k.
● Total stars is ~82k.
● Max. star nets. in a doc is 627.
● Largest star net size (adoptions) is ~ 9k
(specified in the API)
● Total nodes ~172k
● Total adoptions is ~ 522k
12. Tweet Adoption
● The longest adoption occurs in 28 days.
● The majority adoptions happen in less an hour.
● A user may adopt tweets of another user in more than one documents.
● 591 documents have been adopted by one pair.
15. Web Scraping
● Collected the content of the url
● Packages used
○ Requests
○ Nltk
○ Beautifulsoup
● Data cleaning
○ Stop word removal using NLTK’s
english stop word datasets
○ Punctuation removal
○ Lemmatization using gensim’s
lemmatize to only keep the nouns
18. Document Clustering of Documents
● Classify similar documents into same class
● Used 3 different approaches
○ LSI (Latent Semantic Indexing)
○ LDA (Latent Dirichlet Allocation)
○ HDP (Hierarchical Dirichlet Process)
19. LDA
● It is a “generative probabilistic model”
● It builds
○ a topic per document model
○ words per topic model,
● Modeled as Dirichlet distribution (continuous multivariate probability distribution)
● Transformation from bag-of-words counts into a topic space
21. LSI
● Principle - words that are used in the same contexts tend to have similar meanings.
● Matrix decomposition (SVD) on the term document matrix.
○ Identify patterns in the relationships between the terms and concepts
● Extract the conceptual content of a body of text
● Establishing associations between those terms that occur in similar contexts.
23. Coherence score
● To find the optimal number of topics
● Computing the sum of pairwise scores of top n words w1, ...,wn used to describe the topic.
● Röder,Both,Hinneburg:”Exploring the Space of Topic Coherence Measures”, WSDM’15
25. HDP
● Nonparametric Bayesian approach to clustering grouped data
● Mixed-membership model for the unsupervised analysis
● Infers the number of topics from the data
● Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet
Process”, JMLR (2011).
30. Features: Propagation Dynamics
- Given a pair of feature vectors representing propagation dynamics of URL in Twitter
q1 q2 q3 q4
Star Lifetime
Document X 2 5 8 12
Document Y 4 6 15 24
Twitter
stages of star lifetime
(quartiles of adoption delay)
# Retweets
in a given
stage qx
31. Target: Documents in same cluster
- Given a pair of documents, whether they appeared in the same topic
Document X 2 5 8 12
Document Y 4 6 15 24
Topic 01
Topic 03
Topic 02
Document X
Document Y
2 5 8 12 4 6 15 24 0
Topic
Modeling
32. Multiple Targets: Documents in same cluster
- Given a pair of documents, whether they appeared in the same topic
Document X 2 5 8 12
Document Y 4 6 15 24
Topic 01
Topic 03
Topic 02
Document X
Document Y
2 5 8 12 4 6 15 24 0
LDA
Topic 01
Topic 03
Topic 02
Document X
Document Y
LSI
2 5 8 12 4 6 15 24 1
33. Classification: Random Forest
● Train/ Test per Topic
○ LDA
○ LSI
○ HDP
● Two classes (Highly Imbalanced):
○ Randomly draw balanced samples
● Hyper-parameters
○ # Decision Trees: 1000
○ Split criteria: Gini impurity
○ Max depth: 4
○ Bootstrapped samples
○ Max features = sqrt(# features)
Topic Modeling Train Test
LDA 411,558 137,186
LSI 215,768 71,923
HDP 120,243 40,081
34. Performance: Random Forest
Topic
Modeling
Target
(same cluster or not)
Precision Recall F1-score Train Test
LDA No 0.53 0.03 0.06 411,558 137,18
6
Yes 0.57 0.97 0.72
LSI No 0.58 0.21 0.30 215,768 71,923
Yes 0.63 0.90 0.75
HDP No 0.64 0.80 0.71 120,243 40,081
Yes 0.63 0.43 0.52
35. Discussion
● How useful such predictions?
○ Assume a Fake News article been posted in URL Z;
■ Now, Twitter users share/ retweet Z.
■ Later, administrators decide to kill the web-page in Z
■ Still, Twitter users share/ retweet the original tweet contains Z.
■ Use the URL diffusion network
● To predict the topic/ category of the dead article
● Topic Modeling captures;
○ latent semantic structure of documents
● Propagation Dynamics captures;
○ latent cascade structure originated in a different platform