Duplicate Detection on Hoaxy Dataset

Duplicate Detection on Hoaxy
Dataset
Team: Spoilers
Social Media Mining
Hackathon
Essa, Yasanka and Sreeja

Hoaxy - Collects low-credible documents in
web, E.g., fake news, hoax, rumour,
conspiracy Etc.
- Visualize the spread of document URL
in Twitter
- Temporal trends
- # shares in Twitter
- Diffusion Network
- retweets, reply, mention Etc.
- Include bot scores, details of fact
checks Etc.
https://hoaxy.iuni.iu.edu

Hackathon Objectives
Given two low-credible claims find whether they are duplicates
- Based on the Linguistic Features of the claims
- looking at the content of claims (Obvious !!)
- Based on Temporal Dynamics in Twitter
- looking at the diffusion networks of two claim URLs (Wow !!)

Duplicate
Detection
In a nutshell..
Yes / No
Linguistic Features Propagation Dynamics

Overall Architecture
Hoaxy
Database
Hoaxy
API
Web
Crawler
Documents
URLURLURL
URLURLURL
Diffusion
Networks
Unsupervised
Learning
(Document
Clustering)
LDA
LSI
HDP
Duplicates
Supervised
Learning
(Random
Forest)
Diffusion Networks
Duplicates

Hoaxy API
● It provides seven types of queries
○ GET Articles, GET LatestArticles, GET Network, GET Timeline, GET TopArticles, GET TopUsers,
and GET Tweets.
● GET Articles query
○ It requires a keyword and the date published periods.
○ It returns url source, date published, article/document id, domain, site type, total tweets, the title,
the score.
● GET Network query
○ It requires the article/document id with specifying max edges, max nodes, and whether mentions
are included or not.

Data Collection Process
● Collecting Documents
○ Keywords List -->21 general terms in businesses, politics, trend news.
○ Hoaxy API for articles queries.
○ Collecting articles for one month between Sep 16, 2018 to Oct 16, 2018
● Collecting Star Networks
○ The collected article/document ids
○ Hoaxy API for diffusion network queries.
○ Collecting Star networks for each article.

Data Characterization
● Total documents is ~7k.
● Total stars is ~82k.
● Max. star nets. in a doc is 627.
● Largest star net size (adoptions) is ~ 9k
(specified in the API)
● Total nodes ~172k
● Total adoptions is ~ 522k

Tweet Adoption
● The longest adoption occurs in 28 days.
● The majority adoptions happen in less an hour.
● A user may adopt tweets of another user in more than one documents.
● 591 documents have been adopted by one pair.

Web Scraping
● Collected the content of the url
● Packages used
○ Requests
○ Nltk
○ Beautifulsoup
● Data cleaning
○ Stop word removal using NLTK’s
english stop word datasets
○ Punctuation removal
○ Lemmatization using gensim’s
lemmatize to only keep the nouns

Document Clustering of Documents
● Classify similar documents into same class
● Used 3 different approaches
○ LSI (Latent Semantic Indexing)
○ LDA (Latent Dirichlet Allocation)
○ HDP (Hierarchical Dirichlet Process)

LDA
● It is a “generative probabilistic model”
● It builds
○ a topic per document model
○ words per topic model,
● Modeled as Dirichlet distribution (continuous multivariate probability distribution)
● Transformation from bag-of-words counts into a topic space

LSI
● Principle - words that are used in the same contexts tend to have similar meanings.
● Matrix decomposition (SVD) on the term document matrix.
○ Identify patterns in the relationships between the terms and concepts
● Extract the conceptual content of a body of text
● Establishing associations between those terms that occur in similar contexts.

LSI - Example
source : wikipedia

Coherence score
● To find the optimal number of topics
● Computing the sum of pairwise scores of top n words w1, ...,wn used to describe the topic.
● Röder,Both,Hinneburg:”Exploring the Space of Topic Coherence Measures”, WSDM’15

HDP
● Nonparametric Bayesian approach to clustering grouped data
● Mixed-membership model for the unsupervised analysis
● Infers the number of topics from the data
● Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet
Process”, JMLR (2011).

Features: Propagation Dynamics
- Given a pair of feature vectors representing propagation dynamics of URL in Twitter
q1 q2 q3 q4
Star Lifetime
Document X 2 5 8 12
Document Y 4 6 15 24
Twitter
stages of star lifetime
(quartiles of adoption delay)
# Retweets
in a given
stage qx

Target: Documents in same cluster
- Given a pair of documents, whether they appeared in the same topic
Document X 2 5 8 12
Topic 01
Topic 03
Topic 02
Document X
Document Y
2 5 8 12 4 6 15 24 0
Topic
Modeling

Multiple Targets: Documents in same cluster
- Given a pair of documents, whether they appeared in the same topic
Document X 2 5 8 12
Topic 01
Topic 03
Topic 02
Document X
Document Y
2 5 8 12 4 6 15 24 0
LDA
Topic 01
Topic 03
Topic 02
Document X
Document Y
LSI
2 5 8 12 4 6 15 24 1

Classification: Random Forest
● Train/ Test per Topic
○ LDA
○ LSI
○ HDP
● Two classes (Highly Imbalanced):
○ Randomly draw balanced samples
● Hyper-parameters
○ # Decision Trees: 1000
○ Split criteria: Gini impurity
○ Max depth: 4
○ Bootstrapped samples
○ Max features = sqrt(# features)
Topic Modeling Train Test
LDA 411,558 137,186
LSI 215,768 71,923
HDP 120,243 40,081

Performance: Random Forest
Topic
Modeling
Target
(same cluster or not)
Precision Recall F1-score Train Test
LDA No 0.53 0.03 0.06 411,558 137,18
6
Yes 0.57 0.97 0.72
LSI No 0.58 0.21 0.30 215,768 71,923
Yes 0.63 0.90 0.75
HDP No 0.64 0.80 0.71 120,243 40,081
Yes 0.63 0.43 0.52

Discussion
● How useful such predictions?
○ Assume a Fake News article been posted in URL Z;
■ Now, Twitter users share/ retweet Z.
■ Later, administrators decide to kill the web-page in Z
■ Still, Twitter users share/ retweet the original tweet contains Z.
■ Use the URL diffusion network
● To predict the topic/ category of the dead article
● Topic Modeling captures;
○ latent semantic structure of documents
● Propagation Dynamics captures;
○ latent cascade structure originated in a different platform

Duplicate Detection on Hoaxy Dataset

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Duplicate Detection on Hoaxy Dataset

Similar to Duplicate Detection on Hoaxy Dataset (20)

More from Sameera Horawalavithana

More from Sameera Horawalavithana (17)

Recently uploaded

Recently uploaded (20)

Duplicate Detection on Hoaxy Dataset