The document discusses using natural language processing techniques to detect fake news. It describes experiments using sentence structure features and sentiment analysis to classify news articles as real or fake. Combining sentence structure features with function words yielded the best results, with a precision of 90%, recall of 96%, and accuracy of 93%. The key learning is that grammatical and stylistic features can help distinguish real from fake news, but choosing informative features is important for success.
Detecting Fake News Using NLP Sentence Structure and Function Word Analysis
1. The Grammar of Truth and Lies
Using NLP to detect Fake News
Peter J Bleackley
Playful Technology Limited
peter.bleackley@playfultechnology.co.uk
2. The Problem
●
“A lie can run around the world before the truth can get its
boots on.”
●
Fake News spreads six times faster than real news on Twitter
●
The spread of true and false news online, Sorush Vosougi,
Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp.
1146-1151, 9th March 2018
●
https://science.sciencemag.org/content/359/6380/1146
3. The Data
●
“Getting Real about Fake News” Kaggle Dataset
●
https://www.kaggle.com/mrisdal/fake-news
●
12999 articles from sites flagged as unreliable by the BS Detector
chrome extension
●
Reuters-21578, Distribution 1.0 Corpus
●
10000 articles from Reuters Newswire, 1987
●
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
●
Available from NLTK
4. Don’t Use Vocabulary!
●
Potential for bias, especially as corpora are from different
time periods
●
Difficult to generalise
●
Could be reverse-engineered by a bad actor
5. Sentence structure features
●
Perform Part of Speech tagging with TextBlob
●
Concatenate tags to form a feature for each sentence
●
“Pete Bleackley is a self-employed data scientist and
computational linguist.”
●
'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN'
●
Very large, very sparse feature set
6. First model
●
Train LSI model (Gensim) on sentence structure features
from whole dataset
●
70/30 split between training and test data
●
Sentence structure features => LSI => Logistic Regression
(scikit-learn)
●
https://www.kaggle.com/petebleackley/the-grammar-of-trut
h-and-lies
8. Sentiment analysis
●
Used VADER model in NLTK
●
Produces Positive, Negative and Neutral scores for each
sentence
●
Sum over document
●
Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%
9. Sentence Structure + Sentiments
●
Precision 74%
●
Recall 90%
●
Accuracy 81%
●
Matthews 64%
●
Slight improvement, but it looks like sentiment is doing
most of the work
11. Understanding the models
●
Out of 333264 sentence structure features, 298332 occur
only in a single document
●
Out of 23000 documents, 11276 have no features in
common with others
●
We need some denser features
12. Function words
●
Pronouns, prepositions, conjunctions, auxilliaries
●
Present in every document – most common words
●
Usually discarded as “stopwords”...
●
...but useful for stylometric analysis, eg document
attribution
●
NLTK stopwords corpus
13. New model
●
Sentence structure features + function words => LSI =>
Logistic Regression
●
Precision 90%
●
Recall 96%
●
Accuracy 93%
●
Matthews 87%
14. What have we learnt?
●
Grammatical and stylistic features can be used to
distinguish between real and fake news
●
Good choice of features is the key to success
●
Will this generalise to other sources?
15. See also...
●
The (mis)informed citizen
●
Alan Turing Institute project
●
https://www.turing.ac.uk/research/research-projects/misinf
ormed-citizen