Detecting Fake News Using NLP Sentence Structure and Function Word Analysis

•

0 j'aime•38 vues

The document discusses using natural language processing techniques to detect fake news. It describes experiments using sentence structure features and sentiment analysis to classify news articles as real or fake. Combining sentence structure features with function words yielded the best results, with a precision of 90%, recall of 96%, and accuracy of 93%. The key learning is that grammatical and stylistic features can help distinguish real from fake news, but choosing informative features is important for success.

Technologie

The Grammar of Truth and Lies
Using NLP to detect Fake News
Peter J Bleackley
Playful Technology Limited
peter.bleackley@playfultechnology.co.uk

The Problem
●
“A lie can run around the world before the truth can get its
boots on.”
●
Fake News spreads six times faster than real news on Twitter
●
The spread of true and false news online, Sorush Vosougi,
Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp.
1146-1151, 9th March 2018
●
https://science.sciencemag.org/content/359/6380/1146

The Data
●
“Getting Real about Fake News” Kaggle Dataset
●
https://www.kaggle.com/mrisdal/fake-news
●
12999 articles from sites flagged as unreliable by the BS Detector
chrome extension
●
Reuters-21578, Distribution 1.0 Corpus
●
10000 articles from Reuters Newswire, 1987
●
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
●
Available from NLTK

Don’t Use Vocabulary!
●
Potential for bias, especially as corpora are from different
time periods
●
Difficult to generalise
●
Could be reverse-engineered by a bad actor

Sentence structure features
●
Perform Part of Speech tagging with TextBlob
●
Concatenate tags to form a feature for each sentence
●
“Pete Bleackley is a self-employed data scientist and
computational linguist.”
●
'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN'
●
Very large, very sparse feature set

First model
●
Train LSI model (Gensim) on sentence structure features
from whole dataset
●
70/30 split between training and test data
●
Sentence structure features => LSI => Logistic Regression
(scikit-learn)
●
https://www.kaggle.com/petebleackley/the-grammar-of-trut
h-and-lies

Performance
●
Precision 61%
●
Recall 96%
●
Accuracy 70%
●
Matthews Correlation Coefficient 50%
●
Precision measures our ability to catch the bad guys.

Sentiment analysis
●
Used VADER model in NLTK
●
Produces Positive, Negative and Neutral scores for each
sentence
●
Sum over document
●
Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%

Sentence Structure + Sentiments
●
Precision 74%
●
Recall 90%
●
Accuracy 81%
●
Matthews 64%
●
Slight improvement, but it looks like sentiment is doing
most of the work

Random Forests
Precision Recall Accuracy Matthews
Sentence
structure
83% 89% 86% 71%
Sentiments 75% 75% 78% 76%
Both 84% 89% 87% 76%

Understanding the models
●
Out of 333264 sentence structure features, 298332 occur
only in a single document
●
Out of 23000 documents, 11276 have no features in
common with others
●
We need some denser features

Function words
●
Pronouns, prepositions, conjunctions, auxilliaries
●
Present in every document – most common words
●
Usually discarded as “stopwords”...
●
...but useful for stylometric analysis, eg document
attribution
●
NLTK stopwords corpus

New model
●
Sentence structure features + function words => LSI =>
Logistic Regression
●
Precision 90%
●
Recall 96%
●
Accuracy 93%
●
Matthews 87%

What have we learnt?
●
Grammatical and stylistic features can be used to
distinguish between real and fake news
●
Good choice of features is the key to success
●
Will this generalise to other sources?

See also...
●
The (mis)informed citizen
●
Alan Turing Institute project
●
https://www.turing.ac.uk/research/research-projects/misinf
ormed-citizen

Recommandé

The Grammar of Truth and LiesPeter Bleackley

Building a Security culture at Skyscanner 2016Stu Hirst

DevSecOps - a 2 year journey of success & failure!Stu Hirst

If You Like Big Data & Getting Caught in the Rain... -- Techweek 2013 ChicagoPunchkick Interactive

Microsoft sql server 2000eLearningLine

Password best practices and the last pass hackKevin OBrien

EMNLP2014_reading正志坪坂

SEO Strategies for a Constantly Evolving Google AlgorithmRohan Ayyar

Recommandé

The Grammar of Truth and LiesPeter Bleackley

Building a Security culture at Skyscanner 2016Stu Hirst

DevSecOps - a 2 year journey of success & failure!Stu Hirst

If You Like Big Data & Getting Caught in the Rain... -- Techweek 2013 ChicagoPunchkick Interactive

Microsoft sql server 2000eLearningLine

Password best practices and the last pass hackKevin OBrien

EMNLP2014_reading正志坪坂

SEO Strategies for a Constantly Evolving Google AlgorithmRohan Ayyar

The zen of predictive modellingQuinton Anderson

Ml masterclassMaxwell Rebo

PasswordsKevin OBrien

Getting to Know Your Data with RStephen Withington

Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryPanos Alexopoulos

Entity Search Engine DRTC Indian Statistical Institute Bangalore

Hacking Predictive Modeling - RoadSec 2018HJ van Veen

The Revolution of Digital Marketing in the Artificial Intelligence eraMohamed Hanafy

Social media analytics as a service: tools from GATEDiana Maynard

And then there were ... Large Language ModelsLeon Dohmen

Vikrant data scientistVikrant Narayan

Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague

Word Cloud Plus with Will and Ray PoynterRay Poynter

How Machine Learning Works for Business10x Nation

Machine learning with TensorFlow Eslam Saeed

Webinar: Modern Techniques for Better Search Relevance with FusionLucidworks

Practical Guide to the $1000 Genome (2014)AllSeq

Is one enough? Data warehousing for biomedical researchGreg Landrum

OTel Orientation: How to Train Teams (OTel in Practice)Paige Cruz

The agile forecast joe tristano southern fried agile 2018_ finalJoe Tristano

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Contenu connexe

Similaire à Detecting Fake News Using NLP Sentence Structure and Function Word Analysis

The zen of predictive modellingQuinton Anderson

Ml masterclassMaxwell Rebo

PasswordsKevin OBrien

Getting to Know Your Data with RStephen Withington

Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryPanos Alexopoulos

Entity Search Engine DRTC Indian Statistical Institute Bangalore

Hacking Predictive Modeling - RoadSec 2018HJ van Veen

The Revolution of Digital Marketing in the Artificial Intelligence eraMohamed Hanafy

Social media analytics as a service: tools from GATEDiana Maynard

And then there were ... Large Language ModelsLeon Dohmen

Vikrant data scientistVikrant Narayan

Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague

Word Cloud Plus with Will and Ray PoynterRay Poynter

How Machine Learning Works for Business10x Nation

Machine learning with TensorFlow Eslam Saeed

Webinar: Modern Techniques for Better Search Relevance with FusionLucidworks

Practical Guide to the $1000 Genome (2014)AllSeq

Is one enough? Data warehousing for biomedical researchGreg Landrum

OTel Orientation: How to Train Teams (OTel in Practice)Paige Cruz

The agile forecast joe tristano southern fried agile 2018_ finalJoe Tristano

Similaire à Detecting Fake News Using NLP Sentence Structure and Function Word Analysis (20)

The zen of predictive modelling

Ml masterclass

Passwords

Getting to Know Your Data with R

Troubleshooting and Optimizing Named Entity Resolution Systems in the Industry

Entity Search Engine

Hacking Predictive Modeling - RoadSec 2018

The Revolution of Digital Marketing in the Artificial Intelligence era

Social media analytics as a service: tools from GATE

And then there were ... Large Language Models

Vikrant data scientist

Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...

Word Cloud Plus with Will and Ray Poynter

How Machine Learning Works for Business

Machine learning with TensorFlow

Webinar: Modern Techniques for Better Search Relevance with Fusion

Practical Guide to the $1000 Genome (2014)

Is one enough? Data warehousing for biomedical research

OTel Orientation: How to Train Teams (OTel in Practice)

The agile forecast joe tristano southern fried agile 2018_ final

Dernier

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Pigging Solutions in Pet Food ManufacturingPigging Solutions

AI as an Interface for Commercial BuildingsMemoori

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Scaling API-first – The story of a global engineering organizationRadu Cotescu

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Dernier (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

SQL Database Design For Developers at php[tek] 2024

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Pigging Solutions in Pet Food Manufacturing

AI as an Interface for Commercial Buildings

My Hashitalk Indonesia April 2024 Presentation

Salesforce Community Group Quito, Salesforce 101

08448380779 Call Girls In Friends Colony Women Seeking Men

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Azure Monitor & Application Insight to monitor Infrastructure & Application

How to Remove Document Management Hurdles with X-Docs?

Breaking the Kubernetes Kill Chain: Host Path Mount

A Domino Admins Adventures (Engage 2024)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Handwritten Text Recognition for manuscripts and early printed texts

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Scaling API-first – The story of a global engineering organization

The Codex of Business Writing Software for Real-World Solutions 2.pptx

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Detecting Fake News Using NLP Sentence Structure and Function Word Analysis

1. The Grammar of Truth and Lies Using NLP to detect Fake News Peter J Bleackley Playful Technology Limited peter.bleackley@playfultechnology.co.uk

2. The Problem ● “A lie can run around the world before the truth can get its boots on.” ● Fake News spreads six times faster than real news on Twitter ● The spread of true and false news online, Sorush Vosougi, Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp. 1146-1151, 9th March 2018 ● https://science.sciencemag.org/content/359/6380/1146

3. The Data ● “Getting Real about Fake News” Kaggle Dataset ● https://www.kaggle.com/mrisdal/fake-news ● 12999 articles from sites flagged as unreliable by the BS Detector chrome extension ● Reuters-21578, Distribution 1.0 Corpus ● 10000 articles from Reuters Newswire, 1987 ● http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html ● Available from NLTK

4. Don’t Use Vocabulary! ● Potential for bias, especially as corpora are from different time periods ● Difficult to generalise ● Could be reverse-engineered by a bad actor

5. Sentence structure features ● Perform Part of Speech tagging with TextBlob ● Concatenate tags to form a feature for each sentence ● “Pete Bleackley is a self-employed data scientist and computational linguist.” ● 'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN' ● Very large, very sparse feature set

6. First model ● Train LSI model (Gensim) on sentence structure features from whole dataset ● 70/30 split between training and test data ● Sentence structure features => LSI => Logistic Regression (scikit-learn) ● https://www.kaggle.com/petebleackley/the-grammar-of-trut h-and-lies

7. Performance ● Precision 61% ● Recall 96% ● Accuracy 70% ● Matthews Correlation Coefficient 50% ● Precision measures our ability to catch the bad guys.

8. Sentiment analysis ● Used VADER model in NLTK ● Produces Positive, Negative and Neutral scores for each sentence ● Sum over document ● Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%

9. Sentence Structure + Sentiments ● Precision 74% ● Recall 90% ● Accuracy 81% ● Matthews 64% ● Slight improvement, but it looks like sentiment is doing most of the work

10. Random Forests Precision Recall Accuracy Matthews Sentence structure 83% 89% 86% 71% Sentiments 75% 75% 78% 76% Both 84% 89% 87% 76%

11. Understanding the models ● Out of 333264 sentence structure features, 298332 occur only in a single document ● Out of 23000 documents, 11276 have no features in common with others ● We need some denser features

12. Function words ● Pronouns, prepositions, conjunctions, auxilliaries ● Present in every document – most common words ● Usually discarded as “stopwords”... ● ...but useful for stylometric analysis, eg document attribution ● NLTK stopwords corpus

13. New model ● Sentence structure features + function words => LSI => Logistic Regression ● Precision 90% ● Recall 96% ● Accuracy 93% ● Matthews 87%

14. What have we learnt? ● Grammatical and stylistic features can be used to distinguish between real and fake news ● Good choice of features is the key to success ● Will this generalise to other sources?

15. See also... ● The (mis)informed citizen ● Alan Turing Institute project ● https://www.turing.ac.uk/research/research-projects/misinf ormed-citizen