SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
The Grammar of Truth and Lies
Using NLP to detect Fake News
Peter J Bleackley
Playful Technology Limited
peter.bleackley@playfultechnology.co.uk
The Problem
●
“A lie can run around the world before the truth can get its
boots on.”
●
Fake News spreads six times faster than real news on Twitter
●
The spread of true and false news online, Sorush Vosougi,
Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp.
1146-1151, 9th March 2018
●
https://science.sciencemag.org/content/359/6380/1146
The Data
●
“Getting Real about Fake News” Kaggle Dataset
●
https://www.kaggle.com/mrisdal/fake-news
●
12999 articles from sites flagged as unreliable by the BS Detector
chrome extension
●
Reuters-21578, Distribution 1.0 Corpus
●
10000 articles from Reuters Newswire, 1987
●
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
●
Available from NLTK
Don’t Use Vocabulary!
●
Potential for bias, especially as corpora are from different
time periods
●
Difficult to generalise
●
Could be reverse-engineered by a bad actor
Sentence structure features
●
Perform Part of Speech tagging with TextBlob
●
Concatenate tags to form a feature for each sentence
●
“Pete Bleackley is a self-employed data scientist and
computational linguist.”
●
'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN'
●
Very large, very sparse feature set
First model
●
Train LSI model (Gensim) on sentence structure features
from whole dataset
●
70/30 split between training and test data
●
Sentence structure features => LSI => Logistic Regression
(scikit-learn)
●
https://www.kaggle.com/petebleackley/the-grammar-of-trut
h-and-lies
Performance
●
Precision 61%
●
Recall 96%
●
Accuracy 70%
●
Matthews Correlation Coefficient 50%
●
Precision measures our ability to catch the bad guys.
Sentiment analysis
●
Used VADER model in NLTK
●
Produces Positive, Negative and Neutral scores for each
sentence
●
Sum over document
●
Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%
Sentence Structure + Sentiments
●
Precision 74%
●
Recall 90%
●
Accuracy 81%
●
Matthews 64%
●
Slight improvement, but it looks like sentiment is doing
most of the work
Random Forests
Precision Recall Accuracy Matthews
Sentence
structure
83% 89% 86% 71%
Sentiments 75% 75% 78% 76%
Both 84% 89% 87% 76%
Understanding the models
●
Out of 333264 sentence structure features, 298332 occur
only in a single document
●
Out of 23000 documents, 11276 have no features in
common with others
●
We need some denser features
Function words
●
Pronouns, prepositions, conjunctions, auxilliaries
●
Present in every document – most common words
●
Usually discarded as “stopwords”...
●
...but useful for stylometric analysis, eg document
attribution
●
NLTK stopwords corpus
New model
●
Sentence structure features + function words => LSI =>
Logistic Regression
●
Precision 90%
●
Recall 96%
●
Accuracy 93%
●
Matthews 87%
What have we learnt?
●
Grammatical and stylistic features can be used to
distinguish between real and fake news
●
Good choice of features is the key to success
●
Will this generalise to other sources?
See also...
●
The (mis)informed citizen
●
Alan Turing Institute project
●
https://www.turing.ac.uk/research/research-projects/misinf
ormed-citizen

Contenu connexe

Similaire à Detecting Fake News Using NLP Sentence Structure and Function Word Analysis

The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modellingQuinton Anderson
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Troubleshooting and Optimizing Named Entity Resolution Systems in the Industry
Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryTroubleshooting and Optimizing Named Entity Resolution Systems in the Industry
Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryPanos Alexopoulos
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
The Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence eraThe Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence eraMohamed Hanafy
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATEDiana Maynard
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague
 
Word Cloud Plus with Will and Ray Poynter
Word Cloud Plus with Will and Ray PoynterWord Cloud Plus with Will and Ray Poynter
Word Cloud Plus with Will and Ray PoynterRay Poynter
 
How Machine Learning Works for Business
How Machine Learning Works for BusinessHow Machine Learning Works for Business
How Machine Learning Works for Business10x Nation
 
Machine learning with TensorFlow
Machine learning with TensorFlow  Machine learning with TensorFlow
Machine learning with TensorFlow Eslam Saeed
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionLucidworks
 
Practical Guide to the $1000 Genome (2014)
Practical Guide to the $1000 Genome (2014)Practical Guide to the $1000 Genome (2014)
Practical Guide to the $1000 Genome (2014)AllSeq
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
OTel Orientation: How to Train Teams (OTel in Practice)
OTel Orientation: How to Train Teams (OTel in Practice)OTel Orientation: How to Train Teams (OTel in Practice)
OTel Orientation: How to Train Teams (OTel in Practice)Paige Cruz
 
The agile forecast joe tristano southern fried agile 2018_ final
The agile forecast joe tristano  southern fried agile 2018_ finalThe agile forecast joe tristano  southern fried agile 2018_ final
The agile forecast joe tristano southern fried agile 2018_ finalJoe Tristano
 

Similaire à Detecting Fake News Using NLP Sentence Structure and Function Word Analysis (20)

The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
Ml masterclass
Ml masterclassMl masterclass
Ml masterclass
 
Passwords
PasswordsPasswords
Passwords
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Troubleshooting and Optimizing Named Entity Resolution Systems in the Industry
Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryTroubleshooting and Optimizing Named Entity Resolution Systems in the Industry
Troubleshooting and Optimizing Named Entity Resolution Systems in the Industry
 
Entity Search Engine
Entity Search Engine Entity Search Engine
Entity Search Engine
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
The Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence eraThe Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence era
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATE
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
Vikrant data scientist
Vikrant data scientistVikrant data scientist
Vikrant data scientist
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Word Cloud Plus with Will and Ray Poynter
Word Cloud Plus with Will and Ray PoynterWord Cloud Plus with Will and Ray Poynter
Word Cloud Plus with Will and Ray Poynter
 
How Machine Learning Works for Business
How Machine Learning Works for BusinessHow Machine Learning Works for Business
How Machine Learning Works for Business
 
Machine learning with TensorFlow
Machine learning with TensorFlow  Machine learning with TensorFlow
Machine learning with TensorFlow
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with Fusion
 
Practical Guide to the $1000 Genome (2014)
Practical Guide to the $1000 Genome (2014)Practical Guide to the $1000 Genome (2014)
Practical Guide to the $1000 Genome (2014)
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
OTel Orientation: How to Train Teams (OTel in Practice)
OTel Orientation: How to Train Teams (OTel in Practice)OTel Orientation: How to Train Teams (OTel in Practice)
OTel Orientation: How to Train Teams (OTel in Practice)
 
The agile forecast joe tristano southern fried agile 2018_ final
The agile forecast joe tristano  southern fried agile 2018_ finalThe agile forecast joe tristano  southern fried agile 2018_ final
The agile forecast joe tristano southern fried agile 2018_ final
 

Dernier

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Dernier (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Detecting Fake News Using NLP Sentence Structure and Function Word Analysis

  • 1. The Grammar of Truth and Lies Using NLP to detect Fake News Peter J Bleackley Playful Technology Limited peter.bleackley@playfultechnology.co.uk
  • 2. The Problem ● “A lie can run around the world before the truth can get its boots on.” ● Fake News spreads six times faster than real news on Twitter ● The spread of true and false news online, Sorush Vosougi, Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp. 1146-1151, 9th March 2018 ● https://science.sciencemag.org/content/359/6380/1146
  • 3. The Data ● “Getting Real about Fake News” Kaggle Dataset ● https://www.kaggle.com/mrisdal/fake-news ● 12999 articles from sites flagged as unreliable by the BS Detector chrome extension ● Reuters-21578, Distribution 1.0 Corpus ● 10000 articles from Reuters Newswire, 1987 ● http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html ● Available from NLTK
  • 4. Don’t Use Vocabulary! ● Potential for bias, especially as corpora are from different time periods ● Difficult to generalise ● Could be reverse-engineered by a bad actor
  • 5. Sentence structure features ● Perform Part of Speech tagging with TextBlob ● Concatenate tags to form a feature for each sentence ● “Pete Bleackley is a self-employed data scientist and computational linguist.” ● 'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN' ● Very large, very sparse feature set
  • 6. First model ● Train LSI model (Gensim) on sentence structure features from whole dataset ● 70/30 split between training and test data ● Sentence structure features => LSI => Logistic Regression (scikit-learn) ● https://www.kaggle.com/petebleackley/the-grammar-of-trut h-and-lies
  • 7. Performance ● Precision 61% ● Recall 96% ● Accuracy 70% ● Matthews Correlation Coefficient 50% ● Precision measures our ability to catch the bad guys.
  • 8. Sentiment analysis ● Used VADER model in NLTK ● Produces Positive, Negative and Neutral scores for each sentence ● Sum over document ● Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%
  • 9. Sentence Structure + Sentiments ● Precision 74% ● Recall 90% ● Accuracy 81% ● Matthews 64% ● Slight improvement, but it looks like sentiment is doing most of the work
  • 10. Random Forests Precision Recall Accuracy Matthews Sentence structure 83% 89% 86% 71% Sentiments 75% 75% 78% 76% Both 84% 89% 87% 76%
  • 11. Understanding the models ● Out of 333264 sentence structure features, 298332 occur only in a single document ● Out of 23000 documents, 11276 have no features in common with others ● We need some denser features
  • 12. Function words ● Pronouns, prepositions, conjunctions, auxilliaries ● Present in every document – most common words ● Usually discarded as “stopwords”... ● ...but useful for stylometric analysis, eg document attribution ● NLTK stopwords corpus
  • 13. New model ● Sentence structure features + function words => LSI => Logistic Regression ● Precision 90% ● Recall 96% ● Accuracy 93% ● Matthews 87%
  • 14. What have we learnt? ● Grammatical and stylistic features can be used to distinguish between real and fake news ● Good choice of features is the key to success ● Will this generalise to other sources?
  • 15. See also... ● The (mis)informed citizen ● Alan Turing Institute project ● https://www.turing.ac.uk/research/research-projects/misinf ormed-citizen