SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019
DOI: 10.5121/ijnlc.2019.8302 17
FAKE NEWS DETECTION WITH SEMANTIC
FEATURES AND TEXT MINING
Pranav Bharadwaj1 and Zongru Shao2
1
South Brunswick High School, Monmouth Junction, NJ
2
Spectronn , USA
ABSTRACT
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake
news in online articles through the use of semantic features and various machine learning techniques. In
this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest
classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the
best performing model achieved an accuracy of 95.66% using bigram features with the random forest
classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as
opposed to single words or phrases best indicate the authenticity of news.
KEYWORDS
Text Mining, Fake News, Machine Learning, Semantic Features, Natural Language Processing (NLP)
1. INTRODUCTION
Nearly 70% of the population is concerned about malicious use of fake news [3]. Fake news
detection is a problem that has been taken on by large social-networking companies such as
Facebook and Twitter to inhibit the propagation of misinformation across their online platforms.
Some fake news articles have targeted major political events such as the 2016 US Presidential
Election and Brexit [4]. However, the scope of fake news extends beyond globally significant
political events. Individuals falsely reported that a golden asteroid on target to hit Earth contains
$10 quadrillion worth of precious metals in an attempt to increase the value of Bitcoin [1]. With
fake news infiltrating multiple facets of public information, many are rightly concerned.
According to a Pew Research Center survey, fake news and misinformation has had a significant
impact on 68% of Americans’ confidence in government, 54% of Americans’ confidence in each
other, and 51% of Americans’ confidence in their political readers to get work done [5].
Additionally, the previously mentioned survey also states that 79% of US adults believe that
measures should be taken to inhibit the propagation of misinformation [5]. Residents of a
Macedonian town named Veles use Google AdSense to distribute fake news around the internet,
run politically manipulative Facebook pages and websites in order to make a living [12]. One of
their Facebook pages had garnered over 1.5 million followers [12]. The rising problem that is
fake news has become increasingly important because of the vulnerability of the massive readers
and its widespread malicious influence. As these invalid sources of information have gained
traction and have established themselves as credible informants to many individuals, preventing
this category of content from spreading and detecting it at the source has become
increasingly crucial. Consequently, automated identification of fake news has been studied by
Facebook and Twitter as well as other researchers [9].
In this paper, we present a comparison between recurrent neural networks, naive bayes, and
International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019
18
random forest algorithms using various linguistic features for fake-news-detection. We use the
real or fake dataset from kaggle.com evaluate these models. The remainder of this paper is
structured into three sections. Section 2 details related works, how they approached detecting fake
news, and Section 3 describes the semantic features and machine learning algorithms in our
experiment. Section 4 presents the evaluation results, in which random forest with bigram
features achieved the best accuracy of 95.66%. Section 5 presents the conclusions and future
work.
2. RELATED WORKS
Several solutions were proposed for this problem. Prior studies employed logistic regression and
“boolean crowd-sourcing algorithms” to detect fake news in social networking websites [13].
However, this research assumed that agents who post misinformation can be detected by the users
who have prior contact of the content [13]. Another study used convolutional neural networks
(CNNs), with a long short term memory (LSTM) layer to detect fake news by the text context and
additional metadata [14]. Shu et al. studied linguistic features such as word count, frequency of
words, character count, and similarity and clarity score for videos and images while proposing
rumor classification, truth discovery, click-bait detection, and spammer and bot detection [11].
Rubin et al. proposed to classify fake news as one of three types: (a) serious fabrications, (b)
large-scale hoaxes, (c) humorous fakes. It also discussed the challenges that each variant of fake
news presents to its detection [11].
However, none of the prior studies had utilized recurrent neural networks (RNNs), naive bayes,
or random forest.
3. MATERIALS AND METHODS
In this section, we describe the material dataset, text preprocessing, semantic features including
term frequency (TF), term frequency–inverse document frequency (TFIDF), bigrams, trigram,
quadgram, vectorized word representations, and machine learning algorithms such as naive bayes
classifier, random forest, recurrent neural networks (RNN). The process of making each model
has been detailed below in Figure 1.
Figure 1
3.1 Dataset
We used real-or-fake news dataset from kaggle.com in our experiments to evaluate semantic
features. It contains 6256 articles including their titles. 50% of the articles are labeled as FAKE
and the remaining as REAL. Therefore, detecting the FAKE articles is a binary classification
problem. We split the dataset into 80% for training and 20% for testing.
3.2 Text Pre-Processing
We pre-process the raw text to extract semantic features for machine learning. We use n-grams as
semantic features. We first tokenize the title and body of each article. Then, each token is
transformed into lower cases and proper nouns lose their capital-letter information. Next, we
remove stopwords and numbers for unigrams since they carry less meaning in the context. As a
International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019
19
N
result, the remaining to- kens are semantic representations from linguistic perspective. Stopwords
and numbers are reserved for n-grams other than unigrams. Then, we extract TF and TFIDF
numerical features using the semantic representations.
3.3 Linguistic Features
3.3.1 TF and TFIDF
Note that a text subject (e.g., an article) is called a document in natural language processing. TF
computes how frequently a term appears in a document. Given a document d with a set of terms T
= {t1, t2, ..., tM }, and the document length is N (the total occurrence of all terms); suppose term ti
appeared xi times; then, TF of ti is denoted as
As a result, [TF(1),TF(2),...,TF(i),...,TF(M)], i ∈ [1,M] is a semantic representation for the
document. Inverse term frequency (IDF) denotes the popularity of a term across documents.
Given a set of documents D = {d1,d2,...,dk} as the subjects of interest, and TF(i) for term ti is
calculated for each document; suppose C(i) denotes the number of documents in which xi ≠ 0;
then,
Note that each term appears at least once in D. Meanwhile, TF and IDF are calculated in
logarithmically scaled:
Where i ∈ [1, M] and j ∈ [1, K]. Then, TFIDF is the product of TF and IDF:
3.3.2 N-grams
N-grams are continuous chucks of n items from a tokenized sequence for a document. Especially,
uni- grams are terms where n = 1. Bigrams are pairs of adjacent terms where n = 2. Trigrams and
quad- grams are three and four continuous terms, respectively. For example, the sentence “Your
destination is 3 miles away” is tokenized into “your”, “destination”, “is”, “3”, “miles”, and
“away”, where each term is a unigram. The bigrams are two-term strings: “your destination”,
“destination is”, “is 3”, “3 miles”, and “miles away”. Trigrams are three-term strings: “your
destination is”, “destination is 3”, “is 3 miles”, and “3 miles away”. And quadgrams are four-term
strings: “your destination is 3”, “destination is 3 miles”, and “is 3 miles away”. In our
experiment, we use unigrams, bigrams, trigrams, and quadgrams to calculate the correlated TF
and TFIDF features.
International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019
20
3.3.3 Naive Bayes Classifier
The naive Bayes classifier is a classifier based on Bayes’ Theorem:
P (A |B ) =
P(B | A) × P(A)
P (B )
Where A and B are two conditions. The naive Bayes classifier takes each semantic feature as a
condition and classify the samples with the highest occurring probability. Note that it assumes
that the semantic features are independent [8].
3.3.4 Random Forest Classifier
A decision tree is a “tree” where different conditions branch off from their parents and each node
represents a class for classification. The random forest classifier is an ensemble method that
operates a multitude of decision trees and thus improves the accuracy. We adjust parameters such
as max depth, min samples split, n estimators, and random state to achieve the best performance;
where Max depth is the maximum depth of a decision tree; Min samples split is the minimum
amount of samples to split an internal node, and N estimators is the number of decision trees in
the random forest [2].
3.4 GLOVE
GloVe is an unsupervised learning algorithm that parallels the closeness of two words with their
distance in a vector space [7]. The generated vector representations are called word embed- dings.
We use word embeddings as semantic features in addition to n-grams because they represent the
semantic distances between the words in the context.
3.5 RNN
Recurrent neural networks (RNNs) utilize “memory” to process inputs and are widely used in text
generation and natural language processing [6]. Long short-term memory (LSTM) is a RNN
architecture that uses “gates” to “forget” the input at a condition. In our model we use 100 LSTM
cells in one layer and a softmax activation function. We trained the model with 22 epochs and a
batch size of 64.
4. EXPERIMENTAL RESULTS
This section presents the experimental results using naive bayes, random forest, and RNN with
six groups of features: TF, TFIDF, frequency of bigrams, trigrams, and quadgrams, and GloVe
word embeddings.
TF TFIDF Bigram Trigram Quadgram GloVe
Naive Bayes 88.08% 89.90% 90.77% 90.06% 89.74% N/A
Random Forest 89.03% 89.34% 95.66% 94.71% 89.60% N/A
RNN N/A N/A N/A N/A N/A 92.70%
Table 1 shows the accuracy using each method. Observe that random forest results in better
accuracy than the naive bayes classifier with TF, bigrams, trigrams, and quadgrams. Meanwhile,
International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019
21
bigrams outperform TF, TFIDF, trigrams, and quadgrams. The RNN model with GloVe features
outperform TF, TFIDF, and quadgrams but not bigrams and trigrams.
Note that unigrams represent words; bigrams represent words and their one-to-one connections;
trigrams carry level-two connections for words if consider a one-to-one connection between two
words as level-one. As a result, bigrams carry more information than unigrams; trigrams more
than bigrams; and quadgrams more than trigrams. Also, more information for training suppose to
provide better accuracy. Therefore, we would expect quadgrams to result in higher accuracy than
trigrams, trigrams higher than bigrams, and bigrams higher than unigrams. This assumption
contradicts the data displayed in Table 1 as quadgrams do not result in the highest accuracy in the
table. The reason is when information increases, the training process takes specific details and the
model is “over-fitted”, when a model predicts the training set too well that it impairs its ability to
classify examples that are not within its training set. However, bigrams do outperform unigrams
because they carry more information. For the same reason, TF and TFIDF result in similar
accuracies because they both are derived from unigrams. Meanwhile, GloVe with RNN
outperforms unigrams and quadgrams but results in lower accuracy than bigrams and trigrams.
This is caused by single layer LSTM cells and word embeddings represent unigrams. Therefore,
the RNN model outperforms random forest if disregard the difference between word embeddings
and unigrams.
Some implications of the success of these models are that they can be used by readers to filter
through the content they consume to be wary of what articles may contain misinformation.
These models can also be used by agencies, organizations, corporations, campaigns, or any other
formal group to filter through news to find any false claims made about them or their actions.
Additionally, publishing houses or news agencies can employ these methods in order to fact
check pieces their writers compose to avoid any fake news from being produced under their
name.
5. CONCLUSION
In this paper, we applied semantic features including unigram TF & TFIDF, bigrams, trigrams,
quad- grams, and GloVe word embeddings along with naive bayes, random forest, and RNN
classifiers to detect fake news. The performance is promising as bigrams and random forest
achieved an accuracy of 95.66%. It implies that semantic features are useful for fake news
detection. As the next step, semantic features may be combined with other linguistic cues and
meta data to improve the detection performance.
REFERENCES
[1] Babayan, Davit. “Bitcoin Bulls Spreading Fake News About Golden Asteroid: Peter Schiff.”
NewsBTC, NewsBTC, 29 June 2019, www.newsbtc.com/2019/06/29/bitcoin-bulls-spreading- fake-
news-about-golden-asteroid-peter-schiff/.
[2] Breiman, L. (2001). Random forests machine learning. 45: 5–32. View Article PubMed/NCBI Google
Scholar.
[3] Handley, Lucy. “Nearly 70 Percent of People Are Worried about Fake News as a 'Weapon,'
Survey Says.” CNBC, CNBC, 22 Jan. 2018, www.cnbc.com/2018/01/22/nearly-70-percent-of-
people-are-worried-about-fake-news-as-a-weapon-survey-says.html.
[4] Kowalewski, J. (2017). The impact of fake news: Politics. Lexology.
[5] Mitchell, Amy, et al. “Many Americans Say Made-Up News Is a Critical Problem That Needs To Be
Fixed.” Pew Research Center's Journalism Project, Pew Research Center, 28 June 2019,
International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019
22
www.journalism.org/2019/06/05/many-americans-say-made-up-news-is-a-critical-problem-that-
needs-to-be-fixed/.
[6] Mikolov, T., M. Karafia ́t, L. Burget, J. Cˇernocky`, and S. Khudanpur (2010). Recurrent neural
network based language model. In Eleventh annual conference of the international speech
communication association.
[7] Patel, S. (2017). Chapter 1 : Supervised learning and naive bayes classification - part 1 (theory).
Medium.
[8] Pennington, J., R. Socher, and C. Manning (2014). Glove: Global vectors for word representation. In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),
pp. 1532–1543.
[9] Romano, A. (2018). Mark zuckerberg lays out facebooks 3-pronged approach to fake news.
[10] Rubin, V. L., Y. Chen, and N. J. Conroy (2015). Deception detection for news: three types of fakes.
In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in
and for the Community, pp. 83. American Society for Information Science.
[11] Shu, K., A. Sliva, S. Wang, J. Tang, and H. Liu (2017). Fake news detection on social media: A data
mining perspective. ACM SIGKDD Explorations Newsletter 19(1), 22–36.
[12] Soares, Isa, and Florence Davey-Atlee. “The Fake News Machine: Inside a Town Gearing up for
2020.” CNNMoney, Cable News Network, money.cnn.com/interactive/media/the-macedonia- story/.
[13] Tacchini, E., G. Ballarin, M. L. Della Vedova, S. Moret, and L. de Alfaro (2017). Some like it hoax:
Automated fake news detection in social networks. arXiv preprint arXiv:1704.07506.
[14] Wang, W. Y. (2017). “liar, liar pants on fire”: A new benchmark dataset for fake news detection.
arXiv preprint arXiv:1705.00648.

Contenu connexe

Tendances

Final Poster for Engineering Showcase
Final Poster for Engineering ShowcaseFinal Poster for Engineering Showcase
Final Poster for Engineering Showcase
Tucker Truesdale
 
Tweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity RecognitionTweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity Recognition
1crore projects
 

Tendances (20)

A RELIABLE ARTIFICIAL INTELLIGENCE MODEL FOR FALSE NEWS DETECTION MADE BY PUB...
A RELIABLE ARTIFICIAL INTELLIGENCE MODEL FOR FALSE NEWS DETECTION MADE BY PUB...A RELIABLE ARTIFICIAL INTELLIGENCE MODEL FOR FALSE NEWS DETECTION MADE BY PUB...
A RELIABLE ARTIFICIAL INTELLIGENCE MODEL FOR FALSE NEWS DETECTION MADE BY PUB...
 
FAKE NEWS DETECTION PPT
FAKE NEWS DETECTION PPT FAKE NEWS DETECTION PPT
FAKE NEWS DETECTION PPT
 
Odsc 2019 entity_reputation_knowledge_graph
Odsc 2019 entity_reputation_knowledge_graphOdsc 2019 entity_reputation_knowledge_graph
Odsc 2019 entity_reputation_knowledge_graph
 
Seminar Report Mine
Seminar Report MineSeminar Report Mine
Seminar Report Mine
 
Poster presentation in 3rd big data conclave at vit chennai on 20th april 2017
Poster presentation in 3rd big data conclave at vit chennai on 20th april 2017Poster presentation in 3rd big data conclave at vit chennai on 20th april 2017
Poster presentation in 3rd big data conclave at vit chennai on 20th april 2017
 
News Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisNews Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic Analysis
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
 
IRJET - Fake News Detection: A Survey
IRJET -  	  Fake News Detection: A SurveyIRJET -  	  Fake News Detection: A Survey
IRJET - Fake News Detection: A Survey
 
Automatic Hate Speech Detection: A Literature Review
Automatic Hate Speech Detection: A Literature ReviewAutomatic Hate Speech Detection: A Literature Review
Automatic Hate Speech Detection: A Literature Review
 
Final Poster for Engineering Showcase
Final Poster for Engineering ShowcaseFinal Poster for Engineering Showcase
Final Poster for Engineering Showcase
 
Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-Tweets
 
Political prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learningPolitical prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learning
 
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERING
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERINGCATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERING
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERING
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
 
On Semantics and Deep Learning for Event Detection in Crisis Situations
On Semantics and Deep Learning for Event Detection in Crisis SituationsOn Semantics and Deep Learning for Event Detection in Crisis Situations
On Semantics and Deep Learning for Event Detection in Crisis Situations
 
Hoax analyzer for Indonesian news using RNNs with fasttext and glove embeddings
Hoax analyzer for Indonesian news using RNNs with fasttext and glove embeddingsHoax analyzer for Indonesian news using RNNs with fasttext and glove embeddings
Hoax analyzer for Indonesian news using RNNs with fasttext and glove embeddings
 
757
757757
757
 
Strategic perspectives 3
Strategic perspectives 3Strategic perspectives 3
Strategic perspectives 3
 
How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?
 
Tweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity RecognitionTweet Segmentation and Its Application to Named Entity Recognition
Tweet Segmentation and Its Application to Named Entity Recognition
 

Similaire à FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING

A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
AhmedAdilNafea
 
Criminal and Civil Identification with DNA Databases Using Bayesian Networks
Criminal and Civil Identification with DNA Databases Using Bayesian NetworksCriminal and Civil Identification with DNA Databases Using Bayesian Networks
Criminal and Civil Identification with DNA Databases Using Bayesian Networks
CSCJournals
 

Similaire à FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING (20)

Multispectral Image Analysis Using Random Forest
Multispectral Image Analysis Using Random Forest  Multispectral Image Analysis Using Random Forest
Multispectral Image Analysis Using Random Forest
 
Era of Sociology News Rumors News Detection using Machine Learning
Era of Sociology News Rumors News Detection using Machine LearningEra of Sociology News Rumors News Detection using Machine Learning
Era of Sociology News Rumors News Detection using Machine Learning
 
Multispectral image analysis using random
Multispectral image analysis using randomMultispectral image analysis using random
Multispectral image analysis using random
 
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
 
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
 
1808.10245v1 (1).pdf
1808.10245v1 (1).pdf1808.10245v1 (1).pdf
1808.10245v1 (1).pdf
 
San Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSan Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contest
 
High Accuracy Location Information Extraction From Social Network Texts Using...
High Accuracy Location Information Extraction From Social Network Texts Using...High Accuracy Location Information Extraction From Social Network Texts Using...
High Accuracy Location Information Extraction From Social Network Texts Using...
 
High Accuracy Location Information Extraction From Social Network Texts Using...
High Accuracy Location Information Extraction From Social Network Texts Using...High Accuracy Location Information Extraction From Social Network Texts Using...
High Accuracy Location Information Extraction From Social Network Texts Using...
 
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
 
Detection of Fake Accounts in Instagram Using Machine Learning
Detection of Fake Accounts in Instagram Using Machine LearningDetection of Fake Accounts in Instagram Using Machine Learning
Detection of Fake Accounts in Instagram Using Machine Learning
 
Cerdit card
Cerdit cardCerdit card
Cerdit card
 
A survey on approaches for performing sentiment analysis ijrset october15
A survey on approaches for performing sentiment analysis ijrset october15A survey on approaches for performing sentiment analysis ijrset october15
A survey on approaches for performing sentiment analysis ijrset october15
 
San Francisco Crime Prediction Report
San Francisco Crime Prediction ReportSan Francisco Crime Prediction Report
San Francisco Crime Prediction Report
 
Issues in Sentiment analysis
Issues in Sentiment analysisIssues in Sentiment analysis
Issues in Sentiment analysis
 
Fake accounts detection system based on bidirectional gated recurrent unit n...
Fake accounts detection system based on bidirectional gated  recurrent unit n...Fake accounts detection system based on bidirectional gated  recurrent unit n...
Fake accounts detection system based on bidirectional gated recurrent unit n...
 
Fake News Detection using Deep Learning
Fake News Detection using Deep LearningFake News Detection using Deep Learning
Fake News Detection using Deep Learning
 
Predicting Forced Population Displacement Using News Articles
Predicting Forced Population Displacement Using News ArticlesPredicting Forced Population Displacement Using News Articles
Predicting Forced Population Displacement Using News Articles
 
Criminal and Civil Identification with DNA Databases Using Bayesian Networks
Criminal and Civil Identification with DNA Databases Using Bayesian NetworksCriminal and Civil Identification with DNA Databases Using Bayesian Networks
Criminal and Civil Identification with DNA Databases Using Bayesian Networks
 
20142014_20142015_20142115
20142014_20142015_2014211520142014_20142015_20142115
20142014_20142015_20142115
 

Plus de kevig

Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT ModelsIdentifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
kevig
 
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT ModelsIdentifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
kevig
 
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
kevig
 
Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...
kevig
 

Plus de kevig (20)

Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT ModelsIdentifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
 
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT ModelsIdentifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
 
IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
 
Genetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech TaggingGenetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech Tagging
 
Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabi
 
Improving Dialogue Management Through Data Optimization
Improving Dialogue Management Through Data OptimizationImproving Dialogue Management Through Data Optimization
Improving Dialogue Management Through Data Optimization
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structure
 
Rag-Fusion: A New Take on Retrieval Augmented Generation
Rag-Fusion: A New Take on Retrieval Augmented GenerationRag-Fusion: A New Take on Retrieval Augmented Generation
Rag-Fusion: A New Take on Retrieval Augmented Generation
 
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
 
Evaluation of Medium-Sized Language Models in German and English Language
Evaluation of Medium-Sized Language Models in German and English LanguageEvaluation of Medium-Sized Language Models in German and English Language
Evaluation of Medium-Sized Language Models in German and English Language
 
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONIMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structure
 
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONRAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
 
Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...
 
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEEVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
 
February 2024 - Top 10 cited articles.pdf
February 2024 - Top 10 cited articles.pdfFebruary 2024 - Top 10 cited articles.pdf
February 2024 - Top 10 cited articles.pdf
 
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank AlgorithmEnhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
 
Effect of MFCC Based Features for Speech Signal Alignments
Effect of MFCC Based Features for Speech Signal AlignmentsEffect of MFCC Based Features for Speech Signal Alignments
Effect of MFCC Based Features for Speech Signal Alignments
 
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov ModelNERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
 
NLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENE
NLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENENLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENE
NLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENE
 

Dernier

Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 

Dernier (20)

Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 

FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING

  • 1. International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019 DOI: 10.5121/ijnlc.2019.8302 17 FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING Pranav Bharadwaj1 and Zongru Shao2 1 South Brunswick High School, Monmouth Junction, NJ 2 Spectronn , USA ABSTRACT Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news. KEYWORDS Text Mining, Fake News, Machine Learning, Semantic Features, Natural Language Processing (NLP) 1. INTRODUCTION Nearly 70% of the population is concerned about malicious use of fake news [3]. Fake news detection is a problem that has been taken on by large social-networking companies such as Facebook and Twitter to inhibit the propagation of misinformation across their online platforms. Some fake news articles have targeted major political events such as the 2016 US Presidential Election and Brexit [4]. However, the scope of fake news extends beyond globally significant political events. Individuals falsely reported that a golden asteroid on target to hit Earth contains $10 quadrillion worth of precious metals in an attempt to increase the value of Bitcoin [1]. With fake news infiltrating multiple facets of public information, many are rightly concerned. According to a Pew Research Center survey, fake news and misinformation has had a significant impact on 68% of Americans’ confidence in government, 54% of Americans’ confidence in each other, and 51% of Americans’ confidence in their political readers to get work done [5]. Additionally, the previously mentioned survey also states that 79% of US adults believe that measures should be taken to inhibit the propagation of misinformation [5]. Residents of a Macedonian town named Veles use Google AdSense to distribute fake news around the internet, run politically manipulative Facebook pages and websites in order to make a living [12]. One of their Facebook pages had garnered over 1.5 million followers [12]. The rising problem that is fake news has become increasingly important because of the vulnerability of the massive readers and its widespread malicious influence. As these invalid sources of information have gained traction and have established themselves as credible informants to many individuals, preventing this category of content from spreading and detecting it at the source has become increasingly crucial. Consequently, automated identification of fake news has been studied by Facebook and Twitter as well as other researchers [9]. In this paper, we present a comparison between recurrent neural networks, naive bayes, and
  • 2. International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019 18 random forest algorithms using various linguistic features for fake-news-detection. We use the real or fake dataset from kaggle.com evaluate these models. The remainder of this paper is structured into three sections. Section 2 details related works, how they approached detecting fake news, and Section 3 describes the semantic features and machine learning algorithms in our experiment. Section 4 presents the evaluation results, in which random forest with bigram features achieved the best accuracy of 95.66%. Section 5 presents the conclusions and future work. 2. RELATED WORKS Several solutions were proposed for this problem. Prior studies employed logistic regression and “boolean crowd-sourcing algorithms” to detect fake news in social networking websites [13]. However, this research assumed that agents who post misinformation can be detected by the users who have prior contact of the content [13]. Another study used convolutional neural networks (CNNs), with a long short term memory (LSTM) layer to detect fake news by the text context and additional metadata [14]. Shu et al. studied linguistic features such as word count, frequency of words, character count, and similarity and clarity score for videos and images while proposing rumor classification, truth discovery, click-bait detection, and spammer and bot detection [11]. Rubin et al. proposed to classify fake news as one of three types: (a) serious fabrications, (b) large-scale hoaxes, (c) humorous fakes. It also discussed the challenges that each variant of fake news presents to its detection [11]. However, none of the prior studies had utilized recurrent neural networks (RNNs), naive bayes, or random forest. 3. MATERIALS AND METHODS In this section, we describe the material dataset, text preprocessing, semantic features including term frequency (TF), term frequency–inverse document frequency (TFIDF), bigrams, trigram, quadgram, vectorized word representations, and machine learning algorithms such as naive bayes classifier, random forest, recurrent neural networks (RNN). The process of making each model has been detailed below in Figure 1. Figure 1 3.1 Dataset We used real-or-fake news dataset from kaggle.com in our experiments to evaluate semantic features. It contains 6256 articles including their titles. 50% of the articles are labeled as FAKE and the remaining as REAL. Therefore, detecting the FAKE articles is a binary classification problem. We split the dataset into 80% for training and 20% for testing. 3.2 Text Pre-Processing We pre-process the raw text to extract semantic features for machine learning. We use n-grams as semantic features. We first tokenize the title and body of each article. Then, each token is transformed into lower cases and proper nouns lose their capital-letter information. Next, we remove stopwords and numbers for unigrams since they carry less meaning in the context. As a
  • 3. International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019 19 N result, the remaining to- kens are semantic representations from linguistic perspective. Stopwords and numbers are reserved for n-grams other than unigrams. Then, we extract TF and TFIDF numerical features using the semantic representations. 3.3 Linguistic Features 3.3.1 TF and TFIDF Note that a text subject (e.g., an article) is called a document in natural language processing. TF computes how frequently a term appears in a document. Given a document d with a set of terms T = {t1, t2, ..., tM }, and the document length is N (the total occurrence of all terms); suppose term ti appeared xi times; then, TF of ti is denoted as As a result, [TF(1),TF(2),...,TF(i),...,TF(M)], i ∈ [1,M] is a semantic representation for the document. Inverse term frequency (IDF) denotes the popularity of a term across documents. Given a set of documents D = {d1,d2,...,dk} as the subjects of interest, and TF(i) for term ti is calculated for each document; suppose C(i) denotes the number of documents in which xi ≠ 0; then, Note that each term appears at least once in D. Meanwhile, TF and IDF are calculated in logarithmically scaled: Where i ∈ [1, M] and j ∈ [1, K]. Then, TFIDF is the product of TF and IDF: 3.3.2 N-grams N-grams are continuous chucks of n items from a tokenized sequence for a document. Especially, uni- grams are terms where n = 1. Bigrams are pairs of adjacent terms where n = 2. Trigrams and quad- grams are three and four continuous terms, respectively. For example, the sentence “Your destination is 3 miles away” is tokenized into “your”, “destination”, “is”, “3”, “miles”, and “away”, where each term is a unigram. The bigrams are two-term strings: “your destination”, “destination is”, “is 3”, “3 miles”, and “miles away”. Trigrams are three-term strings: “your destination is”, “destination is 3”, “is 3 miles”, and “3 miles away”. And quadgrams are four-term strings: “your destination is 3”, “destination is 3 miles”, and “is 3 miles away”. In our experiment, we use unigrams, bigrams, trigrams, and quadgrams to calculate the correlated TF and TFIDF features.
  • 4. International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019 20 3.3.3 Naive Bayes Classifier The naive Bayes classifier is a classifier based on Bayes’ Theorem: P (A |B ) = P(B | A) × P(A) P (B ) Where A and B are two conditions. The naive Bayes classifier takes each semantic feature as a condition and classify the samples with the highest occurring probability. Note that it assumes that the semantic features are independent [8]. 3.3.4 Random Forest Classifier A decision tree is a “tree” where different conditions branch off from their parents and each node represents a class for classification. The random forest classifier is an ensemble method that operates a multitude of decision trees and thus improves the accuracy. We adjust parameters such as max depth, min samples split, n estimators, and random state to achieve the best performance; where Max depth is the maximum depth of a decision tree; Min samples split is the minimum amount of samples to split an internal node, and N estimators is the number of decision trees in the random forest [2]. 3.4 GLOVE GloVe is an unsupervised learning algorithm that parallels the closeness of two words with their distance in a vector space [7]. The generated vector representations are called word embed- dings. We use word embeddings as semantic features in addition to n-grams because they represent the semantic distances between the words in the context. 3.5 RNN Recurrent neural networks (RNNs) utilize “memory” to process inputs and are widely used in text generation and natural language processing [6]. Long short-term memory (LSTM) is a RNN architecture that uses “gates” to “forget” the input at a condition. In our model we use 100 LSTM cells in one layer and a softmax activation function. We trained the model with 22 epochs and a batch size of 64. 4. EXPERIMENTAL RESULTS This section presents the experimental results using naive bayes, random forest, and RNN with six groups of features: TF, TFIDF, frequency of bigrams, trigrams, and quadgrams, and GloVe word embeddings. TF TFIDF Bigram Trigram Quadgram GloVe Naive Bayes 88.08% 89.90% 90.77% 90.06% 89.74% N/A Random Forest 89.03% 89.34% 95.66% 94.71% 89.60% N/A RNN N/A N/A N/A N/A N/A 92.70% Table 1 shows the accuracy using each method. Observe that random forest results in better accuracy than the naive bayes classifier with TF, bigrams, trigrams, and quadgrams. Meanwhile,
  • 5. International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019 21 bigrams outperform TF, TFIDF, trigrams, and quadgrams. The RNN model with GloVe features outperform TF, TFIDF, and quadgrams but not bigrams and trigrams. Note that unigrams represent words; bigrams represent words and their one-to-one connections; trigrams carry level-two connections for words if consider a one-to-one connection between two words as level-one. As a result, bigrams carry more information than unigrams; trigrams more than bigrams; and quadgrams more than trigrams. Also, more information for training suppose to provide better accuracy. Therefore, we would expect quadgrams to result in higher accuracy than trigrams, trigrams higher than bigrams, and bigrams higher than unigrams. This assumption contradicts the data displayed in Table 1 as quadgrams do not result in the highest accuracy in the table. The reason is when information increases, the training process takes specific details and the model is “over-fitted”, when a model predicts the training set too well that it impairs its ability to classify examples that are not within its training set. However, bigrams do outperform unigrams because they carry more information. For the same reason, TF and TFIDF result in similar accuracies because they both are derived from unigrams. Meanwhile, GloVe with RNN outperforms unigrams and quadgrams but results in lower accuracy than bigrams and trigrams. This is caused by single layer LSTM cells and word embeddings represent unigrams. Therefore, the RNN model outperforms random forest if disregard the difference between word embeddings and unigrams. Some implications of the success of these models are that they can be used by readers to filter through the content they consume to be wary of what articles may contain misinformation. These models can also be used by agencies, organizations, corporations, campaigns, or any other formal group to filter through news to find any false claims made about them or their actions. Additionally, publishing houses or news agencies can employ these methods in order to fact check pieces their writers compose to avoid any fake news from being produced under their name. 5. CONCLUSION In this paper, we applied semantic features including unigram TF & TFIDF, bigrams, trigrams, quad- grams, and GloVe word embeddings along with naive bayes, random forest, and RNN classifiers to detect fake news. The performance is promising as bigrams and random forest achieved an accuracy of 95.66%. It implies that semantic features are useful for fake news detection. As the next step, semantic features may be combined with other linguistic cues and meta data to improve the detection performance. REFERENCES [1] Babayan, Davit. “Bitcoin Bulls Spreading Fake News About Golden Asteroid: Peter Schiff.” NewsBTC, NewsBTC, 29 June 2019, www.newsbtc.com/2019/06/29/bitcoin-bulls-spreading- fake- news-about-golden-asteroid-peter-schiff/. [2] Breiman, L. (2001). Random forests machine learning. 45: 5–32. View Article PubMed/NCBI Google Scholar. [3] Handley, Lucy. “Nearly 70 Percent of People Are Worried about Fake News as a 'Weapon,' Survey Says.” CNBC, CNBC, 22 Jan. 2018, www.cnbc.com/2018/01/22/nearly-70-percent-of- people-are-worried-about-fake-news-as-a-weapon-survey-says.html. [4] Kowalewski, J. (2017). The impact of fake news: Politics. Lexology. [5] Mitchell, Amy, et al. “Many Americans Say Made-Up News Is a Critical Problem That Needs To Be Fixed.” Pew Research Center's Journalism Project, Pew Research Center, 28 June 2019,
  • 6. International Journal on Natural Language Computing (IJNLC) Vol.8, No.3, June 2019 22 www.journalism.org/2019/06/05/many-americans-say-made-up-news-is-a-critical-problem-that- needs-to-be-fixed/. [6] Mikolov, T., M. Karafia ́t, L. Burget, J. Cˇernocky`, and S. Khudanpur (2010). Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association. [7] Patel, S. (2017). Chapter 1 : Supervised learning and naive bayes classification - part 1 (theory). Medium. [8] Pennington, J., R. Socher, and C. Manning (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. [9] Romano, A. (2018). Mark zuckerberg lays out facebooks 3-pronged approach to fake news. [10] Rubin, V. L., Y. Chen, and N. J. Conroy (2015). Deception detection for news: three types of fakes. In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community, pp. 83. American Society for Information Science. [11] Shu, K., A. Sliva, S. Wang, J. Tang, and H. Liu (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter 19(1), 22–36. [12] Soares, Isa, and Florence Davey-Atlee. “The Fake News Machine: Inside a Town Gearing up for 2020.” CNNMoney, Cable News Network, money.cnn.com/interactive/media/the-macedonia- story/. [13] Tacchini, E., G. Ballarin, M. L. Della Vedova, S. Moret, and L. de Alfaro (2017). Some like it hoax: Automated fake news detection in social networks. arXiv preprint arXiv:1704.07506. [14] Wang, W. Y. (2017). “liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648.