SlideShare une entreprise Scribd logo
1  sur  30
A Deep Learning Approach
For Twitter Spam Detection
Lijie Zhou (lijie@mail.sfsu.edu) & Hao Yue
San Francisco State University
Outline
• Problem and Challenges
• Past Work
• Our Model and Results
• Conclusion
• Future Work
What Is Spam?
Spam on Facebook and Twitter
# of active
users
# of spam
accounts
%
Facebook 2.2 billion 60-83 million 2.73%-3.77%
Twitter 330 million 23 million 6.97%
Source: https://www.statista.com/
Various Social Media Sites
Social Media’s Fundamental Design Flaw
• Sophisticated spam accounts know how to use various features to
make the biggest harm:
• Use shortened URL to trick users
• Buy compromised accounts to look legitimate
• Use campaigns to gain traction in a short period time
• Use bots to amplify the noise
• Social media makes it easier and faster to spread spam.
Related Work
• Detection at the tweet level
• Focus on the content of tweets
• E.g., spam words? Overuse of hashtag, URL, mention, …?
• Detection at the account level
• Focus on the characteristics of spam accounts
• E.g., Age of the account? # of followers? # of followees? …
Challenges
• Large amount of unlabeled data
• Time and labor intensive
• Feature selection may cause model overfitting problem
• Twitter spam drift
• Spamming behavior changes over time, thus the performance of existing
machine learning based classifiers decreases.
Research Questions
• Question 1: Can we find an unsupervised way to learn from the
unlabeled data and later apply what we have learnt on labeled data?
• Will this approach outperform the hand-labeling process?
• Question 2: Can we find a more systematic way to reduce the feature
dimensions instead of feature engineering?
Stage 1: Self-taught Learning From Unlabeled Data
Training Data
W/O Label
One-to-N
Encoding
Max-Min
Norm
Sparse Auto-
encoder
Trained
Parameter Set
Stage 2: Soft-max Classifier Training
Preprocessed
Labeled
Training Data
Sparse Auto-
encoder
Soft-max
Regression
Trained
Parameter Set
Stage 3: Classification
Preprocessed
Test Data
Sparse Auto-
encoder
Soft-Max
Regression
Spam/Non-
Spam
Self-taught Learning
• Assumption:
• A single unlabeled record is less informative
• A large of amount of unlabeled records may show certain pattern
• Goal:
• Find an effective model to reveal this pattern (if exists)
• Choose sparse auto-encoder for its good performance and simplicity
Auto-encoder
• A special neural network whose
output is (almost) identical to its
input.
• A compression tool
• The hidden layer is considered the
compressed representation of the
input.
Auto-encoder
• Model parameter:
(𝑊, b) = (𝑊(1), 𝑏(1), 𝑊(2), 𝑏(2))
• Activation function
𝑎1
2
= f(𝑊11
(1)
𝑥1 + 𝑊12
(1)
𝑥2+ 𝑊13
(1)
𝑥3+ 𝑏1
(1)
)
𝑎2
2
= f(𝑊21
(1)
𝑥1 + 𝑊22
(1)
𝑥2+ 𝑊23
(1)
𝑥3+ 𝑏2
(1)
)
𝑎3
2
= f(𝑊31
(1)
𝑥1 + 𝑊32
(1)
𝑥2+ 𝑊33
(1)
𝑥3+ 𝑏3
(1)
)
• Hypothesis ℎ 𝑤,𝑏(𝑥) :
ℎ 𝑤,𝑏(𝑥)= 𝑎1
3
= f(𝑊11
(2)
𝑎1
2
+ 𝑊12
(2)
𝑎2
2
+ 𝑊13
(2)
𝑎3
2
+ 𝑏1
(2)
) = 𝑥
Sparse Auto-encoder
• Sparsity parameter
• Definition: a constraint imposed on the hidden layer
• Goal: ensure pattern will be revealed even if the size of hidden layer is large
• Average activation: 𝜌 =
1
𝑚 𝑖=1
𝑚
[𝑎𝑗
(2)
(𝑥(𝑖))]
• Penalty term
• 𝜌 = 𝜌 (𝜌 = 0.05)
• Kullback-Leibler (KL) divergence: 𝑗=1
𝐾
𝐾𝐿(𝜌 || 𝜌)= 𝜌𝑙𝑜𝑔
𝜌
𝜌
+ (1-𝜌) l𝑙𝑜𝑔
1− 𝜌
1− 𝜌
• 𝑗=1
𝐾
𝐾𝐿(𝜌 || 𝜌) = 0 if 𝜌= 𝜌
Cost Function
J(W,b) =
𝟏
𝒎 𝒊=𝟏
𝒎
| |𝒙𝒊 − 𝒙𝒊||
𝟐
+
𝝀
𝟐
( 𝒌,𝒏 𝑾 𝟐 + 𝒏,𝒌 𝑽 𝟐 + 𝒌 𝒃 𝟏
𝟐
+ 𝒏 𝒃 𝟐
𝟐
) +
𝜷 𝒋=𝟏
𝒌
𝑲𝑳(𝝆|| 𝝆𝒋)
Average sum-of-square error term
Weigh decay term
Penalty term
Cost Function
• Goal: minimize J(W, b) as a function of W and b
• Steps
• Initialization
• Update parameters with gradient descent
𝑊𝑖𝑗
(𝑙)
= 𝑊𝑖𝑗
(𝑙)
- 𝛼
𝜕
𝜕𝑊𝑖𝑗
𝑙 𝐽 𝑊, 𝑏
𝑏𝑖
(𝑙)
= 𝑏𝑖
(𝑙)
- 𝛼
𝜕
𝜕𝑏𝑖
(𝑙) 𝐽 𝑊, 𝑏
Back-propagation
𝛿𝑖
(𝑛 𝑙)
“error term”
how much the node is “responsible” for any error in the output
Back-propagation
1. Perform a feedforward pass, compute activations for layers𝐿2, 𝐿3,
up until the output layer 𝐿 𝑛 𝑙
2. For each output unit I in layer 𝑛𝑙 (the output layer), set
• 𝛿𝑖
(𝑛 𝑙)
= -(𝑦𝑖 − 𝑎𝑖
(𝑛 𝑙)
) x 𝑓−1(𝑧𝑖
(𝑛 𝑙)
)
3. For l = 𝑛𝑙 -1, 𝑛𝑙-2, 𝑛𝑙-3, …, 2
• For each node I in layer l, set 𝛿𝑖
(𝑙)
= ( 𝑗=1
𝑠 𝑙+1
𝑊𝑖𝑗
𝑙
𝛿𝑗
(𝑙+1)
) 𝑓−1(𝑧𝑖
(𝑙)
)
4. Compute the partial derivatives
• 𝛼
𝜕
𝜕𝑊𝑖𝑗
𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝑎𝑗
(𝑙)
𝛿𝑖
(𝑙+1)
• 𝛼
𝜕
𝜕𝑏𝑖
𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝛿𝑖
(𝑙+1)
Fine-tuning
Preprocessed
Labeled
Training Data
Sparse Auto-
encoder
Soft-max
Regression
Trained
Parameter Set
Fine-tuning
Dataset
• 1065 instances; Each instance has 62 features.
• Split 1065 instances into three groups:
• Training w/o label – 600 instances
• Training w label – 365 instances
• Test w label - 100 instances
• Comparison group: SVM, naïve bayes, and random forests
• Training w label – 365 instances
• Test w label – 100 instances
Evaluation
• True Positive (TP): actual spammer, prediction spammer.
• True Negative (TN): actual non-spammer, prediction non-spammer.
• False Positive (FP): actual non-spammer, prediction spammer.
• False Negative (FN): actual spammer, prediction non-spammer.
Evaluation
Accuracy: the correctly classified instances over the total number of
test instances.
Precision: P =
𝑇𝑃
(𝑻𝑃 + 𝐹𝑃)
* 100%
Recall: R =
𝑇𝑃
(𝑇𝑃 + 𝐹𝑁)
* 100%
F-Measure: F =
2∗𝑅𝑃
(𝑅 + 𝑃)
Results
Hidden L2
Hidden
L1
15 20 25 30 35 40 45 50 55 Avg
55 86% 88% 85% 84% 87% 85% 83% 86% 86% 86%
50 84% 84% 86% 88% 86% 89% 87% 86% 88% 86%
45 85% 88% 87% 86% 85% 84% 88% 86% 86% 86%
40 88% 87% 85% 85% 85% 87% 87% 86% 89% 87%
35 87% 88% 87% 86% 87% 86% 86% 85% 86% 86%
30 85% 86% 89% 85% 85% 84% 83% 87% 88% 86%
25 87% 87% 88% 87% 85% 88% 85% 87% 88% 87%
20 84% 88% 83% 88% 86% 85% 88% 87% 86% 86%
15 83% 83% 83% 87% 85% 82% 85% 86% 85% 84%
Avg 85% 87% 86% 86% 86% 86% 86% 86% 87%
Results – Comparison with SVM
TP TN FP FN A P R F
SAE 34 52 3 11 86% 91.9% 75.6% 83.0%
Top 5 28 52 2 18 80% 93.3% 60.9% 73.7%
Top 10 27 52 3 18 79% 90% 60.0% 72.0%
Top 20 28 52 3 17 80% 90.3% 62.2% 73.7%
Top 30 29 52 3 16 81% 90.6% 64.4% 75.3%
Results – Comparison with Random Forests &
Naïve Bayes
TP TN FP FN A P R F
SAE 34 52 3 11 86% 91.9% 75.6% 83.0%
Random
Forrest
32 52 3 13 84% 91% 71.0% 80.0%
Naïve
Bayes
33 50 5 12 83% 86.8% 73.0% 79.5%
Conclusion
• Self-taught Learning: large amount of unlabeled data + small amount
of labeled data
• Sparse AE: reduce the feature dimensions
• Fine tuning: improve the deep learning model by large extent.
Limitation & Future Work
• The dataset we use is relatively small.
• We are still exploring new ways to apply this model on raw data.
A Deep Learning Approach
For Twitter Spam Detection
Lijie Zhou (lijie@mail.sfsu.edu) and Hao Yue
San Francisco State University

Contenu connexe

Similaire à A deep learning approach for twitter spam detection lijie zhou

DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch Eran Shlomo
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKoundinya Desiraju
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar
 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handlinghiratufail
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure Eman magdy
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkDessy Amirudin
 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsJames Brotsos
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptxTadiwaMawere
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learningNimrita Koul
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsWeitao Duan
 

Similaire à A deep learning approach for twitter spam detection lijie zhou (20)

DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
 
DSA 103 Object Oriented Programming :: Week 3
DSA 103 Object Oriented Programming :: Week 3DSA 103 Object Oriented Programming :: Week 3
DSA 103 Object Oriented Programming :: Week 3
 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handling
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loops
 
BIRTE-13-Kawashima
BIRTE-13-KawashimaBIRTE-13-Kawashima
BIRTE-13-Kawashima
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptx
 
Lesson 39
Lesson 39Lesson 39
Lesson 39
 
AI Lesson 39
AI Lesson 39AI Lesson 39
AI Lesson 39
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile Metrics
 

Dernier

HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 

Dernier (20)

HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 

A deep learning approach for twitter spam detection lijie zhou

  • 1. A Deep Learning Approach For Twitter Spam Detection Lijie Zhou (lijie@mail.sfsu.edu) & Hao Yue San Francisco State University
  • 2. Outline • Problem and Challenges • Past Work • Our Model and Results • Conclusion • Future Work
  • 4. Spam on Facebook and Twitter # of active users # of spam accounts % Facebook 2.2 billion 60-83 million 2.73%-3.77% Twitter 330 million 23 million 6.97% Source: https://www.statista.com/
  • 6. Social Media’s Fundamental Design Flaw • Sophisticated spam accounts know how to use various features to make the biggest harm: • Use shortened URL to trick users • Buy compromised accounts to look legitimate • Use campaigns to gain traction in a short period time • Use bots to amplify the noise • Social media makes it easier and faster to spread spam.
  • 7. Related Work • Detection at the tweet level • Focus on the content of tweets • E.g., spam words? Overuse of hashtag, URL, mention, …? • Detection at the account level • Focus on the characteristics of spam accounts • E.g., Age of the account? # of followers? # of followees? …
  • 8. Challenges • Large amount of unlabeled data • Time and labor intensive • Feature selection may cause model overfitting problem • Twitter spam drift • Spamming behavior changes over time, thus the performance of existing machine learning based classifiers decreases.
  • 9. Research Questions • Question 1: Can we find an unsupervised way to learn from the unlabeled data and later apply what we have learnt on labeled data? • Will this approach outperform the hand-labeling process? • Question 2: Can we find a more systematic way to reduce the feature dimensions instead of feature engineering?
  • 10. Stage 1: Self-taught Learning From Unlabeled Data Training Data W/O Label One-to-N Encoding Max-Min Norm Sparse Auto- encoder Trained Parameter Set
  • 11. Stage 2: Soft-max Classifier Training Preprocessed Labeled Training Data Sparse Auto- encoder Soft-max Regression Trained Parameter Set
  • 12. Stage 3: Classification Preprocessed Test Data Sparse Auto- encoder Soft-Max Regression Spam/Non- Spam
  • 13. Self-taught Learning • Assumption: • A single unlabeled record is less informative • A large of amount of unlabeled records may show certain pattern • Goal: • Find an effective model to reveal this pattern (if exists) • Choose sparse auto-encoder for its good performance and simplicity
  • 14. Auto-encoder • A special neural network whose output is (almost) identical to its input. • A compression tool • The hidden layer is considered the compressed representation of the input.
  • 15. Auto-encoder • Model parameter: (𝑊, b) = (𝑊(1), 𝑏(1), 𝑊(2), 𝑏(2)) • Activation function 𝑎1 2 = f(𝑊11 (1) 𝑥1 + 𝑊12 (1) 𝑥2+ 𝑊13 (1) 𝑥3+ 𝑏1 (1) ) 𝑎2 2 = f(𝑊21 (1) 𝑥1 + 𝑊22 (1) 𝑥2+ 𝑊23 (1) 𝑥3+ 𝑏2 (1) ) 𝑎3 2 = f(𝑊31 (1) 𝑥1 + 𝑊32 (1) 𝑥2+ 𝑊33 (1) 𝑥3+ 𝑏3 (1) ) • Hypothesis ℎ 𝑤,𝑏(𝑥) : ℎ 𝑤,𝑏(𝑥)= 𝑎1 3 = f(𝑊11 (2) 𝑎1 2 + 𝑊12 (2) 𝑎2 2 + 𝑊13 (2) 𝑎3 2 + 𝑏1 (2) ) = 𝑥
  • 16. Sparse Auto-encoder • Sparsity parameter • Definition: a constraint imposed on the hidden layer • Goal: ensure pattern will be revealed even if the size of hidden layer is large • Average activation: 𝜌 = 1 𝑚 𝑖=1 𝑚 [𝑎𝑗 (2) (𝑥(𝑖))] • Penalty term • 𝜌 = 𝜌 (𝜌 = 0.05) • Kullback-Leibler (KL) divergence: 𝑗=1 𝐾 𝐾𝐿(𝜌 || 𝜌)= 𝜌𝑙𝑜𝑔 𝜌 𝜌 + (1-𝜌) l𝑙𝑜𝑔 1− 𝜌 1− 𝜌 • 𝑗=1 𝐾 𝐾𝐿(𝜌 || 𝜌) = 0 if 𝜌= 𝜌
  • 17. Cost Function J(W,b) = 𝟏 𝒎 𝒊=𝟏 𝒎 | |𝒙𝒊 − 𝒙𝒊|| 𝟐 + 𝝀 𝟐 ( 𝒌,𝒏 𝑾 𝟐 + 𝒏,𝒌 𝑽 𝟐 + 𝒌 𝒃 𝟏 𝟐 + 𝒏 𝒃 𝟐 𝟐 ) + 𝜷 𝒋=𝟏 𝒌 𝑲𝑳(𝝆|| 𝝆𝒋) Average sum-of-square error term Weigh decay term Penalty term
  • 18. Cost Function • Goal: minimize J(W, b) as a function of W and b • Steps • Initialization • Update parameters with gradient descent 𝑊𝑖𝑗 (𝑙) = 𝑊𝑖𝑗 (𝑙) - 𝛼 𝜕 𝜕𝑊𝑖𝑗 𝑙 𝐽 𝑊, 𝑏 𝑏𝑖 (𝑙) = 𝑏𝑖 (𝑙) - 𝛼 𝜕 𝜕𝑏𝑖 (𝑙) 𝐽 𝑊, 𝑏
  • 19. Back-propagation 𝛿𝑖 (𝑛 𝑙) “error term” how much the node is “responsible” for any error in the output
  • 20. Back-propagation 1. Perform a feedforward pass, compute activations for layers𝐿2, 𝐿3, up until the output layer 𝐿 𝑛 𝑙 2. For each output unit I in layer 𝑛𝑙 (the output layer), set • 𝛿𝑖 (𝑛 𝑙) = -(𝑦𝑖 − 𝑎𝑖 (𝑛 𝑙) ) x 𝑓−1(𝑧𝑖 (𝑛 𝑙) ) 3. For l = 𝑛𝑙 -1, 𝑛𝑙-2, 𝑛𝑙-3, …, 2 • For each node I in layer l, set 𝛿𝑖 (𝑙) = ( 𝑗=1 𝑠 𝑙+1 𝑊𝑖𝑗 𝑙 𝛿𝑗 (𝑙+1) ) 𝑓−1(𝑧𝑖 (𝑙) ) 4. Compute the partial derivatives • 𝛼 𝜕 𝜕𝑊𝑖𝑗 𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝑎𝑗 (𝑙) 𝛿𝑖 (𝑙+1) • 𝛼 𝜕 𝜕𝑏𝑖 𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝛿𝑖 (𝑙+1)
  • 22. Dataset • 1065 instances; Each instance has 62 features. • Split 1065 instances into three groups: • Training w/o label – 600 instances • Training w label – 365 instances • Test w label - 100 instances • Comparison group: SVM, naïve bayes, and random forests • Training w label – 365 instances • Test w label – 100 instances
  • 23. Evaluation • True Positive (TP): actual spammer, prediction spammer. • True Negative (TN): actual non-spammer, prediction non-spammer. • False Positive (FP): actual non-spammer, prediction spammer. • False Negative (FN): actual spammer, prediction non-spammer.
  • 24. Evaluation Accuracy: the correctly classified instances over the total number of test instances. Precision: P = 𝑇𝑃 (𝑻𝑃 + 𝐹𝑃) * 100% Recall: R = 𝑇𝑃 (𝑇𝑃 + 𝐹𝑁) * 100% F-Measure: F = 2∗𝑅𝑃 (𝑅 + 𝑃)
  • 25. Results Hidden L2 Hidden L1 15 20 25 30 35 40 45 50 55 Avg 55 86% 88% 85% 84% 87% 85% 83% 86% 86% 86% 50 84% 84% 86% 88% 86% 89% 87% 86% 88% 86% 45 85% 88% 87% 86% 85% 84% 88% 86% 86% 86% 40 88% 87% 85% 85% 85% 87% 87% 86% 89% 87% 35 87% 88% 87% 86% 87% 86% 86% 85% 86% 86% 30 85% 86% 89% 85% 85% 84% 83% 87% 88% 86% 25 87% 87% 88% 87% 85% 88% 85% 87% 88% 87% 20 84% 88% 83% 88% 86% 85% 88% 87% 86% 86% 15 83% 83% 83% 87% 85% 82% 85% 86% 85% 84% Avg 85% 87% 86% 86% 86% 86% 86% 86% 87%
  • 26. Results – Comparison with SVM TP TN FP FN A P R F SAE 34 52 3 11 86% 91.9% 75.6% 83.0% Top 5 28 52 2 18 80% 93.3% 60.9% 73.7% Top 10 27 52 3 18 79% 90% 60.0% 72.0% Top 20 28 52 3 17 80% 90.3% 62.2% 73.7% Top 30 29 52 3 16 81% 90.6% 64.4% 75.3%
  • 27. Results – Comparison with Random Forests & Naïve Bayes TP TN FP FN A P R F SAE 34 52 3 11 86% 91.9% 75.6% 83.0% Random Forrest 32 52 3 13 84% 91% 71.0% 80.0% Naïve Bayes 33 50 5 12 83% 86.8% 73.0% 79.5%
  • 28. Conclusion • Self-taught Learning: large amount of unlabeled data + small amount of labeled data • Sparse AE: reduce the feature dimensions • Fine tuning: improve the deep learning model by large extent.
  • 29. Limitation & Future Work • The dataset we use is relatively small. • We are still exploring new ways to apply this model on raw data.
  • 30. A Deep Learning Approach For Twitter Spam Detection Lijie Zhou (lijie@mail.sfsu.edu) and Hao Yue San Francisco State University

Notes de l'éditeur

  1. The key is to compute the partial derivatives.
  2. We conducted an experiment on this implementation but the result is not as expected.