SlideShare a Scribd company logo
1 of 15
SMS Spam Filter Design Using R:
 A Machine Learning Approach

                       Reza Rahimi,
                      Ph.D Candidate,
       School of Information and Computer Science,
              University of California, Irvine.
Introduction
• In basic terms Machine Learning (ML) is about the
  construction of systems that can learn from data.
• It is used as a tool for knowledge discovery.
• Several Important classes of problems could be solved
  using machine learning techniques like:
   – Classification (Prediction):
       • Given a collection of records as a training set.
       • Each record contains a set of attributes and one of the attributes
         called class.
       • The problem is to find a model for class attribute as a function
         of other attributes.
            –   Example: Spam or Ham, Handwriting Recognition,…
– Clustering (Description):
   • Given a set of data points, with some attributes, and a similarity
     measure (metric) among them.
   • The goal is to find clusters such that data points in one cluster are
     more similar to one another.
        –   Example: Document Clustering, people categorization,…

– Association (Description):
   • Given a set of records each contains some items from a given
     collection.
   • The goal is to produce dependency rules which show the
     occurrence of an item based on occurrences of other items.
        –   Example: user habit pattern recognition,…

– Regression (Prediction):
   • Predict a value of a given continuous variables based on the values
     of other variables.
   • Could be linear or nonlinear model of dependency.
        –   Example: Stock prediction
Problem Solving Using Machine
     Learning Framework
• ML is a very mature and developed area.
• In all of the different mentioned problem classes, it
  contains rich resources of tools, techniques and
  Algorithms.
• These tools are provided in different languages and
  Framework like R, Matlab, Java, C++, Mahout,…
• The following procedure could be considered as the
  general methodology for problem solving in this
  framework:
Get a sense of data:                      Problem modeling:
   Feature extraction, dimension       Classification, Clustering, Association,
  reduction, noise cancellation,…                   Regression,…




                                         Run standard ML Algorithms:
Select the methods that satisfy your
                                         check the errors according to the
 performance criteria and metrics.
                                              standard ML Metrics.




               • In the next section I will describe design
                 of SMS Spam Filter in R language based
                 on mentioned methodology.
SMS Spam Filter using R
•   #this file is SMS Spam filter codes with different classifiers in R language
•   #Written by: Reza Rahimi
•   #Initialization: Raw Data (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection),
•   #loading required packages, libraries and function declaration

•   #required package for text mining
•   if(!require("tm"))
•             install.packages("tm")
•
•   #required package for SVM
•   if(!require("e1071"))
•             install.packages("e1071")
•
•   #required package for KNN
•   if(!require("RWeka"))
•   install.packages("RWeka", dependencies = TRUE)
•
•   #required package for Adaboost
•   if(!require("ada"))
•             install.packages("ada")
•   library("tm")
•   library("e1071")
•   library(RWeka)
•   library("ada")
R Codes (Cont.)
•   #Initialize random generator
•   set.seed(1245)
•
•   #This function makes vector (Vector Space Model) from text message using highly repeated words
•   vsm<-function(message,highlyrepeatedwords){
•
•            tokenizedmessage<-strsplit(message, "s+")[[1]]
•
•   #making vector
•             v<-rep(0, length(highlyrepeatedwords))
•             for(i in 1:length(highlyrepeatedwords)){
•                              for(j in 1:length(tokenizedmessage)){
•                                               if(highlyrepeatedwords[i]==tokenizedmessage[j]){
•                                               v[i]<-v[i]+1
•                                               }
•                              }
•             }
•   return (v)
•   }
•   #loading data. Original data is from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
•   print("Uploading SMS Spams and Hams!n")
•   smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "t",
    colClasses=c("type"="character","sms"="character"))
R Codes (Cont.)
•   smstabletmp<-smstable
•
•   print("Extracting Ham and Spam Basic Statistics!")
•
•   smstabletmp$type[smstabletmp$type=="ham"] <- 1
•   smstabletmp$type[smstabletmp$type=="spam"] <- 0
•
•   #Convert character data into numeric
•   tmp<-as.numeric(smstabletmp$type)
•
•   #Basic Statisctics like mean and variance of spam and hams
•   hamavg<-mean(tmp)
•   print("Average Ham is :");hamavg
•
•   hamvariance<-var(tmp)
•   print("Var of Ham is :");hamvariance
•
•   print("Extract average token of Hams and Spams!")
•
•   nohamtokens<-0
•   noham<-0
•   nospamtokens<-0
•   nospam<-0
R Codes (Cont.)
•   for(i in 1:length(smstable$type)){
•               if(smstable[i,1]=="ham"){
•               nohamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nohamtokens
•               noham<-noham+1
•   }else{
•               nospamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nospamtokens
•               nospam<-nospam+1
•   }
•   }
•
•   totaltokens<-nospamtokens+nohamtokens;
•   print("total number of tokens is:")
•   print(totaltokens)
•
•   avgtokenperham<-nohamtokens/noham
•   print("Avarage number of tokens per ham message")
•   print(avgtokenperham)
•
•   avgtokenperspam<-nospamtokens/nospam
•   print("Avarage number of tokens per spam message")
•   print(avgtokenperspam)
•
•   print(" Make two different sets, training data and test data!")
R Codes (Cont.)
•   #select the percent of data that you want to use as training set
•   trdatapercent<-0.3
•
•   #training data set
•   trdata=NULL
•
•   #test data set
•   tedata=NULL
•
•   for(i in 1:length(smstable$type)){
•               if(runif(1)<trdatapercent){
•                               trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2])))
•               }
•               else{
•                               tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2])))
•               }
•   }
•
•   print("Training data size is!")
•   dim(trdata)
•
•   print("Test data size is!")
•   dim(tedata)
R Codes (Cont.)
•   # Text feature extraction using tm package
•
•   trsmses<-Corpus(VectorSource(trdata[,2]))
•   trsmses<-tm_map(trsmses, stripWhitespace)
•   trsmses<-tm_map(trsmses, tolower)
•   trsmses<-tm_map(trsmses, removeWords, stopwords("english"))
•
•   dtm <- DocumentTermMatrix(trsmses)
•
•   highlyrepeatedwords<-findFreqTerms(dtm, 80)
•
•   #These highly used words are used as an index to make VSM
•   #(vector space model) for trained data and test data
•
•   #vectorized training data set
•   vtrdata=NULL
•
•   #vectorized test data set
•   vtedata=NULL
R Codes (Cont.)
•   for(i in 1:length(trdata[,2])){
•               if(trdata[i,1]=="ham"){
•                                vtrdata=rbind(vtrdata,c(1,vsm(trdata[i,2],highlyrepeatedwords)))
•               }
•               else{
•                                vtrdata=rbind(vtrdata,c(0,vsm(trdata[i,2],highlyrepeatedwords)))
•               }
•
•   }
•
•   for(i in 1:length(tedata[,2])){
•               if(tedata[i,1]=="ham"){
•                                vtedata=rbind(vtedata,c(1,vsm(tedata[i,2],highlyrepeatedwords)))
•               }
•               else{
•                                vtedata=rbind(vtedata,c(0,vsm(tedata[i,2],highlyrepeatedwords)))
•               }
•
•   }
R Codes (Cont.)
•   # Run different classification algorithms
•   # differnet SVMs with different Kernels
•   print("----------------------------------SVM-----------------------------------------")
•   print("Linear Kernel")
•   svmlinmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],type='C', kernel='linear');
•   summary(svmlinmodel)
•   predictionlin <- predict(svmlinmodel, vtedata[,2:length(vtedata[1,])])
•   tablinear <- table(pred = predictionlin , true = vtedata[,1]); tablinear
•   precisionlin<-sum(diag(tablinear))/sum(tablinear);
•   print("General Error using Linear SVM is (in percent):");(1-precisionlin)*100
•   print("Ham Error using Linear SVM is (in percent):");(tablinear[1,2]/sum(tablinear[,2]))*100
•   print("Spam Error using Linear SVM is (in percent):");(tablinear[2,1]/sum(tablinear[,1]))*100
•
•   print("Polynomial Kernel")
•   svmpolymodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel='polynomial', probability=FALSE)
•   summary(svmpolymodel)
•   predictionpoly <- predict(svmpolymodel, vtedata[,2:length(vtedata[1,])])
•   tabpoly <- table(pred = predictionpoly , true = vtedata[,1]); tabpoly
•
•   print("Radial Kernel")
•   svmradmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel = "radial", gamma = 0.09, cost = 1,
    probability=FALSE)
•   summary(svmradmodel)
•   predictionrad <- predict(svmradmodel, vtedata[,2:length(vtedata[1,])])
•   tabrad <- table(pred = predictionrad, true = vtedata[,1]); tabrad
R Codes (Cont.)
•   print("----------------------------------KNN-----------------------------------------")
•   data<-data.frame(sms=vtrdata[,2:length(vtrdata[1,])],type=vtrdata[,1])
•   classifier <- IBk(data, control = Weka_control(K = 20, X = TRUE))
•   summary(classifier)
•   evaluate_Weka_classifier(classifier, newdata = data.frame(sms=vtedata[,2:length(vtedata[1,])],type=vtedata[,1]))

•   print("---------------------------------Adaboost-------------------------------------")
•   adaptiveboost<-ada(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],test.x=vtedata[,2:length(vtedata[1,])],
    test.y=vtedata[,1], loss="logistic", type="gentle", iter=100)
•   summary(adaptiveboost)
•   varplot(adaptiveboost)
Conclusions
•   In these slides I gave a broad overview of ML and different
    problems that could be solved in this framework.
•   I reviewed in details one way of SMS spam filter
    implementation using ML techniques with R language.
•   ML provides strong framework to solve problem in Big Data
    domain.

More Related Content

What's hot

Spam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmSpam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmAkshay Pal
 
E mail image spam filtering techniques
E mail image spam filtering techniquesE mail image spam filtering techniques
E mail image spam filtering techniquesranjit banshpal
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptxAnush90
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 
Spam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptxSpam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptxKunal Kalamkar
 
final-spam-e-mail-detection-180125111231.pptx
final-spam-e-mail-detection-180125111231.pptxfinal-spam-e-mail-detection-180125111231.pptx
final-spam-e-mail-detection-180125111231.pptxinfotowards
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Presentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksPresentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksAshish Arora
 
Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660syaidatulamirah
 
Machine Learning ppt.pptx
Machine Learning ppt.pptxMachine Learning ppt.pptx
Machine Learning ppt.pptx21MC048SARANRAJ
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsAndrew Ferlitsch
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithmPradip Kumar
 

What's hot (20)

Spam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmSpam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes Algorithm
 
E mail image spam filtering techniques
E mail image spam filtering techniquesE mail image spam filtering techniques
E mail image spam filtering techniques
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptx
 
Spam Email identification
Spam Email identificationSpam Email identification
Spam Email identification
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
Spam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptxSpam email detection using machine learning PPT.pptx
Spam email detection using machine learning PPT.pptx
 
final-spam-e-mail-detection-180125111231.pptx
final-spam-e-mail-detection-180125111231.pptxfinal-spam-e-mail-detection-180125111231.pptx
final-spam-e-mail-detection-180125111231.pptx
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Presentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksPresentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social Networks
 
Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660
 
Machine Learning: Bias and Variance Trade-off
Machine Learning: Bias and Variance Trade-offMachine Learning: Bias and Variance Trade-off
Machine Learning: Bias and Variance Trade-off
 
Machine Learning ppt.pptx
Machine Learning ppt.pptxMachine Learning ppt.pptx
Machine Learning ppt.pptx
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble Methods
 
Handwritten Character Recognition
Handwritten Character RecognitionHandwritten Character Recognition
Handwritten Character Recognition
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Machine learning
Machine learningMachine learning
Machine learning
 
Twitter Analytics
Twitter AnalyticsTwitter Analytics
Twitter Analytics
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithm
 

Viewers also liked

Algorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConAlgorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConmattthemathman
 
Machine Learning The Key Ingredient to Self-Driving Data Center
Machine Learning The Key Ingredient to Self-Driving Data CenterMachine Learning The Key Ingredient to Self-Driving Data Center
Machine Learning The Key Ingredient to Self-Driving Data CenterSergey A. Razin
 
Self-Driving Data Center (Apply Machine Learning to the Cloud)
Self-Driving Data Center (Apply Machine Learning to the Cloud)Self-Driving Data Center (Apply Machine Learning to the Cloud)
Self-Driving Data Center (Apply Machine Learning to the Cloud)Sergey A. Razin
 
Self-Tuning Data Centers
Self-Tuning Data CentersSelf-Tuning Data Centers
Self-Tuning Data CentersReza Rahimi
 
Open ALMS 2.0 제품 소개서
Open ALMS 2.0 제품 소개서Open ALMS 2.0 제품 소개서
Open ALMS 2.0 제품 소개서Jaebok Oh
 
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...Denim Group
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptbutest
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam FilteringiNazneen
 
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Michael Allen
 
Security Insights at Scale
Security Insights at ScaleSecurity Insights at Scale
Security Insights at ScaleRaffael Marty
 
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013Alex Pinto
 
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't ChangedAI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't ChangedRaffael Marty
 
Ankus 제품소개서
Ankus 제품소개서Ankus 제품소개서
Ankus 제품소개서onycom1
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 

Viewers also liked (17)

Algorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConAlgorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozCon
 
Machine Learning The Key Ingredient to Self-Driving Data Center
Machine Learning The Key Ingredient to Self-Driving Data CenterMachine Learning The Key Ingredient to Self-Driving Data Center
Machine Learning The Key Ingredient to Self-Driving Data Center
 
Self-Driving Data Center (Apply Machine Learning to the Cloud)
Self-Driving Data Center (Apply Machine Learning to the Cloud)Self-Driving Data Center (Apply Machine Learning to the Cloud)
Self-Driving Data Center (Apply Machine Learning to the Cloud)
 
Self-Tuning Data Centers
Self-Tuning Data CentersSelf-Tuning Data Centers
Self-Tuning Data Centers
 
Shadow wall utm
Shadow wall utmShadow wall utm
Shadow wall utm
 
Open ALMS 2.0 제품 소개서
Open ALMS 2.0 제품 소개서Open ALMS 2.0 제품 소개서
Open ALMS 2.0 제품 소개서
 
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...
The Self Healing Cloud: Protecting Applications and Infrastructure with Autom...
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.ppt
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
What is SPAM?
What is SPAM?What is SPAM?
What is SPAM?
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam Filtering
 
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
 
Security Insights at Scale
Security Insights at ScaleSecurity Insights at Scale
Security Insights at Scale
 
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
 
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't ChangedAI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
 
Ankus 제품소개서
Ankus 제품소개서Ankus 제품소개서
Ankus 제품소개서
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to SMS Spam Filter Design Using R: A Machine Learning Approach

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011Mandi Walls
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata londonkperi
 
Supervised Learning-classification Part-3.ppt
Supervised Learning-classification Part-3.pptSupervised Learning-classification Part-3.ppt
Supervised Learning-classification Part-3.pptVenneladonthireddy1
 
Supervised Learningclassification Part3.ppt
Supervised Learningclassification Part3.pptSupervised Learningclassification Part3.ppt
Supervised Learningclassification Part3.pptKush736264
 
Advanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.pptAdvanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.pptAnshika865276
 
Pa1 session 3_slides
Pa1 session 3_slidesPa1 session 3_slides
Pa1 session 3_slidesaiclub_slides
 

Similar to SMS Spam Filter Design Using R: A Machine Learning Approach (20)

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Aggregate.pptx
Aggregate.pptxAggregate.pptx
Aggregate.pptx
 
R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Python basics
Python basicsPython basics
Python basics
 
Supervised Learning-classification Part-3.ppt
Supervised Learning-classification Part-3.pptSupervised Learning-classification Part-3.ppt
Supervised Learning-classification Part-3.ppt
 
Supervised Learningclassification Part3.ppt
Supervised Learningclassification Part3.pptSupervised Learningclassification Part3.ppt
Supervised Learningclassification Part3.ppt
 
Advanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.pptAdvanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.ppt
 
Language R
Language RLanguage R
Language R
 
Pa1 session 3_slides
Pa1 session 3_slidesPa1 session 3_slides
Pa1 session 3_slides
 

More from Reza Rahimi

Boosting Personalization In SaaS Using Machine Learning.pdf
Boosting Personalization  In SaaS Using Machine Learning.pdfBoosting Personalization  In SaaS Using Machine Learning.pdf
Boosting Personalization In SaaS Using Machine Learning.pdfReza Rahimi
 
Self-Tuning and Managing Services
Self-Tuning and Managing ServicesSelf-Tuning and Managing Services
Self-Tuning and Managing ServicesReza Rahimi
 
Low Complexity Secure Code Design for Big Data in Cloud Storage Systems
Low Complexity Secure Code Design for Big Data in Cloud Storage SystemsLow Complexity Secure Code Design for Big Data in Cloud Storage Systems
Low Complexity Secure Code Design for Big Data in Cloud Storage SystemsReza Rahimi
 
Smart Connectivity
Smart ConnectivitySmart Connectivity
Smart ConnectivityReza Rahimi
 
The Next Big Thing in IT
The Next Big Thing in ITThe Next Big Thing in IT
The Next Big Thing in ITReza Rahimi
 
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud Computing
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud ComputingQoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud Computing
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud ComputingReza Rahimi
 
On Optimal and Fair Service Allocation in Mobile Cloud Computing
On Optimal and Fair Service Allocation in Mobile Cloud ComputingOn Optimal and Fair Service Allocation in Mobile Cloud Computing
On Optimal and Fair Service Allocation in Mobile Cloud ComputingReza Rahimi
 
Mobile Applications on an Elastic and Scalable 2-Tier Cloud Architecture
Mobile Applications on an Elastic and Scalable 2-Tier Cloud ArchitectureMobile Applications on an Elastic and Scalable 2-Tier Cloud Architecture
Mobile Applications on an Elastic and Scalable 2-Tier Cloud ArchitectureReza Rahimi
 
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile Applications
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile ApplicationsExploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile Applications
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile ApplicationsReza Rahimi
 
Fingerprint High Level Classification
Fingerprint High Level ClassificationFingerprint High Level Classification
Fingerprint High Level ClassificationReza Rahimi
 
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...Reza Rahimi
 
Optimizing Multicast Throughput in IP Network
Optimizing Multicast Throughput in IP NetworkOptimizing Multicast Throughput in IP Network
Optimizing Multicast Throughput in IP NetworkReza Rahimi
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemReza Rahimi
 
Mobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big PictureMobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big PictureReza Rahimi
 
Network Information Processing
Network Information ProcessingNetwork Information Processing
Network Information ProcessingReza Rahimi
 
Pervasive Image Computation: A Mobile Phone Application for getting Informat...
Pervasive Image Computation: A Mobile  Phone Application for getting Informat...Pervasive Image Computation: A Mobile  Phone Application for getting Informat...
Pervasive Image Computation: A Mobile Phone Application for getting Informat...Reza Rahimi
 
Gaussian Integration
Gaussian IntegrationGaussian Integration
Gaussian IntegrationReza Rahimi
 
Interactive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCPInteractive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCPReza Rahimi
 
Quantum Computation and Algorithms
Quantum Computation and Algorithms Quantum Computation and Algorithms
Quantum Computation and Algorithms Reza Rahimi
 

More from Reza Rahimi (19)

Boosting Personalization In SaaS Using Machine Learning.pdf
Boosting Personalization  In SaaS Using Machine Learning.pdfBoosting Personalization  In SaaS Using Machine Learning.pdf
Boosting Personalization In SaaS Using Machine Learning.pdf
 
Self-Tuning and Managing Services
Self-Tuning and Managing ServicesSelf-Tuning and Managing Services
Self-Tuning and Managing Services
 
Low Complexity Secure Code Design for Big Data in Cloud Storage Systems
Low Complexity Secure Code Design for Big Data in Cloud Storage SystemsLow Complexity Secure Code Design for Big Data in Cloud Storage Systems
Low Complexity Secure Code Design for Big Data in Cloud Storage Systems
 
Smart Connectivity
Smart ConnectivitySmart Connectivity
Smart Connectivity
 
The Next Big Thing in IT
The Next Big Thing in ITThe Next Big Thing in IT
The Next Big Thing in IT
 
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud Computing
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud ComputingQoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud Computing
QoS-Aware Middleware for Optimal Service Allocation in Mobile Cloud Computing
 
On Optimal and Fair Service Allocation in Mobile Cloud Computing
On Optimal and Fair Service Allocation in Mobile Cloud ComputingOn Optimal and Fair Service Allocation in Mobile Cloud Computing
On Optimal and Fair Service Allocation in Mobile Cloud Computing
 
Mobile Applications on an Elastic and Scalable 2-Tier Cloud Architecture
Mobile Applications on an Elastic and Scalable 2-Tier Cloud ArchitectureMobile Applications on an Elastic and Scalable 2-Tier Cloud Architecture
Mobile Applications on an Elastic and Scalable 2-Tier Cloud Architecture
 
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile Applications
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile ApplicationsExploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile Applications
Exploiting an Elastic 2-Tiered Cloud Architecture for Rich Mobile Applications
 
Fingerprint High Level Classification
Fingerprint High Level ClassificationFingerprint High Level Classification
Fingerprint High Level Classification
 
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
 
Optimizing Multicast Throughput in IP Network
Optimizing Multicast Throughput in IP NetworkOptimizing Multicast Throughput in IP Network
Optimizing Multicast Throughput in IP Network
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management System
 
Mobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big PictureMobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big Picture
 
Network Information Processing
Network Information ProcessingNetwork Information Processing
Network Information Processing
 
Pervasive Image Computation: A Mobile Phone Application for getting Informat...
Pervasive Image Computation: A Mobile  Phone Application for getting Informat...Pervasive Image Computation: A Mobile  Phone Application for getting Informat...
Pervasive Image Computation: A Mobile Phone Application for getting Informat...
 
Gaussian Integration
Gaussian IntegrationGaussian Integration
Gaussian Integration
 
Interactive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCPInteractive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCP
 
Quantum Computation and Algorithms
Quantum Computation and Algorithms Quantum Computation and Algorithms
Quantum Computation and Algorithms
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

SMS Spam Filter Design Using R: A Machine Learning Approach

  • 1. SMS Spam Filter Design Using R: A Machine Learning Approach Reza Rahimi, Ph.D Candidate, School of Information and Computer Science, University of California, Irvine.
  • 2. Introduction • In basic terms Machine Learning (ML) is about the construction of systems that can learn from data. • It is used as a tool for knowledge discovery. • Several Important classes of problems could be solved using machine learning techniques like: – Classification (Prediction): • Given a collection of records as a training set. • Each record contains a set of attributes and one of the attributes called class. • The problem is to find a model for class attribute as a function of other attributes. – Example: Spam or Ham, Handwriting Recognition,…
  • 3. – Clustering (Description): • Given a set of data points, with some attributes, and a similarity measure (metric) among them. • The goal is to find clusters such that data points in one cluster are more similar to one another. – Example: Document Clustering, people categorization,… – Association (Description): • Given a set of records each contains some items from a given collection. • The goal is to produce dependency rules which show the occurrence of an item based on occurrences of other items. – Example: user habit pattern recognition,… – Regression (Prediction): • Predict a value of a given continuous variables based on the values of other variables. • Could be linear or nonlinear model of dependency. – Example: Stock prediction
  • 4. Problem Solving Using Machine Learning Framework • ML is a very mature and developed area. • In all of the different mentioned problem classes, it contains rich resources of tools, techniques and Algorithms. • These tools are provided in different languages and Framework like R, Matlab, Java, C++, Mahout,… • The following procedure could be considered as the general methodology for problem solving in this framework:
  • 5. Get a sense of data: Problem modeling: Feature extraction, dimension Classification, Clustering, Association, reduction, noise cancellation,… Regression,… Run standard ML Algorithms: Select the methods that satisfy your check the errors according to the performance criteria and metrics. standard ML Metrics. • In the next section I will describe design of SMS Spam Filter in R language based on mentioned methodology.
  • 6. SMS Spam Filter using R • #this file is SMS Spam filter codes with different classifiers in R language • #Written by: Reza Rahimi • #Initialization: Raw Data (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection), • #loading required packages, libraries and function declaration • #required package for text mining • if(!require("tm")) • install.packages("tm") • • #required package for SVM • if(!require("e1071")) • install.packages("e1071") • • #required package for KNN • if(!require("RWeka")) • install.packages("RWeka", dependencies = TRUE) • • #required package for Adaboost • if(!require("ada")) • install.packages("ada") • library("tm") • library("e1071") • library(RWeka) • library("ada")
  • 7. R Codes (Cont.) • #Initialize random generator • set.seed(1245) • • #This function makes vector (Vector Space Model) from text message using highly repeated words • vsm<-function(message,highlyrepeatedwords){ • • tokenizedmessage<-strsplit(message, "s+")[[1]] • • #making vector • v<-rep(0, length(highlyrepeatedwords)) • for(i in 1:length(highlyrepeatedwords)){ • for(j in 1:length(tokenizedmessage)){ • if(highlyrepeatedwords[i]==tokenizedmessage[j]){ • v[i]<-v[i]+1 • } • } • } • return (v) • } • #loading data. Original data is from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection • print("Uploading SMS Spams and Hams!n") • smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "t", colClasses=c("type"="character","sms"="character"))
  • 8. R Codes (Cont.) • smstabletmp<-smstable • • print("Extracting Ham and Spam Basic Statistics!") • • smstabletmp$type[smstabletmp$type=="ham"] <- 1 • smstabletmp$type[smstabletmp$type=="spam"] <- 0 • • #Convert character data into numeric • tmp<-as.numeric(smstabletmp$type) • • #Basic Statisctics like mean and variance of spam and hams • hamavg<-mean(tmp) • print("Average Ham is :");hamavg • • hamvariance<-var(tmp) • print("Var of Ham is :");hamvariance • • print("Extract average token of Hams and Spams!") • • nohamtokens<-0 • noham<-0 • nospamtokens<-0 • nospam<-0
  • 9. R Codes (Cont.) • for(i in 1:length(smstable$type)){ • if(smstable[i,1]=="ham"){ • nohamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nohamtokens • noham<-noham+1 • }else{ • nospamtokens<-length(strsplit(smstable[i,2], "s+")[[1]])+nospamtokens • nospam<-nospam+1 • } • } • • totaltokens<-nospamtokens+nohamtokens; • print("total number of tokens is:") • print(totaltokens) • • avgtokenperham<-nohamtokens/noham • print("Avarage number of tokens per ham message") • print(avgtokenperham) • • avgtokenperspam<-nospamtokens/nospam • print("Avarage number of tokens per spam message") • print(avgtokenperspam) • • print(" Make two different sets, training data and test data!")
  • 10. R Codes (Cont.) • #select the percent of data that you want to use as training set • trdatapercent<-0.3 • • #training data set • trdata=NULL • • #test data set • tedata=NULL • • for(i in 1:length(smstable$type)){ • if(runif(1)<trdatapercent){ • trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2]))) • } • else{ • tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2]))) • } • } • • print("Training data size is!") • dim(trdata) • • print("Test data size is!") • dim(tedata)
  • 11. R Codes (Cont.) • # Text feature extraction using tm package • • trsmses<-Corpus(VectorSource(trdata[,2])) • trsmses<-tm_map(trsmses, stripWhitespace) • trsmses<-tm_map(trsmses, tolower) • trsmses<-tm_map(trsmses, removeWords, stopwords("english")) • • dtm <- DocumentTermMatrix(trsmses) • • highlyrepeatedwords<-findFreqTerms(dtm, 80) • • #These highly used words are used as an index to make VSM • #(vector space model) for trained data and test data • • #vectorized training data set • vtrdata=NULL • • #vectorized test data set • vtedata=NULL
  • 12. R Codes (Cont.) • for(i in 1:length(trdata[,2])){ • if(trdata[i,1]=="ham"){ • vtrdata=rbind(vtrdata,c(1,vsm(trdata[i,2],highlyrepeatedwords))) • } • else{ • vtrdata=rbind(vtrdata,c(0,vsm(trdata[i,2],highlyrepeatedwords))) • } • • } • • for(i in 1:length(tedata[,2])){ • if(tedata[i,1]=="ham"){ • vtedata=rbind(vtedata,c(1,vsm(tedata[i,2],highlyrepeatedwords))) • } • else{ • vtedata=rbind(vtedata,c(0,vsm(tedata[i,2],highlyrepeatedwords))) • } • • }
  • 13. R Codes (Cont.) • # Run different classification algorithms • # differnet SVMs with different Kernels • print("----------------------------------SVM-----------------------------------------") • print("Linear Kernel") • svmlinmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],type='C', kernel='linear'); • summary(svmlinmodel) • predictionlin <- predict(svmlinmodel, vtedata[,2:length(vtedata[1,])]) • tablinear <- table(pred = predictionlin , true = vtedata[,1]); tablinear • precisionlin<-sum(diag(tablinear))/sum(tablinear); • print("General Error using Linear SVM is (in percent):");(1-precisionlin)*100 • print("Ham Error using Linear SVM is (in percent):");(tablinear[1,2]/sum(tablinear[,2]))*100 • print("Spam Error using Linear SVM is (in percent):");(tablinear[2,1]/sum(tablinear[,1]))*100 • • print("Polynomial Kernel") • svmpolymodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel='polynomial', probability=FALSE) • summary(svmpolymodel) • predictionpoly <- predict(svmpolymodel, vtedata[,2:length(vtedata[1,])]) • tabpoly <- table(pred = predictionpoly , true = vtedata[,1]); tabpoly • • print("Radial Kernel") • svmradmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel = "radial", gamma = 0.09, cost = 1, probability=FALSE) • summary(svmradmodel) • predictionrad <- predict(svmradmodel, vtedata[,2:length(vtedata[1,])]) • tabrad <- table(pred = predictionrad, true = vtedata[,1]); tabrad
  • 14. R Codes (Cont.) • print("----------------------------------KNN-----------------------------------------") • data<-data.frame(sms=vtrdata[,2:length(vtrdata[1,])],type=vtrdata[,1]) • classifier <- IBk(data, control = Weka_control(K = 20, X = TRUE)) • summary(classifier) • evaluate_Weka_classifier(classifier, newdata = data.frame(sms=vtedata[,2:length(vtedata[1,])],type=vtedata[,1])) • print("---------------------------------Adaboost-------------------------------------") • adaptiveboost<-ada(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],test.x=vtedata[,2:length(vtedata[1,])], test.y=vtedata[,1], loss="logistic", type="gentle", iter=100) • summary(adaptiveboost) • varplot(adaptiveboost)
  • 15. Conclusions • In these slides I gave a broad overview of ML and different problems that could be solved in this framework. • I reviewed in details one way of SMS spam filter implementation using ML techniques with R language. • ML provides strong framework to solve problem in Big Data domain.