SlideShare a Scribd company logo
1 of 18
Download to read offline
SUBSCRIPTION FRAUD ANALYTICS USING CLASSIFICATION

SOMDEEP KUMAR SEN
Trimax Analytics and Optimization Services
2/21/2014
Subscription Fraud Analytics Using Naïve Bayes Classifier

Contents
Introduction .................................................................................................................................... 3
Overview of the study ..................................................................................................................... 3
Objective of the Study .................................................................................................................... 4
Telecommunication fraud: an Overview ........................................................................................ 4
Definition .................................................................................................................................... 4
Types ........................................................................................................................................... 4
Subscription fraud ................................................................................................................... 4
Recharge Voucher Fraud......................................................................................................... 4
Pre-paid Balance Fraud ........................................................................................................... 4
Unauthorized Service Fraud.................................................................................................... 4
Models Used ................................................................................................................................... 5
Naïve Bayes Classification: an overview ..................................................................................... 5
Decision Tree (A Supervised Learning Method): ........................................................................ 5
Methodology................................................................................................................................... 6
Analysis & Findings ......................................................................................................................... 6
Using R ........................................................................................................................................ 6
Using RapidMiner...................................................................................................................... 13
Conclusion ..................................................................................................................................... 18

2
Subscription Fraud Analytics Using Naïve Bayes Classifier
Introduction
The advancement of technological tools such as computers, the internet, and cellular phones
has made life easier and more convenient for most people in our society. However some
individuals and groups have subverted these telecommunication devices into tools to defraud
numerous unsuspecting victims. It is not uncommon for a scam to originate in a city, country,
state, or even a country different from that in which the victim resides. While, telecom fraud
may occur in different forms, the present study would focus upon the use of analytics to detect
subscription fraud. The study focuses on the application of Naïve Bayes Classification Algorithm
to detect & predict probable fraudsters.

Overview of the study
A fictitious telecom company called Bad Idea came up with a strange rate plan called Praxis
Plan where the callers are allowed to make only one call in the Morning (9AM-Noon),
Afternoon (Noon-4PM), Evening(4PM-9PM) and Night (9PM-Midnight); i.e. four calls per day.
Despite the popularity of the plan, Bad Idea was a target of Subscription Fraud by a gang of
fraudsters consisting of three people: Sally, Virginia and Vince. They finally terminated their
services. Bad Idea has their call logs spanning over one and half months.
The analytics team of the company has been provided two data sets: Black-List Subscriber CallLogs & Audit Log. The Black-List Subscriber Call-Logs data set includes the calling patterns of the
three fraudsters i.e. Sally, Virginia and Vince. After every 5 days the company undertakes an
audit to see whether these Fraudsters have joined their network. The company reviews the list
of subscribers who have made calls to the same people as these three fraudsters and in the
same time frame. This has been provided in the Audit Log.
Test Data: http://bit.ly/1du9cRs
Training Data: http://bit.ly/1du9AQ1

3
Subscription Fraud Analytics Using Naïve Bayes Classifier
Objective of the Study


To provide the Name of the probable callers and the confidence in terms of probability



To provide Name of the fraudster, if any



Code used to determine the subscriber

Telecommunication fraud: an Overview
Definition
Telecommunication fraud is the theft of telecommunication service (telephones, cell phones,
computers etc.) or the use of telecommunication service to commit other forms of fraud.
Victims include consumers, businesses and communication service providers.
Types
Subscription fraud
Subscriber fraud occurs when someone signs up for service with fraudulently-obtained
customer information or false identification. Lawbreakers obtain your personal information and
use it to set up a cell phone account in the name of the subscriber
Recharge Voucher Fraud
This mainly includes unusual top-up recharges and high number of recharges in a given timeperiod
Pre-paid Balance Fraud
Employees with high number of manual balance change as well as Subscribers with high
balances might be an indication of Pre-paid Balance Fraud
Unauthorized Service Fraud
HLR vs. Post-paid subscriber profile reconciliation, HLR services vs. Post-paid Subscriber services
Profile mis-match or sudden change in Subscriber usages could be possible indication of
Unauthorized Service Fraud

4
Subscription Fraud Analytics Using Naïve Bayes Classifier
Models Used
Naïve Bayes Classification: an overview
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from
Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for
the underlying probability model would be "independent feature model".
In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular
feature of a class is unrelated to the presence (or absence) of any other feature. For example, a
fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these
features depend on each other or upon the existence of the other features, a naive Bayes
classifier considers all of these properties to independently contribute to the probability that
this fruit is an apple.
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained
very efficiently in a supervised learning setting. In many practical applications, parameter
estimation for naive Bayes models uses the method of maximum likelihood; in other words,
one can work with the naive Bayes model without believing in Bayesian probability or using any
Bayesian methods.
An advantage of the naive Bayes classifier is that it requires a small amount of training data to
estimate the parameters (means and variances of the variables) necessary for classification.
Because independent variables are assumed, only the variances of the variables for each class
need to be determined and not the entire covariance matrix.
Decision Tree (A Supervised Learning Method):
A decision tree is a flowchart-like structure in which internal node represents test on an
attribute, each branch represents outcome of test and each leaf node represents class label
(decision taken after computing all attributes). A path from root to leaf represents classification
rules. In decision analysis a decision tree and the closely related influence diagram is used as a
visual and analytical decision support tool, where the expected values (or expected utility) of
competing alternatives are calculated.
5
Subscription Fraud Analytics Using Naïve Bayes Classifier
Methodology
In order to make the final prediction Naïve Bayes Classification has been conducted by using
two different packages in the form of R and Rapid Miner. This has been done in order to make
comparison between the results provided by the two packages.

Analysis & Findings
Using R
Our training data (BlackListSubscriberCallLogs), in the form of an excel sheet, has 138 instances
of the names of the people called by the fraudsters Sally, Vince and Virginia in each of the time
frames. We import this dataset into R as “blacklisted”.
We also have a file (Audit Log) of 15 instances where we predict the fraudster by the end of
this report. This is our unseen data. We import this dataset into R as “audit”
The Process:
 Import the datasets and understand them
 Install packages and load the libraries “caret” and “klaR” for Naïve Bayes and “party” for
Decision Tree
 Train our model(Naïve Bayes) using 10-fold cross validation
 Tweak the parameters of the model to obtain finer results
 Check for Accuracy and Kappa values
 Compare the result of Naïve Bayes model with 10-fold cross validated Decision Tree model

6
Subscription Fraud Analytics Using Naïve Bayes Classifier

The above method is used for 10 fold cross validation, which divides the entire dataset in 9:1
parts (using 9 parts for training and 1 part for testing). It repeats this 10 times, reshuffling the
data each time. The outcome of the model is after it has trained itself from all the trials.
Now, we shuffle (for random sampling) our dataset (blacklisted) and take 15 observations
(about 10%) to apply our model and check for the accuracy against it. This set of observation
can be identified and called using the set.seed() function.

7
Subscription Fraud Analytics Using Naïve Bayes Classifier

Upon analyzing the confusion matrix, we find that:
 The accuracy of our model is (6+3+2)/15 = 73.3%
 Precision of predicting Sally = 6/9 = 66.66%
 Precision of predicting Vince = 3/3 = 100%
 Precision of predicting Virginia = 2/3 = 66.66%
Now, we tweak our model, using Laplace (fL) and usekernel. Laplace is a smoothing technique
that assigns non-zero probability to events that do not occur in a sample. Usekernel is another
smoothing technique which is a non-parametric way to estimate the probability density
function of a random variable.

8
Subscription Fraud Analytics Using Naïve Bayes Classifier

9
Subscription Fraud Analytics Using Naïve Bayes Classifier

We observe that the outcome of both the models fit and fit1 (with Laplace and usekernel) are
identical in this case.
Thus, with 73% accuracy, we apply our model to the unseen data (audit).

To obtain the posterior probabilities for each set of observation in the unseen data, we type the
following command:

10
Subscription Fraud Analytics Using Naïve Bayes Classifier

Now, we build the Decision Tree Model,

> plot (ctreeFit$finalModel)

11
Subscription Fraud Analytics Using Naïve Bayes Classifier

Reading the graph: In the evening, if the call is made to Frank and at night, to Clark, and in the
morning the call reaches either, Kelly, Larry or Robert, the probability of the caller fraudster
being Vince is close to 80%.
The disadvantage of the Decision Tree is that the name of Sally shows below each of the nodes,
irrespective of the correct name.
We also find that the accuracy of Naïve Bayes (0.661) is better than the Decision Tree (0.578).
Thus, we now compare the Naïve Bayes output as obtained by Rapid Miner against the output
of R.

12
Subscription Fraud Analytics Using Naïve Bayes Classifier
Using RapidMiner
Initially both the data sets are uploaded into Rapid Miner. The Black-List Subscriber Call-Logs is
named as telecom234 and the Audit log is named as telecom56. Many different classification
algorithms are available in Rapid Miner; out of all we choose the Naïve Bayes Classifier. Both
the telecom234 is dragged into the main window. In data mining we use the concept of data
splitting. In data splitting, we divide our data set into two parts, i.e., training set and validation
set. The purpose of training set is to create model, whereas validation set is used to estimate
the accuracy of the created model. To create model and to estimate its accuracy using the data
splitting technique, we use validation operators that can be found in Evaluation -> Validation
folder in operator window. Most commonly used operators are Split Validation and XValidation. We first make use of Split Validation. Drag and drop Split Validation operator into
process window. Split Validation operator is a group operator, i.e., it groups multiple operators
in it. Group operators have a special sign on them; they have two overlapping blue squares on
their icon as shown in figure below.

13
Subscription Fraud Analytics Using Naïve Bayes Classifier

Validation operator has split ratio parameter (visible in parameter window on right), which
specifies how data set will be split. 0.7 in figure above will split data set into 70% of data for
training set and remaining 30% of data for testing set. Now double click Validation operator.
The Validation sub process window has two parts, i.e., Training and Testing. The split validation
operator is a nested one & we double click on it.
Now the Naïve Bayes Classification is entered into the training window. Validation allows us to
estimate the accuracy of our model. For this purpose, Rapid Miner provides many Performance
Operators in the Performance Measurement folder. Apply Model and Performance
(Classification) operators in testing window as shown in figure below

14
Subscription Fraud Analytics Using Naïve Bayes Classifier

Now the averagable ports of validation operator with result port of Process window are
connected. From the result perspective one would get a performance vector with details about
our created model performance. For example, the created model has the accuracy of almost
71% as shown in figure below, which is quite good.

15
Subscription Fraud Analytics Using Naïve Bayes Classifier
Once the model is created, it is time for using the model to perform classification/prediction.
The telecom56 data set is dragged into the main window. Telecom56 data set is the unlabeled
data set. Apply Model operator is also dragged into the main window. Apply Model operator
will get model from the Validation operator and will apply this model on input of un-labeled
data i.e. telecom56 data which is shown in the figure below.

16
Subscription Fraud Analytics Using Naïve Bayes Classifier
Now, running the whole process would provide the prediction as shown the figure below in the
form of the name of the probable callers along with the confidence in terms of probability

17
Subscription Fraud Analytics Using Naïve Bayes Classifier
Conclusion
Comparing the accuracy and precision from the confusion matrix of Rapid Miner and R results,
we see:
For R:


The accuracy of our model is (6+3+2)/15 = 73.3%



Precision of predicting Sally = 6/9 = 66.66%



Precision of predicting Vince = 3/3 = 100%



Precision of predicting Virginia = 2/3 = 66.66%

Rapid Miner:


The accuracy of our model is = 70.73%



Precision of predicting Sally = 81.82%



Precision of predicting Vince =58.33%



Precision of predicting Virginia = 72.22%

Since, both the statistical software gives accuracy above 70%, we can be confident about our
model and come to the conclusion that Naïve Bayes may be considered the best classifier in this
case, where the training data is considerably small and categorical.

18

More Related Content

What's hot

A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakage
eSAT Publishing House
 
Pollyanna Document Classifier
Pollyanna Document ClassifierPollyanna Document Classifier
Pollyanna Document Classifier
Vijay PG
 
credit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstractcredit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstract
Venkat Projects
 
Dn31538540
Dn31538540Dn31538540
Dn31538540
IJMER
 
Leveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk AssessmentLeveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk Assessment
Erik De Monte
 
SAS Data Mining - Crime Modeling
SAS Data Mining - Crime ModelingSAS Data Mining - Crime Modeling
SAS Data Mining - Crime Modeling
John Michael Croft
 

What's hot (16)

A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakage
 
A model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakageA model to find the agent who responsible for data leakage
A model to find the agent who responsible for data leakage
 
A Study on Credit Card Fraud Detection using Machine Learning
A Study on Credit Card Fraud Detection using Machine LearningA Study on Credit Card Fraud Detection using Machine Learning
A Study on Credit Card Fraud Detection using Machine Learning
 
Pollyanna Document Classifier
Pollyanna Document ClassifierPollyanna Document Classifier
Pollyanna Document Classifier
 
credit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstractcredit card fraud analysis using predictive modeling python project abstract
credit card fraud analysis using predictive modeling python project abstract
 
Ijigsp v6-n2-6
Ijigsp v6-n2-6Ijigsp v6-n2-6
Ijigsp v6-n2-6
 
Dn31538540
Dn31538540Dn31538540
Dn31538540
 
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsCredit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
 
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
 
Leveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk AssessmentLeveraging Technology and Analytics BSA Risk Assessment
Leveraging Technology and Analytics BSA Risk Assessment
 
Discovery of ranking fraud for mobile apps
Discovery of ranking fraud for mobile appsDiscovery of ranking fraud for mobile apps
Discovery of ranking fraud for mobile apps
 
IRJET- Credit Card Fraud Detection using Random Forest
IRJET-  	  Credit Card Fraud Detection using Random ForestIRJET-  	  Credit Card Fraud Detection using Random Forest
IRJET- Credit Card Fraud Detection using Random Forest
 
SAS Data Mining - Crime Modeling
SAS Data Mining - Crime ModelingSAS Data Mining - Crime Modeling
SAS Data Mining - Crime Modeling
 
Crime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenCrime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – Smarten
 
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
 
How Kyriba Helps Protect You From Payments Fraud
How Kyriba Helps Protect You From Payments FraudHow Kyriba Helps Protect You From Payments Fraud
How Kyriba Helps Protect You From Payments Fraud
 

Viewers also liked

Detecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networksDetecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networks
Jamal Meselmani
 
Fraud Management Industry Update Webinar by cVidya
Fraud Management Industry Update Webinar by cVidyaFraud Management Industry Update Webinar by cVidya
Fraud Management Industry Update Webinar by cVidya
cVidya Networks
 

Viewers also liked (6)

Detecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networksDetecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networks
 
TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...
TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...
TM Forum Fraud Management Group Activities - Presented at TM Forum's Manageme...
 
Telecom Fraud Detection
Telecom Fraud DetectionTelecom Fraud Detection
Telecom Fraud Detection
 
Fraud in Telecoms
Fraud in TelecomsFraud in Telecoms
Fraud in Telecoms
 
Fraud Management Industry Update Webinar by cVidya
Fraud Management Industry Update Webinar by cVidyaFraud Management Industry Update Webinar by cVidya
Fraud Management Industry Update Webinar by cVidya
 
Frauds in telecom sector
Frauds in telecom sectorFrauds in telecom sector
Frauds in telecom sector
 

Similar to Subscription fraud analytics using classification

Introduction to Digital Biomarkers V1.0
Introduction to Digital Biomarkers V1.0Introduction to Digital Biomarkers V1.0
Introduction to Digital Biomarkers V1.0
Barry Vant-Hull
 
Implementing Clinical Decision Support System Using Naïve Bayesian Classifier
Implementing Clinical Decision Support System Using Naïve Bayesian ClassifierImplementing Clinical Decision Support System Using Naïve Bayesian Classifier
Implementing Clinical Decision Support System Using Naïve Bayesian Classifier
rahulmonikasharma
 

Similar to Subscription fraud analytics using classification (20)

A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
A Secure Decision Making Process in Health Care System Using Naive Bayes Clas...
 
Ba group3
Ba group3Ba group3
Ba group3
 
Telecom Fraudsters Prediction
Telecom Fraudsters Prediction Telecom Fraudsters Prediction
Telecom Fraudsters Prediction
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian Networks
 
7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...
7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...
7. Plan, perform, and evaluate samples for substantive procedures IPPTChap009...
 
Statistics For Bi
Statistics For BiStatistics For Bi
Statistics For Bi
 
Data Mining Lec1.pptx
Data Mining Lec1.pptxData Mining Lec1.pptx
Data Mining Lec1.pptx
 
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdf
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdfTanvi_Sharma_Shruti_Garg_pre.pdf.pdf
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdf
 
Lecture 22
Lecture 22Lecture 22
Lecture 22
 
San Francisco Crime Prediction Report
San Francisco Crime Prediction ReportSan Francisco Crime Prediction Report
San Francisco Crime Prediction Report
 
Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Data Mining on SpamBase,Wine Quality and Communities and Crime DatasetsData Mining on SpamBase,Wine Quality and Communities and Crime Datasets
Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data Analytics
 
San Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contestSan Francisco Crime Analysis Classification Kaggle contest
San Francisco Crime Analysis Classification Kaggle contest
 
Introduction to Digital Biomarkers V1.0
Introduction to Digital Biomarkers V1.0Introduction to Digital Biomarkers V1.0
Introduction to Digital Biomarkers V1.0
 
Cost optimized reliability test planning rev 7
Cost optimized reliability test planning rev 7Cost optimized reliability test planning rev 7
Cost optimized reliability test planning rev 7
 
Lobsters, Wine and Market Research
Lobsters, Wine and Market ResearchLobsters, Wine and Market Research
Lobsters, Wine and Market Research
 
Final Report
Final ReportFinal Report
Final Report
 
IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction System
 
Implementing Clinical Decision Support System Using Naïve Bayesian Classifier
Implementing Clinical Decision Support System Using Naïve Bayesian ClassifierImplementing Clinical Decision Support System Using Naïve Bayesian Classifier
Implementing Clinical Decision Support System Using Naïve Bayesian Classifier
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
 

More from Somdeep Sen

More from Somdeep Sen (10)

Introduction to Product
Introduction to ProductIntroduction to Product
Introduction to Product
 
Comparison between drugs in prevention of post anesthetic shivering
Comparison between drugs in prevention of post anesthetic shiveringComparison between drugs in prevention of post anesthetic shivering
Comparison between drugs in prevention of post anesthetic shivering
 
Sample phone bill analysis
Sample phone bill analysisSample phone bill analysis
Sample phone bill analysis
 
Multiple regression to findout drivers of online satisfaction
Multiple regression to findout drivers of  online satisfactionMultiple regression to findout drivers of  online satisfaction
Multiple regression to findout drivers of online satisfaction
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
Clustering
ClusteringClustering
Clustering
 
Decision tree
Decision treeDecision tree
Decision tree
 
Market Potential of HCL
Market Potential of HCL Market Potential of HCL
Market Potential of HCL
 
Consumer Behavior Analysis: A study of Cafe Coffee Day
Consumer Behavior Analysis: A study of Cafe Coffee DayConsumer Behavior Analysis: A study of Cafe Coffee Day
Consumer Behavior Analysis: A study of Cafe Coffee Day
 
Introduction to Pinterest
Introduction to PinterestIntroduction to Pinterest
Introduction to Pinterest
 

Recently uploaded

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 

Recently uploaded (20)

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 

Subscription fraud analytics using classification

  • 1. SUBSCRIPTION FRAUD ANALYTICS USING CLASSIFICATION SOMDEEP KUMAR SEN Trimax Analytics and Optimization Services 2/21/2014
  • 2. Subscription Fraud Analytics Using Naïve Bayes Classifier Contents Introduction .................................................................................................................................... 3 Overview of the study ..................................................................................................................... 3 Objective of the Study .................................................................................................................... 4 Telecommunication fraud: an Overview ........................................................................................ 4 Definition .................................................................................................................................... 4 Types ........................................................................................................................................... 4 Subscription fraud ................................................................................................................... 4 Recharge Voucher Fraud......................................................................................................... 4 Pre-paid Balance Fraud ........................................................................................................... 4 Unauthorized Service Fraud.................................................................................................... 4 Models Used ................................................................................................................................... 5 Naïve Bayes Classification: an overview ..................................................................................... 5 Decision Tree (A Supervised Learning Method): ........................................................................ 5 Methodology................................................................................................................................... 6 Analysis & Findings ......................................................................................................................... 6 Using R ........................................................................................................................................ 6 Using RapidMiner...................................................................................................................... 13 Conclusion ..................................................................................................................................... 18 2
  • 3. Subscription Fraud Analytics Using Naïve Bayes Classifier Introduction The advancement of technological tools such as computers, the internet, and cellular phones has made life easier and more convenient for most people in our society. However some individuals and groups have subverted these telecommunication devices into tools to defraud numerous unsuspecting victims. It is not uncommon for a scam to originate in a city, country, state, or even a country different from that in which the victim resides. While, telecom fraud may occur in different forms, the present study would focus upon the use of analytics to detect subscription fraud. The study focuses on the application of Naïve Bayes Classification Algorithm to detect & predict probable fraudsters. Overview of the study A fictitious telecom company called Bad Idea came up with a strange rate plan called Praxis Plan where the callers are allowed to make only one call in the Morning (9AM-Noon), Afternoon (Noon-4PM), Evening(4PM-9PM) and Night (9PM-Midnight); i.e. four calls per day. Despite the popularity of the plan, Bad Idea was a target of Subscription Fraud by a gang of fraudsters consisting of three people: Sally, Virginia and Vince. They finally terminated their services. Bad Idea has their call logs spanning over one and half months. The analytics team of the company has been provided two data sets: Black-List Subscriber CallLogs & Audit Log. The Black-List Subscriber Call-Logs data set includes the calling patterns of the three fraudsters i.e. Sally, Virginia and Vince. After every 5 days the company undertakes an audit to see whether these Fraudsters have joined their network. The company reviews the list of subscribers who have made calls to the same people as these three fraudsters and in the same time frame. This has been provided in the Audit Log. Test Data: http://bit.ly/1du9cRs Training Data: http://bit.ly/1du9AQ1 3
  • 4. Subscription Fraud Analytics Using Naïve Bayes Classifier Objective of the Study  To provide the Name of the probable callers and the confidence in terms of probability  To provide Name of the fraudster, if any  Code used to determine the subscriber Telecommunication fraud: an Overview Definition Telecommunication fraud is the theft of telecommunication service (telephones, cell phones, computers etc.) or the use of telecommunication service to commit other forms of fraud. Victims include consumers, businesses and communication service providers. Types Subscription fraud Subscriber fraud occurs when someone signs up for service with fraudulently-obtained customer information or false identification. Lawbreakers obtain your personal information and use it to set up a cell phone account in the name of the subscriber Recharge Voucher Fraud This mainly includes unusual top-up recharges and high number of recharges in a given timeperiod Pre-paid Balance Fraud Employees with high number of manual balance change as well as Subscribers with high balances might be an indication of Pre-paid Balance Fraud Unauthorized Service Fraud HLR vs. Post-paid subscriber profile reconciliation, HLR services vs. Post-paid Subscriber services Profile mis-match or sudden change in Subscriber usages could be possible indication of Unauthorized Service Fraud 4
  • 5. Subscription Fraud Analytics Using Naïve Bayes Classifier Models Used Naïve Bayes Classification: an overview A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model". In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix. Decision Tree (A Supervised Learning Method): A decision tree is a flowchart-like structure in which internal node represents test on an attribute, each branch represents outcome of test and each leaf node represents class label (decision taken after computing all attributes). A path from root to leaf represents classification rules. In decision analysis a decision tree and the closely related influence diagram is used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated. 5
  • 6. Subscription Fraud Analytics Using Naïve Bayes Classifier Methodology In order to make the final prediction Naïve Bayes Classification has been conducted by using two different packages in the form of R and Rapid Miner. This has been done in order to make comparison between the results provided by the two packages. Analysis & Findings Using R Our training data (BlackListSubscriberCallLogs), in the form of an excel sheet, has 138 instances of the names of the people called by the fraudsters Sally, Vince and Virginia in each of the time frames. We import this dataset into R as “blacklisted”. We also have a file (Audit Log) of 15 instances where we predict the fraudster by the end of this report. This is our unseen data. We import this dataset into R as “audit” The Process:  Import the datasets and understand them  Install packages and load the libraries “caret” and “klaR” for Naïve Bayes and “party” for Decision Tree  Train our model(Naïve Bayes) using 10-fold cross validation  Tweak the parameters of the model to obtain finer results  Check for Accuracy and Kappa values  Compare the result of Naïve Bayes model with 10-fold cross validated Decision Tree model 6
  • 7. Subscription Fraud Analytics Using Naïve Bayes Classifier The above method is used for 10 fold cross validation, which divides the entire dataset in 9:1 parts (using 9 parts for training and 1 part for testing). It repeats this 10 times, reshuffling the data each time. The outcome of the model is after it has trained itself from all the trials. Now, we shuffle (for random sampling) our dataset (blacklisted) and take 15 observations (about 10%) to apply our model and check for the accuracy against it. This set of observation can be identified and called using the set.seed() function. 7
  • 8. Subscription Fraud Analytics Using Naïve Bayes Classifier Upon analyzing the confusion matrix, we find that:  The accuracy of our model is (6+3+2)/15 = 73.3%  Precision of predicting Sally = 6/9 = 66.66%  Precision of predicting Vince = 3/3 = 100%  Precision of predicting Virginia = 2/3 = 66.66% Now, we tweak our model, using Laplace (fL) and usekernel. Laplace is a smoothing technique that assigns non-zero probability to events that do not occur in a sample. Usekernel is another smoothing technique which is a non-parametric way to estimate the probability density function of a random variable. 8
  • 9. Subscription Fraud Analytics Using Naïve Bayes Classifier 9
  • 10. Subscription Fraud Analytics Using Naïve Bayes Classifier We observe that the outcome of both the models fit and fit1 (with Laplace and usekernel) are identical in this case. Thus, with 73% accuracy, we apply our model to the unseen data (audit). To obtain the posterior probabilities for each set of observation in the unseen data, we type the following command: 10
  • 11. Subscription Fraud Analytics Using Naïve Bayes Classifier Now, we build the Decision Tree Model, > plot (ctreeFit$finalModel) 11
  • 12. Subscription Fraud Analytics Using Naïve Bayes Classifier Reading the graph: In the evening, if the call is made to Frank and at night, to Clark, and in the morning the call reaches either, Kelly, Larry or Robert, the probability of the caller fraudster being Vince is close to 80%. The disadvantage of the Decision Tree is that the name of Sally shows below each of the nodes, irrespective of the correct name. We also find that the accuracy of Naïve Bayes (0.661) is better than the Decision Tree (0.578). Thus, we now compare the Naïve Bayes output as obtained by Rapid Miner against the output of R. 12
  • 13. Subscription Fraud Analytics Using Naïve Bayes Classifier Using RapidMiner Initially both the data sets are uploaded into Rapid Miner. The Black-List Subscriber Call-Logs is named as telecom234 and the Audit log is named as telecom56. Many different classification algorithms are available in Rapid Miner; out of all we choose the Naïve Bayes Classifier. Both the telecom234 is dragged into the main window. In data mining we use the concept of data splitting. In data splitting, we divide our data set into two parts, i.e., training set and validation set. The purpose of training set is to create model, whereas validation set is used to estimate the accuracy of the created model. To create model and to estimate its accuracy using the data splitting technique, we use validation operators that can be found in Evaluation -> Validation folder in operator window. Most commonly used operators are Split Validation and XValidation. We first make use of Split Validation. Drag and drop Split Validation operator into process window. Split Validation operator is a group operator, i.e., it groups multiple operators in it. Group operators have a special sign on them; they have two overlapping blue squares on their icon as shown in figure below. 13
  • 14. Subscription Fraud Analytics Using Naïve Bayes Classifier Validation operator has split ratio parameter (visible in parameter window on right), which specifies how data set will be split. 0.7 in figure above will split data set into 70% of data for training set and remaining 30% of data for testing set. Now double click Validation operator. The Validation sub process window has two parts, i.e., Training and Testing. The split validation operator is a nested one & we double click on it. Now the Naïve Bayes Classification is entered into the training window. Validation allows us to estimate the accuracy of our model. For this purpose, Rapid Miner provides many Performance Operators in the Performance Measurement folder. Apply Model and Performance (Classification) operators in testing window as shown in figure below 14
  • 15. Subscription Fraud Analytics Using Naïve Bayes Classifier Now the averagable ports of validation operator with result port of Process window are connected. From the result perspective one would get a performance vector with details about our created model performance. For example, the created model has the accuracy of almost 71% as shown in figure below, which is quite good. 15
  • 16. Subscription Fraud Analytics Using Naïve Bayes Classifier Once the model is created, it is time for using the model to perform classification/prediction. The telecom56 data set is dragged into the main window. Telecom56 data set is the unlabeled data set. Apply Model operator is also dragged into the main window. Apply Model operator will get model from the Validation operator and will apply this model on input of un-labeled data i.e. telecom56 data which is shown in the figure below. 16
  • 17. Subscription Fraud Analytics Using Naïve Bayes Classifier Now, running the whole process would provide the prediction as shown the figure below in the form of the name of the probable callers along with the confidence in terms of probability 17
  • 18. Subscription Fraud Analytics Using Naïve Bayes Classifier Conclusion Comparing the accuracy and precision from the confusion matrix of Rapid Miner and R results, we see: For R:  The accuracy of our model is (6+3+2)/15 = 73.3%  Precision of predicting Sally = 6/9 = 66.66%  Precision of predicting Vince = 3/3 = 100%  Precision of predicting Virginia = 2/3 = 66.66% Rapid Miner:  The accuracy of our model is = 70.73%  Precision of predicting Sally = 81.82%  Precision of predicting Vince =58.33%  Precision of predicting Virginia = 72.22% Since, both the statistical software gives accuracy above 70%, we can be confident about our model and come to the conclusion that Naïve Bayes may be considered the best classifier in this case, where the training data is considerably small and categorical. 18