The objective of this project is to discuss the importance of Machine Learning in different sectors and how does it solve the problems in the Marketing Analytics field. We have discussed Marketing Segmentation, Advertisement, and Fraud detection in our project. We used different Machine Learning algorithms and used R and Python library to predict and solve these problems. After making models and running test data on those models we got following results:
• We trained a Decision tree and Random Forest classifier model which has 73% accuracy to predict whether a person will be a defaulter or not based on credit history, income, job type, dependents etc.
• We segmented the Social networking profiles based on the likes and dislikes of a person using K-Means Clustering.
• We made a predictive model of the messages a customer receives and determined whether a message will be a Spam or not a spam with an accuracy of 97%. We used Naïve Bayes classifier for this model.
How to Troubleshoot Apps for the Modern Connected Worker
Marketing Analytics using R/Python
1. Capstone Project - IS 6596
Project Supervisor:
Dr. Rohit Aggarwal
Project Contributors:
Mayank Badjatya - u1085897
Sagar Singh - u1088202
MARKETING ANALYTICS USING
R/PYTHON
2. 1
Capstone Project – IS 6596
Contents
Executive Summary.......................................................................................................................................2
Book Description...........................................................................................................................................3
Why Data Science?........................................................................................................................................5
Skill sets required for a Data Science............................................................................................................6
7 Steps to effective Predictive Modelling.....................................................................................................7
Marketing Analysis........................................................................................................................................9
Fraud Detection ......................................................................................................................................10
Market Segmentation.............................................................................................................................13
Advertising..............................................................................................................................................16
Lessons Learned..........................................................................................................................................19
Next Steps...................................................................................................................................................19
3. 2
Capstone Project – IS 6596
Executive Summary
The objective of this project is to discuss the importance of Machine Learning in different sectors and how
does it solve the problems in the Marketing Analytics field. We have discussed Marketing Segmentation,
Advertisement, and Fraud detection in our project. We used different Machine Learning algorithms and
used R and Python library to predict and solve these problems. After making models and running test data
on those models we got following results:
• We trained a Decision tree and Random Forest classifier model which has 73% accuracy to predict
whether a person will be a defaulter or not based on credit history, income, job type, dependents
etc.
• We segmented the Social networking profiles based on the likes and dislikes of a person using K-
Means Clustering.
• We made a predictive model on the messages a customer receives and determined whether a
message will be a Spam or not a spam with an accuracy of 97%. We used Naïve Bayes classifier
for this model.
• We created several other models using different algorithms, but these are beyond the scope of
this report.
4. 3
Capstone Project – IS 6596
Book Description
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning,
an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging
from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of
the most important modeling and prediction techniques, along with relevant applications. Topics include
linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support
vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the
methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning
techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on
implementing the analyses and methods presented in R, an extremely popular open source statistical
software platform. An Introduction to Statistical Learning covers many of the same topics, but at a level
accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike
who wish to use innovative statistical learning techniques to analyze their data. The text assumes only a
previous course in linear regression and no knowledge of matrix algebra.
Machine Learning with R: This book is intended for anybody hoping to use data for action. Perhaps you
already know a bit about machine learning, but have never used R; or perhaps you know a little about R,
but are new to machine learning. In any case, this book will get you up and running quickly. It would be
helpful to have a bit of familiarity with basic math and programming concepts, but no prior experience is
required. All you need is curiosity.
Machine learning, at its core, is concerned with the algorithms that transform information into actionable
intelligence. This fact makes machine learning well-suited to the present-day era of big data. Without
machine learning, it would be nearly impossible to keep up with the massive stream of information. Given
the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there
has never been a better time to start using machine learning. R offers a powerful but easy-to-learn set of
tools that can assist you with finding data insights. By combining hands-on case studies with the essential
5. 4
Capstone Project – IS 6596
theory that you need to understand how things work under the hood, this book provides all the knowledge
that you will need to start applying machine learning to your own projects.
Marketing Analytics Data Driven Techniques: This book helps tech-savvy marketers and data analysts
solve real-world business problems with Excel.
Using data-driven business analytics to understand customers and improve results is a great idea in
theory, but in today's busy offices, marketers and analysts need simple, low-cost ways to process and
make the most of all that data. This expert book offers the perfect solution. Written by data analysis expert
Wayne L. Winston, this practical resource shows you how to tap a simple and cost-effective tool, Microsoft
Excel, to solve specific business problems using powerful analytic techniques—and achieve optimum
results. Practical exercises in each chapter helped us to apply and reinforce techniques as you learn.
Shows you how to perform sophisticated business analyses using the cost-effective and widely available
Microsoft Excel instead of expensive, proprietary analytical tools
• Reveals how to target and retain profitable customers and avoid high-risk customers
• Helps you forecast sales and improve response rates for marketing campaigns
• Explores how to optimize price points for products and services, optimize store layouts, and
improve online advertising
• Covers social media, viral marketing, and how to exploit both effectively.
6. 5
Capstone Project – IS 6596
Why Data Science?
Data Science is a field, which can be implemented anywhere. Here is the list of people who uses data
science as a tool in their field and are not from IT background.
• Politics: We may have heard how statistical wizard Nate Silver predicted the electoral votes for
each state in the 2012 presidential election, showing that raw data crunching of polls is much
more reliable than traditional punditry.
• Healthcare: The role of big data in medicine is one where we can build better health profiles and
better predictive models around individual patients so that we can better diagnose and treat
disease. Big data comes into play around aggregating increasingly information around multiple
scales for what constitutes a disease—from the DNA, proteins, and metabolites to cells, tissues,
organs, organisms, and ecosystems.
• Automotive Industry: Areas in the automotive industry impacted by Big Data include:
a. Conceptual Design: Real-world data collected from billions of miles driven will undoubtedly
influence safety, aerodynamics, power algorithms and other fundamental elements of the vehicle.
b. Drawing Boards: Efficiency gained in design, production volumes and manufacturing through
Big Data in the auto industry will make it economically feasible to make today’s options
tomorrow’s standard equipment.
c. Procurement: Supply chain management optimized by Big Data will help manufacturers
continue to wring new efficiency from the procurement process.
d. Manufacturing: On the assembly line, data gathered throughout the building process will be
used in predictive analytics to improve manufacturing simulations and watch machine
performance, making the next assembly line even more efficient and flexible.
• Marketing: Big Data is already having a major influence on vehicle marketing. Social sentiment
will play a growing role in manufacturers’ plans to design new vehicles. Customer feedback on
current models also helps marketing experts identify key themes and messages for new
campaigns.
• Finance: Understanding consumer habits, preferences and buying power across market segments
gives manufacturers insights needed to develop more-effective financing programs. But that’s just
the first step. New insights from Big Data analyses of sales and in-field use data will help captive
financing companies develop new services and new revenue streams.
• Services: Like performance, service will benefit as both a contributor and a user of Big Data in the
automotive industry. Information gathered through millions of service events will provide
feedback to designers.
7. 6
Capstone Project – IS 6596
Skill sets required for a Data Science
Technical Skills:
Python Coding – Python is the most common coding language I typically see required in data science roles,
along with Java, Perl, or C/C++.
Hadoop Platform – Although this isn’t always a requirement, it is heavily preferred in many cases. Having
experience with Hive or Pig is also a strong selling point. Familiarity with cloud tools such as Amazon S3
can also be beneficial.
SQL Database/Coding – Even though NoSQL and Hadoop have become a large component of data science,
it is still expected that a candidate will be able to write and execute complex queries in SQL.
Unstructured data – It is critical that a data scientist be able to work with unstructured data, whether it is
from social media, video feeds or audio.
Non-Technical Skills
Intellectual curiosity – No doubt we have seen this phrase everywhere lately, especially as it relates to
data scientists. Frank Lo describes what it means, and talks about other necessary “soft skills” in his guest
blog posted a few months ago.
Business acumen – To be a data scientist we’ll need a solid understanding of the industry we’re working
in, and know what business problems your company is trying to solve. In terms of data science, being able
to discern which problems are important to solve for the business is critical, in addition to identifying new
ways the business should be leveraging its data.
Communication skills – Companies searching for a strong data scientist are looking for someone who can
clearly and fluently translate their technical findings to a non-technical team, such as the Marketing or
Sales departments. A data scientist must enable the business to make decisions by arming them with
quantified insights, in addition to understanding the needs of their non-technical colleagues to wrangle
the data appropriately.
8. 7
Capstone Project – IS 6596
7 Steps to effective Predictive Modelling
Step 1: Defining the Objective
The first step in any modeling process is defining the objective. We see in what field does the problem fall
in. There are many fields like Target Marketing, Risk & Fraud Management, Strategy Implementation and
Change Management, Operational Efficiency, Increase Customer Experience, Manage Marketing,
Campaigns Forecast, Revenue or Loss, Workforce Management, Financial Modeling, Churn Management,
and Social Media Influencers
Step 2: Gathering the Data
Accurate, actionable, accessible data is the lifeblood of any successful model. So we collect enough data
to make a predictive model on it.
Step 3: Preparing the Data for Modeling
The average modeler spends 70% of his or her time preparing data. In this step we need to prepare data
into right format for analysis and the tool we may want use.
1. Do initial cleaning up
2. Define Variables and Create Data Dictionary
3. Joining/Appending multiple datasets
4. Validate for correctness
5. Produce Basic Summary Reports
Step 4: Selecting and Transforming the Variables
Determining the best fit is essential to good model performance. The underlying structure of the
independent variables in relation to the dependent variable, determines the power and longevity of a
model.
Special consideration is given to the fact that marketing data can have hundreds or even thousands of
variables. We apply methods for identifying the best candidate variables. Programs are introduced that
automatically segment and transform the most powerful variables, to ensure the best fit.
Step 5: Processing and Evaluating the Model
All the preparation works up to this point makes this next step run smoothly. Weights of Evidence and
Information Values are calculated. For our main case study, we used various options within PROC LOGISTIC
to determine the model with the best fit. Validation data are scored, tabulated, and compared using both
SAS® & MSExcel®.
Step 6: Validating the Model
Models should perform well on the development data. Plus, if the hold-out sample is randomly selected,
the model performance should score the validation data with similar results. A true test of model
performance is how well it performs on data from a different time or market area. So, we used three
powerful methods for ensuring model fit. 1) Scoring alternate data is the best way to tell if our model will
9. 8
Capstone Project – IS 6596
perform in a real campaign; 2) Bootstrapping uses simple resampling techniques to find confidence
intervals around our estimates; 3) Key Variable Analysis calculates important market factors as they are
affected by the model, thus ensuring reasonable results.
Step 7: Implementing and Maintaining the Model
Effective implementation is a combination of business intelligence and well-designed procedures. So, we
score a new data set with the new model. Several auditing procedures are done and tracking, and model
maintenance are emphasized as best practices.
Figure 1 7 Steps of Predictive Model
10. 9
Capstone Project – IS 6596
Marketing Analysis
Figure 2 : Facets of Marketing Analysis
An accurate customer risk assessment will help us acquire the most profitable consumers while
minimizing risk. For business-to-consumer companies, Experian offers consumer credit information,
advanced scoring software, prescreening systems, and application decisioning tools. For companies
looking to acquire business customers, our business reports and public records, portfolio data and risk
modeling tools allow clients to create comprehensive profiles of business prospects. Determine which
businesses are well-capitalized and financially suited for customer acquisition.
11. 10
Capstone Project – IS 6596
Fraud Detection
Fraud is a billion-dollar business and it is increasing every year. The PwC global economic crime survey of
2016 suggests that more than one in three (36%) of organizations experienced economic crime.
Traditional methods of data analysis have long been used to detect fraud. They require complex and time-
consuming investigations that deal with different domains of knowledge like financial, economics,
business practices and law.
To know more about how Machine Learning algorithms, solve Fraud detection problem we took a dataset
from the “Machine Learning using R” credit data set.
The idea behind our credit model is to identify factors that make an applicant at higher risk of default.
Therefore, we need to obtain data on many past bank loans and whether the loan went into default, as
well as information about the applicant.
We can see that “job”, “phone”,
“checking_balance”,
“credit_history”, “purpose”,”
savings_balance”,
“employment_duration”,
“other_credit”, “housing” are the
categorical data so in Python we
use onehotencoder() to convert
the categorical data into 0s and 1s.
After applying the
onehotencoder() on all categorical
dataset we got 36 columns. The
credit dataset includes 1,000
examples of loans, plus a
combination of numeric and
nominal features indicating
characteristics of the loan and the
loan applicant. A class variable
indicates whether the loan went
into default.
Figure 3 Conversion of categorical data into 0s and 1s
12. 11
Capstone Project – IS 6596
We did the initial data exploration and plotted that using matplotlib library.
Figure 4 Exploratory Data Analysis
We used decision tree to determine whether a person is a defaulter or not depending on the features.
The core algorithm for building decision trees called ID3. The Decision tree classifiers uses greedy
approach hence an attribute chooses at first step can’t be used anymore which can give better
classification if used in later steps. Also, it overfits the training data which can give poor results for unseen
data. It uses two concepts to determine on which feature it needs to divide the dataset.
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e.,
the most homogeneous branches).
Entropy
A decision tree is built top-down from a root node and involves partitioning the data into subsets that
contain instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the
homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample
is an equally divided it has entropy of one.
After applying the Decision tree model, we got the following classification report.
13. 12
Capstone Project – IS 6596
Figure 5 F1 Score for Decision Tree
F1 score is a measure of a test's accuracy. The F1 score is the harmonic average of the precision and recall,
where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
Decision tree makes a model which is biased so to overcome this drawback we use Bagging.
Bagging is a way to decrease the variance of our prediction by generating additional data for training from
our original dataset using combinations with repetitions to produce multisets of the same cardinality/size
as our original data.
Random Forests is an ensemble classifier which uses many decision tree models to predict the result. A
different subset of training data is selected, with replacement to train each tree. A collection of trees is a
forest, and the trees are being trained on subsets which are being selected at random, hence random
forests. After applying Random Forest classifier, we got the following result.
Figure 6 F1 Score for Random Forest
We can clearly see the increase in the F1-score.
Now the next step in building model as discussed earlier is to fine tune the model. For this we use Grid
Search Cross Validation technique. After applying the GridSearchCV we got the following classification
report.
Figure 7 F1 Score after GridSearchCV
From this model we understand that the model will predict 73% of the time whether a person will be a
defaulter or not.
14. 13
Capstone Project – IS 6596
Market Segmentation
One of the most fundamental marketing activities is in market segmentation. As companies cannot
connect with all their potential customers, they must divide markets into groups (segments) of consumers,
customers, or clients with similar needs and wants. Firms can then target each of these segments by
positioning themselves in a unique segment (such as Ferrari in the high-end sports car market).
While market researchers often form market, segments based on
practical grounds, industry practice and wisdom, cluster analysis
allows segments to be formed that are based on data that are less
dependent on subjectivity.
Cluster analysis is a convenient method for identifying homogeneous
groups of objects called clusters. Objects (or cases, observations) in a
specific cluster share many characteristics, but are very dissimilar to
objects not belonging to that cluster.
Below we have tried try this process from start to finish.
For this analysis, we used a dataset representing a random sample of 30,000 U.S. high school students
who had profiles on a well-known SNS in 2006. To protect the users' anonymity, the SNS will remain
unnamed. However, at the time the data was collected, the SNS was a popular web destination for US
teenagers. Therefore, it is reasonable to assume that the profiles represent a wide cross section of
American adolescents in 2006.
Let's take a quick look at the specifics of the data.
Figure 8 Description of the data set
15. 14
Capstone Project – IS 6596
Figure 9 Min-Max of the Age Figure 10 Gender and Age anomaly
There is something strange around the gender row. On looking carefully, we noticed the NA value. We
see that 2,724 records (9 percent) have missing gender data.
Besides gender, only age has missing values. A total of 5,086 records (17 percent) have missing ages. Also
concerning is the fact that the minimum and maximum values seem to be unreasonable; it is unlikely that
a 3-year-old or a 106-year-old is attending high school. To ensure that these extreme values don't cause
problems for the analysis, we cleaned them up before moving on.
Figure 11 Box Plot for the age distribution
A more reasonable range of ages for the high school students includes those who are at least 13 years old
and not yet 20 years old. Any age value falling outside this range we treated the same as missing data.
An easy solution for handling the missing values is to exclude any record with a missing value. In this case,
we created dummy variables for female and unknown gender. We assigned teens$female the value 1 if
gender is equal to F and the gender is not equal to NA; otherwise, it assigns the value 0 .
Next, we eliminated the 5,523 missing age values. We have used a different strategy known as data
imputation, which involves filling in the missing data with a guess as to the true value. Most people in a
graduation cohort were born within a single calendar year. We have identified the typical age for each
cohort, we had a reasonable estimate of the age of a student in that graduation year.
16. 15
Capstone Project – IS 6596
To cluster the teenagers into marketing segments, we used an implementation of k-means clustering. We
started our cluster analysis by considering only the 36 features that represent the number of times various
interests appeared on the teen SNS profiles.
Evaluating clustering results can be somewhat subjective. Ultimately, the success or failure of the model
hinges on whether the clusters are useful for their intended purpose. As the goal of this analysis was to
identify clusters of teenagers with similar interests for marketing purposes, we largely measured our
success in qualitative terms. For other clustering applications, more quantitative measures of success may
be needed. By examining whether the clusters fall above or below the mean level for each interest
category, we can notice patterns that distinguish the clusters from each other. Cluster 3 is substantially
above the mean interest level on all the sports. This suggests that this may be a group of Athletes per The
Breakfast Club stereotype.
Figure 12 Cluster segmentation
Cluster 0 includes the most mentions of "cheerleading," the word "hot," and is above the average level of
football interest. Hence, these are the so-called Princesses. Similarly, we tried to cluster the different
groups, and this is what we found.
We now focused our effort on turning these insights into action. We applied the clusters back onto the
full dataset.
We looked at the demographic characteristics of the clusters. The mean age does not vary much by
cluster, which is not too surprising as these teen identities are often determined before high school. On
the other hand, there are some substantial differences in the proportion of females by cluster. This is a
very interesting finding as we didn't use gender data to create the clusters, yet the clusters are still
Cluster 0 (N =
872) Princess
cute
hair
shopping
clothes
dance
Cluster 1 (N =
21308) Basket
Cases
???
Cluster 2 (N =
1041) Criminals
drunk
deaths
drugs
die
music
Cluster 3 (N =
5971) Athletes
basketball
soccer
football
volleyball
soccer
Cluster 4 (N =
808) Brains
band
marching
music
rock
17. 16
Capstone Project – IS 6596
predictive of gender. Given our success in predicting gender, we also suspected that the clusters are
predictive of the number of friends the users have. This hypothesis seems to be supported by the data.
Our findings support the popular adage that "birds of a feather flock together." By using machine learning
methods to cluster teenagers with others who have similar interests, we were able to develop a typology
of teen identities that was predictive of personal characteristics, such as gender and the number of
friends. These same methods can be applied to other contexts with similar results.
Advertising
Compared to all the marketing techniques, email marketing is the cheapest way of sending a marketing
message to millions of people. Being so cheap, it is the tool of choice for marketing teams with a small
budget trying to sell cheap products. Most of the times, such products do not deliver what they promise.
Unfortunately, with email marketing, we run the risk of being exposed to malware and fraudulent emails.
Worms and viruses often make use of email and spam techniques to propagate. Phishing emails and
Nigerian 419 scams are examples of fraudulent emails which try to harvest either our money or our
personal information including credit card details. So, while email marketing is the tool of choice for most
marketing teams, it does require stringent regulations to ensure that it does not get abused. Below we
tried to build a model which predicts whether a composed message is spam or not.
The dataset included the text of SMS messages along with a label indicating whether the message is
unwanted. Junk messages are labeled spam, while legitimate messages are labeled ham. Since Naive
Bayes has been used successfully for e-mail spam filtering, it seems likely that it could also be applied to
SMS spam. However, relative to e-mail spam, SMS spam poses additional challenges for automated filters.
SMS messages are often limited to 160 characters, reducing the amount of text that can be used to identify
whether a message is junk.
Figure 13 Description of the data set
The first step towards constructing our classifier involves processing the raw data for analysis. SMS
messages are strings of text composed of words, spaces, numbers, and punctuation. Handling this type of
complex data takes a lot of thought and effort. One needs to consider how to remove numbers and
18. 17
Capstone Project – IS 6596
punctuation; handle uninteresting words such as and, but, and or; and how to break apart sentences into
individual words.
Figure 14 Description of length of the Ham messages Figure15 Description of length of the Spam messages
Our first order of business was to standardize the messages to use only lowercase characters. To this end,
we used tolower() function that returns a lowercase version of text strings. Continuing with our cleanup
process, we also eliminated any punctuation from the text messages. Our next task was to remove filler
words such as to, and, but, and or from our SMS messages. These terms are known as stop words and are
typically removed prior to text mining. This is due to the fact that although they appear very frequently,
they do not provide much useful information for machine learning.
Another common standardization for text data involves reducing words to their root form in a process
called stemming. The stemming process takes words like learned, learning, and learns, and strips the suffix
to transform them into the base form, learn. These are left with the blank spaces that previously separated
the now-missing pieces. The final step in our text cleanup process was to remove additional whitespace.
A word cloud is a way to visually depict the frequency at which words appear in text data. The cloud is
composed of words scattered somewhat randomly around the figure. The resulting word clouds are
shown in the following diagram:
19. 18
Capstone Project – IS 6596
Figure 16 Spam Word cloud Figure 17 Ham Word cloud
Now that the data are processed to our liking, the final step is to split the messages into individual
components through a process called vectorization. We took the corpus and created a data structure in
which rows indicate documents (SMS messages) and columns indicate terms (words). The final step in the
data preparation process was to transform the sparse matrix into a data structure that can be used to
train a Naive Bayes classifier. The sparse matrix included over 6,500 features; this is a feature for every
word that appears in at least one SMS message. It's unlikely that these are useful for classification. To
reduce the number of features, we eliminated any word that appear in less than five SMS messages, or in
less than about 0.1 percent of the records in the training data.
Figure 18 Vectorization
To evaluate the SMS classifier, we need to test its predictions on unseen messages in the test data. The
process of evaluating machine learning algorithms is very similar to the process of evaluating students.
Since algorithms have varying strengths and weaknesses, tests should distinguish among the learners.
Figure 19 Classification report
20. 19
Capstone Project – IS 6596
A confusion matrix is a table that categorizes predictions according to whether they match the actual
value. One of the table's dimensions indicates the possible categories of predicted values, while the other
dimension indicates the same for actual values. Although we have only seen 2 x 2 confusion matrices so
far, a matrix can be created for models that predict any number of class value.
Lessons Learned
Lesson 1: Marketing research is fun- We get to work with a wide variety of datasets, dive in and learn all
about the market their operating in and relay valuable insights back to stakeholders. We dig up everything
from why consumers make certain purchase decisions to what they’re passionate about and what makes
them tick.
Lesson 2: Collaboration is key- While doing this project we found out that while they might be tremendous
innovators, but collaboration is very important.
Lesson 3: Check, re-check and then check again Projects move quickly which means we don’t have time
to go back and re-collect data or make corrections to a report. Questionnaires, surveys, and reports must
be checked, checked by our coworker and checked again.
Next Steps
The next step would be to discover the other facets of Marketing Analysis like “Upsell and Cross Sell”,
“Recommendation System” etc. We can use algorithms like Principal Component Analysis(PCA), QDA, LDA
to reduce the number of features. Also, we can make analysis on the time series data using ARIMA
algorithm.