R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis
2. • Retail store network
wish to offering ‘young’
products
• 25-35 year olds
• 5 Weeks Data
• 1 M Rows each week
• Random Sample of 20%
taken from each week
• Data Analysed using
Rstudio
• Summary(), Reshape2,
GoogleVis
2
Business Case
2
3. • Dunhumby: Customer science
company
• Use data and science to help retailers
and brands delight customers and
earn their loyalty.
• Dunhunby provide 117 weeks of till
dummy data. Each week’s data
consists of over 1M entries. This
analysis we will investigate five weeks
data.
3
Data Source
3
4. • Identify Young Adult and Young
Family Customers within the
customer dataset.
• Profile those customers using
categories such as
• Time they shop
• Day of the week
• Spend
• Affluence
• Compare Young Adult and Young
Family Customers with population
percentages.
4
Objectives
4
• The data was imported into R and a
random sample of 200,000
transactions was extracted for each
week.
• The samples were then combined
using the rbind() command and
summary statistics were generated.
A store location table was added to
the dataset to identify stores by
location in Ireland. This was to
enable the comparison with CSO
data.
9. • Missing Data
• 20% of baskets were purchased by customers who didn’t have a customer
code.
• 30% of baskets are purchased by customers who do not have a Customer Life-
stage classification.
• Reclassification
• Given that there was no missing data for Basket Type and only 1% of data is
missing for Basket Size and Basket Sensitivity, I investigated the data using
summary statistics to see if I could reclassify “Other” and “Blanks” as per
CUST_LIFESTAGE.
• While similarities were identified within the dataset, I did not have sufficient
evidence to reclassify.
• Further work using Association Rules and Cluster Analysis is suggested.
9
Problems Encountered
9
10. • The stores stay open 7 days per week.
• The trading times are between 8am and
9pm although there may be some
variation at weekends.
• The quantity is the number of items of
the same product bought in each basket.
• The mean number of items was 1.45
items with a max of 275.
• The mean spend per basket was €1.9 with
a max of €485
• Product Codes are listed and the number
of baskets containing each product code.
• Customer Price Sensitivity:
• 22% baskets = LA (Less Affluent),
• 36% baskets = MM (Mid Market),
• 21.6% baskets = UM (Up Market),
• 3% = XX (unclassified).
• Customer Lifestage:
• 10% = OA (Old Age),
• 5% = OF (Older Families),
• 21% = OT (Other),
• 6% = PE (Pensioner),
• 11% = YA (Young Adult),
• 17% = YF (Young Family).
10
Retail Data Investigation
10
11. • 297533 baskets representing 30% of the
dataset did not have a Customer Life-
stage category assigned to them.
• Basket ID: All baskets have an ID.
• There are no blank variables.
• Basket Size:
• 71% = Large Baskets,
• 24% = Medium Baskets,
• 5% = Small Baskets.
• 194512 baskets, representing 19% of
baskets were not assigned a customer
price sensitivity.
11
Retail Data Investigation
11
• Customer Codes are listed with the
number of times each customer appeared
in the dataset.
• The most frequent customer from this
sample purchased 56 times in the store.
• However, 194,512 baskets were,
representing 19% of the data were
purchased by customers with no
customer number.
12. • Basket Price Sensitivity:
• 27.6% = LA (Less Affluent),
• 46.7% = MM (Mid Market),
• 25.4% = UM (Up Market),
• 0.2% = XX (Unclassified)
• Basket Type:
• Full Shop :37%,
• Small Shop:18.6%,
• Top Up :44% ,
• XX : 1%
• Basket Dominant Content:
• Fresh = 50%,
• Grocery = 10%,
• Mixed = 38%,
• Non Food = 1%,
• Unclassified = 1%.
12
Retail Data Investigation
12
• Store Code = Each store had a store code.
• Most popular store was STORE00696.
• Store Format:
• 64% = Large,
• 21% = Medium,
• 6.5% = Small
• 8.5% = Extra Large.
• Each store was identified by region.
14. • Young Adult and Young Family represent
28% of the Customer Dataset
• Summary() used to create summary stats in R
14
Summary Statistics
14
15. • Average Shopping Times
• Young Family – 14.9 pm
• 50% of shopping between 2pm and 6pm
• Young Adult – 15.9 pm
• 50% of shopping completed between 1pm and 6pm
• Day of Week
• No preferred shopping day for any of the Lifestage
categories.
• Subset(), mean(), quartile() used in R
15
Shopping Patterns
15
16. Region County YA YF (YF + YA) as a
Percentage of
Customer Base
E01 Kildare 7494 12423 27%
E02 Laois 7304 12182 27%
E03 Meath 7692 12381 27%
N01 Donegal 11071 17605 29%
N02 Monaghan 9368 14543 28%
N03 Cavan 9998 16122 27%
S01 Limerick 8443 14310 28%
S02 Cork 11071 16919 27%
S03 Kerry 8490 13714 28%
W01 Mayo 8256 13618 27%
W02 Galway 9807 15197 27%
W03 Roscommon 6857 11307 27%
16
YA & YF as % of Population
16
18. • Young Adult and Young Family make up a greater proportion
of the customer base than the population %.
• 18% - Average Target Population in Ireland
• 28% - Average Target Population in Customer Database.
• Suggest running promotions between 1pm and 6pm.
• There is no preferred day for running promotions as
everyday is equally important.
18
Outcome
18