Retail sales forecasting solutions from top Kaggle competitions

Kaggle Winning Solution Series:
Retail sales forecasting
YAN XU
HOUSTON MACHINE LEARNING
JUNE 19, 2021

Retail sales forecasting
Rossmann Store Sales
(https://www.kaggle.com/c/rossmann-store-sales)
Corporación Favorita Grocery Sales Forecasting
(https://www.kaggle.com/c/favorita-grocery-sales-forecasting)
Walmart M5 Forecasting – Accuracy
(https://www.kaggle.com/c/m5-forecasting-accuracy)

1115 stores
predicting their daily sales for up to six weeks in advance

Metric
Root Mean Square Percentage Error RMSPE

1st place solution with feature extraction
Recent data
Temporal information
Current trends
Store information
Weather
https://www.kaggle.com/c/rossman
n-store-sales/discussion/18024

Feature extraction – Recent data
• Median
• Mean
• Harmonic mean
• Standard deviation
• Skewness
• Kurtosis
• 10%/90% percentiles
• Previous month
• last quarter
• Last half year
• Last year
• Last 2 years
• Store
• Day of week
• Promotions
• Holidays
Keys Time period Stats

Feature extraction – Temporal information
Day counters (how each record relates to events or cycles)
◦ The number of days before, after or within the event
◦ Events
◦ Promotion cycle
◦ Summer holidays
◦ Store refurbishment
◦ Start of competition and start of secondary promotion cycle
Day of week, day of month, day/week/month of year
Number of holidays during the current week, last week and next week

Feature extraction – Current trends
Last quarter and last year
Store specific linear model (ridge regression) on
◦ The day number, to extrapolate into six weeks in test
◦ Day of week
◦ Promotions

Feature extraction – Other features
Store features
◦ Assortment
◦ Store type
◦ Aggregates by store
◦ Average sales per customer
◦ Ratio of sales during promotions/holidays/Saturdays
◦ Proportion school holidays and days that the store is open
State specific weather
◦ Max temperature
◦ Mm precipitation

Model training
•XGBoost Models on random selections of features
•Handpicked models
•500 random models and validate on each pair of ensemble models
•Take features from all the selected models and combine into one
•Separate models on the months May to Sep
•Month ahead models
•Log transformed the variable, a multiplier factor (0.985) to apply

3rd place solution with entity embeddings
https://arxiv.org/pdf/1604.06737.pdf

Data
•unit_sales by date
•store_nbr
•item_nbr
•onpromotion - whether that item_nbr was on promotion for a specified date and
store_nbr.
•Store metadata, including city, state, type, and cluster.
•Item metadata, including family, class, and perishable
•The count of sales transactions for each date/store_nbr combination.
•Daily oil price
•Holidays and Events, with metadata

Metric
Normalized Weighted Root Mean Squared Logarithmic Error
𝑖=1
𝑛
𝑤𝑖(log 𝑦𝑖 + 1 − log 𝑦𝑖 + 1 )2
𝑖=1
𝑛
𝑤𝑖
The weights 𝑤𝑖 , can be found in the items.csv file (see the Data page). Perishable items are given a weight of 1.25 where
all other items are given a weight of 1.00.
RMSLE incurs a larger penalty for
the underestimation of the Actual
variable than the Overestimation.

1st place solution with ensemble
LSTM model

4th place: Encode-decoder with
Dilated causal convolutions

Metric
Weighted Root Mean Squared Scaled Error (RMSSE)
Naïve one step forecast

1st place solution
•Single LGBM model, with objective = tweedie
•divide into groups with similar time series, and model it.
(e.g.) by store, by store cat, by store dept, etc.
•select final model using mean(cvs, public score) and std(cvs,
public score)
• Multiple validation set
•Ensemble of non recursive and recursive
Recursive:

1st place solution
Store/Item Price
◦ Max
◦ Min
◦ Std
◦ Mean
◦ Price_norm divided by price max
◦ Price_nunique
◦ Item_nunique that has the same price
◦ Price momentum by month/year
• Calendar features
• Day
• Week
• Month
• Year index
• Day of week
• Weekend
• Event
• State

Lag features
◦ 28 day shift
◦ for 14 days
Lag rolling features
◦ 28 day shift
◦ Time window of [7,14, 30, 60, 180]
◦ Rolling mean/std
Rolling with shift [1,7,14]
• Mean encoding features (mean and std)
• ['state_id']
• ['store_id']
• ['cat_id']
• ['dept_id']
• ['state_id', 'cat_id']
• ['state_id', 'dept_id']
• ['store_id', 'cat_id']
• ['store_id', 'dept_id']
• ['item_id']
• ['item_id', 'state_id']
• ['item_id', 'store_id']

2nd place – Aligning top and bottom
Bottom level: lgb model for each store without lag/rolling
Top 5 levels: [ALL, STATE, STORE, CATEGORY, DEPARTMENT]
ALL STATE STORE

Magic multipliers
Overshot fit (multiplier > 1)
Level 1 : all_id
“Build a set of bottom level models with multipliers ranging from 0.9 to 1.23, the optimum is somewhere
in the range 0.93-0.95 and built an ensemble with 0.9, 0.93, 0.95, 0.97 and 0.99”

Alignment at level 1
N-beats prediction at level 1

Conclusion
•Data is the key: promotions, competitions, holidays in addition to sales data.
•Metric needs to be customized for different problems
•Statistical features can be helpful (lag, rolling by different combination of keys)
•GBM is still the most popular model for prediction
•Separate models may be needed for different stores/locations/duration of days
•Low level prediction can be improved by aligning high level prediction

Win a Kaggle competition?
• Ideas and Inspiration
• Best practice

Retail sales forecasting solutions from top Kaggle competitions

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Retail sales forecasting solutions from top Kaggle competitions

Similaire à Retail sales forecasting solutions from top Kaggle competitions (20)

Plus de Yan Xu

Plus de Yan Xu (20)

Dernier

Dernier (20)

Retail sales forecasting solutions from top Kaggle competitions