This document summarizes several winning solutions from Kaggle competitions related to retail sales forecasting. It describes the data and metrics used in the competitions and highlights some common techniques from top solutions, including feature engineering of recent and temporal data, using gradient boosted trees and ensembles of models, and incorporating additional contextual data like weather and promotions.
7. 1st place solution with feature extraction
Recent data
Temporal information
Current trends
Store information
Weather
https://www.kaggle.com/c/rossman
n-store-sales/discussion/18024
8. Feature extraction – Recent data
• Median
• Mean
• Harmonic mean
• Standard deviation
• Skewness
• Kurtosis
• 10%/90% percentiles
• Previous month
• last quarter
• Last half year
• Last year
• Last 2 years
• Store
• Day of week
• Promotions
• Holidays
Keys Time period Stats
9. Feature extraction – Temporal information
Day counters (how each record relates to events or cycles)
◦ The number of days before, after or within the event
◦ Events
◦ Promotion cycle
◦ Summer holidays
◦ Store refurbishment
◦ Start of competition and start of secondary promotion cycle
Day of week, day of month, day/week/month of year
Number of holidays during the current week, last week and next week
10. Feature extraction – Current trends
Last quarter and last year
Store specific linear model (ridge regression) on
◦ The day number, to extrapolate into six weeks in test
◦ Day of week
◦ Promotions
11. Feature extraction – Other features
Store features
◦ Assortment
◦ Store type
◦ Aggregates by store
◦ Average sales per customer
◦ Ratio of sales during promotions/holidays/Saturdays
◦ Proportion school holidays and days that the store is open
State specific weather
◦ Max temperature
◦ Mm precipitation
13. Model training
•XGBoost Models on random selections of features
•Handpicked models
•500 random models and validate on each pair of ensemble models
•Take features from all the selected models and combine into one
•Separate models on the months May to Sep
•Month ahead models
•Log transformed the variable, a multiplier factor (0.985) to apply
14.
15. 3rd place solution with entity embeddings
https://arxiv.org/pdf/1604.06737.pdf
18. Data
•unit_sales by date
•store_nbr
•item_nbr
•onpromotion - whether that item_nbr was on promotion for a specified date and
store_nbr.
•Store metadata, including city, state, type, and cluster.
•Item metadata, including family, class, and perishable
•The count of sales transactions for each date/store_nbr combination.
•Daily oil price
•Holidays and Events, with metadata
19. Metric
Normalized Weighted Root Mean Squared Logarithmic Error
𝑖=1
𝑛
𝑤𝑖(log 𝑦𝑖 + 1 − log 𝑦𝑖 + 1 )2
𝑖=1
𝑛
𝑤𝑖
The weights 𝑤𝑖 , can be found in the items.csv file (see the Data page). Perishable items are given a weight of 1.25 where
all other items are given a weight of 1.00.
RMSLE incurs a larger penalty for
the underestimation of the Actual
variable than the Overestimation.
28. 1st place solution
•Single LGBM model, with objective = tweedie
•divide into groups with similar time series, and model it.
(e.g.) by store, by store cat, by store dept, etc.
•select final model using mean(cvs, public score) and std(cvs,
public score)
• Multiple validation set
•Ensemble of non recursive and recursive
Recursive:
29. 1st place solution
Store/Item Price
◦ Max
◦ Min
◦ Std
◦ Mean
◦ Price_norm divided by price max
◦ Price_nunique
◦ Item_nunique that has the same price
◦ Price momentum by month/year
• Calendar features
• Day
• Week
• Month
• Year index
• Day of week
• Weekend
• Event
• State
30. Lag features
◦ 28 day shift
◦ for 14 days
Lag rolling features
◦ 28 day shift
◦ Time window of [7,14, 30, 60, 180]
◦ Rolling mean/std
Rolling with shift [1,7,14]
• Mean encoding features (mean and std)
• ['state_id']
• ['store_id']
• ['cat_id']
• ['dept_id']
• ['state_id', 'cat_id']
• ['state_id', 'dept_id']
• ['store_id', 'cat_id']
• ['store_id', 'dept_id']
• ['item_id']
• ['item_id', 'state_id']
• ['item_id', 'store_id']
31. 2nd place – Aligning top and bottom
Bottom level: lgb model for each store without lag/rolling
Top 5 levels: [ALL, STATE, STORE, CATEGORY, DEPARTMENT]
ALL STATE STORE
32. 2nd place – Aligning top and bottom
Magic multipliers
Overshot fit (multiplier > 1)
Level 1 : all_id
“Build a set of bottom level models with multipliers ranging from 0.9 to 1.23, the optimum is somewhere
in the range 0.93-0.95 and built an ensemble with 0.9, 0.93, 0.95, 0.97 and 0.99”
33. 2nd place – Aligning top and bottom
Alignment at level 1
N-beats prediction at level 1
34. Conclusion
•Data is the key: promotions, competitions, holidays in addition to sales data.
•Metric needs to be customized for different problems
•Statistical features can be helpful (lag, rolling by different combination of keys)
•GBM is still the most popular model for prediction
•Separate models may be needed for different stores/locations/duration of days
•Low level prediction can be improved by aligning high level prediction
35. Win a Kaggle competition?
• Ideas and Inspiration
• Best practice