Competition 1 (blog 1)

Introduction
I recently started working on the Tabular Playground Series - Jun 2021 competition on Kaggle. In this competition,
Kaggle provides a tabular dataset which is synthetic, but based on a real dataset and generated using a CTGAN. The
original dataset deals with predicting the category on an eCommerce product given various attributes about the listing.
Although the features are anonymized, they have properties relating to real-world features. This is a classification
challenge, where participants need to train ML models to predict a discrete category (out of 9 possible categories) based
on 75 numerical features.
I decided to participate in this competition because it was a great opportunity to practice my tabular data skills. I will
share my approach to this competition in a series of three blogs, starting with this one. My aim has been to build a

simple, yet complete ML pipeline for this competition (which includes basic models, cross-validation, ensembling, and
submission). My focus in these blogs will be on explaining how I built this pipeline and how it can be improved further.
In this post, I will briefly explain the individual parts of the pipeline at a high level. Here are the three main parts of this
pipeline:
1. Data loading and processing
2. Model building and cross-validation
3. Submission
In the second blog, I will focus on the specifics of the algorithms used. And in the third, I will examine the results (speed
and accuracy) produced by the algorithms on various hardware.
Environment
All the code in this blog was run on a Kaggle kernel and a ZBook Studio mobile workstation (Z by HP) powered by an
NVIDIA RTX 5000 Mobile GPU.
Data loading and processing
First, we need to load in the data that we need to work with. In our case, we need to load in three CSV files: train.csv
(training data), test.csv (testing data), and sample_submission.csv (submission format file). And once the data is
loaded, we need to extract the features and targets from the dataframes. We will do this easily with the help of numpy and
pandas. After some basic processing, we extract the features and targets from the pandas dataframes as numpy arrays.
These numpy arrays will be fed into the models for training and inference later.
import numpy as np
import pandas as pd
"""
Load dataframes
"""
data_path = "/path/to/competition/data"
train_df = pd.read_csv(data_path + "train.csv")
test_df = pd.read_csv(data_path + "test.csv")
sample_submission = pd.read_csv(data_path + "sample_submission.csv")
"""

Prepare data for training, validation, and testing
"""
feat_cols = ["feature_{}".format(i) for i in range(50)]
train_X = np.float32(train_df[feat_cols].values)
test_X = np.float32(test_df[feat_cols].values)
train_y = train_df["target"]
train_y = train_y.apply(lambda y: int(y[-1]) - 1).values
I also tried using NVIDIA RAPIDS software to speed up the dataframe loading process and it worked well. Here are the
dataframe loading speed results for the three dataframes: train.csv, test.csv, and sample_submission.csv. (all
times are measured in seconds).
Kaggle N.B.: pd Kaggle N.B.: cudf ZBook Studio.: pd Kaggle N.B.: cudf
Training data 1.421 1.229 1.198 1.067
Testing data 1.284 0.032 0.821 0.021
Sample submission 0.226 0.010 0.143 0.007
It can be seen that using cudf results in large increases in dataframe loading speed. Also, the ZBook Studio mobile
workstation is faster than Kaggle notebook on both pandas and cudf due to having more RAM, CPU cores, and GPU
CUDA cores. The GPU acceleration is very dramatic on both machines.
Model building and cross-validation
The next step is to build a model to fit the data and cross-validate it. We will build the model using popular libraries like
xgboost, lightgbm, catboost, and tensorflow. I will dig into more specifics on modelling in the second blog. For
now, assume that build_model builds a new model, fit trains the model and validates it, and predict_proba returns
the predicted probabilities during testing.
We will use sklearn to set up cross-validation for our model. Cross-validation is essentially the splitting of training data
into multiple equal parts. Each part (or “fold”) is used as a validation set, while the model is trained on the rest of the
training data. With cross-validation, we get a more credible representation of the model’s performance on unseen data
because we use the entire training data to validate it (instead of a small fraction of the training data). Since we train a
new model on each fold, we can generate predictions for each fold (“out-of-fold” predictions) and average them to get
the final test predictions.

And because this is a classification task, we use StratifiedKFold from sklearn, which generates folds with
(approximately) the same proportion of each class as found in the training data. This ensures that each fold has a class
distribution representative of the original data. This makes the validation results more generalizable to unseen testing
data. The parameter n_splits is the number of folds that the training data is split into.
import sklearn as skl
"""
Generate N stratified folds of data
"""
N_SPLITS = 5
folds = []
kf = skl.model_selection.StratifiedKFold(n_splits=N_SPLITS)
split = kf.split(train_X, train_y)
for i, (train_idx, valid_idx) in enumerate(split):
folds.append(
(
(train_X[train_idx], train_y[train_idx]),
(train_X[valid_idx], train_y[valid_idx]),
)
)
"""
Build, train, validate, and calculate out-of-fold predictions for each fold
"""
test_yhat = []
start_time = time.time()
for i, fold in enumerate(folds):
model_parameters = {...}
model = build_model(model_parameters)
(train_X, train_y), (valid_X, valid_y) = fold
model.fit(
train_X, train_y, validation_data=((train_X, train_y), (valid_X, valid_y))
)
test_yhat.append(model.predict_proba(test_X))
Submission

This is the final part of the pipeline, where we simply average the out-of-fold predictions from the model, and replace the
dummy probabilities in sample_submission with our predictions. Multiple models’ predictions can be averaged or
“ensembled” to produce predictions that are even more accurate than the models’ individual predictions. Ensembling
works best when the two models being ensemble are as different from each other as possible. For example, ensembling a
lightgbm model with an xgboost model will probably work better than ensembling two lightgbm models. The
variable weight is a number between 0 and 1 that dictates the relative weighting of the two models’ predictions in the
ensemble (ensembling will not be used in this pipeline or discussed in the results, but I suggest using it for better
performance on most ML tasks).
"""
Average out-of-fold predictions, ensemble two models, and prepare submission CSV file
"""
weight = 0.5
final_test_yhat_model_1 = sum(test_yhat_model_1)/len(test_yhat_model_1)
final_test_yhat_model_2 = sum(test_yhat_model_2)/len(test_yhat_model_2)
final_test_yhat = weight*final_test_yhat_model_1 + (1-weight)*final_test_yhat_model_2
sample_submission[["Class_{}".format(i) for i in range(1, 10)]] = final_test_yhat
sample_submission.to_csv("submission.csv", index=False)
sample_submission.head()
Conclusion
In this blog, we went through the basic structure of the pipeline I used for the Tabular Playground Series - Jun 2021
competition on Kaggle. In the next blog, I will briefly explain how the models work and show how they are
implemented. Thank you for reading, I hope you enjoyed this blog, and see you in the next one!

Competition 1 (blog 1)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Competition 1 (blog 1)

Similaire à Competition 1 (blog 1) (20)

Dernier

Dernier (20)

Competition 1 (blog 1)