SlideShare une entreprise Scribd logo
1  sur  5
Télécharger pour lire hors ligne
Introduction
I recently started working on the Tabular Playground Series - Jun 2021 competition on Kaggle. In this competition,
Kaggle provides a tabular dataset which is synthetic, but based on a real dataset and generated using a CTGAN. The
original dataset deals with predicting the category on an eCommerce product given various attributes about the listing.
Although the features are anonymized, they have properties relating to real-world features. This is a classification
challenge, where participants need to train ML models to predict a discrete category (out of 9 possible categories) based
on 75 numerical features.
I decided to participate in this competition because it was a great opportunity to practice my tabular data skills. I will
share my approach to this competition in a series of three blogs, starting with this one. My aim has been to build a
simple, yet complete ML pipeline for this competition (which includes basic models, cross-validation, ensembling, and
submission). My focus in these blogs will be on explaining how I built this pipeline and how it can be improved further.
In this post, I will briefly explain the individual parts of the pipeline at a high level. Here are the three main parts of this
pipeline:
1. Data loading and processing
2. Model building and cross-validation
3. Submission
In the second blog, I will focus on the specifics of the algorithms used. And in the third, I will examine the results (speed
and accuracy) produced by the algorithms on various hardware.
Environment
All the code in this blog was run on a Kaggle kernel and a ZBook Studio mobile workstation (Z by HP) powered by an
NVIDIA RTX 5000 Mobile GPU.
Data loading and processing
First, we need to load in the data that we need to work with. In our case, we need to load in three CSV files: train.csv
(training data), test.csv (testing data), and sample_submission.csv (submission format file). And once the data is
loaded, we need to extract the features and targets from the dataframes. We will do this easily with the help of numpy and
pandas. After some basic processing, we extract the features and targets from the pandas dataframes as numpy arrays.
These numpy arrays will be fed into the models for training and inference later.
import numpy as np
import pandas as pd
"""
Load dataframes
"""
data_path = "/path/to/competition/data"
train_df = pd.read_csv(data_path + "train.csv")
test_df = pd.read_csv(data_path + "test.csv")
sample_submission = pd.read_csv(data_path + "sample_submission.csv")
"""
Prepare data for training, validation, and testing
"""
feat_cols = ["feature_{}".format(i) for i in range(50)]
train_X = np.float32(train_df[feat_cols].values)
test_X = np.float32(test_df[feat_cols].values)
train_y = train_df["target"]
train_y = train_y.apply(lambda y: int(y[-1]) - 1).values
I also tried using NVIDIA RAPIDS software to speed up the dataframe loading process and it worked well. Here are the
dataframe loading speed results for the three dataframes: train.csv, test.csv, and sample_submission.csv. (all
times are measured in seconds).
Kaggle N.B.: pd Kaggle N.B.: cudf ZBook Studio.: pd Kaggle N.B.: cudf
Training data 1.421 1.229 1.198 1.067
Testing data 1.284 0.032 0.821 0.021
Sample submission 0.226 0.010 0.143 0.007
It can be seen that using cudf results in large increases in dataframe loading speed. Also, the ZBook Studio mobile
workstation is faster than Kaggle notebook on both pandas and cudf due to having more RAM, CPU cores, and GPU
CUDA cores. The GPU acceleration is very dramatic on both machines.
Model building and cross-validation
The next step is to build a model to fit the data and cross-validate it. We will build the model using popular libraries like
xgboost, lightgbm, catboost, and tensorflow. I will dig into more specifics on modelling in the second blog. For
now, assume that build_model builds a new model, fit trains the model and validates it, and predict_proba returns
the predicted probabilities during testing.
We will use sklearn to set up cross-validation for our model. Cross-validation is essentially the splitting of training data
into multiple equal parts. Each part (or “fold”) is used as a validation set, while the model is trained on the rest of the
training data. With cross-validation, we get a more credible representation of the model’s performance on unseen data
because we use the entire training data to validate it (instead of a small fraction of the training data). Since we train a
new model on each fold, we can generate predictions for each fold (“out-of-fold” predictions) and average them to get
the final test predictions.
And because this is a classification task, we use StratifiedKFold from sklearn, which generates folds with
(approximately) the same proportion of each class as found in the training data. This ensures that each fold has a class
distribution representative of the original data. This makes the validation results more generalizable to unseen testing
data. The parameter n_splits is the number of folds that the training data is split into.
import sklearn as skl
"""
Generate N stratified folds of data
"""
N_SPLITS = 5
folds = []
kf = skl.model_selection.StratifiedKFold(n_splits=N_SPLITS)
split = kf.split(train_X, train_y)
for i, (train_idx, valid_idx) in enumerate(split):
folds.append(
(
(train_X[train_idx], train_y[train_idx]),
(train_X[valid_idx], train_y[valid_idx]),
)
)
"""
Build, train, validate, and calculate out-of-fold predictions for each fold
"""
test_yhat = []
start_time = time.time()
for i, fold in enumerate(folds):
model_parameters = {...}
model = build_model(model_parameters)
(train_X, train_y), (valid_X, valid_y) = fold
model.fit(
train_X, train_y, validation_data=((train_X, train_y), (valid_X, valid_y))
)
test_yhat.append(model.predict_proba(test_X))
Submission
This is the final part of the pipeline, where we simply average the out-of-fold predictions from the model, and replace the
dummy probabilities in sample_submission with our predictions. Multiple models’ predictions can be averaged or
“ensembled” to produce predictions that are even more accurate than the models’ individual predictions. Ensembling
works best when the two models being ensemble are as different from each other as possible. For example, ensembling a
lightgbm model with an xgboost model will probably work better than ensembling two lightgbm models. The
variable weight is a number between 0 and 1 that dictates the relative weighting of the two models’ predictions in the
ensemble (ensembling will not be used in this pipeline or discussed in the results, but I suggest using it for better
performance on most ML tasks).
"""
Average out-of-fold predictions, ensemble two models, and prepare submission CSV file
"""
weight = 0.5
final_test_yhat_model_1 = sum(test_yhat_model_1)/len(test_yhat_model_1)
final_test_yhat_model_2 = sum(test_yhat_model_2)/len(test_yhat_model_2)
final_test_yhat = weight*final_test_yhat_model_1 + (1-weight)*final_test_yhat_model_2
sample_submission[["Class_{}".format(i) for i in range(1, 10)]] = final_test_yhat
sample_submission.to_csv("submission.csv", index=False)
sample_submission.head()
Conclusion
In this blog, we went through the basic structure of the pipeline I used for the Tabular Playground Series - Jun 2021
competition on Kaggle. In the next blog, I will briefly explain how the models work and show how they are
implemented. Thank you for reading, I hope you enjoyed this blog, and see you in the next one!

Contenu connexe

Tendances

Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
Rebecca Bilbro
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
Raouf KESKES
 

Tendances (20)

Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
 
Training course lect2
Training course lect2Training course lect2
Training course lect2
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
Converting R to PMML
Converting R to PMMLConverting R to PMML
Converting R to PMML
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
Matlab Presentation
Matlab PresentationMatlab Presentation
Matlab Presentation
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Active contour segmentation
Active contour segmentationActive contour segmentation
Active contour segmentation
 
Xgboost
XgboostXgboost
Xgboost
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 

Similaire à Competition 1 (blog 1)

Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_Model
David Ritchie
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN Approach
Yusuf Uzun
 
AIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfAIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdf
ssuserb4d806
 

Similaire à Competition 1 (blog 1) (20)

Viktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning ServiceViktor Tsykunov: Azure Machine Learning Service
Viktor Tsykunov: Azure Machine Learning Service
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Unsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at ScaleUnsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at Scale
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016Machinelearning Spark Hadoop User Group Munich Meetup 2016
Machinelearning Spark Hadoop User Group Munich Meetup 2016
 
Data Mining using Weka
Data Mining using WekaData Mining using Weka
Data Mining using Weka
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_Model
 
Grid search.pptx
Grid search.pptxGrid search.pptx
Grid search.pptx
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
Azure machine learning service
Azure machine learning serviceAzure machine learning service
Azure machine learning service
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN Approach
 
AIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfAIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdf
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 

Dernier

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Dernier (20)

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 

Competition 1 (blog 1)

  • 1. Introduction I recently started working on the Tabular Playground Series - Jun 2021 competition on Kaggle. In this competition, Kaggle provides a tabular dataset which is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features. This is a classification challenge, where participants need to train ML models to predict a discrete category (out of 9 possible categories) based on 75 numerical features. I decided to participate in this competition because it was a great opportunity to practice my tabular data skills. I will share my approach to this competition in a series of three blogs, starting with this one. My aim has been to build a
  • 2. simple, yet complete ML pipeline for this competition (which includes basic models, cross-validation, ensembling, and submission). My focus in these blogs will be on explaining how I built this pipeline and how it can be improved further. In this post, I will briefly explain the individual parts of the pipeline at a high level. Here are the three main parts of this pipeline: 1. Data loading and processing 2. Model building and cross-validation 3. Submission In the second blog, I will focus on the specifics of the algorithms used. And in the third, I will examine the results (speed and accuracy) produced by the algorithms on various hardware. Environment All the code in this blog was run on a Kaggle kernel and a ZBook Studio mobile workstation (Z by HP) powered by an NVIDIA RTX 5000 Mobile GPU. Data loading and processing First, we need to load in the data that we need to work with. In our case, we need to load in three CSV files: train.csv (training data), test.csv (testing data), and sample_submission.csv (submission format file). And once the data is loaded, we need to extract the features and targets from the dataframes. We will do this easily with the help of numpy and pandas. After some basic processing, we extract the features and targets from the pandas dataframes as numpy arrays. These numpy arrays will be fed into the models for training and inference later. import numpy as np import pandas as pd """ Load dataframes """ data_path = "/path/to/competition/data" train_df = pd.read_csv(data_path + "train.csv") test_df = pd.read_csv(data_path + "test.csv") sample_submission = pd.read_csv(data_path + "sample_submission.csv") """
  • 3. Prepare data for training, validation, and testing """ feat_cols = ["feature_{}".format(i) for i in range(50)] train_X = np.float32(train_df[feat_cols].values) test_X = np.float32(test_df[feat_cols].values) train_y = train_df["target"] train_y = train_y.apply(lambda y: int(y[-1]) - 1).values I also tried using NVIDIA RAPIDS software to speed up the dataframe loading process and it worked well. Here are the dataframe loading speed results for the three dataframes: train.csv, test.csv, and sample_submission.csv. (all times are measured in seconds). Kaggle N.B.: pd Kaggle N.B.: cudf ZBook Studio.: pd Kaggle N.B.: cudf Training data 1.421 1.229 1.198 1.067 Testing data 1.284 0.032 0.821 0.021 Sample submission 0.226 0.010 0.143 0.007 It can be seen that using cudf results in large increases in dataframe loading speed. Also, the ZBook Studio mobile workstation is faster than Kaggle notebook on both pandas and cudf due to having more RAM, CPU cores, and GPU CUDA cores. The GPU acceleration is very dramatic on both machines. Model building and cross-validation The next step is to build a model to fit the data and cross-validate it. We will build the model using popular libraries like xgboost, lightgbm, catboost, and tensorflow. I will dig into more specifics on modelling in the second blog. For now, assume that build_model builds a new model, fit trains the model and validates it, and predict_proba returns the predicted probabilities during testing. We will use sklearn to set up cross-validation for our model. Cross-validation is essentially the splitting of training data into multiple equal parts. Each part (or “fold”) is used as a validation set, while the model is trained on the rest of the training data. With cross-validation, we get a more credible representation of the model’s performance on unseen data because we use the entire training data to validate it (instead of a small fraction of the training data). Since we train a new model on each fold, we can generate predictions for each fold (“out-of-fold” predictions) and average them to get the final test predictions.
  • 4. And because this is a classification task, we use StratifiedKFold from sklearn, which generates folds with (approximately) the same proportion of each class as found in the training data. This ensures that each fold has a class distribution representative of the original data. This makes the validation results more generalizable to unseen testing data. The parameter n_splits is the number of folds that the training data is split into. import sklearn as skl """ Generate N stratified folds of data """ N_SPLITS = 5 folds = [] kf = skl.model_selection.StratifiedKFold(n_splits=N_SPLITS) split = kf.split(train_X, train_y) for i, (train_idx, valid_idx) in enumerate(split): folds.append( ( (train_X[train_idx], train_y[train_idx]), (train_X[valid_idx], train_y[valid_idx]), ) ) """ Build, train, validate, and calculate out-of-fold predictions for each fold """ test_yhat = [] start_time = time.time() for i, fold in enumerate(folds): model_parameters = {...} model = build_model(model_parameters) (train_X, train_y), (valid_X, valid_y) = fold model.fit( train_X, train_y, validation_data=((train_X, train_y), (valid_X, valid_y)) ) test_yhat.append(model.predict_proba(test_X)) Submission
  • 5. This is the final part of the pipeline, where we simply average the out-of-fold predictions from the model, and replace the dummy probabilities in sample_submission with our predictions. Multiple models’ predictions can be averaged or “ensembled” to produce predictions that are even more accurate than the models’ individual predictions. Ensembling works best when the two models being ensemble are as different from each other as possible. For example, ensembling a lightgbm model with an xgboost model will probably work better than ensembling two lightgbm models. The variable weight is a number between 0 and 1 that dictates the relative weighting of the two models’ predictions in the ensemble (ensembling will not be used in this pipeline or discussed in the results, but I suggest using it for better performance on most ML tasks). """ Average out-of-fold predictions, ensemble two models, and prepare submission CSV file """ weight = 0.5 final_test_yhat_model_1 = sum(test_yhat_model_1)/len(test_yhat_model_1) final_test_yhat_model_2 = sum(test_yhat_model_2)/len(test_yhat_model_2) final_test_yhat = weight*final_test_yhat_model_1 + (1-weight)*final_test_yhat_model_2 sample_submission[["Class_{}".format(i) for i in range(1, 10)]] = final_test_yhat sample_submission.to_csv("submission.csv", index=False) sample_submission.head() Conclusion In this blog, we went through the basic structure of the pipeline I used for the Tabular Playground Series - Jun 2021 competition on Kaggle. In the next blog, I will briefly explain how the models work and show how they are implemented. Thank you for reading, I hope you enjoyed this blog, and see you in the next one!