casestudy_important.pptx

Group 4
Team members
Ravi
Richa
Sabarish
Vijay

Problem Statement
Flight Delay Prediction

Dataset understanding and description
No. of features =29, shape of data set = 484551 rows x 28 columns,
Target Variable = ArrDelay

Dataset understanding and description
 Missing Values
 Org_Airport -1177
 Dest_Airport -1479
 Duplicate Data
 2 rows are duplicate
 Duplicate columns
 Columns with same information i.e Org_Airport and Dest_Airport these are repeated
information in data set Origin represented by three letter code for Org_Airport and Dest
represents three letter code for Dest_Airport
 Categorical Variables
 1.UniqueCarrier
 2.FlightNum
 3.TailNum
 4.Origin
 5.Dest

Outliers
# -Columns having outliers :-
 arrdelay
 deepdelay
 taxiout
 carrier delay
 security delay
 late aircraft delay

Data Visualization Techniques used
 Box Plots
 Heat Maps
 Histogram
 Line graphs
 Pie chart
 PairPlot
 Used sweetviz library for more visualization

Screenshots of different visualizations
When we took threshold as .9 then following
features are correlated{'AirTime',
'CRSElapsedTime', 'DepDelay', 'Distance'}

Data Processing and FeatureEngg.
 Imputation – Check for Null values in data set and handle it by diff techniques
 Categorical Encoding – target encoding is used as high cardinality
 Handling Outliers – Box plots to analyse outliers in data
 Scaling - Min Max scaler is used
 Feature Selection- Correlation helped in getting feature correlation with each
other and target
 Feature Split – Derived features from Date and time features
Data set post FE – 24 features for Modelling

Our Research
 Analysed the data deeply from domain perspective which gave very interesting insights.
 Different types of categorical handling we have researched on and then came up with target encoding
 As we know deletion of features are always most impactful decision we used both visualization and
domain knowledge to do this part
 Missing Value handling we tried different approached and then finalized one
 We have derived attributes from given features which we really felt will be helpful for further analysing

Future Tasks
 More Feature Engineering
 Training the model on the selected features
 Model development
 Model assessment
Take away from last Meet
 Group Dynamics
 Elements of Data
 Dynamic data

Feature Engg.
Logistic Regression – SFS for Feature Engineering

Logistic Regression – SFS for Feature
Engineering

Splitting of data
The test_size=0.2 It is split of test and training data as 80/20percent .
X_train data shape after splitting (387639, 24)
X_test data shape after splitting (96910, 24)
y_train data shape after splitting (387639,)
y_test data shape after splitting (96910,)

Linear Regression Model
Interpretation - The R² represents how much variance of the data is explained by
the model, the R2=0.90 means that 0.10 of the variance can not explain by the
model, the logical case when R2=1 the model completely fit and explained all
variance.

 Y = a + bX
 b = slope
 a = intercept
 X= coefficients or features

Ridge Regression Model
Ridge regression is a model tuning method that is used to analyse any data
that suffers from multicollinearity. This method performs L2 regularization.
When the issue of multicollinearity occurs, least-squares are unbiased, and
variances are large, this results in predicted values to be far away from the
actual values.
mean_squared_error with Ridge Regression with train data
0.0033528868720357806
R2 square with Ridge Regression with train data 0.9999999965474273
mean_squared_error with Ridge Regression with test data
0.0027463365607708775
R2 square with Ridge Regression with test data 0.9999999976478994

Future Recommendation
1. Regression vs Classification Problem
2. Dataset can have more records for delay = 0
3. Dataset can have more relevant features according to the domain
knowledge/experience

Interpretation
 Simple linear regression led to overfitting giving an unrealistic accuracy of 100%. This problem caused by
overfitting is well addressed by applying regularization on the regression model.We have used L2 Regularization
that is Ridge Regression to overcome this issue.
 SVM model is extremely unsuitable for this problem as it takes an unreasonable amount of time(near about 3
hours) to run the model and also gives subpar accuracy. It is computationally expensive and inappropriate for
problems with large datasets such as the one given.
 Random forest is also giving us good accuracy 98 %
 ANN is giving 98 %

Dynamic data
Dynamic data out using linear regression and Ridge Regression

casestudy_important.pptx

Recommandé

Recommandé

Contenu connexe

Similaire à casestudy_important.pptx

Similaire à casestudy_important.pptx (20)

Dernier

Dernier (20)

casestudy_important.pptx