3. Dataset understanding and description
No. of features =29, shape of data set = 484551 rows x 28 columns,
Target Variable = ArrDelay
4. Dataset understanding and description
Missing Values
Org_Airport -1177
Dest_Airport -1479
Duplicate Data
2 rows are duplicate
Duplicate columns
Columns with same information i.e Org_Airport and Dest_Airport these are repeated
information in data set Origin represented by three letter code for Org_Airport and Dest
represents three letter code for Dest_Airport
Categorical Variables
1.UniqueCarrier
2.FlightNum
3.TailNum
4.Origin
5.Dest
6. Data Visualization Techniques used
Box Plots
Heat Maps
Histogram
Line graphs
Pie chart
PairPlot
Used sweetviz library for more visualization
7. Screenshots of different visualizations
When we took threshold as .9 then following
features are correlated{'AirTime',
'CRSElapsedTime', 'DepDelay', 'Distance'}
9. Data Processing and FeatureEngg.
Imputation – Check for Null values in data set and handle it by diff techniques
Categorical Encoding – target encoding is used as high cardinality
Handling Outliers – Box plots to analyse outliers in data
Scaling - Min Max scaler is used
Feature Selection- Correlation helped in getting feature correlation with each
other and target
Feature Split – Derived features from Date and time features
Data set post FE – 24 features for Modelling
10. Our Research
Analysed the data deeply from domain perspective which gave very interesting insights.
Different types of categorical handling we have researched on and then came up with target encoding
As we know deletion of features are always most impactful decision we used both visualization and
domain knowledge to do this part
Missing Value handling we tried different approached and then finalized one
We have derived attributes from given features which we really felt will be helpful for further analysing
11. Future Tasks
More Feature Engineering
Training the model on the selected features
Model development
Model assessment
Take away from last Meet
Group Dynamics
Elements of Data
Dynamic data
16. Splitting of data
The test_size=0.2 It is split of test and training data as 80/20percent .
X_train data shape after splitting (387639, 24)
X_test data shape after splitting (96910, 24)
y_train data shape after splitting (387639,)
y_test data shape after splitting (96910,)
17. Linear Regression Model
Interpretation - The R² represents how much variance of the data is explained by
the model, the R2=0.90 means that 0.10 of the variance can not explain by the
model, the logical case when R2=1 the model completely fit and explained all
variance.
18. Y = a + bX
b = slope
a = intercept
X= coefficients or features
19. Ridge Regression Model
Ridge regression is a model tuning method that is used to analyse any data
that suffers from multicollinearity. This method performs L2 regularization.
When the issue of multicollinearity occurs, least-squares are unbiased, and
variances are large, this results in predicted values to be far away from the
actual values.
mean_squared_error with Ridge Regression with train data
0.0033528868720357806
R2 square with Ridge Regression with train data 0.9999999965474273
mean_squared_error with Ridge Regression with test data
0.0027463365607708775
R2 square with Ridge Regression with test data 0.9999999976478994
23. Future Recommendation
1. Regression vs Classification Problem
2. Dataset can have more records for delay = 0
3. Dataset can have more relevant features according to the domain
knowledge/experience
25. Interpretation
Simple linear regression led to overfitting giving an unrealistic accuracy of 100%. This problem caused by
overfitting is well addressed by applying regularization on the regression model.We have used L2 Regularization
that is Ridge Regression to overcome this issue.
SVM model is extremely unsuitable for this problem as it takes an unreasonable amount of time(near about 3
hours) to run the model and also gives subpar accuracy. It is computationally expensive and inappropriate for
problems with large datasets such as the one given.
Random forest is also giving us good accuracy 98 %
ANN is giving 98 %