How has the Airline industry suffered during the pandemic? was a question that always stuck in my mind
when I saw articles on how travelling has been banned and movement of people not only from one country
to another country but also one state to another was being restricted. Hence as a statistics Student with a
curious mind I set out on a quest to find the effect of pandemic on the airline Industry. Tying statistics to
business problems that could benefit a business excites me. Hence I took up the initiative and called two
friends and decided to take their help in this task
we decided to get month wise domestic and international aviation data of the number of departures and
passengers in India during Jan 2010 to April 2022 from Airport Authorities of India website. We then took
this data cleaned, processed and transformed it to make it usable for our analysis. The analysis I suggested
to do for this objective was a familiar one which we had recently learnt in our fifth semester which was
Time series analysis under which we used the Auto Regressive Integrated Moving Average model which
creates a model that uses the past data to predict the future. As I am comfortable in coding I did the analysis
using R studio and python which has some excellent libraries to assist us in the analysis. We created the
model in such a way that the data could predict how the industry would behave if covid had not occurred.
We then compared the reality with the simulation which gave us some interesting interpretations. The results we found is that, international aviation industry on an average suffered five crores thirty three lakhs
per flight per month in losses and the domestic industry on an average suffered eighty two lakhs twenty
four thousand per flight per month in losses. But the key takeaway for the aviation industry from our
simulation vs reality analysis is that international travel is almost back on track after a major setback like
travel ban and it took 2 years and 3 months to do so whereas domestic travel is yet to recover.
I presented our findings and analysis to my statistics professor Mrs.Anwesha Roy also under whose
guidance we could come this far. She was thrilled with our work and encouraged us to get it published and
my team is currently working on it.
4. TIME SERIES DATA
A time-series data is a set of observations on a quantitative variable
collected over time. For example, historical data on sales, inventory,
customer counts, interest rates, costs, etc.
Businesses are often very interested in forecasting time series variables.
Often, independent variables are not available to build a regression
model of a time series variable that is why time series analysis is used.
5. TIME SERIES ANALYSIS
• Time series analysis is statistical technique that uses time-series data
for explaining the past or forecasting future events.
• In time series analysis, we analyze the past behavior of a variable in
order to predict its future behavior.
• The prediction is a function of time (days, months, years, etc.)
• No causal variable.
6. TIME SERIES FORECASTING
Time series forecasting means to forecast or to predict the
future value over a period. It assumes developing models based
on previous data and applying them to make observations and
guide future strategic decisions.
The future is forecast or estimated based on what has already
happened. Time series adds a time order dependence between
observations.
7. Types of time series
methods used for
forecasting:
• Autoregression (AR)
• Moving Average (MA)
• Autoregressive Moving Average (ARMA)
• Autoregressive Integrated Moving Average (ARIMA)
• Seasonal Autoregressive Integrated Moving-Average (SARIMA).
8. Stationarity
Stationarity means that the
statistical properties of a
process generating a time
series do not change over
time. ARIMA requires
stationarity.
Requirements for
stationarity:
1. Mean should be
constant
2. Variance should be
constant
3. There is no seasonality
9. Differencing
Differencing can help stabilize the mean of a time series by removing
changes in the level of a time series, and therefore reducing trend and
seasonality.
10. AR MODEL
In an autoregression model, we forecast the variable of interest using a
linear combination of past values of the variable. The term autoregression
indicates that it is a regression of the variable against itself.
Thus, an autoregressive model of order p can be written as :
𝒚𝒕 = 𝒄 + 𝝓𝟏 𝒚𝒕−𝟏 + 𝝓𝟐 𝒚𝒕−𝟐 + ⋯ + 𝝓𝒑 𝒚𝒕−𝒑 + 𝜺𝒕
where,
• t = Index denoting time period.
• 𝑦𝑡 =A series of n values measured over a time period.
• ϕ = Coefficient term for each value.
• c = Constant term in equation.
• 𝜀𝑡 = Forecast error for time.
• p= order of AR terms
11. MA MODEL
Rather than using past values of the forecast variable in a regression, a
moving average model uses past forecast errors in a regression-like
model.
𝒚𝒕 = 𝒄 + 𝜺𝒕 + 𝝓𝟏𝜺𝒕−𝟏 + 𝝓𝟐 𝜺𝒕−𝟐 + ⋯ + 𝝓𝒒 𝜺𝒕−𝒒
where 𝜀𝑡 is white noise. We refer to this as an MA(q) model, a moving
average model of order q . Of course, we do not observe the values of
𝜀𝑡, so it is not really a regression in the usual sense.
13. ARIMA models
If we combine differencing with autoregression and a moving average
model, we obtain a non-seasonal ARIMA model. ARIMA is an acronym
for Auto Regressive Integrated Moving Average (in this context,
“integration” is the reverse of differencing). The full model can be written
as:
𝒚𝒕
′
= 𝒄 + 𝝓𝟏 𝒚𝒕−𝟏
′
+ 𝝓𝟐𝒚𝒕−𝟐
′
+ ⋯ + 𝝓𝒑𝒚𝒕−𝒑
′
+ 𝝓𝟏 𝜺𝒕−𝟏 + ⋯ + 𝝓𝒒 𝜺𝒕−𝒒 + 𝜺𝒕
where, 𝑦𝑡
′
is the differenced series (it may have been differenced more
than once). The “predictors” on the right-hand side include both lagged
values of 𝑦𝑡 and lagged errors. We call this an ARIMA(p, d, q) model.
14. Components of ARIMA
‘p’ is the order of the ‘Auto Regressive’
(AR) term. It refers to the number of lags
of Y to be used as predictors.
p
‘q’ is the order of the ‘Moving Average’
(MA) term. It refers to the number of lagged
forecast errors that should go into the
ARIMA Model.
q
‘d’ is the minimum number of
differencing needed to make the series
stationary. And if the time series is
already stationary, then d = 0.
d
Lagging a time series means to shift its
values forward one or more-time steps,
or equivalently, to shift the times in its
index backward one or more steps.
Lags
15. Seasonal ARIMA models/ SARIMA
ARIMA models are also capable of modelling a wide range of seasonal
data. A seasonal ARIMA model is formed by including additional
seasonal terms in the ARIMA models we have seen so far. It is written
as follows:
where, m= number of observations per year.
16. Auto Correlation Function
ACF is an (complete) auto-correlation
function which gives us values of auto-
correlation of any series with its lagged
values.
In simple terms, it describes how well
the present value of the series is
related with its past values. A time
series can have components like trend,
seasonality, cyclic and residual. ACF
considers all these components while
finding correlations hence it’s a
‘complete auto-correlation plot’.
17. Partial Autocorrelation Function
PACF is a partial auto-correlation function. Instead of
finding correlations of present with lags like ACF, it
finds correlation of the residuals (which remains after
removing the effects which are already explained by
the earlier lag(s)) with the next lag value hence
‘partial’ and not ‘complete’ as we remove already
found variations before we find the next correlation.
18. • The Akaike information criterion (AIC) is a mathematical method for
evaluating how well a model fits the data it was generated from. In
statistics, AIC is used to compare different possible models and
determine which one is the best fit for the data.
• To compare AIC values, we use the following rule: "If a model is
more than 2 AIC units lower than another, then it is considered
significantly better than that model."
• Formula: AIC = 2k – 2ln(L)
• AIC: Akaike information criterion
• k: number of estimated parameters in the model
• L: maximum value of the likelihood function for the model
AIC values and its interpretation
19. The civil aviation industry in India has emerged as one of the
fastest growing industries in the country during the last three
years. But was affected severely due to the COVID-19 pandemic
in the year 2020-2021. This was due to the nearly complete
restriction of air travel during 2020.
The losses incurred was approximately Rs. 19,564 crores during
the pandemic. This was a big hit to the aviation industry.
The graph here shows, that the number of passengers had gone
down by almost 50% in 2020 compared to 2019.
But since then, the industry has been booming and has been up
by 238 percent in 2021. It is expected that the Indian aviation
industry is set to overtake the UK. Observers say that the sky is
the limit for the aviation sector in India.
Indian aviation sector
20. WHY SARIMA?
• The difference between ARIMA and Regression is that ARIMA also
takes into account the past errors which regression does not take into
account and in the case of SARIMA, it takes into account the
seasonality of the dataset too.
• ARIMA also trumps the predictions of moving average methods since it
also accounts for the autoregressive part where it uses past values to
predict future values which is not taken into account in any moving
averages methods.
• Since Our data has a seasonality component, it was not stationary as
well and since time is an influencing factor for our prediction we need
to use SARIMA
21. Requirements for ARIMA & SARIMA
1. Data is Univariate; dataset should contain only one
variable.
2.Data predicted is stationary.
3.The error terms are white noise; they are
independent and identically distributed with no
correlation.
22. OBJECTIVE 1
Predicting the passengers
per plane for the next 4
years from Jan 2020 if
covid did not happen
OBJECTIVE 2
Predicting the
passengers per plane
for the next 12 months
OBJECTIVES
OBJECTIVE 3
Predicting the losses
suffered due to COVID
through 2020 to 2023
23. R Studio
Libraries used: readxl, tseries,
forecast
Microsoft Excel
Load data and export into R
and python
OUR RESOURCES
Python
Libraries used: matplotlib,
pandas
Internet
Dataset, YouTube, research
papers, articles
Faculty
Expertise of the professors
24. Data overview
International departures,
Number of people carried
Variables used
International flights
Secondary data
From Airports Authority of India
Time frame
Jan 2010 – Jul 2022
Domestic Flights
Domestic departures,
Number of people carried
25. Domestic Data
There is high demand
during holiday seasons
Seasonality
Crisis
Covid 19 crisis hit in 2020
and the people travelling
decreased
Trend
Increasing till 2020
26. International Data
Seasonality
Crisis
Covid 19 crisis hit in 2020
and the people travelling
decreased
Trend
Increasing steadily since
2010 till 2020
There is high demand
during holiday seasons
28. Process of ARIMA Modelling
Checking
stationarity
Using Augmented
Dickey fuler test
Model
Identification
Using auto.arima function
from the tseries library in
R studio and selecting the
model with least AIC Value
Predicting Values
Using the Forecast
library in R studio
29. ADF Test
• Augmented Dickey Fuller test (ADF Test) is a common statistical test used
to test whether a given Time series is stationary or not. It is also called the
unit root test.
• A Dickey-Fuller test is a unit root test that tests the null hypothesis that α=1
in the following model equation. alpha is the coefficient of the first lag on Y.
• The presence of a unit root means the time series is non-stationary.
• Hypothesis: H0 : α=1 vs H1 : α≠1
• Formula: 𝑦𝑡 = 𝑐 + 𝛽𝑡 + 𝛼𝑦𝑡−1 + 𝜙Δ𝑦𝑡
′
+ 𝑒𝑡
30. Checking stationarity using ADF Test
H0 : Time series is not stationary
H1 : Time series is stationary
Data P value Accept/reject null Decision
Domestic data 0.2622 Accept null Not stationary
International data 0.3371 Accept null Not stationary
31. ARIMA & AIC
Domestic
Model
P,D,Q values AIC
values
Model 1 (0,1,0) 1201.687
Model 2 (1,1,1)(0,0,1)[12] 1183.11
Model 3 (0,1,0)(0,0,1)[12] 1199.731
International
model
P,D,Q values AIC values
Model 1 (1,1,3) 1232.58
Model 2 (0,1,2)(1,0,0)[12] 1240.293
Model 3 (2,1,2) 1235.78
These models in red are selected as the best fit models for the respective aviation
types.
These are selected based on their AIC values and the ones in red have the lowest
AIC values out of all models and therefore, they are the best fit.
34. • The test determines whether errors are iid or whether there is
something more behind them; whether the autocorrelations for
the errors or residuals are non-zero.
• Essentially, it is a test of lack of fit: if the autocorrelations of
the residuals are very small, we say that the model doesn’t
show ‘significant lack of fit’
• The hypothesis for the Test are given below:
• H0, is that our model does not show lack of fit
• H1, is just that the model does show a lack of fit.
Ljung-box test
35. Checking fit of the model using Ljung-box test
Lags used P value Reject/accept null Decision
5 0.8784 Accept Null Good Fit
10 0.9678 Accept Null Good Fit
15 0.6549 Accept Null Good Fit
25 0.7494 Accept Null Good Fit
Domestic Data
36. Checking fit of the model using Ljung-box test
Lags used P value Reject/accept null Decision
5 0.9993 Accept Null Good Fit
10 0.9972 Accept Null Good Fit
15 0.8501 Accept Null Good Fit
25 0.9514 Accept Null Good Fit
International Data
37. Forecasting Domestic data Using best Model
01-08-2022 116.46
01-09-2022 115.88
01-10-2022 116.70
01-11-2022 117.72
01-12-2022 117.64
01-01-2023 112.85
01-02-2023 117.97
01-03-2023 118.10
01-04-2023 117.28
01-05-2023 119.28
01-06-2023 117.66
01-07-2023 116.85
38. Forecasting International data using best model
01-08-2022 147.91
01-09-2022 145.86
01-10-2022 145.41
01-11-2022 144.99
01-12-2022 144.60
01-01-2023 144.24
01-02-2023 143.90
01-03-2023 143.58
01-04-2023 143.28
01-05-2023 143.003
01-06-2023 142.74
01-07-2023 142.50
41. Checking stationarity using ADF Test
Data P value Accept/reject null Decision
Domestic data 0.09633 Accept null Not stationary
International data 0.0698 Accept null Not stationary
H0 : Time series is not stationary
H1 : Time series is stationary
42. ARIMA & AIC
Domestic
model
P,D,Q value AIC
Model 1 (4,0,0)(0,1,1)[12] 633.6514
Model 2 (2,0,0)(1,1,2)[12] 640.62
Model 3 (3,0,0)(0,1,1)[12] 635.2183
International
model
P,D,Q value AIC
Model 1 (1,1,2)(0,1,1)[12] 647.2197
Model 2 (3,1,2)(0,1,1)[12] 640.4121
Model 3 (2,1,3)(0,1,1)[12] 638.3682
These models in red are selected as the best fit models for the
respective aviation types.
These are selected based on their AIC values and the ones in red have
the lowest AIC values out of all models and therefore, they are the best
fit.
45. Checking fit of the model using Ljung-box test
Lags used P value Reject/accept null Decision
5 0.7941 Accept Null Good Fit
10 0.7305 Accept Null Good Fit
15 0.4957 Accept Null Good Fit
25 0.2979 Accept Null Good Fit
Domestic Data
46. Checking fit of the model using Ljung-box test
Lags used P value Reject/accept null Decision
5 0.8032 Accept Null Good Fit
10 0.7952 Accept Null Good Fit
15 0.749 Accept Null Good Fit
25 0.07807 Accept Null Good Fit
International Data
47. Forecasting Domestic data using best model
• If covid had not occurred, then
the domestic aviation industry
would have continued thriving
and expanding as we can see
• The Seasonality will
continue as it is
48. Forecasting international data using best model
• The international aviation industry
stagnates with little to no growth
• There is a certain seasonality
expected
50. Domestic Losses
• Covid obstructed a blooming
domestic aviation sector
• The loss due to covid is still seen
in 2022 as shown by the gap
between blue and red line
51. International Losses
• Huge losses were bared due to
covid
• The international travel is almost
back on track as the gap has
almost closed in
• The biggest takeaway from here
is that the industry took 2 years 4
months to recover from a crisis
like travel ban
53. Monthly domestic losses
• The spikes in losses are due to the
1st and 2nd wave of covid
• The losses have finally plateaued,
and the industry is likely to see
some stability now
54. A few assumptions
Domestic Loss calculation
Bombay to Delhi is
the busiest domestic
air route
Average price per ticket
for this route is Rs.5000
Ticket is booked one
month prior
No inflation in the
price of tickets
No fluctuation in
price of tickets
56. Monthly international losses
• The International Aviation industry
is very unstable
• The international aviation is almost
close to catching up with our
simulation
• Currently the industry is running at
negligible losses
57. A few assumptions
International Loss calculation
Bombay to Dubai is
the busiest domestic
air route
Average price per ticket
for this route is Rs.35000
Ticket is booked one
month prior
No inflation in the
price of tickets
No fluctuation in
price of tickets
58. LIMITATIONS
• ARIMA has poor performance for long term
forecasts
• Quite a bit of subjectivity involved in finding P,D,Q
values
• There are better models for prediction of data
• Organized data is hard to find
59. FUTURE SCOPE
• Predictions for different brands using market share
• Using better models for prediction of data
• Checking the accuracy of ARIMA and SARIMA
predictions by comparing present and predicted
values
60. [1] International Journal of Applied Engineering Research ISSN 0973-4562 Volume 14,
Number 3 (2019) pp. 646-650, 10.37622/000000
References
[2] International Journal of Innovative Technology and Exploring Engineering
(IJITEE) ISSN: 2278-3075, Volume-8, Issue-11S, September
2019, doi.org/ 10.35940/ijitee.K1216.09811S19
[3] https://doi.org/10.1109/icmlc.2009.5212785
[4] https://www.aai.aero/en/business-opportunities/aai-traffic-news
[5] https://www.dgca.gov.in/digigov-portal/?page=4267/4210/servicename
61. Acknowledgements
We would like to thank Professor Anwesha, for her patient instruction,
passionate support, and constructive criticisms of this effort.
We would like to thank Dr.Santosha C.D. and Professor Kavya S for giving us
this opportunity and continued support in the development of this project.
We would like to thank our fellow classmates and peers for their valuable
inputs and their help in the project and for patiently listening to us.