2. Table of Contents
○ Introduction
○ Business Question
○ Description of the Data
○ Exploratory Plots and Tables
○ Unsupervised and Supervised Analytics Models
○ Recommendations and Conclusion
○ Possible next steps
2
2
3. Introduction
Air travel cancellation has always been a universal problem. As more and more economic connections happen
among different countries, this issue can cause huge problems to frequent travellers, especially long-distance
travellers, such as international students and business persons. Our group members come from different parts
of the world, so this question is of key interest to us. So we decided to base our projects on the statistic data
of Bureau of Transportation Statistics of the United States, and hoped to generate some interesting insights
regarding air travel cancellation, thus to provide some useful insights for the frequent travellers mentioned
above.
Air cancellation can bring about a series of problems to various shareholders in tourism industry: the agenda of
customers get delayed, the airports get crowded, and the needs for hotel rooms rockets if a large number of
flights got cancelled on the same day due to a severe weather. On acknowledging our insights, travellers can
plan ahead accordingly, airlines and airports can make efforts to reduce cancellation based on our findings, and
hotels can plan their marketing and sales according to certain flight cancellation pattern.
3
3
4. Business Question
Flight cancellation can happen due to a variety of reasons. The most common causes are as follows:
1. Weather
2. Natural Disasters
3. Mechanical Errors
4. Monopoly Routes
5. Aircraft Size
Our team is interested in figuring out the different factors that will lead to a flight cancellation. After deciding
our datasets for this project and initial analysis of the datasets, we decided to focus on the following domains:
1. Segments - by the Airport ID of original airport and Destination Airport ID pair
2. Airport - by every Origin Airport ID
3. Airlines - by Airline ID
We have learned to analyze data with Decision Tree Model and Regression Model in Business Intelligence and
Data Mining class. So we decided to try both models to analyze the above mentioned factors, and choose the
best model that has the smallest average squared error at the initial stage of our analysis.
*In order to work with 2 datasets, we used SQL to combine these two datasets first before we start to
conduct the analysis using SAS Enterprise Miner.
4
4
5. Description of the Data
After careful observation, we choose two datasets:
(1) T100 Domestic Airline Segment Data
(2) Airline On-Time Performance Data.
Those two datasets comes from Bureau of Transportation Statistics of Research and Innovative Technology
Administration (RITA). The first dataset has more than 70k rows and contains domestic market data reported
by U.S. air carriers, including carrier, origin, destination, and service class for enplaned passengers, freight and
mail when both origin and destination airports are located within the boundaries of the United States and its
territories.1 Each month, every certificated U.S. air carriers reports their traffic information to Office of Airline
Information, using an internal normalized form named T-100, and this dataset summarized T-100 data from
1993 to 2013.
The dataset named Airline On-Time Performance Data has more than a million rows. It is collected by the
Office of Airline Information, Bureau of Transportation Statistics (BTS), and contains on-time arrival data for
non-stop domestic flights by major air carriers, and provides such additional items as departure and arrival
delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times,
cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance.2
Variables Available
These two datasets have sufficient data volume and variables for data analysis on the relationship between air
traffic patterns and externalities which hereby defined as airports and airlines.
(1) T100 Domestic Airline Segment Data
This dataset supplied key insights on the factors that result in flight cancellations. The key measures of
this dataset are listed below:
Variables
DepScheduled
Departures Performed
Payload
Available Payload (pounds)
Seats
Available Seats
Passengers
2
Departures Scheduled
DepPerformed
1
Definition
Non-Stop Segment Passengers Transported
Source: http://www.transtats.bts.gov/Fields.asp?Table_ID=259
Source: http://www.transtats.bts.gov/Fields.asp?Table_ID=236
5
5
6. Freight
Non-Stop Segment Freight Transported (pounds)
Mail
Non-Stop Segment Mail Transported (pounds)
Distance
Distance between airports (miles)
LoadFactor
Load Factor: Ratio of Passenger Miles to Available Seat Miles
RampTime
Ramp to Ramp Time (minutes)
AirTime
Airborne Time (minutes)
(2) Airline On-Time Performance Data
This dataset supplied the factors that affect the Delay and causes for different types of delays. The key
measures of this dataset are listed below:
Variables
Definition
CarrierDelay
Carrier Delay, in Minutes
WeatherDelay
Weather Delay, in Minutes
NASDelay
National Air System Delay, in Minutes
SecurityDelay
Security Delay, in Minutes
LateAircraftDelay
Late Aircraft Delay, in Minutes
Analysis Methodology:
1. Consolidated the data for the months of May, June and July
The first dataset contains T-100 data from 1993 to 2013 and more than 10 million records. To get
valuable and effective information, we consolidated the data from May 2013 and July 2013, and get
70,000+ records.
2. Clean and construct new variables
a) Generated variables: Flights_Cancelled, Flights_Adhoc, Adhoc?, Cancellation?
The original first dataset doesn’t have clear indicator about cancellation number, but contain
Flights_Scheduled and Flights_Performed. We subtract Flights_Performed from Flight_Scheduled and
get the number of flights with unexpected changes, including both cancellation and Adhoc. If the
6
6
7. unexpected changes is negative, we convert the changes into a new variable named”Flight_Cancelled”,
and if it’s positive, we convert the changes into another new variable named “Flights_Adhoc”. We also
created binary variables to show the occurrence of cancellation and adhoc, which are named
“Cancellation?” and “Adhoc?”.
Variables
Definition
Flights_Cancelled
Number of flights cancelled (Scheduled - Performed )
Flights_Adhoc
Number of flights which took off adhoc (Scheduled Performed)
Adhoc?
Binary Variable to depict adhoc flights
Cancellation?
Binary Variable to depict cancellations
b) Converted sum to average for: Passengers, Seats, Payload, Freight, Mail, Ramp_to_Ramp, AirTime
Several vital indicators which could be potential externalities impacting cancellations rates is in the sum
of the amount of all flights that day. Therefore, the actual flights numbers influence those indicators. To
exclude this bias possibility, we calculated the average number of the indicators (Total amount/ number
of flights performed) generated new variables to store the records.
Variables
Definition
Avg_Passengers
Avg_Passengers=Passengers/Departures Performed
Avg_Seats
Avg_Seats=Seats/Departures Performed
Avg_Freight
Avg_Freight=Freight/Departures Performed
Avg_Mail
Avg_Mail=Mail/Departures Performed
Avg_Ramp_to_Ramp
Avg_Ramp_to_Ramp=Ramp_to_Ramp/Departures Performed
Avg_AirTime
Avg_AirTime=AirTime/Departures Performed
3) Analyzed data individually for each of the datasets
Two datasets that we are interested in are related to flight cancellations and delays. They have different
7
7
8. primary keys and the internal calculation logic are intuitively different for each of these datasets.
Therefore, we decided to not to merge them, and analyzed them individually.
Exploratory Plots and Tables
We explored both our data sets to find relations between variables. Also, we tried to find interesting patterns
related to flight cancellations by using tableau.
Interesting Relationships
Using a scatter plot in the data exploration menu in SAS we were able to arrive at some interesting
relationships between key variables in our data set.
a) Departures Performed:
We plotted the variable “departures_performed” against the variable “Airline_ID” with respect to
“Flight_Cancelled”. The color blue indicates that a flight was not cancelled and the color red indicates
that a flight was cancelled. The above graph shows us that the density of the red pixels is very high for
departures exceeding 150. More specifically, airlines that had higher number of departures also
had flight cancellations.
8
8
9. The departures_performed variable was noted for further investigation.
b) Number of Passengers:
We plotted the variable “Total Passengers” against the variable “Airport_ID” with respect to
“Flight_Cancelled”. The color blue indicates that a flight was not cancelled and the color red
indicates that a flight was cancelled. An increase in the number of red pixels above the 2500
passenger mark can be observed. More specifically, airports that handled higher passengers also
had flight cancellations.
The total_passengers variable was noted for further investigation.
c) Distance
9
9
10. We plotted the variable “Distance from Origin” against the variable “Dest_Airport_ID” with respect to
“Flight_Cancelled”. The color blue indicates that a flight was not cancelled and the color red indicates
that a flight was cancelled. Distances between the 500 and 750 miles mark see a larger density of red
pixels. It can be observed that shorter distance flights see more flights cancellations.
The distance variable was noted for further investigation.
Using tableau we tried to find interesting facts about key variables.
a) Monthly Distribution of cancellations:
The charts above show that June and July are the months with the highest flight delay and
cancellations. Also, the number of flights diverted increase in the month of June and July.
10
10
11. b) Geographic distribution of flight delays
The three graphs above show that:
1. Georgia had the maximum flights delayed due to weather.
2. Texas had the maximum flights delayed due to security checks.
3. Thursday sees the maximum amount of flight delays.
11
11
12. Unsupervised and Supervised Analytics Models
For this project, we used k-means clustering, as our unsupervised model, and tried decision trees and
regression models for each of the three domains: airports, airlines and segments.
Unsupervised Learning Model
In the segments domain, on running a K-means cluster analysis, we found the following:
We had 46 clusters of segments. We were primarily interested in grouping segments based on the
departures performed and the total flights cancelled in that segment.
We determined 5 major clusters. The range of departures performed in the clusters was from 6 to 864. The
range of flights cancelled for segments in the cluster was from 0 to 75. The five clusters were in decreasing
order of frequency are:
12
12
13. ● The largest cluster comprised of segments that had approximately 9 departures as the average for
the cluster, and 0.05 as the average of flight cancellations for the cluster.
● The next cluster comprised of segments that had approximately 55 departures as the average for
the cluster, and 0.21 as the average of flight cancellations for the cluster.
● The next cluster comprised of segments that had approximately 37.4 departures as the average for
the cluster, and 3 as the average of flight cancellations for the cluster.
● The next cluster comprised of segments that had approximately 119 departures as the average for
the cluster, and 0.39 as the average of flight cancellations for the cluster.
● The next cluster comprised of segments that had approximately 88 departures as the average for the
cluster, and 2.2 as the average of flight cancellations for the cluster.
We weren’t able to analyze a significant trend through the use of this model, so we continued with predictive
modelling.
Supervised Learning Models
The two models that we looked at were :
1. Regression
2. Decision Tree
We will finally base our analysis on one of these two models depending on which has lesser average square
error.
Regression Analysis
We conducted Regression analysis to determine the significant factors that influence flight cancellations. We
performed backward, forward and stepwise regression. The diagram below represents the regression diagram :
The following actions were performed on the data:
13
13
14. 1. Data Partition: The data was partitioned into training and validation for basic model fitting and to prevent
overfitting the training data.
2. Impute: The data was imputed to fill in the missing values.
3. Regression Snapshots:
Stepwise Regression(With Airline ID as Target):
The ASE for Validation (Stepwise) : 0.100689
14
14
15. We looked at the Regressions for the other selection models too, and decided to go ahead with Stepwise as
it had the least average square error.
Output of the stepwise Regression, depicting all significant variables:
Stepwise Regression(With Origin Airport ID as Target):
The ASE for this model was 0.112633
Similarly, for the segment-wise regression model analysis, we got an ASE of 0.090134.
15
15
16. These errors that we saw with the Regression model were much higher than what the decision tree gave us,
so we rejected the regression model and based our analysis on the Decision Tree .
Decision Tree Analysis
Decision trees are a simple, but powerful form of multiple variable analysis. They provide unique capabilities to
supplement, complement, and substitute for traditional statistical forms of analysis. To access the important
variables in this study we apply the decision tree model in terms of SAS to acquire the critical variables in our
dataset.By cross validation, we found the most important variables for our target and conducted further
analysis to provide business suggestion on factors that affect the flight cancellations.
A) Based on Airline ID domain
Experiment Methodology:
1. Import the following dataset :
T-100 Segment data for the months of May,June and July (84,232 rows).
2. Edit variables and set different roles to all of variables
Variable
Role
Level
Airline ID
ID
Nominal
Aircraft Config
Input
Interval
Aircraft Group
Input
Interval
Aircraft Categorization
Input
Nominal
Departure Performed
Input
Interval
Class
Input
Nominal
Average Freight
Input
Interval
16
16
17. Average Airtime
Input
Interval
Average Total Time at ground on bot
Input
Interval
Average Mail
Input
Interval
Average Passengers
Input
Interval
Average Payload
Input
Interval
Average Ramp to Ramp
Input
Interval
Distance
Input
Interval
Month
Input
Interval
Flight Cancelled
Target
Nominal
The other variables which are not important for this analysis, were rejected.
3.Data Partition
With 70% for training and 30% for validation, all the rest is following the default setting.
4. Transformation
Variable transformations can be used to stabilize variance, remove nonlinearity, improve
additivity, and counter non-normality.The following variables were transformed in order to
address these irregularities
Variable
Method
Average Ramp to Ramp
Log
Average Payload
Log
Average Passengers
Log
Average Airtime
Log
Aircraft Categorisation
Dummy Indicator
Class
Dummy Indicator
Post transformation, the variables skewness reduced considerably and in seen in the below figures:
17
17
18. 5. Decision Tree Analysis
Applying with Cross validation, Rest are following the default settings.
6. Results
The ASE for Validation data is : 0.078363
18
18
19. Decision Tree:
We also looked at the various important variables for this dataset:
The subtree assessment plot depicted that the tree was pruned such that there are 45 leaves.
19
19
20. 7. Outcomes
For a given airline, if :
● the number of departures performed is more than approximately 3,
● the average number of passengers travelling is less than approximately 3
then there is a 99.6% probability that a flight of that airline will not be cancelled.
20
20
21. For a given airline, if :
● the average payload is less than 10,
● the Class is F
● the departures performed less than 49
then there is 82.4% probability that the flight would get cancelled.
For a given airline, if:
● the departures performed are more than 70,
● the average payload is more than 9 pounds,
● the average total time on ground is more than 18 minutes
then there is 83.3% probability that the flight would get cancelled.
B) Based on Airport ID
Changing the ID variable to Origin Airport ID and keeping the other configurations similar, we see the following
results:
The ASE for Validation data is 0.0987131
21
21
22. The decision tree:
We see that the same set of variables were important for this analysis as well:
The subtree assessment plot with the average square errors:
22
22
23. Outcomes
For a given Airport, if
● the departures performed more than 42,
● the average payload of less than 10 pounds,
● the average mails sent is more than 1,
then it is very unlikely (100%) that the flight would get cancelled.
For a particular Airport ID,
● the departures performed more than 70,
● they belong to Class F,
● the average payload of less than 10 pounds and Aircraft Config lesser than 2
then it is 83.6% likely that the flight would get cancelled.
23
23
24. C) Based on Segments (Origin Airport ID and Destination Airport ID pairs)
Experiment Methodology:
1. Import the following dataset :
T-100 Segment data for the months of May,June and July (84,232 rows).
2. Edit variables and set different roles to all of variables
Variable
Role
Level
Origin_Airport_ID
ID
Nominal
Dest_Airport_ID
ID
Nominal
flightAdHoc?
Input
Binary
Aircraft Config
Input
Interval
Aircraft Group
Input
Interval
Aircraft Categorization
Input
Nominal
Departure Performed
Input
Interval
Class
Input
Nominal
Average Freight
Input
Interval
Average Airtime
Input
Interval
Average Total Time at ground on bot
Input
Interval
Average Mail
Input
Interval
Average Passengers
Input
Interval
Average Payload
Input
Interval
Distance
Input
Interval
24
24
25. Month
Input
Interval
Flight Cancelled?
Target
Nominal
The other variables which are not important for this analysis were rejected.
3.Data Partition
With 70% for training and 30% for validation, all the rest is following the default setting.
4. Transformation
Variable
Method
Average Payload
Log
Average Passengers
Log
Average Airtime
Log
Aircraft Categorisation
Dummy Indicator
Class
Dummy Indicator
Post transformation, the variables skewness reduced considerably as seen in the figures depicted above in the
airline-based analysis.
5. Decision Tree Analysis
Applying with cross validation, rest are following the default settings.
6.Results
The ASE for Validation data is : 0.081963
25
25
26. Decision Tree:
We also looked at the various important variables for this dataset:
The subtree assessment plot depicted that the tree was pruned such that there are 36 leaves.
26
26
27. 7. Outcomes
For a given segment, if :
● The number of departures performed is more than approximately 70,
● The average allotted payload is less than approximately 9 pounds,
then there is an 88% probability that flights in that segment will get cancelled
27
27
28. For a given segment, if :
● The number of departures performed is more than approximately 70,
● The average allotted payload is more than approximately 9 pounds
● The average total time on ground for both source airport and destination airport is greater than
approximately 19 minutes
then there is an 83.3% probability that flights in that segment will get cancelled
For a given segment, if :
● The number of departures performed is less than approximately 10 and greater than 2
● The flights too off randomly without schedule,
then there is a 94.7% probability that flights in that segment will get cancelled
28
28
29. Recommendations and Conclusion
Important Variables Venn Analysis
We performed a venn analysis on the important variables in each of the three domains and plotted them,
considering those ones that were important at arriving at our recommendations.
● Departures Performed and Avg. Payload are the most important variable in our analysis for all the
29
29
30. three domains. They are the game-changing decider variables that decide cancellations for segments,
airlines and airports
● Airlines and Segments share avg total time on ground at both source and destination as an
important variable. This is interesting because it is counter-intuitive. One would think that this would
appear as a decider variable for airports
● Airlines and airports share the aircraft_class variable as common
● FlightAdHoc, Avg. Passengers, and Airport Config and Avg Mails are important for segments,
airlines and airports respectively
Findings and Recommendations
Segments
Findings:
● In segments that have flights with very less payload on an average (< 8 pounds) but fly
frequently are likely to get cancelled. Moreover, the segments that have flights with higher
payloads and fly frequently, but spend more than 18 minutes at both the source and destination
airports are also likely to get cancelled.
● In segments that have flights with few departures and are taking off without being scheduled
see less or no cancellations.
Recommendations:
● The airport should pilot a program to redirect a few congested segments’ traffic to runways
that handle the non-scheduled flights. Based on the results, it can determine whether priority
given to non-scheduled aircrafts was causing cancellations.
● A new runway should be opened to speed up ground handling and reduce the average time
spent for higher payload aircrafts on ground at both source and destination
● The airport is accommodating flights of non-congested segments, that too flights that are not
scheduled. However, congested, heavy-traffic segments but with less or no passengers are
being cancelled, and those with passengers and cargo, and those that take time on the ground
at both source and destination, are being cancelled.
Airlines:
Findings:
● For small flights (accommodating three or lesser people) that fly more often (more than 3
departures) have very little chance of getting cancelled.
● For flights that fly more often with little payload (lesser than 9 pounds) tend to get cancelled
more often. They also spend a considerable about of time at the airports (18 minutes).
Recommendations:
● The last recommendation for the segments ties into the same for the airlines domain. Ground
crew of airline companies should make sure that quick ground handling time is instilled at the
30
30
31. airport for
higher payload aircrafts on ground at both source and destination
● The payload analysis from segments complies with our finding for aircrafts with lesser number
of passengers. Just as it was found that less payload but high departure segment flights were
getting cancelled, the same for airlines hold true. Airlines ground staff at airports should be
alert when these flights are schedules to arrive and depart at airports, to make sure that
handling time is fast.
Airports:
Findings:
● For airports with frequent departures (more than 70) with relatively lesser payload ( 10 pounds
or lesser) and belonging to Class F, and with avg. mails being loaded into the aircrafts, it is very
likely that these flights would get cancelled.
Recommendations:
● As these delays affect a large population, the airports should work on Scheduled
Passenger/cargo service flights to understand why these flights result in frequent cancellations.
From our findings, it is apparent that the handling time, in terms of baggage and mail loading
into the aircrafts, is deciding the cancellations, apart from other important variables. In
conclusion, handling at the airports is taking time.
31
31
32. Possible next steps
According to Wall Street Journal, illness, family emergencies, and rescheduled business meetings are a big
business for airline companies. 3 At some airlines, the resulting change fee and penalties passengers ended up
paying added up to $2 billion a year, which is even higher than the total baggage fees. If airlines can delve more
into the seasonal client data to figure out a cancellation pattern from the passenger’s side, adjust change fees
and penalties according to the patterns discovered, the airlines can generate a higher revenue based on that
finding.
3
Source: http://online.wsj.com/news/articles/SB10001424052970204563304574318212311819146
32
32