Forecasting analysis on us flights v1

FORECASTING PROJECT ON US DOMESTIC FLIGHTS
(In Revolution Analytics)
Prepared By:
Wyendrila Roy
http://in.linkedin.com/pub/wyendrila-roy/5/3a/876

Acknowledgement
This project is done as a final project, as a part of the course titled “Business Analytics with R”. I am
really thankful to our course instructor Mr. Ajay Ohri, Founder, DecisionStats, for giving me an
opportunity to do the project in Time Series Analysis using R and providing me with the necessary
support and guidance which made me complete the project on time. I am extremely grateful to him for
providing me with the big data set and also the necessary links to start of the project and understand
Time Series Analysis.
In this project I have chosen the topic- “Forecasting on US Domestic Flights”, where I have analyzed the
flight activities in the Top Domestic Airports of US and then presented a prediction of the same for 2010
– June’ 2011. Due to the size of the data set this project is done in Revolution Analytics. I am really
grateful to the extremely resourceful articles and publications provided by Revolution Analytics, which
helped me in understanding the tool as well as the topic.
Also, I would like to extend my sincere regards to the support team of Edureka for their constant and
timely support.

Table of Contents
Methodology................................................................................................................................................4
1. Overview...........................................................................................................................................4
2. Data Source.......................................................................................................................................4
3. Limitations.........................................................................................................................................4
4. Tool/Package Used............................................................................................................................4
5. File Format Used...............................................................................................................................4
The Analysis..................................................................................................................................................5
1. Importing the Data............................................................................................................................5
2. Exploring the Data.............................................................................................................................5
3. Aggregating the Data ........................................................................................................................8
4. Building the Time Series....................................................................................................................9
5. Predict Future Values based on the Time Series Analysis ..............................................................15
Conclusion ..................................................................................................................................................28
References..................................................................................................................................................28

Methodology
1. Overview
In this report we have analyzed time series data in R language. We have used the “data step” functions
in Revolution Analytics’ RevoScaleR package to access a large data file, manipulated it, sorted it,
extracted the data we needed and then aggregated the records with monthly time stamps to form
multiple, monthly time series. Then we have used ordinary R time series functions to do some basic
analysis. Thereafter we have used forecasting functions to predict the domestic flights activity for Top
airports in US for the period Jan 2010 –June 2011.
2. Data Source
The dataset used in this report is the airlines “edge” flight data set (77,242 KB) from infochimps.com. It
contains 3.5 million monthly domestic flight records from 1990 to 2009.
3. Limitations
The major limitation was to extract the time series from time stamped data embedded in this very large
data set. These types of data sets are too large to be read into memory and processed by normal R
language.
4. Tool/Package Used
This Report uses Revolution Analytics’ new add-on package called RevoScaleR™, which provides
unprecedented levels of performance and capacity for statistical analysis in the R environment. With the
help of this package, we can process, visualize and model the largest data sets in a fraction of the time
of legacy systems, without the need to deploy expensive or specialized hardware.
5. File Format Used
RevoScaleR provides a new data file type with extension .xdf that has been optimized for “data
chunking”, accessing parts of an Xdf file for independent processing. Xdf files store data in a binary
format. The file format provides very fast access to a specified set of rows for a specified set of columns.
New rows and columns can be added to the file without re-writing the entire file. RevoScaleR also
provides a new R class, RxDataSource that has been designed to support the use of external memory
algorithms with .xdf files.

The Analysis
1. Importing the Data
We use the RevoScaleR function to read the text file into the special xdf binary format used by
RevoScaleR functions:
> rxGetInfoXdf("Flights",getVarInfo=TRUE)
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionFlights.xdf
Number of observations: 3606803
Number of variables: 5
Number of blocks: 8
Variable information:
Var 1: origin_airport
683 factor levels: MHK EUG MFR SEA PDX ... CRE BOK BIH MQJ LCI
Var 2: destin_airport
708 factor levels: AMW RDM EKO WDG END ... COS HII PHD TBN OH1
Var 3: passengers, Type: integer, Low/High: (0, 89597)
Var 4: flights, Type: integer, Low/High: (0, 1128)
Var 5: month, Type: character
> rxGetInfoXdf(file="Flights",numRows=10,startRow=1)
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionFlights.xdf
Number of blocks: 8
Data (10 rows starting with row 1):
origin_airport destin_airport passengers flights month
1 MHK AMW 21 1 200810
2 EUG RDM 41 22 199011
3 EUG RDM 88 19 199012
4 EUG RDM 11 4 199010
5 MFR RDM 0 1 199002
6 MFR RDM 11 1 199003
7 MFR RDM 2 4 199001
8 MFR RDM 7 1 199009
9 MFR RDM 7 2 199011
10 SEA RDM 8 1 199002
2. Exploring the Data
Now we will sort the file by flights to find the origin / destination pairs, which have the most monthly
flights and pick out the two top origin airports having the most flights.

rxSort(inData="Flights", outFile = "sortFlights", sortByVars="flights",
+ decreasing = TRUE,overwrite=TRUE)
> rxGetInfoXdf(file="sortFlights")
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionsortFlights.xdf
Number of blocks: 8
> mostflights5 <- rxGetInfoXdf(file="sortFlights",numRows=5,startRow=1)
> mostflights5
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionsortFlights.xdf
Number of blocks: 8
origin_airport destin_airport passengers flights month
1 SFO LAX 83153 1128 199412
2 LAX SFO 80450 1126 199412
3 HNL OGG 73014 1058 199408
4 OGG HNL 77011 1056 199408
5 OGG HNL 63020 1044 199412
> top5f <- as.data.frame(mostflights5[[5]])
> topOA <- unique(as.vector(top5f$origin_airport))
> # Select the top 2
> top2 <- topOA[1:2]
> top2
[1] "SFO" "LAX"
From the above code we can see that the two top origin airports that have the most flights are San
Francisco International (SFO) and Los Angeles International (LAX)
Next, we use the RevoScaleR function rxDataStep to build a new file “mostFlights” containing only those
flights that originate in either SFO or LAX.
> rxGetInfoXdf("mostFlights",numRows=10,startRow=1)
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionmostFlights.xdf
Number of blocks: 8

origin_airport destin_airport passengers flights month origin
1 SFO RDM 1413 92 199003 SFO
2 SFO RDM 1394 88 199006 SFO
3 SFO RDM 922 86 199001 SFO
4 SFO RDM 1661 93 199008 SFO
5 SFO RDM 1093 88 199005 SFO
6 SFO RDM 995 79 199011 SFO
7 SFO RDM 1080 83 199004 SFO
8 SFO RDM 1279 78 199012 SFO
9 SFO RDM 1080 83 199002 SFO
10 SFO RDM 1493 92 199007 SFO
> rxGetInfoXdf("mostFlights",getVarInfo=TRUE)
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionmostFlights.xdf
Number of blocks: 8
Variable information:
Var 1: origin_airport
683 factor levels: MHK EUG MFR SEA PDX ... CRE BOK BIH MQJ LCI
Var 2: destin_airport
708 factor levels: AMW RDM EKO WDG END ... COS HII PHD TBN OH1
Var 3: passengers, Type: integer, Low/High: (0, 83153)
Var 4: flights, Type: integer, Low/High: (0, 1128)
Var 5: month, Type: character
Var 6: origin
2 factor levels: SFO LAX

> rxHistogram(~flights|origin, data="mostFlights")
The transformation function, xform, used in rxDataStep creates a new variable, origin, with only two
levels (“SFO” and “LAX”) to hold the information on origin airports. The last line of code in this section
produces the following histogram of monthly flights
3. Aggregating the Data
Now we will break the month variable (which we originally imported as character data) into a month
and year component in order to proceed with our Time Series Analysis.
> xfunc = function(data){data$Month = as.integer(substring(data$month,5,6))
+ data$Year = as.integer(substring(data$month,1,4))
+ return(data)}
> xfunc = function(data){data$Month = as.integer(substring(data$month,5,6))
+ data$Year = as.integer(substring(data$month,1,4))
+ return(data)}
> # Add a new variable for time series work
> rxDataStepXdf(inFile="mostFlights", outFile = "SFO_LAX",
+ overwrite = TRUE, transformVars="month",transformFunc = xfunc)
> (file="SFO_LAX", numRows=10,startRow=1)

> rxDataStepXdf(inFile="SFO_LAX", outFile = "SFO.LAX",
+ varsToDrop=c("origin_airport","month"),
+ overwrite=TRUE)
> rxGetInfoXdf(file="SFO.LAX",numRows=10,startRow=1)
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionSFO.LAX.xdf
Number of blocks: 8
destin_airport passengers flights origin Month Year
1 RDM 1413 92 SFO 3 1990
2 RDM 1394 88 SFO 6 1990
3 RDM 922 86 SFO 1 1990
4 RDM 1661 93 SFO 8 1990
5 RDM 1093 88 SFO 5 1990
6 RDM 995 79 SFO 11 1990
7 RDM 1080 83 SFO 4 1990
8 RDM 1279 78 SFO 12 1990
9 RDM 1080 83 SFO 2 1990
10 RDM 1493 92 SFO 7 1990
The transformation function, xfunc, used in rxDataStepXdf uses ordinary R string handling functions to
break apart the month data. A second data step function drops the unnecessary variables from our final
file: SFO.LAX.
4. Building the Time Series
The function rxCube counts the number of flights in each combination of Year, Month and origin airport.
> xfunc <- function(data){
+ data$Month <- as.integer(substring(data$month,5,6))
+ data$Year <- as.integer(substring(data$month,1,4))
+ return(data)
+ }
>
> rxDataStepXdf(inFile="mostFlights", outFile = "SFO_LAX",
+ overwrite = TRUE, transformVars="month",transformFunc = xfunc)
> (file="SFO_LAX",numRows=10,startRow=1)
> t1 <-rxCube(flights ~ F(Year):F(Month):origin, removeZeroCounts=TRUE,data = "SFO_LAX")
> t1 <- as.data.frame(t1)

> head(t1)
F_Year F_Month origin flights Counts
1 1990 1 SFO 39.04225 284
2 1991 1 SFO 38.42034 295
3 1992 1 SFO 46.23954 263
4 1993 1 SFO 44.39464 261
5 1994 1 SFO 36.15417 240
6 1995 1 SFO 45.76768 198
From the above table, we see that there were 284 records where the originating airport was SFO for the
first month of 1990. The average number of flights among these 284 counts was 39.04225. From this
information, we can calculate the total number of flights for each month. The next bit of code does this
and forms the time information into a proper date. Note that we have reduced the data sufficiently so
that we are now working with a data frame, t1.
Now we will compute total flights out and combine month and date into a date
t1$flights_out<- t1$flights*t1$Counts
> names(t1) <- c("Year","Month","origin","avg.flights.per.destin","total.destin","flights.out")
> t1$Date <- as.Date(as.character(paste(t1$Month,"- 28 -",t1$Year)),"%m - %d - %Y")
> head(t1)
Year Month origin avg.flights.per.destin total.destin flights.out Date
1 1990 1 SFO 39.04225 284 11088 1990-01-28
2 1991 1 SFO 38.42034 295 11334 1991-01-28
3 1992 1 SFO 46.23954 263 12161 1992-01-28
4 1993 1 SFO 44.39464 261 11587 1993-01-28
5 1994 1 SFO 36.15417 240 8677 1994-01-28
6 1995 1 SFO 45.76768 198 9062 1995-01-28
Now, we extract out the SFO data, sort it to form a time series and plot it.
> SFO.t1 <- SFO.t1[order(SFO.t1$Date),]
> x <-SFO.t1$Date
> y <-SFO.t1$flights.out
> library(ggplot2)
> qplot(x,y, geom="line",xlab="", ylab="Number of Flightsn",main="Monthly Flights Out of SFO")

We use the R function, ts, to form the data into a time series object, and use the function stl to perform
a seasonal decomposition.

> SFO.ts <- ts(y,start=x[1],freq=12)
> sd.SFO <- stl(SFO.ts,s.window="periodic")
> plot(sd.SFO)
In the above graph, the first panel of reproduces the time series. The second panel shows the periodic,
seasonal component. The third panel displays the trend and the fourth panel displays the residuals.
700090001100013000
data
-10000500
seasonal
900011000
trend
-10000500
7335 7340 7345 7350
remainder
time

We may now repeat the above steps for the LAX data
> LAX.t1 <- t1[t1$origin=="LAX",]
> LAX.t1 <- LAX.t1[order(LAX.t1$Date),]
> a <-LAX.t1$Date
> b<-LAX.t1$flights.out
> qplot(a,b, geom="line",xlab="", ylab="Number of Flightsn",main="Monthly Flights Out of LAX")
12000
14000
16000
18000
20000
1990 1995 2000 2005 2010
NumberofFlights
Monthly Flights Out of LAX

> LAX.ts <- ts(b,start=x[1],freq=12)
> sd.LAX <- stl(LAX.ts,s.window="periodic")
> plot(sd.LAX)
120001600020000
data
-1500-500500
seasonal
140001600018000
trend
-1000010002000
7335 7340 7345 7350
remainder
time

5. Predict Future Values based on the Time Series Analysis
Now we can proceed with the forecasting analysis, for our further analysis we will work on the SFO time
series data and predict its values for the period of Jan’2010-June’2011. We will use the Simple
Exponential Smoothing as well as the ARIMA model for our forecasting analysis.
SFO.ts = ts(y, start = c(1990), freq=12)
plot.ts(SFO.ts)
Simple Exponential Smoothing
fit <- HoltWinters(SFO.ts, beta=FALSE, gamma=FALSE)
fit
Smoothing parameters:
alpha: 0.6726511
beta : FALSE
gamma: FALSE
Coefficients:
[,1]
a 10987.14
Time
SFO.ts
1990 1995 2000 2005 2010
70008000900010000110001200013000

By default, HoltWinters() just makes forecasts for the same time period covered by our original time
series. In this case, our original time series included Number of Flights originating from SFO from 1990-
2009, so the forecasts are also for 1990-2009.
In the example above, we have stored the output of the HoltWinters() function in the list variable “fit”.
>plot(fit)
The plot shows the original time series in black, and the forecasts in red.
As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-
sample forecast errors, that is, the forecast errors for the time period covered by our original time
series. The sum-ofsquared-errors is stored in a named element of the list variable “fit” called “SSE”, so
we can get its value by typing:
> fit$SSE
[1] 95039885
That is, here the sum-of-squared-errors is 95039885
As explained above, by default HoltWinters() just makes forecasts for the time period covered by the
original data, which is 1990-2009 in this case. We can make forecasts for further time points by using the
“forecast.HoltWinters()” function in the R “forecast” package.
Holt-Winters filtering
Time
Observed/Fitted
1990 1995 2000 2005 2010
70008000900010000110001200013000

> Forecast <- forecast.HoltWinters(fit, h=18)
> Forecast
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 2010 10987.14 10177.296 11796.98 9748.591 12225.68
Feb 2010 10987.14 10011.132 11963.14 9494.466 12479.81
Mar 2010 10987.14 9869.403 12104.87 9277.711 12696.56
Apr 2010 10987.14 9743.726 12230.55 9085.504 12888.77
May 2010 10987.14 9629.634 12344.64 8911.015 13063.26
Jun 2010 10987.14 9524.415 12449.86 8750.096 13224.18
Jul 2010 10987.14 9426.272 12548.00 8600.000 13374.28
Aug 2010 10987.14 9333.946 12640.33 8458.798 13515.48
Sep 2010 10987.14 9246.509 12727.77 8325.076 13649.20
Oct 2010 10987.14 9163.260 12811.02 8197.757 13776.52
Nov 2010 10987.14 9083.648 12890.63 8076.001 13898.27
Dec 2010 10987.14 9007.235 12967.04 7959.137 14015.14
Jan 2011 10987.14 8933.663 13040.61 7846.619 14127.66
Feb 2011 10987.14 8862.637 13111.64 7737.995 14236.28
Mar 2011 10987.14 8793.911 13180.36 7632.886 14341.39
Apr 2011 10987.14 8727.273 13247.00 7530.973 14443.30
May 2011 10987.14 8662.545 13311.73 7431.980 14542.30
Jun 2011 10987.14 8599.571 13374.70 7335.670 14638.61
The forecast.HoltWinters() function gives the forecast for our 18 month period, a 80% prediction
interval for the forecast, and a 95% prediction interval for the forecast. For example, the forecasted
value for Jan 2010 is about 10987.14, with a 95% prediction interval of (9748.591, 12225.68).
To plot the predictions made by forecast.HoltWinters(), we can use the “plot.forecast()” function:

Here the forecasts for ‘Jan 2010 – June 2011’ are plotted as a dark blue line, the 80% prediction interval
as the blue shaded area, and the 95% prediction interval as a light blue shaded area.
The ‘forecast errors’ are calculated as the observed values minus predicted values, for each time point.
We can only calculate the forecast errors for the time period covered by our original time series, which
is 1990-2009 for the Flight data.
The in-sample forecast errors are stored in the named element “residuals” of the list variable returned
by forecast.HoltWinters().
We will now obtain a correlogram of the in-sample forecast errors for lags 1-20. We can calculate a
correlogram of the forecast errors using the “acf()” function in R. To specify the maximum lag that we
want to look at, we use the “lag.max” parameter in acf().
Forecasts from HoltWinters
1990 1995 2000 2005 2010
8000100001200014000

> acf(Forecast$residuals, lag.max=20)
To test whether there is significant evidence for non-zero correlations at lags 1-20, we can carry out a
Ljung-Box test.
> Box.test(Forecast$residuals, lag=20, type="Ljung-Box")
Box-Ljung test
data: Forecast$residuals
X-squared = 370.1992, df = 20, p-value < 2.2e-16
To be sure that the predictive model cannot be improved upon, it is also a good idea to check whether
the forecast errors are normally distributed with mean zero and constant variance. To check whether
the forecast errors have constant variance, we can make a time plot of the in-sample forecast errors:
0.0 0.5 1.0 1.5
-0.50.00.51.0
Lag
ACF
Series Forecast$residuals

> plot.ts(Forecast$residuals)
The plot shows that the in-sample forecast errors seem to have roughly constant variance over time,
although the size of the fluctuations in the start of the time series may be slightly less than that at later
dates. The fluctuations for the time period 2000-2005 is quite high.
To check whether the forecast errors are normally distributed with mean zero, we can plot a histogram
of the forecast errors, with an overlaid normal curve that has mean zero and the same standard
deviation as the distribution of forecast errors.
Time
Forecast$residuals
1990 1995 2000 2005 2010
-3000-2000-1000010002000

The plot shows that the distribution of forecast errors is roughly centered on zero, and is more or less
normally distributed, although it seems to be slightly skewed to the left compared to a normal curve.
However, the left skew is relatively small, and so it is plausible that the forecast errors are normally
distributed with mean zero.
Histogram of forecasterrors
forecasterrors
Density
-6000 -4000 -2000 0 2000 4000
0e+002e-044e-046e-048e-04

Autoregressive Integrated Moving Average (ARIMA) models
Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the
irregular component of a time series that allows for non-zero autocorrelations in the irregular
component.
a. Differencing a Time Series
ARIMA models are defined for stationary time series. Therefore, if you start off with a non-stationary
time series, you will first need to ‘difference’ the time series until you obtain a stationary time series. If
you have to difference the time series d times to obtain a stationary series, then you have an
ARIMA(p,d,q) model, where d is the order of differencing used.
> SFO_diff <- diff(SFO.ts, differences=1)
> plot.ts(SFO_diff)
The resulting time series of first differences (above) does not appear to be stationary in mean.
Therefore, we can difference the time series twice, to see if that gives us a stationary time series:
Time
SFO_diff
1990 1995 2000 2005 2010
-3000-2000-1000010002000

> SFO_diff_1 <- diff(SFO.ts, differences=2)
> plot.ts(SFO_diff_1)
The time series of second differences (above) does appear to be stationary in mean and variance, as the
level of the series stays roughly constant over time, and the variance of the series appears roughly
constant over time. Thus, it appears that we need to difference the time series of the ‘SFO flights’ twice
in order to achieve a stationary series.
This means that we can use an ARIMA(p,d,q) model for the above time series, where d (order of
differencing) = 2, i.e., ARIMA(p,2,q). The next step is to figure out the values of p and q for the ARIMA
model. To do this, we usually need to examine the correlogram and partial correlogram of the
stationary time series.
b. Autocorrelations
> acf(SFO_diff_1, lag.max=20) # plot a correlogram
> acf(SFO_diff_1, lag.max=20, plot=FALSE)
Autocorrelations of series ‘SFO_diff_1’, by lag
0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333
1.000 -0.709 0.264 0.084 -0.361 0.553 -0.652 0.529 -0.327 0.078 0.231
0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667
-0.572 0.787 -0.621 0.286 0.035 -0.295 0.501 -0.623 0.504 -0.287
Time
SFO_diff_1
1990 1995 2000 2005 2010
-2000020004000

We see from the correlogram that the autocorrelation at lag 1, 2, and 3 exceeds the significance bounds,
but its decreasing and its nearing zero after lag 3 although there are other autocorrelations between
lags 1-20 that exceed the significance bounds.
c. Partial Autocorrelations
> pacf(SFO_diff_1, lag.max=20) # plot a partial correlogram
> pacf(SFO_diff_1, lag.max=20, plot=FALSE)
Partial autocorrelations of series ‘SFO_diff_1’, by lag
0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 0.9167
-0.709 -0.481 0.055 -0.326 0.232 -0.357 -0.027 -0.401 -0.019 -0.053 -0.464
1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667
0.130 0.096 0.045 -0.020 0.100 0.118 -0.013 -0.073 0.005
0.0 0.5 1.0 1.5
-0.50.00.51.0
Lag
ACF
Series SFO_diff_1

The partial correlogram shows that the partial autocorrelations at lags 1 and 2 exceed the significance
bounds, are negative, and are slowly decreasing in magnitude with increasing lag. The partial
autocorrelations nears zero after lag 2.
0.5 1.0 1.5
-0.6-0.4-0.20.00.2
Lag
PartialACF
Series SFO_diff_1

Forecasting using an ARIMA model
> SFO_arima <- arima(SFO.ts, order=c(2,2,3)) # fit an ARIMA(2,2,3) model
> SFO_arima
Series: SFO.ts
ARIMA(2,2,3)
Coefficients:
ar1 ar2 ma1 ma2 ma3
-1.7315 -0.9996 0.7477 -0.7477 -1.000
s.e. 0.0014 0.0006 0.0211 0.0218 0.021
sigma^2 estimated as 215044: log likelihood=-1806.58
AIC=3625.17 AICc=3625.53 BIC=3646
>
> SFO_arimaforecasts <- forecast.Arima(SFO_arima, h=18)
> SFO_arimaforecasts
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 2010 11141.50 10543.492 11739.51 10226.924 12056.08
Feb 2010 10705.97 9854.680 11557.25 9404.037 12007.89
Mar 2010 11356.13 10317.891 12394.36 9768.281 12943.97
Apr 2010 10666.23 9459.164 11873.29 8820.184 12512.27
May 2010 11211.39 9863.664 12559.11 9150.221 13272.56
Jun 2010 10957.55 9475.979 12439.12 8691.684 13223.41
Jul 2010 10852.63 9249.182 12456.08 8400.366 13304.90
Aug 2010 11288.52 9572.908 13004.13 8664.718 13912.32
Sep 2010 10639.15 8812.668 12465.63 7845.787 13432.51
Oct 2010 11328.32 9402.626 13254.00 8383.228 14273.40
Nov 2010 10784.62 8758.111 12811.13 7685.342 13883.90
Dec 2010 11037.64 8918.355 13156.93 7796.473 14278.81
Jan 2011 11143.50 8933.374 13353.62 7763.406 14523.59
Feb 2011 10707.79 8408.235 13007.34 7190.924 14224.65
Mar 2011 11356.89 8974.390 13739.40 7713.169 15000.62
Apr 2011 10668.99 8200.825 13137.16 6894.256 14443.73
May 2011 11211.75 8664.892 13758.61 7316.668 15106.83
Jun 2011 10960.08 8333.068 13587.08 6942.415 14977.74

> plot.forecast(SFO_arimaforecasts)
Forecasts from ARIMA(2,2,3)
1990 1995 2000 2005 2010
8000100001200014000

Conclusion
In this report, we worked on a large data set, i.e., the airlines flight data set from infochimps.com, which
consisted of 3.5 million monthly domestic flight records from 1990 to 2009. First of all we started with
analyzing the data set, figured out the variables it contains and their data types, and computed basic
summary statistics. The next task was to prepare the data for analysis, which in addition to cleaning the
data also involved supplementing the data set with additional information, removing unnecessary
variables and, transforming some variables in a way that made sense for the contemplated analysis.
Eventually the data set got smaller in size as the analysis proceeded. We prepared two subsets of the
overall data in the form of most flights originating from SFO and LAX data sets. The time series analysis
was done on both of them.
Finally, we did our forecasting analysis on the SFO time series data. We used two forecasting techniques,
the Simple Exponential Smoothing technique and the forecasting analysis based on the ARIMA model.
References
Books/Articles
1. Little Book of R on Time Series Analysis - Avril Coghlan
2. Introduction to R's time series facilities - Michael Lundholm
3. Working with Time Series Data in R - Eric Zivot
4. White Papers on Big Data and Data Step - Revolution Analytics
Websites
1. http://www.inside-r.org/howto/extracting-time-series-large-data-sets
2. http://www.infochimps.com/datasets/us-domestic-flights-from-1990-to-2009
3. http://www.revolutionanalytics.com/

Forecasting analysis on us flights v1

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Forecasting analysis on us flights v1