SlideShare une entreprise Scribd logo
1  sur  28
FORECASTING PROJECT ON US DOMESTIC FLIGHTS
(In Revolution Analytics)
Prepared By:
Wyendrila Roy
http://in.linkedin.com/pub/wyendrila-roy/5/3a/876
Acknowledgement
This project is done as a final project, as a part of the course titled “Business Analytics with R”. I am
really thankful to our course instructor Mr. Ajay Ohri, Founder, DecisionStats, for giving me an
opportunity to do the project in Time Series Analysis using R and providing me with the necessary
support and guidance which made me complete the project on time. I am extremely grateful to him for
providing me with the big data set and also the necessary links to start of the project and understand
Time Series Analysis.
In this project I have chosen the topic- “Forecasting on US Domestic Flights”, where I have analyzed the
flight activities in the Top Domestic Airports of US and then presented a prediction of the same for 2010
– June’ 2011. Due to the size of the data set this project is done in Revolution Analytics. I am really
grateful to the extremely resourceful articles and publications provided by Revolution Analytics, which
helped me in understanding the tool as well as the topic.
Also, I would like to extend my sincere regards to the support team of Edureka for their constant and
timely support.
Table of Contents
Methodology................................................................................................................................................4
1. Overview...........................................................................................................................................4
2. Data Source.......................................................................................................................................4
3. Limitations.........................................................................................................................................4
4. Tool/Package Used............................................................................................................................4
5. File Format Used...............................................................................................................................4
The Analysis..................................................................................................................................................5
1. Importing the Data............................................................................................................................5
2. Exploring the Data.............................................................................................................................5
3. Aggregating the Data ........................................................................................................................8
4. Building the Time Series....................................................................................................................9
5. Predict Future Values based on the Time Series Analysis ..............................................................15
Conclusion ..................................................................................................................................................28
References..................................................................................................................................................28
Methodology
1. Overview
In this report we have analyzed time series data in R language. We have used the “data step” functions
in Revolution Analytics’ RevoScaleR package to access a large data file, manipulated it, sorted it,
extracted the data we needed and then aggregated the records with monthly time stamps to form
multiple, monthly time series. Then we have used ordinary R time series functions to do some basic
analysis. Thereafter we have used forecasting functions to predict the domestic flights activity for Top
airports in US for the period Jan 2010 –June 2011.
2. Data Source
The dataset used in this report is the airlines “edge” flight data set (77,242 KB) from infochimps.com. It
contains 3.5 million monthly domestic flight records from 1990 to 2009.
3. Limitations
The major limitation was to extract the time series from time stamped data embedded in this very large
data set. These types of data sets are too large to be read into memory and processed by normal R
language.
4. Tool/Package Used
This Report uses Revolution Analytics’ new add-on package called RevoScaleR™, which provides
unprecedented levels of performance and capacity for statistical analysis in the R environment. With the
help of this package, we can process, visualize and model the largest data sets in a fraction of the time
of legacy systems, without the need to deploy expensive or specialized hardware.
5. File Format Used
RevoScaleR provides a new data file type with extension .xdf that has been optimized for “data
chunking”, accessing parts of an Xdf file for independent processing. Xdf files store data in a binary
format. The file format provides very fast access to a specified set of rows for a specified set of columns.
New rows and columns can be added to the file without re-writing the entire file. RevoScaleR also
provides a new R class, RxDataSource that has been designed to support the use of external memory
algorithms with .xdf files.
The Analysis
1. Importing the Data
We use the RevoScaleR function to read the text file into the special xdf binary format used by
RevoScaleR functions:
> rxGetInfoXdf("Flights",getVarInfo=TRUE)
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionFlights.xdf
Number of observations: 3606803
Number of variables: 5
Number of blocks: 8
Variable information:
Var 1: origin_airport
683 factor levels: MHK EUG MFR SEA PDX ... CRE BOK BIH MQJ LCI
Var 2: destin_airport
708 factor levels: AMW RDM EKO WDG END ... COS HII PHD TBN OH1
Var 3: passengers, Type: integer, Low/High: (0, 89597)
Var 4: flights, Type: integer, Low/High: (0, 1128)
Var 5: month, Type: character
> rxGetInfoXdf(file="Flights",numRows=10,startRow=1)
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionFlights.xdf
Number of observations: 3606803
Number of variables: 5
Number of blocks: 8
Data (10 rows starting with row 1):
origin_airport destin_airport passengers flights month
1 MHK AMW 21 1 200810
2 EUG RDM 41 22 199011
3 EUG RDM 88 19 199012
4 EUG RDM 11 4 199010
5 MFR RDM 0 1 199002
6 MFR RDM 11 1 199003
7 MFR RDM 2 4 199001
8 MFR RDM 7 1 199009
9 MFR RDM 7 2 199011
10 SEA RDM 8 1 199002
2. Exploring the Data
Now we will sort the file by flights to find the origin / destination pairs, which have the most monthly
flights and pick out the two top origin airports having the most flights.
rxSort(inData="Flights", outFile = "sortFlights", sortByVars="flights",
+ decreasing = TRUE,overwrite=TRUE)
> rxGetInfoXdf(file="sortFlights")
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionsortFlights.xdf
Number of observations: 3606803
Number of variables: 5
Number of blocks: 8
> mostflights5 <- rxGetInfoXdf(file="sortFlights",numRows=5,startRow=1)
> mostflights5
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionsortFlights.xdf
Number of observations: 3606803
Number of variables: 5
Number of blocks: 8
Data (5 rows starting with row 1):
origin_airport destin_airport passengers flights month
1 SFO LAX 83153 1128 199412
2 LAX SFO 80450 1126 199412
3 HNL OGG 73014 1058 199408
4 OGG HNL 77011 1056 199408
5 OGG HNL 63020 1044 199412
> top5f <- as.data.frame(mostflights5[[5]])
> topOA <- unique(as.vector(top5f$origin_airport))
> # Select the top 2
> top2 <- topOA[1:2]
> top2
[1] "SFO" "LAX"
From the above code we can see that the two top origin airports that have the most flights are San
Francisco International (SFO) and Los Angeles International (LAX)
Next, we use the RevoScaleR function rxDataStep to build a new file “mostFlights” containing only those
flights that originate in either SFO or LAX.
> rxGetInfoXdf("mostFlights",numRows=10,startRow=1)
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionmostFlights.xdf
Number of observations: 144505
Number of variables: 6
Number of blocks: 8
Data (10 rows starting with row 1):
origin_airport destin_airport passengers flights month origin
1 SFO RDM 1413 92 199003 SFO
2 SFO RDM 1394 88 199006 SFO
3 SFO RDM 922 86 199001 SFO
4 SFO RDM 1661 93 199008 SFO
5 SFO RDM 1093 88 199005 SFO
6 SFO RDM 995 79 199011 SFO
7 SFO RDM 1080 83 199004 SFO
8 SFO RDM 1279 78 199012 SFO
9 SFO RDM 1080 83 199002 SFO
10 SFO RDM 1493 92 199007 SFO
> rxGetInfoXdf("mostFlights",getVarInfo=TRUE)
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionmostFlights.xdf
Number of observations: 144505
Number of variables: 6
Number of blocks: 8
Variable information:
Var 1: origin_airport
683 factor levels: MHK EUG MFR SEA PDX ... CRE BOK BIH MQJ LCI
Var 2: destin_airport
708 factor levels: AMW RDM EKO WDG END ... COS HII PHD TBN OH1
Var 3: passengers, Type: integer, Low/High: (0, 83153)
Var 4: flights, Type: integer, Low/High: (0, 1128)
Var 5: month, Type: character
Var 6: origin
2 factor levels: SFO LAX
> rxHistogram(~flights|origin, data="mostFlights")
The transformation function, xform, used in rxDataStep creates a new variable, origin, with only two
levels (“SFO” and “LAX”) to hold the information on origin airports. The last line of code in this section
produces the following histogram of monthly flights
3. Aggregating the Data
Now we will break the month variable (which we originally imported as character data) into a month
and year component in order to proceed with our Time Series Analysis.
> xfunc = function(data){data$Month = as.integer(substring(data$month,5,6))
+ data$Year = as.integer(substring(data$month,1,4))
+ return(data)}
> xfunc = function(data){data$Month = as.integer(substring(data$month,5,6))
+ data$Year = as.integer(substring(data$month,1,4))
+ return(data)}
> # Add a new variable for time series work
> rxDataStepXdf(inFile="mostFlights", outFile = "SFO_LAX",
+ overwrite = TRUE, transformVars="month",transformFunc = xfunc)
> (file="SFO_LAX", numRows=10,startRow=1)
> rxDataStepXdf(inFile="SFO_LAX", outFile = "SFO.LAX",
+ varsToDrop=c("origin_airport","month"),
+ overwrite=TRUE)
> rxGetInfoXdf(file="SFO.LAX",numRows=10,startRow=1)
File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionSFO.LAX.xdf
Number of observations: 144505
Number of variables: 6
Number of blocks: 8
Data (10 rows starting with row 1):
destin_airport passengers flights origin Month Year
1 RDM 1413 92 SFO 3 1990
2 RDM 1394 88 SFO 6 1990
3 RDM 922 86 SFO 1 1990
4 RDM 1661 93 SFO 8 1990
5 RDM 1093 88 SFO 5 1990
6 RDM 995 79 SFO 11 1990
7 RDM 1080 83 SFO 4 1990
8 RDM 1279 78 SFO 12 1990
9 RDM 1080 83 SFO 2 1990
10 RDM 1493 92 SFO 7 1990
The transformation function, xfunc, used in rxDataStepXdf uses ordinary R string handling functions to
break apart the month data. A second data step function drops the unnecessary variables from our final
file: SFO.LAX.
4. Building the Time Series
The function rxCube counts the number of flights in each combination of Year, Month and origin airport.
> xfunc <- function(data){
+ data$Month <- as.integer(substring(data$month,5,6))
+ data$Year <- as.integer(substring(data$month,1,4))
+ return(data)
+ }
>
> rxDataStepXdf(inFile="mostFlights", outFile = "SFO_LAX",
+ overwrite = TRUE, transformVars="month",transformFunc = xfunc)
> (file="SFO_LAX",numRows=10,startRow=1)
> t1 <-rxCube(flights ~ F(Year):F(Month):origin, removeZeroCounts=TRUE,data = "SFO_LAX")
> t1 <- as.data.frame(t1)
> head(t1)
F_Year F_Month origin flights Counts
1 1990 1 SFO 39.04225 284
2 1991 1 SFO 38.42034 295
3 1992 1 SFO 46.23954 263
4 1993 1 SFO 44.39464 261
5 1994 1 SFO 36.15417 240
6 1995 1 SFO 45.76768 198
From the above table, we see that there were 284 records where the originating airport was SFO for the
first month of 1990. The average number of flights among these 284 counts was 39.04225. From this
information, we can calculate the total number of flights for each month. The next bit of code does this
and forms the time information into a proper date. Note that we have reduced the data sufficiently so
that we are now working with a data frame, t1.
Now we will compute total flights out and combine month and date into a date
t1$flights_out<- t1$flights*t1$Counts
> names(t1) <- c("Year","Month","origin","avg.flights.per.destin","total.destin","flights.out")
> t1$Date <- as.Date(as.character(paste(t1$Month,"- 28 -",t1$Year)),"%m - %d - %Y")
> head(t1)
Year Month origin avg.flights.per.destin total.destin flights.out Date
1 1990 1 SFO 39.04225 284 11088 1990-01-28
2 1991 1 SFO 38.42034 295 11334 1991-01-28
3 1992 1 SFO 46.23954 263 12161 1992-01-28
4 1993 1 SFO 44.39464 261 11587 1993-01-28
5 1994 1 SFO 36.15417 240 8677 1994-01-28
6 1995 1 SFO 45.76768 198 9062 1995-01-28
Now, we extract out the SFO data, sort it to form a time series and plot it.
> SFO.t1 <- SFO.t1[order(SFO.t1$Date),]
> x <-SFO.t1$Date
> y <-SFO.t1$flights.out
> library(ggplot2)
> qplot(x,y, geom="line",xlab="", ylab="Number of Flightsn",main="Monthly Flights Out of SFO")
We use the R function, ts, to form the data into a time series object, and use the function stl to perform
a seasonal decomposition.
> SFO.ts <- ts(y,start=x[1],freq=12)
> sd.SFO <- stl(SFO.ts,s.window="periodic")
> plot(sd.SFO)
In the above graph, the first panel of reproduces the time series. The second panel shows the periodic,
seasonal component. The third panel displays the trend and the fourth panel displays the residuals.
700090001100013000
data
-10000500
seasonal
900011000
trend
-10000500
7335 7340 7345 7350
remainder
time
We may now repeat the above steps for the LAX data
> LAX.t1 <- t1[t1$origin=="LAX",]
> LAX.t1 <- LAX.t1[order(LAX.t1$Date),]
> a <-LAX.t1$Date
> b<-LAX.t1$flights.out
> qplot(a,b, geom="line",xlab="", ylab="Number of Flightsn",main="Monthly Flights Out of LAX")
12000
14000
16000
18000
20000
1990 1995 2000 2005 2010
NumberofFlights
Monthly Flights Out of LAX
> LAX.ts <- ts(b,start=x[1],freq=12)
> sd.LAX <- stl(LAX.ts,s.window="periodic")
> plot(sd.LAX)
120001600020000
data
-1500-500500
seasonal
140001600018000
trend
-1000010002000
7335 7340 7345 7350
remainder
time
5. Predict Future Values based on the Time Series Analysis
Now we can proceed with the forecasting analysis, for our further analysis we will work on the SFO time
series data and predict its values for the period of Jan’2010-June’2011. We will use the Simple
Exponential Smoothing as well as the ARIMA model for our forecasting analysis.
SFO.ts = ts(y, start = c(1990), freq=12)
plot.ts(SFO.ts)
Simple Exponential Smoothing
fit <- HoltWinters(SFO.ts, beta=FALSE, gamma=FALSE)
fit
Smoothing parameters:
alpha: 0.6726511
beta : FALSE
gamma: FALSE
Coefficients:
[,1]
a 10987.14
Time
SFO.ts
1990 1995 2000 2005 2010
70008000900010000110001200013000
By default, HoltWinters() just makes forecasts for the same time period covered by our original time
series. In this case, our original time series included Number of Flights originating from SFO from 1990-
2009, so the forecasts are also for 1990-2009.
In the example above, we have stored the output of the HoltWinters() function in the list variable “fit”.
>plot(fit)
The plot shows the original time series in black, and the forecasts in red.
As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-
sample forecast errors, that is, the forecast errors for the time period covered by our original time
series. The sum-ofsquared-errors is stored in a named element of the list variable “fit” called “SSE”, so
we can get its value by typing:
> fit$SSE
[1] 95039885
That is, here the sum-of-squared-errors is 95039885
As explained above, by default HoltWinters() just makes forecasts for the time period covered by the
original data, which is 1990-2009 in this case. We can make forecasts for further time points by using the
“forecast.HoltWinters()” function in the R “forecast” package.
Holt-Winters filtering
Time
Observed/Fitted
1990 1995 2000 2005 2010
70008000900010000110001200013000
> Forecast <- forecast.HoltWinters(fit, h=18)
> Forecast
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 2010 10987.14 10177.296 11796.98 9748.591 12225.68
Feb 2010 10987.14 10011.132 11963.14 9494.466 12479.81
Mar 2010 10987.14 9869.403 12104.87 9277.711 12696.56
Apr 2010 10987.14 9743.726 12230.55 9085.504 12888.77
May 2010 10987.14 9629.634 12344.64 8911.015 13063.26
Jun 2010 10987.14 9524.415 12449.86 8750.096 13224.18
Jul 2010 10987.14 9426.272 12548.00 8600.000 13374.28
Aug 2010 10987.14 9333.946 12640.33 8458.798 13515.48
Sep 2010 10987.14 9246.509 12727.77 8325.076 13649.20
Oct 2010 10987.14 9163.260 12811.02 8197.757 13776.52
Nov 2010 10987.14 9083.648 12890.63 8076.001 13898.27
Dec 2010 10987.14 9007.235 12967.04 7959.137 14015.14
Jan 2011 10987.14 8933.663 13040.61 7846.619 14127.66
Feb 2011 10987.14 8862.637 13111.64 7737.995 14236.28
Mar 2011 10987.14 8793.911 13180.36 7632.886 14341.39
Apr 2011 10987.14 8727.273 13247.00 7530.973 14443.30
May 2011 10987.14 8662.545 13311.73 7431.980 14542.30
Jun 2011 10987.14 8599.571 13374.70 7335.670 14638.61
The forecast.HoltWinters() function gives the forecast for our 18 month period, a 80% prediction
interval for the forecast, and a 95% prediction interval for the forecast. For example, the forecasted
value for Jan 2010 is about 10987.14, with a 95% prediction interval of (9748.591, 12225.68).
To plot the predictions made by forecast.HoltWinters(), we can use the “plot.forecast()” function:
Here the forecasts for ‘Jan 2010 – June 2011’ are plotted as a dark blue line, the 80% prediction interval
as the blue shaded area, and the 95% prediction interval as a light blue shaded area.
The ‘forecast errors’ are calculated as the observed values minus predicted values, for each time point.
We can only calculate the forecast errors for the time period covered by our original time series, which
is 1990-2009 for the Flight data.
The in-sample forecast errors are stored in the named element “residuals” of the list variable returned
by forecast.HoltWinters().
We will now obtain a correlogram of the in-sample forecast errors for lags 1-20. We can calculate a
correlogram of the forecast errors using the “acf()” function in R. To specify the maximum lag that we
want to look at, we use the “lag.max” parameter in acf().
Forecasts from HoltWinters
1990 1995 2000 2005 2010
8000100001200014000
> acf(Forecast$residuals, lag.max=20)
To test whether there is significant evidence for non-zero correlations at lags 1-20, we can carry out a
Ljung-Box test.
> Box.test(Forecast$residuals, lag=20, type="Ljung-Box")
Box-Ljung test
data: Forecast$residuals
X-squared = 370.1992, df = 20, p-value < 2.2e-16
To be sure that the predictive model cannot be improved upon, it is also a good idea to check whether
the forecast errors are normally distributed with mean zero and constant variance. To check whether
the forecast errors have constant variance, we can make a time plot of the in-sample forecast errors:
0.0 0.5 1.0 1.5
-0.50.00.51.0
Lag
ACF
Series Forecast$residuals
> plot.ts(Forecast$residuals)
The plot shows that the in-sample forecast errors seem to have roughly constant variance over time,
although the size of the fluctuations in the start of the time series may be slightly less than that at later
dates. The fluctuations for the time period 2000-2005 is quite high.
To check whether the forecast errors are normally distributed with mean zero, we can plot a histogram
of the forecast errors, with an overlaid normal curve that has mean zero and the same standard
deviation as the distribution of forecast errors.
Time
Forecast$residuals
1990 1995 2000 2005 2010
-3000-2000-1000010002000
The plot shows that the distribution of forecast errors is roughly centered on zero, and is more or less
normally distributed, although it seems to be slightly skewed to the left compared to a normal curve.
However, the left skew is relatively small, and so it is plausible that the forecast errors are normally
distributed with mean zero.
Histogram of forecasterrors
forecasterrors
Density
-6000 -4000 -2000 0 2000 4000
0e+002e-044e-046e-048e-04
Autoregressive Integrated Moving Average (ARIMA) models
Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the
irregular component of a time series that allows for non-zero autocorrelations in the irregular
component.
a. Differencing a Time Series
ARIMA models are defined for stationary time series. Therefore, if you start off with a non-stationary
time series, you will first need to ‘difference’ the time series until you obtain a stationary time series. If
you have to difference the time series d times to obtain a stationary series, then you have an
ARIMA(p,d,q) model, where d is the order of differencing used.
> SFO_diff <- diff(SFO.ts, differences=1)
> plot.ts(SFO_diff)
The resulting time series of first differences (above) does not appear to be stationary in mean.
Therefore, we can difference the time series twice, to see if that gives us a stationary time series:
Time
SFO_diff
1990 1995 2000 2005 2010
-3000-2000-1000010002000
> SFO_diff_1 <- diff(SFO.ts, differences=2)
> plot.ts(SFO_diff_1)
The time series of second differences (above) does appear to be stationary in mean and variance, as the
level of the series stays roughly constant over time, and the variance of the series appears roughly
constant over time. Thus, it appears that we need to difference the time series of the ‘SFO flights’ twice
in order to achieve a stationary series.
This means that we can use an ARIMA(p,d,q) model for the above time series, where d (order of
differencing) = 2, i.e., ARIMA(p,2,q). The next step is to figure out the values of p and q for the ARIMA
model. To do this, we usually need to examine the correlogram and partial correlogram of the
stationary time series.
b. Autocorrelations
> acf(SFO_diff_1, lag.max=20) # plot a correlogram
> acf(SFO_diff_1, lag.max=20, plot=FALSE)
Autocorrelations of series ‘SFO_diff_1’, by lag
0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333
1.000 -0.709 0.264 0.084 -0.361 0.553 -0.652 0.529 -0.327 0.078 0.231
0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667
-0.572 0.787 -0.621 0.286 0.035 -0.295 0.501 -0.623 0.504 -0.287
Time
SFO_diff_1
1990 1995 2000 2005 2010
-2000020004000
We see from the correlogram that the autocorrelation at lag 1, 2, and 3 exceeds the significance bounds,
but its decreasing and its nearing zero after lag 3 although there are other autocorrelations between
lags 1-20 that exceed the significance bounds.
c. Partial Autocorrelations
> pacf(SFO_diff_1, lag.max=20) # plot a partial correlogram
> pacf(SFO_diff_1, lag.max=20, plot=FALSE)
Partial autocorrelations of series ‘SFO_diff_1’, by lag
0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 0.9167
-0.709 -0.481 0.055 -0.326 0.232 -0.357 -0.027 -0.401 -0.019 -0.053 -0.464
1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667
0.130 0.096 0.045 -0.020 0.100 0.118 -0.013 -0.073 0.005
0.0 0.5 1.0 1.5
-0.50.00.51.0
Lag
ACF
Series SFO_diff_1
The partial correlogram shows that the partial autocorrelations at lags 1 and 2 exceed the significance
bounds, are negative, and are slowly decreasing in magnitude with increasing lag. The partial
autocorrelations nears zero after lag 2.
0.5 1.0 1.5
-0.6-0.4-0.20.00.2
Lag
PartialACF
Series SFO_diff_1
Forecasting using an ARIMA model
> SFO_arima <- arima(SFO.ts, order=c(2,2,3)) # fit an ARIMA(2,2,3) model
> SFO_arima
Series: SFO.ts
ARIMA(2,2,3)
Coefficients:
ar1 ar2 ma1 ma2 ma3
-1.7315 -0.9996 0.7477 -0.7477 -1.000
s.e. 0.0014 0.0006 0.0211 0.0218 0.021
sigma^2 estimated as 215044: log likelihood=-1806.58
AIC=3625.17 AICc=3625.53 BIC=3646
>
> SFO_arimaforecasts <- forecast.Arima(SFO_arima, h=18)
> SFO_arimaforecasts
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 2010 11141.50 10543.492 11739.51 10226.924 12056.08
Feb 2010 10705.97 9854.680 11557.25 9404.037 12007.89
Mar 2010 11356.13 10317.891 12394.36 9768.281 12943.97
Apr 2010 10666.23 9459.164 11873.29 8820.184 12512.27
May 2010 11211.39 9863.664 12559.11 9150.221 13272.56
Jun 2010 10957.55 9475.979 12439.12 8691.684 13223.41
Jul 2010 10852.63 9249.182 12456.08 8400.366 13304.90
Aug 2010 11288.52 9572.908 13004.13 8664.718 13912.32
Sep 2010 10639.15 8812.668 12465.63 7845.787 13432.51
Oct 2010 11328.32 9402.626 13254.00 8383.228 14273.40
Nov 2010 10784.62 8758.111 12811.13 7685.342 13883.90
Dec 2010 11037.64 8918.355 13156.93 7796.473 14278.81
Jan 2011 11143.50 8933.374 13353.62 7763.406 14523.59
Feb 2011 10707.79 8408.235 13007.34 7190.924 14224.65
Mar 2011 11356.89 8974.390 13739.40 7713.169 15000.62
Apr 2011 10668.99 8200.825 13137.16 6894.256 14443.73
May 2011 11211.75 8664.892 13758.61 7316.668 15106.83
Jun 2011 10960.08 8333.068 13587.08 6942.415 14977.74
> plot.forecast(SFO_arimaforecasts)
Forecasts from ARIMA(2,2,3)
1990 1995 2000 2005 2010
8000100001200014000
Conclusion
In this report, we worked on a large data set, i.e., the airlines flight data set from infochimps.com, which
consisted of 3.5 million monthly domestic flight records from 1990 to 2009. First of all we started with
analyzing the data set, figured out the variables it contains and their data types, and computed basic
summary statistics. The next task was to prepare the data for analysis, which in addition to cleaning the
data also involved supplementing the data set with additional information, removing unnecessary
variables and, transforming some variables in a way that made sense for the contemplated analysis.
Eventually the data set got smaller in size as the analysis proceeded. We prepared two subsets of the
overall data in the form of most flights originating from SFO and LAX data sets. The time series analysis
was done on both of them.
Finally, we did our forecasting analysis on the SFO time series data. We used two forecasting techniques,
the Simple Exponential Smoothing technique and the forecasting analysis based on the ARIMA model.
References
Books/Articles
1. Little Book of R on Time Series Analysis - Avril Coghlan
2. Introduction to R's time series facilities - Michael Lundholm
3. Working with Time Series Data in R - Eric Zivot
4. White Papers on Big Data and Data Step - Revolution Analytics
Websites
1. http://www.inside-r.org/howto/extracting-time-series-large-data-sets
2. http://www.infochimps.com/datasets/us-domestic-flights-from-1990-to-2009
3. http://www.revolutionanalytics.com/

Contenu connexe

Dernier

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Dernier (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

En vedette

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

En vedette (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Forecasting analysis on us flights v1

  • 1. FORECASTING PROJECT ON US DOMESTIC FLIGHTS (In Revolution Analytics) Prepared By: Wyendrila Roy http://in.linkedin.com/pub/wyendrila-roy/5/3a/876
  • 2. Acknowledgement This project is done as a final project, as a part of the course titled “Business Analytics with R”. I am really thankful to our course instructor Mr. Ajay Ohri, Founder, DecisionStats, for giving me an opportunity to do the project in Time Series Analysis using R and providing me with the necessary support and guidance which made me complete the project on time. I am extremely grateful to him for providing me with the big data set and also the necessary links to start of the project and understand Time Series Analysis. In this project I have chosen the topic- “Forecasting on US Domestic Flights”, where I have analyzed the flight activities in the Top Domestic Airports of US and then presented a prediction of the same for 2010 – June’ 2011. Due to the size of the data set this project is done in Revolution Analytics. I am really grateful to the extremely resourceful articles and publications provided by Revolution Analytics, which helped me in understanding the tool as well as the topic. Also, I would like to extend my sincere regards to the support team of Edureka for their constant and timely support.
  • 3. Table of Contents Methodology................................................................................................................................................4 1. Overview...........................................................................................................................................4 2. Data Source.......................................................................................................................................4 3. Limitations.........................................................................................................................................4 4. Tool/Package Used............................................................................................................................4 5. File Format Used...............................................................................................................................4 The Analysis..................................................................................................................................................5 1. Importing the Data............................................................................................................................5 2. Exploring the Data.............................................................................................................................5 3. Aggregating the Data ........................................................................................................................8 4. Building the Time Series....................................................................................................................9 5. Predict Future Values based on the Time Series Analysis ..............................................................15 Conclusion ..................................................................................................................................................28 References..................................................................................................................................................28
  • 4. Methodology 1. Overview In this report we have analyzed time series data in R language. We have used the “data step” functions in Revolution Analytics’ RevoScaleR package to access a large data file, manipulated it, sorted it, extracted the data we needed and then aggregated the records with monthly time stamps to form multiple, monthly time series. Then we have used ordinary R time series functions to do some basic analysis. Thereafter we have used forecasting functions to predict the domestic flights activity for Top airports in US for the period Jan 2010 –June 2011. 2. Data Source The dataset used in this report is the airlines “edge” flight data set (77,242 KB) from infochimps.com. It contains 3.5 million monthly domestic flight records from 1990 to 2009. 3. Limitations The major limitation was to extract the time series from time stamped data embedded in this very large data set. These types of data sets are too large to be read into memory and processed by normal R language. 4. Tool/Package Used This Report uses Revolution Analytics’ new add-on package called RevoScaleR™, which provides unprecedented levels of performance and capacity for statistical analysis in the R environment. With the help of this package, we can process, visualize and model the largest data sets in a fraction of the time of legacy systems, without the need to deploy expensive or specialized hardware. 5. File Format Used RevoScaleR provides a new data file type with extension .xdf that has been optimized for “data chunking”, accessing parts of an Xdf file for independent processing. Xdf files store data in a binary format. The file format provides very fast access to a specified set of rows for a specified set of columns. New rows and columns can be added to the file without re-writing the entire file. RevoScaleR also provides a new R class, RxDataSource that has been designed to support the use of external memory algorithms with .xdf files.
  • 5. The Analysis 1. Importing the Data We use the RevoScaleR function to read the text file into the special xdf binary format used by RevoScaleR functions: > rxGetInfoXdf("Flights",getVarInfo=TRUE) File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionFlights.xdf Number of observations: 3606803 Number of variables: 5 Number of blocks: 8 Variable information: Var 1: origin_airport 683 factor levels: MHK EUG MFR SEA PDX ... CRE BOK BIH MQJ LCI Var 2: destin_airport 708 factor levels: AMW RDM EKO WDG END ... COS HII PHD TBN OH1 Var 3: passengers, Type: integer, Low/High: (0, 89597) Var 4: flights, Type: integer, Low/High: (0, 1128) Var 5: month, Type: character > rxGetInfoXdf(file="Flights",numRows=10,startRow=1) File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionFlights.xdf Number of observations: 3606803 Number of variables: 5 Number of blocks: 8 Data (10 rows starting with row 1): origin_airport destin_airport passengers flights month 1 MHK AMW 21 1 200810 2 EUG RDM 41 22 199011 3 EUG RDM 88 19 199012 4 EUG RDM 11 4 199010 5 MFR RDM 0 1 199002 6 MFR RDM 11 1 199003 7 MFR RDM 2 4 199001 8 MFR RDM 7 1 199009 9 MFR RDM 7 2 199011 10 SEA RDM 8 1 199002 2. Exploring the Data Now we will sort the file by flights to find the origin / destination pairs, which have the most monthly flights and pick out the two top origin airports having the most flights.
  • 6. rxSort(inData="Flights", outFile = "sortFlights", sortByVars="flights", + decreasing = TRUE,overwrite=TRUE) > rxGetInfoXdf(file="sortFlights") File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionsortFlights.xdf Number of observations: 3606803 Number of variables: 5 Number of blocks: 8 > mostflights5 <- rxGetInfoXdf(file="sortFlights",numRows=5,startRow=1) > mostflights5 File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionsortFlights.xdf Number of observations: 3606803 Number of variables: 5 Number of blocks: 8 Data (5 rows starting with row 1): origin_airport destin_airport passengers flights month 1 SFO LAX 83153 1128 199412 2 LAX SFO 80450 1126 199412 3 HNL OGG 73014 1058 199408 4 OGG HNL 77011 1056 199408 5 OGG HNL 63020 1044 199412 > top5f <- as.data.frame(mostflights5[[5]]) > topOA <- unique(as.vector(top5f$origin_airport)) > # Select the top 2 > top2 <- topOA[1:2] > top2 [1] "SFO" "LAX" From the above code we can see that the two top origin airports that have the most flights are San Francisco International (SFO) and Los Angeles International (LAX) Next, we use the RevoScaleR function rxDataStep to build a new file “mostFlights” containing only those flights that originate in either SFO or LAX. > rxGetInfoXdf("mostFlights",numRows=10,startRow=1) File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionmostFlights.xdf Number of observations: 144505 Number of variables: 6 Number of blocks: 8 Data (10 rows starting with row 1):
  • 7. origin_airport destin_airport passengers flights month origin 1 SFO RDM 1413 92 199003 SFO 2 SFO RDM 1394 88 199006 SFO 3 SFO RDM 922 86 199001 SFO 4 SFO RDM 1661 93 199008 SFO 5 SFO RDM 1093 88 199005 SFO 6 SFO RDM 995 79 199011 SFO 7 SFO RDM 1080 83 199004 SFO 8 SFO RDM 1279 78 199012 SFO 9 SFO RDM 1080 83 199002 SFO 10 SFO RDM 1493 92 199007 SFO > rxGetInfoXdf("mostFlights",getVarInfo=TRUE) File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionmostFlights.xdf Number of observations: 144505 Number of variables: 6 Number of blocks: 8 Variable information: Var 1: origin_airport 683 factor levels: MHK EUG MFR SEA PDX ... CRE BOK BIH MQJ LCI Var 2: destin_airport 708 factor levels: AMW RDM EKO WDG END ... COS HII PHD TBN OH1 Var 3: passengers, Type: integer, Low/High: (0, 83153) Var 4: flights, Type: integer, Low/High: (0, 1128) Var 5: month, Type: character Var 6: origin 2 factor levels: SFO LAX
  • 8. > rxHistogram(~flights|origin, data="mostFlights") The transformation function, xform, used in rxDataStep creates a new variable, origin, with only two levels (“SFO” and “LAX”) to hold the information on origin airports. The last line of code in this section produces the following histogram of monthly flights 3. Aggregating the Data Now we will break the month variable (which we originally imported as character data) into a month and year component in order to proceed with our Time Series Analysis. > xfunc = function(data){data$Month = as.integer(substring(data$month,5,6)) + data$Year = as.integer(substring(data$month,1,4)) + return(data)} > xfunc = function(data){data$Month = as.integer(substring(data$month,5,6)) + data$Year = as.integer(substring(data$month,1,4)) + return(data)} > # Add a new variable for time series work > rxDataStepXdf(inFile="mostFlights", outFile = "SFO_LAX", + overwrite = TRUE, transformVars="month",transformFunc = xfunc) > (file="SFO_LAX", numRows=10,startRow=1)
  • 9. > rxDataStepXdf(inFile="SFO_LAX", outFile = "SFO.LAX", + varsToDrop=c("origin_airport","month"), + overwrite=TRUE) > rxGetInfoXdf(file="SFO.LAX",numRows=10,startRow=1) File name: C:Documents and SettingsWENDELAMy DocumentsRevolutionSFO.LAX.xdf Number of observations: 144505 Number of variables: 6 Number of blocks: 8 Data (10 rows starting with row 1): destin_airport passengers flights origin Month Year 1 RDM 1413 92 SFO 3 1990 2 RDM 1394 88 SFO 6 1990 3 RDM 922 86 SFO 1 1990 4 RDM 1661 93 SFO 8 1990 5 RDM 1093 88 SFO 5 1990 6 RDM 995 79 SFO 11 1990 7 RDM 1080 83 SFO 4 1990 8 RDM 1279 78 SFO 12 1990 9 RDM 1080 83 SFO 2 1990 10 RDM 1493 92 SFO 7 1990 The transformation function, xfunc, used in rxDataStepXdf uses ordinary R string handling functions to break apart the month data. A second data step function drops the unnecessary variables from our final file: SFO.LAX. 4. Building the Time Series The function rxCube counts the number of flights in each combination of Year, Month and origin airport. > xfunc <- function(data){ + data$Month <- as.integer(substring(data$month,5,6)) + data$Year <- as.integer(substring(data$month,1,4)) + return(data) + } > > rxDataStepXdf(inFile="mostFlights", outFile = "SFO_LAX", + overwrite = TRUE, transformVars="month",transformFunc = xfunc) > (file="SFO_LAX",numRows=10,startRow=1) > t1 <-rxCube(flights ~ F(Year):F(Month):origin, removeZeroCounts=TRUE,data = "SFO_LAX") > t1 <- as.data.frame(t1)
  • 10. > head(t1) F_Year F_Month origin flights Counts 1 1990 1 SFO 39.04225 284 2 1991 1 SFO 38.42034 295 3 1992 1 SFO 46.23954 263 4 1993 1 SFO 44.39464 261 5 1994 1 SFO 36.15417 240 6 1995 1 SFO 45.76768 198 From the above table, we see that there were 284 records where the originating airport was SFO for the first month of 1990. The average number of flights among these 284 counts was 39.04225. From this information, we can calculate the total number of flights for each month. The next bit of code does this and forms the time information into a proper date. Note that we have reduced the data sufficiently so that we are now working with a data frame, t1. Now we will compute total flights out and combine month and date into a date t1$flights_out<- t1$flights*t1$Counts > names(t1) <- c("Year","Month","origin","avg.flights.per.destin","total.destin","flights.out") > t1$Date <- as.Date(as.character(paste(t1$Month,"- 28 -",t1$Year)),"%m - %d - %Y") > head(t1) Year Month origin avg.flights.per.destin total.destin flights.out Date 1 1990 1 SFO 39.04225 284 11088 1990-01-28 2 1991 1 SFO 38.42034 295 11334 1991-01-28 3 1992 1 SFO 46.23954 263 12161 1992-01-28 4 1993 1 SFO 44.39464 261 11587 1993-01-28 5 1994 1 SFO 36.15417 240 8677 1994-01-28 6 1995 1 SFO 45.76768 198 9062 1995-01-28 Now, we extract out the SFO data, sort it to form a time series and plot it. > SFO.t1 <- SFO.t1[order(SFO.t1$Date),] > x <-SFO.t1$Date > y <-SFO.t1$flights.out > library(ggplot2) > qplot(x,y, geom="line",xlab="", ylab="Number of Flightsn",main="Monthly Flights Out of SFO")
  • 11. We use the R function, ts, to form the data into a time series object, and use the function stl to perform a seasonal decomposition.
  • 12. > SFO.ts <- ts(y,start=x[1],freq=12) > sd.SFO <- stl(SFO.ts,s.window="periodic") > plot(sd.SFO) In the above graph, the first panel of reproduces the time series. The second panel shows the periodic, seasonal component. The third panel displays the trend and the fourth panel displays the residuals. 700090001100013000 data -10000500 seasonal 900011000 trend -10000500 7335 7340 7345 7350 remainder time
  • 13. We may now repeat the above steps for the LAX data > LAX.t1 <- t1[t1$origin=="LAX",] > LAX.t1 <- LAX.t1[order(LAX.t1$Date),] > a <-LAX.t1$Date > b<-LAX.t1$flights.out > qplot(a,b, geom="line",xlab="", ylab="Number of Flightsn",main="Monthly Flights Out of LAX") 12000 14000 16000 18000 20000 1990 1995 2000 2005 2010 NumberofFlights Monthly Flights Out of LAX
  • 14. > LAX.ts <- ts(b,start=x[1],freq=12) > sd.LAX <- stl(LAX.ts,s.window="periodic") > plot(sd.LAX) 120001600020000 data -1500-500500 seasonal 140001600018000 trend -1000010002000 7335 7340 7345 7350 remainder time
  • 15. 5. Predict Future Values based on the Time Series Analysis Now we can proceed with the forecasting analysis, for our further analysis we will work on the SFO time series data and predict its values for the period of Jan’2010-June’2011. We will use the Simple Exponential Smoothing as well as the ARIMA model for our forecasting analysis. SFO.ts = ts(y, start = c(1990), freq=12) plot.ts(SFO.ts) Simple Exponential Smoothing fit <- HoltWinters(SFO.ts, beta=FALSE, gamma=FALSE) fit Smoothing parameters: alpha: 0.6726511 beta : FALSE gamma: FALSE Coefficients: [,1] a 10987.14 Time SFO.ts 1990 1995 2000 2005 2010 70008000900010000110001200013000
  • 16. By default, HoltWinters() just makes forecasts for the same time period covered by our original time series. In this case, our original time series included Number of Flights originating from SFO from 1990- 2009, so the forecasts are also for 1990-2009. In the example above, we have stored the output of the HoltWinters() function in the list variable “fit”. >plot(fit) The plot shows the original time series in black, and the forecasts in red. As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in- sample forecast errors, that is, the forecast errors for the time period covered by our original time series. The sum-ofsquared-errors is stored in a named element of the list variable “fit” called “SSE”, so we can get its value by typing: > fit$SSE [1] 95039885 That is, here the sum-of-squared-errors is 95039885 As explained above, by default HoltWinters() just makes forecasts for the time period covered by the original data, which is 1990-2009 in this case. We can make forecasts for further time points by using the “forecast.HoltWinters()” function in the R “forecast” package. Holt-Winters filtering Time Observed/Fitted 1990 1995 2000 2005 2010 70008000900010000110001200013000
  • 17. > Forecast <- forecast.HoltWinters(fit, h=18) > Forecast Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 Jan 2010 10987.14 10177.296 11796.98 9748.591 12225.68 Feb 2010 10987.14 10011.132 11963.14 9494.466 12479.81 Mar 2010 10987.14 9869.403 12104.87 9277.711 12696.56 Apr 2010 10987.14 9743.726 12230.55 9085.504 12888.77 May 2010 10987.14 9629.634 12344.64 8911.015 13063.26 Jun 2010 10987.14 9524.415 12449.86 8750.096 13224.18 Jul 2010 10987.14 9426.272 12548.00 8600.000 13374.28 Aug 2010 10987.14 9333.946 12640.33 8458.798 13515.48 Sep 2010 10987.14 9246.509 12727.77 8325.076 13649.20 Oct 2010 10987.14 9163.260 12811.02 8197.757 13776.52 Nov 2010 10987.14 9083.648 12890.63 8076.001 13898.27 Dec 2010 10987.14 9007.235 12967.04 7959.137 14015.14 Jan 2011 10987.14 8933.663 13040.61 7846.619 14127.66 Feb 2011 10987.14 8862.637 13111.64 7737.995 14236.28 Mar 2011 10987.14 8793.911 13180.36 7632.886 14341.39 Apr 2011 10987.14 8727.273 13247.00 7530.973 14443.30 May 2011 10987.14 8662.545 13311.73 7431.980 14542.30 Jun 2011 10987.14 8599.571 13374.70 7335.670 14638.61 The forecast.HoltWinters() function gives the forecast for our 18 month period, a 80% prediction interval for the forecast, and a 95% prediction interval for the forecast. For example, the forecasted value for Jan 2010 is about 10987.14, with a 95% prediction interval of (9748.591, 12225.68). To plot the predictions made by forecast.HoltWinters(), we can use the “plot.forecast()” function:
  • 18. Here the forecasts for ‘Jan 2010 – June 2011’ are plotted as a dark blue line, the 80% prediction interval as the blue shaded area, and the 95% prediction interval as a light blue shaded area. The ‘forecast errors’ are calculated as the observed values minus predicted values, for each time point. We can only calculate the forecast errors for the time period covered by our original time series, which is 1990-2009 for the Flight data. The in-sample forecast errors are stored in the named element “residuals” of the list variable returned by forecast.HoltWinters(). We will now obtain a correlogram of the in-sample forecast errors for lags 1-20. We can calculate a correlogram of the forecast errors using the “acf()” function in R. To specify the maximum lag that we want to look at, we use the “lag.max” parameter in acf(). Forecasts from HoltWinters 1990 1995 2000 2005 2010 8000100001200014000
  • 19. > acf(Forecast$residuals, lag.max=20) To test whether there is significant evidence for non-zero correlations at lags 1-20, we can carry out a Ljung-Box test. > Box.test(Forecast$residuals, lag=20, type="Ljung-Box") Box-Ljung test data: Forecast$residuals X-squared = 370.1992, df = 20, p-value < 2.2e-16 To be sure that the predictive model cannot be improved upon, it is also a good idea to check whether the forecast errors are normally distributed with mean zero and constant variance. To check whether the forecast errors have constant variance, we can make a time plot of the in-sample forecast errors: 0.0 0.5 1.0 1.5 -0.50.00.51.0 Lag ACF Series Forecast$residuals
  • 20. > plot.ts(Forecast$residuals) The plot shows that the in-sample forecast errors seem to have roughly constant variance over time, although the size of the fluctuations in the start of the time series may be slightly less than that at later dates. The fluctuations for the time period 2000-2005 is quite high. To check whether the forecast errors are normally distributed with mean zero, we can plot a histogram of the forecast errors, with an overlaid normal curve that has mean zero and the same standard deviation as the distribution of forecast errors. Time Forecast$residuals 1990 1995 2000 2005 2010 -3000-2000-1000010002000
  • 21. The plot shows that the distribution of forecast errors is roughly centered on zero, and is more or less normally distributed, although it seems to be slightly skewed to the left compared to a normal curve. However, the left skew is relatively small, and so it is plausible that the forecast errors are normally distributed with mean zero. Histogram of forecasterrors forecasterrors Density -6000 -4000 -2000 0 2000 4000 0e+002e-044e-046e-048e-04
  • 22. Autoregressive Integrated Moving Average (ARIMA) models Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the irregular component of a time series that allows for non-zero autocorrelations in the irregular component. a. Differencing a Time Series ARIMA models are defined for stationary time series. Therefore, if you start off with a non-stationary time series, you will first need to ‘difference’ the time series until you obtain a stationary time series. If you have to difference the time series d times to obtain a stationary series, then you have an ARIMA(p,d,q) model, where d is the order of differencing used. > SFO_diff <- diff(SFO.ts, differences=1) > plot.ts(SFO_diff) The resulting time series of first differences (above) does not appear to be stationary in mean. Therefore, we can difference the time series twice, to see if that gives us a stationary time series: Time SFO_diff 1990 1995 2000 2005 2010 -3000-2000-1000010002000
  • 23. > SFO_diff_1 <- diff(SFO.ts, differences=2) > plot.ts(SFO_diff_1) The time series of second differences (above) does appear to be stationary in mean and variance, as the level of the series stays roughly constant over time, and the variance of the series appears roughly constant over time. Thus, it appears that we need to difference the time series of the ‘SFO flights’ twice in order to achieve a stationary series. This means that we can use an ARIMA(p,d,q) model for the above time series, where d (order of differencing) = 2, i.e., ARIMA(p,2,q). The next step is to figure out the values of p and q for the ARIMA model. To do this, we usually need to examine the correlogram and partial correlogram of the stationary time series. b. Autocorrelations > acf(SFO_diff_1, lag.max=20) # plot a correlogram > acf(SFO_diff_1, lag.max=20, plot=FALSE) Autocorrelations of series ‘SFO_diff_1’, by lag 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 1.000 -0.709 0.264 0.084 -0.361 0.553 -0.652 0.529 -0.327 0.078 0.231 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667 -0.572 0.787 -0.621 0.286 0.035 -0.295 0.501 -0.623 0.504 -0.287 Time SFO_diff_1 1990 1995 2000 2005 2010 -2000020004000
  • 24. We see from the correlogram that the autocorrelation at lag 1, 2, and 3 exceeds the significance bounds, but its decreasing and its nearing zero after lag 3 although there are other autocorrelations between lags 1-20 that exceed the significance bounds. c. Partial Autocorrelations > pacf(SFO_diff_1, lag.max=20) # plot a partial correlogram > pacf(SFO_diff_1, lag.max=20, plot=FALSE) Partial autocorrelations of series ‘SFO_diff_1’, by lag 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 0.9167 -0.709 -0.481 0.055 -0.326 0.232 -0.357 -0.027 -0.401 -0.019 -0.053 -0.464 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667 0.130 0.096 0.045 -0.020 0.100 0.118 -0.013 -0.073 0.005 0.0 0.5 1.0 1.5 -0.50.00.51.0 Lag ACF Series SFO_diff_1
  • 25. The partial correlogram shows that the partial autocorrelations at lags 1 and 2 exceed the significance bounds, are negative, and are slowly decreasing in magnitude with increasing lag. The partial autocorrelations nears zero after lag 2. 0.5 1.0 1.5 -0.6-0.4-0.20.00.2 Lag PartialACF Series SFO_diff_1
  • 26. Forecasting using an ARIMA model > SFO_arima <- arima(SFO.ts, order=c(2,2,3)) # fit an ARIMA(2,2,3) model > SFO_arima Series: SFO.ts ARIMA(2,2,3) Coefficients: ar1 ar2 ma1 ma2 ma3 -1.7315 -0.9996 0.7477 -0.7477 -1.000 s.e. 0.0014 0.0006 0.0211 0.0218 0.021 sigma^2 estimated as 215044: log likelihood=-1806.58 AIC=3625.17 AICc=3625.53 BIC=3646 > > SFO_arimaforecasts <- forecast.Arima(SFO_arima, h=18) > SFO_arimaforecasts Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 Jan 2010 11141.50 10543.492 11739.51 10226.924 12056.08 Feb 2010 10705.97 9854.680 11557.25 9404.037 12007.89 Mar 2010 11356.13 10317.891 12394.36 9768.281 12943.97 Apr 2010 10666.23 9459.164 11873.29 8820.184 12512.27 May 2010 11211.39 9863.664 12559.11 9150.221 13272.56 Jun 2010 10957.55 9475.979 12439.12 8691.684 13223.41 Jul 2010 10852.63 9249.182 12456.08 8400.366 13304.90 Aug 2010 11288.52 9572.908 13004.13 8664.718 13912.32 Sep 2010 10639.15 8812.668 12465.63 7845.787 13432.51 Oct 2010 11328.32 9402.626 13254.00 8383.228 14273.40 Nov 2010 10784.62 8758.111 12811.13 7685.342 13883.90 Dec 2010 11037.64 8918.355 13156.93 7796.473 14278.81 Jan 2011 11143.50 8933.374 13353.62 7763.406 14523.59 Feb 2011 10707.79 8408.235 13007.34 7190.924 14224.65 Mar 2011 11356.89 8974.390 13739.40 7713.169 15000.62 Apr 2011 10668.99 8200.825 13137.16 6894.256 14443.73 May 2011 11211.75 8664.892 13758.61 7316.668 15106.83 Jun 2011 10960.08 8333.068 13587.08 6942.415 14977.74
  • 27. > plot.forecast(SFO_arimaforecasts) Forecasts from ARIMA(2,2,3) 1990 1995 2000 2005 2010 8000100001200014000
  • 28. Conclusion In this report, we worked on a large data set, i.e., the airlines flight data set from infochimps.com, which consisted of 3.5 million monthly domestic flight records from 1990 to 2009. First of all we started with analyzing the data set, figured out the variables it contains and their data types, and computed basic summary statistics. The next task was to prepare the data for analysis, which in addition to cleaning the data also involved supplementing the data set with additional information, removing unnecessary variables and, transforming some variables in a way that made sense for the contemplated analysis. Eventually the data set got smaller in size as the analysis proceeded. We prepared two subsets of the overall data in the form of most flights originating from SFO and LAX data sets. The time series analysis was done on both of them. Finally, we did our forecasting analysis on the SFO time series data. We used two forecasting techniques, the Simple Exponential Smoothing technique and the forecasting analysis based on the ARIMA model. References Books/Articles 1. Little Book of R on Time Series Analysis - Avril Coghlan 2. Introduction to R's time series facilities - Michael Lundholm 3. Working with Time Series Data in R - Eric Zivot 4. White Papers on Big Data and Data Step - Revolution Analytics Websites 1. http://www.inside-r.org/howto/extracting-time-series-large-data-sets 2. http://www.infochimps.com/datasets/us-domestic-flights-from-1990-to-2009 3. http://www.revolutionanalytics.com/