2. Building a model can be a never-ending process
IMPROVE THE MODEL
ADDING
INTERACTION S
Taking away
variables
Doing
transformation
3. How do we judge the quality of the model?
The answer :
In relation to other models.
an analysis
of residuals
drop-in
deviance
the results
of an
ANOVA test
Wald test
the AIC or
BIC score
cross-
validation
error
bootstrapping.
4. 18.1. Residuals
The difference between the actual response and
the fitted values.
where the errors, akin to residuals, are
normally distributed.
The basic idea is that if the model is appropriately
fitted to the data, the residuals should be normally
distributed as well.
5.
6. each coefficient is plotted as a point with a thick
line representing the one standard
error confidence interval and a thin line
representing the two standard error confidence
interval.
There is a vertical line indicating 0. In general, a
good rule of thumb is that if the two standard error
confidence interval does not contain 0, it is
statistically significant.
Remember
7.
8.
9. ggplot2 with linear regression
has a handy trick for dealing with lm models. We can use the
model as the data source and ggplot2 “fortifies” it, creating
new columns, for easy plotting
The basic structure for ggplot2 starts with the ggplot function,which at its most basic
should take the data as its first argument. It can take more arguments, or fewer, but
we will stick with that for now. After initializing the object, we add layers using the +
symbol. To start, we will just discuss geometric layers such as points, lines and
histograms. They are included using functions like geom point, geom line and geom
histogram. These functions take multiple arguments, the most important being which
variable in the data gets mapped to which axis or other aesthetic using aes.
Furthermore, each layer can have different aesthetic mappings and even different
data.
ggplot2
10.
11.
12.
13.
14.
15. Q-Q plot
If the model is a good fit, the standardized residuals should all
fall along a straight line when plotted against the theoretical
quantiles of the normal distribution. Both the base
graphics and ggplot2 versions are shown in next slide .
16.
17. histogram of the residuals. This time we will not be
showing the base graphics alternative because a
histogram is standard plot that we have shown
repeatedly.
The histogram is not normally distributed, meaning
model is not an entirely correct.
histogram
18.
19. All of this measuring of model fit only really makes sense
when comparing multiple models, because all of these
measures are relative.
20.
21.
22. where :
ni is the number of observations in group i,
i is the mean of group i, is the overall mean,
Yij is observation j in group i,
N is the total number of observations
K is the number of groups.
23. ANOVA for a multisample test, we do believe it serves a useful purpose
in testing the relative merits of different models. Simply passing
multiple model objects to anova will return a table of results including
the residual sum of squares (RSS), which is a measure of error, the lower
the better.
24.
25. Akaike Information Criterion (AIC). As with RSS, the model with
thelowest AIC—even negative values—is considered optimal.
The BIC (Bayesian Information Criterion) is a similar measure where,
once again, lower is better.
AIC & BIC
32. The results from cv.glm include delta, which has two numbers,
the raw cross-validation error : based on the cost function (in this case the mean squared error, which
is a measure of correctness for an estimator and is defined in this Equation )
for all the folds and the adjusted cross-validation error.
This second number compensates for not using leave-one-out cross-validation, which is like k-fold
cross-validation except that each fold is the all but one data point with one point held out. This is
very accurate but highly computationally intensive.
33. we got a nice number for the error, it helps us only if we can compare it to other models
34.
35.
36.
37. Bootstrapping
The idea is that we start with n rows of data. Some statistic (whether a mean,
regression or some arbitrary function) is applied to the data.
Then the data are sampled, creating a new dataset.
This new set still has n rows except that there are repeats and other rows are
entirely missing.
The statistic is applied to this new dataset.
The process is repeated R times (typically around 1,200), which generates an
entire distribution for the statistic.
This distribution can then be used to find the mean and confidence interval
(typically 95%) for the statistic.
The boot package is a very robust set of tools for making the bootstrap easy to
compute
38.
39. to compute the batting average is to divide total hits by total at bats. This
means we cannot simply run mean(h/ab) and sd(h/ab) to get the mean and
standard deviation. Rather, the batting average is calculated as
sum(h)/sum(ab) and its standar deviation is not easily calculated. This
problem is a great candidate for using the bootstrap.
We calculate the overall batting average with the original data. Then we
sample n rows with replacement and calculate the batting average again. We
do this repeatedly until a distribution isformed. Rather that doing this
manually, though, we use boot.
The first argument to boot is the data. The second argument is the function
that is to be computed on the data. This function must take at least two
arguments.
The beautiful thing about the bootstrap is its near universal applicability. It
can be used in just about any situation where an analytical solution is
impractical or impossible.
Bootstrapping
42. 18.5. Stepwise Variable Selection
A common, though becoming increasingly discouraged, way to select
variables for a model is stepwise selection. This is the process of iteratively
adding and removing variables from a model and testing the model at each
step, usually using AIC.
Return to the book to see all results.
43. Determining the quality of a model is an important step in the model-building
process. This can take the form of traditional tests of fit such as ANOVA or
more modern techniques like cross-validation.
The bootstrap is another means of determining model uncertainty, especially
for models where confidence intervals are impractical to calculate. These can
all be shaped by helping select which variables are included in a model and
which are excluded.
18.6. Conclusion
44. Chapter 19. Regularization and Shrinkage
19.1. Elastic Net
a dynamic blending of lasso and ridge regression.
The lasso uses an L1 penalty to perform variable selection and dimension
reduction, while the ridge uses an L2 penalty to shrink the coefficients for
more stable predictions.
45. The formula for the Elastic Net is:
where λ is a complexity parameter controlling the amount of shrinkage (0 is
no penalty and ∞ is complete penalty)
α regulates how much of the solution is ridge versus lasso with α = 0 being
complete ridge and α = 1 being complete lasso.
Γ, not seen here, is a vector of penalty factors—one value per variable—that
multiplies λ for fine tuning of the penalty applied to each variable;
47. Glmnet
which fits generalized linear models with the Elastic Net.
it is designed for speed and larger, sparser data.
Where functions like lm and glm take a formula to specify the model, glmnet
requires a matrix of predictors (including an intercept) and a response
matrix
48. we will look at the American Community Survey(ACS) data for New York State. We
will throw every possible predictor into the model and see which are selected.
49. λ controls the amount of shrinkage.
By default glmnet fits the regularization path on 100 different values of λ.
glmnet package has a function, cv.glmnet, that computes the cross-validation
automatically. By default α = 1, meaning only the lasso is calculated.
Selecting the best α requires an additional layer of cross-validation.
52. Finding the optimal value of α requires an additional layer of cross-validation,
and unfortunately glmnet does not do that automatically. This will require us
to run cv.glmnet at various levels of α, which will take a fairly large chunk of
time if performed sequentially, making this a good time to use parallelization.
The most straightforward way to run code in parallel is to the use the
parallel, doParallel and foreach packages
First, we build some helper objects to speed along the process.
When a two-layered cross validation is run, an observation should fall in
the same fold each time, so we build a vector specifying fold membership.
We also specify the sequence of α values that foreach will loop over.
It is generally considered better to lean toward the lasso rather than the
ridge, so we consider only α values greater than 0.5.
53. Before running a parallel job, a cluster (even on a single machine) must be started and
registered with makeCluster and registerDoParallel. After the job is done the cluster
should be stopped with stopCluster.
Setting .errorhandling to ''remove'' means that if an error occurs, that iteration will be
skipped. Setting .inorder to FALSE means that the order of combining the results does
not matter and they can be combined whenever returned, which yields significant
speed improvements. Because we are using the default combination function, list,
which takes multiple arguments at once, we can speed up the process by setting
.multicombine to TRUE.
We specify in .packages that glmnet should be loaded on each of the workers, again
leading to performance improvements. The operator %dopar% tells foreach to work in
parallel.
Parallel computing can be dependent on the environment, so we explicitly load some
variables into the foreach environment using .export, namely, acsX, acsY, alphas and
theFolds
54.
55.
56.
57.
58.
59.
60.
61. 19.2. Bayesian Shrinkage
useful when a model is built on data that does not have a large enough number of
rows for some combinations of the variables.For this example, we blatantly steal an
example