10. y ~ x
# yhat = b1x + b0
# Want to find b's that minimise distance
# between y and yhat
z ~ x + y
# zhat = b2x + b1y + b0
# Want to find b's that minimise distance
# between z and zhat
z ~ x * y
# zhat = b3(x⋅y) + b2x + b1y + b0
Tuesday, 16 November 2010
11. X is measured without error.
Relationship is linear.
Errors are independent.
Errors have normal distribution.
Errors have constant variance.
Assumptions
Tuesday, 16 November 2010
18. qplot(x, y - x, data=dclean)
Tuesday, 16 November 2010
19. Your turn
Do the same thing for z and x. What
threshold might you use to remove
outlying values?
Are the errors from predicting z and y
from x related?
Tuesday, 16 November 2010
20. modz <- lm(z ~ x, data = diamonds, na = na.exclude)
coef(modz)
# zhat = 0.03 + 0.61x
qplot(x, rstandard(modz), data = diamonds)
last_plot() + ylim(-10, 10)
qplot(rstandard(mody), rstandard(modz))
Tuesday, 16 November 2010
22. Can we use a
linear model to
remove this trend?
Tuesday, 16 November 2010
23. Can we use a
linear model to
remove this trend?
Tuesday, 16 November 2010
24. Can we use a
linear model to
remove this trend?
Linear models are linear in
their parameters which can be
any transformation of the data
Tuesday, 16 November 2010
25. Your turn
Use a linear model to remove the effect of
carat on price. Confirm that this worked
by plotting model residuals vs. color.
How can you interpret the model
coefficients and residuals?
Tuesday, 16 November 2010
26. modprice <- lm(log(price) ~ log(carat),
data = diamonds, na = na.exclude)
diamonds$relprice <- exp(resid(modprice))
qplot(carat, relprice, data = diamonds)
diamonds <- subset(diamonds, carat < 2)
qplot(carat, relprice, data = diamonds)
qplot(carat, relprice, data = diamonds) +
facet_wrap(~ color)
qplot(relprice, ..density.., data = diamonds,
colour = color, geom = "freqpoly", binwidth = 0.2)
qplot(relprice, ..density.., data = diamonds,
colour = cut, geom = "freqpoly", binwidth = 0.2)
Tuesday, 16 November 2010
27. log(Y) = a * log(X) + b
Y = c . dX
An additive model becomes a
multiplicative model.
Intercept becomes starting point,
slope becomes geometric growth.
Multiplicative model
Tuesday, 16 November 2010
29. # Useful trick - close to 0, exp(x) ~ x + 1
x <- seq(-0.2, 0.2, length = 100)
qplot(x, exp(x)) + geom_abline(intercept = 1)
qplot(x, x / exp(x)) + scale_y_continuous("Percent
error", formatter = percent)
# Not so useful here because the x is also
# transformed
coef(modprice)
Tuesday, 16 November 2010
31. Compare the results of the following two
functions. What can you say about the
model?
ddply(diamonds, "color", summarise,
mean = mean(price))
coef(lm(price ~ color, data = diamonds))
Your turn
Tuesday, 16 November 2010
32. Categorical data
Converted into a numeric matrix, with one
column for each level. Contains 1 if that
observation has that level, 0 otherwise.
However, if we just do that naively, we end
up with too many columns (because we
have one extra column for the intercept)
So everything is relative to the first level.
Tuesday, 16 November 2010
34. # What do you think this model does?
lm(log(price) ~ log(carat) + color,
data = diamonds)
# What about this one?
lm(log(price) ~ log(carat) * color,
data = diamonds)
# Or this one?
lm(log(price) ~ cut * color,
data = diamonds)
# How can we interpret the results?
Tuesday, 16 November 2010
35. mod1 <- lm(log(price) ~ log(carat) + cut, data = diamonds)
mod2 <- lm(log(price) ~ log(carat) * cut, data = diamonds)
# One way is to explore predictions from the model
# over an evenly spaced grid. expand.grid makes
# this easy
grid <- expand.grid(
carat = seq(0.2, 2, length = 20),
cut = levels(diamonds$cut),
KEEP.OUT.ATTRS = FALSE)
str(grid)
grid
grid$p1 <- exp(predict(mod1, grid))
grid$p2 <- exp(predict(mod2, grid))
Tuesday, 16 November 2010
36. Plot the predictions from the two sets of
models. How are they different?
Your turn
Tuesday, 16 November 2010