3. • We are going to examine the linear correlation between two
variables, that is we are looking at how strong a linear relationship
there is between them.
• If there is suitably strong enough relationship between the two
variables ( and there is cause and effect) we can calculate a
“regression line” which is given by:
E[Y/X] =y=α+βx
E(S2)
P(X=x)=p(X=Head)=1/2
4. • The sample correlation coefficient r is given by:
• r= Sxy/(SxxSyy)1/2
• r is such that -1≤r ≤1
• r is a measure of linear association and does not itself indicate “cause
and effect”
5. • ρ=corr(X,Y)=cov(X,Y)/(var(X)var(Y))1/2
• The sample correlation coefficient, r, is an estimator of the population
correlation coefficient, ρ.
• r= Sxy/(SxxSyy)1/2
• Sxx=∑(xi- x̄)2= ∑xi
2-n x̄2
• Syy= ∑(Yi- ȳ)2= ∑Yi
2-n ȳ 2
• Sxy= ∑ (xi- x̄) (Yi- ȳ)= ∑xiyi-n x̄ ȳ
6. • A new computerized ultrasound scanning technique has enabled
doctors to monitor the weight of unborn babies. The table below
shows the estimated weights for one particular foetus at fortnightly
intervals during the pregnancy.
Gestation period
(weeks) x
30 32 34 36 38 40
Estimated foetal
weight (kg) y
1.6 1.7 2.5 2.8 3.2 3.5
8. Coefficient of Determination
• The proportion of the total variability of the responses “explained” by
a model is called the coefficient of determination, denoted, R2.
• The proportion is :
R2 =SSREG/SSTOT= Sxy
2/SxxSyy
R2 can take values between 0% and 100% inclusive.
9. Goodness of fit
• Partitioning the variability of the responses
• To help understand the “goodness of fit” of the model to the data,
the total variation in the responses, Syy= ∑(Yi- ȳ)2 should be studied.
Some of the variation in the responses can be attributed to the
relationships with x (eg y may be high when x is high, low when x is
low) and some is random variation (unmodellable).
Just how much is attributable – or explained by the model – is a
measure of the goodness of fit of the model.
10. • We start from an identity involving yi (the observed y value), ȳ (the
overall average of y values) and ŷi(the predicted value of y)
• Yi- ȳ=(Yi-ŷi)+(ŷi- ȳ)
• Squaring and summing both sides of :
• ∑(Yi- ȳ)2= ∑(Yi-ŷi)2+ ∑(ŷi- ȳ)2
• The cross-product term vanishing
11. • The sum of the left is the “total sum of squares” of the responses, denoted
here SSTOT.
• The second sum of the right is the sum of the squares of the deviations of
the fitted responses from the overall mean. It summarises the variability
accounted for, or “explained by the model”. It is called the regression sum
of squares, denoted here by SSREG
• The first sum on the right is the sum of the squares of the estimated errors
(response-fitted response, generally referred to as residual from the fit). It
summarises the remaining variability, that between the responses and
their fitted values and so unexplained by the model. It is called the residual
sum of squares, denoted by SSRES.
12. • So:
• SSTOT=SSRES+SSREG
• In this case, (simple linear regression model), note that the value of
the coefficient of determination is the square of the correlation
coefficient for the data.
• r= Sxy/(SxxSyy)1/2
• R2 =SSREG/SSTOT= Sxy
2/SxxSyy=r2
13. • A sample of ten claims and corresponding payments on settlement
for household policies is taken from the business of an insurance
company.
The amounts, in units of $100, are as follows:
Claims
x
2.10 2.40 2.50 3.20 3.6 3.8 4.1 4.2 4.5 5
Payme
nt y
2.18 2.06 2.54 2.61 3.67 3.25 4.02 3.71 4.38 4.45