SlideShare une entreprise Scribd logo
1  sur  20
CH 7
Correlation (mutual relation of two or more
things) and Liner Regression (regression
analysis in which the dependent variable is
assumed to be linearly related to the
independent variable or variables.)
Learning Objectives
1) How to use correlation analysis to
describe the relationship between twotwo
interval-levelinterval-level variablesvariables.
2) How to use regression analysis to
estimate the effect of an independent
variable on a dependent variable.
3) How to perform and interpret dummy
variable regression.
4) How to use multiple regression to make
controlled comparisons.
Book has covered fair amount of methodological ground.
 Ch 3 learned two essential methods for analyzing
relationship between an independent variable and a
dependant varb: 1) cross-tabulation 2) mean comparison
analysis.
 Ch 4 covered the logic and practice of controlled comparison
– how to set up and interpret the relationship between an
indp variable and a dep varb, controlling for a third variable.
 CH’s 5 and 6 learned of role of inferential statistics in
evaluating the statistical significance of a relationship, and
became familiar with measures of association.
 By now, you can: 1) frame a testable hypothesis; 2) set up
the appropriate analysis; 3) interpret your findings; 4) and
figure out the probability that you observed results occurred
by chance.
In many ways, correlation and regression are similar to the
methods you have learned.
 Correlation analysis = produces a measure of association –
Pearson’s correlation coefficient which gauges the direction
and strength of a relationship between two interval-level
variables.
 Regression analysis = produces a statistic, a Regression
Coefficient, that estimates the size of the effect of the
independent variable on the dependant variable.
 Emp = Working with survey data, want to investigate the relationship
between individuals’ ages (Indep Varb) and number of hours they
spend watching television each day (Dep Varb).
 1) Is the relationship positive, with older indivduals watching more hours of TV.
 2) Is the relationship negative, with older people watching less TV than younger
people.
 3) How strong is the relationship between age and number of hours devoted to
TV?
 Correlation analysis addresses theses questions.
Regression analysis is similar to mean comparison analysis:
1) where we learned to divide subjects on the independent variable:
females and males.
2) And how to compare values on the dependent variable, – like the
mean Clinton thermometer ratings.
3) Furthermore, learned how to test the null hypothesis with
assumption of random sampling error.
 Similarly, Regression Analysis communicates mean difference on
the dependent variable (thermometer ratings) for subjects who differ
on the independent variable, females compared with males.
 Just like comparison of two sample means, Regression Analysis
provides info that permits the researcher to determine the
probability that the observed difference was observed by chance.
However, regression is different in two ways:
 1) Regression analysis is very precise. It provides the statistic, the
Regression Coefficient that reveals the exact nature of the relationship
between an indp varb and a dep varb.
 Regression coefficient reports “the amount of change in the dep varb that is
associated with a one-unit change in the indp varb.”
 Regression coefficient is used only when the dep varb is measured at the
interval level.
 The independent varb can come in any form: nominal, ordinal, or interval.
 In ch, we show how to interpret regression analysis in which the indep varb is interval
level.
 Also, CH discusses technique called dummy variable regression – uses nominal or ordinal
varbs as indp varbs.
 2) A second distinguishing feature of Regression is the ease it can be
extended to the analysis of controlled comparisons.
 Regression also analyzes the relationship between a dependent varb and a
single indep varb – Bivariate Regression = One Indep Varb and one Dep Varb.
 Regression is remarkably flexible, it can be used to detect and evaluate
spuriousness, and it allows us to model additive relationships and interaction
Correlation
 Eamp in Book: The relationship between two variables: % of a state’s
population that graduated from high school ( indep varb), and % of
eligible pop that voted in the 1994 elections (dep varb). Display is a
Scatterplot = the indp varb is measured along the horizontal axis and
the dep varb is measured along the vertical axis.
 Consider the overall pattern of this relationship: 1) Is it strong, moderate, or weak? 2)
What is the direction of the relationship – positive or negative? You can probably
arrive at reasonable answers to these questions.
 (Eamp in book) As you move from lower to higher values of the indep varb (H axis)
the values of the dep varb (V axis) tend to adjust themselves accordingly, clustering a
bit higher on the turnout axis. The relationship is positive. But how strong is the
relationship.
 In assessing strength, consider the consistency of the pattern.
 If the relationship is strong, then just about every time you compare a state that has
lower education with a state that has higher ed, the second state would also have
higher turnout. So an increase in X (Ind Varb) (H Axis) would be associated with an
increase in Y (Dep Varb) (V Axis) most of the time.
 If the relationship is weak, you would encounter many cases that do not fit the
positive pattern, many higher-ed states with turnouts that are about the same as, or
less than, lower-ed state. So an increase in X would not consistently occasion an
increase in Y. [assessing strength contin on next page]
Correlation Contin…
 Rate the relationship on a scale from 0 to 1, where a rating close to 0 denotes a weak
relationship, rating around .5 is a moderate relationship, and rating close to 1 denotes
a strong relationship.
 From exmp in book: A rating close to 0 not seem correct because pattern has some
predictability. Yet, rating of 1 not seem right either because you can find states in the
“wrong” place on the turnout varb, given levels of ed.
 Form exmp in book: A rating around .5, somewhere in the moderate range seem like a
reasonable gauge of the strength of the relationship.
 Pearson’s Correlation Coefficient = (lowercase r) uses this approach in determining
the direction and strength of an interval-level relationship.
 Pearson’s r always has a value that falls between -1, signifying a perfectly negative
association between the variables, and +1, a perfectly positive association between
them. If no relationship exists value = 0.
 The exact computation of r not needed, but its important to understand the statistical
basis of the correlation coefficient.
 Covariation of X and Y / Separate variation of X and Y.
 The numerator “covariation of X and Y” measures the degree to which variation in X
is associated with variation in Y. This value quantifies thinking we applied to the
scatterplot of states, one low value on X, and one having a higher value on X.
Correlation Contin…
 If the second state also has higher turn out than the 1st state, then the numerator will
be positive.
 By contrast, state with a higher value on X has a lower value on Y, the numerator will
be negative.
 If pattern is inconsistent – the states have different values on X but similar values on
Y – the numerator records this inconsistent pattern and assumes a value close to 0.
 The denominator summarizes all the variation in both varib considered separately. If
the covariation of X and Y is equal to the measure of the total variation in both
variables, then r takes on a value of +1 (perfectly positive covariation) or -1 (perfectly
negative covariation).
 If X and Y do not move together in a systematic way, then r assumes a value close to
0.
 The correlation coefficient for the relationship depicted in 7-1 = Pearson’s r = +.6.
 Pearson’s r is a symmetrical measure of asso between two variables – means
that the correlation between X (Ind Varb) and Y (Dep Varb) is the same as the
correlation between Y (Dep Varb) and X (Ind Varb) .
 Pearson’s r is neutral on the question of whether X causes Y or Y causes X.
Therefore, one cannot attribute causation based on a correlation coefficient.
 Furthermore, Pearson’s r is not a PRE (proportional reduction in error) measure of
asso. It is neatly bounded by -1 and +1, so communicates strength and direction by a
common metric.
Bivariate Regression
Bivariate exmp from Book = analyze relationship between the
scores received on exam (dependant varb, Y) and number of
hours studying for test (indep varb, X).
 The relationship is positive: Students studied more received
better scores. The Correlation between the variables is indeed
strong.
 But Regression Analysis permits us to put a finger point on the
relationship.
• In this case, “each additional hour spent studying results in exactly
a 6-point increase in exam score.” More hours better grade by 6.
 Moreover, the XY pattern can be summarized by a line.
 What liner equation would summarize the relationship
Bivariate Regression Contin..
 To draw a line you must know 2 things: 1) Y-intercept – point at which the line
crosses the Y axis (the value of Y when X is 0) – 2) and slope of the line.
 The Regression Coefficient – the slope of the line – is “rise over run” “the amount
of change in Y (Dep Varb) for each unit change in X (Ind Varb).” with theses two
elements, we arrive at the general formula for a Regression line = a liner equation
that summarizes the relationship between X and Y:
Y = a + b(X)
 A represents the Y-intercept which is 55 – score of students whom did not study
at all (X = 0). B represents the slope of the line – slope or regression coefficient.
 The regression coeffienct (b) is 6. thus the Regression Line for 7-1 (in book) is:
Test score = 55 + 6(number of hours)
 Notice aspects of this approach.
• 1st regression equation provides a general summary of the relationship between X
and Y. for any number of students we can plug in the number of hours spent
studying, do the math, arrive at exam score.
• 2nd formula seem to hold some predictive power, ability to estimate scores for
students whom do not appear in the data.
• Eamp = 3.5 hours studying. Our est: 55 + 6(3.5) = Score of 76.
Bivariate Regression Contin..
• Using an established regression formula to predict values of a dependant variable for new
values of an independent varb is a common application of regression analysis.
 Modify example to make it somewhat more realistic. Assume a sample of 16 students
drawn at random from the population.
 Data = 1) two students share the same value on the indpenant varb, but their scores
were different, 59 and 63, and so on for the other pair of cases, a one number summary
of their value on the dependant variable.
 2) Calculate the mean value of the dependant variable for each value of the indep varb.
So, two none studiers avge their scores: (53 + 57) / 2 = 55; 1 hr cases (59 + 63) / 2 = 61.
Notice avenging does not reproduce the data, instead it produces estimates of the
actual test scores.
 Because these estimates do not represent real values of Y, they are given a separate
label, ^Y (“Y-hat”).
 3) How to describe the relationship between X and Y, “Based on my sample, each
additional hour spent studying produced, on avge, a 6 point increase in exam score.”
 So the Regression Coefficient, b, communicates the avge change in Y for each unit
change in X. A liner regression equation takes this general form:
•
• ^Y = ^a + ^ b (X)
 ^Y (“Y-hat”) is the estimate mean value of the dependant varb, a (“a-hat”) avge value of
Y when X is 0, ^b (“b-hat”) is the avge change in Y for each unit change in X =
• Estimated score = 55 + 6(number of hrs)
Bivariate Regression Contin..
 Regression analysis is built on the estimation of averages. Regression will use the
available info to calculate a Y-intercept.
 If no empirical examples existed for X = 5 hours, Regression will still yield an estimate,
55 + 6(5) = 85, an estimated avge score.
 A regression line travels through the two dimensional space defined by X (Horz Axis)
and Y (Ver Axis), ESTIMATING MEAN VALUES ALONG THE WAY.
 Regression relentlessly summarizes liner relationships. Feed it sample values for X
and Y, and it will return estimates for a and b.
 Coefficients are means from sample data – contain random sampling errorcontain random sampling error.
 Focus on the workhorse, the regression coefficient, b.
 Obviously this estimate contains some error, because the actual student scores fall a
bit above or below the avg for any given value of X.
 Just like any sample mean, the size of the error in a regression coefff is measured by
its standard error. So the real value of b in the pop (beta) is equal to the sample
estimate, b, within the bounds of standard error:
B = ^b + - (standard error of b)
Bivariate Regression Contin..
 All statical rules you have learned – informal +-2 rule of thumb, more formal 1.645 test, p-
values, inferential set-up for testing the null hypo – apply to regression annalist.
 In evaluating diff between two sample means, we tested the null hpyo that the diff in the pop
is equal to 0. in its regression guise, null hypo says much the same thing -- that the true
value of B in the pop is equal to 0, that B = 0.
 The true regression line is flat and horizontal. As in the comparison of two sample means,
we test the null hypo by calculating a t-statistic, or t-ratio:
.t = (^b – B) / (standard error of ^b), with degrees of freedom (d.f.) = n – 2.
 Also, if t-ratio is equal to or greater than 2 , we can reject the null hypo.
 A precise P-value can be obtained.
 For each 1-hour increase in studying time, we est a 6-point increase in exam score (^b = +6).
By comp stand error of ^b is .233. So the t-stat is 6/.233 = 25.75, P-value that rounds to .000.
 A real world relationship, ed turnout examp in 7-1, and discuss some further properties of
regression analysis 7-2 displays estimate regression line. Where did this line originate?
 Liner regression finds a line that provides the best fit to the data points. Using each case’s
values on X, it finds , an estimate value of Y.
 It then calculates the difference between this est and the case’s actual value of Y. this
difference is called prediction error = is represented by the expression Y – Y, the actual value
of Y minus the estimate value of Y.
Bivariate Regression Contin..
 Regression would use the values of the independent varb, percent high school grads, to determine an est value
on the depend varb, percent turnout. Prediction error would be the diff between the state’s actual level of
turnout, Y, and the est turnout, ^Y.
 The prediction error for any given state nay be positive – its actual turnout is higher than its est
turnout – or it may be negative – its actual turnout is lower than its est turnout. Fact if one were to
add up all positive and negative prediction errors, they would sum to 0.
 When it finds the best fitting line, regression works with the square of the diff, (Y - ^Y)2.
 So for each state, regression would square the diff between the state’s actual level of turnout and its est level of
turnout.
 Regression logic, the best fitting line is the one that minimizes the sum of these squared prediction errors
across all cases. That is, regression finds the line that minimizes the quantity E (Y - ^Y)2.
 Criterion of best fit –used to distinguished garden-variety ordinary least square (OLS) regression from other
regression-based techniques. (line in 7-2 = OSL REGRSSION LINE)
 Regression reported the equation that provides the best fit for the relationship between X and Y:
Estimated turnout = -26.27 + .87(% high school grads)
 How would you interrupt each of the est for a and b?
 Consider the est for a, the level of turnout when X is 0. Turnout level of -26.27.
 Nonetheless, regression produced an est for ^a , anchoring the line at a -26.27 turnout rate.
 For some applications of regression, the value of the Y-intercept, the est ^a, has no meaningful interpretation.
(sometimes its essential).
 What about ^b, the estimated effect of education on turnout?
Bivariate Regression Contin..
Two rules for interpreting a regression coefficient.
 First rule, be clear about the units in which the independent and dependent varbs are
measured – make sure you know which is the dep varb and which is the indp varb!
 In example, the dep varb, Y, is measured by %’s -- % of each state’s eligible population that
voted in the 1994 election.
 Indep varb, X, also is expressed in %’s -- % pop has a high school degree.
 Second rule, regression coefficient, b, is expressed in units of the dep varb, not the indep
varb. The coefficient, .87, tells us that “turnout (Y) increases, on avge, by .87 of a
percentage point for each 1-percentage-point increase in education (X).”
 Remember that all the coefficients in a regression equation are measured in units of the dep
varb.
 The intercept is the value of the dep varb when X is 0.
 The slope is the estimate change in the dep varb for a one-unit change in the indp varb.
 Sample, we obtained a sample estimate of .87 for the ture pop value of B.
 The null hypo would claim that B is really 0, and that the sample estimate obtained, .87, is
within the bounds of sampling error.
 The regression coefficient standard error, comp calculated to be .17, and arrive at a
t-ratio: t = (^b – B) / (standard error of ^b), with d.f. = n- 2
= .87 / .17
= 5.12, with d.f. = 50 -2 = 48.
 Informal +- 2 rule of thumb advises us to reject the null hypo. The P-VALUE, A PROBABILTY
OF 2.68-E06, verifies that advice.
R-Square Regression Analysis gives a precise estimate of the effect of an indep varb on
a dep varb. It looks at a relationship and reports the relationship’s exact
nature.
 “What, exactly is the effect of ed (Indep Varb) on turnout in the states (Dep V)
?” Regression Coeaffient provides an answer: “turnout increases by .87 % for
each 1 percent increase in the % of the states pop with a high school diploma.
Plus, the regression coefficient has a P-value of .000.”
 By itself does not measure the completeness of the relationship, the degree to
which Y is explained by X.
 States’ ed levels, though clearly related to turnout, provide an incomplete
explanation of it.
 In Regression Analysis, the completeness of a relationship is measured by
the statistic R2 (R square).
 R-square = a PRE measure, and so it poses the question of strength the same
way as lambda or Kendall’s tau: “how much better can we predict the dep varb
by knowing the indp varb than by not knowing the indp varb.”
 Consider state turnout data. Guess a states turnout (dep varb) without
knowing its ed level (Indep Varb). Lambda is the best guess for a nominal–
level varb is the varb’s measure of central tendency, its mode.
 In case of an interval-level dep varb, such as % turnout, the best guess
provided by variable’s measure of central tendency, its mean.
R-Square Cont……
 State turnout data examp = 7-3, scatterplot of states and regression line. A flat
line is drawn at 40-% turnout. This value is the mean turnout for all 50 states –
calculated like any mean.
 So, Y = 40 %. If we had no knowledge of the indep varb (edu), we would guess
a turnout rate of 40 (Dep Varb) for each state taken at one time.
 This guess serve well for many states, but produce quite a few errors.

Wyoming produced a turnout rate of about 57 % in 1994. our guess of 40 would
have a large dose of error.

Size of this error would be 57 – 40 = 17. Wyoming’s turnout rate is 17 units
higher than predicted, based only on the mean turnout for all states.

This error can be labeled as Y – Y, the actual value of Y minus the overall mean
of Y. This value, calculated for every case, is the staring point for R-square .

R-square finds (Y – Y bar) for each case, squares it, and then sums theses
squared values across all observations. The result is the total sum of squares
TSS.

Total sum of squares: or TSS is an overall summary of all our missed guesses of
turnout, based only on knowledge of the dep varb.
 Reconsider the regression line in 7-3, and see how much it improves the
prediction of Y.
 The Regression Line is the estimated level of turnout (Dep Varb)turnout (Dep Varb) calculated with
knowledge of the indep varb, ed levelindep varb, ed level.
 For each state, we would not guess Y bar, the overall mean of Y. Rather, we
would guess Y, the mean value of Y for a given value of X.
R-Square Cont……

Empl = Wyoming has a value of 83 on the indep varb, since 83 % of its pop has a
high school diploma. What would be an estimation of its turnout level?

Plugging 83 into the regression equation, for Wyoming we obtain -26.27 + .87(83)
= 46 on the turnout scale.

Is our new guess, 46, better than our old guess, 40 ? Somewhat, by guessing
the mean, we “missed” Wyoming’s actual turnout by 17 units. New est, 46,
improves our old guess by 6 units, since Y –Y is equal to 46 -40 = 6. so, Y puts
us a bit closer to the real value of Y.
 But the distance between Wyoming’s actual turnout (Y) and our new est (Y)
remains unexplained. This is prediction error.

For Wyoming, the size of the prediction error would be Y – Y, or 57 -46 = 11.

Thus for Wyoming we could divide its total distance from the mean, 17 units, into
two parts: the amount accounted for by the regression est, an amount equal to
6, and the amount left unaccounted for by the regression est, amount equal to
11.

More generally, in regression analysis the TSS has two components:

Total Sum of Squares = Regression Sum of Squares + Error Sum of Squares
TSS = RSS + ESS
E(Y – Y bar)2 = E(^Y-Ybar)2 + E(Y - ^Y)2
 TSS is a summary of all the variation in the dep varb. RSS is the regression
sum of squares = which is the component of TSS that we pick up by knowing
the indep varb. ESS, the error sum of squares = is the component of TSS that
is left over, or not explained by the regression equation.
 If RSS is a large chunk of TSS, then the indep varb is doing a lot of work in
explaining the dep varb.
R-Square Cont…… As the contribution of RSS declines, and the contribution of ESS increases,
knowledge of the indep varb provides less help in explaining the dep varb.

R-square is simply the ratio of RSS to TSS: R2 = RSS / TSS
 R-square measures the goodness of fit between the regression line and the actual
data.
 If X completely explains Y, if RSS equals TSS, then R-square is 1.
 If RSS makes no contribution – if we would do just a well in accounting for Y
without knowledge of X as with knowledge of X – then R-square is 0.
 R-square is a PRE measure and is always backed by 0 and 1.
 Its value may be interpreted as the prop of the variation in Y that is explained by X.
 Comp reports R-square for the state data is equal to .36. thus, 36 % of the
variation among the states in their turnout rates is accounted for by their levels of
ed.
 The leftover variation among the states, 64%, may be explained by other variables,
but it is not accounted for by the difference in edu.
 R-square sometimes label coefficient of determination, bears a family resemblance
to r, Pearson’s correlation coefficient.
 In fact, R2 = r2, so the value of R2 .36 for state data, is the square of r, +.6 for the
same data. The problem with r is that it may mislead the consumer of political
research into overestimating the relationship between two varb’s. difference

Contenu connexe

Tendances

Linear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec domsLinear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec domsBabasab Patil
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressioncbt1213
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionmejikpg
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAjendra7846
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAbdelaziz Tayoun
 
Statistics-Correlation and Regression Analysis
Statistics-Correlation and Regression AnalysisStatistics-Correlation and Regression Analysis
Statistics-Correlation and Regression AnalysisRabin BK
 
Correlation
CorrelationCorrelation
CorrelationTech_MX
 
Partial correlation
Partial correlationPartial correlation
Partial correlationDwaitiRoy
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysisVan Martija
 
correlation_and_covariance
correlation_and_covariancecorrelation_and_covariance
correlation_and_covarianceEkta Doger
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelSetia Pramana
 
Correlation & Regression
Correlation & RegressionCorrelation & Regression
Correlation & RegressionGrant Heller
 

Tendances (17)

Linear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec domsLinear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec doms
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Statistics-Correlation and Regression Analysis
Statistics-Correlation and Regression AnalysisStatistics-Correlation and Regression Analysis
Statistics-Correlation and Regression Analysis
 
Correlation
CorrelationCorrelation
Correlation
 
Linear Correlation
Linear Correlation Linear Correlation
Linear Correlation
 
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
 
Partial correlation
Partial correlationPartial correlation
Partial correlation
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
 
correlation_and_covariance
correlation_and_covariancecorrelation_and_covariance
correlation_and_covariance
 
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft Excel
 
Correlation ppt...
Correlation ppt...Correlation ppt...
Correlation ppt...
 
Correlation & Regression
Correlation & RegressionCorrelation & Regression
Correlation & Regression
 

Similaire à Ch 7 correlation_and_linear_regression

Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
Requirements.docxRequirementsFont Times New RomanI NEED .docx
Requirements.docxRequirementsFont Times New RomanI NEED .docxRequirements.docxRequirementsFont Times New RomanI NEED .docx
Requirements.docxRequirementsFont Times New RomanI NEED .docxheunice
 
Research Methodology Module-06
Research Methodology Module-06Research Methodology Module-06
Research Methodology Module-06Kishor Ade
 
Correlation IN STATISTICS
Correlation IN STATISTICSCorrelation IN STATISTICS
Correlation IN STATISTICSKriace Ward
 
Stats 3000 Week 2 - Winter 2011
Stats 3000 Week 2 - Winter 2011Stats 3000 Week 2 - Winter 2011
Stats 3000 Week 2 - Winter 2011Lauren Crosby
 
Hph7310week2winter2009narr
Hph7310week2winter2009narrHph7310week2winter2009narr
Hph7310week2winter2009narrSarah
 
PPT Correlation.pptx
PPT Correlation.pptxPPT Correlation.pptx
PPT Correlation.pptxMahamZeeshan5
 
Multivariate Analysis Degree of association between two variable - Test of Ho...
Multivariate Analysis Degree of association between two variable- Test of Ho...Multivariate Analysis Degree of association between two variable- Test of Ho...
Multivariate Analysis Degree of association between two variable - Test of Ho...NiezelPertimos
 
Correlation and regression impt
Correlation and regression imptCorrelation and regression impt
Correlation and regression imptfreelancer
 
P G STAT 531 Lecture 9 Correlation
P G STAT 531 Lecture 9 CorrelationP G STAT 531 Lecture 9 Correlation
P G STAT 531 Lecture 9 CorrelationAashish Patel
 
Biostatistics - Correlation explanation.pptx
Biostatistics - Correlation explanation.pptxBiostatistics - Correlation explanation.pptx
Biostatistics - Correlation explanation.pptxUVAS
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionSumit Prajapati
 
how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest NurFathihaTahiatSeeu
 
Correlation Analysis
Correlation AnalysisCorrelation Analysis
Correlation AnalysisSaqib Ali
 
Assessment 2 ContextIn many data analyses, it is desirable.docx
Assessment 2 ContextIn many data analyses, it is desirable.docxAssessment 2 ContextIn many data analyses, it is desirable.docx
Assessment 2 ContextIn many data analyses, it is desirable.docxfestockton
 

Similaire à Ch 7 correlation_and_linear_regression (20)

Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
2-20-04.ppt
2-20-04.ppt2-20-04.ppt
2-20-04.ppt
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Requirements.docxRequirementsFont Times New RomanI NEED .docx
Requirements.docxRequirementsFont Times New RomanI NEED .docxRequirements.docxRequirementsFont Times New RomanI NEED .docx
Requirements.docxRequirementsFont Times New RomanI NEED .docx
 
Research Methodology Module-06
Research Methodology Module-06Research Methodology Module-06
Research Methodology Module-06
 
Correlation IN STATISTICS
Correlation IN STATISTICSCorrelation IN STATISTICS
Correlation IN STATISTICS
 
Stats 3000 Week 2 - Winter 2011
Stats 3000 Week 2 - Winter 2011Stats 3000 Week 2 - Winter 2011
Stats 3000 Week 2 - Winter 2011
 
Applied statistics part 4
Applied statistics part  4Applied statistics part  4
Applied statistics part 4
 
Hph7310week2winter2009narr
Hph7310week2winter2009narrHph7310week2winter2009narr
Hph7310week2winter2009narr
 
PPT Correlation.pptx
PPT Correlation.pptxPPT Correlation.pptx
PPT Correlation.pptx
 
Multivariate Analysis Degree of association between two variable - Test of Ho...
Multivariate Analysis Degree of association between two variable- Test of Ho...Multivariate Analysis Degree of association between two variable- Test of Ho...
Multivariate Analysis Degree of association between two variable - Test of Ho...
 
Correlation and regression impt
Correlation and regression imptCorrelation and regression impt
Correlation and regression impt
 
Correlation
CorrelationCorrelation
Correlation
 
P G STAT 531 Lecture 9 Correlation
P G STAT 531 Lecture 9 CorrelationP G STAT 531 Lecture 9 Correlation
P G STAT 531 Lecture 9 Correlation
 
9. parametric regression
9. parametric regression9. parametric regression
9. parametric regression
 
Biostatistics - Correlation explanation.pptx
Biostatistics - Correlation explanation.pptxBiostatistics - Correlation explanation.pptx
Biostatistics - Correlation explanation.pptx
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And Regression
 
how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest how to select the appropriate method for our study of interest
how to select the appropriate method for our study of interest
 
Correlation Analysis
Correlation AnalysisCorrelation Analysis
Correlation Analysis
 
Assessment 2 ContextIn many data analyses, it is desirable.docx
Assessment 2 ContextIn many data analyses, it is desirable.docxAssessment 2 ContextIn many data analyses, it is desirable.docx
Assessment 2 ContextIn many data analyses, it is desirable.docx
 

Plus de Omar (TUBBS 128) Ventura VII (9)

Sampling and Inference_Political_Science
Sampling and Inference_Political_ScienceSampling and Inference_Political_Science
Sampling and Inference_Political_Science
 
The New Fair Property Assessment Plan
The New Fair Property Assessment PlanThe New Fair Property Assessment Plan
The New Fair Property Assessment Plan
 
The fair assessmetn_plan_spring_2006_cory_cook_ca_pol
The fair assessmetn_plan_spring_2006_cory_cook_ca_polThe fair assessmetn_plan_spring_2006_cory_cook_ca_pol
The fair assessmetn_plan_spring_2006_cory_cook_ca_pol
 
Normal_Curves_z-scores
Normal_Curves_z-scoresNormal_Curves_z-scores
Normal_Curves_z-scores
 
Explanations and hypotheses ch 2
Explanations and hypotheses ch 2Explanations and hypotheses ch 2
Explanations and hypotheses ch 2
 
Hypothesis testng
Hypothesis testngHypothesis testng
Hypothesis testng
 
Examining relationships m2
Examining relationships m2Examining relationships m2
Examining relationships m2
 
Ch 7 regression_lab_outpu1
Ch 7 regression_lab_outpu1Ch 7 regression_lab_outpu1
Ch 7 regression_lab_outpu1
 
Correlation & Regression_
Correlation & Regression_Correlation & Regression_
Correlation & Regression_
 

Dernier

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 

Dernier (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

Ch 7 correlation_and_linear_regression

  • 1. CH 7 Correlation (mutual relation of two or more things) and Liner Regression (regression analysis in which the dependent variable is assumed to be linearly related to the independent variable or variables.)
  • 2. Learning Objectives 1) How to use correlation analysis to describe the relationship between twotwo interval-levelinterval-level variablesvariables. 2) How to use regression analysis to estimate the effect of an independent variable on a dependent variable. 3) How to perform and interpret dummy variable regression. 4) How to use multiple regression to make controlled comparisons.
  • 3. Book has covered fair amount of methodological ground.  Ch 3 learned two essential methods for analyzing relationship between an independent variable and a dependant varb: 1) cross-tabulation 2) mean comparison analysis.  Ch 4 covered the logic and practice of controlled comparison – how to set up and interpret the relationship between an indp variable and a dep varb, controlling for a third variable.  CH’s 5 and 6 learned of role of inferential statistics in evaluating the statistical significance of a relationship, and became familiar with measures of association.  By now, you can: 1) frame a testable hypothesis; 2) set up the appropriate analysis; 3) interpret your findings; 4) and figure out the probability that you observed results occurred by chance.
  • 4. In many ways, correlation and regression are similar to the methods you have learned.  Correlation analysis = produces a measure of association – Pearson’s correlation coefficient which gauges the direction and strength of a relationship between two interval-level variables.  Regression analysis = produces a statistic, a Regression Coefficient, that estimates the size of the effect of the independent variable on the dependant variable.  Emp = Working with survey data, want to investigate the relationship between individuals’ ages (Indep Varb) and number of hours they spend watching television each day (Dep Varb).  1) Is the relationship positive, with older indivduals watching more hours of TV.  2) Is the relationship negative, with older people watching less TV than younger people.  3) How strong is the relationship between age and number of hours devoted to TV?  Correlation analysis addresses theses questions.
  • 5. Regression analysis is similar to mean comparison analysis: 1) where we learned to divide subjects on the independent variable: females and males. 2) And how to compare values on the dependent variable, – like the mean Clinton thermometer ratings. 3) Furthermore, learned how to test the null hypothesis with assumption of random sampling error.  Similarly, Regression Analysis communicates mean difference on the dependent variable (thermometer ratings) for subjects who differ on the independent variable, females compared with males.  Just like comparison of two sample means, Regression Analysis provides info that permits the researcher to determine the probability that the observed difference was observed by chance.
  • 6. However, regression is different in two ways:  1) Regression analysis is very precise. It provides the statistic, the Regression Coefficient that reveals the exact nature of the relationship between an indp varb and a dep varb.  Regression coefficient reports “the amount of change in the dep varb that is associated with a one-unit change in the indp varb.”  Regression coefficient is used only when the dep varb is measured at the interval level.  The independent varb can come in any form: nominal, ordinal, or interval.  In ch, we show how to interpret regression analysis in which the indep varb is interval level.  Also, CH discusses technique called dummy variable regression – uses nominal or ordinal varbs as indp varbs.  2) A second distinguishing feature of Regression is the ease it can be extended to the analysis of controlled comparisons.  Regression also analyzes the relationship between a dependent varb and a single indep varb – Bivariate Regression = One Indep Varb and one Dep Varb.  Regression is remarkably flexible, it can be used to detect and evaluate spuriousness, and it allows us to model additive relationships and interaction
  • 7. Correlation  Eamp in Book: The relationship between two variables: % of a state’s population that graduated from high school ( indep varb), and % of eligible pop that voted in the 1994 elections (dep varb). Display is a Scatterplot = the indp varb is measured along the horizontal axis and the dep varb is measured along the vertical axis.  Consider the overall pattern of this relationship: 1) Is it strong, moderate, or weak? 2) What is the direction of the relationship – positive or negative? You can probably arrive at reasonable answers to these questions.  (Eamp in book) As you move from lower to higher values of the indep varb (H axis) the values of the dep varb (V axis) tend to adjust themselves accordingly, clustering a bit higher on the turnout axis. The relationship is positive. But how strong is the relationship.  In assessing strength, consider the consistency of the pattern.  If the relationship is strong, then just about every time you compare a state that has lower education with a state that has higher ed, the second state would also have higher turnout. So an increase in X (Ind Varb) (H Axis) would be associated with an increase in Y (Dep Varb) (V Axis) most of the time.  If the relationship is weak, you would encounter many cases that do not fit the positive pattern, many higher-ed states with turnouts that are about the same as, or less than, lower-ed state. So an increase in X would not consistently occasion an increase in Y. [assessing strength contin on next page]
  • 8. Correlation Contin…  Rate the relationship on a scale from 0 to 1, where a rating close to 0 denotes a weak relationship, rating around .5 is a moderate relationship, and rating close to 1 denotes a strong relationship.  From exmp in book: A rating close to 0 not seem correct because pattern has some predictability. Yet, rating of 1 not seem right either because you can find states in the “wrong” place on the turnout varb, given levels of ed.  Form exmp in book: A rating around .5, somewhere in the moderate range seem like a reasonable gauge of the strength of the relationship.  Pearson’s Correlation Coefficient = (lowercase r) uses this approach in determining the direction and strength of an interval-level relationship.  Pearson’s r always has a value that falls between -1, signifying a perfectly negative association between the variables, and +1, a perfectly positive association between them. If no relationship exists value = 0.  The exact computation of r not needed, but its important to understand the statistical basis of the correlation coefficient.  Covariation of X and Y / Separate variation of X and Y.  The numerator “covariation of X and Y” measures the degree to which variation in X is associated with variation in Y. This value quantifies thinking we applied to the scatterplot of states, one low value on X, and one having a higher value on X.
  • 9. Correlation Contin…  If the second state also has higher turn out than the 1st state, then the numerator will be positive.  By contrast, state with a higher value on X has a lower value on Y, the numerator will be negative.  If pattern is inconsistent – the states have different values on X but similar values on Y – the numerator records this inconsistent pattern and assumes a value close to 0.  The denominator summarizes all the variation in both varib considered separately. If the covariation of X and Y is equal to the measure of the total variation in both variables, then r takes on a value of +1 (perfectly positive covariation) or -1 (perfectly negative covariation).  If X and Y do not move together in a systematic way, then r assumes a value close to 0.  The correlation coefficient for the relationship depicted in 7-1 = Pearson’s r = +.6.  Pearson’s r is a symmetrical measure of asso between two variables – means that the correlation between X (Ind Varb) and Y (Dep Varb) is the same as the correlation between Y (Dep Varb) and X (Ind Varb) .  Pearson’s r is neutral on the question of whether X causes Y or Y causes X. Therefore, one cannot attribute causation based on a correlation coefficient.  Furthermore, Pearson’s r is not a PRE (proportional reduction in error) measure of asso. It is neatly bounded by -1 and +1, so communicates strength and direction by a common metric.
  • 10. Bivariate Regression Bivariate exmp from Book = analyze relationship between the scores received on exam (dependant varb, Y) and number of hours studying for test (indep varb, X).  The relationship is positive: Students studied more received better scores. The Correlation between the variables is indeed strong.  But Regression Analysis permits us to put a finger point on the relationship. • In this case, “each additional hour spent studying results in exactly a 6-point increase in exam score.” More hours better grade by 6.  Moreover, the XY pattern can be summarized by a line.  What liner equation would summarize the relationship
  • 11. Bivariate Regression Contin..  To draw a line you must know 2 things: 1) Y-intercept – point at which the line crosses the Y axis (the value of Y when X is 0) – 2) and slope of the line.  The Regression Coefficient – the slope of the line – is “rise over run” “the amount of change in Y (Dep Varb) for each unit change in X (Ind Varb).” with theses two elements, we arrive at the general formula for a Regression line = a liner equation that summarizes the relationship between X and Y: Y = a + b(X)  A represents the Y-intercept which is 55 – score of students whom did not study at all (X = 0). B represents the slope of the line – slope or regression coefficient.  The regression coeffienct (b) is 6. thus the Regression Line for 7-1 (in book) is: Test score = 55 + 6(number of hours)  Notice aspects of this approach. • 1st regression equation provides a general summary of the relationship between X and Y. for any number of students we can plug in the number of hours spent studying, do the math, arrive at exam score. • 2nd formula seem to hold some predictive power, ability to estimate scores for students whom do not appear in the data. • Eamp = 3.5 hours studying. Our est: 55 + 6(3.5) = Score of 76.
  • 12. Bivariate Regression Contin.. • Using an established regression formula to predict values of a dependant variable for new values of an independent varb is a common application of regression analysis.  Modify example to make it somewhat more realistic. Assume a sample of 16 students drawn at random from the population.  Data = 1) two students share the same value on the indpenant varb, but their scores were different, 59 and 63, and so on for the other pair of cases, a one number summary of their value on the dependant variable.  2) Calculate the mean value of the dependant variable for each value of the indep varb. So, two none studiers avge their scores: (53 + 57) / 2 = 55; 1 hr cases (59 + 63) / 2 = 61. Notice avenging does not reproduce the data, instead it produces estimates of the actual test scores.  Because these estimates do not represent real values of Y, they are given a separate label, ^Y (“Y-hat”).  3) How to describe the relationship between X and Y, “Based on my sample, each additional hour spent studying produced, on avge, a 6 point increase in exam score.”  So the Regression Coefficient, b, communicates the avge change in Y for each unit change in X. A liner regression equation takes this general form: • • ^Y = ^a + ^ b (X)  ^Y (“Y-hat”) is the estimate mean value of the dependant varb, a (“a-hat”) avge value of Y when X is 0, ^b (“b-hat”) is the avge change in Y for each unit change in X = • Estimated score = 55 + 6(number of hrs)
  • 13. Bivariate Regression Contin..  Regression analysis is built on the estimation of averages. Regression will use the available info to calculate a Y-intercept.  If no empirical examples existed for X = 5 hours, Regression will still yield an estimate, 55 + 6(5) = 85, an estimated avge score.  A regression line travels through the two dimensional space defined by X (Horz Axis) and Y (Ver Axis), ESTIMATING MEAN VALUES ALONG THE WAY.  Regression relentlessly summarizes liner relationships. Feed it sample values for X and Y, and it will return estimates for a and b.  Coefficients are means from sample data – contain random sampling errorcontain random sampling error.  Focus on the workhorse, the regression coefficient, b.  Obviously this estimate contains some error, because the actual student scores fall a bit above or below the avg for any given value of X.  Just like any sample mean, the size of the error in a regression coefff is measured by its standard error. So the real value of b in the pop (beta) is equal to the sample estimate, b, within the bounds of standard error: B = ^b + - (standard error of b)
  • 14. Bivariate Regression Contin..  All statical rules you have learned – informal +-2 rule of thumb, more formal 1.645 test, p- values, inferential set-up for testing the null hypo – apply to regression annalist.  In evaluating diff between two sample means, we tested the null hpyo that the diff in the pop is equal to 0. in its regression guise, null hypo says much the same thing -- that the true value of B in the pop is equal to 0, that B = 0.  The true regression line is flat and horizontal. As in the comparison of two sample means, we test the null hypo by calculating a t-statistic, or t-ratio: .t = (^b – B) / (standard error of ^b), with degrees of freedom (d.f.) = n – 2.  Also, if t-ratio is equal to or greater than 2 , we can reject the null hypo.  A precise P-value can be obtained.  For each 1-hour increase in studying time, we est a 6-point increase in exam score (^b = +6). By comp stand error of ^b is .233. So the t-stat is 6/.233 = 25.75, P-value that rounds to .000.  A real world relationship, ed turnout examp in 7-1, and discuss some further properties of regression analysis 7-2 displays estimate regression line. Where did this line originate?  Liner regression finds a line that provides the best fit to the data points. Using each case’s values on X, it finds , an estimate value of Y.  It then calculates the difference between this est and the case’s actual value of Y. this difference is called prediction error = is represented by the expression Y – Y, the actual value of Y minus the estimate value of Y.
  • 15. Bivariate Regression Contin..  Regression would use the values of the independent varb, percent high school grads, to determine an est value on the depend varb, percent turnout. Prediction error would be the diff between the state’s actual level of turnout, Y, and the est turnout, ^Y.  The prediction error for any given state nay be positive – its actual turnout is higher than its est turnout – or it may be negative – its actual turnout is lower than its est turnout. Fact if one were to add up all positive and negative prediction errors, they would sum to 0.  When it finds the best fitting line, regression works with the square of the diff, (Y - ^Y)2.  So for each state, regression would square the diff between the state’s actual level of turnout and its est level of turnout.  Regression logic, the best fitting line is the one that minimizes the sum of these squared prediction errors across all cases. That is, regression finds the line that minimizes the quantity E (Y - ^Y)2.  Criterion of best fit –used to distinguished garden-variety ordinary least square (OLS) regression from other regression-based techniques. (line in 7-2 = OSL REGRSSION LINE)  Regression reported the equation that provides the best fit for the relationship between X and Y: Estimated turnout = -26.27 + .87(% high school grads)  How would you interrupt each of the est for a and b?  Consider the est for a, the level of turnout when X is 0. Turnout level of -26.27.  Nonetheless, regression produced an est for ^a , anchoring the line at a -26.27 turnout rate.  For some applications of regression, the value of the Y-intercept, the est ^a, has no meaningful interpretation. (sometimes its essential).  What about ^b, the estimated effect of education on turnout?
  • 16. Bivariate Regression Contin.. Two rules for interpreting a regression coefficient.  First rule, be clear about the units in which the independent and dependent varbs are measured – make sure you know which is the dep varb and which is the indp varb!  In example, the dep varb, Y, is measured by %’s -- % of each state’s eligible population that voted in the 1994 election.  Indep varb, X, also is expressed in %’s -- % pop has a high school degree.  Second rule, regression coefficient, b, is expressed in units of the dep varb, not the indep varb. The coefficient, .87, tells us that “turnout (Y) increases, on avge, by .87 of a percentage point for each 1-percentage-point increase in education (X).”  Remember that all the coefficients in a regression equation are measured in units of the dep varb.  The intercept is the value of the dep varb when X is 0.  The slope is the estimate change in the dep varb for a one-unit change in the indp varb.  Sample, we obtained a sample estimate of .87 for the ture pop value of B.  The null hypo would claim that B is really 0, and that the sample estimate obtained, .87, is within the bounds of sampling error.  The regression coefficient standard error, comp calculated to be .17, and arrive at a t-ratio: t = (^b – B) / (standard error of ^b), with d.f. = n- 2 = .87 / .17 = 5.12, with d.f. = 50 -2 = 48.  Informal +- 2 rule of thumb advises us to reject the null hypo. The P-VALUE, A PROBABILTY OF 2.68-E06, verifies that advice.
  • 17. R-Square Regression Analysis gives a precise estimate of the effect of an indep varb on a dep varb. It looks at a relationship and reports the relationship’s exact nature.  “What, exactly is the effect of ed (Indep Varb) on turnout in the states (Dep V) ?” Regression Coeaffient provides an answer: “turnout increases by .87 % for each 1 percent increase in the % of the states pop with a high school diploma. Plus, the regression coefficient has a P-value of .000.”  By itself does not measure the completeness of the relationship, the degree to which Y is explained by X.  States’ ed levels, though clearly related to turnout, provide an incomplete explanation of it.  In Regression Analysis, the completeness of a relationship is measured by the statistic R2 (R square).  R-square = a PRE measure, and so it poses the question of strength the same way as lambda or Kendall’s tau: “how much better can we predict the dep varb by knowing the indp varb than by not knowing the indp varb.”  Consider state turnout data. Guess a states turnout (dep varb) without knowing its ed level (Indep Varb). Lambda is the best guess for a nominal– level varb is the varb’s measure of central tendency, its mode.  In case of an interval-level dep varb, such as % turnout, the best guess provided by variable’s measure of central tendency, its mean.
  • 18. R-Square Cont……  State turnout data examp = 7-3, scatterplot of states and regression line. A flat line is drawn at 40-% turnout. This value is the mean turnout for all 50 states – calculated like any mean.  So, Y = 40 %. If we had no knowledge of the indep varb (edu), we would guess a turnout rate of 40 (Dep Varb) for each state taken at one time.  This guess serve well for many states, but produce quite a few errors.  Wyoming produced a turnout rate of about 57 % in 1994. our guess of 40 would have a large dose of error.  Size of this error would be 57 – 40 = 17. Wyoming’s turnout rate is 17 units higher than predicted, based only on the mean turnout for all states.  This error can be labeled as Y – Y, the actual value of Y minus the overall mean of Y. This value, calculated for every case, is the staring point for R-square .  R-square finds (Y – Y bar) for each case, squares it, and then sums theses squared values across all observations. The result is the total sum of squares TSS.  Total sum of squares: or TSS is an overall summary of all our missed guesses of turnout, based only on knowledge of the dep varb.  Reconsider the regression line in 7-3, and see how much it improves the prediction of Y.  The Regression Line is the estimated level of turnout (Dep Varb)turnout (Dep Varb) calculated with knowledge of the indep varb, ed levelindep varb, ed level.  For each state, we would not guess Y bar, the overall mean of Y. Rather, we would guess Y, the mean value of Y for a given value of X.
  • 19. R-Square Cont……  Empl = Wyoming has a value of 83 on the indep varb, since 83 % of its pop has a high school diploma. What would be an estimation of its turnout level?  Plugging 83 into the regression equation, for Wyoming we obtain -26.27 + .87(83) = 46 on the turnout scale.  Is our new guess, 46, better than our old guess, 40 ? Somewhat, by guessing the mean, we “missed” Wyoming’s actual turnout by 17 units. New est, 46, improves our old guess by 6 units, since Y –Y is equal to 46 -40 = 6. so, Y puts us a bit closer to the real value of Y.  But the distance between Wyoming’s actual turnout (Y) and our new est (Y) remains unexplained. This is prediction error.  For Wyoming, the size of the prediction error would be Y – Y, or 57 -46 = 11.  Thus for Wyoming we could divide its total distance from the mean, 17 units, into two parts: the amount accounted for by the regression est, an amount equal to 6, and the amount left unaccounted for by the regression est, amount equal to 11.  More generally, in regression analysis the TSS has two components:  Total Sum of Squares = Regression Sum of Squares + Error Sum of Squares TSS = RSS + ESS E(Y – Y bar)2 = E(^Y-Ybar)2 + E(Y - ^Y)2  TSS is a summary of all the variation in the dep varb. RSS is the regression sum of squares = which is the component of TSS that we pick up by knowing the indep varb. ESS, the error sum of squares = is the component of TSS that is left over, or not explained by the regression equation.  If RSS is a large chunk of TSS, then the indep varb is doing a lot of work in explaining the dep varb.
  • 20. R-Square Cont…… As the contribution of RSS declines, and the contribution of ESS increases, knowledge of the indep varb provides less help in explaining the dep varb.  R-square is simply the ratio of RSS to TSS: R2 = RSS / TSS  R-square measures the goodness of fit between the regression line and the actual data.  If X completely explains Y, if RSS equals TSS, then R-square is 1.  If RSS makes no contribution – if we would do just a well in accounting for Y without knowledge of X as with knowledge of X – then R-square is 0.  R-square is a PRE measure and is always backed by 0 and 1.  Its value may be interpreted as the prop of the variation in Y that is explained by X.  Comp reports R-square for the state data is equal to .36. thus, 36 % of the variation among the states in their turnout rates is accounted for by their levels of ed.  The leftover variation among the states, 64%, may be explained by other variables, but it is not accounted for by the difference in edu.  R-square sometimes label coefficient of determination, bears a family resemblance to r, Pearson’s correlation coefficient.  In fact, R2 = r2, so the value of R2 .36 for state data, is the square of r, +.6 for the same data. The problem with r is that it may mislead the consumer of political research into overestimating the relationship between two varb’s. difference